mnoGoSearch now works better for huge documents. Maximum number of words collected from each document was changed from "64K words per section" to "2048K words per section". Data format in DBMode=single was changed, users of DBMode=single have to reindex their documents from the beginning. Data format in DBMode=multi and DBMode=blob was not changed, reindexing in these modes is only necessary for huge documents (bigger than approximately 512K) - to make indexer collect more words from these documents. New limit allows to fully index documents with text size up to about 16Mb.
The LoadTagInfo search.htm command was added, to make tag values available in search results using $(tag).
The LoadURLInfo search.htm command was added, to switch off loading extra section values from the urlinfo table for performance purposes.
The StripAccents yes/no command was added into indexer.conf and search.htm to make accent insensitive searches possible with the databases not supporting accent insensitive collations. When StripAccent is set to yes, all accented letters are converted to their non-accented counterparts when writing or looking up the word index.
Content-Type "application/http" is now understood - a HTTP response with headers.
Content-Type "application/http" now work external parsers: if result type of a parser is "application/http", then indexer consider it is a full HTTP response and parses both headers and content.
PostgreSQL driver now understands the "setnames" DBAddr parameter to set client encoding. If a non-empty "setnames" parameter is given, PQsetClientEncoding() is executed immediately after establishing a connection to the server.
Fixed that highlighting didn't work in some cases when a search query contained two or more phrases.
Performance improvement: the "sorting results by score" step is now much faster on big results (0.01 second vs 1.00 second on results returning one million documents).
Performance improvement: searching for a single word is now about three times faster on big results.
Some indexes were added into SQL schema to make searches with tag and category limit faster (Feature request #772).
Feature request #1364 "highlight collation matches" was implemented. Now when using an accent insensitive collations (for example, latin1_general_ci with MySQL), search.cgi will take into account all word forms for excerpts and highlighting. For example, searches for French "cote" will also highlight "coté" and vice versa, if the non-exact word form generated hits.
MySQL driver now understands setnames parameter in DBAddr (feature request #1326).
MySQL driver now understands sqllogbin parameter in DBAddr (feature request #697).
DebugSQL parameter to DBAddr is now understood. When DebugSQL is set to yes, indexer and search.cgi print all SQL queries sent to the database. mnoGoSearch must be compiled using ./configure --with-debug ... to make this feature work.
MinCoordFactor and MaxCoordFactor impact is now calculated separately for each section.
"nwf" parameter is now understood in DBAddr string, to set its value per database.
"HoldBadHrefs 0" now means never delete unavailable documents from the database automatically (e.g. when remote host is down), which improves indexing speed, and which is now default behavior. Only positive HoldBadHrefs values activate automatic deletion.
Data type of urlinfo.sval was changed from TEXT to MEDIUMTEXT in MySQL table structure, to allow storing sections longer than 64K.
Bug#1733 "'indexer -Ewordstat' problem with PostgreSQL" was fixed.
Bug #1054 "indexer does not index html files without body tag" was fixed. A new special section with name "nobody" is now understood. If this section is configured, then indexer collects words outside the <body>...</body> tags. The default behavior is still not to index words outside these tags.
Bug#768 "User defined section is too short (1Kb limit)" was fixed.
Bug#1654 "SQLWordForms doesn't work with cluster" was fixed. Those using cluster should upgrade node.xml using the latest version of node.xml-dist.
Bug#1713 "Square brackets in DOCTYPE makes XML parser fail" was fixed.
Bug#1739 "indexer doesn't understand Content-Encoding for robots.txt" was fixed.
Bug#1740 "'UseRemoteContentType yes' doesn't work." was fixed.
Bug#1741 "'indexer.conf -Eblob -t tag' fails with 'Unknown table 's' in WHERE clause'" was fixed.
Fixed that indexer ignored the LogLevel command.
Fixed that popularity rank calculation didn't work with Interbase/Firebird. A missing column "url.shows" was added into SQL schema.
Fixed that phrase search didn't work in some cases (a bug since 3.3.0).
"ResultContentType none" is now understood to suppress printing of the "Content-Type" HTTP header by search.cgi. This is useful if you execute search.cgi from another Web application which sends HTTP headers itself.
ue search.cgi is now understood again to exclude documents with the given URL pattern from search results. This feature was broken in 3.2.x.
indexer now uses UDM_TMP_DIR and TMPDIR environment variables when creating temporary files (e.g for external parsers) instead of the default /tmp.
Fixed that standalone dash character was considered as a separate word with "Dehyphenate yes", so for the queries like "a - b", search.cgi incorrectly searched for three words: "a", "-", "b", which never returned results in "find all words" mode.
Fixed that "UseCookie yes" made indexer crash when fetching data from HTDB sources.
Fixed that excerpts generated from cached copy of TEXT files didn't work (bug since 3.3.0).
Bug#746 "Stopwords in a long boolean query" was fixed.
Bug#1016 "Indexer is selecting wrong Content-Type" was fixed.
Bug#1024 "Clear database limitations do not work: error ORA-01795" was fixed.
Bug#1044 "-Ewordstat: incorrect unicode sequence" was fixed.
Bug#1110 "'invalid UTF-8 byte sequence detected' when INSERT INTO dictXX" was fixed. This error happened when indexing into PostgreSQL with DBMode=multi. The "intag" column type was changed from TEXT to BYTEA in the tables "dict00".."dictFF".
Bug#1182 "Indexer crashes with -a -y 'content/type'" was fixed.
Bug#1427 "ORA-01785: maximum number of expressions in a list is 1000" was fixed.
Bug#1436 "Cannot run -Ewordstat, ORA-01400: cannot insert NULL" was fixed.
Bug#1615 "The identifier "PATH_MAX" is undefined" wad fixed.
Bug#1641 "Documentation problem" was fixed.
Bug#1659 "GroupBySite doesn't work in cluster mode" was fixed.
Bug#1679 "search.cgi dumps core on OpenBSD 4.0 when I search for non existing word" was fixed.
Bug#1693 "User defined sections don't work for text/plain files" was fixed.
Bug#1716 "Can't limit indexer to documents matching language" was fixed.
Bug#1725 "Navigation doesn't work when using a single cluster node" was fixed.
Bug#1726 "DateFormat doesn't work in cluster" was fixed.
Relevancy improvement: Fixed that average word distance was considered to be very big in the case when words were found in different sections (e.g. one word in "body" and one word in "title"). Word pairs from different sections are not taken into account anymore for distance calculation.
Relevancy improvement: Average word distance is now calculated taking into account "wf" values for the sections - the final score is now more sensitive to word distances in the sections with higher "wf" values.
DBMode=blob&LiveUpdates=yes is now understood in DBAddr parameter. If LiveUpdates=yes is specified, it's possible to crawl up to several thousand documents without full recreating of search index by running "indexer -Eblob". Thanks to Oz Basarir and Natural Capital Institute for sponsoring this feature.
The "text" and "html" keywords were added into the "Section" command syntax, to apply either text or HTML parser for data returned from a "simple" HTDBDoc query. This option is useful if the source SQL table stores data in HTML format. The default value is "text". Thanks to Oz Basarir and Natural Capital Institute for sponsoring this feature.
Column with name "last_mod_time" is now considered as modification time of the documents, returned from "simple" HTDBDoc queries.
A new syntax to display N rightmost characters from a template variable was added. For example, $(URL:-10). Thanks to Eggert Ehmke for the idea and the original patch.
Performance improvements in score calculation with non-empty "nwf" parameter were made.
Fixed that "simple" HTDBDoc queries didn't work with Interbase/Firebird, because the driver returned empty column names.
Fixed a bug which made search.cgi crash when generating a link to "cached copy" with a template having multiple DBAddr commands.
Fixed a bug in character set conversion, which made indexer crash in rare cases.
Fixed that "indexer -Cw" didn't empty the "bdict" table.
Fixed a bug in cluster code which made search.cgi crash on processing of a front-end template with "Suggest yes" when search didn't return any results.
Cluster support was added. A typical cluster consists of several database machines and a single front-end machine. The front-end machine receives HTTP requests from a user's browser, forwards search queries to the database machines using HTTP protocol, receives back a limited number of top best search results (using a simple XML format, based on OpenSearch specifications) from every database machine, then parses and merges the results, and displays them according to score and applying HTML template. This approach distributes operations with high CPU and hard disk consumption between the database machines in parallel, leaving simple merge and HTML template processing functions to the the front-end machine. As of version 3.3.0, mnoGoSearch allows to join up to 256 database machines into a single cluster.
node.xml-dist is now installed into /etc directory - an XML template for a cluster database machine.
"DBAddr http://hostname/search.cgi/node.xml" search.htm command was added, to specify an URL of a cluster database machine interface with XML format.
"DBAddr file:///path/to/node.xml" search.htm command was added, to specify a static XML search response. This is mostly for test purposes.
Two cluster types were implemented - a merge cluster to join results from several independent databases, each created by its own indexer.conf, as well as a distributed cluster - created by a single indexer.conf when indexer automatically distributes search index between database machines.
Changing default distribution type from "reminder" to "quotient". Thus, for indexer.conf having three DBAddr command, distribution is done as follows:
URLs with seed 0..85 go to the first DBAddr
URLs with seed 85..170 go to the second DBAddr
URLs with seed 171..255 go to the third DBAddr
Maximum amount of words collected from a document was changed from 64K words per document to 64K words per section - positions are now enumerated per section, starting from the beginning of each section separately.
"SaveSectionSize yes/no" indexer.conf and search.htm command was added. When SaveSectionSize is set to yes, indexer stores additional information about section sizes, making it possible to generate better score values, as well as to do "exact section match" searches. Default value is "yes".
Relevancy improvement: "WordDensityFactor num" search.htm command was added. Num is a number in the range 0..255 to specify impact of word frequency on the result score. This feature works with "SaveSectionSize yes". The default value is 25.
Exact section match syntax was added:
title="Apache web server"This feature works with "SaveSectionSize yes".
"WordFormFactor num" search.htm command was added to give more weight to the word forms originally written in the search query and less weight to generated word forms using ispell dictionaries and synonyms. Use with a number 0..255. Default value is 255. 255 means to give the same weight to the original and generated forms. 0 means maximum effect, i.e. weight for a generated word form is much smaller than weight for the original word form.
Excerpt generating code performance improvements were done. Excerpt generation from CachedCopy is now about 6-12% faster.
Using URL and Tag limits is now possible with "indexer -Eblob", e.g.:
./indexer -Eblob -u "%subdir%" ./indexer -Eblob -t tagThis is to generate a search index over a subset of all documents collected during crawling.
Using "Limit" command is also possible with "indexer -Eblob", e.g.:
indexer.conf command:
Limit subdir "SELECT rec_id FROM url WHERE url LIKE '%/subdir/%'"
command line:
./indexer -Eblob --fl=subdir
"ResultContentType type" search.htm command was added to specify Content-Type header generated by search.cgi. The default value is "text/html".
"Dehyphenate yes/no" search.htm command was added. When "Dehyphenate yes" is specified, searching for "peace-making" also will return documents having "peacemaking". Thanks to Oz Basarir and Natural Capital Institute for sponsoring this feature.
Clone template variables were changed: clones are now returned in the same row with the document itself, using CloneN prefix, e.g.: $(Clone0.URL). The "<!--clone-->" search.htm section and the $(CL) variable are not supported anymore.
DetectClones is now "no" by default, for performance purposes.
"CollectLinks yes/no" indexer.conf command was added. The default value is "no" which improves indexing performance by not pupulating the "links" table. As a side effect PopRank calculation is not possible in the default configuration. If PopRank is important for your installation, specify "CollectLinks yes" in indexer.conf.
Default sort order was changed from "RP" (score, then popularity) to "R" (score). This change improves search performance for the installations where PopRank is not important.
Indexer now honors <a rel="nofollow"> tags. Thanks to Jeff Veit for contribution.
A simplified format of HTDBDoc command was added:
HTDBDoc "SELECT title, body FROM docs WHERE id=$2"SQL column names are associated with "Section" names. Thanks to Oz Basarir and Natural Capital Institute for sponsoring this feature.
It's now possible to specify wf as a parameter for DBAddr search.htm command, which is useful when merging two or more databases - to give more score to results coming from a desired database.
DBAddr mysql://root@localhost/db1/?wf=FFFF DBAddr mysql://root@localhost/db2/?wf=1111 DBAddr mysql://root@localhost/db3/?wf=1111
MaxResults parameter was added for DBAddr, which is useful to add a limited number of sponsored links in the top of search results:
DBAddr mysql://root@localhost/avd/?wf=FFFF&MaxResults=1 DBAddr mysql://root@localhost/db1/?wf=1111 DBAddr mysql://root@localhost/db2/?wf=1111
$(DBOrder) template variable was added to display the original order of a document in its database result, before multiple DBAddr search results were merged into the final result. It is equal to $(Order) when using only a single DBAddr command in search.htm.
FOR template operator was added. Loop limits can be both constants:
<!FOR NAME="a" FROM="10" TO="20">a=$(a)<!ENDFOR>and variables that were previously set, for example by the SET operator:
<!SET NAME="from" CONTENT="80"> <!SET NAME="to" CONTENT="90"> <!FOR NAME="a" FROM="$(from)" TO="$(to)">a=$(a)<!ENDFOR>
"[no title]" is not added automatically anymore: an empty string is printed instead. One can use IF template operator to reproduce 3.2.x behaviour:
<!IF NAME="title" CONTENT="">[no title]<!ELSE>$&(title)<!ENDIF>
Various indexing and search performance improvements were made.
Fixed that indexer didn't work with MySQL-5.1.15-GPL.
"indexer -?" now prints its help page to stdout instead of stderr.
A "#version" record is now put into the table "bdict" when running "indexer -Eblob". mnoGoSearch version ID is put as its value. For example, mnoGoSearch 3.3.0 will put "30300" string.
Preliminary implementation for DBMode=rawblob in search.htm was added. This mode is designed for direct search from the table "bdicti" without having to run "indexer -Eblob" and is intended for use with small search databases as a replacement for DBMode=single. In the future releases it will also be reused for real-time index updates - to avoid running "indexer -Eblob" when only a small number of documents were changed.