8.5. Relevancy

8.5.1. Ordering documents

DataparkSearch by default sorts results first by relevency and second by popularity rank.

8.5.2. Relevancy calculation

Relevancy for every found document is calculated as 100% multiply by cosine of an angle formed by weights vector for request and weights vector for document found. The number of vector coordinates is equal to multiplication of the number words forms in search query and the number of sections defined in indexer.conf. Every vector's coordinate is corresponds to a word in search query that fit one of document section. The values of this coordinate is depends on weight for this section defined by wf parameter (see Section 8.1.3) and what this word is: exactly the same as in search query or it's word form or synonym. And one more coordinate is equal to average distance between searched words in document. For query related vector this coordinate is equal to 0.

Since sections definition located only in indexer.conf file, use NumSections command in searchd.conf or in search.htm to specify the number od section used. By default, this value is 256. But note, NumSections do not affect document ordering, only the relevancy value.

Table 8-2. Configure-time parameters to tune relevancy calculation (switches for configure)

--enable-fullrel

This option enables full version of relevancy calculation. Value by default: disabled (or fast relevancy calculation).

--disable-reldistance

This option disables accounting of average word distance for relevancy calculation. Value by default: enabled.

--disable-relposition

This option disables accounting of first query word position for relevancy calculation. Value by default: enabled.

--disable-relwrdcount

This option disables accounting of word counts for relevancy calculation. Value by default: enabled.

--with-bestpos=NUM

This option specify the NUM as the best value of first word position in document found. Value by default: 4.

--with-bestwrdcnt=NUM

This option specify the NUM as the best number of each query word in document found. Value by default: 11.

--with-distfactor=NUM

This option specify the NUM as a factor for average word distance for relevancy calculation. Value by default: 0.2.

--with-posfactor=NUM

This option specify the NUM as factor for difference between first query word position in document found and best position specified by --with-bestpos option. Value by default: 0.5.

--with-wrdcntfactor=NUM

This option specify the NUM as factor for difference between count of query words in document found and the best value specified by --with-bestwrdcnt option. Value by default: 0.4.

--with-wrdunifactor=NUM

This option specify the NUM as factor for difference of query word counts from uniform distribution. Value by default: 0.6.

8.5.3. Popularity rank

DataparkSearch support two methods for popularity rank calculation. A method used in previous versions called "Goo", and new method is called "Neo". By default, the Goo method is used. To select desired PopRank calculation method use PopRankMethod command:


PopRankMethod Neo

You need enable links collection by CollectLinks yes command in your indexer.conf file for Neo method and for full functionality of Goo method. But this slow down a bit indexing speed. By default, links collection is not enabled.

If you place PopRankSkipSameSite yes command in indexer.conf file, indexer will take only inter site links (i.e. links from a page on one site to a page on another site) for popularity rank calculation.

8.5.3.1. "Goo" popularity rank calculation method

The popularity rank calculation is made in two stages. At first stage, the value of Weight parameter for every server is divide by number of links from this server. Thus, the weight of one link from this server is calculated. At second stage, for every page we find the sum of weights of all links pointed to this page. This sum is popularity rank for this page.

By default, the value of Weight parameter is equal to 1 for all servers indexed. You may change this value by Weight command in indexer.conf file or directly in server table, if you load servers configuration from this table.

If you place PopRankFeedBack yes command in indexer.conf file, indexer will calculate site weights before page rank calculation. To do that, indexer calculate sum of popularity rank for all pages from same site. If this sum will great 1, the weight for site set to this sum, otherwise, site weight is set to 1.

If you place PopRankUseTracking yes command in indexer.conf file, indexer will calculate site weight as the number of tracked queries with restriction on this site.

If you place PopRankUseShowCnt yes command in search.htm (or searchd.conf) file, then for every result shown to user corresponding url.shows value will be increased on 1, if relevancy for this result is great or equal to value specified by PopRankShowCntRatio command (default value is 25.0). If you place PopRankUseShowCnt yes in indexer.conf file, indexer will add to url's PopularityRank the value of url.shows multiplied by value, specified in PopRankShowCntWeight command (default value is 0.01).

For this method is supposed all pages are neurons and links between pages are links between neurons. So it's possible use an error back-propagation algorithm to train this neural network. Popularity rank for a pages is the activity level for corresponding neuron.

You may use PopRankNeoIterations command to specify the number of iterations of the Neo Popularity Rank calculation. Default value is 3.

By default, the Neo Popularity Rank is caclulated along with indexing. To speed up indexing, you may postpone Popularity Rank execution using PopRankPostpone command:


PopRankPostpone yes

Then you may calculate the Neo Popularity Rank after indexing in same way as for method Goo, i.e.: indexer -TR

8.5.4. Boolean search

Please note that in case of boolean searching of two or more words, you have to enter operators (&, |, ~). I.e. it is necessary to enter "a & book" instead of "a book" (with no quotation marks).

8.5.5. Crosswords

This feature allows to assign words between <a href="xxx"> and </a> also to a document this link leads to. It works in SQL database mode and is not supported in cache mode. To enable Crosswords, please use CrossWords yes command in indexer.conf and search.htm.