Multi-threaded indexing is now possible, which significantly reduces the time needed to create search index (indexer --index) on a multi-CPU machine. indexer --index now honors the -N option to specify the number of threads to start.
A new command IndexerThreads was added to specify the default number of threads started for indexing.
A new command IndexCacheSize was added to specify amount or RAM used for search index cache when running indexer --index.
A new command IPRequestPerMinLimit was added for more polite crawling.
A new command FollowLinks was added to fine-tune which kind of links should be followed.
The CollectLinks command was enhanced to fine tune what kind of links should be stored in the database.
A new command AjaxLinks was added, to crawl AJAX-run Web sites.
The DNSCacheTimeOut command was added.
The Proxy command was changed to understand the full URL notation including the authorization part (previously it required the host:port format). The ProxyAuthBasic command was removed. Also, Proxy can now accept a list of proxy addresses, which makes indexer randomly choose one of the addresses to download every document.
A special purpose section ResponseTime was added, to store document download time in the database.
Robots exclusion protocol improvements were made. indexer now understands the * and $ patters in robots.txt, and respects the X-Robots-Tags HTTP header.
The Robots command now understands more possible values (robotstxt, xrobotstag, meta and rel) to fine tune which robot directives should be respected. Previously only yes (respect all robot directives) and no (ignore all robot directives) values were understood.
When crawling with multiple threads (indexer -Nxxx) on Windows, threads do not lock each other any more when resolving host names.
Search templates now use a C/C++-style language instead of the old language with cumbersome tag based operators.
Extended word statistics variables are now available in the search template.
New document properties UniqueWordHitVector and SectionHitVector are now available at search time, presending the information about word hit distribution inside the entire document and its individual sections.
When searching in the m=any (find any of the words) mode, search.cgi now displays the query words that are missing in the document.
TODO: Line ending modifiers are now understood in template operators, to avoid redundant empty lines in search.cgi output.
The default template was rewritten to use CSS styles instead of the inline HTML formatting. This makes it easier to customize the search template according to the user's site design. Various other minor template improvements were made.
The quality for the document body snippet was improved for HTML documents.
Fixed that OS environment variables were not available in search templates on Windows.
Cached copies are now stored in a separate table cachedcopy without wrapping to base64.
A new command CachedCopyEncoding was added to control cached copy compression.
The table bdicti was removed. In earlier version it stored pre-parsed documents and was used at indexing time. Now indexing is performed directly from the cached document copies. This change reduces disk size used by the mnoGoSearch database.
indexer --index experienced a significant performance degradation when the table bdicti size grew bigger than the OS disk cache size. The new format fixes this problem. Indexing performance now grows linearly with the document collection size.
The third parameter in the command Section (responsible for the section length) is now optional. The default indexer.conf file now does not specify the length values for the sections, so indexer does not store any information into the table urlinfo by default. At search time, the variables containing various document parts (the body snippet, the title, content type, etc) can now be created from the document cached copies stored in the table cachedcopy. The table urlinfo is now empty in the default configuration, but it still can be used to store user defined sections, or the standard sections when needed.
Changes in the document sections configuration in indexer.conf do not require full re-crawling any longer. The changes immediately take effect after the next indexer --index run.
The link information storage format was changed for better performance and for easier indexing of the link text, e.g.:
<a href="http://www.site.com">link text</a>Also, the new link format makes mnoGoSearch convenient for SEO purposes and for other kind of site analysis. See the new structure for the table links for details.
A new section with the name ilinktext is now understood, meaning the text from the incoming links of the referencing documents, between the <a href=".."> and </a> tags.
It's now possible to limit search to the incoming link text only by specifying the wf=00010000 parameter to search.cgi (assuming the default ID of the section ilinktext).
The structure of the table links for MySQL and PostgreSQL now uses partitioning for better indexing performance.
A new table redirect was added to store simple redirect links, e.g. the URL from the Location header of a 301 Moved Permanently HTTP response. Redirects are stored separately from the hypertext links for better performance of the popularity calculation.
The code performing grouping results by sites was rewritten. When crawling, indexer does not populate the table server with unique site names any more. This improves crawling performance. The column url.site_id was removed.
Popularity is now automatically calculated when running indexer --index in the default configuration.
A new command UsePopularity was added to change the default behavior.
A new command line option indexer --rewritepop is now understood to calculate popularity without recreating the entire search index.
A new command PopularityFactor was added to change how a document's popularity affects its search result score.
The -R command line parameter to indexer (which forced indexer to calculate popularity after crawling) is no longer supported.
The PopRankSkipSameSite, PopRankFeedBack, PopRankUseTracking, PopRankUseShowCnt, PopRankShowCntRatio, PopRankShowCntWeight were removed.
A few performance improvements in built-in parsers, character set and Unicode routines, memory management routines were made.
Fixed that the implementation of utf-8 did not detect malformed byte sequences in some cases, which led to database errors (e.g. invalid byte sequence for encoding "UTF8" in case of PostgreSQL).
indexer running in the SQL interpreter mode (e.g. --create, --drop, --sqlmon) now does not recognize an empty comment immediately followed by a semicolon (e.g. /**/;) as the current SQL query end. This change allows to use complex stored procedures in the structure. See the definition of the stored procedure links_insert_trigger_func implementing link partitioning in create/pgsql/create.txt as an example.
Some columns in the SQL structure were renamed to avoid using SQL reserved words (qinfo.value to qinfo.sval, qtrack.found to qtrack.nfound).
The column intag in the table dict was renamed to coord. The column intag in the tables bdict and dict00..dictFF was renamed to coords.
CMake experimental support was added. Currently CMake is used to build mnoGoSearch on Windows only. On UNIX-alike platforms it's still recommended to use the configure script generated by the GNU autotools.
mnoGoSearch is now compiled using the Filesystem Hierarchy Standard (FHS) layout by default, which is slightly different from the traditional mnoGoSearch layout. Use ./configure --disable-fhs-layout to compile with the traditional mnoGoSearch layout.
The code was modified in a more modular way, a big number of enums were introduced instead of non-typed integer constants, database and thread handlers were added, some other code quality and extendability improvements were made, which makes it easier to maintain the code, as well as to add a plugin infrastructure in the future. The API was slightly changed (e.g. the structure of the UDM_RESULT data type).
XML documents that start with <urlset xmlns="..."> or <sitemapindex xmlns="..."> are now automatically considered to be sitemap protocol files. The built-in XML parser collects links from such files, but does not put their words into the search index.
The built-in HTML parser now understands
<meta property="name" content="...">as a synonym to
<meta name="name" content="...">and
<meta name="http-equiv" content="...">
The old file based search result cache and the Cache command were removed. Use the new search result cache (introduced in 3.3.8) instead.
Support for categories was removed. The table categories was removed. The column server.category was removed. Use user defined limits instead.
Database structure files for Access, mSQL, Solid, SapDB were removed.
The Crosswords command and the crossdict table were removed.
The ResultContentType command was removed. Now search.htm explicitly output the desired content type.