Knowledgebase:
Compatibility of stemmed searches and generic language support
11 September 2014 02:02 PM

Summary

In MarkLogic Server v7.0-2, the tokenizer keys, for languages where MarkLogic provides generic language support, were removed so that they now all use the same key. For example, Greek falls into this class of languages. This change was made as part of an optimization for languages in which MarkLogic Server has advanced stemming and tokenization support.  

Stemmed searches that include characters from languages that do not have advanced language support, performed on MarkLogic Server v7.0-2 or later releases, against content loaded on a version previous to v7.0-2, may not return the expected results.

Resolution

In order to successfully run these stemmed searches, you can either:

  • Reindexing the database ; or
  • Reinsert the affected documents (i.e. the documents that contain characters in languages for which MarkLogic Server only has generic language support).

If these are not possible in your environment, you can always run the query unstemmed.

An Example

The following example demonstrates the issue

  1. On MarkLogic Server version 7.0-1, insert a document (test.xml) that contains the Greek character 'ε'.
  2. Run this query 
    xdmp:estimate( cts:search( doc('test.xml'), 'ε')),
    cts:contains( doc('test.xml'), 'ε')
  3. The query will return the correct results: 1, true
  4. Upgrade MarkLogic Server to version 7.0-3 or later and run the query again
  5. The query will return incorrect results: 0, false 
  6. Reindex the database and re-run the query
  7. The query will return the correct result once again.
     
(0 vote(s))
Helpful
Not helpful

Comments (0)