Abstract
In MarkLogic Server version 9, the default tokenization and stemming code has been changed for all languages (except English tokenization). Some tokenization and stemming behavior will change between MarkLogic 8 and MarkLogic 9. We expect that, in most cases, results will be better in MarkLogic 9.
Information is given for managing this change in the Release Notes at Default Stemming and Tokenization Libraries Changed for Most Languages, and for further related features at New Stemming and Tokenization.
In-depth discussion is provided below for those interested in details.
General Comments on Incompatibilities
General implications of tokenization incompatibilities
If you do not reindex, old content may no longer match the same searches, even for unstemmed searches.
General tokenization incompatibilities
There are some edge-case changes in the handling of apostrophes in some languages; in general this is not a problem, but some specific words may include/break at apostrophes.
Tokenization is generally faster for all languages except English and Norwegian (which use the same tokenization as before).
General implications of stemming incompatibilities
Where there is only one stem, and it is now different: Old data will not match stemmed searches without reindexing, even for the
same word.
Where the new stems are more precise: Content that used to match a query may not match any more, even with
reindexing.
Where there are new stems, but the primary stem is unchanged: Content that used to not match a query may now match it with advanced
stemming or above. With basic stemming there should be no change.
Where the decompounding is different, but the concatenation of the components is the same: Under decompounding, content may match a query when it used to not match, or may not match a query when it used to match, when the query or content involves something with one of the old/new components. Matching under advanced or basic stemming would be generally the same.
General stemming incompatibilities
- MarkLogic now has general algorithms backing up explicit stemming dictionaries. Words not found in the default dictionaries will sometimes be stemmed when they previously were not.
- Diminutives/augmentatives are not usually stemmed to base form.
- Comparatives/superlatives are not usually stemmed to base form.
- There are differences in the exact stems for pronoun case variants.
- Stemming is more precise and restricted by common usage. For example, if the past participle of a verb is not usually used as an adjective, then the past participle will not be included as an alternative stem. Similarly, plural forms that only have technical or obscure usages might not stem to the singular form.
- Past participles will typically include the past participle as an alternative stem.
- The preferred order of stems is not always the same: this will affect search under basic stemming.
Reindexing
It is advisable to reindex to be sure there are no incompatibilities. Where the data in the forests (tokens or stems) does not match the current behavior, reindexing is recommended. This will have to be a forced reindex or a reload of specific documents containing the offending data. For many languages this can be avoided if queries do not touch on specific cases. For certain languages (see below) the incompatibility is great enough that it is essential to reindex.
Language Notes
Below we give some specific information and recommendations for various languages.
Arabic
stemming
The Arabic dictionaries are much larger than before. Implications: (1) better precision, but (2) slower stemming.
Chinese (Simplified)
tokenization
Tokenization is broadly incompatible.
The new tokenizer uses a corpus-based language model. Better precision can be expected.
recommendation
Reindex all Chinese (simplified).
Chinese (Traditional)
tokenization
Tokenization is broadly incompatible.
The new tokenizer uses a corpus-based language model. Better precision can be expected.
recommendation
Reindex all Chinese (traditional).
Danish
tokenization
This language now has algorithmic stemming, and may have slight tokenization differences around certain edge cases.
recommendation
Reindex all Danish content if you are using stemming.
Dutch
stemming
There will be much more decompounding in general, but MarkLogic will not decompound certain known lexical items (e.g., "baastardwoorden").
recommendation
Reindex Dutch if you want to query with decompounding.
English
stemming
British variants may include the British variant as an additional stem, although the first stem will still be the US variant.
Stemming produces more alternative stems. Implications are (1) stemming is slightly slower and (2) index sizes are slightly larger (with advanced stemming).
Finnish
tokenization
This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases.
recommendation
Reindex all content in this language if you are using stemming.
French
See general comments above.
German
stemming
Decompounding now applies to more than just pure noun combinations. For example, it applies to "noun plus adjectives" compound terms. Decompounding is more aggressive, which can result in identification of more false compounds. Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) for compound terms, search gives better recall, with some loss of precision.
recommendation
Reindex all German.
Hungarian
tokenization
This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases.
recommendation
Reindex all content in this language if you are using stemming.
Italian
See general comments above.
Japanese
tokenization
Tokenization is broadly incompatible.
The tokenizer provides internal flags that the stemmer requires. This means that (1) tokenization is incompatible for all words at the storage level due to the extra information and (2) if you install a custom tokenizer for Japanese, you must also install a custom stemmer.
stemming
Stemming is broadly incompatible.
recommendation
Reindex all Japanese content.
Korean
stemming
Particles (e.g., 이다) are dropped from stems; they used to be treated as components for decompounding.
There is different stemming of various honorific verb forms.
North Korean variants are not in the dictionary, though they may handled by the algorithmic stemmer.
recommendation
Reindex Korean unless you use decompounding.
Norwegian (Bokmal)
stemming
Previously, hardly any decompounding was in evidence; now it is pervasive.
Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) search gives better recall, with some loss of precision, at least where it comes to compounds.
recommendation
Reindex Bokmal if you want to query with decompounding.
Norwegian (Nynorsk)
stemming
Previously hardly any decompounding was in evidence; now it is pervasive.
Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) search gives better recall, with some loss of precision, at least where it comes to compounds.
recommendation
Reindex Nynorsk if you want to query with decompounding.
Norwegian (generic 'no')
stemming
Previously 'no' was treated as an unsupported language; now it is treated as both Bokmal and Nynorsk: for a word present in both dialects, all stem variants from both will be present.
recommendation
Do not use 'no' unless you really must; reindex if you want to query it.
Persian
See general comments above.
Portuguese
stemming
More precision with respect to feminine variants (e.g., ator vs atriz).
Romanian
tokenization
This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases.
recommendation
Reindex all content in this language if you are using stemming.
Russian
stemming
Inflectional variants of cardinal or ordinal numbers are no longer stemmed to a base form.
Inflectional variants of proper nouns may stem together due to the backing algorithm, but it will be via affix-stripping, not to the nominal form.
Stems for many verb forms used to be the perfective form; they are now the simple infinitive.
Stems used to drop ё but now preserve it.
recommendation
Reindex all Russian.
Spanish
See general comments above.
Swedish
stemming
Previously hardly any decompounding was in evidence; now it is pervasive.
Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) search gives better recall, with some loss of precision, at least where it comes to compounds.
recommendation
Reindex Swedish if you want to query with decompounding.
Tamil
tokenization
This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases.
recommendation
Reindex all content in this language if you are using stemming.
Turkish
tokenization
This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases.
recommendation
Reindex all content in this language if you are using stemming.