MarkLogic Server v9 Tokenization and Stemming | MarkLogic Support

Knowledgebase

108Administration 8App Services 42Errors 144MarkLogic Server 52Performance Tuning

Knowledgebase:

MarkLogic Server v9 Tokenization and Stemming 04 October 2019 04:32 PM
Abstract In MarkLogic Server version 9, the default tokenization and stemming code has been changed for all languages (except English tokenization). Some tokenization and stemming behavior will change between MarkLogic 8 and MarkLogic 9. We expect that, in most cases, results will be better in MarkLogic 9. Information is given for managing this change in the Release Notes at Default Stemming and Tokenization Libraries Changed for Most Languages, and for further related features at New Stemming and Tokenization. In-depth discussion is provided below for those interested in details. General Comments on Incompatibilities General implications of tokenization incompatibilities If you do not reindex, old content may no longer match the same searches, even for unstemmed searches. General tokenization incompatibilities There are some edge-case changes in the handling of apostrophes in some languages; in general this is not a problem, but some specific words may include/break at apostrophes. Tokenization is generally faster for all languages except English and Norwegian (which use the same tokenization as before). General implications of stemming incompatibilities Where there is only one stem, and it is now different: Old data will not match stemmed searches without reindexing, even for the same word. Where the new stems are more precise: Content that used to match a query may not match any more, even with reindexing. Where there are new stems, but the primary stem is unchanged: Content that used to not match a query may now match it with advanced stemming or above. With basic stemming there should be no change. Where the decompounding is different, but the concatenation of the components is the same: Under decompounding, content may match a query when it used to not match, or may not match a query when it used to match, when the query or content involves something with one of the old/new components. Matching under advanced or basic stemming would be generally the same. General stemming incompatibilities MarkLogic now has general algorithms backing up explicit stemming dictionaries. Words not found in the default dictionaries will sometimes be stemmed when they previously were not. Diminutives/augmentatives are not usually stemmed to base form. Comparatives/superlatives are not usually stemmed to base form. There are differences in the exact stems for pronoun case variants. Stemming is more precise and restricted by common usage. For example, if the past participle of a verb is not usually used as an adjective, then the past participle will not be included as an alternative stem. Similarly, plural forms that only have technical or obscure usages might not stem to the singular form. Past participles will typically include the past participle as an alternative stem. The preferred order of stems is not always the same: this will affect search under basic stemming. Reindexing It is advisable to reindex to be sure there are no incompatibilities. Where the data in the forests (tokens or stems) does not match the current behavior, reindexing is recommended. This will have to be a forced reindex or a reload of specific documents containing the offending data. For many languages this can be avoided if queries do not touch on specific cases. For certain languages (see below) the incompatibility is great enough that it is essential to reindex. Language Notes Below we give some specific information and recommendations for various languages. Arabic stemming The Arabic dictionaries are much larger than before. Implications: (1) better precision, but (2) slower stemming. Chinese (Simplified) tokenization Tokenization is broadly incompatible. The new tokenizer uses a corpus-based language model. Better precision can be expected. recommendation Reindex all Chinese (simplified). Chinese (Traditional) tokenization Tokenization is broadly incompatible. The new tokenizer uses a corpus-based language model. Better precision can be expected. recommendation Reindex all Chinese (traditional). Danish tokenization This language now has algorithmic stemming, and may have slight tokenization differences around certain edge cases. recommendation Reindex all Danish content if you are using stemming. Dutch stemming There will be much more decompounding in general, but MarkLogic will not decompound certain known lexical items (e.g., "baastardwoorden"). recommendation Reindex Dutch if you want to query with decompounding. English stemming British variants may include the British variant as an additional stem, although the first stem will still be the US variant. Stemming produces more alternative stems. Implications are (1) stemming is slightly slower and (2) index sizes are slightly larger (with advanced stemming). Finnish tokenization This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases. recommendation Reindex all content in this language if you are using stemming. French See general comments above. German stemming Decompounding now applies to more than just pure noun combinations. For example, it applies to "noun plus adjectives" compound terms. Decompounding is more aggressive, which can result in identification of more false compounds. Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) for compound terms, search gives better recall, with some loss of precision. recommendation Reindex all German. Hungarian tokenization This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases. recommendation Reindex all content in this language if you are using stemming. Italian See general comments above. Japanese tokenization Tokenization is broadly incompatible. The tokenizer provides internal flags that the stemmer requires. This means that (1) tokenization is incompatible for all words at the storage level due to the extra information and (2) if you install a custom tokenizer for Japanese, you must also install a custom stemmer. stemming Stemming is broadly incompatible. recommendation Reindex all Japanese content. Korean stemming Particles (e.g., 이다) are dropped from stems; they used to be treated as components for decompounding. There is different stemming of various honorific verb forms. North Korean variants are not in the dictionary, though they may handled by the algorithmic stemmer. recommendation Reindex Korean unless you use decompounding. Norwegian (Bokmal) stemming Previously, hardly any decompounding was in evidence; now it is pervasive. Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) search gives better recall, with some loss of precision, at least where it comes to compounds. recommendation Reindex Bokmal if you want to query with decompounding. Norwegian (Nynorsk) stemming Previously hardly any decompounding was in evidence; now it is pervasive. Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) search gives better recall, with some loss of precision, at least where it comes to compounds. recommendation Reindex Nynorsk if you want to query with decompounding. Norwegian (generic 'no') stemming Previously 'no' was treated as an unsupported language; now it is treated as both Bokmal and Nynorsk: for a word present in both dialects, all stem variants from both will be present. recommendation Do not use 'no' unless you really must; reindex if you want to query it. Persian See general comments above. Portuguese stemming More precision with respect to feminine variants (e.g., ator vs atriz). Romanian tokenization This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases. recommendation Reindex all content in this language if you are using stemming. Russian stemming Inflectional variants of cardinal or ordinal numbers are no longer stemmed to a base form. Inflectional variants of proper nouns may stem together due to the backing algorithm, but it will be via affix-stripping, not to the nominal form. Stems for many verb forms used to be the perfective form; they are now the simple infinitive. Stems used to drop ё but now preserve it. recommendation Reindex all Russian. Spanish See general comments above. Swedish stemming Previously hardly any decompounding was in evidence; now it is pervasive. Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) search gives better recall, with some loss of precision, at least where it comes to compounds. recommendation Reindex Swedish if you want to query with decompounding. Tamil tokenization This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases. recommendation Reindex all content in this language if you are using stemming. Turkish tokenization This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases. recommendation Reindex all content in this language if you are using stemming.
(0 vote(s)) Helpful Not helpful

Comments (0)