MarkLogic Server v9 Tokenization and Stemming
04 October 2019 04:32 PM
|
|
AbstractIn MarkLogic Server version 9, the default tokenization and stemming code has been changed for all languages (except English tokenization). Some tokenization and stemming behavior will change between MarkLogic 8 and MarkLogic 9. We expect that, in most cases, results will be better in MarkLogic 9. Information is given for managing this change in the Release Notes at Default Stemming and Tokenization Libraries Changed for Most Languages, and for further related features at New Stemming and Tokenization. In-depth discussion is provided below for those interested in details. General Comments on IncompatibilitiesGeneral implications of tokenization incompatibilitiesIf you do not reindex, old content may no longer match the same searches, even for unstemmed searches. General tokenization incompatibilitiesThere are some edge-case changes in the handling of apostrophes in some languages; in general this is not a problem, but some specific words may include/break at apostrophes. Tokenization is generally faster for all languages except English and Norwegian (which use the same tokenization as before). General implications of stemming incompatibilitiesWhere there is only one stem, and it is now different: Old data will not match stemmed searches without reindexing, even for the Where the new stems are more precise: Content that used to match a query may not match any more, even with Where there are new stems, but the primary stem is unchanged: Content that used to not match a query may now match it with advanced Where the decompounding is different, but the concatenation of the components is the same: Under decompounding, content may match a query when it used to not match, or may not match a query when it used to match, when the query or content involves something with one of the old/new components. Matching under advanced or basic stemming would be generally the same. General stemming incompatibilities
ReindexingIt is advisable to reindex to be sure there are no incompatibilities. Where the data in the forests (tokens or stems) does not match the current behavior, reindexing is recommended. This will have to be a forced reindex or a reload of specific documents containing the offending data. For many languages this can be avoided if queries do not touch on specific cases. For certain languages (see below) the incompatibility is great enough that it is essential to reindex. Language NotesBelow we give some specific information and recommendations for various languages. ArabicstemmingThe Arabic dictionaries are much larger than before. Implications: (1) better precision, but (2) slower stemming. Chinese (Simplified)tokenizationTokenization is broadly incompatible. The new tokenizer uses a corpus-based language model. Better precision can be expected. recommendation Reindex all Chinese (simplified). Chinese (Traditional)tokenizationTokenization is broadly incompatible. The new tokenizer uses a corpus-based language model. Better precision can be expected. recommendation Reindex all Chinese (traditional). DanishtokenizationThis language now has algorithmic stemming, and may have slight tokenization differences around certain edge cases. recommendation Reindex all Danish content if you are using stemming. DutchstemmingThere will be much more decompounding in general, but MarkLogic will not decompound certain known lexical items (e.g., "baastardwoorden"). recommendation Reindex Dutch if you want to query with decompounding. EnglishstemmingBritish variants may include the British variant as an additional stem, although the first stem will still be the US variant. Stemming produces more alternative stems. Implications are (1) stemming is slightly slower and (2) index sizes are slightly larger (with advanced stemming). FinnishtokenizationThis language now has algorithmic stemming and may have slight tokenization differences around certain edge cases. recommendation Reindex all content in this language if you are using stemming. FrenchSee general comments above. GermanstemmingDecompounding now applies to more than just pure noun combinations. For example, it applies to "noun plus adjectives" compound terms. Decompounding is more aggressive, which can result in identification of more false compounds. Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) for compound terms, search gives better recall, with some loss of precision. recommendation Reindex all German. HungariantokenizationThis language now has algorithmic stemming and may have slight tokenization differences around certain edge cases. recommendation Reindex all content in this language if you are using stemming. ItalianSee general comments above. JapanesetokenizationTokenization is broadly incompatible. The tokenizer provides internal flags that the stemmer requires. This means that (1) tokenization is incompatible for all words at the storage level due to the extra information and (2) if you install a custom tokenizer for Japanese, you must also install a custom stemmer. stemmingStemming is broadly incompatible. recommendation Reindex all Japanese content. KoreanstemmingParticles (e.g., 이다) are dropped from stems; they used to be treated as components for decompounding. There is different stemming of various honorific verb forms. North Korean variants are not in the dictionary, though they may handled by the algorithmic stemmer. recommendation Reindex Korean unless you use decompounding. Norwegian (Bokmal)stemmingPreviously, hardly any decompounding was in evidence; now it is pervasive. Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) search gives better recall, with some loss of precision, at least where it comes to compounds. recommendation Reindex Bokmal if you want to query with decompounding. Norwegian (Nynorsk)stemmingPreviously hardly any decompounding was in evidence; now it is pervasive. Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) search gives better recall, with some loss of precision, at least where it comes to compounds. recommendation Reindex Nynorsk if you want to query with decompounding. Norwegian (generic 'no')stemmingPreviously 'no' was treated as an unsupported language; now it is treated as both Bokmal and Nynorsk: for a word present in both dialects, all stem variants from both will be present. recommendation Do not use 'no' unless you really must; reindex if you want to query it. PersianSee general comments above. PortuguesestemmingMore precision with respect to feminine variants (e.g., ator vs atriz). RomaniantokenizationThis language now has algorithmic stemming and may have slight tokenization differences around certain edge cases. recommendation Reindex all content in this language if you are using stemming. RussianstemmingInflectional variants of cardinal or ordinal numbers are no longer stemmed to a base form. Inflectional variants of proper nouns may stem together due to the backing algorithm, but it will be via affix-stripping, not to the nominal form. Stems for many verb forms used to be the perfective form; they are now the simple infinitive. Stems used to drop ё but now preserve it. recommendation Reindex all Russian. SpanishSee general comments above. SwedishstemmingPreviously hardly any decompounding was in evidence; now it is pervasive. Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) search gives better recall, with some loss of precision, at least where it comes to compounds. recommendation Reindex Swedish if you want to query with decompounding. TamiltokenizationThis language now has algorithmic stemming and may have slight tokenization differences around certain edge cases. recommendation Reindex all content in this language if you are using stemming. TurkishtokenizationThis language now has algorithmic stemming and may have slight tokenization differences around certain edge cases. recommendation Reindex all content in this language if you are using stemming.
| |
|