Solutions

MarkLogic Data Hub Service

Fast data integration + improved data governance and security, with no infrastructure to buy or manage.

Learn More

Learn

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Community

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

 
Knowledgebase:
MarkLogic Server v9 Tokenization and Stemming
30 May 2017 04:42 PM

Abstract

In MarkLogic Server version 9, the default tokenization and stemming code has been changed for all languages (except English tokenization). Some tokenization and stemming behavior will change between MarkLogic 8 and MarkLogic 9. We expect that, in most cases, results will be better in MarkLogic 9.

Information is given for managing this change in the Release Notes at Default Stemming and Tokenization Libraries Changed for Most Languages, and for further related features at New Stemming and Tokenization.

In-depth discussion is provided below for those interested in details.

General Comments on Incompatibilities

General implications of tokenization incompatibilities

If you do not reindex, old content may no longer match the same searches, even for unstemmed searches.

General tokenization incompatibilities

There are some edge-case changes in the handling of apostrophes in some languages; in general this is not a problem, but some specific words may include/break at apostrophes.

Tokenization is generally faster for all languages except English and Norwegian (which use the same tokenization as before).

General implications of stemming incompatibilities

Where there is only one stem, and it is now different:  Old data will not match stemmed searches without reindexing, even for the
same word.

Where the new stems are more precise:  Content that used to match a query may not match any more, even with
reindexing.

Where there are new stems, but the primary stem is unchanged:  Content that used to not match a query may now match it with advanced
stemming or above. With basic stemming there should be no change.

Where the decompounding is different, but the concatenation of the components is the same:  Under decompounding, content may match a query when it used to not match, or may not match a query when it used to match, when the query or content involves something with one of the old/new components. Matching under advanced or basic stemming would be generally the same.

General stemming incompatibilities

  • MarkLogic now has general algorithms backing up explicit stemming dictionaries.  Words not found in the default dictionaries will sometimes be stemmed when they previously were not.
  • Diminutives/augmentatives are not usually stemmed to base form.
  • Comparatives/superlatives are not usually stemmed to base form.
  • There are differences in the exact stems for pronoun case variants.
  • Stemming is more precise and restricted by common usage. For example, if the past participle of a verb is not usually used as an adjective, then the past participle will not be included as an alternative stem. Similarly, plural forms that only have technical or obscure usages might not stem to the singular form.
  • Past participles will typically include the past participle as an alternative stem.
  • The preferred order of stems is not always the same: this will affect search under basic stemming.

Reindexing

It is advisable to reindex to be sure there are no incompatibilities. Where the data in the forests (tokens or stems) does not match the current behavior, reindexing is recommended. This will have to be a forced reindex or a reload of specific documents containing the offending data. For many languages this can be avoided if queries do not touch on specific cases. For certain languages (see below) the incompatibility is great enough that it is essential to reindex.

Language Notes

Below we give some specific information and recommendations for various languages.

Arabic

stemming

The Arabic dictionaries are much larger than before. Implications:  (1) better precision, but (2) slower stemming.

Chinese (Simplified)

tokenization

Tokenization is broadly incompatible.

The new tokenizer uses a corpus-based language model.  Better precision can be expected.

recommendation

Reindex all Chinese (simplified).

Chinese (Traditional)

tokenization

Tokenization is broadly incompatible.

The new tokenizer uses a corpus-based language model.  Better precision can be expected.

recommendation

Reindex all Chinese (traditional).

Danish

tokenization

This language now has algorithmic stemming, and may have slight tokenization differences around certain edge cases.

recommendation

Reindex all Danish content if you are using stemming.

Dutch

stemming

There will be much more decompounding in general, but MarkLogic will not decompound certain known lexical items (e.g., "baastardwoorden").

recommendation

Reindex Dutch if you want to query with decompounding.

English

stemming

British variants may include the British variant as an additional stem, although the first stem will still be the US variant.

Stemming produces more alternative stems. Implications are (1) stemming is slightly slower and (2) index sizes are slightly larger (with advanced stemming).

Finnish

tokenization

This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases.

recommendation

Reindex all content in this language if you are using stemming.

French

See general comments above.

German

stemming

Decompounding now applies to more than just pure noun combinations. For example, it applies to "noun plus adjectives" compound terms. Decompounding is more aggressive, which can result in identification of more false compounds. Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) for compound terms, search gives better recall, with some loss of precision.

recommendation

Reindex all German.

Hungarian

tokenization

This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases.

recommendation

Reindex all content in this language if you are using stemming.

Italian

See general comments above.

Japanese

tokenization

Tokenization is broadly incompatible.

The tokenizer provides internal flags that the stemmer requires.  This means that (1) tokenization is incompatible for all words at the storage level due to the extra information and (2) if you install a custom tokenizer for Japanese, you must also install a custom stemmer.

stemming

Stemming is broadly incompatible.

recommendation

Reindex all Japanese content.

Korean

stemming

Particles (e.g., 이다) are dropped from stems; they used to be treated as components for decompounding.

There is different stemming of various honorific verb forms.

North Korean variants are not in the dictionary, though they may handled by the algorithmic stemmer.

recommendation

Reindex Korean unless you use decompounding.

Norwegian (Bokmal)

stemming

Previously, hardly any decompounding was in evidence; now it is pervasive.

Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) search gives better recall, with some loss of precision, at least where it comes to compounds.

recommendation

Reindex Bokmal if you want to query with decompounding.

Norwegian (Nynorsk)

stemming

Previously hardly any decompounding was in evidence; now it is pervasive.

Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) search gives better recall, with some loss of precision, at least where it comes to compounds.

recommendation

Reindex Nynorsk if you want to query with decompounding.

Norwegian (generic 'no')

stemming

Previously 'no' was treated as an unsupported language; now it is treated as both Bokmal and Nynorsk: for a word present in both dialects, all stem variants from both will be present.

recommendation

Do not use 'no' unless you really must; reindex if you want to query it.

Persian

See general comments above.

Portuguese

stemming

More precision with respect to feminine variants (e.g., ator vs atriz).

Romanian

tokenization

This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases.

recommendation

Reindex all content in this language if you are using stemming.

Russian

stemming

Inflectional variants of cardinal or ordinal numbers are no longer stemmed to a base form.

Inflectional variants of proper nouns may stem together due to the backing algorithm, but it will be via affix-stripping, not to the nominal form.

Stems for many verb forms used to be the perfective form; they are now the simple infinitive.

Stems used to drop ё but now preserve it.

recommendation

Reindex all Russian.

Spanish

See general comments above.

Swedish

stemming

Previously hardly any decompounding was in evidence; now it is pervasive.

Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) search gives better recall, with some loss of precision, at least where it comes to compounds.

recommendation

Reindex Swedish if you want to query with decompounding.

Tamil

tokenization

This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases.

recommendation

Reindex all content in this language if you are using stemming.

Turkish

tokenization

This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases.

recommendation

Reindex all content in this language if you are using stemming.

(0 vote(s))
Helpful
Not helpful

Comments (0)