MarkLogic Server stores text in Unicode NFC normalized form
28 January 2017 10:29 PM
In MarkLogic Server, all text is converted into Unicode NFC normalized form before tokenization and storage.
Unicode considers NFC-compatible characters to be essentially equivalent. See the Unicode normalization FAQ and Conformance Requirements in the Unicode Standard.
For example, consider the NFC equivalence of the codepoints x2126 (Ω) and x03A9 (Ω). This is shown for the x2126 entry in the Unicode code chart for the U2100 block.
You can see the effects of normalization alone, and during tokenization, by running the following in MarkLogic Server's Query Console:
xquery version "1.0-ml"; (: equivalence of Ω forms :) let $s := fn:codepoints-to-string (xdmp:hex-to-integer ('2126')) let $token := cts:tokenize ($s) return ( 'original: '||xdmp:integer-to-hex (fn:string-to-codepoints ($s)), 'normalized: '||xdmp:integer-to-hex (fn:string-to-codepoints (fn:normalize-unicode ($s, 'NFC'))), 'tokenized: '||xdmp:describe ($token, (), ()) )
The results show the original value, the normalized value, and the resulting token:
original: 2126 normalized: 3a9 tokenized: cts:word("Ω")