Knowledgebase:
MarkLogic Server stores text in Unicode NFC normalized form
28 January 2017 10:29 PM

Summary

Text is stored in MarkLogic Server in Unicode NFC normalized form.

Discussion

In MarkLogic Server, all text is converted into Unicode NFC normalized form before tokenization and storage. 

Unicode considers NFC-compatible characters to be essentially equivalent. See the Unicode normalization FAQ and Conformance Requirements in the Unicode Standard.

Example

For example, consider the NFC equivalence of the codepoints x2126 (&#x2126) and x03A9 (&#x03A9). This is shown for the x2126 entry in the Unicode code chart for the U2100 block.

You can see the effects of normalization alone, and during tokenization, by running the following in MarkLogic Server's Query Console:

xquery version "1.0-ml";
(: equivalence of Ω forms :)
let $s := fn:codepoints-to-string (xdmp:hex-to-integer ('2126'))
let $token := cts:tokenize ($s)
return (
    'original: '||xdmp:integer-to-hex (fn:string-to-codepoints ($s)),
    'normalized: '||xdmp:integer-to-hex (fn:string-to-codepoints (fn:normalize-unicode ($s, 'NFC'))),
    'tokenized: '||xdmp:describe ($token, (), ())
)

The results show the original value, the normalized value, and the resulting token:

original: 2126
normalized: 3a9
tokenized: cts:word("Ω")
(2 vote(s))
Helpful
Not helpful

Comments (0)