MarkLogic Server stores text in Unicode NFC normalized form
28 January 2017 10:29 PM
In MarkLogic Server, all text is converted into Unicode NFC normalized form before tokenization and storage.
For example, consider the NFC equivalence of the codepoints x2126 (Ω) and x03A9 (Ω). This is shown for the x2126 entry in the Unicode code chart for the U2100 block.
You can see the effects of normalization alone, and during tokenization, by running the following in MarkLogic Server's Query Console:
The results show the original value, the normalized value, and the resulting token: