Knowledgebase:
Valid characters in a MarkLogic Document URI
24 November 2020 05:13 AM

Introduction

A document uniform resource identifier (URI) is a string of characters used to identify a name of a document stored in MarkLogic Server. This article describes which characters are supported by MarkLogic 8 to represent a document URI.

ASCII

MarkLogic 8 allows all characters from printable ASCII characters to be used in a document URI (i.e. decimal range 32-196).

List of allowed special characters within ASCII range

<space> ! " # $ % & ' () * + , - . / : ; < = > ? @ [ \ ] ^ _ ` {  | }  ~ 

Please note ASCII character for space (decimal 32) can be used, however it should not be used as a prefix or a suffix.

Other Character Sets

MarkLogic Server supports UTF 8 encoding. Apart from valid ASCII character set mentioned above, any valid UTF-8 character can be used within a document URI in MarkLogic Server. 

Examples include: Decimal range 384-591 for representing Latin Extended-A;  and decimal range 880-1023 for representing Greek and Coptic.

External Considerations

Few interfaces (such XCC/J) and datatypes might place more restrictions on characters allowed in a MarkLogic document URI. For example, xs:anyURI datatype place more restrictions on a URI and restricts use of & (Decimal code 38) and < (Decimal code 60). Consider the following scenario.

A schema is loaded into database and validations are applied before inserting an xml document into the database, 

Now below query will fail to insert a document with URI having a

Above code fails and gives error listed below,

[1.0-ml] XDMP-DOCENTITYREF: xdmp:unquote("<?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?>&#10;<...") -- Invalid entity reference "." at line 2

 

To resolve this issue, function xdmp:url-encode can be used, for example

let $node := xdmp:unquote(fn:concat('<?xml version="1.0" encoding="UTF-8"?>
<tns:simpleuri xmlns:tns="http://www.example.org/uri" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.example.org/uri uri.xsd ">',

xdmp:url-encode(fn:codepoints-to-string($n)), '.org
</tns:simpleuri>'))

The MarkLogic knowledge base article, Using URL encoding to handle special characters in a document URI , explains a recommended approach for safely handling special characters (using url encoding). A document URI containing special characters, as mentioned in above Knowledge base article, should be encoded before it is inserted into MarkLogic 8. 

Summary

While it is possible to load documents into MarkLogic Server where the document URI contains special characters not encoded, it is recommended to follow best practices by URL encoding document URIs as it will help you design robust applications, free from the side effects caused by such special characters in other areas of your application stack. 

Additional References

ISO/IEC 8859-1

w3 school: HTML Unicode (UTF-8) Reference

 

(1 vote(s))
Helpful
Not helpful

Comments (0)