Using URL encoding to handle special characters in a document URI

Knowledgebase

108Administration 8App Services 42Errors 145MarkLogic Server 53Performance Tuning

Knowledgebase:

24 November 2020 10:59 AM

Introduction

Special care may need to be taken when loading documents into MarkLogic Server where the document URI contains one or more special characters. In this article, we will walk through a scenario where exceptions are thrown if such a URI with special character is not handled properly and then we will talk about how to handle such URIs. This article will take advantage of inbuilt functions (and encode method of java.net.URLEncoder class) and showcase their usage via a couple of samples created using XCC/J to understand this scenario and suggested approach.

Relationship between URI and URL

A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource. The most common form of URI is the Uniform Resource Locator (URL).

A URL is a URI that, in addition to identifying a web resource, specifies the means of acting upon or obtaining the representation, specifying both its primary access mechanism and network location. For example, the URL 'http://example.org/wiki/Main_Page' refers to a resource identified as /wiki/Main_Page whose representation, in the form of HTML and related code, is obtainable via HyperText Transfer Protocol (http) from a network host whose domain name is example.org.

While it is possible to load documents into MarkLogic Server, where the document URI contains special characters not encoded, it is recommended to follow best practices by URL encoding document URIs as it will help you design robust applications, free from the side effects caused by such special characters in other areas of your application stack.

Importance of URL encoding

URL encoding is often required to convert special characters (such as "/", "&", "#", ...), because special characters:

have special meaning in some contexts; or
are not valid character for an URL; or
could be altered during transfer.

For instance, the "#" character needs to be encoded because it has a special meaning of that of an html anchor. The <space> character needs to be encoded because it is not a valid URL character. Also, some characters, such as "~" might not transport properly across the internet.

Consider the example where a parameter is supplied in a URL and parameter value has a special character in it, such as,

Parameter is "movie1" and its value is "Fast & Furious"

The parameter may be submitted via a URL such as "http://www.awebsite.com/encodingurls/submitmoviename.html?movie1=Fast & Furious". In this example, space and & need to be handled specially, otherwise it may not be interpreted properly - for example, the associated GET request may fail.

These character can be encoding:
Space as '%20' or '+'
'&' as '%26'

And thus the URL, after encoding, would look like 'http://www.awebsite.com/encodingurls/submitmoviename.html?movie1=Fast+%26+Furious'.

What is URL encoding?

URL Encoding is the process of converting a string into a valid URL format. Valid URL format means that the URL contains only "alpha | digit | safe | extra | escape" characters. For URL specifications, there are various established standards including below listed w3c standards:

Safe and unsafe characters

Based on Web Standards, the following quick reference chart explains which characters are “safe” and which characters should be encoded in URLs.

Classification	Included characters	Encoding required?
Safe characters	Alphanumerics [`0-9a-zA-Z`], special characters `$-_.+!*'()`, and reserved characters used for their reserved purposes (e.g., question mark used to denote a query string)	NO
ASCII Control characters	Includes the ISO-8859-1 (ISO-Latin) character ranges 00-1F hex (0-31 decimal) and 7F (127 decimal.)	YES
Non-ASCII characters	Includes the entire “top half” of the ISO-Latin set 80-FF hex (128-255 decimal.)	YES
Reserved characters	`$ & + , / : ; = ? @` (not including blank space)	YES*
Unsafe characters	Includes the blank/empty space and " < > # % { } \| \ ^ ~ [ ] `	YES

* Note: Reserved characters only need encoding when not used for their defined, reserved purposes.

For complete details and understanding these character classification please check RFC1738

Walkthrough of an example Scenario using XCC/J

Let's take a look at a sample created to connect to MarkLogic Server using the XCC/J connector.

We will start with a case in our scenario where we have a special character in a document URI which is not safely handled properly while loading this document in to MarkLogic Server. Next we will resolve it by using URI encoding

Consider the following code:

In above code we are running a newAdHocQuery and calling xdmp:document-insert and passing in the URI (with special character). Request has been submitted in a try-catch block to handle any exception which comes out while submitting this request

On running this code we will get below exception:

Full adHocQuery being executed: xdmp:document-insert("&.xml", <test/>)
com.marklogic.xcc.exceptions.XQueryException: XDMP-ENTITYREF: (err:XPST0003) Invalid entity reference ".xml"
[Session: user=[user], cb={default} [ContentSource: user=admin, cb={none} [provider: address=localhost/127.0.0.1:8000, pool=1/64]]]
[Client: XCC/8.0-1, Server: XDBC/8.0-1.1]
in /eval, on line 1
expr:

Notice that there is no '&' character present in the exception trace because '&' is a special character and is not handled properly. To resolve this issue, we can use the encode method of java.net.URLEncoder class to encode these characters. Now consider below example,

As you can see in above example we have encoded a uri with special character by encoding it,
String badUri = "&.xml";
String goodUri = URLEncoder.encode(badUri, "UTF-8");

Running this code will successfully load the document with encoded URI, as %26.xml

Another example for scenario using curl

Here in this example, we are using curl to load a simple XML document with a URI having a special character (ム). Scenario is similar as mentioned in above. This time we are using curl to load document into MarkLogic.

Consider the following curl command:

curl --anyauth --user username:password -X PUT -T ./test.xml -i -H "Content-type: application/xml" http://localhost:8000/v1/documents?uri=/%e3%83%a0.xml

Here are the contents of test.xml: <test><sample>test 1</sample></test>

Running above curl command to load a simple xml document with a URI having a special character (ム) fails with "400 Bad Request":

{"errorResponse":{"statusCode":400, "status":"Bad Request", "messageCode":"REST-INVALIDPARAM", "message":"REST-INVALIDPARAM: (err:FOER0000) Invalid parameter: invalid uri: /πâá.xml"}}

To resolve this issue, we can use the --data-urlencode option provided by curl to encode data.

Now consider below example,

curl --anyauth --user username:password -X PUT -T ./test.xml -i -H "Content-type: application/xml" http://localhost:8000/v1/documents --data-urlencode uri=/%e3%83%a0.xml –G

--data-urlencode is used to encode the uri parameter and -G is used to join arguments into request data

Running this code will successfully load the document with encoded URI, as /%e3%83%a0.xml