Community

MarkLogic 10 and Data Hub 5.0

Latest MarkLogic releases provide a smarter, simpler, and more secure way to integrate data.

Read Blog →

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up →

 
Knowledgebase:
Using URL encoding to handle special characters in a document URI
15 May 2019 08:20 AM

Introduction

Special care may need to be taken when loading documents into MarkLogic Server where the document URI contains one or more special characters.  In this article, we will walk through a scenario where exceptions are thrown if such a URI with special character is not handled properly and then we will talk about how to handle such URIs. This article will take advantage of inbuilt functions (and encode method of java.net.URLEncoder class) and showcase their usage via a couple of samples created using XCC/J to understand this scenario and suggested approach.

Relationship between URI and URL

A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource. The most common form of URI is the Uniform Resource Locator (URL).

A URL is a URI that, in addition to identifying a web resource, specifies the means of acting upon or obtaining the representation, specifying both its primary access mechanism and network location. For example, the URL 'http://example.org/wiki/Main_Page' refers to a resource identified as /wiki/Main_Page whose representation, in the form of HTML and related code, is obtainable via HyperText Transfer Protocol (http) from a network host whose domain name is example.org

While it is possible to load documents into MarkLogic Server, where the document URI contains special characters not encoded, it is recommended to follow best practices by URL encoding document URIs as it will help you design robust applications, free from the side effects caused by such special characters in other areas of your application stack.

Importance of URL encoding

URL encoding is often required to convert special characters (such as "/", "&", "#", ...), because special characters: 

  1. have special meaning in some contexts; or
  2. are not valid character for an URL; or
  3. could be altered during transfer. 

For instance, the "#" character needs to be encoded because it has a special meaning of that of an html anchor. The <space> character needs to be encoded because it is not a valid URL character. Also, some characters, such as "~" might not transport properly across the internet.

Consider the example where a parameter is supplied in a URL and parameter value has a special character in it, such as,

  • Parameter is "movie1" and its value is "Fast & Furious"

The parameter may be submitted via a URL such as "http://www.awebsite.com/encodingurls/submitmoviename.html?movie1=Fast & Furious". In this example, space and & need to be handled specially, otherwise it may not be interpreted properly - for example, the associated GET request may fail.

These character can be encoding:
      Space as '%20' or '+'
      '&' as '%26'

And thus the URL, after encoding, would look like 'http://www.awebsite.com/encodingurls/submitmoviename.html?movie1=Fast+%26+Furious'.

 What is URL encoding?

URL Encoding is the process of converting a string into a valid URL format. Valid URL format means that the URL contains only "alpha | digit | safe | extra | escape" characters. For URL specifications, there are various established standards including below listed w3c standards:

  1. http://www.w3.org/Addressing/URL/url-spec.html
  2. http://www.w3.org/International/francois.yergeau.html 

Safe and unsafe characters

Based on Web Standards, the following quick reference chart explains which characters are “safe” and which characters should be encoded in URLs. 

Classification

Included characters

Encoding required?

Safe characters

Alphanumerics [0-9a-zA-Z], special characters $-_.+!*'(), and reserved characters used for their reserved purposes (e.g., question mark used to denote a query string)

NO

ASCII Control characters

Includes the ISO-8859-1 (ISO-Latin) character ranges 00-1F hex (0-31 decimal) and 7F (127 decimal.)

YES

Non-ASCII characters

Includes the entire “top half” of the ISO-Latin set 80-FF hex (128-255 decimal.)

YES

Reserved characters

$ & + , / : ; = ? @ (not including blank space)

YES*

Unsafe characters

Includes the blank/empty space and " < > # % { } | \ ^ ~ [ ] `

YES

 * Note: Reserved characters only need encoding when not used for their defined, reserved purposes.

For complete details and understanding these character classification please check RFC1738

Walkthrough of an example Scenario using XCC/J

Let's take a look at a sample created to connect to MarkLogic Server using the XCC/J connector. 

We will start with a case in our scenario where we have a special character in a document URI which is not safely handled properly while loading this document in to MarkLogic Server. Next we will resolve it by using URI encoding

Consider the following code: