Using URL encoding to handle special characters in a document URI
24 November 2020 10:59 AM
|
|||||||||||||||||||
IntroductionSpecial care may need to be taken when loading documents into MarkLogic Server where the document URI contains one or more special characters. In this article, we will walk through a scenario where exceptions are thrown if such a URI with special character is not handled properly and then we will talk about how to handle such URIs. This article will take advantage of inbuilt functions (and encode method of Relationship between URI and URLA Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource. The most common form of URI is the Uniform Resource Locator (URL). A URL is a URI that, in addition to identifying a web resource, specifies the means of acting upon or obtaining the representation, specifying both its primary access mechanism and network location. For example, the URL ' While it is possible to load documents into MarkLogic Server, where the document URI contains special characters not encoded, it is recommended to follow best practices by URL encoding document URIs as it will help you design robust applications, free from the side effects caused by such special characters in other areas of your application stack. Importance of URL encodingURL encoding is often required to convert special characters (such as "/", "&", "#", ...), because special characters:
For instance, the "#" character needs to be encoded because it has a special meaning of that of an html anchor. The <space> character needs to be encoded because it is not a valid URL character. Also, some characters, such as "~" might not transport properly across the internet. Consider the example where a parameter is supplied in a URL and parameter value has a special character in it, such as,
The parameter may be submitted via a URL such as " These character can be encoding: And thus the URL, after encoding, would look like ' What is URL encoding?URL Encoding is the process of converting a string into a valid URL format. Valid URL format means that the URL contains only "alpha | digit | safe | extra | escape" characters. For URL specifications, there are various established standards including below listed w3c standards:
Safe and unsafe charactersBased on Web Standards, the following quick reference chart explains which characters are “safe” and which characters should be encoded in URLs.
* Note: Reserved characters only need encoding when not used for their defined, reserved purposes. For complete details and understanding these character classification please check RFC1738 Walkthrough of an example Scenario using XCC/JLet's take a look at a sample created to connect to MarkLogic Server using the XCC/J connector. We will start with a case in our scenario where we have a special character in a document URI which is not safely handled properly while loading this document in to MarkLogic Server. Next we will resolve it by using URI encoding Consider the following code:
In above code we are running a On running this code we will get below exception:
Notice that there is no '
As you can see in above example we have encoded a uri with special character by encoding it, Running this code will successfully load the document with encoded URI, as Another example for scenario using curlHere in this example, we are using curl to load a simple XML document with a URI having a special character (ム). Scenario is similar as mentioned in above. This time we are using curl to load document into MarkLogic. Consider the following curl command:
Here are the contents of test.xml: <test><sample>test 1</sample></test> Running above curl command to load a simple xml document with a URI having a special character (ム) fails with "400 Bad Request":
To resolve this issue, we can use the --data-urlencode option provided by curl to encode data. Now consider below example,
Running this code will successfully load the document with encoded URI, as ConclusionWhile it is possible to load documents into MarkLogic Server, where the document URI contains special characters not encoded, it is recommended to follow best practices by URL encoding document URIs as it will help you design robust applications, free from the side effects caused by such special characters in other areas of your application stack. ReferencesI. http://www.permadi.com/tutorial/urlEncoding/ II. http://perishablepress.com/stop-using-unsafe-characters-in-urls/ III. http://www.ietf.org/rfc/rfc3986.txt RFC3986 on URI IV. http://www.ietf.org/rfc/rfc1738.txt RFC1738 on URL V. http://developer.marklogic.com/products/xcc VI. http://docs.oracle.com/javase/7/docs/api/java/net/URLEncoder.html VIII. https://ec.haxx.se/http-post.html
| |||||||||||||||||||
|