Community

MarkLogic 10 and Data Hub 5.0

Latest MarkLogic releases provide a smarter, simpler, and more secure way to integrate data.

Read Blog →

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up →

 
Knowledgebase:
URL decoding throwing errors
24 May 2016 01:30 PM

Summary

This article describes the errors thrown when decoding URLs and how to detect invalid characters to avoid the errors

Details

When decoding certain URLs using xdmp:url-decode(), it is possible that certain characters will cause one of two errors to be thrown. 

  1. XDMP-UTF8SEQ is thrown if the percent-encoded bytes do not form a valid UTF-8 octet sequence. A good description of UTF-8 can be found at: https://en.wikipedia.org/wiki/UTF-8 
  2. XDMP-CODEPOINT is thrown if the UTF-8 octet sequence specifies a Unicode codepoint invalid for XML.

The specification for the Uniform Resource Identifier (URI): Generic Syntax can be found here: https://tools.ietf.org/html/rfc3986. In particular, the following section explains why certain characters are invalid: "Non-ASCII characters must first be encoded according to UTF-8 [STD63], and then each octet of the corresponding UTF-8 sequence must be percent-encoded to be represented as URI characters."

The code below can be used to detect invalid characters.  Make sure to remove any invalid characters prior to URL decoding.

(codepoint <= 0x8) ||
(codepoint >= 0xb && codepoint <= 0xc) ||
(codepoint > 0xd && codepoint < 0x20) ||
(codepoint >= 0xd800 && codepoint < 0xe000) ||
(codepoint > 0xfffd && codepoint < 0x10000) ||
(codepoint >= 0x110000)

(0 vote(s))
Helpful
Not helpful

Comments (0)