Solutions

MarkLogic Data Hub Service

Fast data integration + improved data governance and security, with no infrastructure to buy or manage.

Learn More

Learn

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Community

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

 
Knowledgebase:
URL decoding throwing errors
24 May 2016 01:30 PM

Summary

This article describes the errors thrown when decoding URLs and how to detect invalid characters to avoid the errors

Details

When decoding certain URLs using xdmp:url-decode(), it is possible that certain characters will cause one of two errors to be thrown. 

  1. XDMP-UTF8SEQ is thrown if the percent-encoded bytes do not form a valid UTF-8 octet sequence. A good description of UTF-8 can be found at: https://en.wikipedia.org/wiki/UTF-8 
  2. XDMP-CODEPOINT is thrown if the UTF-8 octet sequence specifies a Unicode codepoint invalid for XML.

The specification for the Uniform Resource Identifier (URI): Generic Syntax can be found here: https://tools.ietf.org/html/rfc3986. In particular, the following section explains why certain characters are invalid: "Non-ASCII characters must first be encoded according to UTF-8 [STD63], and then each octet of the corresponding UTF-8 sequence must be percent-encoded to be represented as URI characters."

The code below can be used to detect invalid characters.  Make sure to remove any invalid characters prior to URL decoding.

(codepoint <= 0x8) ||
(codepoint >= 0xb && codepoint <= 0xc) ||
(codepoint > 0xd && codepoint < 0x20) ||
(codepoint >= 0xd800 && codepoint < 0xe000) ||
(codepoint > 0xfffd && codepoint < 0x10000) ||
(codepoint >= 0x110000)

(0 vote(s))
Helpful
Not helpful

Comments (0)