Text Extraction Security
08 November 2016 02:03 PM


Binary documents often have various associated metadata. For example, an image may have metadata like a timestamp of when and where it was taken, and so on. MarkLogic Server server offers the ability to extract this metadata information from binary documents (e.g. Images, MS Office and Adobe PDF) using XQuery built-in functions and conversion pipelines using third party software.

The following article gives details about the security vulnerabilities reported for text extraction and MarkLogic releases containing the resolution.


MarkLogic Server's Admin API function xdmp: document-filter will allow you to extract metadata and text from binary documents as XHTML. Additionally, the server’s xdmp:pdf-convert() and Content Processing Framework (CPF) helps convert HTML, Adobe PDF and Microsoft Office documents to XML.

However, these mechanisms utilize and rely on a third-party softwares like Iceni  "Argus PDF converter" and Perceptive Document Filters” from Lexmark to extract text and metadata from a wide variety of document formats. 

Recently, both Iceni and Lexmark have issued security alerts for vulnerabilities in these product and have incorporated fixes into their most recent release. They have published the following CVEs:

For Iceni:

  • CVE-2016-8333 and CVE-2016-8335
    • An exploitable stack-based buffer overflow vulnerability

The latest version of Iceni (v6.6.5) patches the security issues listed above.

For Lexmark:

  • CVE-2016-5646
    • An exploitable heap overflow vulnerability exists in the Compound Binary Format (CBFF) parser functionality of the Lexmark Perceptive Document Filters Library.
  • CVE-2016-4336
    • An exploitable out of bounds write vulnerability exists in the Bzip2 parsing of the Perceptive Document Filters
  • CVE-2016-4335
    • An exploitable buffer overflow vulnerability exists in the XLS parsing of the Perceptive Document Filters conversion functionality

These are considered to be vulnerabilities of "High" severity based on CVSS base scores in excess of 7.0.  A carefully crafted pdf, CBFF, Bzip2, or XLS file could be used to cause a buffer overflow which can result in arbitrary code execution.

The latest version of Lexmark Isys (v11.3) patches the security issues listed above.



MarkLogic has issued an update which includes these fixes.

The latest releases of MarkLogic Server versions 7 (7.0-6.8) and 8 (8.0-6) are available for download from our Community website that incorporates the latest fix for Iceni and Lexmark Isys.


  • For more information on the Lexmark security issues, see

  • Further details on Iceni issues can be found at:



(0 vote(s))
Not helpful

Comments (0)