Text Extraction Security
08 November 2016 02:03 PM
Binary documents often have various associated metadata. For example, an image may have metadata like a timestamp of when and where it was taken, and so on. MarkLogic Server server offers the ability to extract this metadata information from binary documents (e.g. Images, MS Office and Adobe PDF) using XQuery built-in functions and conversion pipelines using third party software.
The following article gives details about the security vulnerabilities reported for text extraction and MarkLogic releases containing the resolution.
MarkLogic Server's Admin API function xdmp: document-filter will allow you to extract metadata and text from binary documents as XHTML. Additionally, the server’s xdmp:pdf-convert() and Content Processing Framework (CPF) helps convert HTML, Adobe PDF and Microsoft Office documents to XML.
However, these mechanisms utilize and rely on a third-party softwares like Iceni "Argus PDF converter" and “Perceptive Document Filters” from Lexmark to extract text and metadata from a wide variety of document formats.
Recently, both Iceni and Lexmark have issued security alerts for vulnerabilities in these product and have incorporated fixes into their most recent release. They have published the following CVEs:
The latest version of Iceni (v6.6.5) patches the security issues listed above.
These are considered to be vulnerabilities of "High" severity based on CVSS base scores in excess of 7.0. A carefully crafted pdf, CBFF, Bzip2, or XLS file could be used to cause a buffer overflow which can result in arbitrary code execution.
The latest version of Lexmark Isys (v11.3) patches the security issues listed above.
MarkLogic has issued an update which includes these fixes.