Best practices around data modeling and data loading
30 June 2015 01:39 PM
MarkLogic Server can ingest and query all sorts of data such as XML, text, JSON, binary, generic, etc. There are some things to consider when choosing to simply load data "as-is" vs. doing some degree of data modeling or data transformation prior to ingestion.
Loading data "as-is" can minimize time and complexity during ingest or document creation. That can, however, sometimes mean more complex, slower performing queries. It may also mean more storage space intensive indexing settings.
In contrast, doing some degree of data transformation prior to ingestion can sometimes result in dramatic improvements in query performance and storage space utilization due to reduced indexing requirments.
An simple example will demonstrate the how a data model can affect performance. Consider the data model used by Apple's iTunes:
Note the multiple <key> sibling elements, at multiple levels - where both levels are named the same thing (in this case, <dict>). Let's say you wanted to query a document like this for "Application Version." In this case, time will be spent performing index resolution for the encompassing element (here,
<name>01-03 Good News</name>
Here, we only need to query and therefore retrieve and evaluate the single <app-version> element, instead of multiple retreivals/evaluations as in the previous example data model.
Although this is a simple example, when processing millions or even billions of records, eliminating small processing steps could have significant performance impact.