Knowledgebase:
Best practices around data modeling and data loading
08 February 2021 01:18 PM

Summary

MarkLogic Server can ingest and query all sorts of data such as XMLtextJSON, binary, generic, etc. There are some things to consider when choosing to simply load data "as-is" vs. doing some degree of data modeling or data transformation prior to ingestion.

Details

Loading data "as-is" can minimize time and complexity during ingest or document creation. That can, however, sometimes mean more complex, slower performing queries. It may also mean more storage space intensive indexing settings.

In contrast, doing some degree of data transformation prior to ingestion can sometimes result in dramatic improvements in query performance and storage space utilization due to reduced indexing requirements.

An Example

A simple example will demonstrate the how a data model can affect performance. Consider the data model used by Apple's iTunes:

<plist version="1.0">
<dict>
  <key>Major Version</key><integer>10</integer>
  <key>Minor Version</key><integer>1</integer>
  <key>Application Version</key><string>10.1.1</string>
  <key>Show Content Ratings</key><true/>
  <dict>
    <key>Track ID</key><integer>290</integer>
    <key>Name</key><string>01-03 Good News</string>
          …
  </dict>
</dict>
 

Note the multiple <key> sibling elements, at multiple levels - where both levels are named the same thing (in this case, <dict>). Let's say you wanted to query a document like this for "Application Version." In this case, time will be spent performing index resolution for the encompassing element (here, <key>). Unfortunately, because there are multiple sibling elements all sharing the same element name, all of those sibling elements will need to be retrieved and then evaluated to see which of them actually match the given query criteria. Consider a slightly revised data model, instead:

 

<iTunesLibrary version="1.0">
<application>
  <major-version>10</major-version>
  <minor-version>1</minor-version>
  <app-version>10.1.1</app-version>
  <show-content-ratings>true</show-content-ratings>
  <tracks>
    <track-id>290</track-id>
    <name>01-03 Good News</name>
          …
  </tracks>
</application>

Here, we only need to query and therefore retrieve and evaluate the single <app-version> element, instead of multiple retrievals/evaluations as in the previous example data model.  

At Scale

Although this is a simple example, when processing millions or even billions of records, eliminating small processing steps could have significant performance impact.

(9 vote(s))
Helpful
Not helpful

Comments (0)