Solutions

MarkLogic Data Hub Service

Fast data integration + improved data governance and security, with no infrastructure to buy or manage.

Learn More

Learn

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Community

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up

 
Knowledgebase:
Best practices around data modeling and data loading
30 June 2015 01:39 PM

Summary

MarkLogic Server can ingest and query all sorts of data such as XMLtextJSON, binary, generic, etc. There are some things to consider when choosing to simply load data "as-is" vs. doing some degree of data modeling or data transformation prior to ingestion.

Details

Loading data "as-is" can minimize time and complexity during ingest or document creation. That can, however, sometimes mean more complex, slower performing queries. It may also mean more storage space intensive indexing settings.

In contrast, doing some degree of data transformation prior to ingestion can sometimes result in dramatic improvements in query performance and storage space utilization due to reduced indexing requirments.

An Example

An simple example will demonstrate the how a data model can affect performance. Consider the data model used by Apple's iTunes:

<plist version="1.0">
<dict>
  <key>Major Version</key><integer>10</integer>
  <key>Minor Version</key><integer>1</integer>
  <key>Application Version</key><string>10.1.1</string>
  <key>Show Content Ratings</key><true/>
  <dict>
    <key>Track ID</key><integer>290</integer>
    <key>Name</key><string>01-03 Good News</string>
          …
  </dict>
</dict>
 

Note the multiple <key> sibling elements, at multiple levels - where both levels are named the same thing (in this case, <dict>). Let's say you wanted to query a document like this for "Application Version." In this case, time will be spent performing index resolution for the encompassing element (here, <key>). Unfortunately, because there are multiple sibling elements all sharing the same element name, all of those sibling elements will need to be retrieved and then evaluated to see which of them actually match the given query criteria. Consider a slightly revised data model, instead:

 

<iTunesLibrary version="1.0">
<application>
  <major-version>10</major-version>
  <minor-version>1</minor-version>
  <app-version>10.1.1</app-version>
  <show-content-ratings>true</show-content-ratings>
  <tracks>
    <track-id>290</track-id>
    <name>01-03 Good News</name>
          …
  </tracks>
</application>

Here, we only need to query and therefore retrieve and evaluate the single <app-version> element, instead of multiple retreivals/evaluations as in the previous example data model.  

At Scale

Although this is a simple example, when processing millions or even billions of records, eliminating small processing steps could have significant performance impact.

(1 vote(s))
Helpful
Not helpful

Comments (0)