Best practices around data modeling and data loading | MarkLogic Support

Knowledgebase

108Administration 8App Services 42Errors 144MarkLogic Server 52Performance Tuning

Knowledgebase:

Best practices around data modeling and data loading 08 February 2021 01:18 PM
Summary MarkLogic Server can ingest and query all sorts of data such as XML, text, JSON, binary, generic, etc. There are some things to consider when choosing to simply load data "as-is" vs. doing some degree of data modeling or data transformation prior to ingestion. Details Loading data "as-is" can minimize time and complexity during ingest or document creation. That can, however, sometimes mean more complex, slower performing queries. It may also mean more storage space intensive indexing settings. In contrast, doing some degree of data transformation prior to ingestion can sometimes result in dramatic improvements in query performance and storage space utilization due to reduced indexing requirements. An Example A simple example will demonstrate the how a data model can affect performance. Consider the data model used by Apple's iTunes: `<plist version="1.0">` `<dict>` `<key>Major Version</key><integer>10</integer>` `<key>Minor Version</key><integer>1</integer>` `<key>Application Version</key><string>10.1.1</string>` `<key>Show Content Ratings</key><true/>` `<dict>` `<key>Track ID</key><integer>290</integer>` `<key>Name</key><string>01-03 Good News</string>` `…` `</dict>` `</dict>` Note the multiple <key> sibling elements, at multiple levels - where both levels are named the same thing (in this case, <dict>). Let's say you wanted to query a document like this for "Application Version." In this case, time will be spent performing index resolution for the encompassing element (here, `<key>`). Unfortunately, because there are multiple sibling elements all sharing the same element name, all of those sibling elements will need to be retrieved and then evaluated to see which of them actually match the given query criteria. Consider a slightly revised data model, instead: <iTunesLibrary version="1.0"> <application> <major-version>10</major-version> <minor-version>1</minor-version> <app-version>10.1.1</app-version> <show-content-ratings>true</show-content-ratings> <tracks> <track-id>290</track-id> <name>01-03 Good News</name> … </tracks> </application> Here, we only need to query and therefore retrieve and evaluate the single <app-version> element, instead of multiple retrievals/evaluations as in the previous example data model. At Scale Although this is a simple example, when processing millions or even billions of records, eliminating small processing steps could have significant performance impact.
(9 vote(s)) Helpful Not helpful

Comments (0)