Best practices around data modeling and data loading
08 February 2021 01:18 PM
|
|
SummaryMarkLogic Server can ingest and query all sorts of data such as XML, text, JSON, binary, generic, etc. There are some things to consider when choosing to simply load data "as-is" vs. doing some degree of data modeling or data transformation prior to ingestion. DetailsLoading data "as-is" can minimize time and complexity during ingest or document creation. That can, however, sometimes mean more complex, slower performing queries. It may also mean more storage space intensive indexing settings. In contrast, doing some degree of data transformation prior to ingestion can sometimes result in dramatic improvements in query performance and storage space utilization due to reduced indexing requirements. An ExampleA simple example will demonstrate the how a data model can affect performance. Consider the data model used by Apple's iTunes: <plist version="1.0"> <dict> <key>Major Version</key><integer>10</integer> <key>Minor Version</key><integer>1</integer> <key>Application Version</key><string>10.1.1</string> <key>Show Content Ratings</key><true/> <dict> <key>Track ID</key><integer>290</integer> <key>Name</key><string>01-03 Good News</string> … </dict> </dict> Note the multiple <key> sibling elements, at multiple levels - where both levels are named the same thing (in this case, <dict>). Let's say you wanted to query a document like this for "Application Version." In this case, time will be spent performing index resolution for the encompassing element (here,
<iTunesLibrary version="1.0">
<application>
<major-version>10</major-version>
<minor-version>1</minor-version>
<app-version>10.1.1</app-version>
<show-content-ratings>true</show-content-ratings>
<tracks>
<track-id>290</track-id>
<name>01-03 Good News</name>
…
</tracks>
</application>
Here, we only need to query and therefore retrieve and evaluate the single <app-version> element, instead of multiple retrievals/evaluations as in the previous example data model. At ScaleAlthough this is a simple example, when processing millions or even billions of records, eliminating small processing steps could have significant performance impact. | |
|