Best Practices for Using MarkLogic Semantics at Scale | MarkLogic Support

Knowledgebase

108Administration 8App Services 42Errors 145MarkLogic Server 53Performance Tuning

Knowledgebase:

Best Practices for Using MarkLogic Semantics at Scale 22 April 2021 07:25 PM
Introduction If you're looking to use any of the interfaces built on top of MarkLogic's semantics engine (Optic API, SQL, or SPARQL) - you'll want to make sure you're using the best practices itemized in this knowledgebase article. It's not unusual to see one or even two orders of magnitude performance improvements, as a result. Note that this article is really just a distillation of the MarkLogic World presentation "Getting the Most from MarkLogic Semantics" - available in both pdf and YouTube formats. Best Practices for Using Semantics at Scale 1) Scope your query - more constrained queries will do less work, and will therefore take less time Trim resultsets early Partition Query partitions or subsets of your data, instead of your entire database Define partitions with Collections Make use of your partitions with collection queries Use cts:query to partition even further Keep like-triples in the same document Use MarkLogic indexes to scope a query Collection query (or SPARQL FROM) to partition the RDF space Put ontologies and other lookup/mapping triples into their own graphs/collections Consider pushing-down some SPARQL FILTERs to the document 2) Pay attention to your data model Documents for Entities; Triples for Facts or Relationships. Benefits: Keep entity information together Fewer joins on both query and retrieval Eliminate joins by materializing often-queried elements into documents (aka “denormalizing triples”) Spread a large number of columns across multiple TDEs, instead of having a single TDE containing all those same columns MLU training materials: Data Modeling Series Pay special attention to (Part 3 - Impact of Normalization: Lessons Learned and Part 4 - Progressive Transformation using the Envelope Pattern) 3) Resultset size specific tips For small resultsets – from SPARQL, get the docs with a search For large resultsets Get docs in a single read, no joins Large result sets may incur connection churning overhead – paginate large resultsets to ensure connection reuse 4) Hardware tips Add more memory - allows the optimizer to choose faster plans Add more hardware - allows for increased parallelization 5) Avoid unnecessary work Re-use queries with bind variable - query plan is cached for 5 minutes Dedup processing De-duplication has no effect on results if you have no duplicate triples and/or you use DISTINCT Skipping dedup processing can result in substantial performance improvements
(11 vote(s)) Helpful Not helpful

Comments (0)