Best Practices for Using MarkLogic Semantics at Scale
22 April 2021 07:25 PM
|
Introduction
If you're looking to use any of the interfaces built on top of MarkLogic's semantics engine (Optic API, SQL, or SPARQL) - you'll want to make sure you're using the best practices itemized in this knowledgebase article. It's not unusual to see one or even two orders of magnitude performance improvements, as a result. Note that this article is really just a distillation of the MarkLogic World presentation "Getting the Most from MarkLogic Semantics" - available in both pdf and YouTube formats.
Best Practices for Using Semantics at Scale
1) Scope your query - more constrained queries will do less work, and will therefore take less time
- Trim resultsets early
- Partition
- Query partitions or subsets of your data, instead of your entire database
- Define partitions with Collections
- Make use of your partitions with collection queries
- Use cts:query to partition even further
- Keep like-triples in the same document
- Use MarkLogic indexes to scope a query
- Collection query (or SPARQL FROM) to partition the RDF space
- Put ontologies and other lookup/mapping triples into their own graphs/collections
- Consider pushing-down some SPARQL FILTERs to the document
2) Pay attention to your data model
- Documents for Entities; Triples for Facts or Relationships. Benefits:
- Keep entity information together
- Fewer joins on both query and retrieval
- Eliminate joins by materializing often-queried elements into documents (aka “denormalizing triples”)
- Spread a large number of columns across multiple TDEs, instead of having a single TDE containing all those same columns
- MLU training materials:
3) Resultset size specific tips
- For small resultsets – from SPARQL, get the docs with a search
- For large resultsets
- Get docs in a single read, no joins
- Large result sets may incur connection churning overhead – paginate large resultsets to ensure connection reuse
4) Hardware tips
- Add more memory - allows the optimizer to choose faster plans
- Add more hardware - allows for increased parallelization
5) Avoid unnecessary work
- Re-use queries with bind variable - query plan is cached for 5 minutes
- Dedup processing
- De-duplication has no effect on results if you have no duplicate triples and/or you use DISTINCT
- Skipping dedup processing can result in substantial performance improvements
|
(11 vote(s))
Helpful Not helpful
|