Community

MarkLogic 10 and Data Hub 5.0

Latest MarkLogic releases provide a smarter, simpler, and more secure way to integrate data.

Read Blog →

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up →

 
Knowledgebase:
Best Practices for Using MarkLogic Semantics at Scale
11 June 2020 03:44 AM

Introduction

If you're looking to use any of the interfaces built on top of MarkLogic's semantics engine (Optic API, SQL, or SPARQL) - you'll want to make sure you're using the best practices itemized in this knowledgebase article. It's not unusual to see one or even two orders of magnitude performance improvements, as a result. Note that this article is really just a distillation of the MarkLogic World presentation "Getting the Most from MarkLogic Semantics" - available in both pdf and YouTube formats.

Best Practices for Using Semantics at Scale

1) Scope your query - more constrained queries will do less work, and will therefore take less time

  • Trim resultsets early
  • Partition
    • Query partitions or subsets of your data, instead of your entire database
    • Define partitions with Collections
    • Make use of your partitions with collection queries
    • Use cts:query to partition even further
  • Keep like-triples in the same document
  • Use MarkLogic indexes to scope a query
    • Collection query (or SPARQL FROM) to partition the RDF space
    • Put ontologies and other lookup/mapping triples into their own graphs/collections
    • Consider pushing-down some SPARQL FILTERs to the document

2) Pay attention to your data model

3) Resultset size specific tips

  • For small resultsets – from SPARQL, get the docs with a search
  • For large resultsets
    • Get docs in a single read, no joins
    • Large result sets may incur connection churning overhead – paginate large resultsets to ensure connection reuse

4) Hardware tips

  • Add more memory - allows the optimizer to choose faster plans
  • Add more hardware - allows for increased parallelization

5) Avoid unnecessary work

  • Re-use queries with bind variable - query plan is cached for 5 minutes
  • Dedup processing
    • De-duplication has no effect on results if you have no duplicate triples and/or you use DISTINCT
    • Skipping dedup processing can result in substantial performance improvements
(1 vote(s))
Helpful
Not helpful

Comments (0)