Knowledgebase:
Best Practices for exporting and importing data in bulk
24 September 2019 08:08 PM

BEST PRACTICES FOR EXPORTING AND IMPORTING DATA IN BULK

Handling large amounts of data can be expensive in terms of both computing resources and runtime. It can also sometimes result in application errors or partial execution. In general, if you’re dealing with large amounts of data as either output or input, the most scalable and robust approach is to break-up that workload into a series of smaller and more manageable batches.

Of course there are other available tactics. It should be noted, however, that most of those other tactics will have serious disadvantages compared to batching. For example:

  • Configuring time limit settings through Admin UI to allow for longer request timeouts - since you can only increase timeouts so much, this is best considered a short term tactic for only slightly larger workloads.
  • Eliminating resource bottlenecks by adding more resources – often easier to implement compared to modifying application code, though with the downside of additional hardware and software license expense. Like increased timeouts, there can be a point of diminishing returns when throwing hardware at a problem.
  • Tuning queries to improve your query efficiency – this is actually a very good tactic to pursue, in general. However, if workloads are sufficiently large, even the most efficient implementation of your request will eventually need to work over subset batches of your inputs or outputs.

For more detail on the above non-batching options, please refer to XDMP-CANCELED vs. XDMP-EXTIME.

WAYS TO EXPORT LARGE AMOUNTS OF DATA FROM MARKLOGIC SERVER

1.    If you can’t break-up the data into a series of smaller batches - use xdmp:save to write out the full results from query console to the desired folder, specified by the path on your file system. For details, see xdmp:save.

2.    If you can break-up the data into a series of smaller batches:

            a.    Use batch tools like MLCP, which can export bulk output from MarkLogic server to flat files, a compressed ZIP file, or an MLCP database archive. For details, see Exporting Content from MarkLogic Server.

            b.    Reduce the size of the desired result set until it saves successfully, then save the full output in a series of batches.

            c.    Page through result set:

                               i.     If dealing with documents, cts:uris is excellent for paging through a list of URIs. Take a look at cts:uris for more details.

                               ii.     If using Semantics

                                             1.    Consider exporting the triples from the database using the Semantics REST endpoints.

                                             2.    Take a look at the URL parameters start? and pageLength? – these parameters can be configured in your SPARQL query to return the results in batches.  See GET /v1/graphs/sparql for further details.

WAYS TO IMPORT LARGE AMOUNTS OF DATA INTO MARKLOGIC SERVER

1.    If you’re looking to update more than a few thousand fragments at a time, you'll definitely want to use some sort of batching.

             a.     For example, you could run a script in batches of say, 2000 fragments, by doing something like [1 to 2000], and filtering out fragments that already have your newly added element. You could also look into using batch tools like MLCP

             b.    Alternatively, you could split your input into smaller batches, then spawn each of those batches to jobs on the Task Server, which has a configurable queue. See:

                            i.     xdmp:spawn

                            ii.    xdmp:spawn-function

2.    Alternatively, you could use an external/community developed tool like CoRB to batch process your content. See Using Corb to Batch Process Your Content - A Getting Started Guide

3.    If using Semantics and querying triples with SPARQL:

              a.    You can make use of the LIMIT keyword to further restrict the result set size of your SPARQL query. See The LIMIT Keyword

              b.    You can also use the OFFSET keyword for pagination. This keyword can be used with the LIMIT and ORDER BY keywords to retrieve different slices of data from a dataset. For example, you can create pages of results with different offsets. See  The OFFSET Keyword

(3 vote(s))
Helpful
Not helpful

Comments (0)