Community

MarkLogic 10 and Data Hub 5.0

Latest MarkLogic releases provide a smarter, simpler, and more secure way to integrate data.

Read Blog →

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up →

 
Knowledgebase:
Best Practices for improving the performance of large collection deletes
14 May 2020 04:23 PM

Introduction

This article outlines various factors influencing the performance of xdmp:collection-delete function and furthermore provides general best practices for improving the performance of large collection deletes.

What are collections?

Collections in MarkLogic Server are used to organize documents in a database. Collections are a powerful and high-performance mechanism to define and manage subsets of documents.

How are collections different from directories?

Although both collections and directories can be used for organizing documents in a database, there are some key differences. For example:

  • Directories are hierarchical, where as collections are not. Consequently, collections do not require member documents to conform to any URI patterns. Additionally, any document can belong to any collection, and any document can also belong to multiple collections
  • You can delete all documents in a collection with the xdmp:collection-delete function. Similarly, you can delete all documents in a directory (as well as all recursive subdirectories and any documents in those directories) with a different function call - xdmp:directory-delete
  • You can set properties on a directory. You cannot set properties on a collection

For further details, see Collections versus Directories.

What is the use of the xdmp:collection-delete function?

xdmp:collection-delete is used to delete all documents in a database that belong to a given collection - regardless of their membership in other collections.

  • Use of this function always results in the specified unprotected collection disappearing. For details, see Implicitly Defining Unprotected Collections
  • Removing a document from a collection and using xdmp:collection-delete are similarly contingent on users having appropriate permissions to update the document(s) in question. For details, see Collections and Security
  • If there are no documents in the specified collection, then nothing is deleted, and the function still returns the empty sequence

What factors affect performance of xdmp:collection-delete?

The speed of xdmp:collection-delete depends on several factors:

Is there a fast operation mode available within the call xdmp:collection-delete?

Yes. The call xdmp:collection-delete("collection-uri") can potentially be fast in that it won't retrieve fragments. Be aware, however, that xdmp:collection-delete will retrieve fragments (and therefore perform much more slowly) when your database is configured with any of the following:

What are the general best practices in order to improve the performance of large collection deletes?

  • Batch your deletes
    • You could use an external/community developed tool like CoRB to batch process your content
    • Tools like CoRB allow you to create a "query module" (this could be a call to cts:uris to identify documents from a number of collections) and a "transform module" that works on each URI returned. CoRB will run the URI query and will use the results to feed a thread pool of worker threads. This can be very useful when dealing with large bulk processing. See: Using Corb to Batch Process Your Content - A Getting Started Guide
  • Alternatively, you could split your input (for example, URIs of documents inside a collection that you want to delete) into smaller batches
    • Spawn each of those batches to jobs on the Task Server instead of trying to delete an entire collection in a single transaction
    • Use xdmp:spawn-function to kick off deletions of one document at a time - be careful not to overflow the task server queue, however
      • Don't spawn single document deletes
      • Instead, make batches of size that work most efficiently in your specific use case
    • One of the restrictions on the Task Server is that there is a set queue size - you should be able increase the queue size as necessary
  • Scope deletes more narrowly with the use of cts:collection-query

Related knowledgebase articles:

 

(0 vote(s))
Helpful
Not helpful

Comments (0)