Knowledgebase: Administration
Using CoRB to batch process your content: a getting started guide
20 May 2020 01:24 AM

Introduction

Several customers have contacted support with questions regarding the use of a tool such as CoRB to "post-process" a large amount of data which they have already stored in MarkLogic.

Here's a brief CoRB tutorial based on some of the questions the support team has been asked - made available as a support KnowledgeBase article in the hope that it might be useful to other customers.

What is CoRB and when would I need to use it?

Start by looking here for the README

CoRB is an open source Java application (available here on github). It's a popular tool for anyone wanting to query a group of candidate documents based on a specific criteria (a specific forest, a date range, a collection, any cts-query) and pass them to a secondary module to perform some transformation to those documents and to update them in place in their forest on-disk.

As MarkLogic is a great repository for unstructured and semi-structured data, being able to revisit documents and perform bulk updates on them can be very useful.

This article offers a simple getting started guide so you can test CoRB out in a development environment and get an idea as to how it could help you to manage your data.

Prerequisites

As CoRB is a Java application, you'll need to ensure you have a JRE installed.

For your convenience, we have provided a zip file containing all the necessary files for you to get up and running, but you may want to replace the xcc.jar with the version that matches your server; look in our maven repository for the one that matches your server version.

Using CoRB: a step-by-step walkthrough

1. Create an XDBC Server with the following values:

root
/
port
9999
modules
Modules
database
Documents

2. Create some sample data in an empty database

I'm using 'Documents' for this example. Ensure this database has the uri lexicon enabled and make sure it's selected as the content source in query console:

for $i in 1 to 2000
return 
xdmp:document-insert(concat($i, ".xml"), 
    element doc {
        element id {$i},
        element created {fn:current-dateTime()}
    }
)

3. CoRB requires 2 modules to function:

  • a module to select the candidate URIs to process
  • a module to process each doc with a matching candidate URI

4. A simple "select" module (get-uris.xqy):

xquery version "1.0-ml";

let $uris := cts:uris('', 'document')
return (count($uris), $uris)

5. A simple processor module which adds an <updated> element to each document with a timestamp (transform-docs.xqy):

xquery version "1.0-ml";

declare variable $URI as xs:string external;

xdmp:node-insert-child(doc($URI)/doc, element updated {fn:current-dateTime()} )

6. Download and unpack corb.zip (attached)

7. From a command prompt, cd to the folder where corb.zip was unpacked and run corb.bat (Windows users) or ./corb.sh (Linux/Solaris/OS X users)

8. You should see logging to stdout.

On completion, you should see a line like this:

INFO: completed all tasks 2000/2000, 159 tps, 0 active threads

9. Examine a document in the database to ensure you see the <updated> element:

<doc>
  <id>1038</id>
  <created>2012-06-28T12:16:10.739+01:00</created>
  <updated>2012-06-28T12:16:23.812+01:00</updated>
</doc>

CoRB Questions and Answers

Q: Can we call and run CORB from an XQuery module? As it's command line based, can you write some XQuery that executes the CoRB batch file via the command line? Is there some other way we could invoke this?

A: Unfortunately there's no way to execute CoRB from an XQuery Module. CoRB is a Java application and - at the time of writing - there's no mechanism in the server to create separate Java (or command line) processes.

If you needed a way of running CoRB at intervals (say: hourly, daily, weekly etc), you could explore using the Windows Task Scheduler or in Unix variants: cron. You could adapt the URI query to look (for example) for documents that have been altered within the last X number of hours and run a CoRB job against just those documents.

However, it may be the case that you can achieve the effect you need much more effectively by using a combination of triggers and spawning tasks using MarkLogic's Task Server.

While a discussion of scheduled tasks is outside the realm of this article, you can read up about the MarkLogic task server here and for some open code offering an example of how a task such as rebalancing data evenly across forests can be achieved, you can look here.

Q: Can we make CoRB run across a cluster? If so: how would we go about configuring this?

A: The CoRB readme has a section called "Writing a Custom URI Module" which shows the general structure of the URI query - which uses a call to the cts:uris function:

http://docs.marklogic.com/5.0doc/docapp.xqy#search.xqy?start=1&cat=all&query=cts:uris

The fifth argument you can pass to a call to cts:uris is a list of Forest IDs, so one way - and possibly the simplest - would be to provide a modified URI module which is restricted in that it only returns fragment URIs for that particular (local) host.

Attached is an example custom URI Module (local-cts-uris.xqy) that uses a call to xdmp:host() to get the particular host ID for that connection and then to only return forest ids for that host. As with all code provided in these articles, the usual disclaimer applies, you should test thoroughly on a development cluster or on a test database to make sure it's doing what you need before testing it in a production environment.

You'll also need to provide a database name (currently it's set to xdmp:database("YOUR_DATABASE_HERE")) for safety.

The provided module can be used as a basis for your own URI module; you could run CoRB on every host in your cluster (making sure CoRB was configured to use only connect to the local host xcc://usr:pass@localhost:port on each instance in the cluster). That way each instance of CoRB would only be responsible for the documents stored in forests local to the host.

Another way would be to hand write specific URI modules to restrict each host to only process URIs for a given group of forest ids. For example, you might just want to have 2 instances of CoRB running, each one being responsible for a specific number of forests (these could equally span over multiple hosts)

Another way would be to write some XQuery to generate these separate URI modules for each host in the cluster, which you could achieve by looping through all hosts (xdmp:hosts()) in the cluster and then using xdmp:save to write out the URI module containing the generated cts:uris query with the corresponding forest ids. I could see you wanting to do this if you wanted to introduce complex uris queries (rather than a simple "catch all" process as discussed earlier) and always ensure you were targeting specific forest ids.



Attachments 
 
 corb.zip (281.52 KB)
 local-cts-uris.xqy (0.40 KB)
(11 vote(s))
Helpful
Not helpful

Comments (0)