Community

MarkLogic 10 and Data Hub 5.0

Latest MarkLogic releases provide a smarter, simpler, and more secure way to integrate data.

Read Blog →

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up →

 
Knowledgebase: Errors
Understanding XDMP-DBDUPURI exceptions, how they can occur and how to prevent them
25 April 2019 11:14 AM

Summary

An XDMP-DBDUPURI error will occur if the same URI occurs in multiple forests of the same database. This article explains how this condition can occur and describes a number of strategies to help prevent and fix them.

Under normal operating conditions, duplicate Uris are not allowed to occur, but there are ways that programmers and administrators can bypass the server safeguards. Since duplicate Uris are considered a form of corruption, any query that encounters one will fail and post an error similar to the following:

XDMP-DBDUPURI: URI /foo.xml found in forests Library06 and Library07

We will begin by exploring the different ways that duplicate Uris can be created. Once we understand how this situation can occur, we will discuss how to prevent it from happening in the first place.  We will also discuss ways to resolve the XDMP-DBDUPURI error when it does occur.

How Administrators Can Cause Duplicate Uris

There are several administrative actions that can result in duplicate Uris:

1. By detaching a forest from its parent database (for administrative purposes - e.g., backup, restore) while allowing updates to continue on the database. If an update is committed to an Uri that exists on the detached forest, the database will create a new Uri on a different forest. When the forest is re-attached to the database, you will have duplicates of these Uris.

2. By detaching a forest from database-1 and then attaching it to database-2. Database-2 may already have some of the Uris that the new forest contains, including directory Uris such as "/".

3. By by doing a forest restore from forest-a to forest-b, where the database that contains forest-b already has some Uris that also exist on forest-a.

Prevention

To prevent case #1: Instead of detaching the forest to perform administrative operations, put the forests in read-only mode instead. You can do this by setting 'updates-allowed' to 'read-only' in the forest settings. This will let the database know that a given Uri exists, but will disallow updates on it, thus preventing any duplicates from beuing created.

Case #2 can be prevented by not using forest attach/detach for content migration between databases.  There are other alternatives such as replication.

The best way to avoid case #3 is by using database, rather then forest restore. If you must use forest restore, make sure to use an Admin API script that double-checks that any given forest backup is being restored to the corresponding restore target. Be sure to test your script thoroughly before deploying to production.

How Programmers Can Create Duplicate Uris

There are several ways that programmers can create duplicate Uris:

1. By using an xdmp:eval() to insert content with one or more forests set in the database option. We normally check whether a Uri exists in all forests before inserting, but xdmp:eval bypasses that safeguard.

2. By using the OUTPUT_FAST_LOAD option in the MapReduce connector.

3. By loading content with the database 'locking' option set to 'off.'

Prevention

To prevent case #1, avoid using 'place keys' (specifying a forest in the database option) during document inserts. This will allow the database to decide where the document goes and thereby prevent duplicates. You can also use the API xdmp:document-assign() to figure out where xdmp:document-insert() would place that Uri, and then pass that value in the xdmp:eval(), e.g., in the if-eval function below, you can either use a hardcoded forest name:

define function local:if-eval($xquery as xs:string,$vars as item()*,$forest as xs:unsignedLong) {

       xdmp:eval

       (

          $xquery,

          $vars,

          <options xmlns="xdmp:eval">

              <isolation>different-transaction</isolation>

              <database>{$forest}</database>

          </options>

        )

    };

    Local:if-eval("xdmp:document-insert('/foo1.xml', <foo>1</foo>)", (), xdmp:forest("Sciam"))

Or you can call it using the output of the xdmp:document-assign() function, which prevents duplicate URIs:

    let $forest :=

         let $forests := xdmp:database-forests(xdmp:database())

       let $index := xdmp:document-assign("document-1.xml", count($forests))
       return $forests[$index]

    return

       Local:if-eval("xdmp:document-insert('/foo1.xml', <foo>1</foo>)", (), xs:unsignedLong($forest))

To prevent case #2, use the default settings for ContentOutputFormat when using the MarkLogic Connector for Hadoop. Here is the explanation from the documentation:

To prevent duplicate URIs, the MarkLogic Connector for Hadoop defaults to a slower protocol for ContentOutputFormat when it detects the potential for updates to existing content. In this case, MarkLogic Server manages the forest selection, rather than the MarkLogic Connector for Hadoop. This behavior guarantees unique URIs at the cost of performance.

You may override this behavior and use direct forest updates by doing the following:

  • Set mapreduce.marklogic.output.content.directory. This guarantees all inserts will be new documents. If the output directory already exists, it will either be removed or cause an error, depending on the value ofmapreduce.marklogic.output.content.cleandir.
  • Set mapreduce.marklogic.output.content.fastload to true. When fastload is true, the MarkLogic Connector for Hadoop always optimizes for performance, even if duplicate URIs are possible.

You can safely set mapreduce.marklogic.output.content.fastload to true if the number of forests in the database will not change while the job runs, and at least one of the following is true:

  • Your job only creates new documents. That is, you are certain that the URIs do not exist in any document or property fragments in the database.
  • The URIs output with ContentOutputFormat may already be in use, but both these conditions are true:
  • The in-use URIs were not originally inserted using forest placement.
  • The number of forests in the database has not changed since initial insertion.
  • You set mapreduce.marklogic.output.content.directory.

For case #3, be sure to use use either the 'fast' or the 'strict' locking option on your target database when loading content. From the documentation:

[This option] Specifies how robust transaction locking should be. When set to strict, locking enforces mutual exclusion on existing documents and on new documents. When set to fast, locking enforces mutual exclusion on existing and new documents. Instead of locking all the forests on new documents, it uses a hash function to select one forest to lock. In general, this is faster than strict. However, for a short period of time after a new forest is added, some of the transactions need to be retried internally. When set to off, locking does not enforce mutual exclusion on existing documents or on new documents; only use this setting if you are sure all documents you are loading are new (a new bulk load, for example), otherwise you might create duplicate URIs in the database.

It is OK to use the 'off' setting only if performing a new bulk load onto a fresh database.

Repairing Duplicate Uris

Once you encounter duplicate URIs, you will need to delete them as soon as possible. Here are scripts that will help you do the job:

1. The first script helps you to view the document singled out in the error message:

    (: Script for viewing duplicate document/properties fragments :)

    xquery version "1.0-ml";

    let $doc := "/" (: DUPLICATE URI :)

    let $forest-a-name := "forest_00"

    let $forest-b-name := "forest_01"

    let $query :=

        'xquery version "1.0-ml";

        declare variable $URI as xs:string external;

        (xdmp:document-properties($URI),fn:doc($URI))'

    let $options-a := <options xmlns="xdmp:eval"><database>{xdmp:forest($forest-a-name)}</database></options>

    let $options-b := <options xmlns="xdmp:eval"><database>{xdmp:forest($forest-b-name)}</database></options>

    let $results-a := xdmp:eval($query,(xs:QName("URI"),$doc),$options-a)

    let $results-b := xdmp:eval($query,(xs:QName("URI"),$doc),$options-b)

    return

        (  fn:concat("RESULTS FROM : ",$forest-a-name), $results-a, fn:concat("RESULTS FROM : ",$forest-b-name), $results-b    )

 

2. The second script allows you to delete a duplicate document or property:

    (: Delete the duplicate documents :)

    xquery version "1.0-ml";

    let $doc := "/" (: DUPLICATE URI :)

    (: BAD FOREST :)

    let $forest-name := "forest_00"

    let $query :=

        'xquery version "1.0-ml";

         declare variable $URI as xs:string external;

         xdmp:document-delete($URI)'

    let $options := <options xmlns="xdmp:eval"><database>{xdmp:forest($forest-name)}</database></options>

    return xdmp:eval($query,(xs:QName("URI"),$doc),$options)

 

3. This script helps you delete a duplicate directory:

    (: Script for deleting duplicate directory fragments :)

    xquery version "1.0-ml";

    let $doc := "/" (: DUPLICATE URI :)

    let $forest-name := "forest_00"

    let $query :=

        'xquery version "1.0-ml";

         declare variable $URI as xs:string external;

         xdmp:node-delete(xdmp:document-properties($URI))'

    let $options := <options xmlns="xdmp:eval"><database>{xdmp:forest($forest-name)}</database></options>

    return xdmp:eval($query,(xs:QName("URI"),$doc),$options)

 

4. If you need to find duplicate uris, this script will show duplicate documents:

    (: Script for finding duplicate documents. :)

    xquery version "1.0-ml";

    for $uri at $i in cts:uris ((), ('frequency-order', 'descending', 'document'))

    let $freq := cts:frequency ($uri)

    where $freq > 1

    return ($uri||': '||$freq)

(19 vote(s))
Helpful
Not helpful

Comments (0)