Knowledgebase:
Understanding XDMP-DBDUPURI exceptions, how they can occur and how to prevent them
30 January 2023 11:23 AM

Summary

An XDMP-DBDUPURI error will occur if the same URI occurs in multiple forests of the same database. This article describes some of the conditions under which this can occur and describes a number of strategies to help identify, prevent and fix them.

If you encounter multiple documents returned for one URI without an error, please see Duplicate Documents.

Under normal operating conditions, duplicate URIs are not allowed to occur, but there are ways that programmers and administrators can bypass the server safeguards.

Since duplicate URIs are considered a form of corruption, any query that encounters one will fail and post an error similar to the following:

XDMP-DBDUPURI: URI /foo.xml found in forests Library06 and Library07

We will begin by exploring the different ways that duplicate URIs can be created. Once we understand how this situation can occur, we will discuss how to prevent it from happening in the first place.  We will also discuss ways to resolve the XDMP-DBDUPURI error when it does occur.

How Administrators Can Cause Duplicate URIs

There are several administrative actions that can result in duplicate URIs:

  1. By detaching a forest from its parent database (for administrative purposes - e.g., backup, restore) while allowing updates to continue on the database. If an update is committed to an Uri that exists on the detached forest, the database will create a new Uri on a different forest. When the forest is re-attached to the database, you will have duplicates of these Uris.
  2. By detaching a forest from Database-1 and then attaching it to Database-2. Database-2 may already have some of the URIs that the new forest contains, including directory fragments covering common URI paths (such as "/").
  3. By performing a forest-level restore from forest-a to forest-b, where the database that contains forest-b already has some URIs that also exist in forest-a.

Prevention: the causes and our recommendations

To prevent case #1: Instead of detaching the forest to perform administrative operations, put the forests in read-only mode instead.

You can do this by setting 'updates-allowed' to 'read-only' in the forest settings. This will let the database know that a given URI exists, but will disallow updates on it, thus preventing any duplicates from being created.

Case #2 can be prevented by not using forest attach/detach for content migration between databases.  There are other alternatives such as database replication.

The best way to avoid case #3 is by using database level restore, rather then forest level restore.

If you must use forest restore, make sure to use an Admin API script that double-checks that any given forest backup is being restored to the corresponding restore target. Be sure to test your script thoroughly before making changes in your production and other critical environments.

How Programmers Can Create Duplicate URIs

There are several ways that programmers can create duplicate URIs:

1. By using an xdmp:eval() to insert content with one or more forests set in the database option. We normally check whether a URI exists in all forests before inserting, but xdmp:eval bypasses that safeguard.

2. By using the OUTPUT_FAST_LOAD option in the MapReduce connector (see the mapreduce Javadoc for more details).

3. By loading content with the database 'locking' option set to 'off'.

Prevention: the causes and our recommendations

To prevent case #1, avoid using 'place keys' (specifying a forest in the database option) during document inserts. This will allow the database to decide where the document goes and thereby prevent duplicates. You can also use the API xdmp:document-assign() to figure out where xdmp:document-insert() would place that URI, and then pass that value in to the xdmp:eval()

In reality, while there can be minor performance gains from using in-forest evals ('place keys'), the practice of loading documents into specified forests is generally not advised, so the example code should be seen as an illustration of the process. We do not consider this to be a best practice.

In the in-forest-eval example function below, you can either use a hardcoded forest name:

Or you can call it using the output of the xdmp:document-assign() function, which prevents duplicate URIs:

It is important to note that there is generally no performance advantage in using the manual xdmp:document-assign(); if you're using this in your code, you should consider instead using xdmp:document-insert() as this approach will manage the forest assignment for you.

To prevent case #2, use the default settings for ContentOutputFormat when using the MarkLogic Connector for Hadoop. Here is the explanation from the documentation:

To prevent duplicate URIs, the MarkLogic Connector for Hadoop defaults to a slower protocol for ContentOutputFormat when it detects the potential for updates to existing content. In this case, MarkLogic Server manages the forest selection, rather than the MarkLogic Connector for Hadoop. This behavior guarantees unique URIs at the cost of performance.

You may override this behavior and use direct forest updates by doing the following:

  • Set mapreduce.marklogic.output.content.directory. This guarantees all inserts will be new documents. If the output directory already exists, it will either be removed or cause an error, depending on the value of the mapreduce.marklogic.output.content.cleandir setting.
  • Set mapreduce.marklogic.output.content.fastload to true. When fastload is true, the MarkLogic Connector for Hadoop always optimizes for performance, even if duplicate URIs are possible.

You can safely set mapreduce.marklogic.output.content.fastload to true if the number of forests in the database will not change while the job runs, and at least one of the following is true:

  • Your job only creates new documents. That is, you are certain that the URIs do not exist in any document or property fragments in the database.
  • The URIs output with ContentOutputFormat may already be in use, but both these conditions are true:
  • The in-use URIs were not originally inserted using forest placement.
  • The number of forests in the database has not changed since initial insertion.
  • You have set mapreduce.marklogic.output.content.directory.

For case #3, be sure to use use either the 'fast' or the 'strict' locking option on your target database when loading content. From the documentation:

[This option] Specifies how robust transaction locking should be.

When set to strict, locking enforces mutual exclusion on existing documents and on new documents.

When set to fast, locking enforces mutual exclusion on existing and new documents. Instead of locking all the forests on new documents, it uses a hash function to select one forest to lock. In general, this is faster than strict. However, for a short period of time after a new forest is added, some of the transactions need to be retried internally. When set to off, locking does not enforce mutual exclusion on existing documents or on new documents; only use this setting if you are sure all documents you are loading are new (a new bulk load, for example), otherwise you might create duplicate URIs in the database.

It is OK to use the 'off' setting only if performing a new bulk load onto a fresh database.

Repairing Duplicate URIs

Once you encounter duplicate URIs, you will need to delete them as soon as possible in order to restore functionality to the affected database.

Here are some utility XQuery scripts that will help you to do this work:

1. Script to view the document singled out in the error message.

2. Script to allow you to delete a duplicate document or property.

3. This script helps you delete a duplicate directory.

4. If you need to find duplicate uris, this script will show duplicate documents.

(25 vote(s))
Helpful
Not helpful

Comments (0)