Understanding XDMP-DBDUPURI exceptions, how they can occur and how to prevent them
11 December 2020 10:16 AM
Under normal operating conditions, duplicate URIs are not allowed to occur, but there are ways that programmers and administrators can bypass the server safeguards.
Since duplicate URIs are considered a form of corruption, any query that encounters one will fail and post an error similar to the following:
We will begin by exploring the different ways that duplicate URIs can be created. Once we understand how this situation can occur, we will discuss how to prevent it from happening in the first place. We will also discuss ways to resolve the
How Administrators Can Cause Duplicate URIs
There are several administrative actions that can result in duplicate URIs:
Prevention: the causes and our recommendations
To prevent case #1: Instead of detaching the forest to perform administrative operations, put the forests in read-only mode instead.
You can do this by setting 'updates-allowed' to 'read-only' in the forest settings. This will let the database know that a given URI exists, but will disallow updates on it, thus preventing any duplicates from being created.
Case #2 can be prevented by not using forest attach/detach for content migration between databases. There are other alternatives such as database replication.
The best way to avoid case #3 is by using database level restore, rather then forest level restore.
If you must use forest restore, make sure to use an Admin API script that double-checks that any given forest backup is being restored to the corresponding restore target. Be sure to test your script thoroughly before making changes in your production and other critical environments.
How Programmers Can Create Duplicate URIs
There are several ways that programmers can create duplicate URIs:
1. By using an
2. By using the
3. By loading content with the database 'locking' option set to 'off'.
Prevention: the causes and our recommendations
To prevent case #1, avoid using 'place keys' (specifying a forest in the database option) during document inserts. This will allow the database to decide where the document goes and thereby prevent duplicates. You can also use the API xdmp:document-assign() to figure out where xdmp:document-insert() would place that URI, and then pass that value in to the xdmp:eval()
In reality, while there can be minor performance gains from using in-forest evals ('place keys'), the practice of loading documents into specified forests is generally not advised, so the example code should be seen as an illustration of the process. We do not consider this to be a best practice.
In the in-forest-eval example function below, you can either use a hardcoded forest name:
Or you can call it using the output of the
It is important to note that there is generally no performance advantage in using the manual
To prevent case #2, use the default settings for ContentOutputFormat when using the MarkLogic Connector for Hadoop. Here is the explanation from the documentation:
To prevent duplicate URIs, the MarkLogic Connector for Hadoop defaults to a slower protocol for ContentOutputFormat when it detects the potential for updates to existing content. In this case, MarkLogic Server manages the forest selection, rather than the MarkLogic Connector for Hadoop. This behavior guarantees unique URIs at the cost of performance.
You may override this behavior and use direct forest updates by doing the following:
You can safely set
For case #3, be sure to use use either the 'fast' or the 'strict' locking option on your target database when loading content. From the documentation:
[This option] Specifies how robust transaction locking should be.
When set to strict, locking enforces mutual exclusion on existing documents and on new documents.
When set to fast, locking enforces mutual exclusion on existing and new documents. Instead of locking all the forests on new documents, it uses a hash function to select one forest to lock. In general, this is faster than strict. However, for a short period of time after a new forest is added, some of the transactions need to be retried internally. When set to off, locking does not enforce mutual exclusion on existing documents or on new documents; only use this setting if you are sure all documents you are loading are new (a new bulk load, for example), otherwise you might create duplicate URIs in the database.
It is OK to use the 'off' setting only if performing a new bulk load onto a fresh database.
Repairing Duplicate URIs
Once you encounter duplicate URIs, you will need to delete them as soon as possible in order to restore functionality to the affected database.
Here are some utility XQuery scripts that will help you to do this work:
1. The first script helps you to view the document singled out in the error message:
2. The second script allows you to delete a duplicate document or property:
3. This script helps you delete a duplicate directory:
4. If you need to find duplicate uris, this script will show duplicate documents: