MarkLogic 10 and Data Hub 5.0

Latest MarkLogic releases provide a smarter, simpler, and more secure way to integrate data.

Read Blog →


Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up →

Understanding the role of Journals in relation to Backup and Restore (Journal Archiving)
05 December 2022 02:42 PM


This KnowledgeBase article will cover the use of Journal files by MarkLogic server. The concept of Journalling is covered in many places in our documentation.

The aim of this article is to augment the online documentation and to offer a deeper look into what goes into the Journal files. It will also cover how MarkLogic Server uses Journals to maintain data. It is structured in the form of questions and answers based on real customer questions we have received in the past.

How does MarkLogic maintain Journals? How long are journals kept by MarkLogic before they are deleted?

At a high level, transactions work like this:

As soon as any transaction takes place, the first thing that happens is the entry is written to the Journal file for the corresponding forest(s) involved in that transaction.

The Journal file contains enough information for the document (also known as a fragment) to be written to disk. It's important to note that for the sake of brevity, no information pertaining to indexes is ever found in the Journal; all necessary index information is generated later in the transaction by MarkLogic Server.

An in-memory stand is created for that document. The in-memory stand contains the data as it would be within the forest, including indexes based on the current index settings from the time at which the transcation took place. At some point in time, in-memory stands are flushed to disk to become on-disk stands in the forest and a checkpoint is made against the Journal to record the forest state at that point in time.

At the time of this writing, Journal sizes are set to 1024 MB by default and MarkLogic will maintain up to two of these files (per forest) at any given time. A journal file will be discarded when all data pertaining to transactions has successfully been checkpointed to disk as on-disk stands.

Note that only data that requires indexing is held in Journal files; binary files are managed outside Journals (aside from retaining a pointer to the location of the binary file).

If we take a backup with Journal Archiving enabled will that backup contain all the Journal files since the creation of the database?

Only the Journal files that are available at the time the backup was taken (by default, this will be for up to 2GB for each forest for that database). Backing up a database with Journal Archiving enabled allows you to rollback and restore the data to any "safe" point that is within scope for the forests in that database; in other words, if your Journal files contain enough data to cover 24 hours worth of transactions for every forest in that database, your restore will allow you to specify the time at which you want the restore to run to.

Note: any backup taken without Journal Archiving enabled will not contain Journals

The following KnowledgeBase article may offer some further insight into how this process would work:

Is any maintenance carried out to journals or the lifecycle of the journals during a backup and whilst journal archiving is enabled?

To explore this question, consider the following:

  • A backup is taken with Journal Archiving enabled on day 1. This is a full backup (including up to 2Gb of current Journal data) and throughout the day I continue to write journals to the backup location.
  • On day 2, I issue another backup command. Let's say it is the same root directory.
  • Does MarkLogic replace the current backup (depending on how many backups to keep is configured) and remove the journals on the principle that they are now contained in the full backup or create a new backup and begin to journal to this location. And then to restore, does one have to restore all backups in sequence which then replay journals for each backup?

Each backup is entirely self-contained, so at the point where the backup is made, a directory will be created for the backup that looks like this:


Inside this directory there will be:

  • Forests (and there may be several)
  • All necessary XML files for MarkLogic to understand the topology of the cluster

The backup directory name is composed of:

{ the date } - { the timestamp for the transaction at the point where that backup took place }

So when a backup kicks off, that initial directory is created, the date is used and the "agreed" timestamp for that backup (as agreed by all participant forests in that database) is used.

It helps to think of a backup as working much like any read query (we say "read" because it needs to calculate the safest timestamp at which all participant forests can perform the "query" which will comprise the set of data for that given backup).

From an operational point-of-view, the timestamp that gets written is essentially the equivalent of running xdmp:request-timestamp() within a query.

So the rule is:

If you perform the backup without Journal Archiving enabled, no Journals are written to the backup data

If you perform the backup with Journal Archiving enabled, you get the backup plus the Journals

At the point where the backup starts, you can think of the backup as being a long-running query, so all available Journal data is included up until the timestamp at which the backup begins; as the Journal is continually maintained for other queries, obviously none of the data for subsequent transactions would make it into the backup.

At the point where the restore takes place - for either day 1 or day 2 - you're working with a full set, so everything MarkLogic needs to do the restore is all included in that backup (all stand data and Forest Journal files for that backup is in the backup). There is no "sequential restore" from a backup.

So all this to say: you can't take a backup on day 2 and roll back to any state before the timespan covered by the Journal files.

If you need a solution that requires the ability to roll back to any given point in time, you probably would need to think carefully about your design and configuration so you could always guarantee that Journal files would be captured to cover that time. If you want to be able to roll back the database to any specific point in the past historically, the only way to do this is to ensure you "archive" prior backups with Journal Archiving enabled. That way you can restore from that days backup and then decide (if required) to restore to the timestamp for a given point in time (timestamp or wallclock).

Does MarkLogic offer anything that allows for incremental (partial) backups?

MarkLogic did not cover the ability to make incremental backups until version 8; if you're using the most up-to-date release, you may want to start by looking at the incremental backup feature in our documentation

I want to understand transactions in more detail, can you recommend a place to start?

The following KnowledgeBase article covers transaction timestamps, how transactions are managed and the difference between read queries and updates:

For backup, we need to use separate filesystem or directories for each database on the cluster. Are there any implications with this approach?

Note that MarkLogic is a shared nothing architecture so much of the approach is really down to how you design your systems architecture.

For example, if you're performing backups at forest level and you have a cluster containing a database comprising on forests on several hosts, one approach could be to make the backup "local" to each host (it would be important to note the ramifications for failover here if you plan to take this approach). Alternatively, each node in the cluster could perform the backup to one shared filesystem resource to keep all the backup data in place.

Another approach would be to create fast flash backups of forest data; briefly quiescing forests so no updates can take place while performing a fast flash backup of all the data on disk. MarkLogic offers several backup solutions and is flexible based on your particular requirements.

Databases will be backed up with a directory structure similar to that which is in the /var/opt/MarkLogic (or C:\Program Files\MarkLogic\Data), so if you make a backup of multiple databases, these will have separate directories by design; if you look inside a backup directory, you should see a "Forests" directory. Each backup also contains all the necessary files for MarkLogic to figure out the complete configuration of the cluster (there will be 6 configuration XML files in the backup directory)

Are you aware of any downsides to creating your own directories for each database and then backing up into them -- I guess you are just duplicating folders, right?

An approach like this may not be necessary. Imagine a backup as almost like being a minimal (self contained) MarkLogic instance, the directories are structured in a way that the restore code can put the data back in the right place at the time the restore takes place. You could create additional directories if you like, but ML will create it's own directory structure and add the necessary files each time, so it can build the topology in memory and restore the forest data. If you have 7 different backups in 7 different directories, that's fine; although each individual backup is going to have all the same directories from MarkLogic's perspective.

What certainly will cause problems is if you change anything in the directory that MarkLogic creates; doing this will almost certainly cause the restore to fail.

Are there any impacts when restoring with Security, Schemas when using Journal Archiving?

From the documentation:

"If Journal Archiving is enabled, you cannot include auxiliary forests, as they should have their own separate backups."

As a general rule, we would recommend making the Security, Schemas and Modules backups last; perform all the big content database backups first, then at the end do the smaller backups of the auxiliary databases.

In the event of a disaster, you can recreate the cluster and restore the most recent Security, Schemas and Modules and then anything else you restore thereafter will have everything it needs already in place in on the host / cluster.

(10 vote(s))
Not helpful

Comments (0)