Understanding the role of Journals in relation to Backup and Restore (Journal Archiving)
03 July 2023 04:10 PM
This KnowledgeBase article will cover the use of journal files by MarkLogic server. The concept of journaling is covered in many places in our documentation . There are sections on Backing Up Databases with Journal Archiving and Restoring a Database from a Backup .
The aim of this article is to augment the online documentation and to offer a deeper look into what goes into the journal files. It will also cover how MarkLogic Server uses journals to maintain data. It is structured in the form of questions and answers based on real customer questions we have received in the past.
How does MarkLogic maintain journals? How long are journals kept by MarkLogic before they are deleted?
At a high level, transactions work like this:
As soon as any transaction takes place, the first thing that happens is the entry is written to the journal file for the corresponding forest(s) involved in that transaction.
The journal file contains enough information for the document (also known as a fragment) to be written to disk. It's important to note that for the sake of brevity, no information pertaining to indexes is ever found in the journal; all necessary index information is generated later in the transaction by MarkLogic Server.
An in-memory stand is created for that document. The in-memory stand contains the data as it would be within the forest, including indexes based on the current index settings from the time at which the transaction took place. At some point in time, in-memory stands are flushed to disk to become on-disk stands in the forest and a checkpoint is made against the journal to record the forest state at that point in time.
At the time of this writing, journal sizes are set to 1024 MB by default and MarkLogic will maintain up to two of these files (per forest) at any given time. A journal file will be discarded when all data pertaining to transactions has successfully been checkpointed to disk as on-disk stands.
Note that only data that requires indexing is held in journal files; binary files are managed outside journals (aside from retaining a pointer to the location of the binary file).
If we take a backup with journal archiving enabled will that backup contain all the journal files since the creation of the database?
No, journal frames are archived going forward from the backup.
This is discussed at Backing Up Databases with Journal Archiving where it says:
Note: any backup taken without journal archiving enabled will not contain journals.
The following KnowledgeBase article may offer some further insight into how a restore would work:
Does MarkLogic offer anything that allows for incremental (partial) backups?
Yes, starting with Version 8. You may want to start by looking at the incremental backup feature in our documentation
I want to understand transactions in more detail, can you recommend a place to start?
The following KnowledgeBase article covers transaction timestamps, how transactions are managed and the difference between read queries and updates:
Can I use the same directory for all the databases on the cluster?
You should not use the same directory for the backup of more than one database backup. Each database should use a different backup directory.
When you create a scheduled backup in the Admin UI the backup path field specifies the value as:
Are there any impacts when restoring with Security, Schemas when using journal archiving?
From the documentation: https://docs.marklogic.com/guide/admin/backup_restore#id_51602
"If journal archiving is enabled, you cannot include auxiliary forests, as they should have their own separate backups."
As a general rule, we would recommend making the Security, Schemas and Modules backups last; perform all the big content database backups first, then at the end do the smaller backups of the auxiliary databases.
In the event of a disaster, you can recreate the cluster and restore the most recent Security, Schemas and Modules and then anything else you restore thereafter will have everything it needs already in place in on the host / cluster.