Knowledgebase:
Understanding MarkLogic Minimum Disk Space Requirements
17 October 2019 05:16 PM

Introduction

  To simplify the calculation, the documentation for disk space requirements section of the MarkLogic Installation guide states that starting from MarkLogic 8 (and continued in MarkLogic 9), the minimum disk space requirement is 1.5 times the total forest size for sufficiently large forests. Previously, the requirement was to maintain disk space that is 3x forest data size.

In MarkLogic 8, we introduced (and continued in MarkLogic 9):

  1. The merge max size configuration parameter.
    • With the merge-max-size set to 32GB (32GB was default for 7 and 8 and recommended and default for version 9 is 48GB), “Sufficiently large forest” is defined as a forest size of 128 GB or larger.  That is, a fully merged forest with no deleted fragments results in a forest that is at least 128 GB.   For a forest of this size, the disk space required is  192 GB (1.5 x 128 GB);
    • For smaller forests (or forests that do not set the merge-max-size), roughly 3x disk space requirement still applies, due to the merge size requirements.
  2. Searches across stands now use multiple threads to improve speed.

Stated most simply, the minimum disk space requirement for a forest is the greater of 192 GB or 1.5x times forest data size. 

This article explains the calculation of the minimum disk space requirement, but please keep in mind that sufficient disk space beyond the bare minimum requirement should be available in order to handle influx of data into your system for at least the amount of time it takes to provision more capacity.  

Assumptions

Before we dive into how the minimum disk space requirement is calculated, let's briefly discuss some of the conditions that need to be met to make this calculation achievable:

  1. Assumption: "deleted" documents can be removed from a forest during a forest merge.
    • There are database configuration settings that prevent deleted documents from being removed.  For example
      • if you set a value for the database merge timestamp configuration, the forests will keep deleted document fragments for that period of time;
      • If you set the database retain until backup setting, deleted fragments will not be removed until a full backup or an incremental backup is completed. 
    • There are circumstances where MarkLogic keeps a merge window (typically one hour) for deleted fragments which may result in larger forests during that time.
  2. Assumption: Merges are always allowed to occur
    • The database configuration allows for merge blackout periods.  During times of a merge blackout, deleted fragments are not removed.
  3. Assumption: Long running operations do not occur during times of heavy document inserts or updates.
    • Long running operations may require obsolete stands within a forest to hang around until the operation is complete.
    • A database backup can be a long running operation.   it is recommended to schedule backups during times of low document updates. 
  4. Assumption: If HA (High Availability) or DR (Disaster Recovery) solutions configured,  then there is sufficient network bandwidth and sufficient system stability for HA (forest replication) and DR (Database Replication) to stay in sync with minimum lag. 
    • Storage requirements can increase significantly if HA and DR are configured; for example, if replication is paused, all of the un-shipped changes need to be retained on the master, so this can mean 2x to 4x the indexed data size.
  5. Assumption: MarkLogic Database restores to an Active database are not required.
    • For database restores where data needs to be restored to an active database, you will need at least 2x indexed data size + 64 GB per forest. You can avoid the 2x requirement if you can clear the forest/database before restoring. 

Details

Once all of the assumptions are met, let's look at how we can calculate the minimum disk space required, taking into account:

  • deleted fragments - i.e. forest size with maximum number of deleted fragments before a merge is kicked off
  • in-flight merging
  • concurrent document update (or reindexing) during a merge

This can be expressed as: 

    minimum-disk-space-required = forest-max-size + merge-space + concurrent-update

Forest Size

The actual forest size varies when considering deleted fragments . The amount of variance depends on the merge-min-ratio setting (for example: a merge ratio of 2 can result in a forest with 1/3 fragments deleted)

    forest-max-size = (1 + 1/merge-min-ratio) * minimum-forest-size

Where minimum-forest-size is calculated by a fully merged forest with no deleted fragments. But even the minimum-forest-size can vary based on the raw document content size and index setting. 

    minimum-forest-size = document-size * expected-index-expansion

Index expansion varies by index settings and document content. We have seen index expansion from 0.75X to 5X.  The only way to estimate this value is by experimentation – with a sufficiently large representative sample data set so that stand overhead is insignificant. 

When calculating disk space required, we need to account for the maximum possible size (i.e. forest size includes index expansion of documents and maximum deleted fragments.).

    forest-max-size = (1 + 1/merge-min-ratio) * (document-size * expected-index-expansion)

(Note: although stated as a maximum forest size, this value may still be lower than actual if merge assumptions not met.)  

Merge Space

During merging, there is a point in time where the old stands and new stand coexist on disk - the write must succeed before the old stand can be removed. There may be multiple merges occurring, but MarkLogic Server will with the merge-max-size configuration, the merges never require more than 1.33x the merge-max-size (the old stand is already taken into account in the forest size calculations):

    1.33 * merge-max-size

Concurrent Update

There is a time lag from the time when merging begins and ends and you need space for the documents that can be updated/reindexed during that time (reindex is equivalent to a delete and an insert)

     (merge-max-size / merge-rate) * update rate

If you make the simplifying assumption that concurrent update (or reindexing) occurs 50% slower as merging, then this just becomes  

    0.66 * merge-max-size

Putting it Together

So putting it all together in terms of forest-max-size

    minimum-disk-space-required = forest-max-size + 2 * merge-max-size

Remember that forest-max-size is a function of document content size, index expansion and retained deleted fragments (merge-min-ratio) per the equation presented earlier.

Caveats

Again, per our assumptions, there are conditions where the calculated minimum disk space requirement may not be sufficient.

  • If deleted documents are configured to be retained across merge
  • If merge blackout periods configured
  • If long running operations occur during times of document updates / inserts.
  • If HA or DR configured and replication lag occurs.
  • If a database restores are required.
  • If additional new content is loaded into the system, then the size of those additional documents needs to be included in the calculations.

Out of Space

What happens if MarkLogic Server does not have enough disk space?

The most likely outcome is that merges will begin to fail and you will see an XDMP-MERGESPACE error in the error log.  It is also possible that forests will go offline.  If a forest goes offline, the database will also be offline, halting all access to the database.  When this happens, you will need to take manual corrective action to either free up some disk space or add more.

Summary

The minimum disk space requirement is forest-max-size + 2 * merge-max-size.   But there are many conditions, including Merge policy configuration, the one hour merge window, and long running operations that can cause deleted fragments and obsolete stands to be retained, resulting in larger than expected forest sizes and greater than expected disk space utilization. High Availability, Disaster Recovery, and Database backup / restore solutions will also require additional disk space to be available.

It is always a good idea to give your system enough head room to avoid application or database outages and monitor your disk usage continuously to understand your trends in order to predict when your disk space allocation will be insufficient.

Related articles:

Recovering from low disk space

Migrating to MarkLogic and understanding the 1.5x requirement

 

(6 vote(s))
Helpful
Not helpful

Comments (0)