Knowledgebase:
Migrating to MarkLogic 7 and understanding the 1.5x disk rule (rather than 3x)
28 July 2015 08:46 PM

Introduction

Those familiar with versions of MarkLogic Server prior to MarkLogic 7 may have heard the 3X disk space rule being mentioned. At the time of writing, references to are to be found in the MarkLogic 5 documentation and the MarkLogic 6 documentation

The Monitoring Metrics of Interest section in the Monitoring MarkLogic Guide refers to the 3X rule as during a preparatory question on disk allocation for a database:

  • Is there enough disk space for forest data and merges? Merges require at least twice as much free disk space as used by the forest data (3X rule). If a merge runs out of disk space, it will fail.

For anyone reading the requirements guidelines for MarkLogic 7 (and above), you may have noticed a section that suggests that you should plan to ensure disk space is available to:

  • 1.5 times the disk space of the total forest size. Specifically, each forest on a filesystem requires its filesystem to have at least 1.5 times the forest size in disk space (or, for each forest less than 32GB, 3 times the forest size). This translates to 1.5 times the disk space of the source content after it is loaded.

    For example, if you plan on loading content that will result in a 100 GB database, reserve at least 150GB of disk space. The disk space reserve is required for merges.

This Knowledgebase article will cover both requirements and offer some further guidance as to how to plan and size your databases and - crucially - how you can take advantage of the newer 1.5X rule.

3X

The original logic behind the allocation of 3X disk space was to provide ample space to allow for a situation where a database is fully reindexed. The allocation would be in thirds according to the following measures:

  1. Your Data
  2. Space for reindexing
  3. Space for merges

The 3X disk provision rule was offered as a very general (and very safe for production) rule to cover the most extreme example where your data gets reindexed in its entirety and then merges have to take place on top of that.

... but why 3X?

To understand this, we need to briefly explore what happens when a document is updated in MarkLogic Server.

As an update is made to a document - and the same rule applies to an update to a document when index changes are concerned - the transaction takes place at a given timestamp (a given point in time). At that point, the original fragment is marked as deleted and a new fragment is written to an in-memory-stand. Eventually, the in-memory stand is written to disk.

For a period of time - especially at times where a MarkLogic instance/cluster is busy performing a large number of updates - it's likely that there will be occasions where two versions of the same fragment exist in different stands on disk; one stand will contain the fragment now marked as deleted and the other stand will contain the newly written fragment - which will be used by any subsequent queries running at later timestamps.

... so that covers 2X - what about the other third?

When a merge takes place, merge candidate stands are identified and a new stand is created. As the candidate stands are read through, the active fragments are copied over to the new stand.

At the point where the merge takes place, the new stand coexists with the older stand because - like updates and reindexing - queries will still need to run against the candidate stands; the timestamp will only get moved on to accommodate the data in the new stand as soon as the process has completed in it's entirety.

While all of this is taking place, other updates could be taking place to documents in other stands and the same rules apply to those fragments too.

So the 3X rule provides a true safeguard; allowing for a situation where forest sizes are likely to swell way above and beyond the size of the data they contain, to accommodate the fragments marked deleted for queries at earlier timestamps and to accommodate the additional headroom required by a merge of some very large stands.

1.5X

Some changes were made in MarkLogic 7 which effectively reduce the footprint of your data on-disk. With some careful planning, you can take advantage of the lower sizing rule.

While the documentation still acknowledges the 3X rule (which is still true if you're performing an upgrade directly from MarkLogic 6 or earlier without making any other configuration changes), a new default configuration has been introduced to databases created under MarkLogic 7; this is the merge max size

What does the merge max size do?

This setting enforces an upper limit of 32GB on the size of an individual stand.

With previous versions of the product, the expectation would be for the contents of a forest to merge down to one large stand. That is: given a quiesced database, on full completion of a merge, all content (all active fragments) should be in a single stand.

For databases on MarkLogic 7 (and later), you can now expect to see more stands - each with a maximum size of 32GB.

This means you should expect to see your data in more stands than you would have done on prior versions of the product, but it also means that you can lower the amount of disk space you need due to this size restriction.

From MarkLogic 7 and onwards - with the merge max size correctly set - the largest amount of space a single merge operation should require would be 64GB

... but why 1.5X?

If we return to this line in the documentation:

  • For example, if you plan on loading content that will result in a 100 GB database, reserve at least 150GB of disk space. The disk space reserve is required for merges.

Given that we now have an upper limit on the size of a stand (32GB), as two smaller stands are being merged to create the new, larger stand and given the space required by other concurrent operations that may be taking place in other stands, a space limit of 1.5X should now cover any merges (and subsequent updates to documents).

For further understanding or the 1.5X rule, read our knowledgebase article 'Explanation of the 1.5X Disk Space Requirement' .

How do I find out whether my database is configured for this new merge max size?

If you're on the admin interface at http://[yourhostname]:8001

Go to: Configure > Databases > [Your Database Name] > Merge Policy

On the right-hand panel, you should see the merge max size; the default should now be 32768

Important caveats

MarkLogic 7 is designed to allow you to work with more stands. While it's safe to say that you should be concerned when you see a system with a very large number of small stands exists, a slightly different rule requires a shift in thinking and this has implications in particular when you start to think about applying the 1.5x disk space rule in your environment.

In releases prior to MarkLogic 6, the expectation (over time) was that all data in a forest would ultimately attempt to get merged into a single stand.

In MarkLogic 7, at least with the default setting of the merge-max-size (to 32768 - 32GB), it is understood that a reasonably large forest would now be divided into a number of 32GB stands.

If you are strictly following this rule for all reasonably large forests on your system - then the 1.5x rule can safely be used operationally in a production environment, but reliance on the rule should require careful management when migrating an existing system as running out of disk space can have catastrophic consequences for a live system.

For very small forests, the 1.5X rule does not apply.  Due to the 32GB stand size overhead, your forests need to be sufficiently larger in order to use the 1.5X rule. 

You should treat the 1.5x rule as an absolute minimum requirement for disk space for a given database. If you are going to use it, we would recommend having a strategy in place for allocating more space until you are confident that the cluster can run safely within the lower (1.5x) boundaries.

I'm upgrading from an earlier version of MarkLogic to MarkLogic 7 - I have changed the merge max size to 32768. Can I reclaim the disk space?

It's important to note that the 1.5x guidelines will only work if your forests all contain stands that have the new maximum size of 32GB. If your forests still contain larger stands, you'll need to break these down before you can consider reclaiming disk space. 

... Breaking Large Stands Down

If your forests contain stands larger than 32 GB, you will want to break these stands down in order to take advantage of the lower disk space requirements.

Different techniques can be followed to break the stands and reclaim disk space:

  1. Re-ingesting the content of the forests with large stands - When documents are re-ingested in a forest, the old fragments will be marked as deleted and the new fragment will be written to a new stand. Once there are sufficient deleted fragments, the large stands will be merged down into smaller stands.
  2. Perform re-indexing – A Forced re-index will update every fragment in the database, effectively re-loading the content - the original fragments will be marked as deleted and the new fragments will be written to a new stand. Once there are sufficient deleted fragments, the large stands will be merged down into smaller stands.  
  3. Forest rebalancing  - Rebalance active fragments from existing forests and retire old forest with Max Merge Size configured, this will merge out deleted fragments in old stand and maintain active fragments in smaller stand/stands in other rebalanced forests.

Conclusion

The major points for the 1.5X rule:

  • The estimated 1.5X disk space utilization is only true for databases where merge-max-size is correctly set and for forests that are sufficiently large. For databases created in MarkLogic Server v7 or later, the default merge-max-size is to 32768 (32GB)
  • If you're upgrading from earlier releases, you would need to make sure you set this value as part of your upgrade process.
    • After upgrading from a version previous to MarkLogic 7, you will have to take explicit steps to decrease the size of any pre-existing large stands. 

 

(7 vote(s))
Helpful
Not helpful

Comments (0)