Using database replication to copy all data from a 3 node cluster to a single node: a step-by-step guide
20 November 2017 04:41 AM
Database replication can be used to replicate the contents of a database spanning a cluster to forests on a similarly configured cluster. The most common use of database replication is to keep two identical MarkLogic clusters in sync with each other.
Database replication can also be thought of as a way to quickly and effectively make a backup of all your forest data for a given database. For example, it can also be used effectively in situations where you want to copy the contents of a database from a master cluster to a single (foreign) node.
In this Knowledgebase article, we will walk through the process of configuring database replication to safely replicate the contents of a live database (spanning a 3-node cluster) onto a single MarkLogic instance. Such a process could be used if you need to - for example - create a development environment that contains real application data.
Before you follow these steps, please take note of the following points before you make any changes:
1. Enabling replication - and copying all your forest data over to another host - will have some overhead on network traffic and additional I/O overhead on each of the hosts in the master cluster. Please ensure that your system is able to cope with the additional overhead before attempting the work outlined in this article.
2. After the replication process has completed, the target forests will switch from the status of async replicating to sync replicating. If the I/O capacity of the replica cannot keep up with the master, it could affect performance on the Primary cluster by forcing the lag limit to be observed; this is explained in detail in our documentation under the section on replication lag: Database Replication Guide - Replication Lag
As a result of the increased workload that replication will place on your Primary Cluster, please ensure that you have enough resources and - if necessary - arrange for the majority of the work to be completed at a time when traffic on the primary cluster is low.
For a prerequisite, we are starting with a 3-node cluster that contains a specific database (in the context of this example, the database is called "application"). The database contains 12 forests; 6 of these are master forests (2 on each of the 3 nodes in the cluster) and 6 of these are replicas of the 6 masters, which are used for forest-level failover.
In order to take our "backup" of this database, we will need to copy the contents of each of the six forests and we're going to use database replication to copy their contents over to a single host that has been prepared for this task.
Listed below are the steps required to perform this task (step-by-step):
1. Review the content of the Master database
This is the master "application" database (6 forests, 6 replicas on a 3-node cluster) - as you can see there are almost 19 million documents stored within this database (over 6 primary forests) and the primary forests are identifiable by following the naming convention of using the database name as a prefix and a sequential number (in this case, application-01 to application-06):
2. Review the configuration of the single destination host
This is the single node host that we want to replicate to - here we have the same database (called "application") and a matching number of forests (6 forests) with matching names (application-01 to application-06). The forests are attached but the database currently remains empty:
3. Ensure that Database Replication is not currently configured for this database on the primary (3-node) cluster
On the master, we're going to select Configure > Databases > application > Database Replication and confirm that Database Replication is not currently set up:
4. Set up the new host as a "foreign cluster"; the target system to copy all the data over from the master
On the master (Configure > Databases > application > Database Replication), select the Configure tab and use the "Select one here" link; this allows us to set up our foreign cluster (which is another way to describe our single host that we're going to replicate our data to):
5. Add the host details for the foreign cluster
Here we're entering the host name for the single host, so the master 3-node cluster can establish contact with it and start the process of configuring database replication.
6. Accept the remaining defaults by clicking on 'ok'
Accept all the default settings - for this walkthrough we're following the simplest path to configuring database replication (from one set of forests whose names match a corresponding set of forests on another host)
7. Confirm that your hostname is now listed as a configured foreign cluster
Confirm that the foreign cluster is now configured - if everything worked out, you should see something like this on-screen:
8. Set up Database replication on the master cluster
Now that we have the foreign cluster configured, we can now set up the database to replicate the data over.
On the master, go back to Configure > Databases > application > Database Replication, click on the Configure tab and ensure that the foreign cluster is now correctly identified by the master; if it is, click 'ok' for the next part of the setup process:
9. Choose the default "Connect By Name" strategy.
In this example, the forest names are identical, so the Replicas can be matched up by name. MarkLogic has identified that the forest names match so it's generated a table to show the mapping that database replication will be using.
Allow it MarkLogic to Connect By Name to the replica set and select ok:
10. Check the database status of the foreign cluster
Confirm that replication is now taking place on the single host (the application database on the foreign cluster host); if it's worked, all forests will be listed with a state of syncing replica and you should see the number of documents starting to increase: