check session status...

Knowledgebase

108Administration 8App Services 42Errors 145MarkLogic Server 53Performance Tuning

Knowledgebase : MarkLogic Server

"Repairing out of order string range index messages" in your Erro...

Introduction

After upgrading to MarkLogic 10.x from any of the previous versions of MarkLogic, examples of the following Warning and Notice level messages may be observed in the ErrorLogs:

Warning: Lexicon '/var/opt/MarkLogic/Forests/Documents/00000006/c4ea1b602ee84a34+Lexicon' collation='http://marklogic.com/collation/zh-Hant' out of order

Notice: Repairing out of order lexicon /var/opt/MarkLogic/Forests/Documents/00000006/c4ea1b602ee84a34+Lexicon collation 'http://marklogic.com/collation/zh-Hant' version 0 to 602

Warning: String range index /space/Forests/Documents/0006ef0e/c0dc932d1b4bcaae-37c6e3905909f64e+string collation 'http://marklogic.com/collation/' out of order.

Notice: Repairing out of order string range index /space/Forests/Documents/0006ef0e/c0dc932d1b4bcaae-37c6e3905909f64e+string collation 'http://marklogic.com/collation/' version 0 to 602

Starting with MarkLogic 10.0, the server now automatically checks for any lexicons or string range indexes that may be in need of repair. Lexicons and range indexes perform "self-healing" in non-read-only stands whenever a lexicon/range index is opened within the stand.

Reason

This is due to changes introduced to the behavior of MarkLogic's root collation.

Starting with MarkLogic 10.0, the root collation has been modified, along with all collations that derive from it, which means there may be some subtle differences in search ordering.

For more information on the specifics of these changes, please refer to http://www.unicode.org/Public/UCA/6.0.0/CollationAuxiliary.html

This helps the server to support newer collation features, such as reordering entire blocks of script characters (for example: Latin, Greek, and others) with respect to each other.

Implementing these changes has, under some circumstances, improved the performance of wildcard matching by more effectively limiting the character ranges that search scans (and returns) for wildcard-based matching.

Based on our testing, we believe this new ordering yields better performance in a number of circumstances, although it does create the need to perform full reindexing of any lexicon or string range index using the root collation.

MarkLogic Server will now check lexicons and string range indexes and will try to repair them where necessary. During the evaluation, MarkLogic Server will skip making further changes if any of the following conditions apply:

(a) They are already ordered according to the latest specification provided by ICU (1.8 at the time of writing)

(b) MarkLogic Server has already checked the stand and associated lexicons and indexes

Whenever MarkLogic performs any repairs, it will always log a message at Notice level to inform users of the changes made. If for any reason, MarkLogic Server is unable to make changes (e.g. a forest is mounted as read-only), MarkLogic will skip the repair process and nothing will be logged.

As these changes have been introduced from MarkLogic 10 onwards, you will most likely observe these messages in cases where recent upgrades (from prior releases of the product) have just taken place.

Repairs are performed on a stand by stand basis, so if a stand does not contain any values that require ordering changes, you will not see any messages logged for that stand.

Also, if any ordering issues are encountered during the process of a merge of multiple stands, there will only be one message logged for the merge, not one for each individual stand involved in that merge.

Summary

Repairs will take place for any stand that has been found to have a lexicon or string index that has an out-of-order and out-of-date (e.g. utilising a collation described by an earlier version of ICU) collation, unless that stand is mounted as read only.
Any repair will generate Notice messages when maintenance takes place.
Whenever a lexicon or string Range index is opened, this check/repair will take place for any string range index; lexicon call (e.g. cts:values); range query (e.g. cts:element-range-query) and during merges merges.
The check looking for ICU version mismatches plus items that are out-of-order, so any lexicon / string range index with older ordering (and which requires no further changes), no further action will be taken for that stand.

Known side effects

If the string range index or lexicon is very large, repairing can cause some performance overhead and may impact search performance during the repair process.

Solution

These messages can be avoided by issuing a full reindex of your databases immediately after performing your upgrade to MarkLogic 10.

Adding RAM to your host

Summary

When changing the amount of RAM on your MarkLogic Server host, there are additional considerations such as cache settings and swap space.

Group Cache Settings

As a ‘Rule of Thumb’, the memory allocated to group caches (List, Compressed Tree and Expanded Tree) on a host should come out to be about 1/3 to 3/8 of main memory. Increasing the group caches beyond this ratio can result in excessive swapping which will adversely affect performance.

For E/D-nodes: this can be distributed as 1/8 of main memory dedicated to your list cache, 1/16 to your compressed tree cache, and 1/8 to your expanded tree cache.
For E-nodes: Can be configured with a larger Expanded Tree Cache; the List Cache and the Compressed Tree Cache can be set to 128MB each.
For D-nodes: Can be configured with a larger List Cache and Compressed Tree Cache; the Expanded Tree Cache can be set to 128MB.

Swap Space (Linux)

Linux Huge Pages should be set to 3/8 the size of your physical memory.
Swap space should be set to the size of your physical memory minus the size of your Huge Pages (because Linux Huge Pages are not swapped), or 32GB, whichever is lower.

For Example: If you have 96GB RAM; Huge Pages should be set to 36GB, and swap space at 32GB.

Swap Space (Solaris)

Swap space should be twice the size of physical memory

Solaris technical note:

Why do we recommend 2x the size of main memory allocated to swap space? When MarkLogic allocates virtual memory on Solaris with mmap and the MAP_ANON option, we do not specify MAP_NORESERVE, and instead let the kernel reserve swap space at the time of allocation. We do this so that if we run out of swap space we find out by the memory allocation failing with an error, rather than the process getting killed at some inopportune time with SIGBUS. The process could be using about all of physical memory, so that explains why you need at 1X physical memory in swap space.

MarkLogic Server uses the standard fork() and exec() idiom to run the pstack program. The pstack program can be a critically important tool for the diagnosis of server problems. We can’t use vfork() to run pstack instead of fork() because it’s not safe for multithreaded programs. When a process calls fork(), the kernel makes a virtual copy of all the address space of the process, so it also reserves swap space for this anonymously mapped memory for the forked process. Of course, immediately after forking, the forked process calls exec() on the pstack program, which frees that reserved memory. Unlike Linux, Solaris doesn’t overbook swap space, so if the kernel cannot reserve the swap space, fork() fails. That's why you need 2X physical memory for swap space on Solaris.

Page File (Windows)

On a Windows system, the page file should be twice the size of physical memory. You can set the page file size in the virtual memory section in the advanced system settings from the Windows Control Panel.

Performance Solution?

Increasing the amount of RAM is not a "cure all" for performance issues on the server. Even when memory issues appears to be the resource bottleneck, increasing RAM may not be the required solution. However, here are a few scenarios where increasing RAM on your server may be appropriate

You have a need to increase group cache sizes because your cache hit / miss ratio is too high or your queries are failing with cache full errors. In this case, increasing RAM can give you additional flexibility on how the group caches are configured. However, an alternative solution that could result in even greater performance improvements may involve reworking your queries and index settings so that the queries can be fully resolved during the index resolution phase of the query evaluation.
While monitoring your server for swap utilization, you noticed that the server is swapping often and you have already checked your memory and cache setting to verify they are within the MarkLogic recommendations. The system should be configured so that swapping does not occur during normal operations of the server as swapping can severely affect performance adversely. If that is the case, then adding RAM may improve performance.

Increasing RAM on your server may only be a temporary fix. If your queries do not scale, then, as the data size in your forests grow, you may once again hit the issues that caused you to increase your RAM in the first place. If this is the case evaluate your queries and indexes to make them more efficient.

Alternatives to Configuration Manager

Overview

The MarkLogic Server Configuration Manager provided a read-only user interface to the MarkLogic Admin UI and could be used for saving and restoring configuration settings. The Configuration Manager tool was deprecated starting with MarkLogic 9.0-5, and is no longer available in MarkLogic 10.

Alternatives

There are a number of alternatives to the Configuration Manager. Most of the options take advantage of the MarkLogic Admin API, either directly or behind the scenes. The following is a list of the most commonly used options:

Manual Configuration
ml-gradle
Configuration Management API

Manual Configuration

For a single environment, the following Knowledge base covers the process of Transporting Resources to a New Cluster.

ml-gradle

For a repeatable process, the most widely used approach is ml-gradle.

A project would be created in Gradle, with the desired configurations. The project can then be used to deploy to any environment - test, prod, qa etc - creating a known configuration that can be maintained under source control, which is a best practice.

Similar to Configuration Manager, ml-gradle also allows for exporting the configuration of an existing cluster. You can refer to transporting configuration using ml-gradle for more details.

While ml-gradle is an open source community project that is not directly supported, it enjoys very good community and developer support. The underlying APIs that ml-gradle uses are fully supported by MarkLogic.

Configuration Management API

An additional option is to use the Configuration Management API directly to export and import resources.

Summary

Both ml-gradle and the Configuration Management API use the MarkLogic Admin API behind the scenes but, for most use cases, our recommendation is to use ml-gradle rather than writing the same functionality from scratch.

Amazon's EC2 ports for MarkLogic Server

Summary

Provide an answer to the question "What ports need to be open in my Security Group in order to run MarkLogic Server on Amazon's EC2?"

Details

To run MarkLogic Server on Amazon's EC2, you'll need to open a port range from 7998-8002 in the appropriate Security Group.

Authenticating MarkLogic users with Kerberos

Introduction

MarkLogic Server allows you to configure MarkLogic Server so that users are authenticated using an external authentication protocol, such as Lightweight Directory Access Protocol (LDAP) or Kerberos. These external agents serve as centralised points of authentication or repositories for user information from which authorisation decisions can be made. In this article you will see the steps required to authenticate a user using Kerberos.

Authenticating MarkLogic users with Kerberos

Kerberos is a ticket-based authentication protocol for trusted hosts on untrusted networks which provides users with encrypted tickets that can be used to request access to particular servers.

Because Kerberos uses tickets, both the user and the server can verify each other's identity and user passwords do not have to pass through the network. This article will also show you how to configure MarkLogic Server to validate Kerberos user tickets and to map them to a MarkLogic database user on your cluster.

Kerberos user principals are defined in the format username@REALM.NAME, for this article we will use the Kerberos principal ml1@MLKRB.LOCAL and the MarkLogic userid krbuser1.

Configuring the MarkLogic cluster

Before MarkLogic can validate Kerberos user tickets the following requirements need to be met.

Configure the Kerberos client on each server on which a MarkLogic instance is running.
Create a Kerberos KeyTab to allow MarkLogic to authenticate users.
Create a MarkLogic External Security configuration for Kerberos.
Configure the MarkLogic AppServer to authenticate Kerberos tickets.
Add an External Kerberos principal to the MarkLogic user.

Configuring the Kerberos client

In order to authenticate Kerberos users, MarkLogic needs to know which host domains it will be authenticating and which Kerberos realms to authenticate against. This information is held in the /etc/krb5.conf file in Unix or krb5.ini file in Windows (The location will be dependent on the Windows Kerberos implementation being used, i.e Active Directory or MIT Kerberos).

In this example our the MarkLogic servers are installed on domains mwca.marklogic.com, mwcb.marklogic.com and mwcc.marklogic.com and the Kerberos Domain Controller (KDC) resides at kerberos.marklogic.com.

The [domain_realm] section is a series of host domains and realms mappings, in this case all our example MarkLogic hosts are using the .marklogic.com domain and will be using the MLKRB.LOCAL realm.

The [realms] section defines the realms that this client can access and the associated KDC that will be used.

A sample krb5.conf file:

[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log

[libdefaults]
dns_lookup_realm = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
rdns = false
default_realm = MLKRB.LOCAL
default_ccache_name = KEYRING:persistent:%{uid}

[realms]
MLKRB.LOCAL = {
kdc = kerberos.marklogic.com
}

[domain_realm]
.marklogic.com = MLKRB.LOCAL
marklogic.com = MLKRB.LOCAL

Note: If the server is already configured to use Kerberos you should simply merge your new realm and domain settings.

Creating a Kerberos Keytab

The Kerberos Domain Administrator must create a services.keytab file for each MarkLogic instance to permit it to authenticate Kerberos users. This is done by issuing the addprinc and ktpass commands on the kerberos Domain controller.

addprinc -randkey HTTP/mwca.marklogic.com

ktpassprinc HTTP/mwca.marklogic.com@MLKRB.LOCAL mapuser mwca.marklogic.com@MLKRB.LOCAL pass mysecret out services.keytab

Example

[kadmin@kerberos ~]# kadmin.local
Authenticating as principal krbadmin/admin@MLKRB.LOCAL with password.

kadmin.local: addprinc -randkey HTTP/mwca.marklogic.com
WARNING: no policy specified for HTTP/mwca.marklogic.com@MLKRB.LOCAL; defaulting to no policy
Principal "HTTP/mwca.marklogic.com@MLKRB.LOCAL" created.

kadmin.local: ktadd -k services.keytab HTTP/mwca.marklogic.com
Entry for principal HTTP/mwca.marklogic.com with kvno 2, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:services.keytab.
Entry for principal HTTP/mwca.marklogic.com with kvno 2, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:services.keytab.
Entry for principal HTTP/mwca.marklogic.com with kvno 2, encryption type des3-cbc-sha1 added to keytab WRFILE:services.keytab.
Entry for principal HTTP/mwca.marklogic.com with kvno 2, encryption type arcfour-hmac added to keytab WRFILE:services.keytab.
Entry for principal HTTP/mwca.marklogic.com with kvno 2, encryption type camellia256-cts-cmac added to keytab WRFILE:services.keytab.
Entry for principal HTTP/mwca.marklogic.com with kvno 2, encryption type camellia128-cts-cmac added to keytab WRFILE:services.keytab.
Entry for principal HTTP/mwca.marklogic.com with kvno 2, encryption type des-hmac-sha1 added to keytab WRFILE:services.keytab.
Entry for principal HTTP/mwca.marklogic.com with kvno 2, encryption type des-cbc-md5 added to keytab WRFILE:services.keytab.

Repeat the above steps for each MarkLogic host in the cluster and copy the resultant services.keytab file to the corresponding MarkLogic Server Data directory.

Creating a MarkLogic External Security configuration

On the MarkLogic Server "Configure->Security->External Security" panel create a new Kerberos External Security configuration as below:

Configuring the MarkLogic AppServer

On the MarkLogic Server "Configure->Groups->{group-name}->AppServers" panel configure the AppServer to user kerberos-ticket as the authentication method and specify the external security definition created in the previous step:

Add the External Kerberos principal to the MarkLogic server userid

On the MarkLogic Server "Configure->Security->Users" panel add the external Kerberos principal name to the user:

Verify everything is working as expected

From a Kerberos enabled client machine, create a new ticket for your Kerberos user principal using the kinit command:

[martin@local1]# kinit ml1@MLKRB.LOCAL

Password for ml1@MLKRB.LOCAL:

Check the status of the ticket using the klist command

[martin@local1]# klist

Ticket cache: KEYRING:persistent:0:0
Default principal: ml1@MLKRB.LOCAL

Valid starting Expires Service principal
25/09/16 13:22:59 26/09/16 13:12:49 HTTP/mwca.marklogic.com@MLKRB.LOCAL

Negotiate a user connection to the AppServer using Curl; specify the -u switch without a userid and password will use the Kerberos ticket created previously.

[martin@local]# curl -v --negotiate -u : http://mwca.marklogic.com:8050

* About to connect() to mwca.marklogic.com port 8050 (#0)
* Trying 192.168.0.50...
* Connected to mwca.marklogic.com (192.168.0.50) port 8050 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: mwca.marklogic.com:8050
> Accept: */*
>
...
* Connected to mwca.marklogic.com (192.168.0.50) port 8050 (#0)
* Server auth using GSS-Negotiate with user ''
> GET / HTTP/1.1
> Authorization: Negotiate YIICTwYJKoZIhvcSAQICAQBuggI+MIICOqADAgEFoQMCAQ6iBwMFACAAAACjggFUYYIBUDCCAUygAwIBBaENGwtNTEtSQi5MT0NBTKIiMCCgAwIBA6EZMBcbBEhUVFAbD213Y2EuZHluZG5zLm9yZ6OCARAwggEMoAMCARKhAwIBAqKB/wSB/Fzt6twxVRCPEWVzLq/h6ZV0MQ95iu9sKgNc1Rg+K4EmDBK1z4IHuHGYuYyV42rGZIA4rmF0NJe398b/uzGf3ViY+UHxNSlyj+BKSD6Q2rjAcYzsGsbXeebnClIDd/+/hN7DLrfWZ7HtuXMrAl0ifqnXSnd045ACUGXz4FAKAuAdJYtDUqT3UZ8+K4ExuGWkyViRhLOuTxphS49vMJ+uaRPZo9jiNkfnjZIj2esNChpz/urXCnlTT1Frrg0gPVlS1unAH4pRWg5DLFnajjg722UXR0P6fb/U3kbRxCCu1F1bJNjjAlTLtyhO4ZNh1LQ+28sYf3DnbNPFQZT1j6SBzDCByaADAgESooHBBIG+cik1lxaeURclPAi9t7x8kFt043KnsE4rv7quBbIET6wPgSu60YwuHjBS8xchgoNbJKp4BHNBjoKEBvNVcU1iqU8cuhYGJqmYkiu/DMGQb/pF4AApR09Azj4fWDmZpEcMMCWZFW6idRc9zmk1a0kjM8tkuA5jEH3M1ggev60mLM33ZkZRI5QhrFlDtfwMvJhfsve9sTSdlJPG7nWYgwUcfZN7BmL96O1P8zQCwFeUuICJO9Edlv3RZgiKBXJmnw==
> User-Agent: curl/7.29.0
> Host: mwca.marklogic.com:8050
> Accept: */*
>
< HTTP/1.1 200 OK

If MarkLogic is able to successfully authenticate the Kerberos user, you see an HTTP 200 code in response to the curl command and the MarkLogic AppServer logs should show the details of the external user mapping

External User(ml1@MLKRB.LOCAL) is Mapped to User(krbuser1)

192.168.0.50 - - [25/Sep/2016:13:48:30 +0100] "GET / HTTP/1.1" 200 2103 - "curl/7.29.0"

Troubleshooting

The following is a list of common problems encountered when authenticating with Kerberos.

Unable to generate a Kerberos token

Check that the kdc parameter in krb5.conf file points to a valid Kerberos Domain Controller:

[martin@local ~]# kinit ml1@MLKRB.LOCAL

kinit: Cannot contact any KDC for realm 'MLKRB.LOCAL' while getting initial credentials

Unauthorised 401 response due to gss_init_sec_context() failed: : No Kerberos credentials available error

This indicates that the Kerberos ticket is missing or invalid; use the klist command to check current ticket status and create a new ticket with kinit if required:

[martin@local ~]# curl -v --negotiate -u : http://mwca.marklogic.com:8050

* About to connect() to mwca.marklogic.com port 8050 (#0)
* Trying 192.168.0.50...
* Connected to mwca.marklogic.com (192.168.0.50) port 8050 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: mwca.marklogic.com:8050
> Accept: */*
>
< HTTP/1.1 401 Unauthorized
< Server: MarkLogic
* gss_init_sec_context() failed: : No Kerberos credentials available
< WWW-Authenticate: Negotiate

MarkLogic server not able to validate Kerberos ticket

Check the following:

Ensure the host name specified when creating the services.keytab resolves to a valid IP Address.
On servers that have multiple hostname use the hostname --fqdn command to determine the correct hostname to use for generating the services.keytab
The services.keytab will only be used if the file permissions are restricted to read/write access for the MarkLogic daemon user on the host, e.g

[admin@mwca MarkLogic]# ls -al services.keytab

-rw------- 1 daemon daemon 594 Sep 25 12:33 services.keytab

Debugging Kerberos connections in MarkLogic

On the MarkLogic Server "Configure->Groups->{group-name}->Diagnostics" panel add Kerberos GSS Negotiate to the list of trace events:

Introduction

When using Kerberos to authenticate to a MarkLogic server the user must first obtain a Kerberos ticket by either authenticating to a directory server such as Active:Directory or directly to the Kerberos Domain server using the kinit utility.

For interactive use, this does not pose a problem but for unattended application use such as an XCC/J application, problems can ensue if a previously generated Kerberos ticket has expired.

This article will outline the steps needed to use a "client-side" Kerberos Keytab that can authenticate an XCC/J application without requiring manual intervention to regenerate Kerberos tickets.

Prerequisites

MarkLogic Server and XCC/J 8.0.5 or later
Java 1.7 or later

Configuration steps

1. Create or update the existing services.keytab and add the User Principal that you want to use with XCC, e.g. ml1@MLKRB.LOCAL

[kadnin@mwca1 Data]# kadmin.local
Authenticating as principal mluser1/admin@MLKRB.LOCAL with password.

kadmin.local: listprincs
ml1@MLKRB.LOCAL

kadmin.local: ktadd -k services.keytab ml1@MLKRB.LOCAL

2. Copy the services.keytab file to path on the Java XCC client machine.

Note: For security reasons ensure that the keytab is only readable by the XCC application userid.

3. Create a Java Authentication and Authorization Service (JAAS) login.conf file with the following contents; change “principal” and “keyTab” entries accordingly

com.sun.security.jgss.krb5.initiate {
com.sun.security.auth.module.Krb5LoginModule required
principal="ml1@MLKRB.LOCAL"
useKeyTab=true
keyTab="/home/ml1/Data/services.keytab"
storeKey=true
debug=true;
};

com.sun.security.jgss.krb5.accept {
com.sun.security.auth.module.Krb5LoginModule required
principal="ml1@MLKRB.LOCAL"
useKeyTab=true
keyTab="/home/ml1/Data/services.keytab"
storeKey=true
debug=true;
};

4. Set the following Java System properties either within the XCC Java application or from the command line, changing login.conf and krb5.conf entries as required.

javax.security.auth.useSubjectCredsOnly=false
java.security.auth.login.config=login.conf
java.security.krb5.conf=/etc/krb5.conf

5. Run the XCC Java application and it should use the Kerberos credentials from the services.keytab to authenticate to the MarkLogic XDBC Server

Example

A simple query to return the current timestamp from an MarkLogic XDBC server.

[ml1@mwca1 ~]$ java -Djavax.security.auth.useSubjectCredsOnly=false -Djava.security.auth.login.config=login.conf -Djava.security.krb5.conf=/etc/krb5.conf com.marklogic.xcc.examples.SimpleQueryRunner xcc://ml1.dyndns.org:8050 query

Debug is true storeKey true useTicketCache false useKeyTab true doNotPrompt false ticketCache is null isInitiator true KeyTab is /home/ml1/Data/services.keytab refreshKrb5Config is false principal is ml1@MLKRB.LOCAL tryFirstPass is false useFirstPass is false storePass is false clearPass is false
principal is ml1@MLKRB.LOCAL
Will use keytab
Commit Succeeded

2016-11-23T18:03:59.457055Z

6. In the MarkLogic AccessLogs you should see the following entries to show a successful Kerberos authentication from the Java XCC Client.

External User(ml1@MLKRB.LOCAL) is Mapped to User(krbuser1)
192.168.0.50 - - [23/Nov/2016:18:03:59 +0000] "POST /eval XDBC/1.0" 200 128 - "Java/1.8.0_66 MarkLogicXCC/8.0-6"

7. When authentication is successfully established "debug=false" can be set in the JAAS login.conf to reduce the verbose logging messages.

Additional Reading

AWS Cluster Repair: Replacing a Missing EBS Volume

Summary

Customers using the MarkLogic AWS Cloud Formation Templates may encounter a situation where someone has deleted an EBS volume that stored MarkLogic data (mounted at /var/opt/MarkLogic). Because the volume, and the associated data are no longer available, the host is unable to rejoin the cluster.

Getting the host to rejoin the cluster can be complicated, but it will typically be worth the effort if you are running an HA configuration with Primary and Replica forests.

This article details the procedures to get the host to rejoin the cluster.

Preparing the New Volume and New Host

The easiest way to create the new volume is using a snapshot of an existing host's MarkLogic data volume. This saves the work of manually copying configuration files between hosts, which is necessary to get the host to rejoin the cluster.

In the AWS EC2 Dashboard:Elastic Block Store:Volumes section, create a snapshot of the data volume from one of the operational hosts.

Next, in the AWS EC2 Dashboard:Elastic Block Store:Snapshots section, create a new volume from the snapshot in the correct zone and note the new volume id for use later.

(optional) Update the name of the new volume to match the format of the other data volumes

(optional) Delete the snapshot

Edit the Auto Scaling Group with the missing host to bring up a new instance, by increasing the Desired Capacity by 1

This will trigger the Auto Scaling Group to bring up a new instance.

Attaching the New Volume to the New Instance

Once the instance is online, and startup is complete connect to the new instance via ssh

Ensure MarkLogic is not running, by stopping the service and checking for any remaining processes.

sudo service MarkLogic stop
pgrep -la MarkLogic

Remove /var/opt/MarkLogic if it exists, and is mounted on the root partition.

sudo rm -rf /var/opt/MarkLogic

Edit /var/local/mlcmd and update the volume id listed in the MARKLOGIC_EBS_VOLUME variable to the volume created above.

MARKLOGIC_EBS_VOLUME="[new volume id],:25::gp2::,*"

Run mlcmd to attach and mount the new volume to /var/opt/MarkLogic on the instance

sudo /opt/MarkLogic/mlcmd/bin/mlcmd init-volumes-from-system
Check that the volume has been correctly attached and mounted

Remove contents of /var/opt/MarkLogic/Forests (if they exist)

sudo rm -rf /var/opt/MarkLogic/Forests/*

Run mlcmd to sync the new volume information to the DynamoDB table

sudo /opt/MarkLogic/mlcmd/bin/mlcmd sync-volumes-to-mdb

Configuring MarkLogic With Empty /var/opt/MarkLogic

If you did not create your volume from a snapshot as detailed above, complete the following steps. If you created your volume from a snapshot, then skip these steps, and continue with Configuring MarkLogic and Rejoining Existing Cluster

Start the MarkLogic service, wait for it to complete its initialization, then stop the MarkLogic service:
- sudo service MarkLogic start
- sudo service MarkLogic stop
Move the configuration files out of /var/opt/MarkLogic/
- sudo mv /var/opt/MarkLogic/*.xml /secure/place (using default settings; destination can be adjusted)
Copy the configuration files from one of the working instances to the new instance
- Configuration files are stored here: /var/opt/MarkLogic/*.xml
- Place a copy of the xml files on the new instance under /var/opt/MarkLogic

Configuring MarkLogic and Rejoining Existing Cluster

Note the host-id of the missing host found in /var/opt/MarkLogic/hosts.xml.

For example, if the missing host is ip-10-0-64-14.ec2.internal
- sudo grep "ip-10-0-64-14.ec2.internal" -B1 /var/opt/MarkLogic/hosts.xml

Edit /var/opt/MarkLogic/server.xml and update the value for host-id to match the value retrieved above

Start MarkLogic and view the ErrorLog for any issues

sudo service MarkLogic start; sudo tail -f /var/opt/MarkLogic/Logs/ErrorLog.txt

You should see messages about forests synchronizing (if you have local disk failover enabled, with replicas) and changing states from wait or async replication to sync replication. Once all the forests are either 'open' or 'sync replicating', then your cluster is fully operational with the correct number of hosts.

At this point you can fail back to the primary forests on the new instances to rebalance the workload for the cluster.

You can also re-enable xdqp ssl enabled, by setting the value to true on the Group Configuration page, if you disabled the setting as part of these procedures.

Update the Userdata In the Auto Scaling Group

To ensure that the correct volume will be attached if the instance is terminated, the Userdata needs to be updated in a Launch Configuration.

Copy the Launch Configuration associated with the missing host.

Edit the details

(optional) Update the name of the Launch Configuration
Update the User data variable MARKLOGIC_EBS_VOLUME and replace the old volume id with the id for the volume created above.
- MARKLOGIC_EBS_VOLUME="[new volume id],:25::gp2::,*"
Save the new Launch Configuration

Edit the Auto Scaling Group associated with the new node

Change the Launch Configuration to the one that was just created and save the Auto Scaling Group.

Next Steps

Now that normal operations have been restored, it's a good opportunity to ensure you have all the necessary database backups, and that your backup schedule has been reviewed to ensure it meets your requirements.

Basic MarkLogic Server Monitoring Guidelines

Summary

MarkLogic recommends that all production servers be monitored for system health.

Recommendations

For production MarkLogic Server clusters, the system monitoring solution should include the following features:

Enable monitoring history, which will allows for the capture and viewing of critical performance data from your cluster. You can learn more about the Monitoring History features by following this link: http://docs.marklogic.com/guide/monitoring/history
- Monitor processes that are running on the system
- Monitor RAM & swap space utilization.
- Monitor I/O device service time, wait time, and queue size; any of these could be indications that the storage system is underpowered or poorly configured.
- Monitor the network for signs of problems that impact application performance. A misconfigured or poorly performing network can have drastic impacts on the performance of an application running on MarkLogic Server.
MarkLogic Error logs should be constantly monitored (and notifications sent) for the following keywords: 'exception', 'SVC-', 'XDMP-', & 'start'; Over time, you may want to refine the keywords, but these may indicate that something is wrong.
Log-file messages should also be monitored based on message level, see Understanding the Log Levels. It's good practice to investigate and resolve important messages promptly.
Switch to debug level logging. Not only will this provide additional information for you to monitor system health, but will also provide additional information to analyze in the event a problem does occur.
Monitor forest sizes - in particular the ratio of forest size to total available disk size (see Memory, Disk Space, and Swap Space Requirements). Alarms should sound if the forest sizes increases significantly beyond target available disk space.
Ensure that the server time is synchronized across all the hosts in the cluster. For example: Use NTP to manage system time across the cluster.
Monitor for host “hot spots.” Uneven host workload could be a symptom that there is an uneven distribution of data across the hosts which may result in performance issues.

MarkLogic Server provides a rich set of monitoring features that include a pre-configured monitoring dashboard, and a Management API that allows you to integrate MarkLogic Server with existing monitoring applications or create your own custom monitoring applications.

For additional information regarding the monitoring support in MarkLogic, Please refer to the Monitoring MarkLogic Guide available on the MarkLogic developer website.

Best Practice for Adding an Index in Production

Summary

It is sometimes necessary to remove or add an index to your production cluster. For a large database with more than a few GB of content, the resulting workload from reindexing your database can be a time and resource intensive process, that can affect query performance while the server is reindexing. This article points out some strategies for avoiding some of the pain-points associated with changing your database configuration on a production cluster.

Preparing your Server for Production

In general, high performance production search implementations run with tight controls on the automatic features of MarkLogic Server.

Re-indexer disabled by default
Format-compatibility set to the latest format
Index-detection set to none.
On a very large cluster (several dozen or more hosts), consider running with expunge-locks set to none
On large clusters with insufficient resources, consider bumping up the default group settings
- xdqp-timeout: from 10 to 30
- host-timeout: from 30 to 90

The xdqp and host timeouts will prevent the server from disconnecting prematurely when a data-node is busy, possibly triggering a false failover event. However, these changes will affect the legitimate time to failover in an HA configuration.

Preparing to Re-index

When an index configuration must be changed in production, you should:

First, index-detection should be set back to automatic
Then, the index configuration change should be made

When you have Database Replication Configured:

If you have to add or modify indexes on a database which has database replication configured, make sure the same changes are made on the Replica cluster as well. Starting with ML server version 9.0-7, index data is also replicated from the Master to the Replica, but it does not automatically check if both sides have the same index settings. Reindexing is disabled by default on a replica cluster. However, when database replication configuration is removed (such as after a disaster), the replica database will reindex as necessary. So it is important that the Replica database index configuration matches the Master’s to avoid unnecessary reindexing.

Note: If you are on a version prior to 9.0-7 - When adding/updating index settings, it is recommended that you update the settings on the Replica database before updating those on the Master database; this is because changes to the index settings on the Replica database only affect newly replicated documents and will not trigger reindexing on existing documents.

After the Re-index

After the re-index has completed, it is important to return to the old settings by disabling the reindexer and setting index-detection back to none.

If you're reindexing over several nights or weekends, be sure to allow some time for the merging to complete. So for example, if your regular busy time starts at 5AM, you may want to disable the reindexer at around midnight to make sure all your merging is completed before business hours.

By following the above recommendations, you should be able to complete a large re-index without any disruption to your production environment.

Best Practices for exporting and importing data in bulk

BEST PRACTICES FOR EXPORTING AND IMPORTING DATA IN BULK

Handling large amounts of data can be expensive in terms of both computing resources and runtime. It can also sometimes result in application errors or partial execution. In general, if you’re dealing with large amounts of data as either output or input, the most scalable and robust approach is to break-up that workload into a series of smaller and more manageable batches.

Of course there are other available tactics. It should be noted, however, that most of those other tactics will have serious disadvantages compared to batching. For example:

Configuring time limit settings through Admin UI to allow for longer request timeouts - since you can only increase timeouts so much, this is best considered a short term tactic for only slightly larger workloads.
Eliminating resource bottlenecks by adding more resources – often easier to implement compared to modifying application code, though with the downside of additional hardware and software license expense. Like increased timeouts, there can be a point of diminishing returns when throwing hardware at a problem.
Tuning queries to improve your query efficiency – this is actually a very good tactic to pursue, in general. However, if workloads are sufficiently large, even the most efficient implementation of your request will eventually need to work over subset batches of your inputs or outputs.

For more detail on the above non-batching options, please refer to XDMP-CANCELED vs. XDMP-EXTIME.

WAYS TO EXPORT LARGE AMOUNTS OF DATA FROM MARKLOGIC SERVER

1. If you can’t break-up the data into a series of smaller batches - use xdmp:save to write out the full results from query console to the desired folder, specified by the path on your file system. For details, see xdmp:save.

2. If you can break-up the data into a series of smaller batches:

a. Use batch tools like MLCP, which can export bulk output from MarkLogic server to flat files, a compressed ZIP file, or an MLCP database archive. For details, see Exporting Content from MarkLogic Server.

b. Reduce the size of the desired result set until it saves successfully, then save the full output in a series of batches.

c. Page through result set:

i. If dealing with documents, cts:uris is excellent for paging through a list of URIs. Take a look at cts:uris for more details.

ii. If using Semantics

1. Consider exporting the triples from the database using the Semantics REST endpoints.

2. Take a look at the URL parameters start? and pageLength? – these parameters can be configured in your SPARQL query to return the results in batches. See GET /v1/graphs/sparql for further details.

WAYS TO IMPORT LARGE AMOUNTS OF DATA INTO MARKLOGIC SERVER

1. If you’re looking to update more than a few thousand fragments at a time, you'll definitely want to use some sort of batching.

a. For example, you could run a script in batches of say, 2000 fragments, by doing something like [1 to 2000], and filtering out fragments that already have your newly added element. You could also look into using batch tools like MLCP.

b. Alternatively, you could split your input into smaller batches, then spawn each of those batches to jobs on the Task Server, which has a configurable queue. See:

i. xdmp:spawn

ii. xdmp:spawn-function

2. Alternatively, you could use an external/community developed tool like CoRB to batch process your content. See Using Corb to Batch Process Your Content - A Getting Started Guide

3. If using Semantics and querying triples with SPARQL:

a. You can make use of the LIMIT keyword to further restrict the result set size of your SPARQL query. See The LIMIT Keyword

b. You can also use the OFFSET keyword for pagination. This keyword can be used with the LIMIT and ORDER BY keywords to retrieve different slices of data from a dataset. For example, you can create pages of results with different offsets. See The OFFSET Keyword

Best Practices for improving the performance of large collection ...

Introduction

This article outlines various factors influencing the performance of xdmp:collection-delete function and furthermore provides general best practices for improving the performance of large collection deletes.

What are collections?

Collections in MarkLogic Server are used to organize documents in a database. Collections are a powerful and high-performance mechanism to define and manage subsets of documents.

How are collections different from directories?

Although both collections and directories can be used for organizing documents in a database, there are some key differences. For example:

Directories are hierarchical, whereas collections are not. Consequently, collections do not require member documents to conform to any URI patterns. Additionally, any document can belong to any collection, and any document can also belong to multiple collections
You can delete all documents in a collection with the xdmp:collection-delete function. Similarly, you can delete all documents in a directory (as well as all recursive subdirectories and any documents in those directories) with a different function call - xdmp:directory-delete
You can set properties on a directory. You cannot set properties on a collection

For further details, see Collections versus Directories.

What is the use of the xdmp:collection-delete function?

xdmp:collection-delete is used to delete all documents in a database that belong to a given collection - regardless of their membership in other collections.

Use of this function always results in the specified unprotected collection disappearing. For details, see Implicitly Defining Unprotected Collections
Removing a document from a collection and using xdmp:collection-delete are similarly contingent on users having appropriate permissions to update the document(s) in question. For details, see Collections and Security
If there are no documents in the specified collection, then nothing is deleted, and the function still returns the empty sequence

What factors affect performance of xdmp:collection-delete?

The speed of xdmp:collection-delete depends on several factors:

Number of documents deleted in a given transaction
Lock contention – Check your application code to see if it is running in update mode. You can read more about lock contention at:
- Documentation - xdmp:transaction-locks
- KB - Understanding Locking in MarkLogic Using Examples
- KB - Understanding the "Lock Trace" diagnostic trace event
Any on-going performance issues in the environment. You can look into your environment's monitoring history to see if there are any resource contention or locking loads present at the time you were running the delete
General application slowness due to underlying code or infrastructure issues. For a deeper dive into MarkLogic architecture, see:
- White Paper - Inside MarkLogic Server
- KB - Performance Theory: Tales From MarkLogic Support

Is there a fast operation mode available within the call xdmp:collection-delete?

Yes. The call xdmp:collection-delete("collection-uri") can potentially be fast in that it won't retrieve fragments. Be aware, however, that xdmp:collection-delete will retrieve fragments (and therefore perform much more slowly) when your database is configured with any of the following:

Temporal collections
Lock fragments
Auditing Events
Delete triggers
Directory-creation set to “automatic” or “manual-enforced”

What are the general best practices in order to improve the performance of large collection deletes?

Batch your deletes
- You could use an external/community developed tool like CoRB to batch process your content
- Tools like CoRB allow you to create a "query module" (this could be a call to cts:uris to identify documents from a number of collections) and a "transform module" that works on each URI returned. CoRB will run the URI query and will use the results to feed a thread pool of worker threads. This can be very useful when dealing with large bulk processing. See: Using Corb to Batch Process Your Content - A Getting Started Guide
Alternatively, you could split your input (for example, URIs of documents inside a collection that you want to delete) into smaller batches
- Spawn each of those batches to jobs on the Task Server instead of trying to delete an entire collection in a single transaction
- Use xdmp:spawn-function to kick off deletions of one document at a time - be careful not to overflow the task server queue, however
  - Don't spawn single document deletes
  - Instead, make batches of size that work most efficiently in your specific use case
- One of the restrictions on the Task Server is that there is a set queue size - you should be able to increase the queue size as necessary
Scope deletes more narrowly with the use of cts:collection-query

Introduction

Problems can occur when trying to explicitly search (or not search) parts of documents when using a global configuration approach to include and exclude elements.

Global Approach

Including and excluding elements in a document using a global configuration approach can lead to unexpected results that are complex to diagnose. The global approach will require positions to be enabled in your index settings, expanding the disk space requirements of your indexes and may result in greater processing time of your position dependent queries. It may also require adjustments to your data model to avoid unintended includes or excludes; and may require changes to your queries in order to limit the number of positions used.

If circumstances dictate that you must instead use the less preferred global configuration approach, you can read more about including/excluding elements in word queries here: http://docs.marklogic.com/guide/admin/wordquery#id_77008.

Recommended Approach

In general, it's better to define specific fields, which are a mechanism designed to restrict your query to portions of documents based on elements. You can read more about fields here: http://docs.marklogic.com/guide/admin/fields

Can I still use XQuery and XML with MarkLogic 8?

Introduction

In MarkLogic 8, support for native JSON and server side JavaScript was introduced. We discuss how this affects the support for XML and XQuery in MarkLogic 8.

Details

In MarkLogic 8, you can absolutely use XML and XQuery. XML and XQuery remain central to MarkLogic Server now and into the future. JavaScript and JSON are complementary to XQuery and XML. In fact, you can even work with XML from JavaScript or JSON from XQuery. This allows you to mix and match within an application—or even within an individual query—in order to use the best tool for the job.

Summary

Stemming in MarkLogic Server is a case-sensitive operation.

Stemmed, Case Insensitive

When you run a stemmed, case-insensitive search, MarkLogic will map all the word to lowercase and then calculate the stems.

In English, this work fairly well as words are generally lowercase. For other languages (such as German) this doesn't always work as well.

Stemmed, Case Sensitive

When a search is case-sensitive, the stems are different depending on the case of the word.

In English, case sensitive searches with stemming specified are not considered as stemmed searches because, in English, words with upper case letters stem to themselves. You would not expect proper names or acronyms to be stemmed to something else. For example, “Mr. Mark Cutting” should not match "marks cuts.”

For German and other languages where stems exist for mixed case words, case-sensitive with stemming is recommended.

Examples

These example queries demonstrate stemmed searches:

Documents
xquery version "1.0-ml"; xdmp:document-insert("1.xml", <a>This is test.</a>), xdmp:document-insert("2.xml", <a>This is TESTING.</a>), xdmp:document-insert("3.xml", <a>This is TESTS.</a>), xdmp:document-insert("4.xml", <a>This is TEST.</a>);Case insensitive with stemming
search:search("TESTS",
    <options xmlns="http://marklogic.com/appservices/search">
      <term>
        <term-option>case-insensitive</term-option>
        <term-option>stemmed</term-option>
      </term>
    </options>)Matches: test, TESTS, TESTING, & TEST.

Case sensitive with stemming

search:search("TESTS",
    <options xmlns="http://marklogic.com/appservices/search">
      <term>
        <term-option>case-sensitive</term-option>
        <term-option>stemmed</term-option>
      </term>
    </options>)Matches: TESTS

Checking for the existence of an element

Introduction

A common use case in many business applications is to find if an element exists in any document or not. This article provide ways to find such documents and explain points that should be taken care of while designing a solution.

Solution

In general, existence of an element in a document can checked by using below XQuery.

cts:element-query(xs:QName('myElement'),cts:and-query(()))

Note the empty cts:and-query construct here. An empty cts:and-query is used to fetch all fragments.

Hence running below search query will bring back all the documents having element "myElement".

Wrapping the query in cts:not-query will bring back all the documents *not* having element "myElement"

As a search using cts:not-query is only guaranteed to be accurate if the underlying query that is being negated is accurate from its index resolution, hence to check existence of a specific XPath, we need to index that XPath.
e.g. if you want to find documents having /path/1/A (and not /path/2/A) then you can create a field index for path /path/1/A and then use it in your query instead.

Things to remember

1.) Have unique element name in a single document i.e. try not to use same element name at multiple places within a document if they have different meaning for your use case. Either give them different element names or put them under different namespaces to remove any ambiguity. e.g. if you have element "table" at two places in a single document then you can put them both under different namespaces such as html:table & furniture:table or you can name them differently such as html_table & furniture_table.

2.) If element names are unique within a document then you don't need to create additional indexes. If element names are not unique within a document and you are interested in only a specific XPath then create path(field) indexes on those XPaths and use the same in your not-query.

Clearing the Expanded Tree Cache on each host when redeploying XM...

Introduction

MarkLogic Server has shipped with full support for the W3C XML Schema specification and schema validation capabilities since version 4.1 (released in 2009).

These features allow for the validation of complete XML documents or elements within documents against an existing XML Schema (or group of Schemas), whose purpose is to define the structure, content, and typing of elements within XML documents.

You can read more about the concepts behind XML Schemas and MarkLogic's support for schema based validation in our documentation:

https://docs.marklogic.com/guide/admin/schemas

Caching XML Schema data

In order to ensure the best possible performance at scale, all user created XML Schemas are cached in memory on each individual node within the cluster using a portion of that node's Expanded Tree Cache.

Best practices when making changes to pre-existing XML Schemas: clearing the Expanded Tree Cache

In some cases, when you are redeploying a revised XML Schema to an existing schema database, MarkLogic can sometimes refer to an older, cached version of the schema data associated with a given document.

Therefore, it's important to note that whenever you plan to deploy a new or revised version of a Schema that you maintain, as a best practice, it may be necessary to clear the cache in order to ensure that you have evicted all cached data stored for older versions of your schemas.

If you don't clear the cache, you may sometimes get references to the old, cached schema references and as result, you may get errors like:

XDMP-LEXVAL (...) Invalid lexical value

You can clear all data stored in the Expanded Tree Cache in two ways:

By restarting MarkLogic service on every host in the cluster. This will automatically clear the cache, but it may not be practical on production clusters.
By issuing a call to xdmp:expanded-tree-cache-clear() command on each host in the cluster. You can run the function in query console or via REST endpoint and you will need a user with admin rights to actually clear the cache.

An example script has been provided that demonstrates the use of XQuery to execute the call to clear the Expanded Tree Cache against each host in the cluster:

Please contact MarkLogic Support if you encounter any issues with this process.

Summary

XDMP-ODBCRCVMSGTOOBIG can occur when a non-ODBC process attempts to connect to an ODBC application server. A couple of reasons that this can happen is that there is an http application that has been accidentally configured to point to the ODBC port, or a load balancer is sending http health checks to an ODBC port. There are a number of common error messages that can indicate whether this is the case.

Identifying Errors and Causes

One method of determining the cause of an XDMP-ODBCRCVMSGTOOBIG error is to take the size value and convert it to Characters. For example, given the following error message:

2019-01-01 01:01:25.014 Error: ODBCConnectionTask::run: XDMP-ODBCRCVMSGTOOBIG size=1195725856, conn=10.0.0.101:8110-10.0.0.103:54736

The size, 1195725856, can be converted to the hexadecimal value 47 45 54 20, which can be converted to the ASCII value "GET ". So what we see is a GET request being run against the ODBC application server.

Common Errors and Values

Error	Hexadecimal	Characters
XDMP-ODBCRCVMSGTOOBIG size=1195725856	47 45 54 20	"GET "
XDMP-ODBCRCVMSGTOOBIG size=1347769376	50 55 54 20	"PUT "
XDMP-ODBCRCVMSGTOOBIG size=1347375956	50 4F 53 54	"POST"
XDMP-ODBCRCVMSGTOOBIG size=1212501072	48 45 4C 50	"HELP"

Conclusion

XDMP-ODBCRCVMSGTOOBIG errors, do not affect the operation of MarkLogic Server, but can cause error logs to fill up with clutter. Determining that the errors are caused by an http request to an ODBC port can help to identify the root cause, so the issue can be resolved.

Configuration Migration of MarkLogic Server using Gradle and ml-g...

Introduction:

As the Configuration Manager has been deprecated starting with MarkLogic version 9.0-5, there is a common question on the ways how the configuration of database or an application server from an old version of MarkLogic instance to new version of MarkLogic server or between any two versions of MarkLogic server post 9.0-4

This article outlines the steps on how to migrate the resource configuration information from one server to other using Gradle and ml-gradle plugin.

Pre-Requisite

As a pre-requisite, have the compatible gradle (6.x) and the latest ml-gradle plugin(latest version is 4.1.1) installed and configured on the client (local machine or a machine from where the gradle project has to run) machine.

Solution:

The entire process is divided into two major parts Exporting resource configuration from the source cluster and Importing the resource configuration onto the destination cluster.

1. Exporting resource configuration from the source cluster/host:

On the machine where gradle is installed and the plug-in is configured, create a project as suggested in https://github.com/marklogic-community/ml-gradle#start-using-ml-gradle

In the example steps below the source project is /Migration

1.1 Creating the new project with the source details:

While creating this new project, please provide the host MarkLogic server host, username, password, REST port, multiple environment details in the command line and once the project creation is successful, you can verify the Source server details in the gradle.properties file.

macpro-user1:Migration user1$ gradle mlNewProject
Starting a Gradle Daemon (subsequent builds will be faster)> Configure project :For Jackson Kotlin classes support please add "com.fasterxml.jackson.module:jackson-module-kotlin" to the classpath > Task :mlNewProjectWelcome to the new project wizard. Please answer the following questions to start a new project. Note that this will overwrite your current build.gradle and gradle.properties files, and backup copies of each will be made.

[ant:input] Application name: [myApp]<--<-<--<-------------> 0% EXECUTING [20s][ant:input] Host to deploy to: [SOURCEHOST]<-------------> 0% EXECUTING [30s] <-------------> 0% EXECUTI[ant:input] MarkLogic admin username: [admin]<-------------> 0% EXECUTING [34s][ant:input] MarkLogic admin password: [admin]<-<---<--<-------------> 0% EXECUTING [39s][ant:input] REST API port (leave blank for no REST API server):<---<-------------> 0% EXECUTING [50s][ant:input] Test REST API port (intended for running automated tests; leave blank for no server):<-------------> 0% EXECUTING [1m 1s][ant:input] Do you want support for multiple environments? ([y], n)<-------------> 0% EXECUTING [1m 6s][ant:input] Do you want resource files for a content database and set of users/roles created? ([y], n)<-------------> 0% EXECUTING [1m 22s]Writing: gradle.propertiesMaking directory: ~/Migration/src/main/ml-configMaking directory: ~/Migration/src/main/ml-modulesUse '--warning-mode all' to show the individual deprecation warnings.See https://docs.gradle.org/6.6.1/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 1m 27s

1 actionable task: 1 executed

Once this build was successful, you can see the below directory structure created under the project directory:

1.2 Exporting the configuration of required resources:

Once the new project is created, export the required resources from the source host/cluster by creating a properties file(Not in the project directory but some other directory) as suggested in the documentation with all the resources details that need to be exported to the destination cluster. In that properties file, specify the names of the resources (Databases, Forests, app servers etc..)using the keys mentioned below with comma-delimited values:

For example, a sample properties file looks like below:

file.properties:

cpfConfigs=my-domain-1 
databases=my-database1,my-database2
domains=my-domain-1,my-domain-2 
groups=my-group 
pipelines=my-pipeline-1 
privilegesExecute=my-privilege-1
privilegesUri=my-privilege-2
roles=my-role-1,my-role-2
servers=my-server-1,my-server-2
tasks=/path/to/task.xqy,/path/to/other/task.xqy
triggers=my-trigger-1,my-trigger-2 
users=user1,user2

Once the file is created, run the below:

macpro-user1:Migration user1$ gradle -PpropertiesFile=~/file.properties mlExportResources

> Task :mlExportResources Exporting resources to: ~/Migration/src/main/ml-config
Exported files:
~/Migration/src/main/ml-config/databases/Documents.json
.
.
.
~/Migration/src/main/ml-config/security/users/miguser.json
Export messages:
The 'forest' key was removed from each exported database so that databases can be deployed before forests.
The 'range' key was removed from each exported forest, as the forest cannot be deployed when its value is null.
The exported user files each have a default password in them, as the real password cannot be exported for security reasons.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.6.1/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 1s

1 actionable task: 1 execute

Once this build was successful, the below directory structure is created under the project directory which includes the list of resources that have been exported and their config files (Example screenshot below):

With this step finished, the export of required resources from the source cluster is created. This export is now ready to be imported with these configurations(resources) into the new/destination cluster.

2. Importing Resources and the configuration on new/Destination host/Cluster:

For importing resource configuration on to the destination host/cluster, again create a new project and use the export that has been created in step 1.2 Exporting the configuration of required resources. Once these configuration files are copied to the new project, make the necessary modification to reflect the new cluster (Like hosts and other dependencies) and then deploy the configuration into the new project.

2.1 Creating a new project for the import with the Destination Host/cluster details:

While creating this new project, provide the destination MarkLogic server host, username, password, REST port, multiple environment details in the command line and once the project creation is successful, please verify the destination server details in the gradle.properties file. In the example steps below the source project is /ml10pro

macpro-user1:ml10pro user1$ gradle mlNewProject
> Task :mlNewProject Welcome to the new project wizard. Please answer the following questions to start a new project.
Note that this will overwrite your current build.gradle and gradle.properties files, and backup copies of each will be made.
[ant:input] Application name: [myApp]
<-------------> 0% EXECUTING [11s] [ant:input] Host to deploy to: [destination host]
<-------------> 0% EXECUTING [25s] [ant:input] MarkLogic admin username: [admin]
<-------------> 0% EXECUTING [28s] [ant:input] MarkLogic admin password: [admin]
<-------------> 0% EXECUTING [36s] [ant:input] REST API port (leave blank for no REST API server):
<-------------> 0% EXECUTING [41s] [ant:input] Do you want support for multiple environments? ([y], n)
<-------------> 0% EXECUTING [44s] [ant:input] Do you want resource files for a content database and set of users/roles created? ([y], n)
<-------------> 0% EXECUTING [59s] Writing: gradle.properties
Making directory: /Users/rgunupur/Downloads/ml10pro/src/main/ml-config
Making directory: /Users/rgunupur/Downloads/ml10pro/src/main/ml-modules
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/6.6.1/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 59s

1 actionable task: 1 executed

Once the project is created, you can observe the below directory structure created:

2.2 Copying the required configuration files from Source project to destination project:

In this step, copy the configuration files that have been created by exporting the resource configuration from the source server in step “ 1.2 Exporting the configuration of required resources”

For example,

macpro-user1:ml10pro user1$ cp ~/Migration/src/main/ml-config ~/ml10pro/src/main/ml-config

After copying, the directory structure in this project looks like below:

NOTE:

Please make sure that after copying configuration files from source to destination, review each and every configuration file and make the necessary changes for example, the host details should be updated to Destination server host details. Similarly, perform any other changes that are needed per the requirement.

For example, under ~/ml10pro/src/main/ml-config/forests/<database>/<forestname>.xml file you see the entry:

"host" : "Sourceserver_IP_Adress",

change the host details to reflect the destination host details. So after changing, it should look like:

"host" : "Destination_IP_Adress",

Similarly, For each forest, please define the host details of the specific node that is required.
For example for forest 1, if it has to be on node 1, define forest1.xml with

"host" : "node1_host",

Similarly, any other configuration parameters that have to be updated, it has to be updated in that specific resource.xml file under the destination ml-config directory.

Best Practice:

As this involves modifying the configuration files, it is advised to have back up and maintain version control(like GitHub or svn) to track back the modifications.

If there is a requirement to deploy the same configuration to multiple environments (like PROD, QA, TEST) all that is needed is to have gradle.properties files created for a different environment where this configuration needs to be deployed. As explained in step 2.1 Creating a new project for the import with the Destination Host/cluster details, the property values for different environments need to be provided while creating the project so that the gradle.properties file for different environments are created.

2.3 Importing the configuration (Running mlDeploy):

In this step, import the configuration that has been copied/exported from a resource. After making sure that the configuration files are all copied from the source and then modified for the correct host details and other required changes, run the below:

macpro-user1:ml10pro user1$ gradle mlDeploy
> Task :mlDeleteModuleTimestampsFile

Module timestamps file /Users/rgunupur/Downloads/ml10pro/build/ml-javaclient-util/module-timestamps.properties does not exist, so not deleting
Use '--warning-mode all' to show the individual deprecation warnings.See https://docs.gradle.org/6.6.1/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 44s

3 actionable tasks: 3 executed

Once the build is successful, go to the admin console of the destination server and verify that all the required configurations have been imported from the source server.

Further read:

For more information, refer to our documentation and knowledge base articles:

https://help.marklogic.com/Knowledgebase/Article/View/686/0/transporting-configuration-to-a-new-cluster

https://help.marklogic.com/knowledgebase/article/View/alternatives-to-configuration-manager

https://github.com/marklogic-community/ml-gradle

https://github.com/marklogic-community/ml-gradle/wiki/Resource-reference

https://developer.marklogic.com/code/ml-gradle/

Configuring and using HAProxy with MarkLogic Server

Introduction

HAProxy (http://www.haproxy.org/) is a free, fast and reliable solution offering high availability, load balancing and proxying for TCP and HTTP-based applications.

MarkLogic 8 (8.0-8 and above) and MarkLogic 9 (9.0-4 and above) include improvements to allow you to use HAProxy to connect to MarkLogic Server.

MarkLogic Server supports balancing application requests using both the HAProxy TCP and HTTP balancing modes depending on the transaction mode being used by the MarkLogic application as detailed below:

For single-statement auto-commit transactions running on MarkLogic version 8.0.7 and earlier or MarkLogic version 9.0.3 and earlier, only TCP mode balancing is supported. This is due to the fact that the SessionID cookie and transaction id (txid) are only generated as part of a multi-statement transaction.
For multi-statement transactions or for single-statement auto-commit transactions running on MarkLogic version 8.0.8 and later or MarkLogic version 9.0.4 and later both TCP and HTTP balancing modes can be configured.

The Understanding Transactions in MarkLogic Server and Single vs. Multi-statement Transactions in the MarkLogic documentation should be referenced to determine whether your application is using single or multi-statement transactions.

Note: Attempting to use HAProxy in HTTP mode with Single-statement transactions prior to MarkLogic versions 8.0.8 or 9.0.4 can lead to unpredictable results.

Example configurations

The following example configurations detail only the parameters relevant to enabling load balancing of a MarkLogic application, for details of all parameters that can be used please refer to the HAProxy documentation.

TCP mode balancing

The following configuration is an example of how to balance requests to a 3-node MarkLogic application using the "roundrobin" balance algorithm based on the source IP address. The health of each node is checked by a TCP probe to the application server every 1 second.

backend app
mode tcp
balance roundrobin
stick-table type ip size 200k expire 30m
stick on src
default-server inter 1s
server app1 ml-node-1:8012 check id 1
server app2 ml-node-2:8012 check id 2
server app3 ml-node-3:8012 check id 3

HTTP mode balancing

The following configuration is an example of how to balance requests to a 3-node MarkLogic application using the "roundrobin" balance algorithm based on the "SessionID" cookie inserted by the MarkLogic server.

The health of each node is checked by issuing an HTTP GET request to the MarkLogic health check port and checking for the "Healthy" response.

backend app
mode http
balance roundrobin
cookie SessionID prefix nocache
option httpchk GET / HTTP/1.1\r\nHost:\ monitoring\r\nConnection:\ close
http-check expect string Healthy
server app1 ml-node-1:8012 check port 7997 cookie app1
server app2 ml-node-2:8012 check port 7997 cookie app2
server app3 ml-node-3:8012 check port 7997 cookie app3

Considerations when scaling out your MarkLogic instance

Introduction

MarkLogic Server is engineered to scale out horizontally by easily adding forests and nodes. Be aware, however, that when adding resources horizontally, you may also be introducing additional demand on the underlying resources.

Details

On a single node, you will see some performance improvement in adding additional forests, due to increased parallelization. This is a point of diminishing returns, though, where the number of forests can overwhelm the available resources such as CPU, RAM, or I/O bandwidth. Internal MarkLogic research (as of April 2014) shows the sweet spot to be around six forests per host (assuming modern hardware). Note that there is a hard limit of 1024 primary forests per database, and it is a general recommendation that the total number of forests should not grow beyond 1024 per cluster.

At cluster level, you should see performance improvements in adding additional hosts, but attention should be paid to any potentially shared resources. For example, since resources such as CPU, RAM, and I/O bandwidth would now be split across multiple nodes, overall performance is likely to decrease if additional nodes are provisioned virtually on a single underlying server. Similarly, when adding additional nodes to the same underlying SAN storage, you'll want to pay careful attention to making sure there's enough I/O bandwidth to accommodate the number of nodes you want to connect.

More generally, additional capacity above a bottleneck generally exacerbates performance issues. If you find your performance has actually decreased after horizontally scaling out some part of your stack, it is likely that a part of your infrastructure below the part at which you made changes is being overwhelmed by the additional demand introduced by the added capacity.

Correcting Ownership of MarkLogic Server Data Directory and Files

Summary

There are sometimes circumstances where the MarkLogic data directory owner can be changed. This can create problems where MarkLogic Server is unable to read and/or write its own files but is easily corrected.

MarkLogic Server user

There are sometimes circumstances where the MarkLogic data directory owner can be changed; this can create problems where MarkLogic Server is unable to read and/or write its own files.

The default location for the data directory on Linux is /var/opt/MarkLogic and the default owner is daemon.

If you are using a nondefault (non-daemon) user to run MarkLogic, for example mlogic, you would usually have

export MARKLOGIC_USER=mlogic

/etc/marklogic.conf

Correct the data directory ownership

If the file ownership is incorrect, the way forward is to change the ownership back to the correct user. For example, if using the default user daemon:

1. Stop MarkLogic Server.

2. Make sure that the user you are using is correct and available on this machine.

3. Change the ownership of all the MarkLogic files (by default /var/opt/MarkLogic and any/all forests for this node) to daemon. The change needs to be made recursively below the directory to include all files. Assuming all nodes in the cluster run as daemon, you can use another unaffected node as a check. You may need to use root/sudo permissions to change owner. For example:

chown -R daemon:daemon /var/opt/MarkLogic

4. Start MarkLogic Server. It should now come up as the correct user and able to manage its files.

References

MarkLogic Documentation: Installing MarkLogic

Creating a Custom MarkLogic AMI with Packer

Summary

Packer from HashiCorp is a provisioning tool, allowing for the automated creation of machine images, extending the ability to manage infrastructure to machine images. Packer supports a number of different image types including AWS, Azure, Docker, VirtualBox and VMWare.

Packer can be used to create a customized MarkLogic Amazon Machine Image (AMI) which can then be deployed to AWS and used in a Cluster. We recommend using the official MarkLogic AMIs whenever possible, and making the necessary customizations to the official images. This ensures that MarkLogic Support is able to quickly diagnose any issues that may occur, as well as reducing the risk of running MarkLogic in a way that is not fully supported.

The KB article, Customizing MarkLogic with Packer and Terraform, covers the process of customizing the official MarkLogic AMI using Packer.

Setting Up Packer

For the purpose of this example, I will assume that you have already installed the AWS CLI, with the correct credentials, and you have installed Packer.

Packer Templates

A Packer template is a JSON configuration file that is used to define the image that we want to build. Templates have a number of keys available for defining the machine image, but the most commonly used ones are builders, provisioners and post-processors.

builders are responsible for creating the images for various platforms.
provisioners is the section used to install and configure software running on machines before turning them into images.
post-processors are actions applied to the images after they are created.

Creating a Template

For our example, we are going to take build from the official Amazon Linux 2 AMI, where we will install the required prerequisite packages, install MarkLogic, and apply some customizations before creating a new image.

Defining Variables

Variables help make the build more flexible, so we will utilize a separate variables file, marklogic_vars.json, to define parts of our build.

{

    "vpc_region": "us-east-1",

    "vpc_id": "vpc-06d3506111cea30d0",

    "vpc_public_sn_id": "subnet-03343e69ae5bed127",

    "vpc_public_sg_id": "sg-07693eb077acb8635",

    "instance_type": "t3.large",

    "ssh_username": "ec2-user",

    "ami_filter": "amzn2-ami-hvm-2.*-ebs",

    "ami_owner": "amazon",

    "binary_source": "./",

    "binary_dest": "/tmp/",

    "marklogic_binary": "MarkLogic-10.0-4.2.x86_64.rpm"

Here we've identified the instance details so our image can be launched, as well as the filter values, ami_filter and ami_owner, that will help us retrieve the correct base image for our AMI. We are also identifying the name of the MarkLogic binary, along with some path details on where to find it locally, and where to place it on the remote host.

Creating Our Template

Now that we have some of the specific build details defined, we can create our template, marklogic_ami.json. In this case we are going to use the build and provisioners keys in our build.

{

"builders": [

{

"type": "amazon-ebs",

"region": "{{user `vpc_region`}}",

"vpc_id": "{{user `vpc_id`}}",

"subnet_id": "{{user `vpc_public_sn_id`}}",

"associate_public_ip_address": true,

"security_group_id": "{{user `vpc_public_sg_id`}}",

"source_ami_filter": {

"filters": {

"virtualization-type": "hvm",

"name": "{{user `ami_filter`}}",

"root-device-type": "ebs"

},

"owners": ["{{user `ami_owner`}}"],

"most_recent": true

},

"instance_type": "{{user `instance_type`}}",

"ssh_username": "{{user `ssh_username`}}",

"ami_name": "ml-{{isotime \"2006-01-02-1504\"}}",

"tags": {

"Name": "ml-packer"

}

],

"provisioners": [

{

"type": "shell",

"script": "./marklogicInit.sh"

},

{

"destination": "{{user `binary_dest`}}",

"source": "{{user `binary_source`}}{{user `marklogic_binary`}}",

"type": "file"

},

{

"type": "shell",

"inline": [ "sudo yum -y install /tmp/{{user `marklogic_binary`}}" ]

}

]

}

In the build section we have defined the network and security group configurations and the source AMI details. We have also defined the naming convention (ml-YYYY-MM-DD-TTTT) for the our new AMI with ami_name and added a tag, ml-packer. Both of those will make it easier to find our AMI when it comes time to deploy it.

Provisioners

In our example, we are using the shell provisioner to execute a script against the machine, the file provisioner to copy the MarkLogic binary file to the machine, and the shell provisioner to install the MarkLogic binary, all of which will be run prior to creating the image. There are also provisioners available for Ansible, Salt, Puppet, Chef, and PowerShell, among others.

Provisioning Script

For our custom image, we've determined that we need install Git, to create a symbolic link MarkLogic needs on Amazon Linux 2, and to setup /etc/marklogic.conf to disable the MarkLogic Managed Cluster feature, all of which we will do inside a script. We've named the script marklogicInit.sh, and it is stored in the same directory as our Packer template.

#!/bin/bash -x

echo "****   Starting setup.sh    ****"

echo "****   Creating LSB symbolic link   ****"

sudo ln -s /etc/system-lsb /etc/redhat-lsb

echo "****   Installing Git   ****"

sudo yum install -y git

echo "****   Setting Up /etc/marklogic.conf   ****"

echo "export MARKLOGIC_MANAGED_NODE=0" >> /tmp/marklogic.conf

sudo cp /tmp/marklogic.conf /etc/

echo "****   Finishing setup.sh    ****"

Executing Our Build

Now that we've completed setting up our build, it's time to use Packer to create the image.

packer build -debug -var-file=marklogic_vars.json marklogic_ami.json

Here you can see that we are telling Packer to do a build using marklogic_ami.json and referencing our variables file with the -var-file flag. We've also added the -debug flag which will disable parallelism and enable debug mode. In debug mode, Packer will stop after each step and prompt you to hit Enter to go to the next step.

The last part of the build output will print out the details of our new image:

Wrapping Up

We have now created a customized MarkLogic AMI using Packer, which can be used to deploy a self managed cluster.

Creating a pstack movie for support

Introduction

Attached to this article is a bash script (pstack.sh), which will generate a zip file containing a lot of useful information for the support team in cases where assistance is required troubleshooting a problem in a development, QA/UAT or production environment.

When would I need to create a pstack movie for support

In situations where you're facing a particular performance problem - this may be a query that seems to be displaying unexpected behaviour or a situation where the server appears to become unresponsive for a period of time.

The "pstack movie" generates a stack trace of the MarkLogic process at intervals of 5 seconds for a period of time specified by the user running the script; we generally recommend running this script for one minute (60 seconds)

Example Usage:

./pstack.sh 60

This will cause the script to execute for 1 minute. The output will be saved to a zip file mounted at /tmp.

The script will generate a filename from the current server time and date. The zip file can then be attached to the case for the engineering team to review.

The script (pstack.sh)

The entire script is available for download from this page and the source is available from here:

A longer pstack script (ml-support-dump.sh)

This script also gets essential OS level information in addition to the pstack output:

Prerequisites

Before running this pstack script, please ensure you have the following packages installed on your host:

yum -y install pstack sysstat psutils gdb

Someone with root access to the host will need to execute this script.

Creating a support request

Summary

This article describes how to create a MarkLogic Support Request (commonly known as a Support Dump). To create a Support Request:

Creating a Support Request

1. Use a web browser to log in to the server's Admin interface, which is typically found on the server at port 8001. Eg., http://localhost:8001

2. Click Configure in the navigation frame

support.dump.1.png

3. Click the Support Tab

4. Choose the options based on the level of detail you want to provide.

For MarkLogic Server version below 9 follow these steps:

support.dump.3.png

1. In general, MarkLogic Support recommends choosing "cluster", "status and logs" and "file".
2. Make sure to zip the file when attaching to a support case.
3. If the resulting zip file is larger then 10MB, please upload the file through the portal using the HTTPS based upload (preferred) or to the Marklogic FTP server, as you will be unable to attach files greater than 10MB to the support case.

For MarkLogic Sever version 9 and onwards:

1. MarkLogic Support recommends choosing "cluster", "status and system logs", "latest" and "upload to MarkLogic Secured Storage".

In MarkLogic 9 ErrorLog files have been split to separate PII (Personally Identifiable Information) from system information:

- "System ErrorLog" which contains only MarkLogic Server related information

- "Application ErrorLog" which contain all application specific logging information including PII

2. Send the file to MarkLogic Support

It is recommended to select "upload to Marklogic Secure Storage" which will upload all collected data automatically. It only requires that the MarkLogic Server can reach telemetry.services.marklogic.com over SSL. After a successful upload provide us with the reported Cluster-ID in the support ticket.

For the other options please follow instructions for earlier MarkLogic versions above.

Data Hub Framework - FAQ

Question	Answer	Further Reading
What is Data Hub?	The MarkLogic Data Hub is an open-source software interface that works to: ingest data from multiple sources harmonize that data master that data then search and analyze that data It runs on MarkLogic Server, and together, they provide a unified platform for mission-critical use cases.	Documentation: MarkLogic Data Hub
How do I install Data Hub?	Please see the referenced documentation Install Data Hub
What software is required for Data Hub installation?	Java JRE (OpenJDK) 8 MarkLogic Server, See Version Compatibility Gradle	Documentation: Install Data Hub
What is MarkLogic Data Hub Central?	Hub Central is the Data Hub graphical user interface	Documentation: Hub Central Guided Tour of MarkLogic Data Hub Central Introducing MarkLogic Data Hub Central What is MarkLogic Data Hub Central?
What are the ways to ingest data in Data Hub?	Hub Central (note that Quick Start has been deprecated since Data Hub 5.5) Data Hub Gradle Plugin Data Hub Client JAR Data Hub Java APIs Data Hub REST APIs MarkLogic Content Pump (MLCP)	Documentation: On-Premises Tools MarkLogic Data Hub 5.5 - Release Notes
What is the recommended batch size for matching steps?	The best batch size for a matching step could vary due to the average number of matches expected Larger average number of matches should use smaller batch sizes A batch size of 100 is the recommended starting point	Documentation: Batch size for Matching step
What is the recommended batch size for merging steps?	The merge batch size should always be 1	Documentation: Batch size for Merging step
How do I kill a long running flow in Data Hub?	At the moment, the feature to stop/kill a long running flow in DataHub isn't available. If you encounter this issue, please provide support with the following information to help us investigate further: Error logs and exception traces from the time the job was started The job document for the step in question You can find that document under the "data-hub-JOBS" db using the job ID Open the query console Select data-hub-JOBS db from the dropdown Hit explore Enter the Jobs ID from the screenshot in the search field and hit enter: E.g.: *21d54818-28b2-4e56-bcfe-1b206dd3a10a* You'll see the document in the results Note: If you want to force it, you can cycle the Java program and stop the requests from the corresponding app server status page on the Admin UI.	KB Article: Killing a Long running Query and Request Time Limits
What do we do if we are receiving SVC-EXTIME error consistently while running the merging step?	“SVC-EXTIME” generally occurs when a query or other operation exceeds its processing time limit. There are various reasons behind this error. For example, Lack of physical resources Infrastructure level slowness Network issues Server overload Document locking issues Additionally, you need to review the step where you match documents to see how many URIs you are trying to merge in one go. Reduce the batch size to a value that gives a balance between processing time and performance (the SVC-EXTIME timeout error) Modify your matching step to work with fewer matches per each run rather than a huge number of matches Turning ON the SM-MATCH and SM-MERGE traces would give a good indication of what it is getting stuck on. Do note, however, to turn them OFF once the issue has been detected/resolved.	Documentation: SVC-EXTIME
What are the best practices for performing Data Hub upgrades?	Note that Data Hub versions depend on MarkLogic Server versions - if your Data Hub version requires a different MarkLogic Server version, you MUST upgrade your MarkLogic Server installation before upgrading your Data Hub version Take a backup Perform extensive testing with all use-cases on lower environments Refer to release notes (some Data Hub upgrades require reindexing), upgrade documentation, version compatibility with MarkLogic Server	KB Article: MarkLogic Server/Data Hub version compatibility and upgrade
How can I encrypt my password in Gradle files used for Data Hub?	You may need to store the password in encrypted Gradle properties and reference the property in the configuration file.	Documentation: Encrypting passwords Blog: Protecting passwords in ml-gradle projects
How can I create a Golden Record using Data Hub?	A golden record is a single, well-defined version of all the data entities in an organizational ecosystem. In the Data Hub Central, once you have gone through the process of ingest, map and master, the documents in the *sm-<EntityType>-mastered* collection would be considered as golden records	KB article: What is a Golden Record and how can you create one on DataHub?
What authentication method does Data Hub support?	DataHub primarily supports basic and digest authentication. The configuration for username/password authentication is provided when deploying your application.
How do I know the compatible MarkLogic server version with Data Hub version?	Refer to Version Compatibility matrix.
Can we deploy multiple DHF projects on the same cluster?	This operation is NOT supported.
Can we perform offline/disconnected Data Hub upgrades?	This is NOT supported, but you can refer to this example to see one potential approach
TDE Generation in Data Hub	For production purposes, you should configure your own TDE's instead of depending solely on TDE's generated by Data Hub (which may not be optimized for performance or scale)
Where does gradle download all the dependencies we need to install DHF from?	Below is the list of sites that Gradle will use in order to resolve dependencies: The DHF Gradle plugin will be fetched from: https://plugins.gradle.org/m2 All dependencies will be retrieved from: https://jcenter.bintray.com/ (or) https://search.maven.org/ JFrog This tool is helpful to figure out what the dependencies are: It provides a shareable and centralized record of a build that provides insights into what happened and why You can create build scans using this tool and even publish those results at https://scans.gradle.com to see where Gradle is trying to download each dependency from under the "Build Dependencies" section on the results page.

Database available without a quorum?

Introduction

In the Scalability, Availabilty & Failover Guide, the node communication section describes a quorum as >50% of the nodes in a cluster.

Is it possible for a database to be available for reads and writes, even if a quorum of nodes is not available in the cluster?

The answer is yes, there are configurations and sequences of events that can lead to forests remaining online when there are fewer than 50% of the hosts being online.

Details

If a single forest in a database is not available, the database is not be accessible. It is also true that as long as all of a database's forests are available in the cluster, the database will be available for reads and writes regardless of any quorum issues.

Of course, the Security database must also be available in the cluster for the cluster to function.

Forest Availability: Simple Case

In the simplest case, if you have a forest that is not configured with either local disk failover or shared disk failover and as long as the forest's host is online and exists in the cluster, the forest will be available regardless of any quorum issues.

To explain this case in more detail: if we have a 3-node MarkLogic cluster containing 3 hosts (let's call them host-a, host-b and host-c); if we were to then initialize host-a as the primary host (so this is the first host is set up in the cluster and is the host containing the master security database) and we then join host-b and host-c to host-a to complete the cluster.

Shortly after that, if we shut both the joiner hosts (host-b and host-c) down, so only host host-a remained online, we would see a chain of messages in the primary host's ErrorLog that indicated there was no longer quorum within the cluster:

2020-05-21 01:19:14.632 Info: Detected quorum (3 online, 1 suspect, 0 offline)
2020-05-21 01:19:18.570 Warning: Detected suspect quorum (3 online, 2 suspect, 0 offline)
2020-05-21 01:19:29.715 Info: Disconnecting from domestic host host-b.example.marklogic.com because it has not responded for 30 seconds.
2020-05-21 01:19:29.715 Info: Disconnected from domestic host host-b.example.marklogic.com
2020-05-21 01:19:29.715 Info: Detected suspect quorum (2 online, 1 suspect, 1 offline)
2020-05-21 01:19:33.668 Info: Disconnecting from domestic host host-c.example.marklogic.com because it has not responded for 30 seconds.
2020-05-21 01:19:33.668 Info: Disconnected from domestic host host-c.example.marklogic.com
2020-05-21 01:19:33.668 Warning: Detected no quorum (1 online, 0 suspect, 2 offline)

Under these circumstances, we would be able to access the host's admin GUI on port 8001 and it would respond without issue. We would be able to access Query Console on that host on port 8000 and would be able to inspect the primary host's databases. We would also be able to access the Monitoring History on port 8002 - all directly from the primary host.

In this scenario, because the primary host remains online and the joining hosts are offline; and because we have not yet set up failover anywhere, there is no requirement for quorum, so host-a remains accessible.

If host-a also happened to have a database with forests that only resided on that host, these would be available for queries at this time. However, this is a fairly limited use case because in general, if you have a 3-node cluster, you would have a database whose forests reside on all three hosts in the cluster with failover forests configured on alternating hosts.

As soon as you do this, if you lose one host and you don't have failover configured, the database would now become unavailable (due to a crucial forest being offline) and if you had failover forests configured, you would still be able to access the database on the remaining two hosts.

However, if you then shut down another host, you would lose quorum (which is a requirement for failover).

Forest Availability: Local Disk Failover

For forests configured for local disk failover, the sequence of events is important:

In response to a host failure that makes an "open" forest inaccessible, the forest will failover to the configured forest replica as long as a quorum exists and the configured replica forest was in the "sync replicating" state. In this case, the configured replica forest will transition to the "open" state; the configured replica forest becomes the acting master forest and is available to the database for both reads and writes.

Additionally, an "open" forest will not go offline in response to another host being evicted from the cluster.

However, once cluster quorum is lost, forest failovers will no longer occur.

Conclusion

Depending on how your forests are distributed in the cluster and depending of the order of host failures, it is possible that a database can remain online even when there is no longer a quorum of hosts in the cluster.

Of course, databases with many forests spread across many hosts typically can't stay online if you lose quorum because some forest(s) will become unavailable.

Recommendation

Even though it is possible to have a functioning cluster with less than a quorum of hosts online, you should not architect your high availability solution to depend on it.

Debug level Error Log messages "Detected indexes for ..."

Introduction

If your MarkLogic Server has it's logging level set to "Debug", it's common to see a chain of 'Detecting' and 'Detected' messages that look like this in your ErrorLogs:

2015-01-27 11:11:04.407 Debug: Detected indexes for database Documents: ss, fp, fcs, fds, few, fep, sln
2015-01-27 11:11:04.407 Debug: Detecting compatibility for database Documents
2015-01-27 11:11:04.407 Debug: Detected compatibility for database Documents

This message will appear immediately after forests are unmounted and subsequently remounted by MarkLogic Server. Detecting indexes is a relatively lightweight operation and usually has minimal impact on performance.

What would cause the forests to be unmounted and remounted

Forest failovers
Heavy network activity leading to a cluster (XDQP) "Heartbeat" timeout
Changes made to forest configuration or indexes
Any incident that may cause a "Hung" message

Apart from the forest state changes (unmount/mount), this message can also appear due to other events requiring index detection.

What are "Hung" messages?

Whenever you see a "Hung" message it's very often indicative of a loss of connection to the IO subsystem (especially the case when forests are mounted on network attached storage rather than local disk). Hung messages are explained in a little more detail in this Knowledgebase article:
https://help.marklogic.com/Knowledgebase/Article/View/35/0/hung-messages-in-the-errorlog

What do the "Detected" messages mean and what can I do about them?

Whenever you see a group of "Detecting" messages:

2015-01-14 13:06:26.016 Debug: Detecting indexes for database XYZ

There was an event where MarkLogic chose to (or was required to) attempt to unmount and remount forests (and the event may also be evident in your ErrorLogs).

The detecting index message will occur soon after a remount, indicating that MarkLogic Server is examining forest data to check whether any reindexing work is required for all databases available to the node which have Forests attached:

2015-01-14 13:06:26.687 Debug: Detected indexes for database XYZ: ss, wp, fp, fcs, fds, ewp, evp, few, fep

The line immediately below indicates that the scan has been completed and the database has been identified as having been configured with a number of indexes. For the line above, these are:

ss: stemmed searches
wp: word positions
fp: fast phrase searches
fcs: fast case sensitive searches
fds: fast diacritic sensitive searches
ewp: element word positions
evp: element value positions
few: fast element word searches
fep: fast element phrase searches

From this list, we are able to determine which indexes were detected. These messages will occur after every remount if you have index detection set to automatic in the database configuration.

Every time the forest is remounted, in addition to a recovery process (where the Journals are scanned to ensure that all transactions logged were safely committed to on-disk stands), there are a number of other tests the server will do. These are configured with three options at database level:

format compatibility
index detection
expunge locks

By default, these three settings are configured with the "automatic" setting (in MarkLogic 7), so if you have logging set to "Debug" level, you'll know that these options are being worked through on remount:

2015-01-14 13:06:26.016 Debug: Detecting indexes for database XYZ (represents the task for "automatic" index detection where the reindexer checks for configuration changes)
2015-01-14 13:06:26.687 Debug: Detecting compatibility for database XYZ (represents the task for "automatic" format compatibility where the on-disk stand format is detected)

These default values may change in accross releases of MarkLogic Server. In MarkLogic 8, expunge locks is set to none but the other two are still set to automatic.

Can these values be changed safely and what happens if I change these?

Unmounting / remounting times can be made much shorter by configuring these settings away from automatic but there are some caveats involved; if you need to upgrade to a future release of the product, it's likely that the on-disk stand format may change (it's still 5.0 even when MarkLogic 8 is released) and so setting format compatibility to 5.0 should cause the "Detecting compatibility" messages to disappear and speed up remount times.

The same is true for disabling index detection but it's important to note that changing index settings on the database will no longer cause the reindexer to perform any checks on remount; in this case you would need to enable this for changes to database index settings to be reindexed.

Decimal division error XDMP-DECOVRFLW

Introduction

Division operations involving integer or long datatypes may generate XDMP-DECOVRFLW in MarkLogic 7. This is the expected behavior but it may not be obvious upon initial inspection.

For example, similar queries with similar but different input values executed in Query Console on Linux/Mac machine running MarkLogic 7 gives the following results

1. This query returns correct results

let $estimate := xs:unsignedLong("220")

let $total := xs:unsignedLong("1600")

return $estimate div $total * 100

==> 13.75

2. This query returns the XDMP-DECOVRFLOW Error

let $estimate := xs:unsignedLong("227")

let $total := xs:unsignedLong("1661")

return $estimate div $total * 100

==> ERROR : XDMP-DECOVRFLW: (err:FOAR0002)

Details

The following defines relevant behaviors in MarkLogic 7 and previous releases.

In MarkLogic 7, if all the operands involved in div operations are integer, long or integer sub-types in XML, then the resulting value of the div operation are stored as xs:decimal.
In versions previous to MarkLogic 7, if an xs:decimal value is large and occupies all digits then it was implicitly cast into an xs:double for further operations - i.e. beginning with MarkLogic, implict casting no longer occurs in this situation .
xs:decimal can accomodate 18 digits as a datatype.
In MarkLogic 7 on Linux & Mac, xs:decimal can occupy all digits depending upon actual value ( 227 div 1661 = 0.1366646598434677905 ), all 18 digits occupied in xs:decimal
MarkLogic 7 on Windows does not perform division with full decimal precision ( 227 div 1661 produces 0.136664659843468 ); as a result, not all 18 digits occupied in xs:decimal
MarkLogic 7 will generates Overflow Exception : FOAR0002, when an operation is performed on an xs:decimal that is already at full decimal precision

In the example above, multiplying the result with 100 gives an error in Linux/Mac, while its OK on Windows.

Recommendations:

We recommend xs:double be used for all division related operations in order to explicitly cast resulting value to larger data-type.

For example: These will return results

xs:double($estimate) div $total * 100

$estimate div $total * xs:double(100)

Deploying MarkLogic in AWS with Terraform

Summary

Terraform from HashiCorp is a deployment tool that many organizations use to manage their infrastructure as code. It is platform agnostic, allowing for the deployment and configuration of on-site physical infrastructure, as well as cloud infrastructure such as AWS, Azure, VSphere and more.

Terraform uses the Hashicorp Configuration Language (HCL) to allow for concise descriptions of infrastructure. HCL is JSON compatible language, and was designed to be both human and machine friendly.

This powerful tool can be used to deploy a MarkLogic Cluster to AWS using the MarkLogic CloudFormation Template. The MarkLogic CloudFormation Template is the preferred method recommended by MarkLogic for building out MarkLogic clusters within AWS.

Setting Up Terraform

For the purpose of this example, I will assume that you have already installed Terraform, the AWS CLI and you have configured the credentials. You will also need to have a working directory that has been initialized using terraform init.

Terraform Providers

Terraform uses Providers to provide access to different resources. The Provider is responsible for understanding API interactions and exposing resources. The AWS Provider is used to provide access to AWS resources.

Terraform Resources

Resources are the most important part of the Terraform language. Resource blocks describe one or more infrastructure objects, like compute instances and virtual networks.

The aws_cloudformation_stack resource, allows Terraform to create a stack from a CloudFormation template.

Choosing a Template

MarkLogic provides two templates for creating a managed cluster in AWS.

MarkLogic cluster in new VPC
MarkLogic cluster in an existing VPC

I've chosen to deploy my cluster to an VPC. When deploying to an existing VPC, you will need to gather the VPC ID, as well as the Subnet IDs for the public and private subnets.

Defining Variables

The MarkLogic CF Template takes a number of input variables, including the region, availability zones, instance types, EC2 keys, encryption keys, licenses and more. We have to define our variables so they can be used as part of the resource.

Variables in HCL can be declared in a separate file, which allows for deployment flexibility. For instance, you can create a Development resource and a Production resource, but using different variable files.

Here is a snippet from our variables file:

variable "cloudform_resource_name" {
type = string
default = "Dev-Cluster-CF"
}
variable "stack_name" {
type = string
default = "Dev-Cluster"
}
variable "ml_version" {
type = string
default = "10.0-4"
}
variable "availability_zone_names" {
type = list(string)
default = ["us-east-1a","us-east-1b","us-east-1c"]
}
...

In the snippet above, you'll notice that we've defined the availability_zone_names as a list. The MarkLogic CloudFormation template won't take a list as an input, so later we will join the list items into a string for the template to use.

This also applies to any of the other lists defined in the variable files.

Using the CloudFormation Resource

So now we need to define the resource in HCL, that will allow us to deploy a CloudFormation template to create a new stack.

The first thing we need to do, is tell Terraform which provider we will be using, defining some default options:

    provider "aws" {
    profile = "default"
    #access_key = var.access_key
    secret_key = var.secret_key
    region = var.aws_region
    }

Next, we need to define the `aws_cloudformation_stack` configuration options, setting the variables that will be passed in when the stack is created:

    resource "aws_cloudformation_stack" "marklogic" {
    name = var.cloudform_resource_name
    capabilities = ["CAPABILITY_IAM"]


    parameters = {
    IAMRole = var.iam_role
    AdminUser = var.ml_admin_user
    AdminPass = var.ml_admin_password
    Licensee = "My User - Development"
    LicenseKey = "B581-REST-OF-LICENSE-KEY"
    VolumeSize = var.volume_size
    VolumeType = var.volume_type
    VolumeEncryption = var.volume_encryption
    VolumeEncryptionKey = var.volume_encryption_key
    InstanceType = var.instance_type
    SpotPrice = var.spot_price
    KeyName = var.secret_key
    AZ = join(",","${var.avail_zone}")
    LogSNS = var.log_sns
    NumberOfZones = var.number_of_zones
    NodesPerZone = var.nodes_per_zone
    VPC = var.vpc_id
    PublicSubnets = join(",","${var.public_subnets}")
    PrivateSubnets = join(",","${var.private_subnets}")
    }
    template_url = "${var.template_base_url}${var.ml_version}/${var.template_file_name}"
    }

Deploying the Cluster

Now that we have defined our variables and our resources, it's time for the actual deployment.

$> terraform apply

This will show us the work that Terraform is going to attempt to perform, along with the settings that have been defined so far.

Once we confirm that things look correct, we can go ahead and apply the resource.

Now we can check the AWS Console to see our stack

And we can also use the ELB to login to the Admin UI

Wrapping Up

We have now deployed a 3 node cluster to an existing VPC using Terraform. The cluster is now ready to have our Data Hub, or other application installed.

Deploying MarkLogic to AWS Using Ansible

Deploying MarkLogic in AWS with Ansible

Summary

Ansible, owned by Red Hat, is an open source provisioning, configuration and application deployment tool that many organizations use to manage their infrastructure as code. Unlike options such as Chef and Puppet, it is agentless, utilizing SSH to communicate between servers. Ansible also does not need a central host for orchestration, it can run from nearly any server, desktop or laptop. It supports many different platforms and services allowing for the deployment and configuration of on-site physical infrastructure, as well as cloud and virtual infrastructure such as AWS, Azure, VSphere, and more.

Ansible uses YAML as its configuration management language, making it easier to read than other formats. Ansible also uses Jinja2 for templating to enable dynamic expressions and access to variables.

Ansible is a flexible tool can be used to deploy a MarkLogic Cluster to AWS using the MarkLogic CloudFormation Template. The MarkLogic CloudFormation Template is the preferred method recommended by MarkLogic for building out MarkLogic clusters within AWS.

Setting Up Ansible

For the purpose of this example, I will assume that you have already installed Ansible, the AWS CLI, and the necessary python packages needed for Ansible to talk to AWS. If you need some help getting started, Free Code Camp has a good tutorial on setting up Ansible with AWS.

Inventory Files

Ansible uses Inventory files to help determine which servers to perform work on. They can also be used to customize settings to indiviual servers or groups of servers. For our example, we have setup our local system with all the prerequisites, so we need to tell Ansible how to treat the local connections. For this demonstration, here is my inventory, which I've named hosts

[local]

localhost              ansible_connection=local

Ansible Modules

Ansible modules are discreet units of code that are executed on a target. The target can be the local system, or a remote node. The modules can be executed from the command line, as an ad-hoc command, or as part of a playbook.

Ansible Playbooks

Playbooks are Ansible's configuration, deployment and orchestration language. Playbooks are how the power of Ansible, and its modules is extended from basic configuration, or manangment, all the way to complex, multi-tier infrastructure deployments.

Chosing a Template

MarkLogic provides two templates for creating a managed cluster in AWS.

MarkLogic cluster in new VPC
MarkLogic cluster in an existing VPC

I've chosen to deploy my cluster to an VPC. When deploying to an existing VPC, you will need to gather the VPC ID, as well as the Subnet IDs for the public and private subnets.

Defining Variables

Variables in Ansible can be declared in a separate file, which allows for deployment flexibility.

Here is a snippet from our variables file:

# vars file for marklogic template and version

ml_version: '10.0-latest'

template_file_name: 'mlcluster.template'

template_base_url: 'https://marklogic-template-releases.s3.amazonaws.com/'

# CF Template Deployment Variables

aws_region: 'us-east-1'

stack_name: 'Dev-Cluster-An3'

IAMRole: 'MarkLogic'

AdminUser: 'admin'

...

Using the CloudFormation Module

So now we need to create our playbook, and choose the module that will allow us to deploy a CloudFormation template to create a new stack. The cloudformation module allows us to create a CloudFormation stack.

Next, we need to define the cloudformation configuration options, setting the variables that will be passed in when the stack is created.

# Use a template from a URL

- name: Ansible Test

  hosts: local

  vars_files:

    - ml-cluster-vars.yml

  tasks:

    - cloudformation:

        stack_name: "{{ stack_name }}"

        state: "present"

        region: "{{ aws_region }}"

        capabilities: "CAPABILITY_IAM"

        disable_rollback: true

        template_url: "{{ template_base_url+ml_version+'/'+ template_file_name }}"

      args:

        template_parameters:

          IAMRole: "{{ IAMRole }}"

          AdminUser: "{{ AdminUser }}"

          AdminPass: "{{ AdminPass }}"

          Licensee: "{{ Licensee }}"

          LicenseKey: "{{ LicenseKey }}"

          KeyName: "{{ KeyName }}"

          VolumeSize: "{{ VolumeSize }}"

          VolumeType: "{{ VolumeType }}"

          VolumeEncryption: "{{ VolumeEncryption }}"

          VolumeEncryptionKey: "{{ VolumeEncryptionKey }}"

          InstanceType: "{{ InstanceType }}"

          SpotPrice: "{{ SpotPrice }}"

          AZ: "{{ AZ | join(', ') }}"

          LogSNS: "{{ LogSNS }}"

          NumberOfZones: "{{ NumberOfZones }}"

          NodesPerZone: "{{ NodesPerZone }}"

          VPC: "{{ VPC }}"

          PrivateSubnets: "{{ PrivateSubnets | join(', ') }}"

          PublicSubnets: "{{ PublicSubnets | join(', ') }}"

        tags:

          Stack: "ansible-test"

Deploying the cluster

Now that we have defined our variables created our playbook, it's time for the actual deployment.

ansible-playbook -i hosts ml-cluster-playbook.yml -vvv

The -i option allows us to reference the inventory file we created. As the playbook runs, it will output as it starts and finishes tasks in the playbook.

PLAY [Ansible Test] ************************************************************************************************************

TASK [Gathering Facts] *********************************************************************************************************

ok: [localhost]

TASK [cloudformation] **********************************************************************************************************

changed: [localhost]

When the playbook finishes running, it will print out a recap which shows the overall results of the play.

PLAY RECAP *********************************************************************************************************************

localhost                  : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

This recap tells us that 2 tasks ran successfully, resulted in 1 change, and no failed tasks, which is our sign that things worked.

If we want to see more information as the playbook runs we can add one of the verbose flags (-vor -vvv) to provide more information about the parameters the script is running, and the results.

Now we can check the AWS Console to see our stack:

And we can also use the ELB to login to the Admin UI

Wrapping Up

We have now deployed a 3 node cluster to an existing VPC using Ansible. The cluster is now ready to have our Data Hub, or other application installed. We can now use the git module to get our application code, and deploy our code using ml-gradle.

DevOps with MarkLogic Server

Introduction

According to Wikipedia, DevOps is a set of practices that combines software development (Dev) and IT operations (Ops) with the goal of shortening the Systems Development Lifecycle, and providing continuous delivery with high software quality. This KB will provide some guidance for system deployment and configuration, which can be integrated into an organizations DevOps processes.

For more information on using MarkLogic as part of a Continuous Integration/Continuous Delivery process, see the KB Deployment and Continuous Integration Tools.

Deploying a Cluster

Deploying a MarkLogic cluster that will act as the target environment for the application code being developed is one piece of the DevOps puzzle. The approach that is chosen will depend on many things, including the tooling already in use by an organization, as well as the infrastructure that will be used for the deployment. We will cover two of the most common environments, On-Premise and Cloud.

On-Premise Deployments

On-Premise deployments, which can include using bare metal servers, or Virtual Machine infrastructure (such as VMware), are one common environment. You can deploy a cluster to an on-premise environment using tools such as shell scripts, or Ansible. In the Scripting Administrative Tasks Guide, there is a section on Scripting Cluster Management, which provides some examples of how a cluster build can be automated.

Once the cluster is deployed, some of the specific configuration tasks that may need to be performed on the cluster can be done using the Management API.

Cloud Deployments

Cloud deployments utilize flexible compute resources provided by vendors such as Amazon Web Services (AWS), or Microsoft Azure.

For AWS, MarkLogic provides an example CloudFormation template, that can be used to deploy a cluster to Amazon's AWS EC2 Environment. Tools like the AWS Command Line Interface (CLI), Terraform or Ansible can be used to extend the MarkLogic CloudFormation template, and automate the process of creating a cluster in the AWS EC2 environment. MarkLogic has provided an example , which can be utilized to . The template can be used to deploy a cluster using the AWS CLI. The template can also be used to Deploy a Cluster Using Terraform, or it can be used to Deploy a Cluster Using Ansible.

For Azure, MarkLogic has provided Solution Templates for Azure which can be extended for automated deployments using the Azure CLI, Terraform or Ansible.

As with the on-premise deployments, configuration tasks can be performed on the cluster using the Management API

Summary

This is just a brief introduction into some aspects of DevOps processes for deploying and configuring a MarkLogic Cluster.

Diagnosing Rebalancer issues after adding or removing a forest

Summary:

After adding or removing a forest and correspond replica forest in a database, we have seen instances where the Rebalancer does not properly distribute the documents amongst existing and newly added forests.

For this particular instance, XDMP-HASHLOCKINGRETRY debug level error message reported repeatedly in the error logs. The messages would look something like:

2016-02-11 18:22:54.044 Debug: Retrying HTTPRequestTask::handleXDBCRequest 83 because XDMP-HASHLOCKINGRETRY: Retry hash locking. Forests config hash does not match.

2016-02-11 18:22:54.198 Debug: Retrying ForestRebalancerTask::run P_initial_p2_01 50 because XDMP-HASHLOCKINGRETRY: Retry hash locking. Forests config hash does not match.

Diagnosing

Gather statistics about the rebalancer to see the number of documents being scheduled. If you run attached script “rebalancer-preview.xqy” in the query console of your MarkLogic Server cluster, it will produce rebalancer statistics in tabular format.

Note that you must first change the database name (YourDatabaseNameOnWhichNewForestsHaveBeenAdded) on the 3^rd line of the XQuery script “rebalancer-preview.xqy”:

declare variable $DATABASE as xs:string := xdmp:get-request-field("db", "YourDatabaseNameOnWhichNewForestsHaveBeenAdded");

If experiencing this issue, the newly added forests will show zero in the “Total to be moved” column in the generated html page.

Resolving

Perform a cluster wide restart in order to get past this issue. The restart is required to reload all of the configuration files across the cluster. The rebalancer will also check to see if additional rebalancing work needs to occur. The rebalancer should work as expected now and the XDMP-HASHLOCKINGRETRY messages should no longer appear in the logs. If you run the rebalancer-preview.xqy script again, the statistics should now show the the number of documents being scheduled to be moved.

You can also validate the rebalancer status from the Database Status page in the Admin UI.

The XDMP-HASHLOCKINGRETRY rebalancer issue has fixed in the latest MarkLogic Server releases. However, the rebalancer-preview.xqy script can be used to help diagnose other perceived issues with the Rebalancer.

Difference between cts: contains and fn: contains

Search fundamentals

Difference between cts:contains and fn:contains

1) fn:contains is a substring match, where as cts:contains performs query matching

2) cts:contains therefore can utilize general queries and stemming, where fn:contains does not

For example:-

Example.xml

<test>daily running makes you fit</test>

fn:contains(fn:doc(“Example.xml”),”ning”)

True

cts:contains(fn:doc(“Example.xml”),”ning”)

False

fn:contains(fn:doc(“Example.xml”),”ran”)

False

cts:contains(fn:doc(“Example.xml”),”ran”)

True

Note:-

The cts:contains examples are checking the document against cts:word-querys. Stemming reduces words down to their root, allowing for smaller term lists.

1) Words from different languages are treated differently, and will not stem to the same root word entry from another language.

2) Note: Nouns will not stem to verbs and vice versa. For example, the word “runner” will not stem to “run”.

References

Duplicate Documents

Introduction

In the more recent versions of MarkLogic Server, there are checks in place to prevent the loading of invalid documents (such as documents with multiple root nodes). However, documents loaded in earlier versions of MarkLogic Server can now result in duplicate URI or duplicate document errors being reported.

Additionally, under normal operating conditions, a document/URI is saved in a single forest. If somehow the load process gets compromised, then user may see issues like duplicate URI (i.e. same URI in different forests) and duplicate documents (i.e. same document/URI in same forest).

Resolution

If the XDMP-DBDUPURI (duplicate URI) error is encountered, refer to our KB article "Handling XDMP-DBDUPURI errors" for procedures to resolve.

If one doesn't see XDMP-DBDUPURI errors but running fn:doc() on a document returns multiple nodes then it could be a case of duplicate document in same forest.

To check that the problem is actually duplicate documents, one can either do an xdmp:describe(fn:doc(...)) or fn:count(fn:doc((...)). If these commands return more than 1 e.g. xdmp:describe(fn:doc("/testdoc.xml")) returns (fn:doc("/testdoc.xml"), fn:doc("/testdoc.xml")) or fn:count(fn:doc("/testdoc.xml")) returns 2 then the problem is of duplicate documents in the same forest (and not duplicate URIs).

To fix duplicate documents, the document will need to be reloaded.

Before reloading, you can take a look at the two version to see if there is a difference. Check fn:doc("/testdoc.xml")[1] versus fn:doc("/testdoc.xml")[2] to see if there is a difference, and which one you want to reload.

If there is a difference, that may also that may point the operation that created the situation.

Effects of case sensitivity of search term on search score

Introduction

This article talks about effects of case sensitivity of search term on search score and thus on final order of search results for a secondary query which is using cts:boost-query and weight. The case-insensitive word term is treated as the lower case word term, so there can be no difference in the frequencies and scores of results for any-case/case-insensitive search term and lowercase search term with “case-sensitive” option or when neither "case-sensitive" nor "case-insensitive" is present. If neither "case-sensitive" nor "case-insensitive" is present, text of search term is used to determine case sensitivity.

Understanding relevance score

In MarkLogic Search results are returned in a relevance order. The most relevant results are first in result sequence and least relevant are last.
More details on relevance score and its calculation are available at, https://docs.marklogic.com/guide/search-dev/relevance

Of many ways to control this relevance score one way is to use a secondary query to boost relevance score, https://docs.marklogic.com/guide/search-dev/relevance#id_30927 . This article takes advantage of examples using secondary query to boost relevance scores and impact of text case (upper, lower or unspecifed) of search terms on relevance score on order of results returned.

A few examples to understand this scenario

Consider a few scenarios where below mentioned queries are trying to boost certain search results up using cts:boost-query and weight for word "washington" in returned results.

Example 1: Search with lowercase search term and option for case not specified

Query1:
xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";

for $hit in
( cts:search(
fn:doc()/test,

cts:boost-query(cts:element-word-query(xs:QName("test"),"George" ),
cts:element-word-query(xs:QName("test"),"washington",(), 10.0) )
)
)

return element hit {
attribute score { cts:score($hit) },
attribute fit { cts:fitness($hit) },
attribute conf { cts:confidence($hit) },
$hit
}

Results for Query1:
<hit score="28276" fit="0.9393904" conf="0.2769644">
<test>Washington, George... </test>
</hit>
...
...
<hit score="16268" fit="0.7125317" conf="0.2100787">
<test>George washington was the first President of the United States of America...</test>
</hit>
...

Example 2: Search with lowercase search term and case-sensitive option

Query2:
xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";

for $hit in
( cts:search(
fn:doc()/test,

cts:boost-query(cts:element-word-query(xs:QName("test"),"George" ),
cts:element-word-query(xs:QName("test"),"washington",("case-sensitive"), 10.0) )
)
)

return element hit {
attribute score { cts:score($hit) },
attribute fit { cts:fitness($hit) },
attribute conf { cts:confidence($hit) },
$hit
}

Results for Query2:
<hit score="28276" fit="0.9393904" conf="0.2769644">
<test>Washington, George... </test>
</hit>
...
...
<hit score="16268" fit="0.7125317" conf="0.2100787">
<test>George washington was the first President of the United States of America...</test>
</hit>
...

Example 3: Search with uppercase search term and option case-insensitive, in cts:boost-query like below with rest of query similar to above queries

Query3:

cts:boost-query(cts:element-word-query(xs:QName("test"),"George" ),
cts:element-word-query(xs:QName("test"),"Washington",("case-insensitive"), 10.0) )

Results for Query3:
<hit score="28276" fit="0.9393904" conf="0.2769644">
<test>Washington, George... </test>
</hit>
...
...
<hit score="16268" fit="0.7125317" conf="0.2100787">
<test>George washington was the first President of the United States of America...</test>
</hit>
...

Clearly above queries are producing the same scores with same fitness and confidence scores. This is because the case-insensitive word term is treated as the lower case word term, so there can therefore be no difference in the frequencies of those two terms (any-case/case-insensitive and lowercase/case-sensitive), and therefore no difference in scoring. Thus no difference in scores of results for Query3 and Query2.
And for cases where case sensitivity is not specified, text of search term is used to determine case sensitivity. For Query3 text of search term contains no uppercase hence it treated as "case-insensitive".

Now let us now take look at a query with a word with uppercase and case-sensitive option in query.

Example 4: Search with uppercase search term and option case-sensitive, in cts:boost-query like below with rest of query similar to above queries

Query4:

cts:boost-query(cts:element-word-query(xs:QName("test"),"George" ),
cts:element-word-query(xs:QName("test"),"Washington",("case-sensitive"), 10.0) )

Results for Query4:
<hit score="44893" fit="0.9172696" conf="0.3489831">
<test>Washington, George was the first... </test>
</hit>
...
...
<hit score="256" fit="0.0692672" conf="0.0263533">
<test>George washington was the first President of the United States of America...</test>
</hit>
...

As we can clearly see the scores are changed for results for Query4 and thus final order of results is also updated.

Conclusion:

While using a secondary query having cts:boost-query and weight, to boost certain search results up, it is important to understand the impact of case of search text on result sequence. A case-insensitive word term is treated as the lower case word term, so there can therefore be no difference in the frequencies of any-case/case-insensitive and lowercase/case-sensitive search terms, and therefore no difference in scoring. For search term with upper case alphabets in text and with “case-sensitive” option scores are boosted up as expected in comparison with a “case-insensitive search”. If neither "case-sensitive" nor "case-insensitive" is present, text of search term is used to determine case sensitivity. If text of search term contains no uppercase, it specifies "case-insensitive". If text of search term contains uppercase, it specifies "case-sensitive".

Element Level Security (ELS) vs. Document Level Security (DLS)

Background

MarkLogic Server includes element level security (ELS), an addition to the security model that allows you to specify security rules on specific elements within documents. Using ELS, parts of a document may be concealed from users who do not have the appropriate roles to view them. ELS can conceal the XML element (along with properties and attributes) or JSON property so that it does not appear in any searches, query plans, or indexes - unless accessed by a user with appropriate permissions.

ELS protects XML elements or JSON properties in a document using a protected path, where the path to an element or property within the document is protected so that only roles belonging to a specific query roleset can view the contents of that element or property. You specify that an element is part of a protected path by adding the path to the Security database. You also then add the appropriate role to a query roleset, which is also added to the Security database.

ELS uses query rolesets to determine which elements will appear in query results. If a query roleset does not exist with the associated role that has permissions on the path, the role cannot view the contents of that path.

Notes:

A user with admin privileges can access documents with protected elements by using fn:doc to retrieve documents (instead of using a query). However, to see protected elements as part of query results, even a user with admin privileges will need to have the appropriate role(s).
ELS applies to both XML elements and JSON properties; so unless spelled out explicitly, 'element' refers to both XML elements and JSON properties throughout this article.

You can read more about how to configure Element Level Security here, and can see how this all works at this Element Level Security Example.

Node-update

One of the commonly used document level capabilities is 'update'. Be aware, however, that document level update is too powerful to be used with ELS permissions as someone with document level update privileges could update not only a node, but also delete the whole document. Consequently, a new document-level capability - 'node-update' - has been introduced. 'node-update' offers finer control when combined with ELS through xdmp:node-replace and xdmp:node-delete functions as they can be used to update/delete only the specified nodes of a document (and not the document itself in its entirety).

Document-level vs Element-level security

Unlike at the document-level:

'update' and 'node-update' capabilities are equivalent at the element-level. However, at the document-level, if a user only has a 'node-update' privilege to a document, you cannot delete the document. In contrast, 'update' privileges allows that user to delete the document
'Read', 'insert' and 'update' are checked separately at the element level i.e.:
- read operations - only permissions with 'read' capability are checked
- node update operations - only permissions with 'node-update' (update) capability are checked
- node insert operations - only permissions with 'insert' capability are checked

Note: read, insert, update and node-update can all be used at the element-level i.e., they can be part of the protected path definition.

Permissions:

Document-level:

update: A node can be updated by any user that has an 'update' capability at the document-level
node-update: A node can be updated by any user with a 'node-update' capability as long as they have sufficient privileges at the element-level

Element-level:

If a node is protected but no 'update/node-update' capabilities are explicitly granted to any user, that node can be updated by any user as long as they have 'update/node-update' capabilities at the document-level
If any user is explicitly granted 'update/node-update' capabilities to that node at the element level, only that specific user is allowed to update/delete that node. Other users who are expected to have that capability must be explicitly granted that permission at the element level

How does node-replace/node-delete work?

When a node-replace/node-delete is called on a specific node:

The user trying to update that node must have at least a 'node-update' (or 'update') capability to all the nodes up until (and including) the root node
None of the descendant nodes of the node being replaced/deleted can be protected by a different roles. If they are protected:
1. 'node-delete' isn’t allowed as deleting this node would also delete the descendant node which is supposed to be protected
2. 'node-replace' can be used to update the value (text node) of the node but replacing the node itself isn’t allowed

Note: If a caller has the 'update' capability at the document level, there is no need to do element-level permission checks since such a caller can delete or overwrite the whole document anyway.

Takeaways:

'node-update' was introduced to offer finer control with ELS, in contrast to the document level 'update'
'update' and 'node-update' permissions behave the same at element-level, but differently at the document-level
1. At document-level, 'update' is more powerful as it gives the user the permission to delete the entire document
2. All permissions talk to each other at document-level. In contrast, permissions are checked independently at the element-level
  1. At the document level, an update permission allows you to read, insert to and update the document
  2. At the element level, however, read, insert and update (node-update) are checked separately
    1. For read operations, only permissions with the read capability are checked
    2. For node update operations, only permissions with the node-update capability are checked
    3. For node insert operations, only permissions with the insert capability are checked (this is true even when compartments are used).
Can I use ELS without document level security (DLS)?
1. ELS cannot be used without DLS
2. Consider DLS the outer layer of defense, whereas ELS is the inner layer - you cannot get to the inner layer without passing through the outer layer
When to use DLS vs ELS?
1. ELS offers finer control on the nodes of a document and whether to use it or not depends on your use-case. We recommend not using ELS unless it is absolutely necessary as its usage comes with serious performance implications
2. In contrast, DLS offers better performance and works better at scale - but is not an ideal choice when you need finer control as it doesn’t allow node-level operations
How does ELS performance scale with respect to different operations?
1. Ingestion - depends on the number of protected paths
  1. During ingestion, the server inspects every node for ELS to do a hash lookup against the names of the last steps from all protected paths
  2. For every protected path that matches the hash, the server does a full test of the node against the path - the higher the number of protected paths, the higher the performance penalty
  3. While the hash lookup is very fast, the full test it comparatively much slower - and the corresponding performance penalty increases when there are a large number of nodes that match the last steps of the protected paths
    1. Consequently, we strongly recommend avoiding the use of wildcards at the leaf-level in protected paths
    2. For example: /foo/bar/* has a huge performance penalty compared to /foo/*/bar
2. Updates - as with ingestion, ELS performance depends on the number of protected paths
3. Query/Search - in contrast to ELS ingestion or update, ELS query performance depends on the number of query rolesets
  1. Because ELS query performance depends on the number of query rolesets, the concept of Protected PathSet was introduced in 9.0-4
  2. A Protected PathSet allows OR relationships between permissions on multiple protected paths that cover the same element
  3. Because query performance depends on the number of relevant query rolesets, it is highly recommended to use helper functions to obtain the query rolesets of nodes configured with element-level security

Summary

Does MarkLogic provide encryption at rest?

MarkLogic 9

MarkLogic 9 introduces the ability to encrypt 'data at rest' - data that is on media (on disk or in the cloud), as opposed to data that is being used in a process. Encryption can be applied to newly created files, configuration files, or log files. Existing data files can be encrypted by triggering a merge or re-index of the data.

For more information about using Encryption at Rest, see Encryption at Rest in the MarkLogic Security Guide.

MarkLogic 8 and Earlier releases

MarkLogic 8 does not provide support for encryption at rest for its own forests.

Memory consumption

Memory consumption patterns will be different when encryption is used:

To access unencrypted forest data MarkLogic normally uses memory-mapped files. When files are encrypted, MarkLogic instead decrypts the entire index to anonymous memory.
As a result, encrypted MarkLogic forests use more anonymous memory and less file-mapped memory than unencrypted forests.
Without encryption at rest, when available memory is low, the operating system can throw out file pages from the working set and later page them in directly from files. But with encryption at rest, when memory is low, the operating system must write them to swap.

Using Amazon S3 Encryption For Backups

If you are hosting your data locally, would like to back up to S3 remotely, and your goal is that there cannot possibly exist unencrypted copies of your data outside your local environment, then you could backup locally and store the backups to S3 with AWS Client-Side encryption. MarkLogic does not support AWS Client-Side encryption, so this would need to be a solution outside MarkLogic.

Encryption at REST with an External KMS in MarkLogic

Introduction

Encryption at REST with an external Key Management System (KMS), or keystore, offers additional security for your encryption keys, along with key management capabilities like automatic key rotation, key revocation, and key deletion.

If you want the ability to perform these tasks, you will need an external KMS.

MarkLogic Encryption at REST supports the Key Managment Interoperability Protocol (KMIP) compliant KMS servers and Amazon's KMS.

Configuring Encryption at REST with an external KMS

The following points should be taken into consideration when configuring Encryption at REST along with choosing and sizing an external KMS:

The KMS system should be able to generate KMIP 1.2 compatible keys.
- If the KMS is unable to generate the keys, a custom process to generate the keys must be developed using a 3rd party tool (such as PyKMIP)
Memory consumption patterns will be different when encryption is used.
- To access unencrypted forest data MarkLogic normally uses memory-mapped files. When files are encrypted, MarkLogic instead decrypts them to anonymous memory.
- As a result, encrypted MarkLogic forests use more anonymous memory and less file-mapped memory than unencrypted forests.
- Without encryption at rest, when available memory is low, the operating system can throw out file pages from the working set and later page them in directly from files. But with encryption at rest, when memory is low, the operating system must write them to swap.
The KMS has to be sized appropriately to handle peak requests of the systems that will be using it.
- For MarkLogic the number of requests will depend on the encryption level, ingest rate, and server workload.
- MarkLogic supports encryption at any or all of the following levels: Cluster, Database, Log and Configuration levels.
- For MarkLogic the peak requests are typically when a cluster first start up, where the number of requests will be approximately 3X the number of encrypted stands.
- During normal operation, we observed an average of 1 query to the KMS for every 100MB ingested (accounting for Journal Files and Labels).
The KMS does not need to be sized based on the number of Transactions Per Second (TPS) on the MarkLogic cluster.
- MarkLogic will cache keys used for encryption for up to one hour, consequently the calls to the KMS are minimal during ingestion.

Introduction

Here we compare XDBC servers and the Enhanced HTTP server in MarkLogic 8.

Details

XDBC servers are still fully supported in MarkLogic Server version 8. You can upgrade existing XDBC servers without making any changes and you can create new XDBC servers as you did in previous releases.

The Enhanced HTTP Server is an additional feature on HTTP servers which is protocol and binary transport compatible with XCC clients, as long as you use the xcc.httpcompliant=true system property.

The XCC protocol is actually just HTTP, but the details of how to handle body, headers, responses, etc., are "built in" to the XCC client libraries and the XDBC server. The HTTP server in MarkLogic 8 now shares the same low-level code and can dispatch XCC-like requests.

Exporting metering data

Summary

Here we discuss various methods for sharing metering data with Support: telemetry in MarkLogic 9 and exporting monitoring data.

Discussion

Telemetry

In MarkLogic 9, enabling telemetry collects, encrypts, packages, and sends diagnostic and system-level usage information about MarkLogic clusters, including metering, with minimal impact to performance. Telemetry sends information about your MarkLogic Servers to a protected and secure location where it can be accessed by the MarkLogic Technical Support Team to facilitate troubleshooting and monitor performance. For more information see Telemetry.

Meters database

If telemetry is not enabled, make sure that monitoring history is enabled and data has been collected covering the time of the incident. See Enabling Monitoring History on a Group for more details.

Exporting data

One of the attached scripts can be used in lieu of a Meters database backup. They will provide the raw metering XML files from a defined period of time and can be reloaded into MarkLogic and used with the standard tools.

exportMeters.xqy

This XQuery export script needs to be executed in Query Console against the Meters database and will generate zip files stored in the defined folder for the defined period of time.

Variables for start and end times, batch size, and output directory are set at the top of the script.

get-raw.sh

This bash version will use MLCP to perform a similar export but requires an XDBC server and MLCP installed. By default the script creates output in a subdirectory called meters-export. See the attached script for details. An example command line is

./get-raw.sh localhost admin admin "2018-04-12T00:00:00" "2018-04-14T00:00:00"

Backup of Meters database

A backup of the full Meters database will provide all the available raw data and is very useful, but is often very large and difficult to transfer, so an export of a defined time range is often requested.

File system errors during backup to NFS mounted drive and recomme...

Summary

There are situations where the SVC-DIRREM, SVC-DIROPEN and SVC-FILRD errors occur on backups to an NFS mounted drive. This article explains how this condition can occur and describes a number of recommendations to avoid such errors.

Under normal operating conditions, with proper mounting options for a remote drive, MarkLogic Server does not expect to report SVC-xxxx errors. Most likely, these errors are a result of improper nfs disk mounting or other IO issues.

We will begin by exploring methods to narrow down the server which has the disk issue and then list some things to look into in order to identify the cause.

Error Log and Sys Log Observation

The following errors are typical MarkLogic Error Log entries seen during an NFS Backup that indicate an IO subsystem error. The System Log files may include similar messages.

Error: SVC-DIRREM: Directory removal error: rmdir '/Backup/directory/path': {OS level error message}

Error: SVC-DIROPEN: Directory open error: opendir '/Backup/directory/path': {OS level error message}

Error: Backup of forest 'forest-name' to 'Bakup path' SVC-FILRD: File read error: open '/Backup/directory/path': {OS level error message}

These SVC- error messages include the {OS level error message} retrieved from the underlying OS platform using generic C runtime strerror() system call. These messages are typically something like "Stale NFS file handle" or "No such file or directory".

If only a subset of hosts in the cluster are generating these types of errrors ...

You should compare the problem host's NFS configuration with rest of the hosts in the cluster to make sure all of the configurations are consistent.

Compare nfs versions (rpm -qa | grep -i nfs)
Compare nfs configurations (mount -l -t nfs, cat /etc/mtab, nfsstat)
Compare platform version (uname -mrs, lsb_release -a)

NFS mount options

MarkLogic recommends the NFS Mount settings - 'rw,bg,hard,nointr,noac,tcp,vers=3,timeo=300,rsize=32768,wsize=32768,actimeo=0'

Vers=3 : Must have NFS client version v3 or above
TCP : NFS must be configured to use TCP instead of default UDP
NOAC : To improve performance, NFS clients cache file attributes. Every few seconds, an NFS client checks the server's version of each file's attributes for updates. Changes that occur on the server in those small intervals remain undetected until the client checks the server again. The noac option prevents clients from caching file attributes so that applications can more quickly detect file changes on the server.
- In addition to preventing the client from caching file attributes, the noac option forces application writes to become synchronous so that local changes to a file become visible on the server immediately. That way, other clients can quickly detect recent writes when they check the file's attributes.
- Using the noac option provides greater cache coherence among NFS clients accessing the same files, but it extracts a significant performance penalty. As such, judicious use of file locking is encouraged instead. The DATA AND METADATA COHERENCE section contains a detailed discussion of these trade-offs.
- NOTE: The noac option is a combination of the generic option sync, and the NFS-specific option actimeo=0.

ACTIME=0 : Using actimeo sets all of acregmin, acregmax, acdirmin, and acdirmax to the same "0" value. If this option is not specified, the NFS client uses the defaults for each of these options listed above.

NOINTR : Selects whether to allow signals to interrupt file operations on this mount point. If neither option is specified (or if nointr is specified), signals do not interrupt NFS file operations. If intr is specified, system calls return EINTR if an in-progress NFS operation is interrupted by a signal.

Using the intr option is preferred to using the soft option because it is significantly less likely to result in data corruption.

The intr / nointr mount option is deprecated after kernel 2.6.25. Only SIGKILL can interrupt a pending NFS operation on these kernels, and if specified, this mount option is ignored to provide backwards compatibility with older kernels.

BG : If the bg option is specified, a timeout or failure causes the mount command to fork a child which continues to attempt to mount the export. The parent immediately returns with a zero exit code. This is known as a "background" mount.

HARD (vs soft) : Determines the recovery behavior of the NFS client after an NFS request times out. If neither option is specified (or if the hard option is specified), NFS requests are retried indefinitely. If the soft option is specified, then the NFS client fails an NFS request after retrans retransmissions have been sent, causing the NFS client to return an error to the calling application.

Note: A so-called "soft" timeout can cause silent data corruption in certain cases. As such, use the soft option only when client responsiveness is more important than data integrity. Using NFS over TCP or increasing the value of the retrans option may mitigate some of the risks of using the soft option.

Issue persists => Further debugging

If after checking NFS configuration and after implementing the MarkLogic recommended NFS mount settings, the issue persists, then you will need to debug the NFS connection during an issue period. You should enable rpcdebug for NFS on the hosts showing the NFS errors, and then analyze the resulting syslogs during a period that is experiencing the issues

rpcdebug -m nfs -s all

The resulting logs may give you additional information to help understand what the source of the failures are.

Generating unique IDs (GUIDs)

SUMMARY

This article will show you a way to create a GUID using the XQuery language.

What are GUIDs?

A GUID (Globally Unique IDentifier) is expressed as a string and is comprised of groups of hexadecimal characters, each of which are separated into five groups by hyphens.

The format can be represented as:

XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX

Where X denotes a hexidecimal value.

Generating a GUID

Version 7 and above

In version 7 and higher, the function sem:uuid-string is available to help in creating a UUID/GUID.

Version 6

xquery version "1.0-ml";

declare private function local:random-hex($seq as xs:integer*) as xs:string+ {
  for $i in $seq return 
    fn:string-join(for $n in 1 to $i
      return xdmp:integer-to-hex(xdmp:random(15)), "")
};

declare function local:guid() as xs:string {
  fn:string-join(local:random-hex((8,4,4,4,12)),"-")
};

Example use:

for $i in 1 to 250
return
local:guid()

Best Practice Tip

If you need to create thousands of GUIDs as part of your content processing architecture, be careful to create a separate URI to represent each unique identifier. When managing all GUIDs within a single URI, you may end up creating a bottleneck with multiple processes waiting on a write lock on this single URI. This is especially true if your working database enables the 'maintain last modified' or 'maintain directory last modified' options. Be sure to scan the server ErrorLog.txt file for any XDMP-DEADLOCK messages, which are indicative of a processing bottleneck.

Geospatial Cylindrical Queries

Introduction

At the time of this writing (MarkLogic 9), MarkLogic Server cannot perform spherical queries, as the geospatial indexes do not support a true 3D coordinate system. In situations where cylindrical queries are sufficient, you can create a 2D geospatial index and a separate range index on an altitude value. An "and-query" with these indexes would result in a cylindrical query.

Example

Consider the following sample document structure:

Configure these 2 indexes for your content database:

Geospatial Element Pair index specifying latitude localname as ‘lat’ , longitude localname ‘long’ and ‘parent localname’ as ‘location’ in configuration
Range element index with localname as ‘alt’ with int scalar type

Assuming you have data in your content database matching above document structure, this query:

will return all the documents with location i.e., points falling in the cylinder with center at 37.655983, -122.425525 having a radius of 1000 miles and with an altitude of less than 4 miles.

Note that in MarkLogic Server 9 geospatial region match was introduced, so the above technique can be extended beyond cylinders.

Groups and Ports - Why can't I see my application server?

Introduction

MarkLogic Server has a notion of groups, which are sets of similarly configured hosts within a cluster.

Application servers (and their respective ports) are scoped to their parent group.

Therefore, you need to make sure that the host and its exposed port to which you're trying to connect both exist in the group where the appropriate application server is defined. For example, if you attempt to connect to a host defined in a group made up of d-nodes, you'll only see application servers and ports defined in the d-nodes group. If the application server you actually want is in a different group (say, e-nodes), you'll get a connection error, instead.

Questions

Can I use any xdmp builtins to show which application servers are linked to particular groups?

The code example below should help with this:

Guarding MarkLogic Applications against attack

Introduction

OK, so you have written an amazing "killer App" using XQuery on MarkLogic Server and you are ready to make it available to the world. Before pressing the deploy button, you may want to verify that your application is not susceptible to hackers and other malicious users. There are many reliable scanners available to help find vulnerabilities in your MarkLogic installation. MarkLogic does not recommend any particular scanner.

This article presents recommendation to handle some of the issues that might be flagged by a vulnerability scan over your MarkLogic Server Application.

Recommendations

Put ports 7998 - 8002 behind a firewall.

For vulnerabilities related to OpenSSH, TCP/IP attack, and other OS related known weaknesses, these can easily be warded off by taking the following steps:

Use a strong name/password.
Upgrade to the latest version of MarkLogic Server to get the most recently included OpenSSH library; or don’t use SSH and close port 22.
Place production behind a firewall and only expose ports required by public application.

It is important to guard against Denial Of Service attacks. Here are some ways you can harden against that:

Don’t open any MarkLogic administrative ports to the outside (7998 – 8002). The following utilities can be used to configure ports: inetd, xinetd. This article shows you how to use rc.d to control ports on Linux. Once configured you can then use netstat to check status.
Place your production server behind a firewall and harden that layer against flood attacks. Some firewalls, such as HAProxy are able to limit traffic on a per-IP basis and still allow for high performance.
This article is a good resource for preventing DOS attacks.
Read the “Securing Your Production Deployment” section of the MarkLogic Server's Understanding and Using Security Guide. It walks you through a set of guidelines for securing your MarkLogic Server:

How do I read and interpret QConsole profile output?

Introduction

This article is intended to give you enough information to enable you to understand the output from query console's profiler.

Details

Query

Consider the following XQuery:

xquery version '1.0-ml';
let $path := '/Users/chamlin/Downloads/medsamp2012.xml'
let $citations := xdmp:document-get($path)//MedlineCitation
for $citation at $i in $citations
return
   xdmp:document-insert(fn:concat("/",$i,$citation/PMID/fn:string(),".xml"), $citation)

This FLWOR expression will load an xml file into memory, then find each MedlineCitation element and insert it as a document in the database. Although this example is very simple, it should give us enough information to understand what the profiler does and how to understand the output.

Scenario / Walkthrough

Setup

Download the small dataset for medline at http://www.nlm.nih.gov/databases/dtd/medsamp2012.xml and save it to your MarkLogic machine
Open a buffer in Query Console
Load the XML fragments into your nominated database by executing the XQuery shown above, altering $path so it points to your downloaded medsamp2012.xml file
You should have loaded 156 Medline XML fragments in your database if everything worked correctly. If you receive an error, make sure that the file is available to MarkLogic and has the correct permissions to allow access

Profile the query

Now run the same query again, only this time, ensure "Profile" is selected before you hit the run button.

You should see something like this (click image to view it in a separate window):

The header shows overall statistics for the query:

Profile 627 Expressions PT0.286939S: The number of XQuery expression evaluations along with the entire query execution expressed as an xs:dayTimeDuration (hence the PT prefix)

The table shows the profiler results for the expressions evaluated in the request, one row for each expression:

Module:Line No.:Col No.: The point in the code where the expression can be found.
Count: The number of times the expression was evaluated.
Shallow %: The percentage of time spent evaluating a particular expression compared to the entire query, excluding any time spent evaluating any sub-expressions.
Shallow µs: The time (in microseconds) taken for all the evaluations of a particular expression. This excludes time spent evaluating any sub-expressions.
Deep %: The percentage of time spent evaluating a particular expression compared to the entire query, including any time spent evaluating any sub-expressions.
Deep µs: The time (in microseconds) taken for all the evaluations of a particular expression. This includes time spent evaluating any sub-expressions.
Expression: The particular XQuery expression being profiled and timed. Expressions can represent FLWOR expressions, literal values, arithmetic operations, functions, function calls, and other expressions.

Shallow time vs deep time

In profiler output you will usually want to pay the most attention to expressions that have a large shallow time. These expressions are doing the most work, exclusive of work done in sub-expressions.

If an expression has a very large deep time, but almost no shallow time, then most of the time is being spent in sub-expressions.

For example, in the profiler output shown, the FLWOR expression at .main:2:0 has the most deep time since it has includes the other expressions, but not a lot of shallow time since it doesn't do much work itself. The expression at .main:3:45 has a lot of deep time, but that all comes from the subexpression at .main:3:18, which takes the most time.

Sorting

The default sorting of the table is by Shallow % descending. This a generally a good view as it will bring the expressions taking the most shallow time to the top. You can sort on a different column by clicking on the column header.

Cold vs warm

Timings may change for a query if you execute it more than once, due to the caching performed by MarkLogic. A query will be slower if it needs data that is not available in the caches (cold) vs where much of the information is available from caches (warm). This is by design and gives better performance as the system runs and caches frequently used information.

Lazy evaluation

Another characteristic of MarkLogic Server is its use of lazy evaluation. A relatively expensive evaluation may return quickly without performing the work necessary to produce results. Then, when the results are needed, the work will actually be performed and the evaluation time will be assigned at that point. This can give surprising results.

Wrapping an expression in xdmp:eager() will evaluate it eagerly, giving a better idea of how much time it really takes because the time for evaluation will be better allocated in the profiler output.

Introduction

This Knowledgebase article is a general guideline for backups using the journal archiving feature for both free space requirements and expected file sizes written to the archive journaling repository when archive journaling is enabled and active.

The MarkLogic environment used here was an out-of-the box version 9.x with one change of adding a new directory specific to storing the archive journal backup files.

It is assumed that the reader of this article already has a basic understanding of the role of Journal Archiving in the Backup and Restore feature of MarkLogic Server. See references below for further details(below).

How much free space is needed for the Archive Journal files in a backup?

MarkLogic Server uses the forest size of the active forest to confirm whether the journal archive repository has enough free space to accommodate that forest, but if additional forests already exist on the same volume, then there may be an issue in the Server's "free-space" calculation as the other forests are never used in the algorithm that calculates the free space available for the backup and/or archive journal repositories. Only one forest is used in the free-space calculation.

In other words, if multiple forests exist on the same volume, there may not be enough free space available on that specific volume due to the additional forests; especially during a high rate of ingestion. If that is the case, then it is advised to provide enough free space on that volume to accommodate the sizes of all the forests. Required Free Space(approximately) = (Number of Forests) x (Size of largest Forest).

What can we expect to see in the journal archiving repository in terms of files sizes for specific ingestion types and sizes? That brings us to the other side.

How is the Journal Archive repository filling up?

1 MByte of raw XML data loaded into the server (as either a new document ingestion or a document update) will result in approximately 5 to 6 MBytes of data being written to the corresponding Journal Archive files. Additionally, adding Range Indexes will contribute to a relatively small increase in consumed space.

Ingesting/updating RDF data results in slightly less data being written to the journal archive files.

In conclusion, for both new document ingestion and document updates, the typical expansion ratio of Journal Archive size to Input file size is between 5 an 6 but can be higher than that depending on the document structure and any added range indexes.

References:

How reindexing/rebalancing works, and the impact on performance

Summary

While reindexing should be an infrequent operation in a production environment, it is important to understand how the process can impact a MarkLogic environment. This article describes the process of reindexing and explores how it may affect performance of the server.

Why reindex?

MarkLogic enables some default full-text search indexes and any inserted content populates these indexes. There are situations where these indexes may need to be changed, including:

to enable additional search functionality in a MarkLogic application
to increase accuracy of unfiltered searches
to create additional facets or lexicon-based functions
to recognize enhancements or bug fixes between MarkLogic versions
to remove unused indexes to reclaim disk space

When the indexes are changed, the server will begin the process of reindexing all affected content. In most cases, this will include all documents in a database, but the server does try to reindex only the fragments that contain content that would be populated in the added/removed index.

How reindexing works

When configuration changes are made in the admin interface or the Admin API, the server will write a new version of its configuration files. Directly after these configuration changes are made, or on startup, the server will automatically start reindexing forests. If no index changes have been made, the server will simply reindex zero fragments. If the changes include index settings, however, the server will find that some/all fragments may need to be reindexed. The server will query the content and pick up a few hundred fragments that have not be reindexed reinsert this content into the database with the new index settings. In this way, reindexing is very much like a simple document update, only the process is automated and the index settings are different. Once these fragments have been completed, the server will get the next batch of fragments and this process continues until the query returns zero fragments to reindex.

Reindexing consumes additional disk space during the process itself. In particular, at any point in the reindexing process, the server can have up to three instances of a single fragment:

the original document (original indexes)
updated document (new indexes)
merged document (only if the stand where this document resides is currently being merged)

In a worst-case scenario, more likely to happen towards the end of reindexing, the disk footprint of all the forests could be 3x the original size. For this reason, MarkLogic requires extra disk space for reindexing. This design choice ensures integrity of the content and allows for zero downtime when reindexing.

Performance Impact

Reindexing is a resource-intensive operation, as it uses both CPU and disk bandwidth. The CPU will be busy parsing the content and generating index entries while the disk will be reading fragments for reindexing, writing new stands to disk, and running merges on these newly created stands. You can expect significant performance impact in environments that are normally heavily utilized. You can decrease the impact of reindexing by using the reindexer throttle setting in the database configuration page. Reducing the value from 5 will introduce a delay between completion of one batch of fragments and the next query for further fragments.

Recommendations

Here are some recommendations when considering reindexing:

Plan to make multiple index changes at once to avoid reindexing multiple times
Disable reindexing (database configuration) to avoid accidentally forcing a reindex, only re-enabling it when reindexing is explicitly planned
Only enable reindexing (database configuration) during off-peak hours. The duration to complete the reindex will increase, but performance during peak hours will be better.
Check for free disk space before the reindex process begins (see Understanding MarkLogic Minimum Disk Space Requirements)
Ensure the environment has sufficient i/o bandwidth
Disable application/user access if you can afford the downtime, as this may improve overall reindex performance

Rebalancing

Changes in the cluster configuration may require rebalancing content across forests. Rebalancing works similar to reindexing; batches of documents are marked deleted in one forest and inserted into another forest. The performance impact and recommendations are thus the same as for reindexing.

How to handle XDQP-TIMEOUT on a busy cluster

Introduction

Sometimes, when a cluster is under heavy load, your cluster may show a lot of XDQP-TIMEOUT messages in the error log. Often, a subset of hosts in the cluster may become so busy that the forests they host get unmounted and remounted repeatedly. Depending on your database and group settings, the act of remounting a forest may be very time-consuming, due to the fact that that all hosts in the cluster are being forced to do extra work of index detection.

Forest Remounts

Every time a forest remounts, the error log will show a lot messages like these:

2012-08-27 06:50:33.146 Debug: Detecting indexes for database my-schemas
2012-08-27 06:50:33.146 Debug: Detecting indexes for database Triggers
2012-08-27 06:50:35.370 Debug: Detected indexes for database Last-Login: sln
2012-08-27 06:50:35.370 Debug: Detected indexes for database Triggers: sln
2012-08-27 06:50:35.370 Debug: Detected indexes for database Schemas: sln
2012-08-27 06:50:35.370 Debug: Detected indexes for database Modules: sln
2012-08-27 06:50:35.373 Debug: Detected indexes for database Security: sln
2012-08-27 06:50:35.485 Debug: Detected indexes for database my-modules: sln
2012-08-27 06:50:35.773 Debug: Detected indexes for database App-Services: sln
2012-08-27 06:50:35.773 Debug: Detected indexes for database Fab: sln
2012-08-27 06:50:35.805 Debug: Detected indexes for database Documents: ss, fp

... and so on ...

This can go on for several minutes and will cost you more down time than necessary, since you already know the indexes for each database.

Improving the situation

Here are some suggestions for improving this situation:

Browse to Admin UI -> Databases -> my-database-name
Set ‘index detection’ to ‘none’
Set ‘expunge locks’ to ‘none’

Repeat steps 1-4 for all active databases.

Now tweak the group settings to make the cluster less sensitive to an occasional busy host:

Browse to Admin UI -> Groups -> E-Nodes
Set ‘xdqp timeout’ to 30
Set ‘host timeout’ to 90
Click OK to make this change effective.

The database-level changes tell the server to speed up cluster startup time when a server node is perceived to be offline. The group changes will cause the hosts on that group to be a little more forgiving before declaring a host to be offline, thus preventing forest unmounting when it's not really needed.

If after performing these changes, you find that you are still experiencing XDQP-TIMEOUT's, the next step is to contact MarkLogic Support for assistance. You should also alert your Development team, in case there is a stray query that is causing the data nodes to gather too many results.

Related Reading

XML Data Query Protocol (XDQP)

How to Resolve "Host does not match origin" Error When Coupling C...

Summary

When configuring a server to add a foreign cluster you may encounter the following error:

Forbidden
Host does not match origin or inferred origin, or is otherwise untrusted.

This error will typically occur when using MarkLogic Server versions prior to 10.0-6, in combination with Chrome versions newer than 84.x.

Our recommendation to resolve this issue is to upgrade to MarkLogic Server 10.0-6 or newer. If that is not an option, then using a different browser, such as Mozilla Firefox, or downgrading to Chrome version 84.x may also resolve the error.

Changes to Chrome

Starting in version 85.x of Chrome, there was a change made to the default Referrer-Policy, which is what causes the error. The old default was no-referrer-when-downgrade, and the new value is strict-origin-when-cross-origin. When no policy is set, the browser's default setting is used. Websites are able to set their own policy, but it is common practice for websites to defer to the browser's default setting.

A more detailed description can be found at developers.google.com

How to resolve an "Internal Server Error XDMP-LEXVAL: ss:request-...

Introduction

For hosts that don't use a standard US locale (en_US) there are instances where some lower level calls will return data that cannot be parsed by MarkLogic Server. An example of this is shown with a host configured with a different locale when making a call to the Cluster Status page (cluster-status.xqy):

The problem

The problem you have encountered is a known issue: MarkLogic Server uses a call to strtof() to parse the values as floats:

http://linux.die.net/man/3/strtof

Unfortunately, this uses a locale-specific decimal point. The issue in this environment is likely due to the Operating System using a numeric locale where the decimal point is a comma, rather then a period.

Resolving the issue

The workaround for this is as follows:

1. Create a file called /etc/marklogic.conf (unless one already exists)

2. Add the following line to /etc/marklogic.conf:

export LC_NUMERIC=en_US.UTF-8

After this is done, you can restart the MarkLogic process so the change is detected and try to access the cluster status again.

Hung Messages in the ErrorLog

Summary

Hung messages in the ErrorLog indicate that MarkLogic Server was blocked while waiting on host resources, typically I/O or CPU.

Debug Level

The presence of Debug-level Hung messages in the ErrorLog does not indiciate a critical problem, but it does indicate that the server is under load and intermittently unresponsive for some period of time. A server that is logging Debug-level Hung messages should be closely monitored and the reason(s) for the hangs should be understood. You'll get a debug message if the hang time is greater than or equal to the Group's XDQP timeout.

Warning Level

When the duration of the Hung message is greater than or equal to two times the Group's XDQP timeout setting, the Hung message will appear at the Warning log level. Consequently, if the host is unresponsive to the rest of the cluster (that is, they have not received a heartbeat for the group's host timeout number of seconds), it may trigger a failover.

Common Causes

Hung messages in the ErrorLog have been traced back to the following root causes:

MarkLogic Server is installed on a Virtual Machine (VM), and
- The VM does not have sufficient resources provisioned for peak use; or
- The underlying hardware is not provisioned with enough resources for peak use.
MarkLogic Server is using disk space on a Storage Area Network (SAN) or Network Attached Storage (NAS) device, and
- The SAN or NAS is not provisioned to handle peak load; or
- The network that connects the host to the storage system is not provisioned to handle peak load.
Other enterprise level software is running on the same hardware as MarkLogic Server. MarkLogic Server is designed with the assumption that it is running on dedicated hardware.
A file backup or a virus scan utility is running against the same disk where forest data is stored, overloading the I/O capabilities of the storage system.
There is insufficient I/O bandwidth for the merging of all forests simultaneously.
Re-indexing overloads the I/O capabilities of the storage system.
A query that performs extremely poorly, or a number of such queries, caused host resource exhaustion.

Forest Failover

If the cause of the Hung message further causes the server to be unresponsive to cluster heartbeat requests from other servers in the cluster, for a duration greater than the host timeout, then the host will be considered unavailable and will be voted out of the cluster by a quorum of its peers. If this happens, and failover is configured for forests stored on the unresponsive host, the forests will fail over.

Debugging Tips

Look at system statistics (such as SAR data) and system logs from your server for entries that occurred during the time-span of the Hung message. The goal is to pinpoint the resource bottleneck that is the root cause.

Provisioning Recommendation

The host on which MarkLogic Server runs needs to be correctly provisioned for peak load.

MarkLogic recommends that your storage subsystem simultaneously support:

20MB/s read throughput, per forest
20MB/s write throughput, per forest

We have found that customers who are able to sustain these throughput rates have not encountered operational problems related to storage resources.

Configuration Tips

If the Hung message occurred during a I/O intensive background task (such as database backup, merge or reindexing), consider setting of decreasing the backgound IO Limit - This group level configuration controls the I/O resources that background I/O tasks will consume.

If the Hung message occurred during a database merge, consider decreasing the merge priority in the database’s Merge Policy. For example, if the priority is set to "normal", then try decreasing it to "lower".

JSON in versions 6, 7, and 8

Introduction

This article compares JSON support in MarkLogic Server versions 6, 7, and 8, and the upgrade path for JSON in the database.

How is native JSON different than the previous JSON support?

Previous versions of MarkLogic Server provided XQuery APIs that converted between JSON and XML. This translation is lossy in the general case meaning developers were forced to make compromises on either or both ends of the transformation. Even though the transformation was implemented in C++ it still added significant overhead to ingestion. All of these issues go away with JSON as a native document format.

How do I upgrade my JSON façade data to native JSON?

For applications that use the previous JSON translation façade (for example: through the Java or REST Client APIs), MarkLogic 8 comes with sample migration scripts to convert JSON stored as XML into native JSON.

The migration script will upgrade a database’s content and configuration from the XML format that was used in MarkLogic 6 and 7 to represent data to native JSON, specifically converting documents in the http://marklogic.com/xdmp/json/basic namespace.

If you are using the MarkLogic 7 JSON support, you will also need to migrate your code to use the native JSON support. The resulting application code is expected to be more efficient, but it will require application developers to make minor code changes to your application.

Summary

There are many different options when loading data into MarkLogic. The best option for your particular circumstances will depend on your use case.

Details

Version Compatibility

Not all features/programs are provided or are compatible with all versions of MarkLogic. Check the requirements given. Note that the MarkLogic documentation at docs.marklogic.com allows you to select the version of the documentation that you view.

For the most recent MarkLogic Server versions, there is a separate guide: Loading Content Into MarkLogic Server.

MLCP

MarkLogic Content Pump (mlcp) is an open-source, Java-based command-line tool. mlcp helps to import, export, and copy data to or from MarkLogic databases. It is designed for integration and automation in existing workflows and scripts.

See MarkLogic Content Pump for details.

REST API

In MarkLogic 6 and above, the MarkLogic REST API provides a set of RESTful services for creating, updating, retrieving, deleting and query documents and metadata. See REST Development for details.

Java API

The Java Client API is an open source API for creating applications that use MarkLogic Server for document and search operations.

Node.js

The Node.js Client API enables you to create Node.js applications that can read, write, and query documents and semantic data in a MarkLogic database. See Node.js Application Developer's Guide.

XQuery

You can load documents into the database using the XQuery load document functions, as described in Loading Content Using XQuery in the guide to loading content.

WebDav

You can set up a WebDAV server and client to load documents. See WebDAV Servers for more information.

RecordLoader

RecordLoader is a Java-based command line tool, designed to load any number of arbitrary-sized input documents into a MarkLogic database.

Corb

Corb is a Java-based command line tool for content reprocessing in bulk.

XQSync

XQSync is a command-line, Java-based tool, useful for synchronizing MarkLogic databases to and from other databases, filesystems, and zip-files.

XCC Application

Documents can also be loaded into the database by an XCC application, as described in the XCC Developer’s Guide.

Logging HTTP Application Server Requests

Summary

MarkLogic Server maintains an access log for logging each HTTP application server request. However, the access log only contains summary information. In order to log additional HTTP request detail along with parameters, you can do so in the error log by using a URL rewriter.

Detail

A URL rewriter can be configured for each application server. The URL rewriter will receive request object and can log the request details accordingly. Below is a sample URL rewriter that can be used to log HTTP request fields:

xquery version "1.0-ml" 
(
xdmp:log(fn:concat("Request URI: ", xdmp:get-request-url())),
for $field in xdmp:get-request-field-names() 
return
xdmp:log(fn:concat("Request Field - [Name:] ", $field," [Value:] ", xdmp:get-request-field($field)))
)

To configure a URL rewriter on an application server using the MarkLogic Admin UI, navigate to -> Configure -> {group-name} -> App Servers -> {app-server-name} -> set 'url rewriter' value to the rewriter script URI.

Managing and Updating Temporal Documents

Introduction

Here we discuss management of temporal documents.

Details

In MarkLogic, a temporal document is managed as a series of versioned documents in a protected collection. The ‘original’ document inserted into the database is kept and never changes. Updates to the document are inserted as new documents with different valid and system times. A delete of the document is also inserted as a new document.

In this way, a temporal document always retains knowledge of when the information was known in the real world and when it was recorded in the database.

API's

By default the normal xdmp:* document functions (e.g., xdmp:document-insert) are not permitted on temporal documents.

The temporal module (temporal:* functions; see Temporal API) contains the functions used to insert, delete, and manage temporal documents.

All temporal updates and deletes create new documents and in normal operations this is exactly what will be desired.

See also the documentation: Managing Temporal Documents.

Updates and deletes outside the temporal functions

Note: normal use of the temporal feature will not require this sort of operation.

The function temporal:collection-set-options can be used with the updates-admin-override option to specify that users with the admin role can change or delete temporal documents using non-temporal functions, such as xdmp:document-insert and xdmp:document-delete.

For example, if you need to do a corb or other administrative transform, but do not want to update the system dates on the documents; say, you want to change the values M/F to Male/Female.

Manual upgrade for MarkLogic AWS AMI

Introduction

If you have an existing MarkLogic Server cluster running on EC2, there may be circumstances where you need to upgrade the existing AMI with the latest MarkLogic rpm available. You can also add a custom OS configuration.

This article assumes that you have started your cluster using the CloudFormation templates with Managed Cluster feature provided by MarkLogic.

Procedure
To upgrade manually the MarkLogic AMI, follow these steps:

1. Launch a new small MarkLogic instance from the AWS MarketPlace, based on the latest available image. For example, t2.small based on MarkLogic Developer 9 (BYOL). The instance should be launched only with the root OS EBS volume.
Note: If you are planning to leverage the PAYG-PayAsYouGo model, you must choose MarkLogic Essential Enterprise.
a. Launch a MarkLogic instance from AWS MarketPlace, click Select and then click Continue:

b. Choose instance type. For example, one of the smallest available, t2.small
c. Configure instance details. For example, default VPC with a public IP for easy access
d. Remove the second EBS data volume (/dev/sdf)
e. Optional - Add Tags
f. Configure Security Group - only SSH access is needed for the upgrade procedure
g. Review and Launch
Review step - AWS view:

2. SSH into your new instance and switch the user to root in order to execute the commands in the following steps.

$ sudo su -

Note: As an option, you can also use "sudo ..." for each individual command.

3. Stop MarkLogic and uninstall MarkLogic rpm:

$ service MarkLogic stop
$ rpm -e MarkLogic

4. Update-patch the OS:

$ yum -y update

Note: If needed, restart the instance (For example: after a kernel upgrade/core-libraries).
Note: If you would like to add more custom options/configuration/..., they should be done between steps 4 and 5.

5. Install the new MarkLogic rpm
a. Upload ML's rpm to the instance. (For example, via "scp" or S3)
b. Install the rpm:

$ yum install [<path_to_MarkLogic_RPM>]/[MarkLogic_RPM]

Note: Do not start MarkLogic at any point of AMI's preparation.

6. Double check to be sure that the following files and log traces do not exist. If they do, they must be deleted.

$ rm -f /var/local/mlcmd.conf
$ rm -f /var/tmp/mlcmd.trace
$ rm -f /tmp/marklogic.host

7. Remove artifacts
Note: Performing the following actions will remove the ability to ssh back into the baseline image. New credentials are applied to the AMI when launched as an instance. If you need to add/change something, mount the root drive to another instance to make changes.

$ rm -f /root/.ssh/authorized_keys
$ rm -f /home/ec2user/.ssh/authorized_keys
$ rm -f /home/ec2-user/.bash_history
$ rm -rf /var/spool/mail/*
$ rm -rf /tmp/userdata*
$ rm -f [<path_to_MarkLogic_RPM>]/[MarkLogic_RPM]
$ rm -f /root/.bash_history
$ rm -rf /var/log/*
$ sync

8. Optional - Create an AMI from the stopped instance.[1] The AMI can be created at the end of step 7.

$ init 0

[1] For more information: https://docs.aws.amazon.com/toolkit-for-visual-studio/latest/user-guide/tkv-create-ami-from-instance.html

At this point, your custom AMI should be ready and it can be used for your deployments. If you are using multiple AWS regions, you will have to copy the AMI as needed.
Note: If you'd like to add more custom options/configuration/..., they should be done between steps 4 and 5.

Additional references:
[2] Upgrading the MarkLogic AMI - https://docs.marklogic.com/8.0/guide/ec2/managing#id_69624

MarkLogic & the Log4j Remote Code Execution Vulnerability (CVE-20...

Updates

Tuesday, February 1, 2022 : Released Pega Connector 1.0.1 which uses MLCP 10.0-8.2 with forced dependencies to log4j 2.17.1.

Tuesday, January 25, 2022 : MarkLogic Server versions 10.0-8.3 (CentOS 7.8 and 8) is now available on the Azure marketplace.

Monday, January 17, 2022 : MarkLogic Server 10.0-8.3 is now available on AWS marketplace;

Monday, January 10, 2022 : MarkLogic Server 10.0-8.3 released with Log4j 2.17.1. (ref: CVE-2021-44832 ).

Friday, January 7, 2022 : Fixed incorrect reference to log4j version included with MarkLogic 10.0-8.2 & 9.0-13.7.

Wednesday, January 05, 2022 : Updated workaround to reference Log4j 2.17.1. (ref: CVE-2021-44832 ).

Tuesday, December 28, 2021 : Add explicit note that for MarkLogic Server installations not on AWS, it is safe to remove the log4j files in the mlcmd/lib directory.

Saturday, December 25, 2021: MLCP update to resolve CVE-2019-17571 is now available for download;

Friday, December 24, 2021: AWS & Azure Marketplace update;

Wednesday, December 22, 2021: additional detail regarding SumoCollector files; AWS & Azure Marketplace update; & MLCP note regarding CVE-2019-17571.

Monday, December 20, 2021: Updated workaround to reference Log4j 2.17.0. (ref: CVE-2021-45105 )

Friday, December 17, 2021: Updated for the availability of MarkLogic Server versions 10.0-8.2 and 9.0-13.7;

Wednesday, December 15, 2021: Updated to include SumoLogic Controller reference for MarkLogic 10.0-6 through 10.0-7.3 on AWS;

Tuesday, December 14, 2021: This article had been updated to account for the new guidance and remediation steps in CVE-2021-45046;

"It was found that the fix to address CVE-2021-44228 in Apache Log4j 2.15.0 was incomplete in certain non-default configurations. This could allows attackers with control over Thread Context Map (MDC) input data when the logging configuration uses a non-default Pattern Layout with either a Context Lookup or a Thread Context Map pattern to craft malicious input data using a JNDI Lookup pattern resulting in a denial of service (DOS) attack. ..."

Monday, December 13, 2021: Original article published.

Subject

Important MarkLogic Security update on Log4j Remote Code Execution Vulnerability (CVE-2021-44228)

Summary

A flaw in Log4j, a Java library for logging error messages in applications, is the most high-profile security vulnerability on the internet right now and comes with a severity score of 10 out of 10. At MarkLogic, we take security very seriously and have been proactive in responding to all kinds of security threats. Recently a serious security vulnerability in the Java-based logging package Log4j was discovered. Log4j is developed by the Apache Foundation and is widely used by both enterprise apps and cloud services. The bug, now tracked as CVE-2021-44228 and dubbed Log4Shell or LogJam, is an unauthenticated RCE ( Remote Code Execution ) vulnerability allowing complete system takeover on systems with Log4j 2.0-beta9 up to 2.14.1.

As part of mitigation measures, Apache originally released Log4j 2.15.0 to address the maximum severity CVE-2021-44228 RCE vulnerability. However, that solution was found to be incomplete (CVE-2021-45046) and Apache has since released Log4j 2.16.0. This vulnerability can be mitigated in prior releases (<2.16.0) by removing the JndiLookup class from the classpath. Components/Products using the log4j library are advised to upgrade to the latest release ASAP seeing that attackers are already searching for exploitable targets.

MarkLogic Server

MarkLogic Server version 10.0-8.3 now includes Log4j 2.17.1. (ref: CVE-2021-44832 ).

MarkLogic Server versions 10.0-8.2 & 9.0-13.7 includes log4j 2.16.0, replacing all previously included log4j modules affected by this vulnerability.

MarkLogic Server versions 10.0-8.3 & 9.0-13.7 are available for download from our developer site at https://developer.marklogic.com/products/marklogic-server .

MarkLogic Server versions 10.0-8.3 & 9.0-13.7 are available on the AWS Marketplace.

MarkLogic Server versions 10.0-8.3 (CentOS 7.8 and 8) & 9.0-13.7 (CentOS 8) VMs are available in the Azure marketplace.

MarkLogic Server does not use log4j2 within the core server product.

However, CVE-2021-44228 has been determined to impact the Managed Cluster System (MLCMD) in AWS.

Note: log4j is included in the MarkLogic Server installation, but it is only used by MLCMD on AWS. For MarkLogic Server installations not on AWS, you can simply remove the log4j files in the mlcmd/lib directory (sudo rm /opt/MarkLogic/mlcmd/lib/log4j*).

AWS Customers can use the following workaround to mitigate exposure to the CVE.

Impacted versions

The versions that are affected by the Log4Shell vulnerability are

10.0-6.3 through 10.0-8.1 on AWS

9.0-13.4 through 9.0-13.6 on AWS

Earlier versions of MLCMD use a log4j version that is not affected by this vulnerability.

How to check log4j version used by MarkLogic Managed Cluster System in AWS

Access the instance/VM via SSH.

Run the following command ls /opt/MarkLogic/mlcmd/lib/ | grep "log4j"

If the log4j jar files returned are between 2.0-beta9 and up to 2.14.1 then the system contains this vulnerability.

An example response from a system containing the CVE:

log4j-1.2-api-2.14.1.jar

log4j-api-2.14.1.jar

log4j-core-2.14.1.jar

In the above case, the log4j dependencies are running version 2.14.1 which is affected.

Workaround

The following workaround can be executed on a running MarkLogic service, without stopping it.

AWS

1. ssh into your EC2 instance, you must have sudo access in order to make the changes necessary for the fix.

2. Download and extract the Log4j 2.17.1 dependency from apache.

curl https://archive.apache.org/dist/logging/log4j/2.17.1/apache-log4j-2.17.1-bin.tar.gz --output log4j.tar.gz && tar -xf log4j.tar.gz

If your EC2 instance does not have outbound external internet access, download the dependency onto a machine that does, and then scp the file over to the relevant ec2 instance via a bastion host.

3. Move the relevant log4j dependencies to the /opt/MarkLogic/mlcmd/lib/ folder IE:

sudo mv ./apache-log4j-2.17.1-bin/log4j-core-2.17.1.jar /opt/MarkLogic/mlcmd/lib/

sudo mv ./apache-log4j-2.17.1-bin/log4j-api-2.17.1.jar /opt/MarkLogic/mlcmd/lib/

sudo mv ./apache-log4j-2.17.1-bin/log4j-1.2-api-2.17.1.jar /opt/MarkLogic/mlcmd/lib/

4. Remove the old log4j dependencies

sudo rm /opt/MarkLogic/mlcmd/lib/log4j-core-2.14.1.jar

sudo rm /opt/MarkLogic/mlcmd/lib/log4j-1.2-api-2.14.1.jar

sudo rm /opt/MarkLogic/mlcmd/lib/log4j-api-2.14.1.jar

SumoLogic Collector

AMIs for MarkLogic versions 10.0-6 through 10.0-7.3 were shipped with the SumoCollector libraries. These libraries are not needed nor are they executed by MarkLogic Server. Starting with MarkLogic version 10.0-8, the SumoCollector libraries are no longer shipped with the MarkLogic AMIs.

It is safe to remove those libraries from all the instances that you have launched using any of the MarkLogic AMIs available in Market place. You can remove the SumoCollector directory and all it's files under /opt.

Additionally, if you have created any clusters using the Cloud Formation templates (managed cluster feature), we would suggest that you delete the SumoCollector directory under /opt if exists. Once MarkLogic releases new AMIs, you can update the stack with new AMI ID and perform a rolling restart of nodes so that the permanent fix would be in place.

Other Platforms

For the impacted MarkLogic versions listed above running on platforms besides AWS, the log4j jars are included in the MarkLogic installation folder but are never used. The steps listed in the workaround above can still be applied to these systems even though the systems themselves are not impacted.

MarkLogic Java Client

The MarkLogic Java Client API has neither a direct nor indirect dependency on log4j. The MarkLogic Java Client API does use the industry-standard SLF4J abstract interface for logging. Any conformant logging library can provide the concrete implementation of the SLF4J interface. By default, MarkLogic uses the logback implementation of the SLF4J interface. The logback library doesn't have the vulnerability that exists in the log4j library. Customers who have chosen to override logback with log4j may have the vulnerability. Such customers should either revert to the default logback library or follow the guidance provided by log4j to address the vulnerability: https://logging.apache.org/log4j/2.x/security.html

MarkLogic Data Hub & Hub Central

The MarkLogic Data Hub & Hub Central are not affected directly by log4j vulnerability, Datahub and Hub Central used Spring boot and spring has an option to switch default logging to use log4j, which Data Hub does not.
The log4j-to-slf4j and log4j-api jars that we include in spring-boot-starter-logging cannot be exploited on their own. By default, MarkLogic Data Hub uses the logback implementation of the SLF4J interface.
The logback library doesn't have the vulnerability that exists in the log4j library. Please refer: https://spring.io/blog/2021/12/10/log4j2-vulnerability-and-spring-boot

MarkLogic Data Hub Service

For MarkLogic Data Hub Service customers, no action is needed at this time. All systems have been thoroughly scanned and patched with the recommended fixes wherever needed.

MarkLogic Content Pump (MLCP)

MarkLogic Content Pump 10.0-8.2 & 9.0-13.7 are now available for download from developer.marklogic.com and GitHub. This release resolves the the CVE-2019-17571 vulnerability.

MLCP versions 10.0-1 through 10.0-8.2 and versions prior to 9.0-13.6 used an older version of log4j-1.2.17 that is not affected by the primary vulnerability discussed in this article (CVE-2021-44228), but mlcp versions prior to 10.0-8.2 are affected by the critical vulnerability CVE-2019-17571.

MLCP v10.0-8.2 & MLCP v9.0-13.7 specific workaround for CVE-2021-44832

The following workaround can be executed on a host with mlcp

1. Download and extract the Log4j 2.17.1 dependency from apache.

curl https://archive.apache.org/dist/logging/log4j/2.17.1/apache-log4j-2.17.1-bin.tar.gz --output log4j.tar.gz && tar -xf log4j.tar.gz

2. Move the relevant log4j dependencies to the $MLCP_PATH/lib/ folder IE:

sudo mv ./apache-log4j-2.17.1-bin/log4j-core-2.17.1.jar $MLCP_PATH/lib/

sudo mv ./apache-log4j-2.17.1-bin/log4j-api-2.17.1.jar $MLCP_PATH/lib/

sudo mv ./apache-log4j-2.17.1-bin/log4j-1.2-api-2.17.1.jar $MLCP_PATH/lib/

sudo mv ./apache-log4j-2.17.1-bin/log4j-jcl-2.17.1.jar $MLCP_PATH/lib/

sudo mv ./apache-log4j-2.17.1-bin/log4j-slf4j-impl-2.17.1.jar $MLCP_PATH/lib/

2. Remove the old log4j dependencies

sudo rm $MLCP_PATH/lib/log4j-core-2.17.0.jar

sudo rm $MLCP_PATH/lib/log4j-1.2-api-2.17.0.jar

sudo rm $MLCP_PATH/lib/log4j-api-2.17.0.jar

sudo rm $MLCP_PATH/lib/log4j-jcl-2.17.0.jar

sudo rm $MLCP_PATH/lib/log4j-slf4j-impl-2.17.0.jar

Pega Connector

The 1.0.0 Pega connector installer briefly runs MLCP 10.0-6.2 via gradle as part of the setup. MLCP 10.0-6.2 uses the old 1.2 log4j jar. The actual connector does not use log4j at runtime. We have released Pega Connector 1.0.1 which uses MLCP 10.0-8.2 with forced dependencies to log4j 2.17.1.

MarkLogic-supported client libraries, tools

All other MarkLogic-supported client libraries, tools, and products are not affected by this security vulnerability.

Verified Not Affected

The following MarkLogic Projects, Libraries and Tools have been verified by the MarkLogic Engineering team as not being affected by this vulnerability

Apache Spark Connector

AWS Glue Connector

Corb-2

Data Hub Central Community Edition

Data Hub QuickStart

Jena Client - Distro not affected, but some tests contain log4j;

Kafka Connector

MLCP - uses an older version of log4j that is not affected CVE-2021-44228), but it is affected by CVE-2019-17571. See notes above.

ml-gradle

MuleSoft Connector - The MarkLogic Connector does not depend on log4j2, but it does leverage the MarkLogic Java Client API (see earlier comments);

Nifi Connector

XCC

MarkLogic Open Source and Community-owned projects

If you are using one of the MarkLogic open-source projects which have a direct or transient dependency on Log4j 2 up to version 2.14.1 please either upgrade the Log4j to version 2.16.0 or implement the workaround in prior releases (<2.16.0) by removing the JndiLookup class from the classpath. Please refer: https://logging.apache.org/log4j/2.x/security.html

Contact and Links

MarkLogic is dedicated to supporting our customers, partners, and developer community to ensure their safety. If you have a registered support account, feel free to contact support@marklogic.com with any additional questions.

More information about the log4j vulnerability can be found at

https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-44228 or

https://logging.apache.org/log4j/2.x/security.html

https://www.cisa.gov/uscert/ncas/current-activity/2021/12/13/cisa-creates-webpage-apache-log4j-vulnerability-cve-2021-44228

https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-45046

MarkLogic Cluster Requirements

Summary

All hosts in a MarkLogic cluster of two or more servers must run the same MarkLogic Server installation package.

Operating System Architecture

MarkLogic Server installation packages are created for each supported operating system architecture (e.g. Windows 64 bit, Linux 64 bit …). Consequently, all hosts in a MarkLogic cluster must employ the same operating system architecture.

Version

For MarkLogic 8 and previous releases, all hosts within a MarkLogic cluster must be running the same version of MarkLogic Server. Mixed version clusters are not recommended, are not tested by MarkLogic, and are not supported. The behavior of a mixed version cluster is not defined and could lead to corrupt or inconsistent data.

MarkLogic 9 and MarkLogic 10 include the "Rolling Upgrade" feature that allows a cluster to run on a mixed version for a very short period of time. For additional details and restrictions, please refer to the Rolling Upgrade section of our Administrators Guide.

Security and Schema Databases

In a MarkLogic Cluster, if the Security databases is configured with a Schemas database, the forests must be placed on the same host.

Cluster Upgrades

When upgrading a MarkLogic cluster to a new release, the upgrade should occur on all hosts in the cluster within a short period of time. The first server to be upgraded must be the server on which the Security database is mounted.

In addition to the MarkLogic Server’s Installation Guide, you will want to refer to the “Upgrading a Cluster to a New Maintenance Release of MarkLogic Server” section in the MarkLogic Server’s Scalability, Availability, and Failover Guide for details regarding the required procedure to upgrade a cluster.

MarkLogic Database Restore Across Clusters

Summary

There are scenarios where you may want to restore a database from a MarkLogic Server backup that was taken from a database on a different cluster.

Examples

Two example scenarios where this may be appropriate:

- For development or testing purposes - you may want to take the content from one system to perform development of testing on a different cluster.

- A system failed, and you need to recreate a cluster and restore the database to the last known good state.

Constraints

There are constraints on performing a database restore from a MarkLogic database backup across clusters

The source and target servers must be the same Operating System. More specifically, they must be able to use the same MarkLogic Server installation package.

The backups must be accessible from all servers on which a forest in the target database resides.

The path to the backups must be identical on all of the servers.

The MarkLogic process must have sufficient access credentials to read the files in the backup.

If the number of hosts and/or forests is different, see Restoring a Reconfigured Database.

If running MarkLogic versions prior to 9.0-4 then the following conditions must also be met

The forest names must be identical in both the source database and the target database.

The number of forests in both the source and target databases should be the same. If the source database has a forest that does not reside on the target, then that forest data will not be included in the target after the database restore is complete.

Note: Differences in index configuration and/or forest order may result in reindexing or rebalancing after the restore is complete

Debugging Problems

If you are experiencing difficulties restoring a database backup, you can validate the backup using xdmp:database-backup-validate, or xdmp:database-incremental-backup-validate:

1. In the Query Console, execute a simple script that validates restoring the backup. Something like

xquery version "1.0-ml";

let $db-name := "Documents"

let $db-backup-path := "/my-backup-dir/test"

return xdmp:database-restore-validate(

    xdmp:database-forests( xdmp:database($db-name)),

    $db-backup-path)

But with the $db-name and $db-backup-dir set appropriately. The results will be a backup plan in xml format. Look at both the ‘forest-status’ and ‘directory-status’ for each of the forests. Both should have the “okay” value.

A common error for the ‘directory-status’ is “non-existent”. If you get this error, check the following.

- Verify that the backup directory exists on each server in the cluster that has a forest in the database;

- Verify that the backup directory has a “Forests” subdirectory, and the “Forests” directory contains subdirectories for each of the forests that reside on the Server.

- For the above directories, subdirectories and file contents, verify that the MarkLogic process has the proper credentials to access them.

2. If xdmp:database-backup-validate, or xdmp:database-incremental-backup-validate does not indicate any errors, then look in the MarkLogic Server’s ErrorLog.txt for entries during the time of the restore for any errors reported. It is a good idea to set the MarkLogic Server group’s ‘File log level’ to ‘debug’ in order to get detailed error messages.

Helpful Commands:

On Unix Systems, the following commands may be useful in troubleshooting:

Check the 'file system access user ID' for the MarkLogic process

ps -A -o fuser,pid,comm | grep MarkLogic

View file/directory permissions, owner and group

ls -l

Change ownership recursively. In a default installation this should be daemon

chown -R daemon.daemon /path/to/Backup

Add read and write permissions recursively

chmod -R +rw /path/to/Backup

Further Reading

Transporting Resources to a New Cluster

Phases of Backup or Restore Operation

Restoring a Reconfigured Database

MarkLogic fails to start with Initialization: XDMP-ENCODING: (err...

Summary

MarkLogic may fail to start, with an XDMP-ENCODING error, Initialization: XDMP-ENCODING: (err:XQST0087) Unsupported character encoding: ascii.  This is caused by a mismatch in the Linux Locale character set, and the UTF-8 character set required by MarkLogic.

Solutions

There are two primary causes to this error. The first is using service instead of systemctl to start MarkLogic on some Linux distros. The second is related to the Linux language settings.

Starting MarkLogic Service

On an Azure MarkLogic VM, as well as some more recent Linux distros, you must use systemctl, and not service to start MarkLogic. To start the service, use the following command:

sudo systemctl start MarkLogic

Linux Language Settings

This issue occurs when the Linux Locale LANG setting is not set to UTF-8. This can be accomplished by changing the value of LC_ALL to "en_US.UTF-8". This should be done for the root user for default installations of MarkLogic. To change the system wide locale settings, the /etc/locale.conf needs to be modified. This can be done using the localectl command.

sudo localectl set-locale LANG=en_US.UTF-8

If MarkLogic is configured to run as a non-root user, then setting the locale can be done in the users environment. Setting the value can be done using the $HOME/.i18n file. If the file does not exist, please create it and ensure it has the following:

export LANG="en_US.UTF-8"

If that does not resolve the issue in the user environment, then you may need to look at setting LC_CTYPE, or LC_ALL for the locale.

LC_CTYPE will override the character set part of the LANG setting, but will not change other locale settings.

LC_ALL will override both LC_CTYPE and all locale configurations of the LANG setting.

References

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/ch-keyboard_configuration

https://access.redhat.com/solutions/974273

https://www.unix.com/man-page/centos/1/localectl/

http://man7.org/linux/man-pages/man1/locale.1.html

MarkLogic Fundamentals - How should I scale out my cluster?

Introduction

A MarkLogic cluster is a group of inter-connected individual machines (often called “nodes” or “hosts”) that work together to perform computationally intensive tasks. Clustering offers scalability and high-availability by avoiding single-points of failure. This knowledgebase article contains tips and best practices around clustering, especially in the context of scaling out.

How many nodes should I have in a cluster?

If you need high-availability, there should be a minimum of three nodes in a cluster to satisfy quorum requirements.

Anything special about deploying on AWS?

Quorum requirements hold true even in a cloud environment where you have Availability Zones (or AZs). In addition to possible node failure, you can also defend against possible AZ failure by splitting your d-Nodes and e-Nodes evenly across three availability zones.

Load distribution after failover events

If a d-node experiences a failover event, the remaining d-nodes pick up its workload so that the data stored in its forests remains available.

Failover forest topology is an important factor in both high-availability and load-distribution within a cluster. Consider the example below of a 3-node cluster where each node has two data forests (dfs) and two local disk-failover forests (ldfs):

Case 1: In the event of a fail over, if both dfs (df1.1 and df1.2) from node1 fail over to node2, the load on node2 would double (100% to 200%, where node2 would now be responsible for its own two forests - df2.1 and df2.2 - as well as the additional two forests from node1 - ldf1.1 and ldf1.2)

Case 2: In the event of a fail over, if we instead set up the replica forests in such a way that when node1 goes down, df1.1 would fail over to node2 and df1.2 would fail over to node3, then the load increase would be reduced per node. Instead of one node going from 100% to 200% load, two nodes would instead go from 100% to 150%, where node2 is now responsible for its two original forests - df2.1 and df2.2, plus one of node1's failover forests (ldf1.1), and node3 would also now be responsible for its two original forests - df3.1 and df3.2, plus one of node1's failover forests (ldf1.2)

Growing or scaling out your cluster

If you need to fold in additional capacity to your cluster, try to add nodes in "rings of three." Each ring of three can have its own independent failover topology, where nodes 1, 2, and 3 will fail over to each other as described above, and nodes 4, 5, and 6 will fail over to each other separate from the original ring of three. This results in minimal configuration changes for any nodes already in your cluster when adding capacity.

Important related takeaways

In addition to the standard MarkLogic Server clustering requirements, you'll also want to pay special attention to the hardware specification of individual nodes

Although the hardware specification doesn’t have to be exactly the same across all nodes, it is highly recommended that all d-nodes be of the same specification because cluster performance will ultimately be limited by the slowest d-node in the system

You can read more about the effect of slow d-nodes in a cluster in the "Check the Slowest D-Node" section of our "Performance Testing
With MarkLogic" whitepaper

Automatic fail-back after a failover event is not supported in MarkLogic due to the risks of unintentional overwrites, which could potentially result in accidental data loss. Should a failover event occur, human intervention is typically required to manually fail-back. You can read more about the considerations involved in failing a forest back in the following knowledgebase article: Should I flip failed over forests back to their respective masters? What are the risks if I leave them?

Further reading

Documentation

Clustering in MarkLogic Server

Getting Started with Distributed Deployments

Knowledgebase articles

MarkLogic Fundamentals - High-availability & False Failovers

Considerations when scaling out your MarkLogic Instance

MarkLogic Linux Tuned Profile

MarkLogic Linux Tuned Profile

Summary

The tuned tuning service can change operating system settings to improve performance for certain workloads. Different tuned profiles are available and choosing the profile that best fits your use case simplifies configuration management and system administration. You can also write your own profiles, or extend the existing profiles if further customization is needed. The tuned-adm command allows users to switch between different profiles.

RedHat Performance and Tuning Guide: tuned and tuned-adm

tuned-adm list will list the available profiles

tuned-adm active will list the active profile

Creating a MarkLogic Tuned Profile

Using the throughput-performance profile, we can create a custom tuned profile for MarkLogic Server. First create the directory for the MarkLogic profile:

sudo mkdir /usr/lib/tuned/MarkLogic/

Next, create the tuned.conf file that will include the throughput-performance profile, along with our recommended configuration:

# # tuned configuration # [main] summary=Optimize for MarkLogic Server on Bare Metal include=throughput-performance [sysctl] vm.swappiness = 1 vm.dirty_ratio = 40 vm.dirty_background_ratio=1 [vm] transparent_hugepages=never

Activating the MarkLogic Tuned Profile

Now when we do a tuned list it should show us the default profiles, as well as our new MarkLogic profile:

$ tuned-adm list Available profiles: - MarkLogic - Optimize for MarkLogic Server - balanced - General non-specialized tuned profile - desktop - Optimize for the desktop use-case - hpc-compute - Optimize for HPC compute workloads - latency-performance - Optimize for deterministic performance at the cost of increased power consumption - network-latency - Optimize for deterministic performance at the cost of increased power consumption, focused on low latency network performance - network-throughput - Optimize for streaming network throughput, generally only necessary on older CPUs or 40G+ networks - powersave - Optimize for low power consumption - throughput-performance - Broadly applicable tuning that provides excellent performance across a variety of common server workloads - virtual-guest - Optimize for running inside a virtual guest - virtual-host - Optimize for running KVM guests Current active profile: virtual-guest

Now we can make MarkLogic the active profile:

$ sudo tuned-adm profile MarkLogic

And then check the active profile:

$ tuned-adm active Current active profile: MarkLogic

Disabling the Tuned Daemon

The tuned daemon does have some overhead, and so MarkLogic recommends that it be disabled. When the daemon is disabled, tuned will only apply the profile settings and then exit. Update the /etc/tuned/tuned-main.conf and set the following value:

daemon = 0

References

Linux Performance Tuning for MarkLogic

Linux Swappiness

IO Schedulers

Linux Huge Pages and Transparent Huge Pages

RHEL Performance Tuning: tuned and tuned-adm

RHEL Performance Tuning: tuned

MarkLogic ODBC Setup and Quick start for Linux environments

Introduction

There is a lot of useful information in MarkLogic Server's documentation surrounding many of the new features of MarkLogic 9 - including the new SQL implementation, improvements made to the ODBC driver and the new system for generating SQL "view" templates for your data. This article attempts to pull it all together by showing all the measures needed to create a successful connection and to verify that everything is set up correctly and works as expected?

This guide presents a step-by-step walk through covering the installation of all the necessary components, the configuration of the ODBC driver and the loading of data into MarkLogic in order to create a Template View that will allow a SQL query to be rendered.

Prerequisites

We're starting with a clean install of Redhat Enterprise Linux 7:

$ uname -a Linux engrlab-128-084.engrlab.marklogic.com 3.10.0-327.4.5.el7.x86_64 #1 SMP Thu Jan 21 04:10:29 EST 2016 x86_64 x86_64 x86_64 GNU/Linux

In this example, I'm using yum to manage the additional dependencies (openssl-libs and unixODBC) required for the MarkLogic ODBC driver:

$ sudo yum install openssl-libs Package 1:openssl-libs-1.0.2k-8.el7.x86_64 already installed and latest version Nothing to do $ sudo yum install unixODBC Package unixODBC-2.3.1-11.el7.x86_64 already installed and latest version Nothing to do

If you want to use the latest version of unixODBC (2.3.4 at the time of writing), you can get it using cURL by running curl -O ftp://ftp.unixodbc.org/pub/unixODBC/unixODBC-2.3.4.tar.gz

$ curl -O ftp://ftp.unixodbc.org/pub/unixODBC/unixODBC-2.3.4.tar.gz % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1787k 100 1787k 0 0 235k 0 0:00:07 0:00:07 --:--:-- 371k

Please note - as per the documentation, this method will require unixODBC to be compiled so additional dependencies may need to be met for this.

This article assumes that you have downloaded the ODBC driver for MarkLogic Server and the MarkLogic 9 install binary and have those available on your machine:

$ ll total 310112 -r--r--r-- 1 support support 316795526 Nov 16 04:19 MarkLogic-9.0-3.x86_64.rpm -r--r--r-- 1 support support 754596 Nov 16 04:18 mlsqlodbc-1.3-3.x86_64.rpm

Getting started: installing and configuring MarkLogic 9 with an ODBC Server

We will start by installing and starting MarkLogic 9:

$ sudo rpm -i MarkLogic-9.0-3.x86_64.rpm $ sudo service MarkLogic start Starting MarkLogic: [ OK ]

From there, we can point our browser at http://host:8001 and walk through the initial MarkLogic install process:

As soon as the install process has been completed and you have created an Administrator user for MarkLogic Server, we're ready to create an ODBC Application Server.

To do this, go to Configure > Groups > Default > App Servers and select the Create ODBC tab:

Next we're going to make the minimal configuration necessary by entering the required fields - the odbc server name, the Application Server module directory root and the port.

In this example we will configure the Application Server using the following values:

odbc server name

ml-odbc

root

/

port

5432

After this is done, confirm that the Application Server has been created by going to Configure > Groups > Default > App Servers and ensure that you can see the ODBC Server listed and configured on port 5432 as per the image below:

Getting started: Setting up the MarkLogic ODBC Driver

Use RPM to install the ODBC driver:

$ sudo rpm -i mlsqlodbc-1.3-3.x86_64.rpm odbcinst: Driver installed. Usage count increased to 1. Target directory is /etc

Configure the base template as instructed in the installation guide:

$ odbcinst -i -s -f /opt/MarkLogic/templates/mlsql.template

Getting started: ensure unixODBC is configured

To ensure the unixODBC commandline client is configured, you can run isql -h to bring up the help options:

$ isql -h ********************************************** * unixODBC - isql * ********************************************** * Syntax * * * * isql DSN [UID [PWD]] [options] * * * * Options * * * * -b batch.(no prompting etc) * * -dx delimit columns with x * * -x0xXX delimit columns with XX, where * * x is in hex, ie 0x09 is tab * * -w wrap results in an HTML table * * -c column names on first row. * * (only used when -d) * * -mn limit column display width to n * * -v verbose. * * -lx set locale to x * * -q wrap char fields in dquotes * * -3 Use ODBC 3 calls * * -n Use new line processing * * -e Use SQLExecDirect not Prepare * * -k Use SQLDriverConnect * * --version version * * * * Commands * * * * help - list tables * * help table - list columns in table * * help help - list all help options * * * * Examples * * * * isql WebDB MyID MyPWD -w < My.sql * * * * Each line in My.sql must contain * * exactly 1 SQL command except for the * * last line which must be blank (unless * * -n option specified). * * * * Please visit; * * * * http://www.unixodbc.org * * nick@lurcher.org * * pharvey@codebydesign.com * **********************************************

If you're not seeing the above message, it could be possible that there's another application on your system overriding this, for this configuration, the isql command is found at /usr/bin/isql:

$ which isql /usr/bin/isql

Getting started: initial connection test

If you're happy that isql is correctly, installed, we're ready to test the connection using isql -v:

$ isql -v MarkLogicSQL admin admin +---------------------------------------+ | Connected! | | | | sql-statement | | help [tablename] | | quit | | | +---------------------------------------+ SQL>

Let's confirm that it's really working by loading some data into MarkLogic and creating an SQL view around that data.

Loading sample data into MarkLogic

To load data, we're going to use Query Console to insert the same sample data that is created in the Quick Start Documentation:

To access Query Console, point your browser at http://host:8000 and make note of the following:

Ensure the database is set to Documents (or at least, matches the database specified by your ODBC Application Server) and ensure that the Query Type is set to JavaScript

When these are both set correctly, run the code to generate sample data (note that this data is taken from the quick start guide and reproduced here for convenience):

After that has run, you should see a null response back from the query:

To confirm that the data was loaded successfully, you can use the Explore button. You should now see that 22 employee documents (rows) are now in the database:

Create the template view

Now the documents are loaded, a tabular view for that data needs to be created.

Ensure the database is (still) set to Documents (or at least, matches the database specified by your ODBC Application Server) and ensure that the Query Type is now set to XQuery

As soon as this is set, you can run the code below to generate the template view (note that this data is taken from the quick start guide and reproduced here for convenience):

And to confirm this was loaded, Query Console should report an empty sequence was returned.

Test the template using a SQL Query

The database should remain set to Documents and ensure that the Query Type is now set to SQL:

Then you can run the following SQL Query:

SELECT * FROM employees

If everything has worked correctly, Query Console should render a view of the table in response to your query:

Test the SQL Query via the ODBC Driver

All that remains now is to go back to the shell and test the same connection over ODBC.

To do this, we're going to use the isql command again and run the same request there:

$ isql -v MarkLogicSQL admin admin +---------------------------------------+ | Connected! | | | | sql-statement | | help [tablename] | | quit | | | +---------------------------------------+ SQL> select * from employees <<< RESPONSE CUT >>> SQLRowCount returns 7 7 rows fetched

Further reading

MarkLogic Documentation: SQL on MarkLogic Server Quick Start

MarkLogic Documentation: Configuring the ODBC Driver on Linux

MarkLogic OS Parameter Handling at Startup

There are various operating system settings that MarkLogic prescribes for best performance. During the startup of a MarkLogic Server instance, some of these parameters are set to the recommended values. These parameters include:

File descriptor limit

Number of processes per user

Swappiness

Dirty background ratio

Max sectors

Read ahead

For some settings, Info level error log messages are recorded to indicate that these values were changed. For example, the MarkLogic Server error log might include a line similar to:

2020-03-03 12:40:25.512 Info: Reduced Linux kernel swappiness to 1 2020-03-03 12:40:25.512 Info: Reduced Linux kernel dirty background ratio to 1 2020-03-03 12:40:25.513 Info: Reduced Linux kernel read ahead to 512KB for vda 2020-03-03 12:40:25.513 Info: Increased Linux kernel max sectors to 2048KB for vda

MarkLogic Server I/O Requirements Guide

SUMMARY

This article will help MarkLogic Administrators and System Architects who need to understand how to provision the I/O capacity of their MarkLogic installation.

MarkLogic Disk Usage

Databases in MarkLogic Server are made up of forests. Individual forests are made up of stands. In the interests of both read and write performance, MarkLogic Server doesn't update data already on disk. Instead, it simply writes to the current in-memory stand, which will then contain the latest version of any new or changed fragments, and old versions are marked as obsolete. The current in-memory stand will eventually become yet another on-disk stand in a particular forest.

Ultimately, however, the more stands or obsolete fragments there are in a forest, the more time it takes to resolve a query. Merges are a background process that reduce the number of stands and purge obsolete fragments in each forest in a database, thereby improving the time it takes to resolve queries. Because merges are so important to the optimal operation of MarkLogic Server, it's important to provision the appropriate amount of I/O bandwidth, where each forest will typically need 20MB/sec read and 20MB/sec write. For example, a machine hosting four forests will typically need sufficient I/O bandwidth for both 80MB/sec read and 80MB/sec write.

Determining I/O Bandwidth

One way to determine I/O bandwidth would be to use a synthetic benchmarking utility to return the available read and write bandwidth for the system as currently provisioned. While useful in terms of getting a ballpark sense of the I/O capacity, this approach unfortunately does not provide any information about the real world demand that will ultimately be placed on that capacity.

Another way would be to actually load test a candidate provisioning against the application you're going to deploy on this cluster. If you start from our general recommendations (from MarkLogic: Understanding System Resources) then do an application level load test (paying special attention to I/O heavy activities like ingestion or re-indexing, and the subsequent merging), the system metrics from that load test will then tell you what, if any, bottlenecks or extra capacity may exist on the system across not only your I/O subsystem, but for your CPU and RAM usage as well.

For both of these approaches (measuring capacity via synthetic benchmarks or measuring both capacity and demand vs. synthetic application load), it would also be useful to have some sense of the theoretical available I/O bandwidth before doing any testing. In other words, if you're provisioning shared storage like SAN or NAS, your storage admin should have some idea of the bandwidth available to each of the hosts. If you're provisioning local disk, you probably already have some performance guidance from the vendors of the I/O controllers or disks being used in your nodes. We've seen situations in the past where actual available bandwidth has been much different from expected, but at a minimum the expected values will provide a decent baseline for comparison against your eventual testing results.

Additional Resources

MarkLogic Data Management Tutorial

Monitoring Metrics of Interest to MarkLogic Server

MarkLogic Server IP Ports

Introduction

This article provides a list of IP ports that MarkLogic Server uses.

MarkLogic Server Ports

The following IP ports should be open and accessible on every host in the cluster:

Port 7997 (TCP/HTTP) is the default HealthCheck application server port and is required to check health/proper running of a MarkLogic instance.

Port 7998 (TCP/XDQP) is the default "foreign bind port" on which the server listens for foreign inter-host communication - required for the database replication feature. This port is configurable and can be set with the admin:host-set-foreign-port() function.

Port 7999 (TCP/XDQP)is the default "bind port" on which the server listens for inter-host communication within the cluster. The bind port is required for all MarkLogic Server Clusters. This port is configurable and can be set with the admin:host-set-port() function.

Port 8000 (TCP/HTTP) is the default App-Services application server port and is required by Query Console.

Port 8001 (TCP/HTTP) is the default Admin application server port and is required by the Admin UI.

Port 8002 (TCP/HTTP) is the default Manage application server port and is required by Configuration Manager and Monitoring Dashboard.

MarkLogic 9 and Telemetry

Port 443 (TCP/HTTP) Outbound connections must be allowed to use the MarkLogic Telemetry feature introduced in MarkLogic 9 (and above).

Ops Director Ports

The following ports are the default ports used by Ops Director. These can be changed during the installation process.

Port 8003 (TCP/HTTP) is the "SecureManage" default port and must be open on the managed cluster, to allow the Ops Director cluster to monitor the cluster. If 8003 is already in use, it will choose the next open port above 8003.

Port 8008 (TCP/HTTP) is the "OpsDirectorApplication" default application server port, and allows access to the Ops Director UI.

Port 8009 (TCP/HTTP) is the "OpsDirectorSystem" default application server port, and allows access to the Ops Director APIs.

(Note: The Ops Director feature has been deprecated with MarkLogic 10.0-5.)

Data Hub Framework (DHF) Ports

The following ports are the default ports used by the Data Hub Framework. Both the ports and the database/app server names can be changed during the installation process.

Port 8010 (TCP/HTTP) is the "data-hub-STAGING" default application server port for accessing ingested data for further processing.

Port 8011 (TCP/HTTP) is the "data-hub-FINAL" default application server port for downstream applications to access harmonized data.

Port 8013 (TCP/HTTP) is the "data-hub-JOBS" default application server port for jobs (flow runs).

Recommendations

In production, the ports listed above should be hidden behind a firewall. Only your customer application ports should be accessible to outside users. We also recommend disabling Query Console and CQ instances in production to avoid an errant query that may run away with system resources.

Netstat Utility

The netstat utility is useful for checking open ports:

Linux:

netstat -an | egrep 'Proto|LISTEN'

Windows:

Open cmd as Administrator

C:\Windows\System32>netstat -abon -p tcp

Look for MarkLogic.exe entries in the list.

Commonly Used Ports

The following is a list of commonly used ports and services that may need to have access limited or be disabled/blocked, based on your local network security policies.

Port
General Service Ports

20 FTP Data Transfer Mode

21 FTP Control(command) Mode

22 SSH

23 Telnet

43 WHOIS

53 DNS

Port
Web Service Ports

119 NNTP

80 HTTP

3306 MySQL

Port
Control Panel Default Ports

2082 cPanel

2083 Secure cPanel

2086 WHM

2087 Secure WHM

2095 cPanel Webmail

2096 Secure cPanel Webmail

8443 Secure Plesk

8880 Plesk

10000 Webmin

Port
E-mail Service Ports

25 SMTP

465 SMTPS

109 POP2

110 POP3

143 IMAP

993 IMAPS

This Wikipedia article contains a more comprehensive list:

http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers

MarkLogic Server Isolation Levels

Introduction

MarkLogic automatically provides

ANSI REPEATABLE READ level of isolation for update transactions, and

Serializable isolation for read-only (query) transactions.

MarkLogic can be made to provide ANSI SERIALIZABLE isolation for update transactions, but doing so requires developers to manage their own predicate locks.

Isolation Levels - Background

There are many possible levels of isolation, and many different taxonomies of isolation levels. The most common taxonomy (familiar to those with a RDBMS background) is the one defined by ANSI SQL, which defines four levels of isolation based on read phenomena that are possible at each level. ANSI has a definition for each phenomenon, but these definitions are open to interpretation. Broad interpretation results in more rigorous criteria for each isolation level (and therefore better isolation at each level), whereas strict interpretation results in less rigorous isolation at each level. Here I’ll use a shorthand notation to describe these phenomena, and will use the broad rather than the strict interpretation. The notation specifies the operation, the transaction performing the operation, and the item or domain on which the operation is performed. Operations in my notation are:

Write (w)

Read (r)

Commit (c)

Abort/rollback (a)

An example of this shorthand: w1[x] means transaction1 writes to item x.

Now the phenomena:

A dirty read happens when a transaction T2 reads an item that is being written by concurrently running transaction T1. In other words: w1[x]…r2[x]…((c1 or a1) and (c2 or a2) in any order). This phenomenon could lead to an anomaly in the case where T1 later aborts, and T2 has then read a value that never existed in the database.

A non-repeatable read happens when a transaction T2 writes an item that was read by a transaction T1 prior to T1 completing. In other words: r1[x]…w2[x]…((c1 or a1) and (c2 or a2) in any order). Non-repeatable reads don’t produce the same anomalies as dirty reads, but can produce errors in cases where T1 relies on the value of x not changing between statements in a multi-statement transaction (e.g. reading and then updating a bank account balance).

A phantom read happens when a transaction T1 retrieves a set of data items matching some search condition and concurrently running transaction T2 makes a change that modifies the set of items that match that condition. In other words: (r1[P] and w2[x in P] in any order)…((c1 or a1) and (c2 or a2) in any order), where P is a set of results. Phantom reads are usually less serious than dirty or non-repeatable reads because it generally doesn’t matter if item x in P is written before or after T1 finishes unless T1 is itself explicitly reading x. And in this case the phenomenon would no longer be a phantom, but would instead be a dirty or non-repeatable read per the definitions above. That said, there are some cases where phantom reads are important.

The isolation levels ANSI defines are based on which of these three phenomena are possible at that isolation level. They are:

READ UNCOMMITTED – all three phenomena are possible at this isolation level.

READ COMMITTED – Dirty reads are not possible, but non-repeatable and phantom reads are.

REPEATABLE READ – Dirty and non-repeatable reads are not possible, but phantom reads are.

SERIALIZABLE – None of the three phenomena are possible at this isolation level.

Note that as defined above, ANSI SERIALIZABLE is not sufficient for transactions to be truly serializable (in the sense that running them concurrently and running them in series would in all cases produce the same result), so SERIALIZABLE is an unfortunate choice of names for this isolation level, but that’s what ANSI called it.

Update Transaction Locks

Typically, a DBMS will avoid dirty and non-repeatable reads by taking locks on records (called item locks). Locks are either shared locks (which can be held by more than one transaction) or exclusive locks (which can be held by only one transaction at a time). In most DBMSes (including MarkLogic), locks taken when reading an item are shared and locks taken when writing an item are exclusive.

MarkLogic prevents dirty and non-repeatable reads in update transactions by taking item locks on items that are being read or written during a transaction and releasing those locks only on completion of the transaction (post-commit or post-abort). When a transaction needs to lock an item on which another transaction has an exclusive lock, that transaction waits until either the lock is released or the transaction times out. Deadlock detection prevents cases where two transactions are waiting on each other for exclusive locks. In this case one of the transactions will abort and restart.

In addition, MarkLogic prevents some types of phantom reads by taking item locks on the set of items in a search result. This prevents phantom reads involving T2 removing an item in a set that T1 previously searched, but does not prevent phantom reads involving T2 inserting an item in a set that T1 previously searched, or those involving T2 searching for items and seeing a deletion caused by T1.

Avoiding All Phantom Reads

To avoid all phantom reads via locking, it is necessary to take locks not just on items that currently match the search criteria, but also on all items that could match the search criteria, whether they currently exist in the database or not. Such locks are called predicate locks. Because you can search for pretty-much anything in MarkLogic, guaranteeing a predicate lock for arbitrary searches would require locking the entire database. From a concurrency and throughput perspective, this is obviously not desirable. MarkLogic therefore leaves the decision to take predicate locks and the scope of those locks in the hands of application developers. Because the predicate domain can frequently be narrowed down with some application-specific knowledge, this provides the best balance between isolation and concurrency. To take a predicate lock, you lock a synthetic URI representing the predicate domain in every transaction that reads from or writes to that domain. You can take shared locks on a synthetic URI via fn:doc(URI). Exclusive locks are taken via xdmp:lock-for-update(URI).

Note that predicate locks should only be taken in situations where phantom reads are intolerable. If your application can get by with REPEATABLE READ isolation, you should not take predicate locks, because any additional locking results in additional serialization and will impact performance.

Summary

To summarize, MarkLogic automatically provides ANSI REPEATABLE READ level of isolation for update transactions and true serializable isolation for read-only (query) transactions. MarkLogic can be made to provide ANSI SERIALIZABLE isolation for update transactions, but doing so requires developers to manage their own predicate locks.

MarkLogic Server on MacOS Rosetta2 - Apple Silicon M1 / M[x]

Introduction

Rosetta 2 is a seamless, very efficient emulator designed to bridge the transition between Intel and Apple Silicon processors (e.g. M1[x]). The first time you launch a Mac app on an Apple Silicon computer, you might be asked to install the Rosetta component to open it.
Currently, when installing MarkLogic Server DMG (pkg) on Apple Silicon macOS, you will be blocked by the following error:
“MarkLogic Server ([version]) can’t be installed on this computer.
MarkLogic Server requires an Intel processor.”
The error above is caused by MarkLogic’s macOS system call to verify if it’s running on an Intel processor. This legacy check was required when Apple was transitioning from PowerPC to Intel CPUs (announced in June 2005, Rosetta 1 emulation). MarkLogic Server has never been available for PowerPC-based Apple Computers. In order to install MarkLogic’s Intel package on Apple Silicon, the legacy check has to be removed from the installation script.

Procedure

*1. Open a Terminal [0] and install Rosetta2 emulation software.

$ softwareupdate --install-rosetta

Note: For additional information, please check the official Apple Rosetta 2 article. [1]
[1] https://support.apple.com/en-us/HT211861
* Step not required if Rosetta 2 is already installed for other Intel-based applications.

2. Download any ML server DMG from the ML-DMC website [2]
[2] https://developer.marklogic.com/products/marklogic-server

3. Mount the DMG and copy the install package to a writable temporary location in the local filesystem

$ cp -R /Volumes/MarkLogic/ /Users/[your_user_name]/tmp

4. In a Terminal window, edit Contents/Resources/InstallationCheck in a text editor (e.g. vim or nano)

$ vim /Users/[your_username]/tmp/MarkLogic-[downloaded_package_version].pkg/Contents/Resources/InstallationCheck

Note: As an alternative, in the GUI-Finder, right-click and "Show Package Contents”. Navigate to “Contents/Resources/“, and edit the file “InstallationCheck” with a GUI text editor.

5. Delete or comment out the block starting with (lines 46-52) and save the file “InstallationCheck”:

46 echo "Checking for Intel CPU"
47 if [[ $CPU_TYPE != "7" ]] ;
48 then
49 echo "MarkLogic Server requires a CPU with an Intel instruction set."
50 exit 114; # displays message 18
51 fi
52 echo "$CPU_NAME is an Intel CPU."Save the file and back out of the folder

6. Install the MarkLogic package from the GUI Finder or CLI as intended. [3]
[3] https://docs.marklogic.com/guide/installation/procedures#id_28962

Conclusions
• The procedure in this knowledge base article allows to install MarkLogic Server on macOS Rosetta2 - Apple Silicon M1 / M[x].
• MacOS is supported for development only. Conversion (Office and PDF) and entity enrichment are not available on macOS. [4]
• The legacy installation check is removed starting from MarkLogic 10.0-10+ release.
• Once the legacy check is removed, Rosetta 2 emulation software will be still required till an official native M1 / M[x] MarkLogic Server package will be available.

References
[0] https://support.apple.com/guide/terminal/open-or-quit-terminal-apd5265185d-f365-44cb-8b09-71a064a42125/
[1] https://support.apple.com/en-us/HT211861
[2] https://developer.marklogic.com/products/marklogic-server
[3] https://docs.marklogic.com/guide/installation/procedures#id_28962
[4] https://docs.marklogic.com/guide/installation/intro#id_63469

MarkLogic Server stores text in Unicode NFC normalized form

Summary

Text is stored in MarkLogic Server in Unicode NFC normalized form.

Discussion

In MarkLogic Server, all text is converted into Unicode NFC normalized form before tokenization and storage.

Unicode considers NFC-compatible characters to be essentially equivalent. See the Unicode normalization FAQ and Conformance Requirements in the Unicode Standard.

Example

For example, consider the NFC equivalence of the codepoints x2126 (&#x2126) and x03A9 (&#x03A9). This is shown for the x2126 entry in the Unicode code chart for the U2100 block.

You can see the effects of normalization alone, and during tokenization, by running the following in MarkLogic Server's Query Console:

xquery version "1.0-ml"; (: equivalence of Ω forms :) let $s := fn:codepoints-to-string (xdmp:hex-to-integer ('2126')) let $token := cts:tokenize ($s) return ( 'original: '||xdmp:integer-to-hex (fn:string-to-codepoints ($s)), 'normalized: '||xdmp:integer-to-hex (fn:string-to-codepoints (fn:normalize-unicode ($s, 'NFC'))), 'tokenized: '||xdmp:describe ($token, (), ()) )

The results show the original value, the normalized value, and the resulting token:

original: 2126 normalized: 3a9 tokenized: cts:word("Ω")

MarkLogic Server v9 Tokenization and Stemming

Abstract

In MarkLogic Server version 9, the default tokenization and stemming code has been changed for all languages (except English tokenization). Some tokenization and stemming behavior will change between MarkLogic 8 and MarkLogic 9. We expect that, in most cases, results will be better in MarkLogic 9.

Information is given for managing this change in the Release Notes at Default Stemming and Tokenization Libraries Changed for Most Languages, and for further related features at New Stemming and Tokenization.

In-depth discussion is provided below for those interested in details.

General Comments on Incompatibilities

General implications of tokenization incompatibilities

If you do not reindex, old content may no longer match the same searches, even for unstemmed searches.

General tokenization incompatibilities

There are some edge-case changes in the handling of apostrophes in some languages; in general this is not a problem, but some specific words may include/break at apostrophes.

Tokenization is generally faster for all languages except English and Norwegian (which use the same tokenization as before).

General implications of stemming incompatibilities

Where there is only one stem, and it is now different: Old data will not match stemmed searches without reindexing, even for the
same word.

Where the new stems are more precise: Content that used to match a query may not match any more, even with
reindexing.

Where there are new stems, but the primary stem is unchanged: Content that used to not match a query may now match it with advanced
stemming or above. With basic stemming there should be no change.

Where the decompounding is different, but the concatenation of the components is the same: Under decompounding, content may match a query when it used to not match, or may not match a query when it used to match, when the query or content involves something with one of the old/new components. Matching under advanced or basic stemming would be generally the same.

General stemming incompatibilities

MarkLogic now has general algorithms backing up explicit stemming dictionaries. Words not found in the default dictionaries will sometimes be stemmed when they previously were not.

Diminutives/augmentatives are not usually stemmed to base form.

Comparatives/superlatives are not usually stemmed to base form.

There are differences in the exact stems for pronoun case variants.

Stemming is more precise and restricted by common usage. For example, if the past participle of a verb is not usually used as an adjective, then the past participle will not be included as an alternative stem. Similarly, plural forms that only have technical or obscure usages might not stem to the singular form.

Past participles will typically include the past participle as an alternative stem.

The preferred order of stems is not always the same: this will affect search under basic stemming.

Reindexing

It is advisable to reindex to be sure there are no incompatibilities. Where the data in the forests (tokens or stems) does not match the current behavior, reindexing is recommended. This will have to be a forced reindex or a reload of specific documents containing the offending data. For many languages this can be avoided if queries do not touch on specific cases. For certain languages (see below) the incompatibility is great enough that it is essential to reindex.

Language Notes

Below we give some specific information and recommendations for various languages.

Arabic

stemming

The Arabic dictionaries are much larger than before. Implications: (1) better precision, but (2) slower stemming.

Chinese (Simplified)

tokenization

Tokenization is broadly incompatible.

The new tokenizer uses a corpus-based language model. Better precision can be expected.

recommendation

Reindex all Chinese (simplified).

Chinese (Traditional)

tokenization

Tokenization is broadly incompatible.

The new tokenizer uses a corpus-based language model. Better precision can be expected.

recommendation

Reindex all Chinese (traditional).

Danish

tokenization

This language now has algorithmic stemming, and may have slight tokenization differences around certain edge cases.

recommendation

Reindex all Danish content if you are using stemming.

Dutch

stemming

There will be much more decompounding in general, but MarkLogic will not decompound certain known lexical items (e.g., "baastardwoorden").

recommendation

Reindex Dutch if you want to query with decompounding.

English

stemming

British variants may include the British variant as an additional stem, although the first stem will still be the US variant.

Stemming produces more alternative stems. Implications are (1) stemming is slightly slower and (2) index sizes are slightly larger (with advanced stemming).

Finnish

tokenization

This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases.

recommendation

Reindex all content in this language if you are using stemming.

French

See general comments above.

German

stemming

Decompounding now applies to more than just pure noun combinations. For example, it applies to "noun plus adjectives" compound terms. Decompounding is more aggressive, which can result in identification of more false compounds. Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) for compound terms, search gives better recall, with some loss of precision.

recommendation

Reindex all German.

Hungarian

tokenization

This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases.

recommendation

Reindex all content in this language if you are using stemming.

Italian

See general comments above.

Japanese

tokenization

Tokenization is broadly incompatible.

The tokenizer provides internal flags that the stemmer requires. This means that (1) tokenization is incompatible for all words at the storage level due to the extra information and (2) if you install a custom tokenizer for Japanese, you must also install a custom stemmer.

stemming

Stemming is broadly incompatible.

recommendation

Reindex all Japanese content.

Korean

stemming

Particles (e.g., 이다) are dropped from stems; they used to be treated as components for decompounding.

There is different stemming of various honorific verb forms.

North Korean variants are not in the dictionary, though they may handled by the algorithmic stemmer.

recommendation

Reindex Korean unless you use decompounding.

Norwegian (Bokmal)

stemming

Previously, hardly any decompounding was in evidence; now it is pervasive.

Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) search gives better recall, with some loss of precision, at least where it comes to compounds.

recommendation

Reindex Bokmal if you want to query with decompounding.

Norwegian (Nynorsk)

stemming

Previously hardly any decompounding was in evidence; now it is pervasive.

Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) search gives better recall, with some loss of precision, at least where it comes to compounds.

recommendation

Reindex Nynorsk if you want to query with decompounding.

Norwegian (generic 'no')

stemming

Previously 'no' was treated as an unsupported language; now it is treated as both Bokmal and Nynorsk: for a word present in both dialects, all stem variants from both will be present.

recommendation

Do not use 'no' unless you really must; reindex if you want to query it.

Persian

See general comments above.

Portuguese

stemming

More precision with respect to feminine variants (e.g., ator vs atriz).

Romanian

tokenization

This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases.

recommendation

Reindex all content in this language if you are using stemming.

Russian

stemming

Inflectional variants of cardinal or ordinal numbers are no longer stemmed to a base form.

Inflectional variants of proper nouns may stem together due to the backing algorithm, but it will be via affix-stripping, not to the nominal form.

Stems for many verb forms used to be the perfective form; they are now the simple infinitive.

Stems used to drop ё but now preserve it.

recommendation

Reindex all Russian.

Spanish

See general comments above.

Swedish

stemming

Previously hardly any decompounding was in evidence; now it is pervasive.

Implications: (1) stemming is slower, (2) decompounding takes more space, and (3) search gives better recall, with some loss of precision, at least where it comes to compounds.

recommendation

Reindex Swedish if you want to query with decompounding.

Tamil

tokenization

This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases.

recommendation

Reindex all content in this language if you are using stemming.

Turkish

tokenization

This language now has algorithmic stemming and may have slight tokenization differences around certain edge cases.

recommendation

Reindex all content in this language if you are using stemming.

MarkLogic Server Version Downgrades are Not Supported

Summary

Version downgrades are not supported by MarkLogic Server.

Backup your configuration files before you do anything else

Please ensure you have all your current configuration files backed up.

Each host in a MarkLogic cluster is configured using parameters which are stored in XML Documents that are available on each host. These are usually relatively small files and will zip up to a manageable size.

If you cd to your "Data" directory (on Linux this is /var/opt/MarkLogic; on Windows this is C:\Program Files\MarkLogic\Data and on OS X this is /Users/{username}/Library/Application Support/MarkLogic), you should see several xml files (assignments, clusters, databases, groups, hosts, server).

Whenever MarkLogic updates any of these files, it creates a backup using the same naming convention used for older ErrorLog files (_1, _2 etc). We recommend backing up all configuration files before following the steps under the next heading.

Upgrade Guidelines:

If you follow these upgrade guidelines and recommendations, then it is unlikely that you will need to downgrade your MarkLogic Server instance.

Stay current with upgrades. MarkLogic Server typically gets faster and more stable with each release.

Upgrade to the latest maintenance release of your current version before upgrading to the next feature version (for example: when upgrading from v6.0-2 to v7.0-2.3, first upgrade to the latest 6.0-x release). This may be required for older maintenance releases.

[Note] Upgrading across 2 feature releases in a single step may not be supported - please refer to the Installation and Upgrade Guide for supported upgrade paths.

When planning an upgrade across feature releases of MarkLogic Server, plan for a full test cycle of your application.

In addition to testing your application on the new version, test the upgrade process.

Always read the release notes before performing an upgrade. Pay particular attention to the "Known Incompatibilities" section.

Back-out Plan

Although it is not expected that you will ever need to back out a version upgrade of MarkLogic Server, it is always prudent to have a contingency plan for the worst case scenario.

Before an upgrade, you will want to

Create a backup of all of your forests in all of your databases.

Save your deployment scripts (used for creating your system configuration)

In the unlikely event you want to restore your system to a previous version level, you will need to first make a decision regarding the data in your databases.

MarkLogic does not support restoring a backup made on a newer version of MarkLogic Server onto an older version of MarkLogic Server. Your Back-Out Plan will need to take this into consideration.

If it is sufficient to roll back to the data as it existed previous to the upgrade, you will be able to just restore the data from the backup you made prior to the upgrade;

If you can recreate / reload all of the data that was inserted after the upgrade, you can restore the data from the pre-upgrade backup and then reload / recreate the new data;

If you need to capture the current data as it now exists in the database, you can use mlcp to export from the database, and then use mlcp import to load this back into the database;

Once you have decided how to handle your data, you will need to recreate your MarkLogic Instance from a fresh install. This can be done on fresh hardware, or in place if you are careful with your data. The steps will include

Uninstall MarkLogic on each host

Remove any previous instance of MarkLogic data and configuration files from each host.

Install MarkLogic Server.

Recreate your configuration

Restore your data using the method you decided on previously.

In addition to testing an upgrade, you should also test your Back-Out Plan.

MarkLogic Server/Data Hub version compatibility and upgrade

What is MarkLogic Data Hub?

MarkLogic’s Data Hub increases data integration agility, in contrast to time consuming upfront data modeling and ETL. Grouping all of an entity’s data into one consolidated record with that data’s context and history, a MarkLogic Data Hub provides a 360° view of data across silos. You can ingest your data from various sources into the Data Hub, standardize your data - then more easily consume that data in downstream applications. For more details, please see our Data Hub documentation.

Note: Prior to version 5.x, Data Hub was previously known as Data Hub Framework (DHF)

Takeaways:

In contrast to previous versions, Data Hub 5 is largely configuration-based. Upgrading to Data Hub 5 will require either:

Conversion of legacy flows from the code-based approach of previous versions to the configuration-based format of Data Hub 5

Executing your legacy flows with the “hubRunLegacyFlow” Gradle task

It’s very important to verify the “Version Support” information on the Data Hub GitHub README.md before installing or upgrading to any major Data Hub release

Pre-requisites:

One of the pre-requisites for installing Data Hub is to check for the supported/compatible MarkLogic Server version. For details, see our version compatibility matrix. Other pre-requisites can be seen here.

New installations of Data Hub

We always recommend installing the latest Data Hub version compatible with your current MarkLogic Server version. For example:

-If a customer is running MarkLogic Server 9.0-7, one should install the most recent compatible Data Hub version (5.0.2), even if the previous Data Hub versions (such as 5.0.1, 5.0.0, 4.x and 3.x) also work with server version 9.0-7.

-Similarly, if a customer is running 9.0-6, the recommended Data Hub version would be 4.3.1 instead of previous versions 4.0.0, 4.1.x, 4.2.x and 3.x.

Note: A specific MarkLogic server version can be compatible with multiple Data Hub versions and vice versa, which allows independent upgrades of either Data Hub or MarkLogic Server.

Upgrading from a previous version

To determine your upgrade path, first find your current Data Hub version in the “Can upgrade from” column in the version compatibility matrix.

While Data Hub should generally work with future server versions, it’s always best to run the latest Data Hub version that's also explicitly listed as compatible with your installed MarkLogic Server version.

If required, make sure to upgrade your MarkLogic Server version to be compatible with your desired Data Hub version. You can upgrade MarkLogic Server and Data Hub independently of each other as long as you are running a version of MarkLogic Server that is compatible with the Data Hub version you plan to install. If you are running an older version of MarkLogic Server, then you must upgrade MarkLogic Server first, before upgrading Data Hub.

Note: Data Hub is not designed to be 'backwards' compatible with any version before the MarkLogic Server version listed with the release. For example, you can’t use Data Hub 3.0.0 on 9.0-4 – you’ll need to either downgrade to Data Hub 2.0.6 while staying on MarkLogic Server 9.0-4, or alternatively upgrade MarkLogic Server to version 9.0-5 while staying on Data Hub 3.0.0.

Example 1 - Scenario where you DO NOT NEED to upgrade MarkLogic Server:

Current Data Hub version: 4.0.0

Target Data Hub version: 4.1.x

ML server version: 9.0-9

The “Can upgrade from” value for the target version shows 2.x which means you need to be at least be on Data Hub 2.x. Since, the current Data Hub version is 4.0.0, this requirement has been met.

Unless there is a strong reason for choosing 4.1.x, we highly recommend to upgrade to the latest version compatible with MarkLogic Server 9.0-9 in 4.x - which in this example is 4.3.2. Consequently, the recommended upgrade path here becomes 4.0.0-->4.3.2 instead of 4.0.0-->4.1.x.

Since 9.0-9 is supported by the recommended Data Hub version 4.3.2, there is no need to upgrade ML server.

Hence, recommended path will be Data Hub 4.0.0-->4.3.2

Example 2 - Scenario where you NEED to upgrade MarkLogic Server:

Current Data Hub version: 3.0.0

Target Data Hub version: 5.0.2

ML server version: 9.0-6

The “Can upgrade from” value for the target version shows Data Hub version 4.3.1 which means you need to be at least be on 4.3.x (4.3.1 or 4.3.2 depending on your MarkLogic Server version). Since the current Data Hub version 3.0.0 doesn’t satisfy this requirement, upgrade path after this step becomes Data Hub 3.0.0-->4.3.x

As per the matrix, the latest compatible Data Hub version for 9.0-6 is 4.3.1, so the path becomes 3.0.0-->4.3.1

From the matrix, the minimum supported MarkLogic Server version for 5.0.2 is 9.0-7, so you will have to upgrade your MarkLogic Server version before upgrading your Data Hub version to 5.0.2.

Because 9.0-7 is supported by all 3 versions under consideration (3.0.0, 4.3.1 and 5.0.2), recommended path can be either

3.0.0-->4.3.1-->upgrade MarkLogic Server version to at least 9.0-7-->upgrading Data Hub version to 5.0.2

Upgrading MarkLogic Server version to at least 9.0-7-->upgrade Data Hub from 3.0.0 to 4.3.1-->upgrade Data Hub version to 5.0.2

Recall that Data Hub 5 moved to a configuration-based approach from previous versions’ code-based approach. Upgrading to Data Hub 5 from a previous major version will require either:

Conversion of legacy flows from the code-based approach of previous versions to the configuration-based format of Data Hub 5

Executing your legacy flows with the “hubRunLegacyFlow” Gradle task

Links for Reference:

https://docs.marklogic.com/datahub/upgrade.html

MarkLogic Support FAQ

Question

Answer

Further Reading

What are the maximum and minimum number of nodes a MarkLogic Cluster can have?

Minimum: 1 node (3 nodes if you want high availability)

Optimum: ~64 nodes

Maximum: 256 nodes

KB Articles:

MarkLogic Fundamentals - How should I scale out my cluster?

Limits on the number of MarkLogic Servers in a cluster

Documentation:

Scalability considerations in MarkLogic Server

Are all nodes created equal in MarkLogic?

In MarkLogic, how a node is configured, provisioned, and scaled depends on the type of that node and what roles it might serve:

A single node can act as an e-node, d-node, or both ("e/d-node")

With respect to high availability/failover, any one node serves as both primary host (for its assigned data forests) and failover host (for its assigned failover forests)

With respect to disaster recovery/replication, nodes can serve as either hosts for primary data forests in the primary cluster, or as hosts for replica forests in the replica cluster

Bootstrap hosts are used to establish an initial connection to foreign clusters during database replication. Only the nodes hosting your security forests (both primary security forests as well as their local disk failover copies) need to be bootstrap hosts

KB Articles:

How should I scale out my cluster?

Should my primary and replica write to the same shared storage?

Documentation:

Evaluator/Data node Architecture

Bootstrap hosts

Can I have nodes with mixed specifications within a cluster?

Queries in MarkLogic Server use every node in the cluster

Fast nodes will wait for slow nodes - especially slow d-nodes

Therefore, all nodes - especially all d-nodes - should be of the same hardware specification

KB Articles:

How should I scale-out my cluster?

Documentation:

"Check the slowest d-node" section of our Performance Testing With MarkLogic white-paper

MarkLogic Server Clustering Requirements

Does MarkLogic support Horizontal Scaling or Vertical Scaling?

Both horizontal (more nodes) and vertical scaling (bigger nodes) are possible with MarkLogic Server

Do note that high availability (HA) in MarkLogic Server requires at least some degree of horizontal scaling with a minimum of three nodes in a cluster

Given the choice between one big node and three smaller nodes, most deployments would be better off with three smaller nodes to take advantage of HA

Documentation:

Scalability considerations in MarkLogic Server

I'm confused about high availability (HA) vs. disaster recovery (DR) - How does MarkLogic do HA? - How does MarkLogic do DR?

High Availability (HA) in MarkLogic Server involves automatic forest failover, which maintains database availability in the face of host failure. Failing back is a manual operation

Disaster Recovery (DR) in MarkLogic Server involves a separate copy - with smaller data deltas (database replication) or larger (backup/restore). Switching to and back from DR copies are both manual operations

Documentation:

Failover and Database Replication

How many forests can a MarkLogic cluster have?

There is a design limit of 1024 forests (including Local Disk Failover forests)

If you need more than 1024 forests, look into super-clusters and super-databases

KB Articles

Considerations when scaling out MarkLogic Cluster

Docuementation:

Super Databases and Clusters

How to calculate the I/O bandwidth on a ML node?

I/O bandwidth of a node can be calculated with the following formula:

(# of forests per node*I/O bandwidth per forest)

If your node has a 10tb disk capacity

# of forests per node: (Disk space/max forest size)

Disk space: 10tb

Recommended max forest size in ML: 512gb

Recommended # of forests for this node: 20 (Disk space/forest size)

I/O bandwidth per forest: 20mb/sec read, 20mb/sec write

Total I/O bandwidth: 20*20mb/sec (# of forests/I/O per forest)

So, If your disk capacity is 10tb, the I/O bandwidth will be:

400mb/sec read, 400mb/sec write

Similarly, if your disk capacity is 20tb, the I/O bandwidth will be:

800mb/sec read, 800mb/sec write

KB Articles:

MarkLogic Server I/O Requirements Guide

What is the maximum size for a forest in MarkLogic?

The rule-of-thumb maximum size for a forest is 512GB

It's almost always better to have more small forests instead of one very large forest

It's important to keep in mind that forests have hard maximums for:

Number of stands

Number of fragments

KB Articles:

Reaching stand limit frequently?

"Emergency: Stand has 'n' fragments" message

Documentation:

Forest sizes per data node host

How many documents per forest/database?

While MarkLogic Server does not have a practical or effective limit on the number of documents in a forest or database, you'll want to watch out for:

Size of forests - as bigger forests require more time and computational resources to maintain

Maximum number of stands per forest (64) is a hard stop and difficult to unwind - so it's important that your database is merging often enough to stay well under that limit. Most deployments don't come close to this maximum unless they're underprovisioned and therefore merging too slowly or too infrequently

Maximum number of fragments per stand (on the order of tens or hundreds of millions). Most deployments typically scale horizontally to more forests (and therefore more stands) well before needing to worry about the number of fragments in a specific stand

KB Articles:

Considerations when scaling out MarkLogic Cluster

MarkLogic Fundamentals - How should I scale out my cluster?

High Availability & Failover in MarkLogic FAQ

Reaching stand limit frequently?

"Emergency: Stand has 'n' fragments" message

Performance Theory: Tales From MarkLogic Support

Documentation:

Forest sizes per data node host

How should I configure my default databases (like security)?

The required number of master forests for default databases is one

For example - each default database (including Security) is recommended to have one data forest and one LDF forest

More LDF copies are not recommended as they're almost never worth the additional administrative complexity and dedicated hardware resources

Documentation

Understanding Databases

KB Articles:

How many forests should my security database have?

Multiple forests for security database

High Availability and Failover in MarkLogic FAQ

What is the recommended record or document size?

100 kb +/- two orders of magnitude (1 kB - 10 MB)

KB Articles:

Performance Theory: Tales From MarkLogic Support

What is the recommended number of range indexes for a database?

On the order of 100 or so

If you need many more, revise your data model to take advantage of Template Driven Extraction (TDE)

KB Articles

How Many Range Indexes?

Range Indexes and Mapped File Initialization Errors

Documentation

TDE FAQ

10000 Range Indexes

Does it help to do concurrent MLCP jobs in terms of performance?

Each MLCP job, starting in version 10.0-4.2, uses the maximum number of threads available on the server as the default thread count

Since a single job already uses the all the available threads, concurrent MLCP jobs won't be helpful in terms of performance

KB Articles:

Does MLCP support concurrent jobs?

Documentation:

MLCP import command line options (thread_count_number)

Should we backup default databases?

We recommend regular backups for the Security database

If actively used, regular backups are recommended for Schemas, Modules, Triggers and other default databases

KB Articles:

What are the databases installed by default and do I need to back them up?

Backup/restore best practices?

Backups can be CPU/RAM intensive

Incremental backups minimize storage, not necessarily time

Unless your cluster is over-provisioned compared to most, concurrent backup jobs are not recommended

The "Include Replica" setting allows for backup if failed over - but also doubles your backup footprint in terms of storage

The "Max Backups" setting is applicable only for full backups

KB Articles:

Understanding the role of Journals in relation to Backup and Restore

Database backup, restore and local disk failover

Documentation:

Backing Up and Restoring a Database

Do we need to mirror configuration between primary and replica databases? If so, how do we do it?

Yes - primary and replica databases should have mirrored configurations. If the replica database's configuration is different, query results from the replica database will also be different

Configurations can be mirrored with Configuration Manager (deprecated in 10.0-3), or mlgradle/Configuration Management API (CMA)

KB Articles:

Alternatives to Configuration Manager

Configuration Migration of MarkLogic Server using Gradle and ml-gradle plug in

What to consider when configuring the thread_count option for MLCP export?

By default the -thread_count is 4 (if -thread_count is not specified)

For best performance, you can configure this option to use the maximum number of threads supported by the app server in the group (maximum number of server threads allowed on each host in the group * the number of hosts in the group)

E.g.: For a 3-node cluster, this number will be 96 (32*3) where:

32 is the max number of threads allowed on each host

3 is the number of hosts in the cluster

Note: If the -thread_count is configured to use max server threads, it is highly not recommended to use concurrent jobs

KB Articles:

Does MLCP support concurrent jobs?

Documentation:

Export Command Line Options

MarkLogic supported ISO Codes for the representation of language ...

Summary

In addition to the multiple language support in MarkLogic Server, MarkLogic Server also supports ISO codes listed below for representation of names for these languages.

MarkLogic supported ISO codes

MarkLogic supports following ISO codes for the representation of language names:
1. ISO 639-1
2. ISO 639-2/T , and
3. ISO 639-2/B

Further, NOTE:
a. MarkLogic uses the 2-letter ISO 639-1 codes, including zh's zh_Hant variant, and
b. MarkLogic uses the 3-letter ISO 639-2 codes. To get a more specific list of ISO 639-2 codes go to http://www.loc.gov/standards/iso639-2/php/code_list.php

Again, MarkLogic only supports below listed languages, http://docs.marklogic.com/guide/search-dev/languages#id_64343
English
French
Italian
German
Russian
Spanish
Arabic
Chinese (Simplified and Traditional)
Korean
Persian (Farsi)
Dutch
Japanese
Portuguese
Norwegian (Nynorsk and Bokmål)
Swedish

Suggestion

The function cdict:get-languages() can be used to get ISO Codes for all supported languages. Here is an example of the usage:

xquery version "1.0-ml"; import module namespace cdict = "http://marklogic.com/xdmp/custom-dictionary" at "/MarkLogic/custom-dictionary.xqy"; cdict:get-languages() ==> ("en", "ja", "zh", "zh_Hant")

MarkLogic Transaction Locks (and Transaction Mode) vs Document an...

Summary

There are many different kinds of locks present in MarkLogic Server.

Transaction locks are obtained when MarkLogic Server detects the potential of a transaction to change the database, at which point the server considers it to be an update transaction. Once a lock is acquired, it is held until the transaction ends. Transaction locks are set by MarkLogic Server either explicitly or implicitly depending on the configured commit mode. Because it's very common to see poorly performing application code written against MarkLogic Server due to unintentional locking, the two concepts of transaction type and commit mode have been combined into a single, simpler control - transaction mode

MarkLogic Server also has the notion of document and directory locks. Unlike transaction locks, document and directory locks must be set explicitly and are persistent in the database - they are not tied to a transaction. Document locks also apply to temporal documents. Any version of a temporal document can be locked in the same way as a regular document.

Cache partition locks are used by threads which can make changes to a cache. Threads need to acquire a write lock for both the relevant cache and cache partition before it makes the change.

Transaction Locks and Commit Mode vs. Transaction Mode

Transaction lock types are associated with transaction types. Query type transactions do not use locks to obtain a consistent view of data, but rather the state of the data at a particular timestamp. Update type transactions have the potential to change the database and therefore require locks on documents to ensure transactional integrity.

So - if an update transaction type is run in explicit commit mode, then locks are acquired for all statements in an update transaction - whether or not those statements perform updates. Once a lock is acquired, it is held until the transaction ends. If an update transaction type is run in auto commit mode, by default MarkLogic Server detects the transaction type through static analysis of the first statement in that transaction. If the server detects the potential for updates during static analysis, then the transaction is considered an update transaction - which results in a write lock being acquired.

In multi-statement transactions, if an update transaction type is run in explicit commit mode, then the transaction is an update transaction and locks are acquired for all statements in an update transaction - even if no update occurs. In auto commit mode MarkLogic Server determines the transaction type through static analysis of the first statement. If in auto commit mode, and the first statement is a query, and an update occurs later in that transaction, MarkLogic Server will throw an exception. In multi-statement transactions, the transaction ends only when it is explicitly committed or rolled back. Failure to explicitly commit or roll back a multi-statement transaction might retain locks until the transaction times out or reaches the end of the session - at which point the transaction rolls back.

Best practices:

1) Avoid unnecessary transaction locks or holding on to transaction locks for too long. For single-statement transactions, do not explicitly set the transaction type to update if running a query. For multi-statement transactions, always explicitly commit or rollback the relevant transaction to free transaction locks as soon as possible.

2) It's very common for users to write code that unintentionally takes write locks. One of the best ways to avoid unintentional locks is to use transaction modes instead of transaction types/commit modes. Transaction modes combines transaction type and commit mode into a single configurable value. You can read more about transaction mode in our documentation at: Transaction Mode Overview.

3) Be aware that when setting transaction mode, the xdmp:commit and xdmp:update XQuery prolog options affect only the next transaction created after their declaration; they do not affect an entire session. Use xdmp:set-transaction-mode or xdmp.setTransactionMode if you need to change the transaction mode settings at the session level.

Document and Directory Locks

Document and directory locks are not tied to a transaction. The locks must be explicitly set and stored as a lock document in a MarkLogic Server database. So the locks can last a specified time period or be persistent until explicitly unlocked.

Each document and directory can have a lock. The lock can be used as part of an application's update strategy. MarkLogic Server provides the flexibility for client to set up a policy of how to use the locks that suitable for client environment. For example, if only one user is allowed to update the specific database objects, you can set the lock to be "exclusive." In contrast, if you have multiple users updating the same database object, you can set the lock to be "shared."

Unlike transaction locks, document and directory locks are persistent in the database and are consequently searchable.

Temporal Document Locks

A temporal collection contain bi-temporal or uni-temporal documents. Each version of a temporal document can be locked in the same way as a regular, non-temporal document.

Cache and Cache Partition Locks

If a thread attempts to make a change to database cache, it needs to acquire a write lock for the relevant cache and cache partition. This cache or cache partition write lock is serializes write access, which keep date in the relevant cache or cache partition thread-safe. While cache and cache partition locks are short-lived, be aware that in the case of a single cache partition, all of the threads needing to access that would need to serialize through a single cache partition write lock. For multiple cache partitions, multiple write locks can be acquired with one lock per partition - which allows multiple threads to make concurrent cache partition updates.

References and Additional Reading:

1) Understanding Transactions in MarkLogic Server

2) Cache Partitions

3) Document and Directory Locks

4) Understanding Locking in MarkLogic Server Using Examples

5) Understanding XDMP-DEADLOCK

6) Understanding the Lock Trace Diagnostic Trace Event

7) How MarkLogic Server Supports ACID Transactions

MarkLogic Upgrade FAQ

Question

Answer

Further Reading

How can we download the latest version of MarkLogic server?

The latest available versions are published on our Support portal, and they are available for download from Developer portal

Documentation:

Developer portal

Installation Guide

How can we download older versions of MarkLogic Server in case it's required?

Please request the specific version you need in a support ticket and we’ll try to provide it to you

Where can we find the EOL (end of life) information for a MarkLogic version?

You can visit MarkLogic Support page current versions and information on the release lifecycle.

Where can I see the list of bug fixes between two versions or MarkLogic Server?

You can view all of the bug fixes between MarkLogic Server versions by using our Fixed Bugs reporting tool, available on our Support site

Link:

Fixed Bugs Reporting Tool

What is MarkLogic's recommended procedure for performing major version upgrades (for instance, MarkLogic 9 to MarkLogic 10)?

It is good practice to upgrade to the latest release of whichever version you are currently running (e.g MarkLogic 9.x to 9.latest) and then to the desired major release version (e.g MarkLogic 10.x)

Documentation:

Upgrades and Database Compatibility

Upgrade Support

What are the best practices to be followed while planning/performing upgrades?

Refer to our Release Notes, New Features, Known Incompatibilities and Fixed Bugs report

Perform thorough testing of any operational procedures on non-production systems

Run application sanity checks in lower environments against the MarkLogic version to which you’re upgrading before upgrading your production environment. This may require making application code changes to resolve any unforeseen issues (see our documentation on “Known Incompatibilities”)

Ensure you have relevant backups in case you need to rebuild after a failed upgrade

Documentation:

Planning Future Upgrades

Upgrades and Database Compatibility

New Features

Product Release Notes

Known Incompatibilities with Previous Releases

Fixed Bugs Reporting Tool

Upgrade Support

Upgrading from Previous Releases

Upgrading a Cluster to a New Maintenance Release of MarkLogic Server

KB Articles:

Before executing significant operational procedures on production systems

How do I upgrade from one version of MarkLogic Server to another?

How do I upgrade from one version of MarkLogic Server to another?

Please refer to our knowledge base article for details. You can check our documentation for additional details.

Documentation:

Upgrading a Cluster to a New Maintenance Release of MarkLogic Server

Upgrading from Previous Releases

KB Article:

How do I upgrade from one version of MarkLogic Server to another?

Can we downgrade or rollback upgrades to previously installed versions?

No, version downgrades are NOT supported. Ensure you have sufficient backups for your system’s MarkLogic Configuration and Data in the event you need to rebuild your environment

Documentation:

Upgrading from Previous Releases

KB Articles:

MarkLogic Server Version Downgrades are Not Supported

"Wait Replication" scenarios and their resolutions

What is the recommended back out plan in case of emergency upgrade failure?

Although it is not expected that you will ever need to back out a version upgrade of MarkLogic Server, it is always prudent to have contingency plans in place. This knowledgebase article includes the preparation steps needed to back out of an upgrade.

KB Articles:

Back-out Plan sections in

MarkLogic Server Version Downgrades are Not Supported

How do I upgrade from one version of MarkLogic Server to another?

Is there a particular order we need to follow while upgrading a multi node cluster?

The forests for the Security and Schemas databases must be on the same host, and that host must be the first host you upgrade when upgrading a cluster.

Documentation:

Upgrading a Cluster to a New Maintenance Release of MarkLogic Server

Important points to note before performing Rolling Upgrades

Upgrade Support

Upgrading from Previous Releases

KB Article:

How do I upgrade from one version of MarkLogic Server to another?

What is the difference between conventional and rolling upgrades?

Conventional Upgrades:

A cluster can be upgraded with a minimal amount of transactional downtime.

Rolling Upgrades:

The goal in performing a rolling upgrade is to have zero downtime of your server availability or transactional data.

Note: if a cluster’s high availability (HA) scheme is designed to only allow for one host to fail, that cluster becomes vulnerable to failure when a host is taken offline for a rolling upgrade. To maintain HA while a host is offline for a rolling upgrade, your cluster’s HA scheme must be designed to allow two hosts to fail simultaneously

Documentation:

Rolling Upgrades

Installation and Upgrade

Upgrading a Cluster to a New Maintenance Release of MarkLogic Server

What are the important points to note before performing rolling upgrades?

Please refer to our documentation for more details.

Documentation:

Important points to note before performing Rolling Upgrades

How do I roll back a partial upgrade while performing Rolling Upgrades?

If the upgrade is not yet committed, you can reinstall the previously used version of MarkLogic on the affected nodes

Documentation:

Rolling Back a Partial Upgrade

Are there API's available to perform and manage Rolling upgrades?

Yes, there are API's to perform and manage rolling upgrades. Please refer to our documentation.

Documentation:

APIs for performing Rolling Upgrades

APIs for Managing Rolling Upgrades

How do I upgrade replicated environments?

Replica first! If your Security database is replicated then your replica cluster must be upgraded before your primary cluster. This is also mentioned in our Database Replication FAQ.

Documentation:

Upgrading Clusters Configured with Database Replication

Rolling Upgrades on Both Production and DR Clusters

KB Article:

How do I upgrade from one version of MarkLogic Server to another?

What is the procedure for performing OS patches/upgrades?

There are two choices, each with their own advantages and disadvantages:

Creating brand new nodes with desired OS version and setting them up from scratch (Steps are listed in this KB)

Advantage: Your can switch back to your existing cluster if there’s something wrong with your new cluster

Disadvantage: You’ll need a backup cluster until you’re satisfied with the upgraded environment

Upgrading OS on the existing nodes

Advantage: You won’t need two environments as you’ll be upgrading in place

Disadvantage: Larger maintenance window, since your existing nodes will be unavailable during the upgrade/patching process

KB Article:

Recreating a Node into an Existing Cluster

What order should the client libraries (such as ODBC Driver, MLCP/XCC, Node.js Client API etc) be upgraded?

While the order of client library upgrades doesn't matter, it’s important that the client library versions you’re using are compatible with your MarkLogic Server installation

Documentation:

Product Support Matrix

Can we perform in-place upgrades or do we need to commission new servers? Which among the two is recommended?

Both approaches can work. However, it’s important to understand that in on-premise deployments, you’d generally stay on your same machines, then change/upgrade the MarkLogic binary. In contrast, in AWS and other virtualized environments, you’d generally keep your data/configuration, and instead attach them to a new machine instance, with itself is running the more recent binary

When is the Security database updated?

Security database upgrades happen if the security-version in clusters.xml is different after the upgrade. In between minor release upgrades, there is typically no upgrade of the Security database. You can also check the status in the 'Upgrade' tab in the Admin UI.

What is the procedure for transferring data between MarkLogic clusters?

There are multiple options - these include:

Database Backup and Restore

Database Replication

Tools like MLCP

Documentation:

Backing Up and Restoring a Database

Database Replication

MLCP

KB Articles:

Transferring data between MarkLogic Server clusters

Best Practices for exporting and importing data in bulk

What is the procedure for copying configuration between clusters?

Configurations can be copied with Configuration Manager (deprecated in 10.0-3), or mlgradle/Configuration Management API (CMA)

Documentation:

Scripting Additional Administrative Tasks

KB Articles:

Alternatives to Configuration Manager

Configuration Migration of MarkLogic Server using Gradle and ml-gradle plug in

Transporting Configuration to a New Cluster

How do I upgrade MarkLogic on AWS?

Please refer to our MarkLogic on AWS FAQ for more details

What are the best practices for performing Data Hub upgrades?

Please refer to our Data Hub Framework FAQ for more details

Memory Consumption Logging and Status

With the release of MarkLogic Server versions 8.0-8 and 9.0-4, detailing memory use broken out by major areas is periodically recorded to the error log. These diagnostic messages can be useful for quickly identifying memory resource consumption at a glance and aid in determining where to investigate memory-related issues.

Error Log Message and Description of Details

At one hour intervals, an Info level log message will be written to the server error log in the following format:

Info: Memory 46% phys=255137 size=136452(53%) rss=20426(8%) huge=97490(38%) anon=1284(0%) swap=1(0%) file=37323(14%) forest=49883(19%) cache=81920(32%) registry=1(0%)

The error log entry contains memory-related figures for non-zero statistics: raw figures are in megabytes; percentages are relative to the amount of physical memory reported by the operating system. Except for phys, all values are for the MarkLogic Server process alone. The figures include

Memory: percentage of physical memory consumed by the MarkLogic Server process

phys: size of physical memory in the machine

size: total process memory for the MarkLogic process; basically huge+anon+swap+file on Linux. This includes memory-mapped files, even if they are not currently in physical memory.

swap: swap consumed by the MarkLogic Server process

rss: Resident Set Size reported by the operating system

anon: anonymous mapped memory used by the MarkLogic Server

file: total amount of RAM for memory-mapped data files used the MarkLogic Server---the MarkLogic Server executable itself, for example, is memory-mapped by the operating system, but is not included in this figure

forest: forest-related memory allocated by the MarkLogic Server process

cache: user-configured cache memory (list cache, expanded tree cache, etc.) consumed by the MarkLogic Server process

registry: memory consumed by registered queries

huge: huge page memory reserved by the operating system

join: memory consumed by joins for active running queries within the MarkLogic Server process

unclosed: unclosed memory, signifying memory consumed by unclosed or obsolete stands still held by the MarkLogic Server process

In addition to reporting once an hour, the Info level error log entry is written whenever the amount of main memory used by MarkLogic Server changes by more than five percent from one check to the next. MarkLogic Server will check the raw metering data obtained from the operating system once per minute. If metering is disabled, the check will not occur and no log entries will be made.

With the release of MarkLogic Server versions 8.0-8 and 9.0-5, this same information will be available in the output from the function xdmp:host-status().

<host-status xmlns="http://marklogic.com/xdmp/status/host">

. . .

<memory-process-size>246162</memory-process-size>

<memory-process-rss>27412</memory-process-rss>

<memory-process-anon>54208</memory-process-anon>

<memory-process-rss-hwm>73706</memory-process-rss-hwm>

<memory-process-swap-size>0</memory-process-swap-size>

<memory-system-pagein-rate>0</memory-system-pagein-rate>

<memory-system-pageout-rate>14.6835</memory-system-pageout-rate>

<memory-system-swapin-rate>0</memory-system-swapin-rate>

<memory-system-swapout-rate>0</memory-system-swapout-rate>

<memory-size>147456</memory-size>

<memory-file-size>279</memory-file-size>

<memory-forest-size>1791</memory-forest-size>

<memory-unclosed-size>0</memory-unclosed-size>

<memory-cache-size>40960</memory-cache-size>

<memory-registry-size>1</memory-registry-size>

. . .

</host-status>

Additionally, with the release of MarkLogic Server 8.0-9.3 and 9.0-7, Warning-level log messages will be reported when the host may be low on memory. The messages will indicate the areas involved, for example:

Warning: Memory low: forest+cache=97%phys

Warning: Memory low: huge+anon+swap+file=128%phys

The messages are reported if the total memory used by the mentioned areas is greater than 90% of physical memory (phys). As best practice for most use cases, the total of the areas should not be more than around 80% of physical memory, and should be even less if you are using the host for query processing.

Both forest and file include memory-mapped files; for example, range indexes. Since the OS manages the paging in/out of the files, it knows and reports the actual RAM in use; MarkLogic reports the amount of RAM needed if all the mapped files were in memory at once. That's why MarkLogic can even report >100% of RAM in use---if all the memory-mapped files were required at once the machine would be out of memory.

Data Encryption Scenario: An encrypted file cannot be memory-mapped and is instead decrypted and read into anon memory. Since the file that is decrypted in memory is not file-backed it cannot be paged out. Therefore, even though encrypted files do not require more memory than unencrypted files, they become memory-resident and require physical memory to be allocated when they are read.

If the hosts are encountering these warnings, memory use should be monitored closely.

Remedial action to support memory requirements might include:

Adding more physical memory to each of the hosts;

Adding additional hosts to the cluster to spread the data across;

Adding additional forests to any under-utilized hosts.

Other action might include:

Archiving/dropping any older forest data that is no longer used;

Reviewing the group level cache settings to ensure they are not set too high, as they make up the cache part of the total. For reference, default (and recommended) group level cache settings based on common RAM configurations may be found in our Group Level Cache Settings based on RAM Knowledge base article.

Summary

This enhancement to MarkLogic Server allows for easy periodic monitoring of memory consumption over time, and records it in a summary fashion in the same place as other data pertaining to the operation of a running node in a cluster. Since all these figures have at their source raw Meters data, more in-depth investigation should start with the Meters history. However, having this information available at a glance can aid in identifying whether memory-related resources need to be explored when investigating performance, scale, or other like issues during testing or operation.

Additional Reading

Knowledgebase: RAMblings - Opinions on Scaling Memory in MarkLogic Server

Metering database disk space requirements

Introduction

The MarkLogic Monitoring History feature allows you to capture and view critical performance data from your cluster. By default, this performance data is stored in the Meters database. This article explains how you can plan for the additional disk space required for the Meters database.

Meters Database Disk Usage

Just like any other database, Meters database is also made up of forests which in turn are made up of stands that reside physically on-disk. As Meters database is used by Monitoring History to store critical performance data of your cluster, the amount of information can grow significantly with more number of hosts, forests, databases etc. Thus the need to plan and manage the disk space required by Meters database.

Recommendation

Meters database stores critical performance data of your cluster. The size of data is proportional to the number of hosts, app servers, forests, databases etc. Typically, the raw retention settings have the largest impact on size.

MarkLogic's recommendation for a new install is to start with the default settings and monitor usage over the first two weeks of an install. The performance history charts, constrained to just show the Meters database, will show an increasing storage utilization over the first week, then leveling off for the second week. This would give you a decent idea of space utilization going forward.

You can then adjust the number of days of raw measurements that are retained.

You can also add additional forests to spread the Meters database over more hosts if needed.

Mimetype Definitions Not Upgraded in Marklogic Server v8.0-1

Summary

New and updated mimetypes were added for MarkLogic 8. If your MarkLogic Server instance has customized mimetypes, the upgrade to MarkLogic Server v8.0-1 will not update the mimetypes table.

Details

MarkLogic 8 includes the following new mimetype values:

Name Extension Format

application/json json json

application/rdf+json rj json

application/sparql-results+json srj json

application/xml xml xsd xvs sch    xml

text/json json

text/xml xml

application/vnd.marklogic-javascript sjs text

application/vnd.marklogic-ruleset rules text

If you upgraded to 8.0 from a previous version of MarkLogic Server and if you have ever customized your mimetypes (for example, using the MIME Types Configuration page of the Admin Interface), the upgrade will not automatically add the new mimetypes to your configuration. If you have not added any mimetypes, then the new mimetypes will be automatically added during the upgrade. You can check if you have these mimetypes configured by going to the Mimetype page of the Admin Interface and checking if the above mimetypes exist. If they exist, then there is nothing you need to do.

Effect

Not having these mimetypes may lead to application level failures - for example: running Javascript code via Query Console will fail.

Resolving Manually

If you do not have the above mimetypes after upgrading to 8.0, you can manually add the mimetypes to your configuration using the Admin Interface. To manually add the configuration, perform the following

Open the Admin Interface in a browser (for example, open http://localhost:8001).

Navigate to the Mimetypes page, near the bottom of the tree menu.

Click the Create tab.

Enter the name,the extension, and the format for the mimetype (see the table above).

Click OK.

Repeat the preceding steps for each mimetype in the above table.

Please be aware that updating the mimetype table results in a MarkLogic Server restart. You will want to execute this procedure when MarkLogic Server is idle or during a maintenance window.

Resolve by Script

Alternatively, if you do not have the above mimetypes after upgrading to 8.0, you can add the mimetypes to your configuration by executing the following script in Query Console:

xquery version "1.0-ml";

import module namespace admin = "http://marklogic.com/xdmp/admin" at "/MarkLogic/admin.xqy";
declare namespace mt = "http://marklogic.com/xdmp/mimetypes";

let $config := admin:get-configuration()
let $all-mimetypes := admin:mimetypes-get($config) (: existing mimetypes defined :)
let $new-mimetypes := (admin:mimetype("application/json", "json", "json"),
admin:mimetype("application/rdf+json", "rj", "json"),
admin:mimetype("application/sparql-results+json", "srj", "json"),
admin:mimetype("application/xml", "xml xsd xvs sch", "xml"),
admin:mimetype("text/json", "", "json"),
admin:mimetype("text/xml", "", "xml"),
admin:mimetype("application/vnd.marklogic-javascript", "sjs", "text"),
admin:mimetype("application/vnd.marklogic-ruleset", "rules", "text"))
(: remove intersection to avoid conflicts :)
let $delete-mimetypes :=
for $mimetype in $all-mimetypes
return if ($mimetype//mt:name/data() = $new-mimetypes//mt:name/data()) then $mimetype else ()
let $config := admin:mimetypes-delete($config, $delete-mimetypes)
(: save new mimetype definitions :)
return admin:save-configuration( admin:mimetypes-add( $config, $new-mimetypes))
(: executing this query will result in a restart of MarkLogic Server :)

Please be aware that updating the mimetype table results in a MarkLogic Server restart.    You will want to execute this script when MarkLogic Server is idle or during a maintenance window.

Fixes

At the time of this writting, it is expected that the upgrade scripts will be improved in a maintenance release of MarkLogic Server where these updates will occur automatically.

Monitoring cache status with xdmp:cache-status

Introduction

In this article, we discuss use of xdmp:cache-status in monitoring cache status, and explain the values returned.

Details

Note that this is a relatively expensive operation, so it’s not something to run every minute, but it may be valuable to run it occasionally for information on current cache usage.

Output format

The values returned by xdmp:cache-status are per host, defaulting to the current host. It takes an optional host-id to allow you to gather values from a specific host in the cluster.

The output of xdmp:cache-status will look something like this:

<cache-status xmlns="http://marklogic.com/xdmp/status/cache"> <host-id>18349804367231394552</host-id> <host-name>macpro-2113.local</host-name> <compressed-tree-cache-partitions> <compressed-tree-cache-partition> <partition-size>512</partition-size> <partition-table>0.2</partition-table> <partition-used>0.8</partition-used> <partition-free>99.2</partition-free> <partition-overhead>0</partition-overhead> </compressed-tree-cache-partition> </compressed-tree-cache-partitions> <expanded-tree-cache-partitions> <expanded-tree-cache-partition> <partition-size>1024</partition-size> <partition-table>0.7</partition-table> <partition-busy>0</partition-busy> <partition-used>30.4</partition-used> <partition-free>69.6</partition-free> <partition-overhead>0</partition-overhead> </expanded-tree-cache-partition> </expanded-tree-cache-partitions> <list-cache-partitions> <list-cache-partition> <partition-size>1024</partition-size> <partition-table>0.2</partition-table> <partition-busy>0</partition-busy> <partition-used>0</partition-used> <partition-free>100</partition-free> <partition-overhead>0</partition-overhead> </list-cache-partition> </list-cache-partitions> <triple-cache-partitions> <triple-cache-partition> <partition-size>1024</partition-size> <partition-busy>0</partition-busy> <partition-used>0</partition-used> <partition-free>100</partition-free> </triple-cache-partition> </triple-cache-partitions> <triple-value-cache-partitions> <triple-value-cache-partition> <partition-size>512</partition-size> <partition-busy>0</partition-busy> <partition-used>0</partition-used> <partition-free>100</partition-free> </triple-value-cache-partition> </triple-value-cache-partitions> </cache-status>

Values

cache-status contains information for each partition of the caches:

The list cache holds search term lists in memory and helps optimize XPath expressions and text searches.

The compressed tree cache holds compressed XML tree data in memory. The data is cached in memory in the same compressed format that is stored on disk.

The expanded tree cache holds the uncompressed XML data in memory (in its expanded format).

The triple cache hold triple data.

The triple value cache holds triple values.

The following are descriptions of the values returned:

partition-size: The size of a cache partition, in MB.

partition-table: The percentage of the table for a cache partition that is currently used. The table is a data structure that has a fixed overhead per cache entry, for cache admin. This will fix the number of entries that can be resident in the cache. If the partition table is full, something will need to be removed before another entry can be added to the cache.

partition-busy: The percentage of the space in a cache partition that is currently used and cannot be freed.

partition-used: The percentage of the space in a cache partition that is currently used.

partition-free: The percentage of the space in a cache partition that is currently free.

partition-overhead: The percentage of the space in a cache partition that is currently overhead.

When do I get errors?

You will get a cache-full error when nothing can be removed from the cache to make room for a new entry.

The "partition-busy" value is the most useful indicator of getting a cache-full error. It tells you what percent of the cache partition is locked down and cannot be freed to make room for a new entry.

Moving Forests Across Storage Devices

Update:

Since the time this article was originally written, MarkLogic included Forest Rebalancing and Forest Retiring Features in the more recent versions of MarkLogic Server. For zero downtime movement of forests, please refer to our documentation for these features - http://docs.marklogic.com/guide/admin/database-rebalancing.

The legacy Article follows:

Summary

There are many reasons why you may need to move a forest from one storage device to another. For example:

Transition from shared storage to dedicated storage (or vice versa);

Replace a small storage device with a larger one;

Reorganize - forest placement;

No matter what the reason, the action of moving forests should be well planned and deliberate, while the procedure should be well tested. This article lists both the steps that should be followed as well as issues to be considered when planning a move.

We will present two different techniques for moving a forest. The first being appropriate for databases that the can be restricted from updates for the duration of the forest move. The second being appropriate for production databases where database downtime needs to be minimized.

Simple Procedure to Move a Forest

The simple procedure to move a forest can be used on any forest whose database can be restricted from updates for the duration of the process. This will typically be for test, development and staging systems, but may also include production environments that can be disabled for extended maintenance windows.

To retain data integrity, this procedure requires that the associated database is restricted from updates.   The update restriction can be enforced in a variety of ways:

By setting all forests in the database to “read-only”;

By disabling all application servers that reference the database. You will also need to verify that there are no tasks in the task queue that can update the database.

By restricting access at the application level.

By restricting access procedurally – this is a common approach in test, development and staging environments.

The following steps can be used to move a forest:

Step 1: Begin enforcement of update restriction;

Step 2: Create a backup of the forest you would like to move;

Step 3: Create a new forest, specifying the new location for the forest;

Step 4: Restore the forest data from step 2 to the newly created forest;

Step 5: Verify that the forest data is restored successfully;

Step 6: Switch forests attached to the database;

a. Detach the original forest from the database;

b. Attach the new forest to the database;

WARNING: When moving a forest in the Security database, this step must occur in a single transaction (i.e. detaching original security forest and attaching a new security forest in a single transaction). The MarkLogic Server must have an operational Security database to function properly.

Step 7: Remove update restriction (from step 1);

Step 8: (Optional) Remove/delete the original forest.

Moving a Forest Minimizing Downtime

If the forest to be moved resides on a production system whose content databases are continually being updated, and if you cannot afford the database to be restricted from updates for the duration of a backup and a restore, then you can use the local disk failover feature to synchronize your forests before switching them. This approach will minimize the required downtime of the database.

The following steps can be used to move a forest while minimizing downtime:

Step 1: Create a new forest, specifying the new location for the forest.

Step 2: (Optional) Seed the new forest from backup. Although we will be using the local disk failover feature to synchronize forest content, seeding the new forest from a recent backup will result in faster synchronization and will use less resources (i.e. less disruptive to the production system)

Step 3: If you do not have a recent forest backup of the forest you would like to move, create one.

Step 4: Perform a forest level restore to the newly created forest.

Step 5: Configure the new forest as a forest replica of the original forest.

Step 5: Wait until the Forest is in the “sync replicating” state. You can use the Admin UI Forest status page to check for sync replicating.

Step 6: Switch forests: This step requires that the database is OFFLINE for a short period of time.

a. Detach the original forest from the database;

b. Remove the forest replica configuration created in step 5;

c. Attach the new forest to the database ;

WARNING: When moving a forest in the Security database , this step must occur in a single transaction (i.e. detaching original Security forest and attaching a new Security forest in a single transaction). The MarkLogic Server must have an operational Security database to function properly

Step 7: (Optional) Remove/delete the original forest.

Retaining Forest Name

Both forest move procedures presented require the new forest to have a different name than the original because forest names must be unique within a MarkLogic Server cluster and both procedures have the original and new forests existing in the system at the same time. Although rare, some applications have forest name dependencies (i.e. applications that perform in-forest query evaluations or in-forest placement of document inserts). If this is the case, you will either need to update your application, or change the method used to move the forest (since MarkLogic Server does not provide a mechanism to change the name of a forest).

You can modify the “Simple Forest Move” procedureby performing the forest delete after (step 2) ‘creating a successful forest backup’, and before (step 3) ‘creating a new forest’. This way, in step 3, you can create the new forest with the same name as the forest that was deleted.

To retain the forest name while minimizing database downtime, you can perform the “Moving a Forest Minimizing Downtime” procedure twice – the first time to a temporary forest and the second time to the final destination.

Forest Replicas and Failover Hosts

If the original forest has ‘forest replicas’ or ‘failover hosts’ configured, you will need to detach these configurations before you can delete the original forest.

If you would like the new forest to be configured with ‘forest replicas’ of ‘failover hosts’, you must first detach these configurations from the original forest before reattaching them to the new forest.

Estimate Time

The majority of the time will be spent transferring content from the original forest to the new forest. You can estimate the amount of time this will take from

The size of the forest on disk (forest-size in MB);

The I/O read rate available for the device where the original forest resides (read-rate in MB/second); and

The I/O write rate available for the device where the new forest resides (write-rate in MB/second).

Estimate time = (Forest-size / read-rate) + (Forest-size / write-rate)

Sizing Rules and Recommendations

When determining the resources allocated to forest data, it is recommended that you stay within the following guidelines:

[MarkLogic Recommendation] “The I/O subsystem should have capacity for sustained I/O at 20-MB/sec per content forest in each direction (i.e., 20-MB/sec reads and 20-MB/sec writes at the same time.”

[MarkLogic Recommendation] “The size of all forest data on a server should not exceed 1/3 of the available disk space. The other 2/3rds should be available for forest merges and reindexing, otherwise you will risk merge or reindex failures.”

     ( The 3x disk space requirement was always true for MarkLogic 6 and earlier releases. However, beginning in MarkLogic 7, the 3x disk space requirement can be reduced if configured and managed. )

[MarkLogic Rule of thumb] “Provision at least 2 CPU cores per active forests. This facilitates concurrent operations. “

[MarkLogic Rule of thumb] “Forests should not grow beyond 200GB or 64-million fragments. These thresholds do not guarantee a particular level of performance and may need to be lowered depending on the application.”

Additional Related Knowledgebase articles

Knowledgebase Article: Understand the Logs during rebalancer and reindex activity

Knowledgebase Article: Data Balancing in MarkLogic

Knowledgebase Article: Rebalancing, replication and forest reordering

Knowledgebase Article: Diagnosing Rebalancer issues after adding or removing a forest

OpenSSL Security Advisory - Cross-protocol attack on TLS using SS...

Summary

On March 1, 2016, a vulnerability in OpenSSL named DROWN, a man-in-the-middle attack that stands for “Decrypting RSA with Obsolete and Weakened eNcryption", was announced. All MarkLogic Server versions 5.0 and later are *not* affected by this vulnerability.

Advisory

The Advisory reported by OpenSSL.org states

CVE-2016-0800 (OpenSSL advisory) [High severity] 1st March 2016:

A cross-protocol attack was discovered that could lead to decryption of TLS sessions by using a server supporting SSLv2 and EXPORT cipher suites as a Bleichenbacher RSA padding oracle. Note that traffic between clients and non-vulnerable servers can be decrypted provided another server supporting SSLv2 and EXPORT ciphers (even with a different protocol such as SMTP, IMAP or POP) shares the RSA keys of the non-vulnerable server. This vulnerability is known as DROWN (CVE-2016-0800). Recovering one session key requires the attacker to perform approximately 2^50 computation, as well as thousands of connections to the affected server. A more efficient variant of the DROWN attack exists against unpatched OpenSSL servers using versions that predate 1.0.2a, 1.0.1m, 1.0.0r and 0.9.8zf released on 19/Mar/2015 (see CVE-2016-0703 below). Users can avoid this issue by disabling the SSLv2 protocol in all their SSL/TLS servers, if they've not done so already. Disabling all SSLv2 ciphers is also sufficient, provided the patches for CVE-2015-3197 (fixed in OpenSSL 1.0.1r and 1.0.2f) have been deployed. Servers that have not disabled the SSLv2 protocol, and are not patched for CVE-2015-3197 are vulnerable to DROWN even if all SSLv2 ciphers are nominally disabled, because malicious clients can force the use of SSLv2 with EXPORT ciphers. OpenSSL 1.0.2g and 1.0.1s deploy the following mitigation against DROWN: SSLv2 is now by default disabled at build-time. Builds that are not configured with "enable-ssl2" will not support SSLv2. Even if "enable-ssl2" is used, users who want to negotiate SSLv2 via the version-flexible SSLv23_method() will need to explicitly call either of: SSL_CTX_clear_options(ctx, SSL_OP_NO_SSLv2); or SSL_clear_options(ssl, SSL_OP_NO_SSLv2); as appropriate. Even if either of those is used, or the application explicitly uses the version-specific SSLv2_method() or its client or server variants, SSLv2 ciphers vulnerable to exhaustive search key recovery have been removed. Specifically, the SSLv2 40-bit EXPORT ciphers, and SSLv2 56-bit DES are no longer available. In addition, weak ciphers in SSLv3 and up are now disabled in default builds of OpenSSL. Builds that are not configured with "enable-weak-ssl-ciphers" will not provide any "EXPORT" or "LOW" strength ciphers. Reported by Nimrod Aviram and Sebastian Schinzel.

Fixed in OpenSSL 1.0.1s (Affected 1.0.1r, 1.0.1q, 1.0.1p, 1.0.1o, 1.0.1n, 1.0.1m, 1.0.1l, 1.0.1k, 1.0.1j, 1.0.1i, 1.0.1h, 1.0.1g, 1.0.1f, 1.0.1e, 1.0.1d, 1.0.1c, 1.0.1b, 1.0.1a, 1.0.1)

Fixed in OpenSSL 1.0.2g (Affected 1.0.2f, 1.0.2e, 1.0.2d, 1.0.2c, 1.0.2b, 1.0.2a, 1.0.2)

MarkLogic Server Details

Marklogic Server disallows SSLv2 and disallows weak ciphers in all supported version. As a result, MarkLogic Server is not affected by this vulverability.

Whenever MarkLogic releases a new version of MarkLogic Server, OpenSSL versions are reviewed and updated.

Partial or Incomplete Upgrade

Introduction

There have been incidents where upgrades have yielded messages like this in the ErrorLog:

2014-08-26 12:20:16.353 Notice: Admin: Beginning upgrading configuration 2014-08-26 12:20:16.533 Warning: Metering database is not configured - Temporarily disabling usage metering 2014-08-26 12:20:16.585 Notice: Admin: Checking prerequisites for Application Services upgrade. 2014-08-26 12:20:16.650 Notice: Admin: Configuring Manage appserver on port 8002 2014-08-26 12:20:16.675 Notice: Admin: Creating Manage appserver 2014-08-26 12:20:16.680 Error: Admin: ADMIN-NOSUCHDATABASE - unable to create Manage server on port 8002, contact support for assistance. 2014-08-26 12:20:16.680 Notice: Admin: Created Manage appserver 2014-08-26 12:20:16.685 Notice: Admin: Configuring App Services appserver on port 8000 2014-08-26 12:20:16.715 Notice: Admin: Updating App Services appserver 2014-08-26 12:20:17.299 Error: Admin: ADMIN-NOSUCHDATABASE - unable to update App-Services server on port 8000, contact support for assistance. 2014-08-26 12:20:17.299 Notice: Admin: Updated App Services appserver 2014-08-26 12:20:17.299 Notice: Admin: Completed Application Services upgrades 2014-08-26 12:20:17.441 Warning: Metering database is not configured - Temporarily disabling monitoring 2014-08-26 12:20:17.469 Info: Mounted forest Extensions locally on /var/opt/MarkLogic/Forests/Extensions read write 2014-08-26 12:20:18.091 Info: Unmounted forest Extensions 2014-08-26 12:20:18.091 Notice: Admin: 6.0 Upgrade completed 2014-08-26 12:20:18.445 Info: Mounted forest Extensions locally on /var/opt/MarkLogic/Forests/Extensions read write 2014-08-26 12:20:19.496 Info: Mounted forest Meters locally on /var/opt/MarkLogic/Forests/Meters read write 2014-08-26 12:20:20.002 Notice: Admin: Creating Healthcheck appserver 2014-08-26 12:20:20.007 Error: Admin: XDMP-CAST - unable to upgrade for 7.0, please contact support for assistance.

In scenarios like this, we have seen an upgrade from earlier versions run to completion but fail to create the necessary application servers and databases:

App Servers
8000, 8002, 7997
Databases
Fab, App-Services

Checklist: how to identify missing components

A successful upgrade from MarkLogic 4 to MarkLogic 7 should yield the following additional App Servers and Databases - if any of these are missing, you'll need to run some of the attached scripts. If you have any concerns, please raise a case with the support team to further discuss any issues you encounter during an upgrade:

- App Servers:

Default :: App-Services : 8000 [HTTP]
Default :: Manage : 8002 [HTTP]
Default :: HealthCheck : 7997 [HTTP]

- Databases:

App-Services
Fab
Meters

Observations

This issue has been observed when installing MarkLogic 4.0-9.1, updating to 4.1-11.1 and subsequently upgrading to 7.0-3 but there may be other paths that also produce similar results. An upgrade from 4.1-11.1 to 7.0-3 will complete as expected.

Running scripts to create the missing components:

The attached zip file (upgrade-scripts.zip) contains 5 separate XQuery Modules which can be executed against the server.

The example below assumes that you may not have access to the Query Console application on port 8000 (http://localhost:8000/qconsole/), so this example demonstrates the use of the "CQ" tool so the modules can be run against the server. You can also run these as ad-hoc queries against an XDBC server as long as you ensure that the two scripts marked with an asterisk (create-users.xqy and create-7997-appserver-and-user.xqy) are executed against an XDBC URI that targets the security database (/Security).

If you are unable to access Query Console, you can temporarily install CQ to allow you to easily run the scripts:

- http://developer.marklogic.com/download/code/cq/releases/mark-logic-cq-4.1-1.zip
- Unpack to somewhere (e.g. /tmp/cq)
- Create an HTTP application server on an unused port (e.g. 8003) and set Modules to (filesystem) and the path to /tmp/cq
- Open cq by going the application server to that port on any node in your cluster http://localhost:8003/

If you are unfamiliar with CQ and would like to discuss the configuration of this tool and the execution of these scripts, please contact the support team first.

Identifying the issue:

if you have upgraded to a current supported version of MarkLogic from an older version and you see only the following App servers after upgrading:
- Admin (8001)
- Docs (8000)

Then you will want to take corrective actions

Resolving the issue:

Start by following these steps to prepare the system:
- Install CQ
- Delete the "Docs" App Server on 8000
- You can copy and paste the contents of each of the modules into a cq buffer and use one of the buttons ("TEXT" / "XML" / "HTML") to evaluate the module code.

Execute the following modules in the order specified below:

(*) - ensure you evaluate these modules against the Security database; you can do this by selecting "Security (Admin)" from the "content-source" dropdown selector. If you have any concerns about this, please contact the support team first to discuss the process:

- create-users.xqy*
- create-ml-42-databases.xqy
- create-8000-appserver.xqy (if you have a "Docs" App Server - please remove it previous to executing)

The above are all included in MarkLogic 4.2. If you have upgraded to MarkLogic 5, you will need to run:

- create-8002-appserver.xqy

If you've upgraded to MarkLogic 7, you will need to run:

- create-7997-appserver-and-user.xqy*

Post upgrade checks:

Test applications on port 7997:

http://localhost:7997/

Test applications on port 8000:

http://localhost:8000/
http://localhost:8000/appservices/
http://localhost:8000/qconsole/

Test applications on port 8002:

http://localhost:8002/nav/
http://localhost:8002/dashboard/
http://localhost:8002/history/ (note that this will require the enabling of performance metrics)

To enable MarkLogic 7 metering - go to 8001 and configure as a "Group level" setting (Configure -> Groups -> Default)
- Set "metering" to enabled
- Set "performance metering" to enabled

Performance implications of ad hoc queries versus parameterized m...

Summary

This article briefly looks at the performance implications of ad hoc queries versus passing external variables to a query in a module

Details

Programatically, you can achieve similar results by dynamically generating ad hoc queries on the client as you can by definining your queries in modules and passing in external variable values as necessary.

Dynamically generating ad hoc queries on the client side results in each of your queries being compiled and linked with library modules before they can be evaluated - for every query you submit. In contrast, queries in modules only experience that performance overhead the first time they're invoked.

While it's possible to submit queries to MarkLogic Server in any number of ways, in terms of performance, it's far better to define your queries in modules, passing in external variable values as necessary.

Performance implications of updating Module and Schema databases

Performance implications of updating Module and Schema databases

This article briefly looks at the performance implications of adding or modifying modules or schemas to live (production) databases.

Details

When XQuery modules or schemas are referenced for the first time after upload, they are parsed and then cached in memory so that subsequent access is faster.

When a module is added or updated, the modules cache is invalidated and every module (for all Modules databases within the cluster) will need to be parsed again before they can be evaluated by MarkLogic Server.

Special consideration should be made when updating modules or schemas in a production environment as reparsing can impact the performance of MarkLogic server for the duration that the cache is being rebuilt.

MarkLogic was designed with the assumption that modules and schemas are rarely updated. As such, the recommendation is that updates to modules or schemas in production environments is made during periods of low activity or out of hours.

Further reading

MarkLogic Documentation: Module Caching Notes

Tips and Hints for Debugging Module Resolution in MarkLogic

Performance Settings Checklist

Summary

This article lists some common system and MarkLogic Server settings that can affect the performance of a MarkLogic cluster.

Details

From MarkLogic System Requirements:

I/O Schedulers

** The deadline I/O scheduler is required on Red Hat Linux platforms. The deadline scheduler is optimized to ensure efficient disk I/O for multi-threaded processes, and MarkLogic Server can have many simultaneous threads. For information on the deadline scheduler, see the Red Hat documentation.

Note that on VMWare hosted servers, the noop scheduler is recommended.

You can read more about I/O schedulers in the following MarkLogic knowledgebase article

Notes on IO schedulers.

Huge Pages

At system startup on Linux machines, MarkLogic Server logs a message to the ErrorLog.txt file showing the Huge Page size, and the message indicates if the size is below the recommended level.

If you are using Red Hat 6, you must turn off Transparent Huge Pages (Transparent Huge Pages are configured automatically by the operating system).

You can also read more about huge pages, transparent huge pages, and group cache settings at the following MarkLogic knowledgebase articles:

Linux Huge Pages and Transparent Huge Pages

Group Caches and Linux Huge Pages

MarkLogic Server Configurations

The following items are related to default MarkLogic Server configurations and their relationship to indexes – either index population during ingest or index reads during query time, especially in the context of avoiding threads locking when executed in parallel

There’s a collection of settings that are enabled by default in the server, whose values we often recommend changing from their defaults when users run into performance issues. Those are:

If not needed, directory creation should be set to manual

If not needed, maintain last modified should be set to false

If not needed, maintain directory last modified should be set to false

If not needed, inherit permissions should be set to false

If not needed, inherit collections should be set to false

If not needed, inherit quality should be set to false

If you’re likely to use URI or collection lexicon functions, both URI lexicon and collection lexicon should be set to true

You can read more about these settings and how they relate to overall multi-thread/multi-request system performance in the following knowledgebase articles:

                - When submitting lots of parallel queries, some subset of those queries take much longer - Why?;

                - Read only queries run at a timestamp - Update transactions use locks;

                - https://help.marklogic.com/Knowledgebase/Article/View/113/0/indexing-best-practices

                - https://help.marklogic.com/Knowledgebase/Article/View/73/0/what-is-a-directory-in-marklogic

                - https://help.marklogic.com/Knowledgebase/Article/View/17/0/understanding-xdmp-deadlock

Pitfalls Running MarkLogic Process as non-root user

Introduction

Some customers choose to run MarkLogic without the watchdog process running as root. As this is increasingly becoming a popular topic, there is an additional Knowledgebase article that discusses this in further detail:

Knowledgebase: Start and Stop MarkLogic Server as Non-Root User

The aim of this Knowledgebase article is to recommend some of the modifications you should consider making to the user that is taking the responsibility of running as the root process would have done.

MarkLogic server's root process makes a number of OS-specific settings to allow the product to run optimally. If you choose to make these modifications, this article aims to provide you with enough information to ensure you can match the settings that the server changes.

Points to consider

We do not recommend changing the root user.

Future upgrades to MarkLogic Server are likely to change what our root process sets up before starting the daemon process.

The Linux kernel Out of Memory (OOM) killer is less likely to attempt to terminate a process running as root, so if you're doing this, you should consider having additional monitoring in place to ensure you can react quickly in the event that your watchdog process is killed.

The root MarkLogic process is simply a restarter process, waiting the non-root (daemon) process to exit - and if the daemon process exits abnormally, for any reason, the root process will fork and exec another process under the daemon process. The root process runs no XQuery scripts, opens no sockets, and accesses no database files.

We strongly recommend starting MarkLogic as root, and to let it switch to the non-root user on its own.

When the server initializes, if it initialises with the default root process, it performs some privileged kernel calls to configure sockets, memory, and threads. For example:

it allocates huge pages if any are available,

increases the number of file descriptors it can use,

binds any configured low-numbered socket ports, and

requests the capability to runs some of its threads at high priority.

MarkLogic Server will function if it isn't started as root, but it may not perform as well.

Problems Seen by Customers running MarkLogic as a non-root user

1. If non-root user account isn't able authenticate due to any underlying system issue, MarkLogic can't startup properly. This can result in an endless restart loop of MarkLogic Server.

Getting started

You should check the following settings which are configured by the root process when MarkLogic first starts.

1. maxproc soft limit

The maxproc soft limit is set to 1024 by default. In /etc/init.d/MarkLogic the following line raises the soft limit to match the hard limit for the current process heirarchy:

ulimit -u `ulimit -Hu`

2. Ensure Huge Pages are assigned correctly

If you see something like this in /var/log/messages

MarkLogic: Linux Huge Pages: shmget(1): Operation not permitted

If you look in /etc/sysctl.conf, you should see (or add) a line:

vm.hugetlb_shm_group = {gid}

Here the {gid} is the group id of the user that runs MarkLogic. Again, it would make sense to ensure that both users (whatever you're using in place of root and daemon) are able to do this.

3. Server HugePages calculations

lower value range

A calculated total of Group Level caches (List Cache + Compressed Cache + Expanded Cache)

upper value range

Take the total from the lower value range and then for each database, add the following:

in memory list size

in memory tree size

in memory range index size * number of defined range indexes

in memory reverse index size (if reverse query is enabled)

in memory triple index size (if triple positions are enabled)

Multiply these by the number of assigned local forests [exclude AppServices, Fab, Extensions, Modules,Schemas, Security, Triggers, Last-Login, Meters] + small buffer

4. Additional kernel parameters to be defined in /etc/sysctl.conf

shmall

shmmax

shmmni

The above values influence shared memory handling and these values are set automatically if MarkLogic runs with the default root/daemon settings.

On Redhat (RHEL) these values are pre-defined but not on SuSE. We recommend these values should be updated in sysctl.conf anyway.

First step: get the current PAGE_SIZE by the following cmd call:

getconf PAGE_SIZE

With the PAGE_SIZE you can calculate kernel.shmall as per the instructions below:

kernel.shmall = (HugePages_Total * hugepagesize) / PAGE_SIZE

And you can set kernel.shmmax and kernel.shmmni accordingly:

kernel.shmmax = 17179869184 // 16GB MarkLogic default settings kernel.shmmni = 32768 // default is 4096 not enough on big RAM systems [this will change the page size returned above but doesn't change the calculation above]

5. Configure vm.hugetlb_shm_group

In case MarkLogic runs under a different user ID some more parameters needs to be added to /etc/sysctl.conf:

vm.hugetlb_shm_group=gid of hugetlb group [group of user id]

6. Configuring limits

You can also set memory limits in /etc/security/limits.conf

username soft memlock (1024*1024*Huge Pages in MB) username hard memlock (1024*1024*Huge Pages in MB)

7. Configure / increase the vm.max_map_count

The vm.max_map_count allows for the restriction of the number of individual VMAs (Virtual Memory Areas) that a particular process can use. A Virtual Memory Area is a contiguous area of virtual address space.

The amount of VMAs a process is allowed to create as specified by the OS. By default, there are usually around 65530 memory map entries allowed per process.

From the kernel documentation for max_map_count:

This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling malloc, directly by mmap and mprotect, and also when loading shared libraries. While most applications need less than a thousand maps, certain programs, particularly malloc debuggers, may consume lots of them, e.g., up to one or two maps per allocation. The default value is 65536.

See: https://kernel.org/doc/Documentation/sysctl/vm.txt

Our recommendation is that this value can be safely doubled or even quadrupled where modern hardware is taken into consideration:

sysctl vm.max_map_count=262120

For this step, this is more important for hosts that have a larger amount of RAM. If you are setting up hosts with 256GB RAM or greater, this change is really worth considering.

8. Configure SOMAXCONN

Linux SOMAXCONN parameter defines the maximum number of backlog value MarkLogic process allowed to pass to socket listen. Different Linux platforms (RHEL/CentOS) or even different versions of Linux may have different default SOMAXCONN value.

MarkLogic default backlog value for application servers is 512; However, Linux platform have lower SOMAXCONN value than MarkLogic requested higher backlog, then MarkLogic requested backlog will not be respected by Linux. MarkLogic when started as root user, will go through each application server and find the max backlog value, and set the Linux SOMAXCONN value to match the highest backlog value.

One can set the SOMAXCONN to max value of any application server backlog value manually using below.

sysctl -w net.core.somaxconn=512

9. Socket buffer rmem_max and wmem_max

Linux parameter defines the max send buffer size (wmem) and receive buffer size (rmem) for TCP ports. In other word, this parameter set the amount of memory that is allocated for each TCP socket when it is opened or created while transferring files. For more efficient parallel job performance, MarkLogic sets buffer values based on platform hardware RAM size during the startup when started as a root as below.

RAM> 32GB Read/Write butter size 2048 KB (262144 bytes)

RAM> 8GB Read/Write butter size 1024 KB

RAM> 4GB Read/Write butter size 512 KB

RAM<= 4GB Read/Write butter size 128 KB

One can set the rmem_max and wmem_max values for platform with RAM>32GB manually using below.

sysctl -w net.core.rmem_max=262144 sysctl -w net.core.wmem_max=262144

10. Linux swappiness and Dirty background ratio

MarkLogic sets Linux swappiness and dirty background ratio parameters during startup. When starting as non-root, Linux swappiness and dirty background ratio should be set as per KB "Linux Swappiness".

Further reading

It is recommended that you also read this Knowledgebase article which covers running MarkLogic as a non-root user:

Knowledgebase - Start and Stop MarkLogic Server as Non-root User

Our documentation also covers running the main MarkLogic process (daemon by default) as a different user:

Documentation: Configuring MarkLogic Server on UNIX Systems to Run as a Non-daemon User

Queries constrained to elements

Introduction

In this Knowledgebase article, we will discuss a technique which will allow you to scope queries in such a way to ensure that they occur only contained within a parent element.

Details

cts:element-query

Consider a containment scenario where you have an XML document structured in this way:

<rootElement>
<id>7635940284725382398</id>
<parentElement>
<childElement1>valuea</childElement1>
<childElement2>false</childElement2>
</parentElement>
<parentElement>
<childElement1>valuea</childElement1>
<childElement2>truthy</childElement2>
</parentElement>
<parentElement>
<childElement1>valueb</childElement1>
<childElement2>true</childElement2>
</parentElement>
<childElement1>valuec</childElement1>
</rootElement>

And you want to find the document where where a parentElement has a childElement1 with a value of 'valuec'.

A search like

cts:search (/,
cts:element-value-query(xs:QName('childElement1'), 'valuec', 'exact')
)

will give you the above document, but doesn't consider where the childElement1 value is. This isn't what you want. Search queries perform matching per fragment, so there is no constraint that childElement1 be in any particular spot in the fragment.

Wrapping a cts:element-query around a subquery will constrain the subquery to exist within an instance of the named element. Therefore,

cts:search (/,
cts:element-query (
   xs:QName ('parentElement'),
cts:element-value-query(xs:QName('childElement1'), 'valuec', 'exact')
)
)

will not return the above document since there is no childElement1 with a value of 'valuec' inside a parentElement.

This applies to more-complicated subqueries too. For example, looking for a document that has a childElement1 with a value of 'valuea' AND a childElement2 with a value of 'true' as

cts:search (/,
cts:and-query ((
cts:element-value-query(xs:QName('childElement1'), 'valuea', 'exact'),
cts:element-value-query(xs:QName('childElement2'), 'true', 'exact')
))
)

will return the above document. But you may want these two child element-values both inside the same parentElement. This can be accomplished with

cts:search (/,
  cts:element-query (
      xs:QName ('parentElement'),
      cts:and-query ((
          cts:element-value-query(xs:QName('childElement1'), 'valuea', 'exact'),
          cts:element-value-query(xs:QName('childElement2'), 'true', 'exact')
      ))
  )
)

This should give you expected results, as it won't return the above document since the two child element-value queries do not match inside the same parentElement instance.

Filtering and indexes

Investigating a bit further, if you run the query with xdmp:query-meters you will see (depending on your database settings)

<qm:filter-hits>0</qm:filter-hits>
<qm:filter-misses>1</qm:filter-misses>

What is happening is that the query can only determine from the current indexes that there is a fragment with a parentElement, and a childElement1 with a value of 'valuea', and a childElement2 with a value of 'true'. Then, after retrieving the document and filtering, it finds that the document is not a complete match and so does not return it (thus filter-misses = 1).

(To learn more about filtering, refer to Understanding the Search Process section in our Query Performance and Tuning Guide.)

At scale you may find this filtering slow, or the query may hit Expanded Tree Cache limits if it retrieves many false positives to filter through.

If you have the correct positions enabled, the indexes can resolve this query without retrieving the document and filtering. In this case, after setting both

element-word-positions

and

element-value-positions

to true on the database and reindexing, xdmp:query-meters now shows

<qm:filter-hits>0</qm:filter-hits>
<qm:filter-misses>0</qm:filter-misses>

(To track element-value-queries inside element-queries you need element-word-positions and element-value-positions enabled. The former is for element-query and the latter is for element-value-query.)

Now this query can be run without filtering. However, if you have a lot of relationship instances in a document, the calculations using positions can become quite expensive to compute.

Position details

Further details: Empty-element positions are problematic. Positions are word positions, and the position of an element is the word position of the first word after the element starts to the word position of the first word after the element ends. Positions of attributes are the positions of their element. If everything is an empty element, you have no words and everything has the same position and so positions cannot discriminate between elements.

Reindexing

Note that if you change these settings you will need to reindex your database, and the usual tradeoffs apply (larger indexes and slower indexing). Please see the following for guidance on adding an index and reindexing in general:

See also:

Reindexing impact
Adding an index in production

Query Registry limits

Summary

There is a limit to the number of registered queries held in the forest registry. If your application does not account for that fact, you may get unexpected results.

Where is it?

If a specific registered query is not found, then a cts:search operation with an invalid cts:registered-query throws an XDMP-UNREGISTERED exception. The XDMP-UNREGISTERED error occurs when a query could not be found in a forest query registry. If a query that had been previously registered can not be found, it may have been discarded automatically. (In the most recent versions of MarkLogic Server at the time of this writing) The forest query registry only contains up to about 48,000 of the most recently used registered queries. If you register more than that, the least recently used ones get discarded.

Recommendation

To avoid registered queries being dropped, it’s a good idea to unregister queries when you know they aren’t needed any more.

RAMblings - Opinions on Scaling Memory in MarkLogic Server

This not-too-technical article covers a number of questions about MarkLogic Server and its use of memory:

How MarkLogic uses memory;

Why you might need more memory;

When you might need more memory;

How you can add more memory.

Let’s say you have an existing MarkLogic environment that’s running acceptably well. You have made sure that it does not abuse the infrastructure on which it’s running. It meets your SLA (maybe expressed as something like “99% of searches return within 2 seconds, with availability of 99.99%”).   Several things about your applications have helped achieve this success:

Your queries run as queries; there are no queries mistakenly running in update mode.

The result sets from queries are of reasonable size, not “boiling the ocean”.

Your queries are largely satisfied from indexes, typically able to run unfiltered.

Updates affect a relatively small number of documents at any one time.

Plus, your current infrastructure and practices are up to the task.

As such, your application’s performance is largely determined by the number of disk accesses required to satisfy any given query. Most of the processing involved is related to our major data structures:

term lists;

range indexes;

lexicons;

triples (for SQL, Optic, SPARQL …)

Fulfilling a query can involve tens, hundreds or even thousands of accesses to these data structures, which reside on disk in files within stand directories.   (The triple world especially tends to exhibit the greatest variability and computational burden.)

Of course, MarkLogic is designed so that the great majority of these accesses do not need to access the on-disk structures. Instead, the server caches termlists, range indexes, triples, etc. which are kept in RAM in the following places:

termlists are cached in the List Cache, which is allocated at startup time (according to values found in config files) and managed by MarkLogic Server. When a termlist is needed, the cache is first consulted to see whether the termlist in question is present.   If so, no disk access is required. Otherwise, the termlist is read from disk involving files in the stand such as ListIndex and ListData.

range indexes are held in memory-mapped areas of RAM and managed by the operating system’s virtual memory management system. MarkLogic allocates the space for the in-memory version of the range index, causes the file to be loaded in (either on-demand or via pre-load option), and thereafter treats it as an in-memory array structure. Any re-reading of previously paged-out data is performed transparently by the OS. Needless to say, this last activity slows down operation of the server and should be kept to a minimum.

One key notion to keep in mind is that the in-memory operations (the “hit” cases above) operate at speeds of about a microsecond or so of computation. The go-to-disk penalty (the “miss” cases) cost at least one disk access which takes a handful of milliseconds plus even more computation than a hit case. This represents a difference on the order of 10,000 times slower.

Nonetheless, you are running acceptably. Your business is succeeding and growing. However, there are a number of forces stealthily working against your enterprise continuing in this happy state.

Your database is getting larger (more and perhaps larger documents).

More users are accessing your applications.

Your applications are gaining new or expanded capabilities.

Your software is being updated on a regular basis.

You are thinking about new operational procedures (e.g. encryption).

In the best of all worlds, you have been measuring your system diligently and can sense when your response time is starting to degrade. In the worst of all worlds, you perform some kind of operational / application / server / operating system upgrade and performance falls off a cliff.

Let’s look under the hood and see how pressure is building on your infrastructure. Specifically, let’s look at consumption of memory and effectiveness of the key caching structures in the server.

Recall that the response time of a MarkLogic application is driven predominantly by how many disk operations are needed to complete a query. This, in turn, is driven by how many termlist and range index requests are initiated by the application through MarkLogic Server and how many of those do not “hit” in the List Cache and in-memory Range Indexes. Each one of those “misses” generates disk activity, as well as a significant amount of additional computation.

All the forces listed above contribute to decreasing cache efficiency, in large part because they all use more RAM. A fixed size cache can hold only a fraction of the on-disk structure that it attempts to optimize. If the on-disk size keeps growing (a good thing, right?) then the existing cache will be less effective at satisfying requests. If more users are accessing the system, they will ask in total for a wider range of data. As applications are enriched, new on-disk structures will be needed (additional range indexes, additional index types, etc.) And when did any software upgrade use LESS memory?

There’s a caching concept from the early days of modern computing (the Sixties, before many of you were born) called “folding ratio”. You take the total size of a data structure and divide it by the size of the “cache” that sits in front of it. This yields a dimensionless number that serves as a rough indicator of cache efficiency (and lets you track changes to it).   A way to compute this for your environment is to take the total on-disk size of your database and divide it by the total amount of RAM in your cluster. Let’s say each of your nodes has 128GB of RAM and 10 disks of about 1TB each that are about half full. So, the folding ratio of each node of (the shared-nothing approach of MarkLogic allows us to consider each node individually) this configuration at this moment is (10 x 1TB x 50%) / 128GB or about 40 to 1.

This number by itself is neither good nor bad. It’s just a way to track changes in load. As the ratio gets larger, cache hit ratio will decrease (or, more to the point, the cache miss ratio will increase) and response time will grow.   Remember, the difference between a hit ratio of 98% versus a hit ratio of 92% (both seem pretty good, you say) is a factor of four in resulting disk accesses! That’s because one is a 2% miss ratio and the other is an 8% miss ratio.

Consider the guidelines that MarkLogic provides regarding provisioning: 2 VCPUs and 8GB RAM to support a primary forest that is being updated and queried. The maximum recommended size of a single forest is about 400 GB, so the folding ratio of such a forest is 400GB / 8GB or about 50 to 1. This suggests that the configuration outlined a couple of paragraphs back is at about 80% of capacity. It would be time to think about growing RAM before too long. What will happen if you delay?

Since MarkLogic is a shared-nothing architecture, the caches on any given node will behave independently from those on the other nodes. Each node will therefore exhibit its own measure of cache efficiency. Since a distributed system operates at the speed of its slowest component, it is likely that the node with the most misses will govern the response time of the cluster as a whole.

At some point, response time degradation will become noticeable and it will become time to remedy the situation. The miss ratios on your List Cache and your page-in rate for your Range Indexes will grow to the point at which your SLA might no longer be met.

Many installations get surprised by the rapidity of this degradation. But recall, the various forces mentioned above are all happening in parallel, and their effect is compounding. The load on your caches will grow more than linearly over time. So be vigilant and measure, measure, and measure!

In the best of all possible worlds, you have a test system that mirrors your production environment that exhibits this behavior in advance of production. One approach is to experiment with reducing the memory on the test system by, say, configuring VMs for a given memory size (and adjusting huge pages and cache sizes proportionately) to see where things degrade unacceptably. You could measure:

Response time: where does it degrade by 2x, say?

List cache miss ratio: at what point does it double, say?

Page-in rate: at what point does increase by 2x, say?

When you find the memory size at which things degraded unacceptably, use that to project the largest folding ratio that your workload can tolerate. Or you can be a bit clever and do the same additional calculations for ListCache and Anonymous memory:

Compute the sum of the sizes of all ListIndex + ListData files in all stands and divide by size of ListCache. This gives the folding ratio for this host of the termlist world.

Similarly, compute the sum of the sizes of all RangeIndex files and divide by the size of anonymous memory. This gives the folding ratio for the range index world on this host. This is where encryption can bite you. At least for a period of time, both the encrypted and the un-encrypted versions of a range index must be present in memory. This effectively doubles your folding ratio and can send you over the edge in a hurry. [Note: depending on your application, there may be additional in-memory derivatives of range indexes built to optimize for facets, sorting of results, … all taking up additional RAM.]

[To be fair, on occasion a resource other than RAM can become oversubscribed (beyond the scope of this discussion):

IOPs and I/O bandwidth (both at the host and storage level);

Disk capacity (too full leads to slowness on some storage devices, or to inability to merge);

Wide-area network bandwidth / latency / consistency (causes DR to push back and stall primary);

CPU saturation (this is rare for traditional search-style applications, but showing up more in the world of SQL, SPARQL and Optic, often accompanied by memory pressure due to very large Join Tables. Check your query plans!);

Intra-cluster network bandwidth (both at host and switch/backbone layer, also rare)].

Alternatively, you may know you need to add RAM because you have an emergency on your hands: you observe that MarkLogic is issuing Low Memory warnings, you have evidence of heavy swap usage, your performance is often abysmal, and/or the operating system’s OOM (out of memory) killer is often taking down your MarkLogic instance. It is important to pay attention to the warnings that MarkLogic issues, above and beyond any that come from the OS.

You need to tune your queries so as to avoid bad practices (see the discussion in the beginning of this article) that waste memory and other resources, and almost certainly add RAM to your installation. The tuning exercise can be labor-intensive and time-consuming; it is often best to throw lots of RAM at the problem to get past the emergency at hand.

So, how to add more RAM to your cluster? There are three distinct techniques:

Scale vertically: Just add more RAM to the hosts you already have.

Scale horizontally: Add more nodes to your cluster and re-distribute the data

Scale functionally: Convert your existing e/d-nodes into d-nodes and add new e-nodes

Each of these options has its pros and cons. Various considerations:

Granularity:   Say you want to increase RAM by 20%. Is there an option to do just this?

Scope: Do you upgrade all nodes? Upgrade some nodes? Add additional nodes?

Cost: Will there be unanticipated costs beyond just adding RAM (or nodes)?

Operational impact: What downtime is needed? Will you need to re-balance?

Timeliness: How can you get back to acceptable operation as quickly as possible?

Option 1: Scale Vertically

On the surface, this is the simplest way to go. Adding more RAM to each node requires upgrading all nodes. If you already have separate e- and d-nodes, then it is likely that just the d-nodes should get the increased RAM.

In an on-prem (or, more properly, non-cloud) environment this is a bunch of procurement and IT work. In the worst case, your RAM is already maxed out so scaling vertically is not an option.

In a cloud deployment, the cloud provider dictates what options you have. Adding RAM may drag along additional CPUs to all nodes also, which requires added MarkLogic licenses as well as larger payment to the cloud provider. The increased RAM tends to come in big chunks (only 1.5x or 2x options). It’s generally not easy to get just the 20% more RAM (say) that you want. But this may be premature cost optimization; it may be best just to add heaps of RAM, stabilize the situation, and then scale RAM back as feasible. Once you are past the emergency, you should begin to implement longer-term strategies.

This approach also does not add any network bandwidth, storage bandwidth and capacity in most cases, and runs the small risk of just moving the bottleneck away from RAM and onto something else.

Option 2: Scale Horizontally

This approach adds whole nodes to the existing complex. It has the net effect of adding RAM, CPU, bandwidth and capacity. It requires added licenses, and payment to the cloud provider (or a capital procurement if on-prem). The granularity of expansion can be controlled; if you have an existing cluster of (2n+1) nodes, the smallest increment that makes sense in an HA context is 2 more nodes (to preserve quorum determination) giving (2n+3) nodes. In order to make use of the RAM in the new nodes, rebalancing will be required. When the rebalancing is complete, the new RAM will be utilized.

This option tends to be optimal in terms of granularity, especially in already larger clusters. To add 20% of aggregate RAM to a 25-node cluster, you would add 6 nodes to make a 31-node cluster (maintaining the odd number of nodes for HA). You would be adding 24%, which is better than having to add 50% if you had to scale all 25 nodes by 50% because that was what your cloud provider offered.

Option 3: Scale Functionally

Scaling functionally means adding new nodes as e-nodes to cluster and reconfiguring existing e/d-nodes to be d-nodes. This frees up RAM on the d-side (specifically by dramatically reducing the need for Expanded Tree Cache and memory for query evaluation) which will go towards restoring good folding ratio. Recent experience says about 15% of RAM could be affected in this manner.

More licenses are again required, plus installation and admin work to reconfigure the cluster. You need to make sure that network can handle increases in XDMP traffic from e-nodes to d-nodes, but this is not typically a problem. The resulting cluster tends to run more predictably. One of our largest production clusters typically runs its d-nodes at nearly 95% memory usage as reported by MarkLogic as the first number in an error log line. It can get away with running so full because it is a classical search application whose d-node RAM usage does not fluctuate much. Memory usage on e-nodes is a different story, especially when the application uses SQL or Optic. In such a situation, on-demand allocation of large Join Tables can cause abrupt increase in memory usage. That’s why our advice on combined e/d nodes is to run below 80% to allow for query processing.

Thereafter, the two groups of nodes can be scaled independently depending on how the workload evolves.

Here are a few key takeaways from this discussion:

Measure performance when it is acceptable, not just when it is poor.

Do whatever it takes to stabilize in an emergency situation.

Correlate metrics with acceptable / marginal performance to determine a usable folding ratio.

If you have to make a guess, try to achieve no worse than a 50:1 ratio and go from there.

Measure and project the growth rate of your database.

Figure out how much RAM needs to be added to accommodate projected growth.

Test this hypothesis if you can on your performance cluster.

Range index type casting and invalid values

Range indexes and invalid values

We will discuss range index type casting and the behavior based the invalid-values setting.

Casting values

We can cast a string to an unsignedLong as

xs:unsignedLong('4235234')

and the return is 4235234 as an unsignedLong. However, if we try

xs:unsignedLong('4235234x')

it returns an error

XDMP-CAST: (err:FORG0001) xs:unsignedLong("4235234x") -- Invalid cast: "4235234x" cast as xs:unsignedLong

Similarly,

xs:unsignedLong('')

returns an error

XDMP-CAST: (err:FORG0001) xs:unsignedLong("") -- Invalid cast: "" cast as xs:unsignedLong

This same situation can arise when a document contains invalid values. The invalid-values setting on the range index determines what happens in the case of a value that can't be cast to the type of the range index.

Range indexes---values and types

Understanding Range Indexes discusses range indexes in general, and Defining Element Range Indexes discusses typed values.

Regarding the invalid-values parameter of a range index:

In the invalid values field, choose whether to allow insertion of documents that contain elements or JSON properties on which range index is configured, but the value of those elements cannot be coerced to the index data type. You can choose either ignore or reject. By default, the server rejects insertion of such documents. However, if you choose ignore, these documents can be inserted. This setting does not change the behavior of queries on invalid values after documents are inserted into the database. Performing an operation on an invalid value at query time can still result in an error.

Behavior with invalid values

Create a range index

First, create a range index of type unsignedLong on the id element in the Document database:

import module namespace admin = "http://marklogic.com/xdmp/admin"

at "/MarkLogic/admin.xqy";

let $config := admin:get-configuration()

let $dbid := xdmp:database('Documents')

let $rangespec := admin:database-range-element-index('unsignedLong', '', 'id', (), fn:false())

return

admin:save-configuration (admin:database-add-range-element-index($config, $dbid, $rangespec))

Insert a document with a valid id value

We can insert a document with a valid value:

xdmp:document-insert ('test.xml', <doc><id>4235234</id></doc>)

Now if we check the values in the index as

cts:values (cts:element-reference (xs:QName ('id')))

we get the value 4235234 with type unsignedLong. We can search for the document with that value as

cts:search (/, cts:element-range-query (xs:QName ('id'), '=', 4235234), 'filtered')

and the document is correctly returned.

Insert a document with a invalid id value

With the range index still set to reject invalid values, we can try to insert a document with a bad value

xdmp:document-insert ('test.xml', <doc><id>4235234x</id></doc>)

That gives an error as expected:

XDMP-RANGEINDEX: xdmp:eval("xquery version "1.0-ml";
xdmp:document-insert ('te...", (), <options xmlns="xdmp:eval"><database>16363513930830498097</database>...</options>) -- Range index error: unsignedLong fn:doc("test.xml")/doc/id: XDMP-LEXVAL: Invalid lexical value "4235234x"

and the document is not inserted.

Setting invalid-values to ignore and inserting an invalid value

Now we use the Admin UI to set the invalid-values setting on the range index to ignore. Inserting a document with a bad value as

xdmp:document-insert ('test.xml', <doc><id>4235234x</id></doc>)

now succeeds. But remember, as mentioned above, "... if you choose ignore, these documents can be inserted. This setting does not change the behavior of queries on invalid values after documents are inserted into the database. Performing an operation on an invalid value at query time can still result in an error."

Values. Checking the values in the index

cts:values (cts:element-reference (xs:QName ('id')))

does not return anything.

Unfiltered search. Searching unfiltered for a value of 7 as

cts:search (/, cts:element-range-query (xs:QName ('id'), '=', xs:unsignedLong (7)), 'unfiltered')

returns our document (<doc><id>4235234x</id></doc>). This is a false positive. When you insert document with an invalid value, that document is returned for any search using the index.

Filtered search. We can search filtered for a value of 7 to see if the false positive can be removed from the results:

cts:search (/, cts:element-range-query (xs:QName ('id'), '=', xs:unsignedLong (7)), 'filtered')

throws an error

XDMP-CAST: (err:FORG0001) cts:search(fn:collection(), cts:element-range-query(fn:QName("","id"), "=", xs:unsignedLong("7")), "filtered") -- Invalid cast: xs:untypedAtomic("4235234x") cast as xs:unsignedLong

That's because when the document is used in filtering, the invalid value is cast to match the query and it throws an error as in the earlier cast test.

Adding a new index and reindexing

If you have documents already in the database, and add an index, the reindexer will automatically reindex the documents.

If there are invalid values for one of your indexes index then the reindexer will reindex the document but will issue a Debug-level message about the problem:

2023-06-26 16:44:28.646 Debug: IndexerEnv::putRangeIndex: XDMP-RANGEINDEX: Range index error: unsignedLong fn:doc("/test.xml")/doc/id: XDMP-LEXVAL: Invalid lexical value "4235234x"

The reindexer will not reject or delete the document. You can use this URI given to find the document and correct the issue.

Finding documents with invalid values

Since documents with invalid values always are returned by searches, you can use this to find the documents by doing an and-query of two searches that are normally mutually exclusive. For the document with the invalid value,

cts:uris ((), (),
cts:and-query ((
cts:element-range-query (xs:QName ('id'), '=', 7),
cts:element-range-query (xs:QName ('id'), '=', 8)
))
)

returns /test.xml.

Reaching "stand limit" frequently?

Introduction

Seeing too many "stand limit" messages in your logs frequently? This article explains what this message means to your application and what actions should you take.

What are Stands and how their numbers can increase?

A stand holds a subset of the forest data and exists as a physical subdirectory under the forest directory. This directory contains a set of compressed binary files with names like TreeData, IndexData, Frequencies, Qualities, and such. This is where the actual compressed XMLdata (in TreeData) and indexes (in IndexData) can be found.

At any given time, a forest can have multiple stands. To keep the number of stands to a manageable level MarkLogic runs merges in the background. A merge takes some of the stands on disk and creates a new singular stand out of them, coalescing and optimizing the indexes and data, as well as removing any previously deleted fragments.

MarkLogic Server has a fixed limit for the maximum number of stands (64). When that limit is reached you will no longer be able to update your system. While MarkLogic automatically manage merges and it is unlikely to reach this limit, there are few configurations under user control that may impact merges and you may see this issue. e.g.

1.) You can manage merges using Merge Policy Controls. e.g. setting a low merge max size would stop merges beyond the configured size and hence the overall number of stands would keep growing.

2.) Low value of background-io-limit would mean less amount of I/O for background tasks such as merges. This may also adversely affect the merge rate and hence the number of stands may grow.

3.) Low in-memory settings not keeping up with an aggressive data load. e.g. If you are bulk loading large documents and have low in memory tree size then stands may accumulate and reach the hard limit.

What you can do to keep the number of stands within manageable limit?

While MarkLogic automatically manage merges to keep the number of stands at a manageable level, it adds WARNING entry to the logs when it sees the number of stands growing alarmingly! e.g. Warning: Forest XXXXX is at 92% of stand limit

If you see such messages in your logs, you should take some action as reaching the hard limit of 64 would mean you will no longer be able to update your system.

Here's what you can check and do to lower the number of stands.

1.) If you have configured merge policy controls then check if they actually match with your application usage. You could change the required settings as needed. For instance:

2.) There should be no merge blackouts during ingestion, or any time there is heavy updating of your content.

3.) Beginning with MarkLogic version 7, the server is able to manage merges with less free space required on your drives (1.5 times the size of your content). This is accomplished by setting the merge max size to 32768 (32GB). Although this does create more stands, this is OK on newer systems, since the server is able to use extra CPU cores in parallel.

2.) If you have configured background-io-limit then check if that is sufficient for your application usage. If needed, increase the value so that merges can make use of more IO. You should only use this setting on systems that have limited disk IO. In general you want to first set it to 200, and if the disk IO seems to still be overwhelmed, set it to 150 and so on. A setting of 1oo may be too low for systems that are doing ingestion, since the merge process needs to be able to keep up with stand creation.

3.) If you are performing bulk loads then check if the in-memory settings are suffificient and can be increased. If needed, increase the required value so that in-memory stands (and as a result on-disk stands) accomodate more data and thereby decreases the number of stands. If you do grow the in-memory caches, make sure to grow the database journal files by a corresponding amount. This will insure that a single large transaction will be able to fit in the journals.

Conclusion

If you decide to control MarkLogic's merge process, you should monitor the system for any adverse effect that it may cause and take actions accordingly. MarkLogic Server continuously assesses the state of each database and the default merge settings and the dynamic nature of merges will keep the database tuned optimally at all times. So if you are unsure - let MarkLogic handle the merges for you!

Reclaiming Space from a Disabled Triple Index

On a MarkLogic 7 cluster or a MarkLogic 8 cluster that was previously upgraded from MarkLogic Server version 6, reindexing of the triple index does not always get triggered when the triple index is turned off. Reindexing is performed after turning off an index in order to reclaim space that the index was using.

The workaround is to force a manual reindexing.

Renaming hosts in ML cluster

Introduction

Using MarkLogic Server's Admin UI, it is possible to modify the name of a single host via Admin UI -> Configure -> Hosts -> 'Select Host in question' and update the name and click ok.

However, if you would want to change/update the hostnames across cluster, we recommend that you follow the below steps:

1) Renaming hosts in a cluster

Add the new hostnames to the DNS or /etc/hosts on all hosts.

Make sure all new hostnames can be resolved from the nodes in the cluster.

Rename all host-names using one of the following:

Admin UI

Admin-API function admin:host-set-name() to the new names.

/manage/v2/hosts/{id|name}/properties (PUT) REST endpoint that is part of the Management REST API

Note: changing the hostname will require a restart.

Host/cluster should come up if the DNS entries have been set up correctly.

Remove old host names.

2) Once the hostnames are updated, we recommend you verify the items below that may be affected by hostname changes:

Application Servers

PKI Certificates

Database replication

Flexible replication

Application code

Replacing a failed MarkLogic node in a cluster: a step by step wa...

Introduction

In this knowledgebase article, we are working on the premise that a host in your cluster has been completely destroyed, that primary forests on the failed host have failed over to their replicas - and that steps need to be taken to introduce a new replacement host to get the cluster back up and running.

We start with some general assumptions for this scenario:

We have a 3-node cluster already configured for High Availability (including the necessary auxilary databases)

The data is all contained in one database (for the sake of simplicity)

That each host in the cluster contains 2 primary forests and 2 replica forests

Cluster topology

Here is an overview of the cluster topology:

Host Name Primary Forest 1 Primary Forest 2 Replica Forest 1 Replica Forest 2
Host A Data-1 Data-2 Data-5-R Data-6-R
Host B Data-3 Data-4 Data-1-R Data-2-R
Host C Data-5 Data-6 Data-3-R Data-4-R
In addition, Host B will also contain replicas for the vital auxiliary forests: Schemas-1-R and Security-1-R. Host C will contain Schemas-2-R and Security-2-R

Failure Scenario

Host B will be unexpectedly terminated. For the application, these Forests will need to be detached and removed:

Host Name Primary Forest 1 Primary Forest 2 Replica Forest 1 Replica Forest 2
Host B Data-3 Data-4 Data-1-R Data-2-R
As Host B also contains the replica auxiliary forests (for the Security and Schemas database), these will also need to be removed before Host B can be taken out of the cluster.

Walkthrough: a step-by-step guide

The attached Query Console workspace (KB-607.xml) runs through all the necessary steps to set up a newly configured 3-node cluster for this scenario; feel free to review all 5 tabs in the workspace to gain insight into how everything is set up for this scenario.

1. Overview

The cluster status before Host B is removed from the cluster is as follows; note that the Forests for Host B are all highlighted in the images below:

The Schemas Database

The Security Database

The "Data" Database

2. Create the Failover Scenario

Host B will be stopped. You'll need to give MarkLogic some time to perform the failover. To illustrate the failure in this scenario, we're going to issue sudo service MarkLogic stop at the command prompt on this host.

This is what you should see after the failover has taken place:

The Schemas Database

The Security Database

The "Data" Database

After failover has taken place, you should see:

That the Data database is still online

That the Data database contains the same number of documents as it did prior to failover (200,000)

That the four Forests that were mounted on Host B are now all listed as being in an error state

That the replica forests for the two primary forests are now showing an open state

Recovery - Step 1: Detach and remove the Host B Auxilary Forests

The first task is to ensure the two auxiliary forests for the Schemas and Security databases are removed.

Detach the Schemas Replica Forest

In the Admin GUI go to: Configure > Forests > Schemas > Configure Tab > Forest Replicas and uncheck Schemas-1-R and click ok

Note: these changes will not be applied until you have clicked on the ok button

Detach the Security Replica Forest

In the Admin GUI go to: Configure > Forests > Security > Configure Tab > Forest Replicas and uncheck Security-1-R and click ok

Note: these changes will not be applied until you have clicked on the ok button

The above steps are scripted in the first tab of the attached Query Console workspace (KB-607-Failover.xml)

Delete the Schemas Replica Forest

In the Admin GUI go to: Configure > Forests > Schemas-1-R > Configure Tab and click delete and follow the on-screen prompts to delete the forest configuration

Delete the Security Replica Forest

In the Admin GUI go to: Configure > Forests > Security-1-R > Configure Tab and click delete and follow the on-screen prompts to delete the forest configuration

Note: while the above steps are scripted in the second tab of the attached Query Console workspace (KB-607-Failover.xml) please note that the admin:forest-delete builtin will not allow you to delete a forest that is currently unavailable; instead the call will fail with an XDMP-FORESTMNT exception.

Recovery - Step 2: Remove 'dead' primary forests and replicas and reinstate failed over forests as master forests

Start by disabling the rebalancer on the database until the problem has been completely resolved; to do this go to Configure > Databases > Data > Configure Tab and set enable rebalancer to false. This will stop any documents from being moved around until the maintenance work has been completed:

The above step is scripted in the third tab of the attached Query Console workspace (KB-607-Failover.xml)

Detach and delete the 'dead' replicas

We're going to start by removing the Data-1-R and the Data-2-R replica forests from the database.

Go to Configure > Forests > Data-1 > Configure Tab and uncheck the entry under forest replicas to remove the Data-1-R replica from the Data-1 forest:

Go to Configure > Forests > Data-2 > Configure Tab and uncheck the entry under forest replicas to remove the Data-2-R replica from the Data-2 forest:

The above step is scripted in the fourth tab of the attached Query Console workspace (KB-607-Failover.xml)

Go to Configure > Forests > Data-1-R > Configure Tab and use the delete button to remove the forest:

Note that the confirmation screen will force you to perform a configuration only delete as the original forest data is no longer available. Click ok to confirm:

Go to Configure > Forests > Data-2-R > Configure Tab and use the delete button to remove the forest:

Again, the confirmation screen will force you to perform a configuration only delete as the original forest data is no longer available. Click ok to confirm:

Note: while the above steps are scripted in the fifth tab of the attached Query Console workspace (KB-607-Failover.xml) please note that the admin:forest-delete builtin will not allow you to delete a forest that is currently unavailable; instead the call will fail with an XDMP-FORESTMNT exception.

At this stage, the database should still be completely available and you should now see 2 error messages reported on the database status page (Configure > Databases > Data > Status Tab):

Detach forests Data-3 and Data-4, detach the replicas and re-attach the replicas as master forests

The next step will cause a small outage while the configuration changes are being made.

First, we need to remove the replicas (Data-3-R and Data-4-R) from their respective master forests so we can add them back to the database as primary forests. To do this:

Using the Admin GUI go to Configure > Forests > Data-3 > Configure Tab and under the forest replicas section, uncheck Data-3-R to remove it as a replica:

Go to Configure > Forests > Data-4 > Configure Tab and under the forest replicas section, uncheck Data-4-R to remove it as a replica:

The above step is scripted in the sixth tab of the attached Query Console workspace (KB-607-Failover.xml)

Now go to Configure > Databases > Data > Forests > Configure Tab:

Uncheck Data-3 and Data-4 to remove them from the database

Check Data-3-R and Data-4-R to add them to the database

Click ok to save the changes

The above step is scripted in the seventh tab of the attached Query Console workspace (KB-607-Failover.xml)

You should now see that there are no further errors reported on the database status page for the Data database:

Delete the configuration for Data-3 and Data-4

We now need to delete the configuration for the Data-3 and Data-4 forests before we can safely remove the 'dead' host from the cluster.

Go to Configure > Forests > Data-3 > Configure Tab and use the delete button to remove the forest:

Click ok to confirm the deletion of the configuration information:

Go to Configure > Forests > Data-4 > Configure Tab and use the delete button to remove the forest:

Click ok to confirm the deletion of the configuration information:

Note: while the above steps are scripted in the eighth tab of the attached Query Console workspace (KB-607-Failover.xml) please note that the admin:forest-delete builtin will not allow you to delete a forest that is currently unavailable; instead the call will fail with an XDMP-FORESTMNT exception.

You can now safely remove the host from the cluster.

Recovery - Step 3: Remove 'dead' host configuration

Using the Admin GUI go to Configure > Hosts to view the current cluster topology:

Note that the status for the 'dead' host is disconnected and there are no Forests listed for that host. Click on the hostname for that host to get to the configuration.

From there you can use the remove button taking care to ensure that you're editing the configuration for the correct host (the host name field will tell you):

Read the warning and confirm the action using the ok button:

After the restart, you should verify that there are only two hosts available in the cluster:

Recovery - Step 4: Adding the replacement host to the cluster

Install MarkLogic Server on your new host, initialize it and join it to the existing cluster

Adding the missing forests to the new host

From the Admin GUI on your newly added host go to: Configure > Forests > Create Tab and manually add the 6 forests that were deleted in earlier steps:

Important Note: the hostname listed next to host indicates the host on which these forests will be created.

This step is scripted in the first tab of the attached Query Console workspace (KB-607-Recovery.xml)

Attach the missing replica forests for the Security and Schemas database

From the Admin GUI go to: Configure > Forests > Security > Configure Tab and add Security-1-R as a forest replica:

From the Admin GUI go to: Configure > Forests > Schemas > Configure Tab and add Schemas-1-R as a forest replica:

This step is scripted in the second tab of the attached Query Console workspace (KB-607-Recovery.xml)

Attach the replicas for the 4 forests for the Data database

From the Admin GUI go to: Configure > Forests > Data-1 > Configure Tab and add Data-1-R as a forest replica and use the ok button to save the changes:

Go to: Configure > Forests > Data-2 > Configure Tab and add Data-2-R as a forest replica and use the ok button to save the changes:

Go to: Configure > Forests > Data-3-R > Configure Tab and add Data-3 as a forest replica and use the ok button to save the changes:

Go to: Configure > Forests > Data-4-R > Configure Tab and add Data-4 as a forest replica and use the ok button to save the changes:

This step is scripted in the third tab of the attached Query Console workspace (KB-607-Recovery.xml)

Conclusion

At the end of the process, your database status should look like this:

The only task that remains (after the new replicas have caught up) is to establish Data-3 and Data-4 as the master forests.

To do this you'd need to detach them as replica forests, remove Data-3-R and Data-4-R from the database, attach Data-3-R as a replica for Data-3 and Data-4-R as a replica for Data-4 and then attach Data-3 and Data-4 back to the database.

After doing this, your final database status should look like this:

And the cluster host status should look like this:

Remember to re-enable the rebalancer if you wish to continue using it.

Resetting Wallet password after loosing the existing password.

Introduction and Pre-requisites

MarkLogic provides and manages PKCS #11 secured wallet which can be used as the KMS aka keystore for encryption at rest. When MarkLogic server starts for the first time, the server prompts to configure the wallet password. This article describes the way to reset the wallet password if you forget the one that was set at the time of initial launch.

As the encryption at rest is enabled for databases, first you will need to decrypt all of the encrypted data, otherwise you will lose access to it.

To disable encryption, at the cluster level, you will need to change the cluster setting of Data Encryption from 'force' to 'default-off' under the key store tab of the cluster. All the databases that have encryption enabled, please change them to disable encryption. You will also need to disable log encryption as well if enabled. Once this change is complete, all the databases will need to be reindexed, which will decrypt the databases. Once you make sure all the databases are decrypted and reindexed before resetting the password.

Steps to reset the wallet password:

1. Stop MarkLogic server on all hosts

2. On all of the nodes,

move the following files/directories to a secure location in case they need to be restored

/var/opt/MarkLogic/keystore*.xml

/var/opt/MarkLogic/kms

Please make sure you have backup of the above.

3. Once those files are deleted, Copy the new/clean bootstrap keystore.xml from the MarkLogic install directory on all the nodes

cp /opt/MarkLogic/Config/keystore.xml /var/opt/MarkLogic/

4. Make sure step 2 and 3 are performed on all the nodes and then start MarkLogic server on all nodes.

5. Reset your wallet password from Cluster->Keystore->password change page refer to https://docs.marklogic.com/guide/security/encryption#id_61056

Note: In the place of current password, you can provide any random password or even leave it blank.

Once complete, your wallet password should be set to the new value. Then you can configure your encryption at rest for data again.

(NOTE: AS WE ARE CHANGING THE ENCRYPTION CONFIGURATION AND RESETTING WALLET PASSWORDS, IT IS HIGHLY RECOMMENDED THAT YOU HAVE A PROPER BACK UP OF YOUR DATA AND CONFIGURATION. Please try the above mentioned steps in any of lower environments before you are implementing in your production)

Retrieve MARKLOGIC_ADMIN_PASSWORD from an Amazon S3 Bucket

Introduction

While launching the CloudFormation Templates to create a managed cluster on AWS, the variables MARKLOGIC_ADMIN_USERNAME and MARKLOGIC_ADMIN_PASSWORD need to be provided as part of the AMI user data and these values are used to create the initial admin MarkLogic user.

This user creation is needed for initial cluster set up process and in case if a node restarts and joins the cluster. The password that is provided when launching the template is not exported to MarkLogic process and it is not stored anywhere on the AMI.

If we wish to provide an administrator password, it is not recommended practice to provide a clear text password through /etc/marklogic.conf.

Alternatives

A best practice is to use a secure S3 bucket with encryption configured and data transmission in combination with an AMI role assigned to EC2 instances on the cluster to access the S3 bucket. This approach is discussed in our documentation and the aim of this Knowledgebase article is to cover the approach in further detail.

We can use AWS CLI as suggested below to securely retrieve the password from an object stored in an S3 bucket and then pass that into /etc/marklogic.conf file as the MARKLOGIC_ADMIN_PASSWORD variable.

Solution

We recommend storing the MarkLogic admin password in an object (e.g. a text file) in a secured/encrypted S3 bucket which can only be retrieved by an authorized user who has access to the specific S3 bucket.

As a pre-requisite, create a file (For example: password.txt) with the required value for MARKLOGIC_ADMIN_PASSWORD and place it in a secure s3 bucket (for example: a bucket named "mlpassword")

To modify the CloudFormation Template

1. Locate the Launch configurations in the template

2. Within LaunchConfig1, add the following line at the beginning

#!/bin/bash

3. Add the following at the end of the launch configuration block

- >

echo 'export MARKLOGIC_ADMIN_PASSWORD=$(aws s3 --region us-west-2 cp s3://mlpassword/password.txt -)' >

/etc/marklogic.conf # create marklogic.conf

4. Delete the entries are referring to MARKLOGIC_ADMIN_PASSWORD

- MARKLOGIC_ADMIN_PASSWORD=
- !Ref AdminPass
- |+

5. So after modifying the LaunchConfig , it would look like below:

LaunchConfig1:
Type: 'AWS::AutoScaling::LaunchConfiguration'
DependsOn:
- InstanceSecurityGroup
Properties:
BlockDeviceMappings:
- DeviceName: /dev/xvda
Ebs:
VolumeSize: 40
- DeviceName: /dev/sdf
NoDevice: true
Ebs: {}
KeyName: !Ref KeyName
ImageId: !If [EssentialEnterprise, !FindInMap [LicenseRegion2AMI,!Ref 'AWS::Region',"Enterprise"], !FindInMap [LicenseRegion2AMI, !Ref 'AWS::Region', "BYOL"]]
UserData: !Base64
'Fn::Join':
- ''
- - |
#!/bin/bash
- - MARKLOGIC_CLUSTER_NAME=
- !Ref MarkLogicDDBTable
- |+

- MARKLOGIC_EBS_VOLUME=
- !Ref MarklogicVolume1
- ',:'
- !Ref VolumeSize
- '::'
- !Ref VolumeType
- |
::,*
- |
MARKLOGIC_NODE_NAME=NodeA#
- MARKLOGIC_ADMIN_USERNAME=
- !Ref AdminUser
- |+

- |
MARKLOGIC_CLUSTER_MASTER=1
- MARKLOGIC_LICENSEE=
- !Ref Licensee
- |+

- MARKLOGIC_LICENSE_KEY=
- !Ref LicenseKey
- |+

- MARKLOGIC_LOG_SNS=
- !Ref LogSNS
- |+

- MARKLOGIC_AWS_SWAP_SIZE=
- 32
- |+

- >
echo 'export MARKLOGIC_ADMIN_PASSWORD=$(aws s3 --region us-west-2 cp s3://mlpassword/password.txt -)' >
/etc/marklogic.conf # create marklogic.conf

- !If
- UseVolumeEncryption
- !Join
- ''
- - 'MARKLOGIC_EBS_KEY='
- !If
- HasCustomEBSKey
- !Ref VolumeEncryptionKey
- 'default'
- ''

SecurityGroups:
- !Ref InstanceSecurityGroup
InstanceType: !Ref InstanceType
IamInstanceProfile: !Ref IAMRole
SpotPrice: !If
- UseSpot
- !Ref SpotPrice
- !Ref 'AWS::NoValue'
Metadata:
'AWS::CloudFormation::Designer':
id: 2efb8cfb-df53-401d-8ff2-34af0dd25993

6. Repeat the steps 2,3,4 for all the other LaunchConfig groups and save the template and launch the stack.

With this, there is no need to provide the Admin Password while launching the stack using Cloud formation templates.

**Please make sure that the IAM role that you are assigning have access to the S3 bucket where the password file is available.

NOTE: The Cloud formation templates are created in YAML - be cautious when editing as YAML is whitespace sensitive.

Search and Fragmentation

Summary

This article explores fragmentation policy decisions for a MarkLogic database, and how search results may be influenced by your fragmentation settings.

Discussion

Fragments versus Documents

Consider the below example.

1) Load 20 test documents in your database by running

let $doc := <test>{
for $i in 1 to 20 return <node>foo {$i}</node>
}</test>
for $i in 1 to 20
return xdmp:document-insert ('/'||$i||'.xml', $doc)

Each of the 20 documents will have a structure like so:

<test> <node>foo 1</node> <node>foo 2</node> . . . <node>foo 20</node> </test>

2) Observe the database status: 20 documents and 20 fragments.

3) Create a fragment root on 'node' and allow the database to reindex.

4) Observe the database status: 20 documents and 420 fragments. There are now 400 extra fragments for the 'node' elements.

We will use the data with fragmentation in the examples below.

Fragments and cts:search counts

Searches in MarkLogic work against fragments (not documents). In fact, MarkLogic indexes, retrieves, and stores everything as fragments.

While the terms fragments and documents are often used interchangeably, all the search-related operations happen at fragment level. Without any fragmentation policy defined, one fragment is the same as one document. However, with a fragmentation policy defined (e.g., a fragment root), the picture changes. Every fragment acts as its own self-contained unit and is the unit of indexing. A term list doesn't truly reference documents; it references fragments. The filtering and retrieval process doesn't actually load documents; it loads fragments. This means a single document can be split internally into multiple fragments but they are accessed by a single URI for the document.

Since the indexes only work at the fragment level, operations that work at the level of indexing can only know about fragments.

Thus, xdmp:estimate returns the number of matching fragments:

xdmp:estimate (cts:search (/, 'foo')) (: returns 400 :)

while fn:count counts the actual number of items in the returned sequence:

fn:count (cts:search (/, 'foo')) (: returns 20 :)

Fragments and search:search counts

When using search:search, "... the total attribute is an estimate, based on the index resolution of the query, and it is not filtered for accuracy." This can be seen since

import module namespace search = "http://marklogic.com/appservices/search" at "/MarkLogic/appservices/search/search.xqy";
search:search("foo",
<options xmlns="http://marklogic.com/appservices/search">
<transform-results apply="empty-snippet"/>
</options>
)

returns

<search:response snippet-format="empty-snippet" total="400" start="1" page-length="10" xmlns:search="http://marklogic.com/appservices/search">
<search:result index="1" uri="/3.xml" path="fn:doc("/3.xml")" score="2048" confidence="0.09590387" fitness="1">
<search:snippet/>
</search:result>
<search:result index="2" uri="/5.xml" path="fn:doc("/5.xml")" score="2048" confidence="0.09590387" fitness="1">
<search:snippet/>
</search:result>
.
.
.
<search:result index="10" uri="/2.xml" path="fn:doc("/2.xml")" score="2048" confidence="0.09590387" fitness="1">
<search:snippet/>
</search:result>

Notice that the total attribute gives the estimate of the results, starting from the first result in the page, similar to the xdmp:estimate result above, and is based on unfiltered index (fragment-level) information. Thus the value of 400 is returned.

When using search:search:

Each result in the report provided by the Search API reflects a document -- not a fragment. That is, the units in the Search API are documents. For instance, the report above has 10 results/documents.

Search has to estimate the number of result documents based on the indexes.

Indexes are based on fragments and not documents.

If no filtering is required to produce an accurate result set and if each fragment is a separate document, the document estimate based on the indexes will be accurate.

If filtering is required or if documents aggregate multiple matching fragments, the estimate will be inaccurate. The only way to get an accurate document total in these cases would be to retrieve each document, which would not scale.

Fragmentation and relevance

Fragmentation also has an effect on relevance. See Fragments.

Should I use fragmentation?

Fragmentation can be useful at times, but generally it should not be used unless you are sure you need it and understand all the tradeoffs. Alternatively, you can break your document into subdocuments instead. In general, the search API is designed to work better without fragmentation in play.

Server-side JavaScript and JSON vs XQuery and XML in MarkLogic Se...

Introduction

This article discusses the capabilities of JavaScript and XQuery, and the use of JSON and XML, in MarkLogic Server, and when to use one vs the other.

Details

Can I do everything in JavaScript that I can do in XQuery? And vice-versa?

Yes, eventually. Server-side JavaScript builds upon the same C++ foundation that the XQuery runtime uses. MarkLogic 8.0-1 provides bindings for just about every one of the hundreds of built-ins. In addition, it provides wrappers to allow JavaScript developers to work with JSON instead of XML for options parameters and return values. In the very few places where XQuery functionality is not available in JavaScript you can always drop into XQuery with xdmp.xqueryEval(...).

When should I use XQuery vs JavaScript? XML vs JSON? When shouldn’t I use one or the other?

This decision will likely depend on skills and aspirations of your development team more than the actual capabilities of XML vs JSON or XQuery vs JavaScript. You should also consider the type of data that you’re managing. If you receive the data in XML, it might be more straightforward to keep the data in its original format, even if you’re accessing it from JavaScript.

JSON

JSON is best for representing data structures and object serialization. It maps closely to the data structures in many programming languages. If your application communicates directly with a modern browser app, it’s likely that you’ll need to consume and produce JSON.

XML

XML is ideal for mark-up and human text. XML provides built-in semantics for declaring human language (xml:lang) that MarkLogic uses to provide language-specific indexing. XML also supports mixed content (e.g., text with intermingled mark-up), allowing you to "embed" structures into the flow of text.

Triples

Triples are best for representing atomic facts and relationships. MarkLogic indexes triples embedded in either XML or JSON documents, for example to capture metadata within a document.

JavaScript

JavaScript is the most natural language to work with JSON data. However, MarkLogic’s JavaScript environment also provides tools for working with XML. NodeBuilder provides a pure JavaScript interface for constructing XML nodes.

XQuery

XQuery can also work with JSON. MarkLogic 8 extends the XQuery and XPath Data Model (XDM) with new JSON node tests: object-node(), array-node(), number-node(), boolean-node(), and null-node(). One implication of this is that you can use XPath on JSON nodes just like you would with XML. XML nodes also implement a DOM interface for traversal and read-only access.

Summary

If you’re working with data that is already XML or you need to model rich text and mark-up, an XML-centric workflow is the best choice. If you’re working with JSON, for example, coming from the browser, or you need to model typed data structures, JSON is probably your best choice.

Server-side JavaScript implementation and module reuse

Introduction

This article discusses how JavaScript is implemented in MarkLogic Server, and how can modules be reused?

Is Node.js embedded in the server?

MarkLogic 8 embeds Google's V8 JavaScript engine, just like Node.js does, but not Node.js itself. Both environments use JavaScript and share the core set of types, functions, and objects that are defined in the language. However, they provide completely different contexts.

Can I reuse code written for Node in Server-Side JavaScript?

Not all JavaScript that runs in the browser will work in Node.js; Similarly, not all JavaScript that runs in Node.js will work in MarkLogic. JavaScript that doesn’t depend on the specific environment is portable between MarkLogic, Node.js, and even the browser.

For example, the utility lodash library can run in any environment because it only depends on features of JavaScript, not the particular environment in which it’s running.

Conversely, Node’s HTTP library is not available in MarkLogic because that library is particular to JavaScript running in Node.js, not built-in to the language. (To get the body of an HTTP request in MarkLogic, for example, you’d use the xdmp.getRequestBody() function, part of MarkLogic’s built-in HTTP server library.) If you’re looking to use Node with MarkLogic, we provide a full-featured, open-source client API.

Will you allow npm modules on MarkLogic?

JavaScript libraries that don’t depend on Node.js should work just fine, but you cannot use npm directly today to manage server-side JavaScript modules in MarkLogic. (This is something that we’re looking at for a future release.)

To use external JavaScript libraries in MarkLogic, you need to copy the source to a directory under an app server’s modules root and point to them with a require() invocation in the importing module.

What can you import?

JavaScript modules

Server-side JavaScript in MarkLogic implements a module system similar to CommonJS. A library module exports its public types, variables, and functions. A main module requires a library module, binding the exported types, variables, and functions to local “namespace” global variables. The syntax is very similar to the way Node.js manages modules. One key difference is that modules are only scoped for a single request and do not maintain state beyond that request. In Node, if you change the state of a module export, that change is reflected globally for the life of the application. In MarkLogic, it’s possible to change the state of a library module, but that state will only exist in the scope of a single request.

For example:

// *********************************************
// X.sjs

module.exports.blah = function() {
return "Not Math.random";
}

// *********************************************
// B.sjs

var x = require("X.sjs");

function bTest() {
return x.blah === Math.random;
}

module.exports.test = bTest;

// *********************************************
// A.sjs

var x = require("X.sjs");
var b = require("B.sjs");

x.blah = Math.random;

b.test();

// *********************************************
// A-prime.sjs

var x = require("X.sjs");
var b = require("B.sjs");

b.test();

Invoking A.sjs returns true, but subsequently invoking A-prime.sjs still returns false.

XQuery modules

MarkLogic also allows server-side JavaScript modules to import library modules written in XQuery and call the exported variables and functions as if they were JavaScript.

See also: Server-Side JavaScript in MarkLogic.

Sizing and Scaling MarkLogic Server Environments

Summary

If you have already optimized your queries and data (removing unused indexes, dropping older data, etc.), you might be looking to size or scale your environment to ensure it meets either your current and/or future requirements. This article is intended to provide high-level guidance around some of the main areas to consider when thinking about sizing or scaling a MarkLogic Server environment.

While scaling is now easier, thanks to the flexibility of virtualization and cloud technologies, we would still recommend that customers work with MarkLogic Sales and/or Professional Services teams to review and advise on any changes whenever possible. Precise sizing and scaling advice is outside the scope of the MarkLogic Support team.

MarkLogic Server Resource Requirements

MarkLogic Server is just one part of an environment – the health of a cluster depends on the health of the underlying infrastructure, such as disk I/O, network bandwidth, memory, and CPU. Therefore, as a first step, we would recommend reviewing and considering MarkLogic Server's resource needs, which are available within its Installation Guide:

Memory, Disk Space, and Swap Space Requirements
https://docs.marklogic.com/guide/installation-guide/en/requirements-and-database-compatibility/memory,-disk-space,-and-swap-space-requirements.html

Identifying Resource Contention/Starvation

You are, no doubt, already tracking the performance of queries, whether that be in your current or candidate environment, but it is also important to check for and track resource bottlenecks. Some high I/O and CPU activity, as well as increased memory utilization may not necessarily be a cause for concern and can just indicate the system is operating properly. However, you will want to look for evidence of resource contention/starvation, which might impact cluster performance, if not now, then potentially in the near future.

The MarkLoigc Server hosts will indicate issues encountered with resources in their ErrorLogs, and such messages could include details on slow infrastructure or background tasks, lagging operations, hosts low on memory (RAM), disk space and/or other areas.

The Monitoring Dashboard and Monitoring History can be useful MarkLogic Server features to help you understand bottlenecks and what to do next. Some key areas to look for resource contention/starvation include:

Memory

Check the ErrorLogs for any Warning-level memory related messages such as the following, which will indicate the areas involved, for example:

Warning: Memory low: forest+cache=97%phys
Warning: Memory low: huge+anon+swap+file=128%phys

Nearby "Info" level messages on the host can provide further information on the areas involved. Some potential paths for remediation for low memory situations are outlined within the following knowledgebase article:

Memory Consumption Logging and Status
https://help.marklogic.com/Knowledgebase/Article/View/memory-consumption-logging-and-status

For D/E-nodes, also check that the memory situation on each host is well-balanced between the group-level caches; in-memory content; App Server work and the Operating System. A "Rule of Thirds" provides a conceptual explanation on this, which is covered in the following knowledgebase article:

Group caches and Linux huge pages
https://help.marklogic.com/Knowledgebase/Article/View/15/0/group-caches-and-linux-huge-pages

A number of questions specifically on the scaling of memory are also covered in the following knowledgebase article:

RAMblings - Opinions on Scaling Memory in MarkLogic Server
https://help.marklogic.com/knowledgebase/article/View/ramblings---opinions-on-scaling-memory-in-marklogic-server

Caches

If you intend to scale physical memory, then you should consider any re-configuration of MarkLogic Server's group-level caches. During the installation process, MarkLogic sets memory and other settings based on the characteristics of the computer in which it is running. For the group-level caches, automatic sizing is usually recommended. However, for RAM size greater than 256GB, group cache settings are configured the same as for 256GB with automatic cache sizing. These can be changed using manual cache sizing.

Group Level Cache Settings based on RAM
https://help.marklogic.com/Knowledgebase/Article/View/group-level-cache-settings-based-on-ram

Check for queries that are contending for the caches. If the caches are not efficiently used, you will also see high I/O utilization on D-nodes. Cache hits are good, and indicate the query is running in an optimized fashion. Cache misses indicate that the query could not retrieve its results directly from the cache and had to read the data from disk. Disk I/O is expensive relative to reading from memory. Cache misses indicate that the query might be able to be optimized, either by rewriting the parts of the query that have cache misses to better take advantage of the indexes, or by adding indexes that the query can use.

A simple way to review cache hit/miss data is via the "Databases" section in the Monitoring History, which will show details for List Cache, Expanded Tree Cache and Compressed Tree Cache. Also shown, is triple-related cache, Triple Cache and Triple Value Cache, however, unlike other MarkLogic caches, these can shrink and grow, only taking up memory when it needs to add to the caches. Further information on sizing caches and understanding cache statistics may be found via the following resources:

Semantic Graph Developer's Guide: Sizing Caches
https://docs.marklogic.com/guide/semantics/indexes#id_28957

Tuning Queries with query-meters and query-trace
https://docs.marklogic.com/guide/performance/query_meters

I/O Bandwidth

It is important to provision the appropriate amount of I/O bandwidth, where each forest will typically need a minimum of 20MB/sec read and 20MB/sec write. Further information on MarkLogic Server’s I/O requirements, may be found within the following knowledgebase article:

MarkLogic Server I/O Requirements Guide:
https://help.marklogic.com/knowledgebase/article/View/11/0/marklogic-server-io-requirements-guide

Generally, when provisioning local disk, there is already some awareness of performance guidance from the vendors of the I/O controllers or disks being used on hosts. We have seen situations in the past where actual available bandwidth has been much different from expected, but at a minimum the expected values will provide a decent baseline for comparison against eventual testing results. If not already known, we would recommend contacting the vendors of the disk I/O related hardware used by the hosts before testing.

Look out for evidence of I/O Wait, which is the percentage of CPU time spent waiting for I/O operations to complete on a host. Some common causes of I/O Wait include slow storage devices and disk congestion (also faulty hardware and file system issues). I/O Wait can be monitored via technologies such as:

MarkLogic Server Monitoring History
https://docs.marklogic.com/guide/monitoring/history

Sar, from the sysstat package (external link)
https://github.com/sysstat/sysstat/

Network

Network should be monitored. Depending upon the size of the cluster, network traffic can be substantial (in the case of 50 or greater hosts) or small (1-3 hosts). Query workload can also impact network – if queries are requesting large numbers of documents, this can impact network.

CPU

To recap, some high CPU utilization may not be a cause of concern, as there are workloads and tasks that are known to be CPU intensive, such as certain queries, filtering, ingestion, reindexing, rebalancing and merging (note that merge activity will show up as nice % in CPU statistics).

Remediation for high CPU might include tuning code to see if there a way to make better use of MarkLogic caches and reduce E-node operations. Otherwise, for sizing, adding additional capacity can alleviate a CPU bottleneck, so you might look into the option of adding E-nodes/cores.

Disk Space

Disk utilization is an important part of the host's ecosystem. The results of filling the file system can have disastrous effects on server performance and data integrity. It is very important to ensure that your host always has an appropriate amount of free disk space. Sufficient disk space beyond the bare minimum requirement should be available in order to handle influx of data into your system for at least the amount of time it takes to provision more capacity. Further information on MarkLogic's disk space requirements may be found in the following knowledgebase article:

Understanding MarkLogic Minimum Disk Space Requirements
https://help.marklogic.com/Knowledgebase/Article/View/284/0/understanding-marklogic-minimum-disk-space-requirements

Other Areas to Consider

Have You Planned for Failover Situations?

Host resource utilization may vary greatly after a failover event, and such situations should be sized and tested accordingly. Remember that memory utilization on the D-node might vary greatly after a failover and you should size accordingly. For example, if preload is turned off for range indexes, a host that properly served 6 primary forests and 6 failover forests could find itself with inadequate memory when it is serving 9 primary forests and 3 failover forests after a node failure. Likewise, those failover forests might not have impacted cache utilization on that host before the failover, but once active, are consuming cache resources.

Will You Be Changing the Number of Data Nodes?

If scaling data nodes horizontally, you will likely want to take advantage of the new node arrangement by redistributing your database data across all the data nodes in a well-balanced way. The following knowledgebase articles contain information on best practice on how this can be achieved:

MarkLogic Fundamentals - How should I scale out my cluster?
https://help.marklogic.com/Knowledgebase/Article/View/how-should-i-build-out-my-cluster

Considerations when scaling out your MarkLogic instance
https://help.marklogic.com/Knowledgebase/Article/View/162/0/considerations-when-scaling-out-your-marklogic-instance

Are You Running MarkLogic Server as Non-root User?

MarkLogic Server's root process makes a number of OS-specific settings to allow the product to run optimally. However, some customers choose to run MarkLogic Server without the watchdog process running as root. If as part of a scaling you will be using systems with different specifications than before, there are modifications that you should consider making to the user that is taking the responsibility of running as the root process would have done. These modifications are detailed within the following knowledgebase article:

Pitfalls Running MarkLogic Process as non-root user
https://help.marklogic.com/Knowledgebase/Article/View/306/0/pitfalls-running-marklogic-process-as-non-root-user

Are There Any Licensing Implications?

If you are scaling your environment, consider if there will be any licensing implications as part of any change. If you have any questions in this area, you are welcome to open a Support ticket or simply fill out the form on the following page and we will be in touch with you:

How Can We Help?
https://www.progress.com/company/contact?s=marklogic

Test Any Changes

As always, we would recommend thoroughly testing any potential changes in a lower environment that is representative of Production (including while it is under a representative Production load) before being used in Production, to identify any issues or changes in performance.

Enable Request Monitoring

The Request Monitoring feature enables you to configure logging of information related to requests, including metrics collected during request execution. This feature lets you enable logging of internal preset metrics for requests on specific endpoints. You can also log custom request data by calling the provided Request Logging APIs. This logged information may help you evaluate server performance.

Endpoints and Request Monitoring
https://docs.marklogic.com/guide/performance/request_monitoring

Issues After Scaling

If you run into issues after making changes to your infrastructure, you are welcome to contact MarkLogic Support for assistance. You may also find the following resource useful:

Performance Issues in MarkLogic Server: what they look like - and what you should do about them
https://help.marklogic.com/Knowledgebase/Article/View/performance-issues-in-marklogic-server-what-they-look-like---and-what-you-should-do-about-them

References & Further Reading

Performance: Understanding System Resources
https://developer.marklogic.com/learn/understanding-system-resources/

Sizing E-nodes

Introduction

The performance and resource consumption of E-nodes is determined by the kind of queries executed in addtion to the distribution and amount of data. For example, if there are 4 forests in the cluster and the query is asking for only the top-10 results, then the E-node would receive a total of 4 x 10 results in order to determine the top-10 among these 40. If there are 8 forests, then the E-node would have to sort through 8 x 10 results.

Performance Test for Sizing E-Nodes:

To size E-nodes, it’s best to determine first how much workload a single E-node can handle, and then scale up accordingly.

Set up your performance test so it is at scale and so that it only talks to a single E-node. Start the Application Server settings with something like

threads = 32

backlog = 512

keep alive = 0

Crank up the number of threads for the test from low to high, and observe the amount of resources being used on the E-node (cpu, memory, network). Measure both response time and throughput during these tests.

When the number of threads are low, you should be getting the best response time. This is what the end user would experience when the site is not busy.

When the number of threads are high, you will see longer response time, but you should be getting more throughput.

As you increase the number of threads, you will eventually run out of resources on the E-node - most likely memory. The idea is to identify the number of active threads when the system's memory is exceeded, because that is the maximum number of threads that your E-node can handle.

Addtitional Tuning of E-nodes

Thrashing

If you notice thrashing before MarkLogic is able to reach a memory consumption equilibrium, you will need to continue decreasing the threads so that the RAM/thread ratio is near the 'pmap total memory'/thread.

The backlog setting can be used to queue up requests w/o chewing up significant resources.

Adjusting backlog along with some of the timeout settings might give a reasonable user experience comparable to, or even better than, what you may see with high thread counts.

As you continue to decrease the thread count and make other adjustments, the mean time to failure will likely increase until the settings are such that equilibrium is reached before all the memory resources are consumed - at which time we do not expect to see any additional memory failures.

Swap, RAM & Cache for E-nodes

Make sure that the E-nodes have swap space equal to the size of RAM (if the node has less than 32GB of RAM) or 32 GB (if the node has 32GB or more of RAM)

For E-nodes, you can minimize the List Cache and Compressed Tree Cache - set to 1GB each - in your group level configurations.

Your Expanded Tree Cache (group level parameter) should be at least equal to 1/8 of RAM, but you can further increase the Expanded Tree Cache so that all three caches (List, Compressed, Expanded) in combination are up to 1/3 of RAM.

Another important group configuration parameter is Expanded Tree Cache Partitions. A good starting point is 2-3 GB per partition, but is should not be more than 12 GB per partition. The greater the number of partitions, the greater the capacity of handling concurrent query loads.

Growing your Cluster

As your application, data and usage changes over time, it is important to periodically revisit your cluster sizings and re-run your performance tests.

Solving AWS Cloud Formation Template failures due to deprecated s...

Introduction

This article is intended to address the impact of AWS deprecation of Python 3.6 (Lambda runtime dependency) and Classic Load Balancer (CLB) on MarkLogic Cloud Formation Templates (CFT).

Background

AWS announced deprecation of Python 3.6 and Classic Load Balancer (CLB).

For Python 3.6, please refer to 'Runtime end of support dates'.
For Classic Load Balancer, please refer to 'Migrate Classic Load Balancer'.

MarkLogic 10 provided CFTs prior to 10.0-9.2 are impacted by the python 3.6 deprecation as MarkLogic uses custom lambdas. CFTs prior to 10.0-9.2 are also impacted by the CLB deprecation since the MarkLogic single-host deployment uses CLB.

Solutions

1. Upgrade to latest MarkLogic CFT templates:

Starting with release of 10.0-9.2, MarkLogic CFT uses python 3.9 and has removed CLB for single-host deployments.

The fully-qualified domain name (FQDN) of the node is based on internal IP address from the persistent reusable ENI. In single-host cluster without CLB, the FQDN for the node is referenced in the list of outputs as the endpoint to access Admin UI. For example, http://ip-10.x.x.x.ap-southeast-2.compute.internal:8001.

For a single-host cluster in a private subnet, client residing in public domain will not be able to connect to single host directly. Your AWS Administrator will be required to set up a bastion host (jump box) or a reverse proxy, which acts as an addressable middle-tier to route traffic to the MarkLogic host. Alternatively, your Administrator can assign an Elastic IP to single-host which makes the host publicly accessible.

2. Running with MarkLogic prior to 10.0-9.2

2.1: Modify MarkLogic's most current CFT.

You can use the latest version of the MarkLogic CFT, and then change the MarkLogic AMI version inside that CFT to refer to specific prior version of MarkLogic AMI.

2.2: Customized CFT (derived from MarkLogic CFT but with specific modification).

You can modify your copy of template to upgrade to Python 3.9 and remove the use of CLB.

a) To upgrade the Python changes: Please refer to the custom lambda templates (ml-managedeni.template, ml-nodemanger.template) and search for "python3.6" and replace it with "python3.9".

Format to build the URL: https://marklogic-db-template-releases.s3.<<AWS region>>.amazonaws.com/<<ml-version>>/ml-nodemanager.template

Download v10.0-7.1 custom lambda templates for upgrade using below links:

https://marklogic-db-template-releases.s3.us-west-2.amazonaws.com/10.0-7.1/ml-managedeni.template

https://marklogic-db-template-releases.s3.us-west-2.amazonaws.com/10.0-7.1/ml-nodemanager.template

After the changes are done, the modified templates should be uploaded to the s3 bucket. Also, the 'TemplateURL' should be updated in the main CFTs (mlcluster-vpc.template, mlcluster.template) under 'Resources' -> ManagedEniStack, 'Resources' -> NodeMgrLambdaStack.

b) To remove the CLB changes: Please refer to the latest CFT version (mlcluster-vpc.template, mlcluster.template) and compare/modify the templates accordingly.

c) To upgrade the Python version existing old stack without redeployment: Please navigate to the AWS Lambdas console (Lambda->Functions->ActivityManager-Dev->Edit runtime setting) and update the runtime to use "Python 3.9".

AWS deprecation does not impact already deployed stack, since the Lambda functions are created during service creation (and only deleted when the service is terminated). Similarly, updating the cluster capacity does not have impact on existing deployed stack.

MarkLogic Cloud Services (DHS)

The issue is already addressed by the MarkLogic Cloud Services team with an upgrade of underlying dependency to "Python 3.9".

MarkLogic 9

Please Note that this Knowledgebase article refers to MarkLogic 10 Cloud Formation Template changes alone. For MarkLogic 9 Cloud Formation templates, work on recommended Solutions is still in progress.

References

MarkLogic 10.0-9.2 Release Notes Addendum

Latest MarkLogic CFT

SSH to AWS MarkLogic Managed Cluster using a bastion host

The recommended way to run MarkLogic on AWS is to use the "managed" Cloud Formation template provided by MarkLogic:

https://developer.marklogic.com/products/cloud/aws

The documentation for it is here:

https://docs.marklogic.com/guide/ec2/CloudFormation

By default, the MarkLogic nodes are hidden in Private Subnets of a VPC and the only way to access them from the Internet is via the Elastic Load Balancer.

This is optimal as it distributed the load and shields from common attack vectors.

However, for some types of maintenance it may be useful, or even necessary to SSH directly into individual MarkLogic nodes.

Examples where this is necessary:

1. Configuring Huge Pages size so that it is correct for the instance size/amount of RAM: https://help.marklogic.com/Knowledgebase/Article/View/420/0/group-level-cache-settings-based-on-ram

2. Manual MarkLogic upgrade where a new AMI is not yet available (for example for emergency hotfix): https://help.marklogic.com/Knowledgebase/Article/View/561/0/manual-upgrade-for-marklogic-aws-ami

To enable SSH access to MarkLogic nodes you need to:

I. Create an intermediate EC2 host, commonly known as 'bastion' or 'jump' host.

II. Put it in the correct VPC and correct (public) subnet and ensure that it has public / Internet-facing IP address

III. Adjust security settings so that SSH connections to bastion host as well SSH connection from bastion to MarkLogic nodes are allowed and launch the bastion instance.

IV. Additionally, you will need to configure SSH key forwarding or a similar solution so that you don't need to store your private key on the bastion host.

I. Creating the EC2 instance in AWS Console:

1. The EC2 instance needs to be in the same region as the MarkLogic Cluster so the starting console URL will be something like this (depending on the region and your account):

https://eu-west-1.console.aws.amazon.com/ec2/home?region=eu-west-1#LaunchInstanceWizard:

2. The instance OS can be any Linux of your choice and the default Amazon Linux 2 AMI is fine for this. For most scenarios the jump host does not need to be powerful so any OS that is free tier eligible is recommended:

3.Choose instance size. For most scenarios (including SSH for admin access), the free tier t2.micro is the most cost-effective instance:

4. Don't launch the instance just yet - go to Step 3 of the Launch Wizard ("Step 3: Configure Instance Details").

II. Put the bastion host in the correct VPC and subnet and configure public IP:

The crucial steps here are:

1. Choose the same VPC that your cluster is in. You can find the correct VPC by reviewing the resources under the Cloud Formation template section of the AWS console or by checking the details of the MarkLogic EC2 nodes.

2. Choose the correct subnet - you should navigate to the VPC section of the AWS Console, and see which of the subnets of the MarkLogic Cluster has an Internet Gateway in its route table.

3. Ensure that "Auto-assign Public IP" setting is set to "enable" - this will automatically configure a number of AWS settings so that you won't have to assign Elastic IP, routing etc. manually.

4.Ensure that you have sufficient IAM permissions to be able to create the EC2 instance and update security rules (to allow SSH traffic)

III. Configure security settings so that SSH connections are allowed and launch:

1. Go to "Step 6: Configure Security Group" of the AWS Launch Wizard. By default, AWS will suggest creating "launch" security group that opens SSH incoming to any IP address. You can adjust as necessary to allow only a certain IP address range, for example.

Additionally, you may need to review the security group setting for your MarkLogic cluster so that SSH connections from bastion host are allowed.

2.Go to "Step 7: Review Instance Launch" and press "Launch". At this step you need to choose a correct SSH key pair for the region or create a new one. You will need this SSH key to connect to the bastion host.

3. Once the EC2 instance launches, review its details to find out the public IP address.

IV. Configure SSH key forwarding so that you don't have permanently store your private SSH on the bastion host. Please review your options and alternatives here (for example using ProxyCommand) as key forwarding temporarily stores the private key on the bastion host, so anyone with root access to the bastion host could hijack your MarkLogic private key (when logged in at the same time as you).

1. Add the private key, to SSH agent:

ssh-add -K myPrivateKey.pem

2. Test the connection (with SSH agent forwarding) to the bastion host using:

ssh -A ec2-user@<bastion-IP-address>

3. Once you're connected ssh from the bastion to a MarkLogic node:

ssh ec2-user@<MarkLogic-instance-IP-address or DNS-entry>

For strictly AWS infrastructure issues (VPC, subnets, security groups) please contact AWS support. For any MarkLogic related issues please contact MarkLogic support via:

help.marklogic.com

Start and stop MarkLogic Server as non-root user

Introduction

We discuss why MarkLogic server should be started with root priviledges.

Details

It is possible to install MarkLogic Server in a directory that does not require root priviledges.

There's also a section in our Installation Guide (Configuring MarkLogic Server on UNIX Systems to Run as a Non-daemon User) that talks at some length about how to run MarkLogic Server as a user other than daemon on UNIX systems. While that will allow you to configure permissions for non-root and non-daemon users in terms of file ownership and actual runtime, you'll still want to be the root user to start and stop the server.

It is possible to start MarkLogic without su privileges, but this is strongly discouraged.

The parent (root) MarkLogic process is simply a restarter process. It is there simply to wait for the non-root process to exit, and if the non-root process exits abnormally for some reason, the root process will fork and exec another non-root process. The root process runs no XQuery scripts, opens no sockets, and accesses no database files.

We strongly recommend to start MarkLogic as root and let it switch to the non-root user on its own. When the server initializes, if it is root it makes some privileged kernel calls to configure sockets, memory, and threads. For example, it allocates huge pages if any are available, increases the number of file descriptors it can use, binds any configured low-numbered socket ports, and requests the capability to run some of its threads at high priority. MarkLogic Server will function if it isn’t started as root, but it will not perform as well.

You can work around the root-user requirements for starting/stopping (and even installation/uninstallation) by creating wrapper scripts that call the appropriate script (startup, shutdown, etc.), providing sudo privileges to just the wrapper. This helps to control and debug execution.

Further reading

Knowledgebase - Pitfalls Running Marklogic Process as Non-root User

Stemming and element-value-query

Introduction

Stemming is handled differently between a word-query and value-query; a value-query only indexes using basic stemming.

Discussion

A word may have more than one stem. For example,

cts:stem ('placing')

returns

place

placing

To see how this works with a word-query we can use xdmp:plan. Running

xdmp:plan (cts:search (/, cts:word-query ('placing')))

on a database with basic stemming returns

<qry:final-plan>
<qry:and-query>
<qry:term-query weight="1">
<qry:key>17061320528361807541</qry:key>
<qry:annotation>word("placing")</qry:annotation>
</qry:term-query>
</qry:and-query>
</qry:final-plan>

Since basic stemming uses only the first/shortest stem, this is searching just for the stem 'place'.

Searching with

cts:search (/, cts:word-query ('placing'))

will match 'a place of my own' ('placing' and 'place' both stem to 'place') but not 'new placings' ('placings' stems to just 'placing').

However, on a database with advanced stemming the plan is

<qry:final-plan>
<qry:and-query>
<qry:or-two-queries>
<qry:term-query weight="1">
<qry:key>17061320528361807541</qry:key>
<qry:annotation>word("placing")</qry:annotation>
</qry:term-query>
<qry:term-query weight="1">
<qry:key>17769756368104569500</qry:key>
<qry:annotation>word("placing")</qry:annotation>
</qry:term-query>
</qry:or-two-queries>
</qry:and-query>
</qry:final-plan>

Here you can see that there are two term queries OR-ed together (note the two different key values). The result is that the same cts:word-query('placing') now also matches 'new placings' because it queries using both stems for 'placing' ('place' and 'placing') and so matches the stemmed version of 'placings' ('placing').

However, a search with

cts:element-value-query(xs:QName('title'), 'new placing')

returns

<qry:final-plan>
<qry:and-query>
<qry:term-query weight="1">
<qry:key>10377808623468699463</qry:key>
<qry:annotation>element(title,value("new","placing"))</qry:annotation>
</qry:term-query>
</qry:and-query>
</qry:final-plan>

whether the database has basic or advanced stemming, showing that multiple stems are not used.

The reason for this is that MarkLogic will only do basic stemming when indexing the keys for a value. Therefore there is a single key for the value. If MarkLogic Server were designed to support multiple stems for values (which is does not), this would expand the indexes dramatically and slow down indexing, merging, and querying. Consider if each word had two stems, then there would be 2^N keys for N words. The size would grow exponentially for addtional stems.

More information on value-queries is available at Understanding Search: value queries.

Steps to renew SSL certificates for MarkLogic

Summary

When an SSL certificate is expired or out of date, it is necessary to renew the SSL certificates applied to a MarkLogic application server.

The following general steps are required to apply an SSL certificate.

Create a certificate request for a server in MarkLogic

Download certificate request and send it to certificate authority

Import signed certificate into MarkLogic

Detailed Steps

Before proceeding, please note that you don't need to create a new template to renew an expired certificate as the existing template will work.

1. Creating a certificate request - A fresh csr can be generated from the MarkLogic Admin UI by navigating to Security -> Certificate Templates -> click [your_template] -> click the request tab -> Select radio button applicable for an expired/out of date certificate case. For additional information, refer to the Generating and Downloading Certificate Requests section of our Security Guide.

2. Download and send to certificate authority - The certificate template status page will display the newly generated request. You can download it and send it to your certificate authority for signing.

3. Import signed certificate into MarkLogic - After receiving the signed certificate back from the certificate authority, you can import it from our Admin UI by navigating to Security-> Certificate Templates -> click [your_template] -> Import tab. For additional information, refer to the Importing a Signed Certificate into MarkLogic Server section of our Security Guide.

4. Verify - To verify whether the certificate has been renewed, please look at the summary of your certificate authority. The newly added certificate should appear in certificate authority. Detailed instructions for this are available at Viewing Trusted Certificate Authorities.

If you are not able to view the certificate authority, then you may need to add the certificate as if it is a new CA. This can happen as if there was a change in CA certificate chain.

Click on the certificate template name and then import the certificate. You should already have this CA listed (as this was already there and only the certificate expired). However if there is a change in certificate authority then you will need to import it - you can do this by navigating in the Admin UI to Configure -> Security -> Certificate Authorities --> click on the import tab - this will be equivalent to adding a new CA certificate into MarkLogic. The CA certificate name will now appear in the list.

SVC-MAPINI: Mapped file initialization error occurs randomly

Summary

Some disk related errors, such as SVC-MAPINI, seen on MarkLogic Servers running on the Linux platform can sometimes be attributed to background services attempting to read or monitor MarkLogic data files.

SVC-MAPINI Errors

In some cases when background services are attempting to access MarkLogic data files, you may encounter an error similar to the following:

SVC-MAPINI: Mapped file initialization error: open '/var/opt/MarkLogic/Forests/my-forest-02/0000145a/Timestamps': Operation not permitted

The most common cause of this issue is Anti-Virus software.

Resolution

To avoid file access conflicts, MarkLogic recommends that all MarkLogic data files, typically /var/opt/MarkLogic/, be excluded from access by any background services, which includes AV software. As a general rule, ONLY MarkLogic Server should be maintaining MarkLogic Server data files. If those directories MUST be scanned, then MarkLogic should be shutdown, or the forests be fully quiesced, to prevent issues.

Further Reading

SVC-MAPINI Errors

SVC-FILEOPN Errors

Troubleshooting Windows File System Errors

System clock synchronization and XDMP-CLOCKSKEW

Summary

MarkLogic Server expects the system clocks to be synchronized across all the nodes in a cluster, as well as between Primary and Replica clusters. The acceptable level of clock skew (or drift) between hosts is less than 0.5 seconds, and values greater than 30 seconds will trigger XDMP-CLOCKSKEW errors, and could impact cluster availability.

Cluster Hosts should use NTP to maintain proper clock synchronization.

Inside MarkLogic Clock Time usage

MarkLogic hosts include a precise time of day in XDQP heartbeat messages they send to each other. When a host processes incoming XDQP heartbeat messages, host compares the time of the day in the message against its own clock. If the time difference from the comparison is large enough host will report a CLOCKSKEW in ErrorLog.

Clock Skew

MarkLogic does not thoroughly test clusters in a clock skewed configuration, as it is not a valid configuration. As a result, we do not know all of the ways that a MarkLogic Server Cluster would fail. However, there are some areas where we have noticed issues:

Local disk failover may not perform properly as the inter-forest negotiations regarding which forest has the most up to date content may not produce the correct results.

Database replication can hang

SSL certificate verification may fail on the time range.

If MarkLogic Server detects a clock skew, it will write a message to the error log such as one of the following:

Warning: Heartbeat: XDMP-CLOCKSKEW: Detected clock skew: host hostname.domain.com skewed by NN seconds

Warning: XDQPServerConnection::init: nnn.nnn.nnn.nnn XDMP-CLOCKSKEW: Detected clock skew: host host.domain.local skewed by NN seconds

Warning: Excessive clock skew detected; suggest using NTP (NN seconds skew with hostname)

If one of these lines appears in the error log, or you see repeated XDMP-CLOCKSKEW errors over an extended time period, the clock skew between the hosts in the cluster should be verified. However, do not be alarmed if this warning appears even if there is no clock skew. This message may appear on a system under load, or at the same time as a failed host comes back online. In these cases the errors will typically clear within a short amount of time, once the load on the system is reduced.

Time Sync Config

NTP is the recommended solution for maintaining system clock synchronization.

(1) NTP clients on Linux

The most common Linux NTP clients are ntpd and chrony. Either of these can be used to ensure your hosts stay synchronized to a central NTP time source. You can check the settings for NTP, and manually update the date if needed

The instructions in the link below goes over the process of checking the ntpd service, and updating the date manually using the ntpdate command.

https://www.golinuxhub.com/2017/12/how-to-forcefully-sync-date-and-time.html

The following Server Fault article goes over the process of forcing chrony to manually update and step the time using the chronyc command.

https://serverfault.com/questions/930747/force-chrony-time-check

Running the applicable command on the affected servers should resolve the CLOCKSKEW errors for the short term.

If the ntpd or chrony service is not running, you can still use the ntpdate or chronyc command to update the system clock, but you will need to configure a time service to ensure accurate time is maintained, and avoid future CLOCKSKEW errors. For more information on setting up a time sychonization service, see the following KB article:

(2) NTP clients on Windows

Windows servers can be configured to retrieve time directly from an NTP server, or from a Primary Domain Controller (PDC) in the root of an Active Directory forest that is configured as an NTP server. The following link includes information on configuring NTP on a Windows server, as well as configuring a PDC as an NTP server.

https://support.microsoft.com/en-us/help/816042/how-to-configure-an-authoritative-time-server-in-windows-server

(3) VMWare time synchronization

If your systems are VMWare virtual machines then you may need to take the additional step of disabling time synchronization of the virtual machine. By default the VMWare daemon will synchronize the Guest OS to the Host OS once per minute, and may interfere with ntpd settings. Through the VMSphere Admin UI, you can disable time synchronization between the Guest OS and Host OS in the virtual machine settings.

Configuring Virtual Machine Options

This will prevent regular time synchronization, but synchronization will still occur during some VMWare operations such as, Guest OS boots/reboots, resuming a virtual machine, among others. To disable VMWare clock sync completely, then you need to edit the .vmx for the virtual machine to set several synchronization properties to false. Details can be found in the following VMWare Blog:

Completely Disable Time Synchronization for your VM

(4) AWS EC2 time synchronization

For AWS EC2 instances, if you are noticing CLOCKSKEW in MarkLogic cluster you would benefit from changing clock source from default xen to tsc.

https://aws.amazon.com/premiumsupport/knowledge-center/manage-ec2-linux-clock-source/

Other sources for Clock Skew

(1) Overloaded Host leading to Clock Skew

If for some reason there is a long time between when a XDQP heartbeat message was encoded in sending host, and when it was decoded at receiving host end, it will be interpreted as a CLOCKSKEW. Below are some of the combinations which can lead to CLOCKSKEW.

If a sending host is overloaded enough that heartbeat messages are taking a long time to be sent, it could be reported as a transient CLOCKSKEW by the receiver.

If a receiving host is overloaded enough that a long time elapsed between sending time and processing time, it can be reported as a transient CLOCKSKEW.

If you see a CLOCKSKEW message in ErrorLog combined with other messages (Hung messages, Slow Warning) then Server is likely overloaded and thrashing. Messages reporting broken XDQP connections (Stopping XDQPServerConnection) are a good indication that a host is overloaded and hung for a while, so much that other hosts disconnected.

(2) XDQP Thread start fail leading to Clock Skew

When MarkLogic starts up it tries to make the number of process per user (set limit) on System to at least 16384. But if MarkLogic is not starting as root, then MarkLogic will only be able to raise the soft limit (for number of processes per user) up to the hard limit, which could fail XDQP thread start up. You can get the current setting with the shell command ulimit -u and make sure number of process per user is at least 16384.

Further Reading

VMWare KB: Disabling Time Synchronization

MarkLogic KB: NTP Configuration with Ntpd and Chrony

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/ch-configuring_ntp_using_ntpd

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/sect-using_chrony

MarkLogic Knowledgebase: XML Data Query Protocol (XDQP)

Telemetry

Introduction

This article gives a brief summary of the MarkLogic Telemetry feature available in MarkLogic Server version 9

What is Telemetry used for?

Telemetry is a communication channel between customer's MarkLogic Server instances and the MarkLogic Support team. When enabled, historical configuration, system log and performance data is efficiently collected for immediate access by the Support team who can begin working on your support incident. Having immediate access to this critical system data will often lead to quicker diagnostics and resolution of your logged support incidents.

When Telemetry is enabled, MarkLogic Server collects data locally and periodically uploads it encrypted and anonymised to a secure cloud storage. Data collected locally follows MarkLogic encryption settings and can be reviewed at any time. Telemetry has very low impact on the server performance as it does not require any communication between nodes and it does not depend on any database or index settings. Telemetry does require some local disk space and an SSL connection (Port 443) to access *telemetry.services.marklogic.com.

What is Captured and What is Not

Telemetry data is only collected from:

System Error Logs

Metering Data

Configuration Data

Telemetry neither collects nor sends application specific logs or customer data.

How to enable

Telemetry can be enabled at any time after MarkLogic 9 is installed through the Admin-UI, Admin-API or Rest interfaces. It is recommended that you enable Telemetry in order to have data uploaded and available before an incident is reported to MarkLogic Technical Support. The following script is an example of how to enable Telemetry from Query Console with recommended settings for all nodes in a cluster:

Telemetry will be enabled during run time (doesn't require a restart) and starts uploading as soon as some data is collected and a connection to *telemetry.services.marklogic.com is established. All configuration settings can be changed at any time and are not dependent on other log level settings. Currently the following data types are configurable:

Configuration files reflect MarkLogic cluster settings over time

ErrorLog will only contain system related information; Application level logging, which may contain Personally Identifiable Information, are not included in the system ErrorLog files captured by the Telemetry feature.

Metering (performance) data holds information about cluster,host,forest status and application feature metrics

In addition, Telemetry supports uploading a Support Request on demand to the secure cloud storage. Uploading a Support Request is independent of all configured Telemetry settings as long as a connection to *.telemetry.services.marklogic.com over SSL can be established.

Who has access

Telemetry data is stored at a secured cloud storage using the Cluster-ID as identifier. A Cluster-ID is a randomly generated number during a MarkLogic installation. Access to the data is restricted and requires an open Support Ticket with a provided Cluster-ID. Data will be accessed and downloaded only by the Support Team for the period of time a Support Ticket is open. As soon as the ticket is closed all downloaded data will be destroyed. Data uploaded to the cloud storage will be held for a few month until it is deleted.

Further reading

More details can be found in the Telemetry (Monitoring MarkLogic Guide) in our documentation.



Temporal documents with other MarkLogic features

Introduction

Interoperation of Temporal support with other MarkLogic features.

Features that support Temporal collections

MarkLogic’s Temporal feature is built-in to the server and is supported by many of MarkLogic’s power features: Search API, Semantics, Tiered Storage, and Flexibile Replication. Temporal queries can be written in either JSON or XQuery.

Collections

How are collections used to implement Temporal documents?

Temporality is defined on a protected collection, known as a temporal collection. When a document is inserted into a temporal collection, a URI collection is created for that document. Additionally, the latest version of each document will reside in a latest collection.

Why are collections used to group all revisions of a particular document vs storing it in the properties?

This was done to avoid unnecessary fragmentation, enhance performance, and make best use of existing infrastructure.

Does the Temporal implementation use the collection lexicon or just collections?

It uses only collections. The collection lexicon can be turned on and utilized for applications.

Won’t Temporal collections also be in the collection lexicon if the lexicon is enabled?

Yes.

See also: Temporal, URI, and Latest Collections.

Timezones

The Temporal axes are based on standard MarkLogic dateTime range indexes.

All timezone information is handled in the standard way, as for any other dateTime range index in MarkLogic.

DLS (Library Services API)

Temporal and DLS are aimed at solving different sorts of problems, so do not replace each other. They will coexist.

Tiered Storage

Temporal documents can be leveraged with our Tiered Storage capabilities.

The typical use case is where companies will need to store years of historical information for various purposes regulations.

Compliance. Either internal or external auditing can occur (up to seven years based on Dodd-Frank Legislation). This data can be deployed on commodity hardware at lower cost, and can be remounted when needed.

Analytics. Many years of historical information can be cheaply stored on commodity hardware to allow data scientists to perform analysis for future patterns and backtesting against previous assumptions.

JSON/JavaScript

Temporal documents work with XML/XQuery as well as JSON/JavaScript.

Java/search/REST/Node API

Temporal is supported by all of our existing server-side APIs.

MLCP

You can specify a Temporal collection with the –temporal_collection option in MLCP.

Normal document management APIs (xdmp:*)

By default this is not allowed and an error will be returned. Normally the temporal:* API should be used. However, for more information, see also Managing and Updating Temporal Documents.

Triples

MarkLogic supports non-managed triples in a Temporal document.

Temporal documents—finding all versions

Introduction

How do you find all versions of a temporal document?

Details

In MarkLogic Server, a temporal document is managed as a series of versioned documents in a protected temporal collection. In addition, each temporal document added creates another collection based on its URI, and all versions of the document will be in that collection.

For example, if you have stored a temporal document at URI /orders/koolorder.xml then you can find all the versions of that document by using a collection query as

cts:search (/, cts:collection-query ('/orders/koolorder.xml'))

and the uris of all the versions of the document as

cts:uris ((), (), cts:collection-query ('/orders/koolorder.xml'))

Temporal queries—Allen and ISO operators

Introduction

Allen and ISO operators are comparison operators that can be used in temporal queries.

Details

Both operator sets are used to represent relations between two intervals. ISO operators are more general and usually can be represented by a combination of Allen operators. For example: iso_succeeds = aln_met_by || aln_after.

Period Comparison Operators are discussed in more detail in Searching Temporal Documents.

Timezones and Indexes

Timezone information and MarkLogic

Summary

This article discusses the effect of the implicit timezone on date/time values as indexed and retrieved.

Discussion

Timezone information and indexes

Values are stored in the index effectively in UTC, without any timezone information. When indexed, the value is adjusted to UTC from either the explicit timezone of the data or implicitly from the host timezone, and then the timezone is forgotten. The index data does not save information regarding the source timezone.

When queried, values from the index are adjusted to the timezone specified in the query, or to the host's implicit timezone if none is specified.

Therefore, dates and times in the implicit timezone do what would be expected in calculations, unless you have a particular reason for actually knowing the offset from UTC.

Implicit timezone

The definition of an implicit timezone is given at https://www.w3.org/TR/xpath20/#dt-timezone.

The MarkLogic host implicit timezone comes into play when the document is indexed and when values are returned from the indexes.

fn:implicit-timezone() can be used to show the implicit timezone for a host.

Changing implicit timezone

If you change the implicit timezone without reindexing, the implicit timezone at indexing time was different than the implicit timezone at query time, so values indexed with the implicit timezone are "wrong" in that they were indexed with a different implicit timezone.

If you specify a timezone for the data when it is indexed and when it is queried, the implicit timezone will not be a factor.

Examples

First we create an dateTime element range index on element <dt>, then insert a document without timezone information:

xdmp:document-insert ('/test.xml', <doc><dt>2018-01-01T12:00:00</dt></doc>)

Using a server located in New York (timezone now -05:00), retrieving the value from the index via

cts:element-values (xs:QName ('dt'), ())

gives

2018-01-01T12:00:00

showing that the implicit timezone works as described above. To see the value stored in the index (as adjusted to UTC) you can specify the timezone on the value retrieved:

cts:element-values (xs:QName ('dt'), (), 'timezone=+00:00')

returns

2018-01-01T17:00:00Z

so 2018-01-01T17:00:00 is the value coming from the index.

When the implicit timezone is -5 hours then the call without a timezone returns 12:00. However, if the implicit timezone changed, then the value returned for the query without a timezone would also change, even though the value stored in the index has not changed.

Tips and Hints for Debugging Module Resolution in MarkLogic

Introduction

XQuery modules can be imported from other XQuery modules in MarkLogic Server. This article describes how modules are resolved in MarkLogic when they are imported in Xquery.

Details

How modules are imported in code

Modules can be imported by using two approaches-

--by providing relative path

import module namespace m = "http://example.edu/example" at "example.xqy";

--Or by absolute path

import module namespace m = "http://example.edu/example" at "/example.xqy";

How MarkLogic resolves the path and loads the module

If something starts with a slash, it is a non-relative path and MarkLogic take it as is, if it doesn't, it is a relative path and first it is resolved relative to the URI of the current module to obtain a non-relative path.

Path in hand, MarkLogic always start by looking in the Modules directory. This is a security issue as we want to make sure that the MarkLogic created modules are the ones chosen. In general, users should NOT be putting their modules there. It creates issues on upgrade and if they open up permissions on the directory to ease deployment it creates a security hole.

Then, depending on whether the appserver is configured to use a modules database or the filesystem, we interpret the non-relative path in terms of the appserver root either on the file system or in the Modules database.

Debugging module path issue

To Debug this you can also enable Module caching trace. This will check  how it resolves the paths. Enter "module" as the name of the event in the Diagnostics>Events and you should have a list of module caching events added. These will give you the working details of how module resolution is happening, and should provide enough information to resolve the issue.

Be aware that diagnostic traces can fill up your ErrorLog.txt file very fast, so be sure to turn them off as soon when you no longer need them.

Performance Hints

1. Be sure that your code does not rely on dynamically-created modules. Although these may be convenient at times, they will make overall performance suffer. This is because every time a module changes, the internal modules cache is invalidated and must be re-loaded from scratch -- which will tend to hurt performance.

2. if you are noticing a lot of XDMP-DEADLOCK messages in your log, be sure your modules are not mixing any update statements within what should be a read-only query. The XQuery parser looks for updates anywhere in the modules stack -- including imports -- and if it finds one, it assumes that any Uri that is gathered by the queries might potentially be updated. Thus, if the query matches 10 Uris, it will put a write lock on them, and if it matches 100000 Uris, it will lock all of them as well, and performance will suffer. To prevent this, be sure to isolate updates in their own transactions via xdmp:eval() or xdmp:spawn().

Transferring data between MarkLogic Server clusters

Summary

There are a number of options for transferring data between MarkLogic Server clusters. The best option for your particular circumstances will depend on your use case.

Details

Database Backup and Restore

To transfer the data between two independent clusters, you may use a database backup and restore procedure, taking advantage of MarkLogic Server's facility to make a consistent backup of a database.

Note: the backup directory path that you use must exist on all hosts that serve any forests in the database. The directory you specify can be an operating system mounted directory path, it can be an HDFS path, or it can be an S3 path. Further information on using HDFS and S3 storage with MarkLogic is available in our documentation:

Disk Storage Considerations

Further information regarding backup and restore may be found in our documentation and Knowledgebase:

Backing Up and Restoring a Database

MarkLogic Database Restore Across Clusters

Restoring Backups Across Feature Releases of MarkLogic

Database Replication

Database Replication is another method you might choose to use to transfer content between environments. Database Replication will allow you to maintain copies of forests on databases in multiple MarkLogic Server clusters. Once the replica database in the replica cluster is fully synchronized with its master, you may break replication between the two and then go on to use the replica cluster/database as the master.

Note: to enable Database Replication, a license key that includes Database Replication is required. You would also need to ensure that all hosts are: running the same maintenance release of MarkLogic Server; using the same type of Operating System; and Database Replication is correctly configured.

Also note that before MarkLogic server version 9.0-7, indexing information was not replicated over the network between the Master and Replica databases and is instead regenerated by the Replica database.

Starting with ML server version 9.0-7, index data is also replicated from the Master to the Replica, but it does not automatically check if both sides have the same index settings.The following Knowledgebase article contains further information on this:

Database Replication Indexing on Replica Explained

Further details on Database Replication and how it can be configured, may be found in our documentation:

Database Replication Guide

MarkLogic Content Pump (mlcp)

Depending on your specific requirements, you may also like to make use of the MarkLogic Content Pump (mlcp), which is a command line tool for getting data out of and into a MarkLogic Server database. Using mlcp, you can export documents and metadata from a database, import documents and metadata to a database, or copy documents and metadata from one database to another.

If required, you may use mlcp to extract a consistent database snapshot, forcing all documents to be read from the database at a consistent point in time:

Extracting a Consistent Database Snapshot

Note: the version of mlcp you use should be same as the most recent version of MarkLogic Server that will be used in the transfer.

Also note that mlcp should not be run on a host that is currently running MarkLogic Server, as the Server assumes it has the entire machine available to it, including the CPU and disk I/O capacity.

Further information regarding mlcp is available in our documentation:

MarkLogic Content Pump (mlcp)

mlcp User Guide

Further Information

Related Knowledgebase articles that you may also find useful:

Cloning a MarkLogic instance or cluster

Loading Data Into MarkLogic Server

Transporting Configuration to a New Cluster

Problem Statement

You have an application running on a particular cluster (the source cluster), devcluster and you wish to port that application to an new cluster (the target cluster) testcluster. Porting the application can be divided into two tasks: configuring the target cluster and copying the code and data. This article is only about porting the configuration.

In an ideal world, the application is managed in an "infrastructure as code" manner: all of the configuration information about that cluster is codified in scripts and payloads stored in version control and able to be "replayed" at will. (One way to assure that this is the case is to configure testing for the application in a CI environment that begins by using the deployment scripts to configure the cluster.)

But in the real world, it's all too common for some amount of "tinkering" to have been performed in the Admin UI or via ad hoc calls to the Rest Management API (RMA). And even if that hasn't happened, it's not generally possible to be certain that's the case, so you still have to worry that it might have happened.

Migrating the application

The central theme in doing this "by hand" is that RMA payloads are re-playable. That is, the payload you GET for the properties of a resource is the same as the payload that you PUT to update the properties of that resource.

If you were going to migrate an application by hand, you'd proceed along these lines.

Determine what needs to be migrated

An application consists (more or less by definition) of one or more application servers. Application servers have databases associated with them (those databases may have additional database associations). Databases have forests.

A sufficiently complex application might have application servers divided into different groups of hosts.

Applications may also have users (for example, each application server has a default user; often, but not always, "nobody").

Users, in turn, have roles, and roles may have roles and privileges. Code may have amps that use privileges.

That covers most of the bases, but beware that apps can have additional configuration that should be reviewed: security artifacts (certificates, external securities, protected paths or collections, etc.), mime types, etc.

Get Source Configuration

Using RMA, you can get the properties of all of these resources:

Application servers

Hypothetically, the App-Services application server.

curl --anyauth -u admin:admin \ http://localhost:8002/manage/v2/servers/App-Services/properties?group-id=Default

Groups

Hypothetically, the Default group.

curl --anyauth -u admin:admin \ http://localhost:8002/manage/v2/groups/Default/properties

Databases

Hypothetically, the Documents database.

curl --anyauth -u admin:admin \ http://localhost:8002/manage/v2/databases/Documents/properties

Users

Hypothetically, the ndw user.

curl --anyauth -u admin:admin \ http://localhost:8002/manage/v2/users/ndw/properties

Roles

Hypothetically, the app-admin role.

curl --anyauth -u admin:admin \ http://localhost:8002/manage/v2/roles/app-admin/properties

Privileges

Hypothetically, the app-writer execute privilege.

curl --anyauth -u admin:admin \ "http://localhost:8002/manage/v2/privileges/app-writer/properties?kind=execute"

And the create-document URI privilege.

curl --anyauth -u admin:admin \ "http://localhost:8002/manage/v2/privileges/create-document/properties?kind=uri"

Amps

Hypothetically, my-amped-function in /foo.xqy in the Modules
database using the namespace http://example.com/.

curl --anyauth -u admin:admin \ "http://localhost:8002/manage/v2/amps/my-amped-function/properties\ ?modules-database=Modules\ &document-uri=/foo.xqy\ &namespace=http://example.com"

Create Target Configuration

Some of the properties of a MarkLogic resource may be references to other resources. For example, an application server refers to databases and a role can refer to a privilege. Consequently, if you just attempt to POST all of the property payloads, you may not succeed. The references can, in fact, be circular so that no sequence will succeed.

The easiest way to get around this problem is to simply create all of the resources using minimal configurations: Create the forests (make sure you put them on the right hosts and configure them appropriately). Create the databases, application servers, roles, and privileges. Create the amps. If you need to create other resources (security artifacts, mime types, etc.) create those.

Finally, PUT the property payloads you collected from the source cluster onto the target cluster. This will update the properties of each application server, database, etc. to be the same as the source cluster.

Related Reading

MarkLogic Documentation - Scripting Cluster Management

MarkLogic Knowledgebase - Transferring data between MarkLogic Server clusters

MarkLogic Knowledgebase - Best Practices for exporting and importing data in bulk

MarkLogic Knowledgebase - Deployment and Continuous Integration Tools

Unable to Merge All Deleted Fragments on forest with 32GB max mer...

Summary

Sometimes, following a manual merge, a number of deleted fragments -- usually small number -- are left behind after the merge completes. In a system that is undergoing steady updates, one will observe that the number of deleted fragments will go up and down, but never go down to zero.

Options

There are a couple of approaches to resolve this issue:

1. If you have access to the Query Console, you should run xdmp:merge() with an explicit timestamp (e.g. the return value of xdmp:request-timestamp()). This will cause the server to discard all deleted fragments.

2. If you do not have access to the Query Console, just wait an hour and do the merge again from the Admin GUI.

Explanation

The hour window was added to avoid XDMP-OLDSTAMP errors that had cropped up in some of our internal stress testing, most commonly for replica databases, but also causing transaction retries for non-replica databases.

We've done some tuning of the change since then (e.g. not holding on to the last hour of deleted fragments after a reindex), and we may do some further tuning so this is less surprising to people.

Note

The explanation above is for new MarkLogic 7 installations. In case of an upgrade from prior MarkLogic 7 this solution might not work as it requires a divergent approach to split single big stands into 32GB. Please read more in the following knowledge base article Migrating to MarkLogic 7 and understanding the 1.5x disk rule (rather than 3x.

Unclosed/Obsolete Stands

Summary

Obsolete stands (also referred to as "unclosed-stands") occur during the normal operation of MarkLogic Server. Stands in a forest are marked as obsolete so that MarkLogic Server can recover the forest from an unexpected outage. There are many reasons a stand can be marked as obsolete:

It is the output stand of a merge that has not yet completed. Once the merge finishes, the stand is unmarked and is available to the forest;

It is an in-memory stand that has not yet been completely saved on disk. Once the in-memory stand is completely saved to disk, the stand is unmarked and is available to the forest;

It has been replaced by a merge but is still in use by a query. Once the last query that is using the stand completes, the stand is deleted.

It has been replaced by a merge but is still in use by a forest or database backup. Once all backups that are using the stand completes, the stand is deleted.

NOTE: An obsolete stand will not be deleted as long as there is any query or transaction running that has a handle to a node within that stand. This means that we should avoid server fields that store any kind of database node in the value parameter.

If you must store document nodes in a server field, then the server-field backup documents should either reside in a separate database, or you should use xdmp:quote() to store a string representation of the node into the server-field, and then use xdmp:unquote() to turn the value back into a node when you are ready to use it. Be sure to add some error handling around the xdmp:unquote() in case the base-uri for that node no longer exists.

Forest Startup

When a forest is enabled, any stand in the forest that is marked as obsolete will be deleted.

Obsolete stands should only exist at startup in the situation where the stand is disabled unexpectedly. This may occur:

If MarkLogic Server was stopped unexpectedly

If the stand resides on a network attached device and the device became unreachable.

There were in flight queries or backups when MarkLogic Server was stopped.

A forest can be started by:

(Re)starting MarkLogic Server

(Re)starting (Disabling and Enabling) the Forest

Detecting Obsolete Stands

The files that make up a stand reside under a directory on the file system. The directory for a stand that has been marked as obsolete contains a file with the name Obsolete.

Additionally, a forest's status will indicate that when “unclosed-stand” exists for that forest. xdmp:forest-status() will include an 'unclosed-stand' element that gives an indication of the query reference timelines. The results will include an element that looks something like:

<unclosed-stand>
      <stand-id>539300098042005337</stand-id>
      <path>/data/Forests/forest-name/000127ee</path>
      <disk-size>51478</disk-size>
      <memory-size>2786</memory-size>
   <reference>
<count>1</count>
<earliest>2017-02-24T02:17:25-05:00</earliest>
<latest>2017-02-26T23:17:30-05:00</latest>
   </reference>
</unclosed-stand>

Understanding differences in Monitoring Dashboard Disk Space usag...

Summary

MarkLogic server monitoring dashboard provides a way to Monitor Disk Usage which is a key monitoring metric. Comparing the disk usage shown on monitoring dashboard with Disk space on filesystem (for example, using df –h) reveals difference between two. This article talks about these differences and reasons behind them.

Details

To understand how to use Monitoring dashboard Disk Usage, see our documentation at https://docs.marklogic.com/guide/monitoring/dashboard#id_60621

If you add all disk usage metrics (Fast Data, Large Data, Forest Data, Forest reserve, Free) and compare it with space on your disk (using df -h or other commands) you will see a difference between those two values.

This difference exists mainly because of two reasons:
1. Monitoring history dashboard displays disk space usage excluding Forest journal sizes in MB & GB
2. On Linux, by default around 5% of the filesystem is reserved for cases where the filesystem fills up to prevent serious problems and for its own purposes. For example for keeping backups of its internal data structures.

An example

Consider below example for a host running RHEL 7 with 100GB disk space on filesystem for one database and one forest.

Disk usage as shown by Monitoring dashboard:
Free                 92.46 GB      98.17%
Forest Reserve      1.14 GB       1.21%
Forest Data          0.57 GB        0.60%
Large Data           0.02 GB        0.02%

Total from Monitoring dashboard is around 94.19 GB. When we add the size of Journals (around 1GB for this case), and OS reserve space (5%), the total comes out to be 100GB which is total capacity of disk in this example.

On the other hand, consider disk usage as shown by df -h command for filesystem:

Filesystem                    Size Used Avail Use% Mounted on
/dev/mapper/Data1-Vol1 99G 2.1G 92G    3%   /myspace

Adding 5% default OS reserve for Linux gives us total size for this filesystem which is more than 99GB i.e,100 GB appx.

Items of Note

The Dashboard:Disk Space uses KB/MB/GB, which means 1 KB = 1000 B, not KiB/MiB/GiB where 1 KiB = 1024 B.

The actual disk usage for forests (including Journal sizes) can be confirmed by checking the output of below command from the file system:
du --si -h /MarkLogic_Data/Forests/*
-h flag is for human readable format

--si flag is for using KB/MB/GB instead of the default KiB/MiB/GiB

Conclusion

The reason for difference in metrics on Monitoring dashboard and disk usage for filesystem is because monitoring history does not show Journal size and OS reserve space in the report.

Useful Links:

https://docs.marklogic.com/guide/monitoring/dashboard#id_60621

http://serverfault.com/questions/315181/df-says-disk-is-full-but-it-is-not

http://www.walkernews.net/2011/01/22/why-the-linux-df-command-shows-lesser-free-disk-space/

Understanding Forest State Transitions While Putting Forest in Fl...

Understanding Forest State Transitions While Putting Forest in Flash-backup mode

When we transition a forest into flash-backup mode, the forest is unmounted and then remounted in read-only mode so no updates can be made. During that process, the forest goes into "start closing" state for a short while (less than a second). During this time, new queries/updates are rejected with a retry exception and running queries are allowed to continue running.

After "start closing", the process enters a "finish closing" state. At this point, all currently running queries will throw a retry exception. Transactions that are in-flight when a forest enters flash backup mode will be retried until they either succeed (when the forest remounts as read-only in the case of read transactions, or when it remounts with read/write in the case of update transactions), or they hit the timeout limit.

New transactions are continually retried until they either succeed (when the forest comes back up read-only in the case of read transactions, or when it comes back up read/write in the case of update transactions), or timeout.

During flash-backup, if nested transactions are taking place, MarkLogic will attempt to retry only the transaction that receives the exception, because it's possible that the exception applies only to that transaction. In the case where a forest is closing, MarkLogic will throw one of the following exceptions to indicate the state of the forest at the time:

XDMP-FORESTNOT

XDMP-FORESTMNT

XDMP-UPDATESNOTALLOWED

In the case of nested transactions - and for the above three exceptions - the transaction will not process the exception but instead will pass it up the stack. The net result is that the three exceptions will cause the outer transaction to retry rather than the inner transaction, releasing its hold on the forest and allowing it to close. The product has been designed to work in this way to prevent the flash-backup process from being held up by any nested trasactions that could be in-flight at the time.

If you can recreate the same condition, enabling the following diagnostic trace events should provide a wealth of useful information for deeper analysis of the underlying issue:

Forest State

Forest Label

Forest Constructor

Forest Destructor

Forest Startup

Forest Shutdown

Forest Mount

Forest Unmount

Forest Open

Forest Close

If you are unfamiliar with diagnostic trace events, more information is available in this Knowledgebase article

Understanding Search: value queries

Value queries

Summary

Here we summarize some characteristics of value queries and compare to other approaches.

Discussion

Characteristics

Punctuation and space tokens are not indexed as words in the universal index. Therefore, word-queries involving whitespace or punctuation will not make use of whitespace or punctuation in index resolution, regardless of space or punctuation sensitivity.

Punctuation and space tokens are also not generally indexed as words in the universal index in value queries either. However, as a special exception there are terms in the universal index for "exact" value queries ("exact" is shorthand for "case-sensitive", "diacritic-sensitive", "punctuation-sensitive", "whitespace-sensitive", "unstemmed", and "unwildcarded"). "exact" value queries should be resolvable properly from the index, but only if you have fast-case-sensitive-searches and fast-diacritic-sensitive-searches enabled in the database.

For field-word or field-value queries you can modify what counts as punctuation or whitespace via tokenizer overrides. This can turn what would have been a phrase into a single word.

Outside of the special case given for exact value queries, all queries involving space or punctuation are phrase queries. Word and value search is not string matching.

Space insensitive and punctuation insensitive do not mean tokenization insensitive. "foo-bar" will not match "foobar" as a value query or a word query, regardless of your punctuation sensitivity. Word and value search is not string matching.

Stemming is handled differently between a word-query and value-query; a value-query only indexes using basic stemming.

String range queries are about string matching. Whether there is a match depends on the collation, but there is no tokenization and no stemming happening.

Exact matches

If you want to do exact queries you can

Enable fast-case-sensitive-searches and fast-diacritic-sensitive-searches on your database and run them as value queries.

or

Create a field with custom overrides for the significant punctuation or whitespace and run them as field word or field value queries.

or

Create a string range index with the appropriate collation (codepoint, most likely) and run them as string range equality queries.

Looking deeper

As with all queries, xdmp:plan can be helpful: it will show you the questions asked of the indexes. If there is information from a query is not reflected in the plan, that will be a case where there might be false positives from index resolution (i.e., unfiltered search).

For example, the plan for cts:search(/, cts:element-value-query(xs:QName("x"), "value-1", "exact")) should include the hyphen if you do have fast-case-sensitive-searches and fast-diacritic-sensitive-searches enabled in the database.

JSON

For purposes of indexing, a JSON property (name-value pair) is roughly equivalent to an XML element. See the following for more details:

Creating Indexes and Lexicons Over JSON Documents

How Field Queries Differ Between JSON and XML

References

Stemming and element-value-query

cts:field-value-query

cts:element-value-query

Using xdmp:plan to View the Evaluation Plan

Understanding slow 'journal frame' entries in the ErrorLog

Introduction

Slow journal frame log entries will be logged at Warning level in your ErrorLog file and will mention something like this:

.....journal frame took 28158 ms to journal...

Examples

2016-11-17 18:38:28.476 Warning: forest Documents journal frame took 28152 ms to journal (sem=0 disk=28152 ja=0 dbrep=0 ld=0): {{fsn=121519836, chksum=0xd79a4bd0, words=33}, op=commit, time=1479425880, mfor=18383617934651757356, mtim=14445621353792290, mfsn=121519625, fmcl=16964678471847070106, fmf=18383617934651757356, fmt=14445621353792290, fmfsn=121519625, sk=10604213488372914348, pfo=116961308} 2016-11-17 18:38:28.482 Warning: forest Documents journal frame took 26308 ms to journal (sem=0 disk=26308 ja=0 dbrep=0 ld=0): {{fsn=113883463, chksum=0x10b1bd40, words=23}, op=fastQueryTimestamp, time=1479425882, mfor=959797732298875593, mtim=14701896887337160, mfsn=113882912, fmcl=16964678471847070106, fmf=959797732298875593, fmt=14701896887337160, fmfsn=113882912, sk=4596785426549375418, pfo=54687472} 2016-11-17 18:38:28.482 Warning: forest Documents journal frame took 28155 ms to journal (sem=0 disk=28155 ja=0 dbrep=0 ld=0):{{fsn=121740077, chksum=0xfd950360, words=31}, op=prepare, time=1479425880, mfor=10258363344370988969, mtim=14784083780681960, mfsn=121740077,fmcl=16964678471847070106, fmf=10258363344370988969, fmt=14784083780681960, fmfsn=121740077, sk=12062047643091825183, pfo=14672600}

Understanding the messages in further detail

These messages give you further hints on what is causing the delay; in most cases, you would probably want to involve the MarkLogic Support team in diagnosing the root cause of the problem although the table below should help with further interpretation of cause of these messages:

Item Description
sem time waiting on semaphore
disk time waiting on disk
ja time waiting if journal archive is lagged
dbrep time waiting if DR replication is lagged
ld time waiting to replicate the journal frame to a HA replica
fsn frame sequence number
chksum frame checksum
words length in words of the frame
op the type of frame
time UNIX time
mfor ID of master forest (if replica)
mtim when master became master
mfsn master forest fsn
fmcl foreign master cluster id
fmf foreign master forest id
fmt when foreign master became HA master
fmfsn foreign master fsn
sk sequence key (frame unique id)
pfo previous frame offset
Further reading / related articles

Knowledgebase: Warning messages for lagging operations

Knowledgebase: IO Statistics: New performance trace events

Understanding XDMP-INMM*FULL messages

Summary

The XDMP-INMMTREEFULL, XDMP-INMMLISTFULL, XDMP-INMMINDXFULL, XDMP-INMREVIDXFULL, XDMP-INMMTRPLFULL & XDMP-INMMGEOREGIONIDXFULL messages are informational only. These messages indicate that in-memory storage is full, resulting in the forest stands being written out to disk. There is no error as MarkLogic Server is working as expected.

XDMP-INMMTREEFULL indicates the in memory tree storage is full

XDMP-INMMLISTFULL indicates the in memory list storage is full

XDMP-INMMINDXFULL indicates the in memory range index storage is full.

XDMP-INMREVIDXFULL indicates the in memory reverse index storage is full.

XDMP-INMMTRPLFULL indicates the in memory triple index storage is full.

XDMP-INMMGEOREGIONIDXFULL indicates the in memory geospatial region index storage is full.

Configuration Settings

If these messages consistently appear more frequently than once per minute, increasing the ‘in-memory’ settings in the affected database may be appropriate.

XDMP-INMMTREEFULL corresponds to the “in memory tree size” setting. "in memory tree size" specifies the amount of cache and buffer memory to be allocated for managing fragment data for an in-memory stand.

XDMP-INMMLISTFULL corresponds to the “in memory list size” setting. "in memory list size" specifies the amount of cache and buffer memory to be allocated for managing termlist data for an in-memory stand.

XDMP-INMMINDXFULL corresponds to the “in memory range index size” setting. "in memory range index size" specifies the amount of cache and buffer memory to be allocated for managing range index data for an in-memory stand.

XDMP-INMREVIDXFULL corresponds to the “in memory reverse index size” setting. "in memory reverse index size" specifies the amount of cache and buffer memory to be allocated for managing reverse index data for an in-memory stand.

XDMP-INMMTRPLFULL corresponds to the “in memory triple index size” setting. "in memory triple index size" specifies the amount of cache and buffer memory to be allocated for managing triple index data for an in-memory stand.

XDMP-INMMGEOREGIONIDXFULL corresponds to the “in memory geospatial region index size” setting. "in memory geospatial region index size" specifies the amount of cache and buffer memory to be allocated for managing geo region index data for an in-memory stand.

Increasing the in memory settings have implications on the ‘journal size’ setting. The default value of journal size should be sufficient for most systems; it is calculated at database configuration time based on the size of your system. If you change the other memory settings, however, the journal size should equal the sum of the in memory list size and the in memory tree size. Additionally, you should add space to the journal size if you use range indexes (particularly if you use a lot of range indexes or have extremely large range indexes), as range index data can take up journal space.

Upgrading Data Hub Version in DHS AWS Service: DHF 4.x/5.0/5.1 to...

Introduction

This KB article is for those customers who are willing to upgrade their DHS (Data Hub Service) Data Hub version from Data Hub 5.1.0 (or earlier) to Data Hub 5.2.x+ on AWS.

Note: This process only applies for requests to MarkLogic Support to upgrade the Data Hub version on a DHS AWS service.

Details

For customers who want to upgrade their DHS Data Hub version from Data Hub 5.1.0 (or earlier) to Data Hub 5.2.x in DHS AWS, they should be made aware of the following.

The user can still upgrade to Data Hub 5.2.x but with the following caveats:

The Data Hub Explorer will not be available in the upgraded Data Hub Service. The upgrade only updates the DHF 5.2.x framework.

Only the old DHS roles will be available. These are documented at https://docs.marklogic.com/cloudservices/aws/security/security-roles-service.html. These roles will be mapped to the Data Hub 5.2 roles documented at http://docs.marklogic.com/datahub/security/users-and-roles.html#users-and-roles__section-general-roles. The mapping for the roles is documented in this table.

Old DHS Roles DH 5.2 Roles

Flow Developer data-hub-developer

Flow Operator data-hub-operator
data-hub-monitor

Endpoint Developer data-hub-developer

Endpoint User data-hub-operator

Service Security Admin
data-hub-security-admin
data-hub-admin
pii-reader

When customers deploy their Data Hub project to their upgraded service, they will encounter Explorer-related errors. The Explorer errors can be ignored.

The Gradle task that will return an Explorer-related error is “hubDeploy”. Details at http://docs.marklogic.com/datahub/tools/gradle/gradle-tasks.html#gradle-tasks__marklogic-data-hub-setup-tasks

To determine which Data Hub version customers can upgrade to, see Version Compatibility in the DHS AWS documentation.
- AWS https://docs.marklogic.com/cloudservices/aws/refs/version-compatibility.html

URI Keys

Summary

Internally, MarkLogic Server maps URIs to hash values. Hash values are just numbers. For internal operations, numbers are easier to process and are more performant than strings. We refer the URI hash as a URI Key.

Details

Where would I see a URI key?

Sometimes, URI Keys will appear in the MarkLogic Error Logs. For example, the MarkLogic Lock Manager manages document locks. Internally, the lock manager in each forest doesn't deal with URIs, it only deals with URI keys. When logging messages, the lock manager helpfully tries to turn the URI key into a URI to be more human readable. It does that by looking up and retrieving the document URI matching that URI key. If the reporting forest doesn't have a document to match the URI key, it will reports a URI key instead of a URI.

For example, if the 'Lock Trace' trace event is enabled, you may see events logged that look like either of the following lines:

2015-03-18 01:53:17.576 Info: [Event:id=Lock Trace] forest=content-f1 uri=/cache/151516917/state.xml waiting=11744114292967458924 holding=15120765280191786041

2015-03-18 01:53:17.576 Info: [Event:id=Lock Trace] forest=content-f1 uri=#7734249069814007397 waiting=11744114292967458924 holding=15120765280191786041

The first line shows a URI (/cache/151516917/state.xml), and the second gives instead a URI key (7734249069814007397). When a URI key is reported as in this example, one of the following 2 will be true:

The reporting action may be restricted to a single forest and the referenced document for the URI (key) may be in a different forest; or

The document may not exist at all. An example where this might occur is when a Lock is acquired by an update before the document is actually inserted, or xdmp:lock-for-update can lock URIs that aren’t in the database without ever creating a document.

How can I find a URI key for a URI?

To can turn a URI ($uri) into a URI key using the following XQuery code

xdmp:add64(xdmp:mul64(xdmp:hash64($uri),5),xdmp:hash64("uri()")

You may want to generate the URI key in order to scan an Error Log file for reference to that key.

How can I find the URI or document for a URI key?

You can check the entire database by using cts:uris with a cts:term-query and that URI key. As an example, the following XQuery code

xquery version '1.0-ml';
let $uri := '/foo.xml'
let $uri-key := xdmp:add64(xdmp:mul64(xdmp:hash64($uri),5),xdmp:hash64("uri()"))
return cts:uris ((), (), cts:term-query ($uri-key))

returns /foo.xml

Valid characters in a MarkLogic Document URI

Introduction

A document uniform resource identifier (URI) is a string of characters used to identify a name of a document stored in MarkLogic Server. This article describes which characters are supported by MarkLogic 8 to represent a document URI.

ASCII

MarkLogic 8 allows all characters from printable ASCII characters to be used in a document URI (i.e. decimal range 32-196).

List of allowed special characters within ASCII range

<space> ! " # $ % & ' () * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~

Please note ASCII character for space (decimal 32) can be used, however it should not be used as a prefix or a suffix.

Other Character Sets

MarkLogic Server supports UTF 8 encoding. Apart from valid ASCII character set mentioned above, any valid UTF-8 character can be used within a document URI in MarkLogic Server.

Examples include: Decimal range 384-591 for representing Latin Extended-A; and decimal range 880-1023 for representing Greek and Coptic.

External Considerations

Few interfaces (such XCC/J) and datatypes might place more restrictions on characters allowed in a MarkLogic document URI. For example, xs:anyURI datatype place more restrictions on a URI and restricts use of & (Decimal code 38) and < (Decimal code 60). Consider the following scenario.

A schema is loaded into database and validations are applied before inserting an xml document into the database,

Now below query will fail to insert a document with URI having a

Above code fails and gives error listed below,

[1.0-ml] XDMP-DOCENTITYREF: xdmp:unquote("<?xml version="1.0" encoding="UTF-8"?>
<...") -- Invalid entity reference "." at line 2

To resolve this issue, function xdmp:url-encode can be used, for example

let $node := xdmp:unquote(fn:concat('<?xml version="1.0" encoding="UTF-8"?>
<tns:simpleuri xmlns:tns="http://www.example.org/uri" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.example.org/uri uri.xsd ">',

xdmp:url-encode(fn:codepoints-to-string($n)), '.org
</tns:simpleuri>'))

The MarkLogic knowledge base article, Using URL encoding to handle special characters in a document URI , explains a recommended approach for safely handling special characters (using url encoding). A document URI containing special characters, as mentioned in above Knowledge base article, should be encoded before it is inserted into MarkLogic 8.

Summary

While it is possible to load documents into MarkLogic Server where the document URI contains special characters not encoded, it is recommended to follow best practices by URL encoding document URIs as it will help you design robust applications, free from the side effects caused by such special characters in other areas of your application stack.

Additional References

ISO/IEC 8859-1

w3 school: HTML Unicode (UTF-8) Reference

Warning messages for lagging operations

Summary

In releases 8.0-5.3 and 7.0-6.4, we've added code to detect lagged operations and log warnings.

Fast Query Timestamp

Every forest has a notion of a "fast query timestamp", also sometimes referred to as a "nonblocking timestamp". This is the maximum timestamp at which a query can run without waiting for the forest's timestamp to advance; it indicates the most current time at which the forest has complete state to answer a query. There are several reasons for forests to have this timestamp.

The first has to do with transaction commits, during which the forest places a finger on the commit timestamp for the duration of the commit. The point of this is to ensure that queries perceive committed transactions to be atomic. There can be multiple (even many) transactions with a finger on various timestamps at any given point in time.

The second has to do with asynchronous database replication, in which case each replicated journal frame is accompanied by an appropriate fast query timestamp from the master forest, sampled when the frame was journaled. The forest in the replica database will advance its fast query timestamp to track the journal stream. If replication is interrupted for some reason, the timestamp will stay fixed until replication resumes.

There is now code to detect and warn that a forest's fast query timestamp has lagged the cluster commit timestamp by an excessive amount. For forests in a master database, this means 30 seconds. For forests in a replica database, the warning starts at 60 seconds. The complaint frequency automatically backs off to once every 5 minutes for each forest, with one final warning when the lag is no longer excessive to properly identify when the issue was resolved. The text of the warning looks like

2016-09-09 10:37:01.225 Warning: Forest content-db-001-1 fast query timestamp (14734353609140210) lags commit timestamp (14734354209281070) by 60014 ms

This warning will help flag any problems with overly long transactions that can hold up queries. The warning helps flag the lag issue earlier, rather than later.

Journaling

There are times when it takes a very long time to write a journal frame, which may result in a lagged timestamp. Reasons can include underprovisioned disk, oversubscribed VM environments, VM migration, etc. These incidents will now get flagged by a new warning message like the following whenever writing a journal frame exceeds 30 seconds:

2016-08-22 21:52:18.636 Warning: forest content-db-f1 journal frame took 38882 ms to journal: {{fsn=99181221, chksum=0xbc959270, words=33}, op=commit, time=1471917138, mfor=15947305669564640543, mtim=14719107644243560, mfsn=99181221, fmcl=16964678471847070106, fmf=7272939609350931075, fmt=14445621385518980, fmfsn=103616323, sk=13614815415239633478, pfo=233342552}

Canary Thread

Another addition is a canary thread that wakes up each second, checks how long it was asleep, and warns if it was longer than 10 seconds. That message looks like

2016-09-09 10:37:01.225 Warning: Canary thread sleep was 12345 ms

Further Reading

Information on these and other events, including how to control the time limits:

MarkLogic Knowledgebase - IO Statistics: New performance trace events

Information on database replication lag:

MarkLogic Knowledgebase - Database Replication Lag Limit Explained

What are the compatible combinations of failover and replication ...

In MarkLogic Server 5.0, database replication is compatible with local-disk failover, while flexible replication is compatible with both local- and shared-disk failover.

In MarkLogic Server 4.2, flexible replication is compatible with both local- and shared-disk failover.

What are the databases installed by default and do I need to back...

Below is a brief description of each of the databases that ship with MarkLogic Server:

App-Services

contains data for the "Application Services" suite of apps that are available on port 8000 on installing the product. The database also contains executed "Query Console" history and "Workspace" metadata so queries and code written in the Query Console buffers is retained. (read more about the App-Services database)

Documents

A default database for documents (content loaded into MarkLogic Server); empty on first install

Extensions

Used for any installed MarkLogic Server Extensions (plugins); empty on first install (you can read more on plugins and view the API documentation)

Fab

Used to store state information for the "Information Studio" application that ships with MarkLogic Server; empty on first install (you can read more about this database in our documentation). Note that the "Information Studio" application has been deprecated beginning in MarkLogic 8.

Last-Login

Used to store metadata about the last user to have logged into an application server; not enabled on any application server by default and empty on first install (you can read more about monitoring user activity in our documentation)

Meters

Used to store historical performance information for a MarkLogic instance (you can read more about monitoring MarkLogic performance history in our documentation)

The other key ("auxiliary") databases include:

Modules

A default database for XQuery library and main module code; empty on first install (you can read more about the Modules database in our documentation)

Schemas

A default database for XML Schemas used for validating content loaded into MarkLogic Server; empty on first install (you can read more about the Schemas database in our documentation)

Security

Contains information for all users, roles and privileges that ship with MarkLogic Server; contains data on first install and is routinely updated in maintenance releases of the product (you can read more about the Security database in our documentation)

Triggers

Contains metadata for all configured database triggers; empty on first install (you can read more about the triggers database in our documentation)

We recommend regular backups for Security database and for Schemas, Modules and Triggers databases if they are actively used.

What do I do about XDMP-LISTCACHEFULL errors?

Introduction

MarkLogic Server uses its list cache to hold search term lists in memory. If you're attempting to execute a particularly non-selective or inefficient query, your query will fail due to the size of the search term lists exceeding the allocated list cache.

What do I do about XDMP-LISTCACHEFULL errors?

This error really boils down to the amount of an available resource (in this case, the list cache) vs. resource usage (here, the size of the fetched search term lists). The available options are:

Increasing the amount of available resource - in this case, increasing the list cache size.

decreasing the amount of resource usage - here, either:

Finding and preventing inefficient queries. That is, tune the queries in order to select fewer terms during index resolution;

Ensure appropriate indexes are enabled. For example, enabling positions may improve multi-term and proximity queries;

Reducing the size of forests, as smaller forests will have less data and consequently smaller term lists for a given query.

Note that option #1 (increasing the list cache size) is at best a temporary solution. There is a maximum size for the list cache of 32768 MB (73728 MB as of v7.0-5 and v8.0-2), and each partition should be no more than 8192 MB. Ultimately, if you're often seeing XDMP-LISTCACHEFULL errors, you're likely running insufficiently selective queries, and/or your forests are too big.

What does session.commit() do in XCC/J

Introduction

We have seen some confusion around the use of the commit() method when working with XCC/J or XCC .NET. In this article, we will walk through a scenario where exceptions are thrown if it is used in an unexpected way and we will discuss managing transactions in general. This article will attempt to give a clearer picture regarding how all the parts work in unison.

Walkthrough

We'll start by taking a look at the JavaDoc for XCC/J's Session.commit() at https://docs.marklogic.com/javadoc/xcc/com/marklogic/xcc/Session.html#commit()

Under the "Throws" heading, it states that you should expect to see an IllegalStateException if the TransactionMode is set to AUTO.

Consider the following code:

Note that this is a slightly adapted example of the ad-hoc query in the XCC Developer Guide (http://docs.marklogic.com/guide/xcc/concepts#id_65804).

The code itself is fairly simplistic - we're running a newAdHocQuery and calling xdmp:document-delete and passing in the URI of the first doc in a given database. As xdmp:document-delete returns an empty sequence, we will not be using a ResultSequence Object to work with the result set after the request has been submitted.

If you were to run the code as-is, you should see that there are no Exceptions caught and you should be able to verify that the code ran to the end. Most importantly, examination of the database should show one less document than before.

If we were to add a new line just below line 22 and add a commit():

You should see two things happening

You should now see an Exception being thrown: SEVERE [1] (SessionImpl.throwIllegalState): Cannot commit without an active transaction

In spite of this, the adHocQuery has executed and has deleted another document in the database.

In this example, the call to s.commit() appears to do little more than just throw back an exception. Why was that?

Remembering that the JavaDoc stated that you would see an IllegalStateException if you had the TransactionMode set to AUTO, you should be able to confirm the state of the TransactionMode by getting XCC to tell you. To do this, you can add a line just below cs.newSession() to print the current transaction mode to stdout:

Running the code this time, you should see AUTO printed out in the console.

Adding the line:

Should now allow you to make a call to s.commit() without getting that exception thrown. Note that in this example, such usage is okay; as you're issuing a single xdmp:document-delete() and this must run as an update.

If you change the transaction mode to QUERY and run the code again, you will see an error message that looks like:
com.marklogic.xcc.exceptions.XQueryException: XDMP-UPDATEFUNCTIONFROMQUERY: xdmp:document-delete([YOUR_URI_HERE]) -- Cannot apply an update function from a query
In essence, XCC will manage transactions on your behalf; you can issue adHocQueries in this nature and the server should do the right thing; as the query is sent to MarkLogic Server, it will be evaluated and it will set the correct transaction mode for you - for simple transactions, you don't need to handle it yourself.

Further reading

Inside MarkLogic Server explains transactions in significantly more detail - you can download it at http://developer.marklogic.com/inside-marklogic

The Developer's Guide also covers transactions in detail - particularly the section called "Understanding Transactions in MarkLogic Server" - which can be found at http://docs.marklogic.com/guide/app-dev/transactions#chapter

Now it may be the case that you don't really need to use an explicit call to commit() at all.

Understanding when commit() may be useful

You can get a better idea of what newAdHocQuery does by looking at the JavaDoc for the method http://docs.marklogic.com/javadoc/xcc/com/marklogic/xcc/AdhocQuery.html

Note this part of the description:

A specialization of Request which contains an ad-hoc query (XQuery code as a literal String) to be submitted and evaluated by the MarkLogic Server.

The key word in the description is "evaluated" - what your newAdHocQuery is doing at the server level is passing the enclosed string to one of MarkLogic Server's builtins - xdmp:eval (http://docs.marklogic.com/xdmp:eval)

The xdmp:eval builtin creates a transaction on your behalf. You can test this on your own system if you enable the eval audit event; to do this from the Admin UI, navigate to Groups > [Group Name] > Auditing - ensure "audit enabled" is set to true and tick the checkbox next to 'eval'.

If you run the original code again and then check the audit log, you should see something like:
2013-05-02 14:27:46.937 event=eval; expr=fn:doc(); database=[your-database-name-here]; success=true; user=[your-user-name-here]; roles=admin;
In the scenario earlier, why was an IllegalStateException being thrown?

Here's a quick recap of the events that took place:

We created a session

We execute an adHocQuery which - in turn is creating its own transaction

After that has completed, we're calling s.commit() and this is being caught by our try/catch block

The previous query has already returned and has performed the delete as an enclosed transaction. The message is telling you that you're trying to commit a transaction where no transaction exists.

When would you use commit()?

The best case for this is when you want to compose Multi Statement Transactions. For a good overview, you can read the documentation in the XCC Developer's Guide:

http://docs.marklogic.com/guide/xcc/concepts#id_23310

In the code example in that section of documentation, you should see the use of commit() as used in a situation where two separate transactions take place and where either both need to run to completion, or the whole transaction needs to be completely rolled back in the event of failure.

What is a directory in MarkLogic?

Summary

It’s a lot easier to think of what a directory is useful for rather than to state what it is. Directories are a powerful and efficient way to group documents in a database. While collections are also powerful and efficient, directories have always seemed more natural because they have an obvious analog in filesystem directories and because they can effectively be serialized through the document URI.

Details

Like documents, directories can have properties. You may have run into this while performing bulk loads as the server tries to keep the last modified date for the directory reflective of the most recent children documents. You can also put your own properties on the directory, which can be quite handy for assigning properties that are common to a group of documents.

Like documents, directories have permissions. You can control the documents users can “see” through webdav by controlling access at the directory level, and you can also assign a default permission level on a directory that all of its children documents will inherit. This is especially useful if you are using permissions on your stored modules and editing them – you can simply load with the appropriate URI, and all the right permissions will be assigned at load time.

Directories seem like documents in some regards but, when you create a directory in MarkLogic, it is reported by the database status screen as a fragment rather than a document. Furthermore, input() and doc() do not return directories, they only return document nodes. You could have a million directories in the database and doc() will return an empty sequence.

Directory Properties

The properties of a directory at a URI will identify itself as a directory. For example, the properties of the root directory xdmp:document-properties("/") will report

<prop:properties xmlns:prop="http://marklogic.com/xdmp/property">

<prop:directory/>

</prop:properties>

You can see how many directories you have by executing

xdmp:estimate(xdmp:document-properties(cts:uris())[.//prop:directory])

You can list all the directory URIs in the database by executing

for $uri-prop in xdmp:document-properties(cts:uris())[.//prop:directory] return base-uri($uri-prop)

Directory Creation

If the database is configured to create directories automatically then a document is insert will result in directories being created if they do not already exists (based on the URI of the document).

Warning: automatic directory creation has a performance penalty as document insert causes additional write locks to be acquired for all directories implied by the URI; This may have the effect of serializing inserts. Unless there is a need to have automatic directory creation turned on (such as for webdav), it is recommended that the directory creation setting on the databases be set to manual.
You can manually create a directory by calling

xdmp:directory-create( $uri )

Directory Deletion

WARNING: xdmp:directory-delete( $uri ) deletes not only the directory, but also deletes all of its child and descendant documents and directories from the database.

Use caution when calling this function. Bulk delete calls like xdmp:directory-delete and xdmp:collection-delete delete the relevant documents' term lists without knowledge of those documents URIs. Without the URIs, the server can't determine the mimetypes of the corresponding documents. Without the mimetypes, it cannot prove that the corresponding documents are or are not modules. Since the server doesn't know if modules are being deleted, module caches will consequently always be invalidated when calling bulk delete calls like xdmp:directory-delete and xdmp:collection-delete, regardless of the contents of either the relevant directory or collection. If your application requests cannot afford the response time impact of module cache re-warming, you should instead call xdmp:document-delete for each document in the relevant directory or collection instead of calling bulk delete operations like xdmp:directory-delete and xdmp:collection-delete.

If all you want to do is delete a directory fragment, you need to just remove the node from it's property.

xdmp:node-delete(xdmp:document-properties( $uri ));

What triggers failover in MarkLogic Server?

Summary

Each node in a cluster communicates with all of the other nodes in the cluster at periodic intervals. This periodic communication, known as a heartbeat, circulates key information about host status and availability between the nodes in a cluster. Through this mechanism, the cluster determines which nodes are available and communicates configuration changes with other nodes in the cluster. If a node goes down for some reason, it will stop sending heartbeat packets to the other nodes in the cluster.

Cluster Heartbeat

The cluster uses the heartbeat to determine if a node in the cluster is down. A heartbeat message from a given node contains its view of the current state of the cluster at the moment of the heartbeat was generated. The determination of a down node is based on a vote from each node in the cluster. In order to vote a node out of the cluster, there must be a quorum of nodes voting to remove a node.

A quorum occurs if more than 50% of the total number of nodes in the cluster (including any nodes that are down) vote the same way. The voting that each host performs is done based on how long it has been since it last had a heartbeat from the other node. If at least half of the nodes in the cluster determine that a node is down, then that node is disconnected from the cluster. The wait time for a host to be disconnected from the cluster is typically considerably longer than the time for restarting a host, so restarts should not cause hosts to be disconnected from the cluster (and therefore they should not cause forests to fail over).

There are group configuration parameters to determine how long to wait before removing a node (for details, see XDQP Timeout, Host Timeout, and Host Initial Timeout Parameters).

Each node in the cluster continues listening for the heartbeat from the disconnected node to see if it has come back up, and if a quorum of nodes in the cluster are getting heartbeats from the node, then it automatically rejoins the cluster.

The heartbeat mechanism allows the cluster to recover gracefully from things like hardware failures or other events that might make a host unresponsive. This occurs automatically, without any human intervention; machines can go down and automatically come back up without requiring intervention from an administrator.

Hosts with Content Forests

If the node that goes down hosts content in a forest, then the database to which that forest belongs will go offline until the forest either comes back up or is detached from the database.

If you have failover enabled and configured for the forest whose host is removed from the cluster, the forest will attempt to fail over to a secondary host (that is, one of the secondary hosts will attempt to mount the forest). Once that occurs, the database will come back online.

For shared disk failover, there is an additional failover criteria that could prevent a forest from failing over. The forest's label file is updated regularly by the host that is managing the forest. To avoid data corruption of the data on the shared file system, the forest will not fail over when the forest is being actively managed - i.e. the forest's label file time stamp is checked to ensure that the forest is not currently being actively managed. This could occur in the situation where a host is isolated from the other nodes in the cluster, but still can access the forest data (on shared disk).

Tips for Handling Failover on A Busy Cluster

This Knowledgebase Article contains a good discussion about how to handle failover that is occurring frequently when your cluster hosts sometimes are too busy to respond in a timely manner. The section on "Improving the situation" contains step-by-step instructions for group and database settings that are tuned for a very busy cluster.

XDMP-CANCELED vs. XDMP-EXTIME

Summary

XDMP-CANCELED indicates that a query or operation was cancelled either explicitly or as a result of a system event. XDMP-EXTIME also indicates that a query or operation was cancelled, but the reason for the cancellation is the result of the elapsed processing time exceeding a timeout setting.

XDMP-CANCELED: Canceled request

The XDMP-CANCELED error message usually indicates that an operation such as a merge, backup or query was explicitly canceled. The message includes information about what operation was canceled. Cancellation may occur through the Admin Interface or by calling an explicit cancellation function, such as xdmp:request-cancel().

An XDMP-CANCELED error message can also occur when a client breaks the network socket connection to the server while a query is running (i.e. if the client abandons the request), resulting in the query being canceled.

try/catch:

XDMP-CANCELED exception will not be caught in a try/catch block.

XDMP-EXTIME: Time limit exceeded

An XDMP-EXTIME error will occur if a query or other operation exceeded its processing time limit. Surrounding messages in the ErrorLog.txt file may pinpoint the operation which timed out.

Inefficient Queries

If the cause of the timeout is an inefficient or incorrect query, you should tune the query. This may involve tuning your query to minimize the amount of filtering required. Tuning queries in MarkLogic often includes maintaining the proper indexes for the database so that the queries can be resolved during the index resolution phase of query evaluation. If a query requires filtering of many documents, then the performance will be adversely affected. To learn more about query evaluation, refer to Section 2.1 'Understanding the Search Process' of the MarkLogic Server's Query Performance and Tuning Guide available in our documentation at https://docs.marklogic.com/guide/performance.pdf.

MarkLogic has tools that can be used to help evaluate the characteristic of your queries. The best way to analyze a single query is to instrument the query with query trace, query meters and query profiling API calls: Query trace can be used to determine if the queries are resolvable in the index, or if filtering is involved; Query meters gives statistics from a query execution; and Query profiling will provide information regarding how long each statement in your query took. Information regarding these APIs are available in the Query Performance and Tuning Guide.

The Query Console makes it easy to profile a query in order to view sub-statement execution times. Once you have identified the poor performing statements, you can focus on optimizing that part of the code.

Inadequate Processing Limit

If the cause of the timeout is an inadequate processing limit, you may be able to configure a more generous limit through the Admin Interface.

A common setting which can contribute to the XDMP-EXTIME error message is the 'default time limit' setting for an Application Server or the Task Server. An alternative to increasing the 'default time limit' is to use xdmp:set-request-time-limit() within your query. Note that neither the 'default time limit' nor the request time limit can be larger than the "max time limit".

Resource Bottlenecks

If the cause of the timeout is the result of a resource bottleneck where the query or operation was not being serviced adequately, you will need to tune your system to eliminate the resource bottleneck. MarkLogic recommends that all systems where MarkLogic Server is installed should monitor the resource usage of its system components (i.e. CPU, memory, I/O, swap, network, ...) so that resource bottlenecks can easily be detected.

try/catch

XDMP-EXTIME can be caught in a try/catch block.

xdmp:value, xdmp:eval, and inline functions

xdmp:value() vs. xdmp:eval():

Both xdmp:value() and xdmp:eval() are used for executing strings of code dynamically. However, there are fundamental difference between the two:

The code in the xdmp:value() is evaluated against the current context - if variables are defined in the current scope, they may be referenced without re-declaring them

xdmp:eval() creates an entirely new context that has no knowledge of the context calling it - which means one must define a new XQuery prolog and variables from the main context. Those newly defined variables are then passed to the xdmp:eval() call as parameters and declared as external variables in the eval script

Function behavior when used inline:

Although both these functions seem to fulfill the same purpose, it is very important to note their behaviors changes when used inline. Consider the following example:

declare namespace db = “http://marklogic.com/xdmp/database”; Let $t:= <database xmlns=”http://marklogic.com/xdmp/database” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”> <database-name>aptp-dev-modules</database-name> </database> return fn:fold-left(function($a, $b){ xdmp:value(fn:concat(“$t/db:”, “database-name”)) }, (), (1,2,3)) (or) fn:fold-left(function($a, $b){ $t/xdmp:value(fn:concat(“db:”, “database-name”)) }, (), (1,2,3))

When a function is called inline, the expressions inside the function cannot be statically compiled to function items because the values of the closed-over variables are not yet available. Therefore, the query parser would have to look for any variable bindings during dynamic analysis to be able to evaluate the expression. Ideally, variables from the main context are passed to the function call as parameters. However, in the case of xdmp:value(), the function is expected to have the needed context to evaluate the expression and therefore the expression is evaluated without looking for any variable bindings - which can ultimately lead to unexpected behavior. This explains why the first return statement in the above example returns an ‘empty sequence’ and the second one returns the correct results because the variable is being referenced outside of the xdmp:value call. In other words, when used inline - xdmp:value() cannot reference variables declared in the current scope.

In contrast, in the case of xdmp:eval, the parser would know to look for variable bindings during dynamic analysis as this function is not expected to have the knowledge of the calling context. Consequently, when using xdmp:eval the context needs to be explicitly created and the variables explicitly passed to the call as parameters and declared as external variables.

XML serialization and output options

XML serialization and output options

XML as stored in MarkLogic Server

MarkLogic Server starts by parsing and indexing the document contents, converting the document from serialized XML (what you see in a file) to a compressed binary fragment representation of the XML data model—strictly, the XQuery Data Model (XDM). The data model differs from the serialized format. For example, when serialized in XML an attribute value may be surrounded by single or double quotes; in the data model that difference is not recorded.  Character references are stored internally as codepoints.

Therefore, when XML is returned from the database, the content will be the same, but serialization details may vary. If it is required to return a file byte-for-byte then it can be stored in in MarkLogic server in its binary form. However, binary documents are not indexed by MarkLogic, which means they cannot be directly searched.

XML as returned by MarkLogic Server

Though the original serialization information may not be stored in MarkLogic Server, there are a number of ways that output can be controlled when returning serialized XML from MarkLogic Server.

The XQuery xdmp:output option can be used at the code level: xdmp:output.

Output options may be used with xdmp:save when writing a file.

Output options may be specified at the app-server level: Controlling App Server Access, Output, and Errors.

XPath not() with path fields

Introduction

Fields are a great way of restricting what parts of your documents to search, based on XML element QNames or JSON propertyNames . Fields are extremely useful when you have content in one or more elements or JSON properties that you want to query simply and efficiently as a single unit. But can you use field names you've created with XPath's fn:not ()? In other words, given a field name "test-field-name" can you do something like fn:not(//test-field-name)? Unfortunately, you can not, as the server will return an XDMP-UNINDEXABLEPATH error. There is, however, a workaround.

Workaround

The workaround is to create two fields, then to query across those two fields using cts:not-in-query ( http://docs.marklogic.com/cts:not-in-query), Consider two documents:

Document 1

xdmp:document-insert(

"/test/fields-001.xml",

<doc>

<content>

<courtcase>

<metadata>

<docinfo>

<hier>

<hierlev>

<heading>

<title>1900</title>

</heading>

<hierlev>

<heading>

<title>Volume 10</title>

</heading>

<hierlev>

<heading>

<title>test title - (1900) 10 Ch.D. 900</title>

</heading>

</hierlev>

</hierlev>

</hierlev>

</hier>

</docinfo>

</metadata>

</courtcase>

</content>

</doc> ,

xdmp:default-permissions(),

("test", "fields")

)

Document 2

xdmp:document-insert(

"/test/fields-002.xml",

<doc>

<content>

     <courtcase>

              <metadata>

                     <docinfo>

                            <hier>

                                  <hierlev>

                                       <heading>

                                              <title>1879</title>

                                        </heading>

                                        <hierlev>

                                         <heading>

                                                  <title>John had a little lamb</title>

                                            </heading>

                                            <hierlev>

                                                  <heading>

                                                        <title>Mary had a little lamb</title>

                                                  </heading>

                                              </hierlev>

                                         </hierlev>

                                   </hierlev>

                               </hier>

                          </docinfo>

                     </metadata>

                </courtcase>

            </content>

</doc> ,

xdmp:default-permissions(),

("test", "fields")

)

Say you're interested in three different paths:

1) All titles, Which Should be defined as fn:collection()//heading/title

2) Titles with lower-level titles, Which Should be defined as fn:collection()//hierlev[.//hierlev/heading/title]/heading/title

3) Titles with NO lower-level titles, Which Should be defined as fn:collection()//hierlev[fn:not(.//hierlev/heading/title)]/heading/title

Unfortunately, while we can express #3 in full XPath, we can not express #3 in the subset of XPath used to describe path fields. However, you can emulate #3 by defining fields corresponding to #1 & #2, then combining them in a cts:not-in-query.

Create the path fields

All titles

  Create a Path Field with name "titles-all" path "//heading/title"

Titles with lower-level titles

Create a Path Field with name "titles-with-lower-level titles," path "//hierlev[.//hierlev/heading/title]/heading/title"

Emulate the XPath you want by combining these two newly created path fields in a cts: not-in-query ()

for $doc in cts:search(

fn:collection("fields"),

cts:not-in-query(

    cts:field-word-query(

      "titles-all",

      $term

      ) ,

    cts:field-word-query(

      "titles-with-lower-level-titles",

      $term

      )

    )

)

return

xdmp:node-uri($doc)

XQuery 3.0 and MarkLogic 5 XQuery Dialects

Summary

MarkLogic 5 does not support a dialect that conforms to XQuery 3.0.

Details

At the time of this writing, Xquery 3.0 is a "Working Draft" (http://www.w3.org/TR/xquery-30/). MarkLogic remains abreast of and contributes to the XQuery standards as we have employees who are members on the XQuery 3.0 Working Group and one who is an editor of the XQuery 3.0 Working Draft.

You can find a list and description of the XQuery dialects that MarkLogic Server implements in the "XQuery Dialects in MarkLogic Server" section of the MarkLogic Server's XQuery and XSLT Reference Guide available on our developer website at http://developer.marklogic.com/pubs/5.0/books/xquery.pdf. The "MarkLogic Server Enhanced (XQuery 1.0-ml)" dialect implements some of the functionality that you will find in the XQuery 3.0 Working Draft, such as the try / catch() expression.

XQuery ampersand in string

Summary

The ampersand is a special character used to denote a predefined entity reference in a string literal.

XQuery W3C Recommendation

Can be found at http://www.w3.org/TR/xquery-30/ .

Section 2.4.5 'URI Literals' states "Certain characters, notably the ampersand, can only be represented using a 'predefined entity reference' or a 'character reference'."

Section 3.1.1 'Literals' defines the predefined entity reference for ampersand as "&".

Issues with the ampersand character

The ampersand character can be tricky to construct in an XQuery string, as it is an escape character to the XQuery parser. The ways to construct the ampersand character in XQuery are:

Use the XML entity syntax (for example, &).

Use a CDATA element (<![CDATA[element content here]]>), which tells the XQuery parser to read the content as character data.

Use the repair option on xdmp:document-load, xdmp:document-get, or xdmp:unquote.

For additional details and examples, please refer to XML Data Model Versus Serialized XML in the MarkLogic Server's XQuery and XSLT Reference Guide.

XQuery and JavaScript interoperability

Introduction

This article discusses the use of XQuery in JavaScript and vice versa.

Using XQuery in JavaScript

A JavaScript module in MarkLogic can also import an XQuery library and access its functions and variables as if they were JavaScript. If you’re working primarily in JavaScript, but you have an existing library in XQuery or a specialized task that would be better suited to XQuery, you can write just that library in XQuery and import it into a JavaScript module.

The calling JavaScript module doesn’t need to even know that the library was implemented in XQuery. MarkLogic automatically makes all of the public functions and variables of the XQuery library available as native JavaScript functions and variables in the calling JavaScript module. (This is what’s happening when you import one of MarkLogic’s many libraries that come bundled with the server, such as Admin or Security.)

This capability will be key for those developers with existing investments in XQuery that want to start using JavaScript without having to rewrite all of their libraries.

Using JavaScript in XQuery

You can't import JavaScript libraries to XQuery, but you can call xdmp:invoke with a JavaScript main module or evaluate a string of JavaScript with xdmp:javascript-eval.

Port	General Service Ports
20	FTP Data Transfer Mode
21	FTP Control(command) Mode
22	SSH
23	Telnet
43	WHOIS
53	DNS
Port	Web Service Ports
119	NNTP
80	HTTP
3306	MySQL
Port	Control Panel Default Ports
2082	cPanel
2083	Secure cPanel
2086	WHM
2087	Secure WHM
2095	cPanel Webmail
2096	Secure cPanel Webmail
8443	Secure Plesk
8880	Plesk
10000	Webmin
Port	E-mail Service Ports
25	SMTP
465	SMTPS
109	POP2
110	POP3
143	IMAP
993	IMAPS

Name	Extension	Format
application/json	json	json
application/rdf+json	rj	json
application/sparql-results+json	srj	json
application/xml	xml xsd xvs sch	xml
text/json		json
text/xml		xml
application/vnd.marklogic-javascript	sjs	text
application/vnd.marklogic-ruleset	rules	text

Host Name	Primary Forest 1	Primary Forest 2	Replica Forest 1	Replica Forest 2
Host A	Data-1	Data-2	Data-5-R	Data-6-R
Host B	Data-3	Data-4	Data-1-R	Data-2-R
Host C	Data-5	Data-6	Data-3-R	Data-4-R

Item	Description
sem	time waiting on semaphore
disk	time waiting on disk
ja	time waiting if journal archive is lagged
dbrep	time waiting if DR replication is lagged
ld	time waiting to replicate the journal frame to a HA replica
fsn	frame sequence number
chksum	frame checksum
words	length in words of the frame
op	the type of frame
time	UNIX time
mfor	ID of master forest (if replica)
mtim	when master became master
mfsn	master forest fsn
fmcl	foreign master cluster id
fmf	foreign master forest id
fmt	when foreign master became HA master
fmfsn	foreign master fsn
sk	sequence key (frame unique id)
pfo	previous frame offset

Old DHS Roles	DH 5.2 Roles
Flow Developer	data-hub-developer
Flow Operator	data-hub-operator data-hub-monitor
Endpoint Developer	data-hub-developer
Endpoint User	data-hub-operator
Service Security Admin	data-hub-security-admin data-hub-admin pii-reader

Sitefinity

NativeChat

MOVEit

Kendo UI

Telerik

DataDirect

Corticon

Kemp LoadMaster

Flowmon

WhatsUp Gold

Kendo UI

Telerik

Test Studio

Fiddler Everywhere

DataDirect

Chef

MOVEit

WS_FTP

OpenEdge

MarkLogic

Semaphore

Introduction

Reason

Summary

Known side effects

Solution

Summary

Swap Space (Linux)

Swap Space (Solaris)

Alternatives to Configuration Manager

Overview

Alternatives

Manual Configuration

ml-gradle

Configuration Management API

Summary

Summary

Details

Introduction

Authenticating MarkLogic users with Kerberos

Configuring the MarkLogic cluster

Configuring the Kerberos client

A sample krb5.conf file:

Creating a Kerberos Keytab

Example

Creating a MarkLogic External Security configuration

Configuring the MarkLogic AppServer

Add the External Kerberos principal to the MarkLogic server userid

Verify everything is working as expected

Troubleshooting

Unable to generate a Kerberos token

Unauthorised 401 response due to gss_init_sec_context() failed: : No Kerberos credentials available error

MarkLogic server not able to validate Kerberos ticket

Debugging Kerberos connections in MarkLogic

Further Reading

Introduction

Prerequisites

Configuration steps

Example

Additional Reading

Summary

Preparing the New Volume and New Host

Attaching the New Volume to the New Instance

Configuring MarkLogic With Empty /var/opt/MarkLogic

Configuring MarkLogic and Rejoining Existing Cluster

Update the Userdata In the Auto Scaling Group

Next Steps

Best Practice for Adding an Index in Production

Summary

Preparing your Server for Production

Preparing to Re-index

When you have Database Replication Configured:

After the Re-index

Introduction

What are collections?

How are collections different from directories?

What is the use of the xdmp:collection-delete function?

What factors affect performance of xdmp:collection-delete?

Is there a fast operation mode available within the call xdmp:collection-delete?

What are the general best practices in order to improve the performance of large collection deletes?