MarkLogic Content Pump (MLCP) Thread Tuning
29 June 2022 03:03 PM
Performance of the data extraction, ingestion using mlcp depends on multiple factor including hardware capacity of client node running mlcp. This article is solely focused on how to adjust mlcp thread_count and thread_count_per_split for better performance during import and export for the given hardware and the data set size.
For mlcp import jobs, there are two options for tuning the the thread:
-thread_count is the number of threads to spawn for concurrent loading. The total number of the thread count, however, is controlled by the newly calculated thread count or -thread_count if it is specified.
-thread_count_per_split is the maximum number of threads that can be assigned to each split. If you specify -thread_count_per_split, each input split will run with the specified number.
What if both the options are not specified?
Prior to 10.0-4.2, mlcp import will use default thread count 4 for concurrent loading.
For mlcp versions higher than or equal to 10.0-4.2, thread polling mechanism was introduced. During job initialization, mlcp conducts a thread polling to identify the maximum app server or xdbc server threads on the port that handles mlcp requests. MLCP will then use this number as the default thread count.
For mlcp export jobs, the only option for thread tuning is -thread_count.
What if thread_count is not specified?
If it is not specified, the default thread count for concurrent exporting is 4.
For import: It is recommended to align mlcp concurrent thread count with the maximum server threads allowed on all hosts (preferrable all the E nodes) in the group, to achieve better performance. However, this may not be the case if your MarkLogic server is I/O bound. Increasing the concurrency of writes will not necessarily improve performance. Because of the polling mechanism, the concurrency of the current app server/xdbc server has been maxed out, so it's not recommended to run multiple mlcp jobs at the same time.
For export: It is a good reasonable practice to try out smaller numbers for thread count such as 8, 16, 24, 32, 40 or 48 threads until the environment reaches I/O bound.Since mlcp exports content from multiple MarkLogic servers and writes to the local file system on a single node, the performance is largely restricted by the I/O capability of the machine that runs the mlcp job.Further increasing the thread count may harm the performance, since the speed of the client consuming data is a lot slower than the speed of the server serving data. It may also result in long-running requests, which may timeout (SVC-EXTIME exception) on the app server/xdbc server depending on the request timeout setting.
For more information on MLCP troubleshoot see following resources.