Data Hub Framework - FAQ | MarkLogic Support

Knowledgebase

108Administration 8App Services 42Errors 145MarkLogic Server 53Performance Tuning

Knowledgebase:

Data Hub Framework - FAQ

15 March 2022 12:31 PM

Question	Answer	Further Reading
What is Data Hub?	The MarkLogic Data Hub is an open-source software interface that works to: ingest data from multiple sources harmonize that data master that data then search and analyze that data It runs on MarkLogic Server, and together, they provide a unified platform for mission-critical use cases.	Documentation: MarkLogic Data Hub
How do I install Data Hub?	Please see the referenced documentation Install Data Hub
What software is required for Data Hub installation?	Java JRE (OpenJDK) 8 MarkLogic Server, See Version Compatibility Gradle	Documentation: Install Data Hub
What is MarkLogic Data Hub Central?	Hub Central is the Data Hub graphical user interface	Documentation: Hub Central Guided Tour of MarkLogic Data Hub Central Introducing MarkLogic Data Hub Central What is MarkLogic Data Hub Central?
What are the ways to ingest data in Data Hub?	Hub Central (note that Quick Start has been deprecated since Data Hub 5.5) Data Hub Gradle Plugin Data Hub Client JAR Data Hub Java APIs Data Hub REST APIs MarkLogic Content Pump (MLCP)	Documentation: On-Premises Tools MarkLogic Data Hub 5.5 - Release Notes
What is the recommended batch size for matching steps?	The best batch size for a matching step could vary due to the average number of matches expected Larger average number of matches should use smaller batch sizes A batch size of 100 is the recommended starting point	Documentation: Batch size for Matching step
What is the recommended batch size for merging steps?	The merge batch size should always be 1	Documentation: Batch size for Merging step
How do I kill a long running flow in Data Hub?	At the moment, the feature to stop/kill a long running flow in DataHub isn't available. If you encounter this issue, please provide support with the following information to help us investigate further: Error logs and exception traces from the time the job was started The job document for the step in question You can find that document under the "data-hub-JOBS" db using the job ID Open the query console Select data-hub-JOBS db from the dropdown Hit explore Enter the Jobs ID from the screenshot in the search field and hit enter: E.g.: *21d54818-28b2-4e56-bcfe-1b206dd3a10a* You'll see the document in the results Note: If you want to force it, you can cycle the Java program and stop the requests from the corresponding app server status page on the Admin UI.	KB Article: Killing a Long running Query and Request Time Limits
What do we do if we are receiving SVC-EXTIME error consistently while running the merging step?	“SVC-EXTIME” generally occurs when a query or other operation exceeds its processing time limit. There are various reasons behind this error. For example, Lack of physical resources Infrastructure level slowness Network issues Server overload Document locking issues Additionally, you need to review the step where you match documents to see how many URIs you are trying to merge in one go. Reduce the batch size to a value that gives a balance between processing time and performance (the SVC-EXTIME timeout error) Modify your matching step to work with fewer matches per each run rather than a huge number of matches Turning ON the SM-MATCH and SM-MERGE traces would give a good indication of what it is getting stuck on. Do note, however, to turn them OFF once the issue has been detected/resolved.	Documentation: SVC-EXTIME
What are the best practices for performing Data Hub upgrades?	Note that Data Hub versions depend on MarkLogic Server versions - if your Data Hub version requires a different MarkLogic Server version, you MUST upgrade your MarkLogic Server installation before upgrading your Data Hub version Take a backup Perform extensive testing with all use-cases on lower environments Refer to release notes (some Data Hub upgrades require reindexing), upgrade documentation, version compatibility with MarkLogic Server	KB Article: MarkLogic Server/Data Hub version compatibility and upgrade
How can I encrypt my password in Gradle files used for Data Hub?	You may need to store the password in encrypted Gradle properties and reference the property in the configuration file.	Documentation: Encrypting passwords Blog: Protecting passwords in ml-gradle projects
How can I create a Golden Record using Data Hub?	A golden record is a single, well-defined version of all the data entities in an organizational ecosystem. In the Data Hub Central, once you have gone through the process of ingest, map and master, the documents in the *sm-<EntityType>-mastered* collection would be considered as golden records	KB article: What is a Golden Record and how can you create one on DataHub?
What authentication method does Data Hub support?	DataHub primarily supports basic and digest authentication. The configuration for username/password authentication is provided when deploying your application.
How do I know the compatible MarkLogic server version with Data Hub version?	Refer to Version Compatibility matrix.
Can we deploy multiple DHF projects on the same cluster?	This operation is NOT supported.
Can we perform offline/disconnected Data Hub upgrades?	This is NOT supported, but you can refer to this example to see one potential approach
TDE Generation in Data Hub	For production purposes, you should configure your own TDE's instead of depending solely on TDE's generated by Data Hub (which may not be optimized for performance or scale)
Where does gradle download all the dependencies we need to install DHF from?	Below is the list of sites that Gradle will use in order to resolve dependencies: The DHF Gradle plugin will be fetched from: https://plugins.gradle.org/m2 All dependencies will be retrieved from: https://jcenter.bintray.com/ (or) https://search.maven.org/ JFrog This tool is helpful to figure out what the dependencies are: It provides a shareable and centralized record of a build that provides insights into what happened and why You can create build scans using this tool and even publish those results at https://scans.gradle.com to see where Gradle is trying to download each dependency from under the "Build Dependencies" section on the results page.

4.0 out of 5 stars

(7 vote(s))

Helpful

Not helpful

Comments (0)