Question
|
Answer
|
Further Reading
|
What is Data Hub?
|
The MarkLogic Data Hub is an open-source software interface that works to:
- ingest data from multiple sources
- harmonize that data
- master that data
- then search and analyze that data
It runs on MarkLogic Server, and together, they provide a unified platform for mission-critical use cases.
|
Documentation:
|
How do I install Data Hub?
|
Please see the referenced documentation Install Data Hub
|
What software is required for Data Hub installation?
|
|
Documentation:
|
What is MarkLogic Data Hub Central?
|
Hub Central is the Data Hub graphical user interface
|
Documentation:
|
What are the ways to ingest data in Data Hub?
|
- Hub Central (note that Quick Start has been deprecated since Data Hub 5.5)
- Data Hub Gradle Plugin
- Data Hub Client JAR
- Data Hub Java APIs
- Data Hub REST APIs
- MarkLogic Content Pump (MLCP)
|
Documentation:
|
What is the recommended batch size for matching steps?
|
- The best batch size for a matching step could vary due to the average number of matches expected
- Larger average number of matches should use smaller batch sizes
- A batch size of 100 is the recommended starting point
|
Documentation:
|
What is the recommended batch size for merging steps?
|
The merge batch size should always be 1
|
Documentation:
|
How do I kill a long running flow in Data Hub?
|
At the moment, the feature to stop/kill a long running flow in DataHub isn't available.
If you encounter this issue, please provide support with the following information to help us investigate further:
- Error logs and exception traces from the time the job was started
- The job document for the step in question
- You can find that document under the "data-hub-JOBS" db using the job ID
- Open the query console
- Select data-hub-JOBS db from the dropdown
- Hit explore
- Enter the Jobs ID from the screenshot in the search field and hit enter:
- E.g.: *21d54818-28b2-4e56-bcfe-1b206dd3a10a*
- You'll see the document in the results
Note: If you want to force it, you can cycle the Java program and stop the requests from the corresponding app server status page on the Admin UI.
|
KB Article:
|
What do we do if we are receiving SVC-EXTIME error consistently while running the merging step?
|
“SVC-EXTIME” generally occurs when a query or other operation exceeds its processing time limit. There are various reasons behind this error. For example,
- Lack of physical resources
- Infrastructure level slowness
- Network issues
- Server overload
- Document locking issues
Additionally, you need to review the step where you match documents to see how many URIs you are trying to merge in one go.
- Reduce the batch size to a value that gives a balance between processing time and performance (the SVC-EXTIME timeout error)
- Modify your matching step to work with fewer matches per each run rather than a huge number of matches
- Turning ON the SM-MATCH and SM-MERGE traces would give a good indication of what it is getting stuck on. Do note, however, to turn them OFF once the issue has been detected/resolved.
|
Documentation:
|
What are the best practices for performing Data Hub upgrades?
|
- Note that Data Hub versions depend on MarkLogic Server versions - if your Data Hub version requires a different MarkLogic Server version, you MUST upgrade your MarkLogic Server installation before upgrading your Data Hub version
- Take a backup
- Perform extensive testing with all use-cases on lower environments
- Refer to release notes (some Data Hub upgrades require reindexing), upgrade documentation, version compatibility with MarkLogic Server
|
KB Article:
|
How can I encrypt my password in Gradle files used for Data Hub?
|
You may need to store the password in encrypted Gradle properties and reference the property in the configuration file.
|
Documentation:
Blog:
|
How can I create a Golden Record using Data Hub?
|
A golden record is a single, well-defined version of all the data entities in an organizational ecosystem.
- In the Data Hub Central, once you have gone through the process of ingest, map and master, the documents in the sm-<EntityType>-mastered collection would be considered as golden records
|
KB article:
|
What authentication method does Data Hub support?
|
DataHub primarily supports basic and digest authentication. The configuration for username/password authentication is provided when deploying your application.
|
How do I know the compatible MarkLogic server version with Data Hub version?
|
Refer to Version Compatibility matrix.
|
Can we deploy multiple DHF projects on the same cluster?
|
This operation is NOT supported.
|
Can we perform offline/disconnected Data Hub upgrades?
|
This is NOT supported, but you can refer to this example to see one potential approach
|
TDE Generation in Data Hub
|
For production purposes, you should configure your own TDE's instead of depending solely on TDE's generated by Data Hub (which may not be optimized for performance or scale)
|
Where does gradle download all the dependencies we need to install DHF from?
|
Below is the list of sites that Gradle will use in order to resolve dependencies:
- The DHF Gradle plugin will be fetched from:
- All dependencies will be retrieved from:
This tool is helpful to figure out what the dependencies are:
- It provides a shareable and centralized record of a build that provides insights into what happened and why
- You can create build scans using this tool and even publish those results at https://scans.gradle.com to see where Gradle is trying to download each dependency from under the "Build Dependencies" section on the results page.
|