Knowledgebase:
Data Hub Framework - FAQ
15 March 2022 12:31 PM

Question

Answer

Further Reading

What is Data Hub?

The MarkLogic Data Hub is an open-source software interface that works to:

  1. ingest data from multiple sources
  2. harmonize that data
  3. master that data
  4. then search and analyze that data

It runs on MarkLogic Server, and together, they provide a unified platform for mission-critical use cases.

Documentation:

How do I install Data Hub?

Please see the referenced documentation Install Data Hub

What software is required for Data Hub installation?

Documentation:

What is MarkLogic Data Hub Central?

Hub Central is the Data Hub graphical user interface

Documentation:

What are the ways to ingest data in Data Hub?

  • Hub Central (note that Quick Start has been deprecated since Data Hub 5.5)
  • Data Hub Gradle Plugin
  • Data Hub Client JAR
  • Data Hub Java APIs
  • Data Hub REST APIs
  • MarkLogic Content Pump (MLCP)

Documentation:

What is the recommended batch size for matching steps?

  • The best batch size for a matching step could vary due to the average number of matches expected
  • Larger average number of matches should use smaller batch sizes
  • A batch size of 100 is the recommended starting point

Documentation:

What is the recommended batch size for merging steps?

The merge batch size should always be 1

Documentation:

How do I kill a long running flow in Data Hub?

At the moment, the feature to stop/kill a long running flow in DataHub isn't available.

If you encounter this issue, please provide support with the following information to help us investigate further:

  • Error logs and exception traces from the time the job was started
  • The job document for the step in question
    • You can find that document under the "data-hub-JOBS" db using the job ID
      • Open the query console
      • Select data-hub-JOBS db from the dropdown
      • Hit explore
      • Enter the Jobs ID from the screenshot in the search field and hit enter:
        • E.g.: *21d54818-28b2-4e56-bcfe-1b206dd3a10a*
      • You'll see the document in the results

Note: If you want to force it, you can cycle the Java program and stop the requests from the corresponding app server status page on the Admin UI.

KB Article:

What do we do if we are receiving SVC-EXTIME error consistently while running the merging step?

“SVC-EXTIME” generally occurs when a query or other operation exceeds its processing time limit. There are various reasons behind this error. For example,

  • Lack of physical resources
  • Infrastructure level slowness
  • Network issues
  • Server overload 
  • Document locking issues

Additionally, you need to review the step where you match documents to see how many URIs you are trying to merge in one go. 

  • Reduce the batch size to a value that gives a balance between processing time and performance (the SVC-EXTIME timeout error)
  • Modify your matching step to work with fewer matches per each run rather than a huge number of matches
  • Turning ON the SM-MATCH and SM-MERGE traces would give a good indication of what it is getting stuck on. Do note, however, to turn them OFF once the issue has been detected/resolved.

Documentation:

What are the best practices for performing Data Hub upgrades?

  • Note that Data Hub versions depend on MarkLogic Server versions - if your Data Hub version requires a different MarkLogic Server version, you MUST upgrade your MarkLogic Server installation before upgrading your Data Hub version
  • Take a backup
  • Perform extensive testing with all use-cases on lower environments
  • Refer to release notes (some Data Hub upgrades require reindexing), upgrade documentation, version compatibility with MarkLogic Server

KB Article:

How can I encrypt my password in Gradle files used for Data Hub?

You may need to store the password in encrypted Gradle properties and reference the property in the configuration file. 

Documentation:

Blog:

How can I create a Golden Record using Data Hub?

A golden record is a single, well-defined version of all the data entities in an organizational ecosystem.

  • In the Data Hub Central, once you have gone through the process of ingest, map and master, the documents in the sm-<EntityType>-mastered collection would be considered as golden records

KB article:

What authentication method does Data Hub support?

DataHub primarily supports basic and digest authentication. The configuration for username/password authentication is provided when deploying your application.

How do I know the compatible MarkLogic server version with Data Hub version?

Refer to Version Compatibility matrix.

Can we deploy multiple DHF projects on the same cluster?

This operation is NOT supported.

Can we perform offline/disconnected Data Hub upgrades?

This is NOT supported, but you can refer to this example to see one potential approach

TDE Generation in Data Hub

For production purposes, you should configure your own TDE's instead of depending solely on TDE's generated by Data Hub (which may not be optimized for performance or scale)

Where does gradle download all the dependencies we need to install DHF from?

Below is the list of sites that Gradle will use in order to resolve dependencies:

This tool is helpful to figure out what the dependencies are:

  • It provides a shareable and centralized record of a build that provides insights into what happened and why
  • You can create build scans using this tool and even publish those results at https://scans.gradle.com to see where Gradle is trying to download each dependency from under the "Build Dependencies" section on the results page.



(7 vote(s))
Helpful
Not helpful

Comments (0)