MarkLogic from a Relational Perspective: Part 1
Posted by David Kaaret on 20 January 2020 08:00 AM
MarkLogic in the Technology Landscape
This is the first in a series of blogs for people coming from a relational background, to help you understand the differences in how MarkLogic handles data integration and access.
TL/DR: A number of technologies have been introduced in the past 20 years to address the problems with relational databases. As a true multi-model database, MarkLogic solves the issues of data-integration complexity and scalability while providing data-management capabilities equivalent to what relational provides—and goes even further with an integrated technology stack that includes search, alerting and geospatial.
The Evolving Data Landscape
From the late 1970s or early 1980s to around 2000, relational ruled the database world. There were some object-oriented databases and some holdovers from the hierarchical era, but these were minor.
Around 2000, relational’s hold began to develop cracks. This was in part because of an inability to scale and difficulty in integrating multiple data sources and complex data. Scalability suffered because of the extensive—and expensive—joins that relational required. The need to break complex entities into rectangles of rows and columns for storage created the need for joins to pull them together. Complexity grew both because the underlying data was becoming complex and also because firms were consolidating many overlapping data sets into consolidated wholes.
In part to handle this, there was a renaissance in data technologies. Many new kinds of technologies (called NoSQL databases) came into common use, including document, search and graph databases. Approaches that were based on Complex event processing (CEP), which can both push notifications to users and be queried, also gained a niche. Each of these technologies solved problems that relational struggled with, and each gained a foothold in the market.
Hadoop also became a powerhouse, offering extremely large-scale, distributed data access and processing.
However, while each new technology provided new capabilities, none could replicate the critical core functionality of relational systems. This meant that relational continued as the primary database technology up to today.
The fundamental issue was that most of the new technologies had difficulties in performing structured queries that combined different data types while filtering within an entity type—the primary relational use case.
Document databases excel at filtering data within a data type. This goes a long way towards solving the issue of data complexity and integration of diverse data sets.
However, document databases are poor at linking data from different entity types (such as queries that pull together all the orders for a customer).
Semantic databases excel at relationships and can easily perform queries joining dissimilar data types.
However, pure semantic solutions are not as scalable as relational databases. They also introduce enormous complexity issues because each field in every row of data must be converted into its own triple before it can be used. Pulling together the information is complicated as data originating from a single row or document must be pulled together from many triples.
Search and CEP
Search and CEP added valuable capabilities that were fundamentally different than traditional relational. While they could do things that are difficult to do in relational, they were in no sense substitutes for it. Search allowed users to access and search data without the need to understand any underlying data structures, and without the need for structured data at all. CEP allowed data to be pushed to users when appropriate instead of requiring them to ask for it.
Hadoop provided high levels of scalability but did so at the cost of pushing functionality from the underlying database technology to application developers. “Schema-agnostic” and “schemaless” became popular buzzwords as a result. The benefits provided by a data dictionary were sacrificed for the benefit of maximum scalability.
MarkLogic: True Multi-model
MarkLogic incorporates all of the technologies discussed above, including SQL, into a single, integrated code base. This makes it perhaps the only true “multi-model” database. Others provide “multi-product” offerings where different types of data are stored in separate databases and must mostly be integrated by individual developers. The ability to incorporate multiple technologies into a single, integrated environment allows MarkLogic to solve the complexity and scalability issues of relational while maintaining—and enhancing—its ability to search and query.
This is difficult to do with other NoSQL products. The reason for this is that most NoSQL products are built on open source for which there is a separate development effort for each technology type. To use multiple technologies in a single application, it is necessary to maintain multiple databases and integrate the different technologies. This causes issues in data reliability, disaster recovery and ease of development.
With MarkLogic, there is a single database. A single query can include SQL, SPARQL, structured document filtering and search. MarkLogic’s query planner will parse the query and determine how to implement it. Alerts can be defined to send out notifications when a document is ingested, changed or deleted.
Through the integration of documents and semantics, MarkLogic can filter within an entity type, even when documents do not fully share the same underlying schema. With semantics, MarkLogic can join documents of different types giving it the power of relational joins.
This combination allows MarkLogic to solve the issues of complexity and scalability while providing capabilities equivalent to what relational provides. Search, alerting and geospatial allow it to go even further.
A reason why large, complex, relational systems are difficult to develop is because of the waterfall approach to data modeling. The waterfall model requires schemas to be defined before data can be loaded. This can potentially front-load many months or even years of modeling before other development can fully take place. In the era of complex overlapping data models, it is important to recognize that it is not necessary for the database to have a full, formal model that combines all of the overlapping data sources you use before development begins.
With MarkLogic, you load all of the data as is and then only do the modeling and transforms needed to achieve the objectives of your first deliverable.
Once your first deliverable is done, you build on it to do the modeling/ETL needed for your next one. You are not building a data silo for each application. You are gradually filling out a richer but consolidated data environment.
This is how to deliver applications quickly while building a central, enterprise, data repository or hub.
I’ll discuss the modeling question in more depth in my next post in this series, “Data Modeling – From Relational to MarkLogic.”
Document technology provides high-performance queries to MarkLogic and allows it to easily handle data complexity and the integration of overlapping data sets.
Graph technology allows MarkLogic to integrate dissimilar data sets with a power greater than that of relational joins.
With its integrated technology stack, MarkLogic provides capabilities that are superior to relational while overcoming its limitations.
It is the next step beyond relational.