Avoiding the Franken-beast: Polyglot Persistence Done Right
Posted by Damon Feldman on 25 August 2015 07:03 PM
Why should people store many kinds of data in one place? Why not put the many kinds in many places? A customer of mine has learned the answer the hard way by doing the latter, so I thought I’d write it up as a case study so you don’t have to suffer the pain that it had.
Honestly, I am writing this for my benefit too. Watching a team try to integrate three or four different persistence and data processing products (the polyglot part) into some kind of “Franken-beast” is like watching a train full of ageing software architects drive off a bridge in slow motion.
Building the Franken-beast is difficult to do – and difficult to watch
Sometimes you do need separate, discrete systems – but with MarkLogic you can avoid most of that integration, so I’m going to write here about how to choose and why it matters.
Martin Fowler first wrote about “Polyglot Persistence” in a blog post in 2011, and the term has caught on. Today there are 85,000 Google hits on the phrase, and people who think and write about the database world acknowledge that we are all dealing with many different forms of data and that we need many different tools to make that work. Text data, binary data, structured data, semantic data. The term is “rising terminology” and we can all expect to hear a lot about it in the next few years.
The conventional understanding is that to achieve Polyglot Persistence you store each discrete data type in its own discrete technology. But polyglot means “able to speak many languages,” not “able to integrate many components.” This is where my customer got into trouble. The team took the traditional route of using multiple stores for multiple data types. This post covers the impact of storing two types – structured data and binary data – in two different stores instead of one integrated store.
MarkLogic enables polyglot persistence with a multi-model database instead, and at the end of this post I’ll cover how this team is now re-working the system into a simpler form using this capability.
The Example: Excel Worksheets
Back to this particular customer. To protect the innocent, I’m going to pretend this was a mortgage application system that takes in Excel worksheet submissions documenting assets and expenses as someone applies for a mortgage. The Excel worksheet comes in, something extracts the various incomes and expenses to XML, then both get stored to support the mortgage application process.
Once the Excel binary and extracted XML are stored, a workflow engine helps users review and approve (or reject) the submission. This basic flow can be found in many computer systems – from applying for a loan to submitting a proposal to do a big project or to apply to become a preferred vendor.
The part I’ll focus on here is that this class of system stores both the original, “un-structured”, binary data (.xslx, .pdf, .jpg) as well as the structured XML data extracted from the Excel.
Here’s the mindset that got my customer into trouble:
The resulting architecture looked something like this:
The resulting architecture looks something like the above. Simple, right?
Getting the Data Back Out
The first problem showed up as soon as we started to do something with the data. We needed to send it to an external, independent reviewer every week for quality control and to check on some complex business rules outside the scope of my customer’s implementation. These reviewers needed the XML records matching some criteria (let’s pretend it was super-jumbo loans, to fit with our example) and all the extracted data needed to be in zip files limited to 1GB each, uncompressed.
Easy, right? Get the new super-jumbo loans from last week, and stop when you hit 1 GB. Keep going until you have them all. We had a couple of problems:
This is a lot of data at busy times, so they used Spring Batch to coordinate long-running batch processing jobs. Suddenly we have a third component, adding a little more complexity.
This diagram shows how a batch processor grabbed content from MarkLogic and wrote it to the file-system and also grabbed content from Alfresco and wrote that to the file-system. Both were zipped to limit the number of files in one directory (an alternative is to hash the filenames and use the first two digits of the hash to create 100 sub-directories of reasonable size. No free lunch – more complexity either way).
But recall there was a size limit on how much data could be moved and transferred. So we needed another post-process to determine file sizes on disk, and chunk the data into zip files with groups of matching XML and Zip files, totaling about 1GB each.
But wait, there’s more… The first Spring Batch process has to complete before the external python chunking script runs. So we need to connect both to a job scheduler (they used an enterprise Job Control product, but a cron job would do as well). Here is a diagram showing the additional components added to the overall architecture (new items circled in red):
A lot more complex, isn’t it? My customer’s enthusiasm for the original solution was already waning.
Let’s Talk Physical Architecture
Alfresco uses two different stores, and does not manage transactions in them, much less XA transactions with other sub-systems; a relational database for some bookkeeping data and a file system for the actual binaries (because BLObs don’t work well). For the file-system to be reliable we need a clustered, highly-available file-system like GlusterFS. Alfresco internally bundles a number of separate internal components with a core cobbled together from a relational database adapter and an internal build of Lucene for text indexing, plus direct file-system access. Here is an updated diagram with the internal physical and software components introduced by Alfresco:
Remember the overall purpose or use case was “Not That Complicated!” We need to accept binaries and structured data, store it, and send it to a downstream system for verification! That’s it.
As those of you know who run enterprise software, every component in the infrastructure needs operations people and processes. This was a Highly-available (HA) system with Disaster Recovery (DR) requirements.
DR basically means double everything, and configuring some process to constantly move all data to that duplicate infrastructure at a disaster site. MarkLogic was the primary database for the entire system and was replicated. Now that more components have been added to support Alfresco (Oracle, Alfresco, GlusterFS) they had to be configured to replicate so copies of their data was in the DR site at all times.
Is DR even possible with multiple stores?
To make matters (still) worse, we spent a lot of time considering how data would be kept in synch. Every sub-system replicates differently and with different time lags and “Recovery Point Objective” amounts of non-replicated data. When they all come up at the DR site, in the event of a disaster, they will be seconds or even minutes out of synch. We spend a surprising amount of time trying to add designs and governance about this. The time spent had a large opportunity cost – we should have been focused on other things but were distracted by managing data replication across multiple persistent stores.
For HA, we needed to configure product-specific strategies for each major component (Alfresco Oracle/Lucene data, Alfresco binary file data, MarkLogic data).
These “HA Config” boxes on the diagrams were worse than they look. They required duplicate infrastructure to be provisioned and configured for each of the (un-necessary) persistence technologies used. Each required meetings, vendor consultations, design documents and approvals. And testing. Separate High-Availability failover testing for each.
Each of these needs its own trained staff, provisioned hardware, run-book (instructions for what do for normal operations as well as emergencies), test plans, upgrade plans, vendor or expert consultant relationships, and support contracts.
Here is the diagram showing the Disaster Recovery site and the data flows that need to be managed and configured. The new data flows are shown as red arrows, and the new people and roles are inside red circles:
Unbelievable isn’t it? The people needed, processes needed, infrastructure needed, architectural components needed are not usually put on one diagram – for the very reason that humans can’t deal effectively with this level of complexity.
And it only gets worse.
I also spent many months stabilizing and improving another project – HealthCare.gov (Obamacare, as it is usually called). When it went live in poor shape, with bad performance and many errors, the White House sent in a group called the Ad-hoc Team to help fix it. The first thing they did when they came on the scene was buy and install monitoring at every level, in every component. You can’t run a critical enterprise system effectively without knowing what is happening inside your own system, and as experienced operations engineers, they prioritized monitoring at the top of the list.
And our Mortgage Processing application described here also needs monitoring to work well. Unfortunately, by splitting data into multiple systems we had many more sub-systems to monitor. Products like the ones diagrammed so far often have APIs to help connect their internal monitoring and statistics feeds or logs to consolidated monitoring tools. Each of them needs to be monitored to form a common operating picture.
Again, here is a diagram with red circles to show the added monitoring agents that must be developed, maintained and integrated to keep all this stuff working that should not have been necessary. The Enterprise Monitoring product and MarkLogic agent are not circled, because they were going to be necessary even if the data was all stored in MarkLogic, as it should have been.
Deployment, Dev Ops, Automation
Each and every component requires some deployment, version control, testing. Those that have any code or customization must have that code integrated into the build. Experienced dear readers know that the overall maven/ant/gulp or similar system that knits all the components together into a known, testable and valid configuration are hugely complex. They are complex in part due to the number of components involved.
I will spare you another diagram, and just show the dev ops guy who has to build and maintain a pushbutton build with a Continuous Integration and Continuous Deployment capability:
How Not to Suffer Like We Did
So this is a fairly long, detailed story about how a simple requirement became an operational nightmare. I promise you your time has been better spent reading my history of what happened than if you had spent months or years living through it.
Everything (everything!) in the diagrams above is needed to run a system that does this simple submit, review and export task at scale in a mature, HA/DR, enterprise environment. It all started with a need to handle mortgage submissions, and a simple, naïve, dangerous mindset:
Let’s look at what happens if we remove Alfresco and STORE THE #*&!%^!* DATA IN MARKLOGIC, AS THE PRODUCT WAS DESIGNED TO SUPPORT.
This one simplification is not a panacea. Operating an enterprise system is inherently complex and some of that complexity is fundamental complexity. Here is what we would have built and supported using MarkLogic to integrate and manage disparate data types together (binary and XML in this case):
As we can see, reducing the number of components vastly simplifies the overall complexity and amount of work. This is not surprising, and we can consider it an axiom of software integration
The reason this diagram is so much simpler because removing the extra data store (Alfresco in this case) removed all the integration, coordination, deployment, monitoring, configuration and supporting components that went with it.
The Right Answer
MarkLogic is specifically engineered to handle many types of data: XML, Binary, RDF, Text and JSON. The system will be far simpler (as we see above) if MarkLogic is used as an integrated, multi-model store that manages all the data together in one place.
By putting part of the data in one store (MarkLogic) and other parts of the data in another store (Alfresco) the system was vastly over-complicated. The additional complexity was hard to predict at the time, but it is obvious in hindsight.
The Big Picture
Here is where we ended up.
What they wanted to accomplish
What they initially built
What MarkLogic did for them
MarkLogic Corp’s software development philosophy is rare. We use a single C++ core to solve the data management problem in once place, cutting across many data types from Text to RDF and XML to Binary. Most others deliver a “solution” that hides an amalgam of open-source or acquired separate components, which masks and increases complexity rather than eliminating it.
Put another way, MarkLogic database achieves the goal of Polyglot Persistence via the technique of multi-model storage. With MarkLogic you get Polyglot persistence without poly lot of headaches.
But the “philosophy” above is abstract and it is hard to know why it matters without living through concrete implementations. The example recounted above is real. It was not really Mortgages, but all the integrations and impacts actually happened, and you can avoid the frustration that I experienced – and more importantly deliver value to your users and customers without being distracted by myriad integration headaches that really don’t need to happen in the first place.
Read more »
New MarkLogic® Semantics Speeds Application Development and Eases Data Integration
Posted by Paul Hughes on 18 August 2015 02:30 PM
San Carlos, Calif. – August 18, 2015 – MarkLogic Corporation, the leading Enterprise NoSQL database platform provider, today announced new MarkLogic® Semantics features in MarkLogic® 8+, the new-generation database that helps organizations achieve faster time to results, operationalize heterogeneous data, and mitigate risk. With new support for Jena and Sesame APIs, MarkLogic Semantics makes it easier and faster for developers to integrate data and build smarter applications in order to maximize the value of their organizations’ data.
“Enterprises are continuing to struggle with extracting and unlocking the value of unstructured and semi-structured information and to date, very few vendors have addressed this. Semantic technologies can overcome these obstacles by linking and relating unstructured information with the structured data in ways that current database technology just can’t handle,” said David Schubmehl, research director, Content Analytics, Discovery and Cognitive Systems, IDC. “IDC sees semantic capabilities as being crucial to build next generation knowledge bases and solutions that will drive knowledge worker productivity and enterprise value.”
Smart applications prove to be elusive for many organizations today, as 85 percent of companies fail to exploit data for a competitive advantage and data silos are listed as a significant impediment to achieving big data goals. Companies are struggling to integrate and manage data that reside in multiple silos as their relational database technology is often too limiting and inflexible to manage today’s complex data types. Organizations require a different approach to modeling data that focuses on relationships and context and breaks through silos, giving enterprises the ability to better organize data and build smarter applications faster.
The MarkLogic database is the only schema-agnostic Enterprise NoSQL platform that integrates search, semantics, and rich queries, with enterprise features customers require for production applications. MarkLogic ingests data as-is, driving rapid data integration by pulling data from various silos and incorporating all data into one platform for a comprehensive view of information.
“With MarkLogic Semantics, we can link together items for more impactful data discovery and sharing,” said Andrea Powell, CIO, CABI. “For example, our Plantwise KnowledgeBank contains a database of plant disease distribution. Using MarkLogic will help to determine the likelihood that disease present in one country will occur in another. Helping prevent crop losses allows us to better fulfill our mission of solving problems in agriculture around the world.”
“Semantics is a game-changer and our forward-thinking customers and partners understand the that. Organizations ranging from government agencies to banks to media and publishers throughout the world are using MarkLogic Semantics to enrich their data, and ultimately enrich their businesses,” said Joe Pasqua, EVP, products, MarkLogic. “Semantics gives our customers a competitive advantage, creating smart data that helps achieve business goals.”
Read more »