News Categories
(6)Press Releases (14)MarkLogic World (28)Big Data (38)Uncategorized (7)Dynamic Publishing (19)Agile Development (1)cloud (7)Hadoop (33)NoSQL (10)semantics (43)Enterprise NoSQL (1)HTML5 (4)Mobile (1)data enrichment (4)defense (1)geospatial (4)intelligence (4)search (4)use case (12)Analytics (14)ACID compliant (5)Defense (8)Search (3)alerting (1)query (1)schema (1)variety (1)velocity (6)Security (8)Content Platform (1)Migration (1)Serialized Search (1)Springer (7)Financial Services (1)Fraud (1)Big Data Nation Dallas (1)Big Data Nation (1)Chris Anderson (2)Fernando Mesa (1)Reed Construction Data (1)Reed Elsevier (1)Tony Jewitt (2)Situational Awareness (2)Dan McCreary (1)LexisNexis (1)Mark Rodgers (1)David Gorbet (1)David Leeming (1)MUGL (5)Publishing (1)Royal Society of Chemistry (1)RSC (1)Science (1)User Group (2)Intel (1)Sony (2)Amir Halfon (1)AML (1)Anti-Money Laundering (1)BDN Boston (1)Temis (4)DAM (1)Condé Nast (1)Digital Asset Management (1)Henry Stewart (4)Book Publishing (1)XQuery (1)Direct Digital (1)Typeswitch (1)Permissions (1)AIP (1)Digital Media (1)James Wonder (1)mission-based publishing (1)STM associations (2)STM publishing (8)Media (5)Media & Marketing (1)Facets (1)mongoDB (10)Semantic Web (1)Amazon Web Service (1)Cloud (1)BBC (2)MarkLogic 7 (1)Mike Bowers (1)Sanjay Anand (1)Software Upgrade (1)Zynx Health (1)Multi-Version Concurrency (7)Marketing (3)The Real Scoop (1)Frank Rubino (1)Operational Trade Store (2)Linked Data (1)Philip Fennell (2)RDF (1)Adam Fowler (1)Range Indexes (1)range indexing scoring (2)Journey to Sanity (1)Jason Hunter (1)Loading As Is (1)MapReduce (1)HDFS (1)ASTM (2)Learning Management System (2)LMS (1)Intelligence (8)Healthcare (1)Enterprise Reference Data management (1)Reference data (2)Tableau (2)JSON (3)AngularJS (2)jQuery (1)Education (1)LRS (1)TinCan (1)Events (1)San Francisco (12)Data Management (1)MarkLogic World Tour (5)Government (1)Decision Support (8)Semantics (2)hiring (2)jobs (2)skill set (3)REST API (2)C++ REST Wrapper (2)narrative (2)Polling (2)Unstructured (2)Early Access (1)Open Source (1)free developer license (1)Java Client API (1)open source (1)proprietary (2)metadata (1)Women in Technology (1)Grace Hopper (1)Mary Hostege (1)frankfurt book fair (1)Klopotek (1)Larry (1)OpenWorld (1)Oracle (1)DaaS (1)Data as a Service (1)women in technology (4)ACID transactions (4)Government-grade security (2)rapid application development (2)RDBMS (5)Community (2)Bitemporal (4)MarkLogic 8 (1)Turkey (1)Santa's List (1)data integration (1)geospatial data (1)Patient 360 (1)EHR interoperability (1)HIE (2)semantic data (1)semantic interoperability (1)technology interoperability (1)Time-Series (3)Angular JS (1)Ember (1)JEE (1)document database (2)NBC (3)SNL app (3)Martin Fowler (3)microservices (2)polyglot persistence (2)polyglot persistance (2)Risk Management (2)Samplestack (2)Java (2)multi-statement transactions (2)product management (2)samplestack (2)Innovation (2)MarkLogic History (2)Timeline (1)Operational Data Warehouse (7)Retail and Consumer (1)Enterprise (2)Retail (1)Healthcare apps (1)Healthcare reform (5)Big Data in Retail (1)Omnichannel 360 (4)Omnichannel in Retail (1)Saturday Night Live app (1)Consumer 360 (1)Loyalty Programs in Retail (2)Big data in Government (2)Transformational leadership (1)E-Commerce in Retail (1)infographic (1)E-Commerce (1)Online sales (1)multi-model database
Supported Server Versions
Version Original Release Date Current Release End of Life Date
MarkLogic Server 8.0 February 6, 2015 8.0-3 In Circulation
MarkLogic Server 7.0 November 14, 2013 7.0-5.2 In Circulation
MarkLogic Server 6.0 September 12, 2012 6.0-6 June 26, 2016
Latest Updates
Aug
25
Avoiding the Franken-beast: Polyglot Persistence Done Right
Posted by Damon Feldman on 25 August 2015 07:03 PM

Why should people store many kinds of data in one place? Why not put the many kinds in many places? A customer of mine has learned the answer the hard way by doing the latter, so I thought I’d write it up as a case study so you don’t have to suffer the pain that it had.

Honestly, I am writing this for my benefit too. Watching a team try to integrate three or four different persistence and data processing products into some kind of “Franken-beast” is like watching a train full of ageing software architects drive off a bridge in slow motion.

Building the Franken-beast is difficult to do – and difficult to watch

Sometimes you do need separate, discrete systems – but with MarkLogic you can avoid most of that integration, so I’m going to write here about how to choose and why it matters.


Polyglot Persistence

Martin Fowler first wrote about “Polyglot Persistence” in a blog post in 2011[1], and the term has caught on. Today there are 85,000 Google hits on the phrase, and people who think and write about the database world acknowledge that we are all dealing with many different forms of data and that we need many different tools to make that work. Text data, binary data, structured data, semantic data. The term is “rising terminology” and we can all expect to hear a lot about it in the next few years.

The conventional understanding is that to achieve Polyglot Persistence you store each discrete data type in its own discrete technology. But polyglot means “able to speak many languages,” not “able to integrate many components.” This is where my customer got into trouble. The team took the traditional route of using multiple stores for multiple data types. This post covers the impact of storing two types – structured data and binary data – in two different stores instead of one integrated store.

MarkLogic enables multi-model persistence instead, and at the end of this post I’ll cover how this team is now re-working the system into a simpler form using this capability.


The Example: Excel Worksheets

Back to this particular customer. To protect the innocent, I’m going to pretend this was a mortgage application system that takes in Excel worksheet submissions documenting assets and expenses as someone applies for a mortgage. The Excel worksheet comes in, something extracts the various incomes and expenses to XML, then both get stored to support the mortgage application process.

Once the Excel binary and extracted XML are stored, a workflow engine helps users review and approve (or reject) the submission. This basic flow can be found in many computer systems – from applying for a loan to submitting a proposal to do a big project or to apply to become a preferred vendor.

The part I’ll focus on here is that this class of system stores both the original, “un-structured”, binary data (.xslx, .pdf, .jpg) as well as the structured XML data extracted from the Excel.

Here’s the mindset that got my customer into trouble:

MarkLogic is the best XML/JSON document store in the world – we get that! But binaries should go into a ‘content system.’ A CMS or DAM. Let’s use Alfresco to store the Excel and some generated PDF notices, and put the XML in MarkLogic – that way we use a best-of-breed system for each type of data!”

The resulting architecture looked something like this:

The resulting architecture looks something like the above. Simple, right?
High fives all around to celebrate our collective genius.


Getting the Data Back Out

The first problem showed up as soon as we started to do something with the data. We needed to send it to an external, independent reviewer every week for quality control and to check on some complex business rules outside the scope of my customer’s implementation. These reviewers needed the XML records matching some criteria (let’s pretend it was super-jumbo loans, to fit with our example) and all the extracted data needed to be in zip files limited to 1GB each, uncompressed.

Easy, right? Get the new super-jumbo loans from last week, and stop when you hit 1 GB. Keep going until you have them all. We had a couple problems:

  • Only MarkLogic knows what is a super-jumbo for last week. Alfresco has limited metadata and the super-jumbo criteria (loan size, Jumbo thresholds per zip code) is in the Database, not Alfresco.
  • It’s hard to know where to stop because you need to get the binary to know how big your zip file is.

This is a lot of data at busy times, so they used Spring Batch to coordinate long-running batch processing jobs. Suddenly we have a third component, adding a little more complexity.

This diagram shows how a batch processor grabbed content from MarkLogic and wrote it to the file-system and also grabbed content from Alfresco and wrote that to the file-system. Both were zipped to limit the number of files in one directory (an alternative is to hash the filenames and use the first two digits of the hash to create 100 sub-directories of reasonable size. No free lunch – more complexity either way).

But recall there was a size limit on how much data could be moved and transferred. So we needed another post-process to do determine file sizes on disk, and chunk the data into zip files with groups of matching XML and Zip files, totaling about 1GB each.

But wait, there’s more… The first Spring Batch process has to complete before the external python chunking script runs. So we need to connect both to a job scheduler (they used an enterprise Job Control product, but a cron job would do as well). Here is a diagram showing the additional components added to the overall architecture (new items circled in red):

A lot more complex, isn’t it? My customer’s enthusiasm for the original solution was already waning.


Let’s Talk Physical Architecture

Alfresco uses two different stores, and does not manage transactions in them, much less XA transactions with other sub-systems; a relational database for some bookkeeping data and a file system for the actual binaries (because BLObs don’t work well). For the file-system to be reliable we need a clustered, highly-available file-system like GlusterFS. Alfresco internally bundles a number of separate internal components with a core cobbled together from a relational database adapter and an internal build of Lucene for text indexing, plus direct file-system access. Here is an updated diagram with the internal physical and software components introduced by Alfresco:

Remember the overall purpose or use case was “Not That Complicated!” We need to accept binaries and structured data, store it, and send it to a downstream system for verification! That’s it.


Operations

As those of you know who run enterprise software, every component in the infrastructure needs operations people and processes. This was a Highly-available (HA) system with Disaster Recovery (DR) requirements.

Disaster Recovery

DR basically means double everything, and configuring some process to constantly move all data to that duplicate infrastructure at a disaster site. MarkLogic was the primary database for the entire system and was replicated. Now that more components have been added to support Alfresco (Oracle, Alfresco, GlusterFS) those had to be configured to replicate so were are copies of their data in the DR site at all times.

Is DR even possible with multiple stores?

To make matters (still) worse, we spent a lot of time considering how data would be kept in synch. Every sub-system replicates differently and with different time lags and “Recovery Point Objective” amounts of non-replicated data. When they all come up at the DR site, in event of a disaster, they will be seconds or even minutes out of synch. We spend a surprising amount of time trying to add designs and governance about this. The time spent had a large opportunity cost – we should have been focused on other things but were distracted by managing data replication across multiple persistent stores.

High Availablilty

For HA, we needed to configure product-specific strategies for each major component (Alfresco Oracle/Lucene data, Alfresco binary file data, MarkLogic data).

These “HA Config” boxes on the diagrams are worse than they look. They require duplicate infrastructure to be provisioned and configured for each of the (un-necessary) persistence technologies used. Each required meetings, vendor consultations, design documents and approvals. And testing. Separate High-Availability failover testing for each.

Process

Each of these needs its own trained staff, provisioned hardware, run-book (instructions for what do for normal operations as well as emergencies), test plans, upgrade plans, vendor or expert consultant relationships, and support contracts.

Here is the diagram showing the Disaster Recovery site and the data flows that need to be managed and configured. The new data flows are shown as red arrows, and the new people and roles are inside red circles:

Unbelievable isn’t it? The people needed, processes needed, infrastructure needed, architectural components needed are not usually put on one diagram – for the very reason that humans can’t deal effectively with this level of complexity.

And it only gets worse.


Monitoring

I also spent many months stabilizing and improving another project – HealthCare.gov (Obamacare, as it is usually called). When it went live in poor shape, with bad performance and many errors, the White House sent in a group called the Ad-hoc Team to help fix it. The first thing they did when they came on the scene was buy and install monitoring at every level, in every component. You can’t run a critical enterprise system effectively without knowing what is happening inside your own system, and as experienced operations engineers, they prioritized monitoring at the top of the list.

And our Mortgage Processing application described here also needs monitoring to work well. Unfortunately, by splitting data into multiple systems we had many more sub-systems to monitor. Products like the ones diagrammed so far often have APIs to help connect their internal monitoring and statistics feeds or logs to consolidated monitoring tools. Each of them needs to be monitored to form a common operating picture.

Again, here is a diagram with red circles to show the added monitoring agents that must be developed, maintained and integrated to keep all this stuff working that should not have been necessary. The Enterprise Monitoring product and MarkLogic agent are not circled, because they were going to be necessary even if the data was all stored in MarkLogic, as it should have been.


Deployment, Dev Ops, Automation

Each and every component requires some deployment, version control, testing. Those that have any code or customization must have that code integrated into the build. Experienced dear readers know that the overall maven/ant/gulp or similar system that knits all the components together into a known, testable and valid configuration are hugely complex. They are complex in part due to the number of components involved.

I will spare you another diagram, and just show the dev ops guy who has to build and maintain a pushbutton build with a Continuous Integration and Continuous Deployment capability:


How Not to Suffer Like We Did

So this is a fairly long, detailed story about how a simple requirement became an operational nightmare. I promise you your time has been better spent reading my history of what happened than if you had spent months or years living through it.

Everything (everything!) in the diagrams above is needed to run a system that does this simple submit, review and export task at scale in a mature, HA/DR, enterprise environment. It all started with a need to handle mortgage submissions, and a simple, naïve, dangerous mindset:

Let’s use Alfresco to store the Excel and PDFs and put the XML in MarkLogic – that way we use a best-of-breed system for each type of data!”

Let’s look at what happens if we remove Alfresco and STORE THE #*&!%^!* DATA IN MARKLOGIC, AS THE PRODUCT WAS DESIGNED TO SUPPORT.

This one simplification is not a panacea. Operating an enterprise system is inherently complex and some of that complexity is fundamental complexity. Here is what we would have built and supported using MarkLogic to integrate and manage disparate data types together (binary and XML in this case):

As we can see, reducing the number of components vastly simplifies the overall complexity and amount of work. This is not surprising, and we can consider it an axiom of software integration

The complexity of a software integration project varies dramatically with the number of components that need to be integrated.

The reason this diagram is so much simpler because removing the extra data store (Alfresco in this case) removed all the integration, coordination, deployment, monitoring, configuration and supporting components that went with it.


The Right Answer

MarkLogic is specifically engineered to handle many types of data: XML, Binary, RDF, Text and JSON. The system will be far simpler (as we see above) if MarkLogic is used as an integrated, multi-model store that manages all the data together in one place.

By putting part of the data in one store (MarkLogic) and other parts of the data in another store (Alfresco) the system was vastly over-complicated. The additional complexity was hard to predict at the time, but it is obvious in hindsight.


The Big Picture

Here is where we ended up.

What they wanted to accomplish

What they initially built

What MarkLogic did for them


MarkLogic’s software development philosophy is rare. We use a single C++ core to solve the data management problem in once place, cutting across many data types from Text to RDF and XML to Binary. Most others deliver a “solution” that hides an amalgam of open-source or acquired separate components, which masks and increases complexity rather than eliminating it.

Put another way, MarkLogic achieves the goal of Polyglot Persistence via the technique of multi-model storage.

But the “philosophy” above is abstract and it is hard to know why it matters without living through concrete implementations. The example recounted above is real. It was not really Mortgages, but all the integrations and impacts actually happened, and you can avoid the frustration that I experienced – and more importantly deliver value to your users and customers without being distracted by myriad integration headaches that really don’t need to happen in the first place.

Avoiding the Franken-beast: Polyglot Persistence Done Right from MarkLogic.


Read more »



Aug
18

MarkLogic 8 Enhancements Help Developers Create Smarter Applications Faster

San Carlos, Calif. – August 18, 2015 – MarkLogic Corporation, the leading Enterprise NoSQL database platform provider, today announced new MarkLogic® Semantics features in MarkLogic® 8+, the new-generation database that helps organizations achieve faster time to results, operationalize heterogeneous data, and mitigate risk. With new support for Jena and Sesame APIs, MarkLogic Semantics makes it easier and faster for developers to integrate data and build smarter applications in order to maximize the value of their organizations’ data.

“Enterprises are continuing to struggle with extracting and unlocking the value of unstructured and semi-structured information and to date, very few vendors have addressed this. Semantic technologies can overcome these obstacles by linking and relating unstructured information with the structured data in ways that current database technology just can’t handle,” said David Schubmehl, research director, Content Analytics, Discovery and Cognitive Systems, IDC. “IDC sees semantic capabilities as being crucial to build next generation knowledge bases and solutions that will drive knowledge worker productivity and enterprise value.”

Smart applications prove to be elusive for many organizations today, as 85 percent of companies fail to exploit data for a competitive advantage and data silos are listed as a significant impediment to achieving big data goals. Companies are struggling to integrate and manage data that reside in multiple silos as their relational database technology is often too limiting and inflexible to manage today’s complex data types. Organizations require a different approach to modeling data that focuses on relationships and context and breaks through silos, giving enterprises the ability to better organize data and build smarter applications faster.

The MarkLogic database is the only schema-agnostic Enterprise NoSQL platform that integrates search, semantics, and rich queries, with enterprise features customers require for production applications. MarkLogic ingests data as-is, driving rapid data integration by pulling data from various silos and incorporating all data into one platform for a comprehensive view of information.

MarkLogic Semantics adds additional value by linking the data together and providing context for better insight into the business. Because MarkLogic built semantics into the database, MarkLogic can connect all this disparate data together in one platform, eliminating the need to integrate multiple systems and perform costly ETL processes between them. It is the only platform that can store and query a combination of documents, data, and RDF triples (the language of semantics, sometimes called linked data) as the data resides in a single unified platform. This makes search more effective and faster than traversing across various data silos, and delivers information within context, resulting in better insights. With a single platform, users have flexibility in choosing a document data model, RDF model or mix of models that work best to store their data and provides the ability to query across everything holistically. And, in MarkLogic, users can choose how to query the data, using JavaScript, XQuery, or SPARQL — or even a combination of languages. The flexible data model and easy-to-develop query ability speed data integration and application development.

The new MarkLogic 8+ database adds additional semantics and search value to the platform by supporting the standard Jena and Sesame APIs. These building blocks simplify and speed semantics application development. The support also makes it easier for developers, customers and partners to migrate existing semantics applications to MarkLogic so they can gain access to all of the other MarkLogic capabilities. A new rich search API also makes it easier for JavaScript developers to create sophisticated apps faster using a style that is very efficient, familiar, and natural. Application development is further speeded by eliminating the need to use ETL tools as data is ingested “as is” into MarkLogic.

“With MarkLogic Semantics, we can link together items for more impactful data discovery and sharing,” said Andrea Powell, CIO, CABI. “For example, our Plantwise KnowledgeBank contains a database of plant disease distribution. Using MarkLogic will help to determine the likelihood that disease present in one country will occur in another. Helping prevent crop losses allows us to better fulfill our mission of solving problems in agriculture around the world.”

“Semantics is a game-changer and our forward-thinking customers and partners understand the that. Organizations ranging from government agencies to banks to media and publishers throughout the world are using MarkLogic Semantics to enrich their data, and ultimately enrich their businesses,” said Joe Pasqua, EVP, products, MarkLogic. “Semantics gives our customers a competitive advantage, creating smart data that helps achieve business goals.”


New MarkLogic® Semantics Speeds Application Development and Eases Data Integration from MarkLogic.


Read more »




Copyright © 2015 MarkLogic Corporation. All Rights Reserved. MARKLOGIC® is a registered trademark of MarkLogic Corporation.   Terms of Use  |  Privacy Policy  |  Careers  |  Sitemap