Design Patterns: The Envelope Design Pattern
Posted by Damon Feldman on 10 August 2018 01:25 AM |
|
MarkLogic design patterns are reusable solutions for many of the commonly occurring problems encountered when designing MarkLogic applications. These patterns may be unique to applications on MarkLogic or may be industry patterns that have MarkLogic specific considerations. Unlike recipes, MarkLogic design patterns are generally more abstract and applicable in multiple scenarios. Envelope Design PatternIntentSeparate data intended for consumption by external processes from data intended to make the MarkLogic database system more powerful and flexible. Create an overall envelope parent element or object that contains a “headers” section and an “instance” data section, which are separate within the document. This aligns with how the MarkLogic Data Hub Framework and MarkLogic Entity Services create their envelopes. MotivationAdditional, Richer IndexingIn MarkLogic, the JSON or XML document that is stored becomes the “interface to the indexes.” This means that when you add an element to an XML document, or add a nested object to a JSON structure, you are also causing that data and its relationship to parents and values to be indexed. Thus, the primary mechanism to adding indexed data is simply adding elements and nested objects. It is useful to separate out the data that is used to do something in MarkLogic from the data that is stored because your data services API needs it. Data services may want data that:
In contrast, MarkLogic processing and indexing may improve if data:
While the same goal can be achieved using extensive transforms on the data as it is ingested, and then again as it is accessed by the data services, it is much more efficient and clear to have the externally-accessed “core” data stored as-is. For the sake of simplicity, we will discuss this pattern in XML terms (documents, elements and nodes) going forward, but all concepts also apply to JSON. In summary, we address two conflicting goals by using the envelope pattern:
Some systems have additional goals as well, such as multiple APIs that consume data in radically different formats (e.g., CSV vs. JSON, XML vs. RDF). In that case, there may be more sections than the headers and instance sections (such as “triples” or “html-preformatted”). Integrating Heterogeneous Data (“Silo Busting”)This pattern is often used to quickly integrate large sets of data together into MarkLogic. With this approach, raw or “good enough” data is directly ingested into MarkLogic, and a relatively small number of elements are initially included in the “headers” section to maintain uniform indexing, retrieval, and analysis across many data sets. All data in the “instance” section can be accessed, rendered using default rendering, exported, and managed, and the system accessing the data can be developed very quickly using the most valuable data first. When used with the Data Hub Framework—where raw content is initially ingested into a Staging database—the “instance” section would include more uniform or harmonized data. Preserve System FlexibilityKeeping data used purely by MarkLogic processes separate from data accessed by data services allows developers to add data to the “headers” section as needed without breaking external layers or sub-systems. This can reduce time to analyze, re-code, test, and coordinate on large projects. ApplicabilityThe envelope design pattern should generally be used in all designs. You should have a specific and compelling reason not to use this pattern before omitting it from your design. We recommend to use this pattern when:
Participants
Collaborations“All access through a service” is a pattern that ensures that all updates add the “headers” section and that all queries remove it. This makes the “headers” section invisible to callers, preserving flexibility within the MarkLogic data layer (within .js and .xqy code inside MarkLogic itself). ConsequencesAdding “index-able” data is separated from returning data formats. A change to headers will not be externally visible to clients depending on the “instance” data.
ImplementationConsider the following issues when implementing the envelope pattern:
Sample CodeArticle RepositoryConsider a set of articles like this one in XML format that need to be stored, searched, and accessed: Figure 1 shows a simplified approximation of the docBook schema. Let’s assume that callers need this data in this exact format or it will be considered invalid. There are two problems you should consider if you want to search or facet using a range index on the revision date. First, the desired data is in a non-specific Code in Figure 2 extracts a transformed/formatted version of the date and creates a more specifically-named element in another namespace, Now, to search for all articles in January of 2002, we would add a date range index to declare namespace es = "http://marklogic.com/entity-services"; declare namespace meta = "http://marklogic.com/patternExample/meta"; (: generic function to query documents, including headers, but return only the instance data :) declare function es:queryData($q) { for $envelope in cts:search(/es:envelope, $q) return $envelope/es:instance/element() }; let $fromQ := cts:element-range-query(xs:QName("meta:revisionDate"), ">=", xs:date("2002-01-01")) let $toQ := cts:element-range-query(xs:QName("meta:revisionDate"), "<=", xs:date("2002-01-31")) let $jan2002Q := cts:and-query(($fromQ, $toQ)) return es:queryData($jan2002Q) Note that the function Social Network RelationshipsFor data representing profiles in a social network, such as LinkedIn or Facebook, we may store a person’s profile as XML, but their relationships as RDF. The RDF may go in the “triples” section. Here is a hypothetical person profile in a social network application: Each user is ideally modeled as a document, because it is self-contained and hierarchical. However, the social network itself is a graph, so the relationship data is ideally modeled using RDF triples: Alfred <foaf:knows> Sally Alfred <foaf:knows> Margaret Alfred <foaf:knows> Neeraj To augment the profile in Figure 3 with semantic triple information about the social network “Alfred” is part of, run this code when each document is inserted or updated: Running the code in Figure 4 results in the structure we want: the “person” record is left as-is, bundled into an envelope with semantic triples that describe the social network derived from this profile: <es:envelope xmlns:es="http://marklogic.com/entity-services"> <es:triples> <sem:triple xmlns:sem="http://marklogic.com/semantics"> <sem:subject>Alfred_Jones_1974</sem:subject> <sem:predicate>http://xmlns.com/foaf/0.1/knows</sem:predicate> <sem:object>Sally2227</sem:object> </sem:triple> <sem:triple xmlns:sem="http://marklogic.com/semantics"> <sem:subject>Alfred_Jones_1974</sem:subject> <sem:predicate>http://xmlns.com/foaf/0.1/knows</sem:predicate> <sem:object>MargaretTheProgrammer</sem:object> </sem:triple> <sem:triple xmlns:sem="http://marklogic.com/semantics"> <sem:subject>Alfred_Jones_1974</sem:subject> <sem:predicate>http://xmlns.com/foaf/0.1/knows</sem:predicate> <sem:object>Neeraj</sem:object> </sem:triple> </es:triples> <es:instance> <sn:person xmlns:sn="http://marklogic.com/patterns/example/social-network"> <sn:name>Alfred</sn:name> <sn:uniqueUserName>Alfred_Jones_1974</sn:uniqueUserName> <sn:interests> <sn:interest levelofinterest="7">Semantics</sn:interest> <sn:interest levelofinterest="10">MarkLogic</sn:interest> <sn:interest levelofinterest="3">Polyglot Persistence</sn:interest> </sn:interests> <sn:friends> <sn:friend>Sally2227</sn:friend> <sn:friend>MargaretTheProgrammer</sn:friend> <sn:friend>Neeraj</sn:friend> </sn:friends> </sn:person> </es:instance> </es:envelope> This example is slightly different than the article repository example in that we introduce a triples section to highlight its purpose. The instance section is simply the original “person” record. Related PatternsRelated patterns (TBD) include all patterns to add data outside of the actual documents being inserted and returned. These include patterns to store additional information in the URI scheme, collections, properties fragments, or RDF triples. UsesThe envelope pattern has become ubiquitous in MarkLogic implementations. The pattern is leveraged heavily in the MarkLogic Data Hub Framework, and is likely found in any MarkLogic implementation that involves data integration. The post Design Patterns: The Envelope Design Pattern appeared first on MarkLogic. | |