Knowledgebase:
cts:highlight with overlapping matches
06 December 2016 07:18 PM

 

Problem:

When searching for matches using OR'ed word-queries, and in the case where there are overlapping matches, (i.e. one query contains the text of another query), the results of a cts:highlight query are not as desired.

 

For example:

 

let $p := <p>From the memoirs of an accomplished artist</p>

 

let $query :=

 

cts:or-query(

(cts:word-query("accomplished artist"),

cts:word-query("memoirs of an accomplished artist"))

)

 

return cts:highlight($p, $query, <m>{$cts:text}</m>)

 

 The desired outcome of this would be:

               <p>From the <m>memoirs of an accomplished artist</m> </p>

 Whereas, the actual results are:

                <p>From the <m>memoirs of an </m> <m>accomplished artist</m></p>

 

This behavior is by design and the results are expected. It is because cts:highlight  breaks up overlapping  areas into separate matches.

The cts:highlight built-in variables – $cts:queries and $cts:action help in understanding how this works, as well as to work-around this problem.

  $cts:queries --> returns the matching queries for each of the matched texts.

  $cts:action --> can be used with xdmp:set to specify what should happen next

  • "continue" - (default) Walk the next match. If there are no more matches, return all evaluation results.
  • "skip" - Skip walking any more matches and return all evaluation results
  • "break" - Stop walking matches and return all evaluation results

   For eg., replacing the return statement with the following in the original query:

return

 cts:highlight($p, $query,

<m>{$cts:text,<number-of-matches>{count($cts:queries)}</number-of-matches>,

<matched-by>{$cts:queries}</matched-by>}</m>)

 

==>

 

<p>From the

     <m>memoirs of an

     <number-of-matches>1</number-of-matches>

     <matched-by>

      <cts:word-query xmlns:cts="http://marklogic.com/cts">

       <cts:text xml:lang="en">memoirs of an accomplished artist</cts:text>

      </cts:word-query>

    </matched-by>

     </m>

 

   <m>accomplished artist

   <number-of-matches>2</number-of-matches>

    <matched-by>

      <cts:word-query xmlns:cts="http://marklogic.com/cts">

     <cts:text xml:lang="en">memoirs of an accomplished artist</cts:text>

      </cts:word-query>

      <cts:word-query xmlns:cts="http://marklogic.com/cts">

    <cts:text xml:lang="en">accomplished artist</cts:text>

      </cts:word-query>

    </matched-by></m></p>

 

These results give us a better understanding of how the text is being matched. We can see that " accomplished artist" is matched by both the word-queries 'accomplished artist' and 'memoirs of an accomplished artist'; hence the results of cts:highlight seem different.

To work around this problem, we can insert a small piece of code: 

 

let $p := <p>From the memoirs of an accomplished artist</p>

let $query :=

     cts:or-query(

        (cts:word-query("accomplished artist"),

        cts:word-query("memoirs of an accomplished artist")))

 

     return cts:highlight($p,$query,

 

       ( if (count($cts:queries) gt 1) then xdmp:set($cts:action, "continue")

         else

       ( let $matched-text := <x>{$cts:queries}</x>/cts:word-query/cts:text/data(.)

        return <m>{$matched-text}</m> )

        ))

 

==>

 

<p>From the <m>memoirs of an accomplished artist</m></p>

 

 

Please note that this solution relies on assumptions about what's inside the or-query, but this example could be modified to handle other overlapping situations.

 

   

 



      These results giv

      e us a better understanding of how the text is being matched. We can see that " accomplished artist" is matched by both the word-queries, and hence the results of cts:highlight seem different.

(0 vote(s))
Helpful
Not helpful

Comments (0)