Knowledgebase:
Understanding Term Frequency rules for relevance calculations
02 June 2023 03:27 PM

Introduction

In the past, we have heard concerns from customers with regard to exactly how our scoring works.  A couple of examples from Stack Overflow include:

As this area of the product has been a source of confusion in the past, the goal of this Knowledgebase article will be to collate a few additional resources on MarkLogic's scoring algorithm into one article and in doing so, offering some additional pointers on ways to make search scoring (hopefully) become less opaque to our users.

Understanding relevance scoring

The default relevance scoring mechanism in MarkLogic for a call to cts:search will be logtfidf (Term Frequency / Inverse Document Frequency). From our documentation:

The logtfidf method of relevance calculation is the default relevance calculation, and it is the option score-logtfidf of cts:search. The logtfidf method takes into account term frequency (how often a term occurs in a single fragment) and document frequency (in how many documents does the term occur) when calculating the score.

See: http://docs.marklogic.com/guide/search-dev/relevance#id_66768

This can lead to an assumption that MarkLogic Server uses the following algorithm to define its relevance scoring (log is natural logarithm, base e):

log(1/term frequency) * log(1/document frequency)

However, this can lead to an over-simplified view of how the scoring really works. Refer to the documentation at https://docs.marklogic.com/guide/search-dev/relevance#id_74166, the logtfidf method (the default scoring method) uses the following formula to calculate relevance:

log(term frequency) * (inverse document frequency)

MarkLogic calculates its scores using scaled, stepped integer arithmetic.  If you look at the database status page for a given database, you may notice that one of the configuration options is called "tf normalization"; by default, this is set to scaled-log

What this can mean is that - often for small data sets and documents - you may not see a lot of difference with regard to how scores are computed by the server.

Our documentation describes the effect that tf normalization would have on scoring:

The scoring methods that take into account term frequency (score-logtfidf and score-logtf) will, by default, normalize the term frequency (how many search term matches there are for a document) based on the size of the document. The idea of this normalization is to take into account how frequent a term occurs in the document, relative to the other documents in the database. You can think of this is the density of terms in a document, as opposed to simply the frequency of the terms. The term frequency normalization makes a document that has, for example, 10 occurrences of the word "dog" in a 10,000,000 word document have a lower relevance than a document that has 10 occurrences of the word "dog" in a 100 words document. With the default term frequency normalization of scaled-log, the smaller document would have a higher score (and therefore be more relevant to the search), because it has a greater 'term density' of the word "dog". For most search applications, this behavior is desirable.

Source: https://docs.marklogic.com/guide/search-dev/relevance#id_40969

Example: putting it to the test

Consider the following XQuery example:

If you view the example documents above, it could be said that the respective densities of the word "fun" (as it appears within each given document) are:

/doc1.xml
1/14
/doc2.xml
3/7
/doc3.xml
1/3
/doc4.xml
1/1
/doc5.xml
1/3

If you were to run this code, you could therefore expect the documents to be ordered as follows (when ordered by relevancy):

  1. /doc4.xml
  2. /doc2.xml
  3. /doc3.xml and /doc5.xml (tied)
  4. /doc1.xml

Instead what you see from the search:search output is:

URI Score
/doc2.xml 3072
/doc5.xml 2816
/doc4.xml 2048
/doc1.xml 2048
/doc3.xml 2048

This result suggests that the formula being used is not density but raw count for term in question (for example: we see that the term "fun" occurs 3 times in /doc2.xml; 2 times in /doc5.xml and once each in docs /doc4.xml, /doc1.xml, /doc3.xml).

What is really happening?

Here using relevance trace is useful to see what's really happening for the score calculations:

for $x in cts:search(fn:doc(), "fun", "relevance-trace")
return cts:relevance-info($x)

Running the above will give you a little more detail as to how MarkLogic Server is deriving the score. The format output for the first result looks like this:

Notes on the Inverse Document Frequency calculation

Term Frequency concerns itself with how often a term appears in the document

Inverse document frequency divides that by the fraction of the documents in which the term occurs.

The IDF portion of the equation attempts to deal with the relative importance of terms. Additionally, it:

    • Only matters when there are multiple terms in a query
    • Depends on statistics across an entire specific collection
    • Testing on small collections may give misleading answers

Further reading

(6 vote(s))
Helpful
Not helpful

Comments (0)