Community

MarkLogic 10 and Data Hub 5.0

Latest MarkLogic releases provide a smarter, simpler, and more secure way to integrate data.

Read Blog →

Company

Stay On Top Of Everything MarkLogic

Be the first to know! News, product information, and events delivered straight to your inbox.

Sign Me Up →

 
Knowledgebase:
Understanding Term Frequency rules for relevance calculations
08 March 2018 06:31 PM

Introduction

In the past, we have heard concerns from customers with regard to exactly how our scoring works.  A couple of examples from Stack Overflow include:

As this area of the product has been a source of confusion in the past, the goal of this Knowledgebase article will be to collate a few additional resources on MarkLogic's scoring algorithm into one article and in doing so, offering some additional pointers on ways to make search scoring (hopefully) become less opaque to our users.

Understanding relevance scoring

The default relevance scoring mechanism in MarkLogic for a call to cts:search will be logtfidf (Term Frequency / Inverse Document Frequency). From our documentation:

The logtfidf method of relevance calculation is the default relevance calculation, and it is the option score-logtfidf of cts:search. The logtfidf method takes into account term frequency (how often a term occurs in a single fragment) and document frequency (in how many documents does the term occur) when calculating the score.

See: http://docs.marklogic.com/guide/search-dev/relevance#id_66768

This can lead to an assumption that MarkLogic Server uses the following algorithm to define its relevance scoring:

log(1/term frequency) * log(1/document frequency)

However, this can lead to an over-simplified view of how the scoring really works.

MarkLogic calculates its scores using scaled, stepped integer arithmetic.  If you look at the database status page for a given database, you may notice that one of the configuration options is called "tf normalization"; by default, this is set to scaled-log

What this can mean is that - often for small data sets and documents - you may not see a lot of difference with regard to how scores are computed by the server.

Our documentation describes the effect that tf normalization would have on scoring:

The scoring methods that take into account term frequency (score-logtfidf and score-logtf) will, by default, normalize the term frequency (how many search term matches there are for a document) based on the size of the document. The idea of this normalization is to take into account how frequent a term occurs in the document, relative to the other documents in the database. You can think of this is the density of terms in a document, as opposed to simply the frequency of the terms. The term frequency normalization makes a document that has, for example, 10 occurrences of the word "dog" in a 10,000,000 word document have a lower relevance than a document that has 10 occurrences of the word "dog" in a 100 words document. With the default term frequency normalization of scaled-log, the smaller document would have a higher score (and therefore be more relevant to the search), because it has a greater 'term density' of the word "dog". For most search applications, this behavior is desirable.

Source: https://docs.marklogic.com/guide/search-dev/relevance#id_40969

Example: putting it to the test

Consider the following XQuery example: