Understanding Term Frequency rules for relevance calculations
08 March 2018 06:31 PM
In the past, we have heard concerns from customers with regard to exactly how our scoring works. A couple of examples from Stack Overflow include:
As this area of the product has been a source of confusion in the past, the goal of this Knowledgebase article will be to collate a few additional resources on MarkLogic's scoring algorithm into one article and in doing so, offering some additional pointers on ways to make search scoring (hopefully) become less opaque to our users.
Understanding relevance scoring
The default relevance scoring mechanism in MarkLogic for a call to cts:search will be logtfidf (Term Frequency / Inverse Document Frequency). From our documentation:
The logtfidf method of relevance calculation is the default relevance calculation, and it is the option score-logtfidf of cts:search. The logtfidf method takes into account term frequency (how often a term occurs in a single fragment) and document frequency (in how many documents does the term occur) when calculating the score.
This can lead to an assumption that MarkLogic Server uses the following algorithm to define its relevance scoring:
However, this can lead to an over-simplified view of how the scoring really works.
MarkLogic calculates its scores using scaled, stepped integer arithmetic. If you look at the database status page for a given database, you may notice that one of the configuration options is called "tf normalization"; by default, this is set to scaled-log
What this can mean is that - often for small data sets and documents - you may not see a lot of difference with regard to how scores are computed by the server.
Our documentation describes the effect that tf normalization would have on scoring:
The scoring methods that take into account term frequency (score-logtfidf and score-logtf) will, by default, normalize the term frequency (how many search term matches there are for a document) based on the size of the document. The idea of this normalization is to take into account how frequent a term occurs in the document, relative to the other documents in the database. You can think of this is the density of terms in a document, as opposed to simply the frequency of the terms. The term frequency normalization makes a document that has, for example, 10 occurrences of the word "dog" in a 10,000,000 word document have a lower relevance than a document that has 10 occurrences of the word "dog" in a 100 words document. With the default term frequency normalization of scaled-log, the smaller document would have a higher score (and therefore be more relevant to the search), because it has a greater 'term density' of the word "dog". For most search applications, this behavior is desirable.
Example: putting it to the test
Consider the following XQuery example:
If you view the example documents above, it could be said that the respective densities of the word "fun" (as it appears within each given document) are:
If you were to run this code, you could therefore expect the documents to be ordered as follows (when ordered by relevancy):
Instead what you see from the search:search output is:
This result suggests that the formula being used is not density but raw count for term in question (for example: we see that the term "fun" occurs 3 times in /doc2.xml; 2 times in /doc5.xml and once each in docs /doc4.xml, /doc1.xml, /doc3.xml).
What is really happening?
Here using relevance trace is useful to see what's really happening for the score calculations:
Running the above will give you a little more detail as to how MarkLogic Server is deriving the score. The format output for the first result looks like this:
Notes on the Inverse Document Frequency calculation
Term Frequency concerns itself with how often a term appears in the document
Inverse document frequency divides that by the fraction of the documents in which the term occurs.
The IDF portion of the equation attempts to deal with the relative importance of terms. Additionally, it: