评分文档

全文评分函数

搜索时,将根据文档与查询的相关性对文档进行评分。分数是介于 0.0 和 1.0 之间的浮点数,其中 1.0 是最高分数。分数将作为搜索结果的一部分返回,并可用于对结果进行排序。

Redis Stack 附带了一些非常基本的评分函数来评估文档相关性。它们都基于文档分数和术语频率。这与使用可排序字段的能力无关。评分函数是通过添加SCORER {scorer_name}参数添加到搜索查询中。

如果您更喜欢自定义评分函数,则可以使用扩展 API 添加更多函数。

以下是 Redis Stack 中可用的预捆绑评分函数列表,以及有关它们工作原理的简短说明。每个函数都由注册名称提及,该名称可以作为SCORERargument 在FT.SEARCH.

TFIDF (默认)

基本的 TF-IDF 评分具有一些额外的功能:

  1. 对于每个结果中的每个术语,将计算该术语的 TF-IDF 分数到该文档。频率根据预先确定的字段权重进行加权,并且每个术语的频率按每个文档中的最高术语频率进行标准化。

  2. 查询词的总 TF-IDF 乘以FT.CREATE通过SCORE_FIELD.

  3. 根据搜索词之间的 “slop” 或累积距离为每个结果分配一个惩罚。完全匹配不会受到处罚,但搜索词相距遥远的匹配将显著降低其分数。对于连续项的每个二元语法,将确定它们之间的最小距离。惩罚是距离平方和的平方根;例如,1/sqrt(d(t2-t1)^2 + d(t3-t2)^2 + ...).

给定文档 D 中的 N 项,T1...Tn,结果分数可以用这个 Python 函数来描述:

def get_score(terms, doc):
    # the sum of tf-idf
    score = 0

    # the distance penalty for all terms
    dist_penalty = 0

    for i, term in enumerate(terms):
        # tf normalized by maximum frequency
        tf = doc.freq(term) / doc.max_freq

        # idf is global for the index, and not calculated each time in real life
        idf = log2(1 + total_docs / docs_with_term(term))

        score += tf*idf

        # sum up the distance penalty
        if i > 0:
            dist_penalty += min_distance(term, terms[i-1])**2

    # multiply the score by the document score
    score *= doc.score

    # divide the score by the root of the cumulative distance
    if len(terms) > 1:
        score /= sqrt(dist_penalty)

    return score

TFIDF.DOCNORM

Identical to the default TFIDF scorer, with one important distinction:

Term frequencies are normalized by the length of the document, expressed as the total number of terms. The length is weighted, so that if a document contains two terms, one in a field that has a weight 1 and one in a field with a weight of 5, the total frequency is 6, not 2.

FT.SEARCH myIndex "foo" SCORER TFIDF.DOCNORM

BM25

A variation on the basic TFIDF scorer, see this Wikipedia article for more info.

The relevance score for each document is multiplied by the presumptive document score and a penalty is applied based on slop as in TFIDF.

FT.SEARCH myIndex "foo" SCORER BM25

DISMAX

A simple scorer that sums up the frequencies of matched terms. In the case of union clauses, it will give the maximum value of those matches. No other penalties or factors are applied.

It is not a one-to-one implementation of Solr's DISMAX algorithm, but it follows it in broad terms.

FT.SEARCH myIndex "foo" SCORER DISMAX

DOCSCORE

A scoring function that just returns the presumptive score of the document without applying any calculations to it. Since document scores can be updated, this can be useful if you'd like to use an external score and nothing further.

FT.SEARCH myIndex "foo" SCORER DOCSCORE

HAMMING

Scoring by the inverse Hamming distance between the document's payload and the query payload is performed. Since the nearest neighbors are of interest, the inverse Hamming distance (1/(1+d)) is used so that a distance of 0 gives a perfect score of 1 and is the highest rank.

This only works if:

  1. The document has a payload.
  2. The query has a payload.
  3. Both are exactly the same length.

Payloads are binary-safe, and having payloads with a length that is a multiple of 64 bits yields slightly faster results.

Example:

> HSET key:1 foo hello payload aaaabbbb
(integer) 2

> HSET key:2 foo bar payload aaaacccc 
(integer) 2

> FT.CREATE idx ON HASH PREFIX 1 key: PAYLOAD_FIELD payload SCHEMA foo TEXT
"OK"

> FT.SEARCH idx "*" PAYLOAD "aaaabbbc" SCORER HAMMING WITHSCORES
1) "2"
2) "key:1"
3) "0.5"
4) 1) "foo"
   2) "hello"
5) "key:2"
6) "0.25"
7) 1) "foo"
   2) "bar"
RATE THIS PAGE
Back to top ↑