堵塞
词干提取支持
RediSearch 支持词干提取 - 即将单词的基本形式添加到索引中。这允许对 ”hiring
“ 返回 ”hire
“ 和 ”hired
“。
当前的词干提取支持基于 Snowball 词干提取器库,该库支持大多数欧洲语言以及阿拉伯语和其他语言。请参阅下面的“支持的语言”部分。我们希望很快包含更多语言(如果您需要特定的语言支持,请打开一个 issue)。
有关更多详细信息,请参阅 Snowball Stemmer 网站。
这个怎么运作?
词干提取将同一单词的不同形式映射到一个共同的词根 - “stem” - 例如,英文的词干提取器映射 chosen ,studies 和 study to studi 。因此,搜索 covered 也会找到只有其他形式的文档。
为了定义在构建索引时 Stemmer 应该应用哪种语言,您需要指定LANGUAGE
参数。有关更多详细信息,请查看 FT。CREATE 语法。
使用语言定义创建索引
为德语中的单词创建索引 ”wort:
“ 替换为单个TEXT
字段 ”wort
"
redis> FT.CREATE idx:german ON HASH PREFIX 1 "wort:" LANGUAGE GERMAN SCHEMA wort TEXT
Adding words
Adding some words with same stem in German, all variations of the word stück
( piece
in english): stück stücke stuck stucke
=> stuck
redis> HSET wort:1 wort stück
(integer) 1
redis> HSET wort:2 wort stücke
(integer) 1
redis> HSET wort:3 wort stuck
(integer) 1
redis> HSET wort:4 wort stucke
(integer) 1
Searching for a common stem
Search for "stuck" (german for "piece"). As of v2.10, it's only necessary to specify the LANGUAGE
argument when it wasn't specified to create the index being used to search.
Note the results for words that contains "ü
" are encoded in UTF-8.
redis> FT.SEARCH idx:german '@wort:(stuck)' German
1) (integer) 4
2) "wort:3"
3) 1) "wort"
2) "stuck"
4) "wort:4"
5) 1) "wort"
2) "stucke"
6) "wort:1"
7) 1) "wort"
2) "st\xc3\xbcck"
8) "wort:2"
9) 1) "wort"
2) "st\xc3\xbccke"
Supported languages
The following languages are supported and can be passed to the engine when indexing or querying using lowercase:
- arabic
- armenian
- danish
- dutch
- english
- finnish
- french
- german
- hungarian
- italian
- norwegian
- portuguese
- romanian
- russian
- serbian
- spanish
- swedish
- tamil
- turkish
- yiddish
- chinese (see below)
Chinese support
Indexing a Chinese document is different than indexing a document in most other languages because of how tokens are extracted. While most languages can have their tokens distinguished by separation characters and whitespace, this is not common in Chinese.
Chinese tokenization is done by scanning the input text and checking every character or sequence of characters against a dictionary of predefined terms and determining the most likely match based on the surrounding terms and characters.
Redis Stack makes use of the Friso chinese tokenization library for this purpose. This is largely transparent to the user and often no additional configuration is required.
Using custom dictionaries
If you wish to use a custom dictionary, you can do so at the module level when loading the module. The FRISOINI
setting can point to the location of a friso.ini
file which contains the relevant settings and paths to the dictionary files.
Note that there is no default friso.ini
file location. RedisSearch comes with its own friso.ini
and dictionary files that are compiled into the module binary at build-time.
On this page