最小计数草图
Count-min sketch 是一种概率数据结构,用于估计数据流中某个元素的频率。
Count-Min Sketch 是 Redis Stack 中的一种概率数据结构,可用于估计数据流中事件/元素的频率。
它使用次线性空间,但代价是由于碰撞而过度计算某些事件。它使用事件 / 元素流,并保留其频率的估计计数器。
重要的是要知道,来自低于特定阈值(由error_rate确定)的 Count-Min 草图的结果应该被忽略,甚至通常近似为零。所以 Count-Min sketch 确实是一种用于计算流中元素频率的数据结构,但它只对更高的计数有用。非常低的计数应作为噪音忽略。
使用案例
产品(零售、网上商店)
此应用程序回答了以下问题: 商品的销量(在某一天)是多少?
使用每天(周期)创建的一个 Count-Min 草图。每笔产品销售都进入 CMS。CMS 为对销售额贡献最大的产品提供相当准确的结果。占总销售额百分比较低的产品将被忽略。
例子
假设您选择的错误率为 0.1% (0.001),确定性为 99.8% (0.998)。这意味着您的错误概率为 0.02% (0.002)。您的草图努力将误差保持在您添加的所有元素总数的 0.1% 以内。误差超过此范围的可能性为 0.02%,例如,低于阈值的元素与高于阈值的元素重叠时。当您向 CMS 添加一些项目并评估其频率时,请记住,在如此小的样本中,冲突很少见,就像其他概率数据结构一样。
Example 1:
If we had a uniform distribution of 1000 elements where each has a count of around 500 the threshold would be 500:
threshold = error * total_count = 0.001 * (1000*500) = 500
This shows that a CMS is maybe not the best data structure to count frequency of a uniformly distributed stream.
Let's try decreasing the error to 0.01%:
threshold = error * total_count = 0.0001 * (1000*500) = 100
This threshold looks more acceptable already, but it means we will need a bigger sketch width w = 2/error = 20 000
and consequently - more memory.
Example 2:
In another example let's imagine a normal (gaussian) distribution where we have 1000 elements, out of which 800 will have a summed count of 400K (with an average count of 500) and 200 elements will have a much higher summed count of 1.6M (with an average count of 8000), making them the heavy hitters (elephant flow). The threshold after "populating" the sketch with all the 1000 elements would be:
threshold = error * total_count = 0.001 * 2M = 2000
This threshold seems to be sitting comfortably between the 2 average counts 500 and 8000 so the initial chosen error rate should be working well for this case.
Sizing
Even though the Count-Min sketch is similar to Bloom filter in many ways, its sizing is considerably more complex. The initialisation command receives only two sizing parameters, but you have to understand them thoroughly if you want to have a usable sketch.
CMS.INITBYPROB key error probability
1. Error
The error
parameter will determine the width w
of your sketch and the probability will determine the number of hash functions (depth d
). The error rate we choose will determine the threshold above which we can trust the result from the sketch. The correlation is:
threshold = error * total_count
or
error = threshold/total_count
where total_count
is the sum of the count of all elements that can be obtained from the count
key of the result of the CMS.INFO
command and is of course dynamic - it changes with every new increment in the sketch. At creation time you can approximate the total_count
ratio as a product of the average count you'll be expecting in the sketch and the average number of elements.
Since the threshold is a function of the total count in the filter it's very important to note that it will grow as the count grows, but knowing the total count we can always dynamically calculate the threshold. If a result is below it - it can be discarded.
2. Probability
probability
in this data structure represents the chance of an element that has a count below the threshold to collide with elements that had a count above the threshold on all sketches/depths thus returning a min-count of a frequently occurring element instead of its own.
Performance
Adding, updating and querying for elements in a CMS has a time complexity O(1).
Academic sources
References
On this page