时间序列

Redis 堆栈

您可以使用 Redis Stack 管理 Redis Enterprise 中的时间序列数据。

特征

  • 按开始时间和结束时间查询
  • 按标签集查询
  • 任何时间存储桶的聚合查询(Min、Max、Avg、Sum、Range、Count、First、Last)
  • 可配置的最长保留期
  • Compaction/Roll-ups - 自动更新的聚合时间序列
  • labels index - 每个键都有 labels,允许按标签查询

内存模型

时间序列是内存块的链接列表。 每个数据块都有预定义的样本大小。 每个样本都是时间和 128 位值的元组, 时间戳为 64 位,值为 64 位。

时间序列功能

Redis 堆栈提供了一种新的数据类型,该数据类型将固定大小的内存块用于时间序列样本,由与 Redis 流相同的基数树实现编制索引。使用流,您可以创建上限流,从而有效地按计数限制消息数量。对于时间序列,您可以应用以毫秒为单位的保留策略。这更适合时间序列使用案例,因为它们通常对给定时间窗口内的数据感兴趣,而不是对固定数量的样本感兴趣。

下采样/压缩

缩减采样前 缩减采样后

如果您想无限期地保留所有原始数据点,您的数据集会随着时间的推移而线性增长。但是,如果您的使用案例允许您拥有更远的更早的精细数据,则可以应用缩减采样。这允许您使用给定聚合函数聚合给定时间窗口的原始数据,从而保留更少的历史数据点。时间序列支持使用以下聚合进行下采样:avg、sum、min、max、range、count、first 和 last。

二级索引

使用 Redis 的核心数据结构时,您只能通过知道保存时间序列的确切键来检索时间序列。遗憾的是,对于许多时间序列使用案例(例如根本原因分析或监控),您的应用程序不会知道它要查找的确切键。这些使用案例通常希望查询一组在几个维度中彼此相关的时间序列,以提取您需要的见解。您可以使用核心 Redis 数据结构创建自己的二级索引来帮助解决这个问题,但这将带来高昂的开发成本,并且需要您管理边缘情况以确保索引正确。

Redis 根据field value对称为标签。您可以向每个时间序列添加标签,并在查询时使用它们进行筛选

以下是创建具有两个标签(sensor_id 和 area_id 分别是值为 2 和 32 的字段)和 60000 毫秒保留时段的时间序列的示例:

    TS.CREATE temperature RETENTION 60000 LABELS sensor_id 2 area_id 32

Aggregation at read time

When you need to query a time series, it’s cumbersome to stream all raw data points if you’re only interested in, say, an average over a given time interval. Time series only transfer the minimum required data to ensure lowest latency.

Here's an example of aggregation over time buckets of 5,000 milliseconds:

    127.0.0.1:12543> TS.RANGE temperature:3:32 1548149180000 1548149210000 AGGREGATION avg 5000
    1) 1) (integer) 1548149180000
       2) "26.199999999999999"
    2) 1) (integer) 1548149185000
       2) "27.399999999999999"
    3) 1) (integer) 1548149190000
       2) "24.800000000000001"
    4) 1) (integer) 1548149195000
       2) "23.199999999999999"
    5) 1) (integer) 1548149200000
       2) "25.199999999999999"
    6) 1) (integer) 1548149205000
       2) "28"
    7) 1) (integer) 1548149210000
       2) "20"

Integrations

Redis Stack comes with several integrations into existing time series tools. One such integration is our RedisTimeSeries adapter for Prometheus, which keeps all your monitoring metrics inside time series while leveraging the entire Prometheus ecosystem.

Furthermore, we also created direct integrations for Grafana. This repository contains a docker-compose setup of RedisTimeSeries, its remote write adaptor, Prometheus and Grafana. It also comes with a set of data generators and pre-built Grafana dashboards.

Time series modeling approaches with Redis

Data modeling approaches

Redis streams allow you to add several field value pairs in a message for a given timestamp. For each device, we collected 10 metrics that were modeled as 10 separate fields in a single stream message.

For sorted sets, we modeled the data in two different ways. For “Sorted set per device”, we concatenated the metrics and separated them out by colons, e.g. “<timestamp>:<metric1>:<metric2>: … :<metric10>”.

Of course, this consumes less memory but needs more CPU cycles to get the correct metric at read time. It also implies that changing the number of metrics per device isn’t straightforward, which is why we also benchmarked a second sorted set approach. In “Sorted set per metric,” we kept each metric in its own sorted set and had 10 sorted sets per device. We logged values in the format “<timestamp>:<metric>”.

Another alternative approach would be to normalize the data by creating a hash with a unique key to track all measurements for a given device for a given timestamp. This key would then be the value in the sorted set. However, having to access many hashes to read a time series would come at a huge cost during read time, so we abandoned this path.

Each time series holds a single metric. We chose this design to maintain the Redis principle that a larger number of small keys is better than a fewer number of large keys.

Our benchmark did not utilize the out-of-the-box secondary indexing capabilities of time series. Redis keeps a partial secondary index in each shard, and since the index inherits the same hash-slot of the key it indexes, it is always hosted on the same shard. This approach would make the setup for native data structures even more complex to model, so for the sake of simplicity, we decided not to include it in our benchmarks. Additionally, while Redis Enterprise can use the proxy to fan out requests for commands like TS.MGET and TS.MRANGE to all the shards and aggregate the results, we chose not to exploit this advantage in the benchmark either.

Data ingestion

For the data ingestion part of our benchmark, we compared the four approaches by measuring how many devices’ data we could ingest per second. Our client side had 8 worker threads with 50 connections each, and a pipeline of 50 commands per request.

Ingestion details of each approach:

Redis streams Time series Sorted set
per device
Sorted set
per metric
Command XADD TS.MADD ZADD ZADD
Pipeline 50 50 50 50
Metrics per request 5000 5000 5000 500
# keys 4000 40000 4000 40000

All our ingestion operations were executed at sub-millisecond latency. Although both used the same Rax data structure, the time series approach has slightly higher throughput than Redis streams.

Each approach yields different results, which shows the value of prototyping against specific use cases. As we see on query performance, the sorted set per device comes with improved write throughput but at the expense of query performance. It’s a trade off between ingestion, query performance, and flexibility (remember the earlier data modeling remark).

Read performance

The read query we used in this benchmark queried a single time series and aggregated it in one-hour time buckets by keeping the maximum observed CPU percentage in each bucket. The time range we considered in the query was exactly one hour, so a single maximum value was returned. For time series, this is out of the box functionality.

    127.0.0.1:12543> TS.RANGE cpu_usage_user{1340993056} 1451606390000 1451609990000 AGGREGATION max 3600000

For the Redis streams and sorted sets approaches, we created the following LUA scripts. The client once again had 8 threads and 50 connections each. Since we executed the same query, only a single shard was hit, and in all four cases this shard maxed out at 100% CPU.

This is where you can see the real power of having dedicated data structure for a given use case with a toolbox that runs alongside it. Using time series exceeds all other approaches and is the only one to achieve sub-millisecond response times.

Memory utilization

For both the Redis streams and sorted set approaches, the samples were stored as strings, while time series stored them as doubles. In this specific data set, we chose a CPU measurement with rounded integer values between 0-100, which thus consumes two bytes of memory as a string. With time series, each metric had 64-bit precision.

Time series can dramatically reduce the memory consumption when compared against both sorted set approaches. Given the unbounded nature of time series data, this is typically a critical criteria to evaluate - the overall data set size that needs to be retained in memory. Redis streams reduces the memory consumption further but would be equal or higher than time series when more digits for a higher precision would be required.

More info

RATE THIS PAGE
Back to top ↑