This topic describes the indexing mechanism for time series tables and provides best practices for writing efficient queries.
Background
For more information about the data model for time series data, see Design a time series table. Data is organized by time series. In a time series table, all columns marked as TAG (tag columns) are indexed. These columns specify the time series to which each data point belongs and indicate the data source.
The following figure shows an example of data that belongs to the same time series in a time series table. For more information, see Design a time series table.
If the value in a tag column changes, the data belongs to a different time series. Typically, a time series identifies a specific data source that generates time series data.
LindormTSDB creates an inverted index for each tag column. The index uses the column name and value as a key to index all time series that contain that key-value pair. This allows the system to quickly locate the set of time series that match a specific tag. As data is written, the inverted index for the time series table evolves, as shown in the following figure.
For the timestamp column, LindormTSDB automatically creates a Block Range Index (BRIN) to quickly locate the required data range within large volumes of time series data. However, LindormTSDB does not create indexes on field columns.
Recommendations
Based on the indexing mechanism of LindormTSDB, consider the following points when querying a time series table:
Include equality filters on tag columns and a time range filter in your queries. Avoid queries that filter only on field columns.
If a query includes filters on multiple tag columns and the time series sets they map to have an inclusion relationship, retain only the most selective tag filter. For example, in the sample data, the id column is more selective than the city column.
A query that covers a wide time range may scan a large amount of data, which degrades query performance. To improve performance, narrow the time range in your filter.
Example queries
To reproduce the results in the following examples, download the SQL script and use it to load the sample data.
Query raw data in a time range
Use the following statement to query all SO2 metrics reported by devices in the Yuhang district from 2019-04-18 10:00:00 to 2019-04-18 10:30:00.
SELECT id, so2 FROM aqm WHERE district='yuhang' AND time >= '2019-04-18 10:00:00' AND time < '2019-04-18 10:30:00';The following result is returned:
+---------+------+
| id | so2 |
+---------+------+
| HY00003 | 10 |
| HY00003 | 10.3 |
| HY00003 | 10.3 |
| HY00003 | 10.1 |
| HY00003 | 10.1 |
| HY00003 | 10.1 |
| HY00003 | 10.1 |
| HY00003 | 10.1 |
| HY00003 | 10.3 |
| HY00003 | 10.2 |
| HY00001 | 10 |
| HY00001 | 10.5 |
| HY00001 | 10 |
| HY00001 | 10.3 |
+---------+------+
14 rows in set (444 ms)Query tag column values
After you ingest data from many similar devices into a single time series data table, you may need to query the values of a specific tag for those devices. For example, you can use the following SQL statement to query the device id of all air quality monitoring devices that send data to the aqm time series data table:
SELECT DISTINCT(id) FROM aqm;The following result is returned:
lindorm> SELECT DISTINCT(id) FROM aqm;
+----------+
| |
+----------+
| HY00001 |
| HY00002 |
| HY00003 |
| HY00004 |
+----------+
4 rows in set (61 ms)Queries on tag columns use the tag index by default at the storage engine level, which avoids performance issues caused by full table scans. However, if a tag has high cardinality (a large number of unique values), even an indexed query can be slow.
Downsampling query
Query the average PM2.5 and SO2 metrics reported by devices in the Yuhang district from 2019-04-18 10:00:00 to 2019-04-18 10:30:00 at a 5-minute granularity.
SELECT id, time, avg(pm2_5) AS avg_pm2_5, avg(so2) AS avg_so2 FROM aqm WHERE district='yuhang' AND time >= '2019-04-18 10:00:00' AND time < '2019-04-18 10:30:00' SAMPLE BY 5m;The following result is returned:
+----------+-----------------------------+------------+---------+
| id | time | avg_pm2_5 | avg_so2 |
+----------+-----------------------------+------------+---------+
| HY00003 | 2019-04-18T10:00:00+08:00 | 31.26 | 10.16 |
| HY00003 | 2019-04-18T10:05:00+08:00 | 31.24 | 10.16 |
| HY00001 | 2019-04-18T10:00:00+08:00 | 31.175 | 10.2 |
+----------+-----------------------------+------------+---------+
3 rows in set (170 ms)Downsampling is a time-based aggregation performed on each time series. For more information about downsampling queries, see Downsampling queries.
Query aggregate data across devices
Calculate the maximum 5-minute average of PM2.5 and SO2 metrics from devices in the Yuhang district between 2019-04-18 10:00:00 and 2019-04-18 10:30:00.
SELECT max(avg_pm2_5) AS max_avg_pm25, max(avg_so2) AS max_avg_so2 FROM (SELECT district, id, time, avg(pm2_5) AS avg_pm2_5, avg(so2) AS avg_so2 FROM aqm WHERE district='yuhang' AND time >= '2019-04-18 10:00:00' AND time < '2019-04-18 10:30:00' SAMPLE BY 5m) GROUP BY district;The following result is returned:
+--------------+------------+
| max_avg_pm25 | max_avg_so2 |
+--------------+------------+
| 31.26 | 10.2 |
+--------------+------------+
1 rows in set (103 ms)