全文检索性能测试-实时数仓 Hologres(Hologres)-阿里云帮助中心

Hologres 自 V4.0 版本起支持全文倒排索引，实现高性能的全文检索能力。本文介绍 Hologres 基于 http_logs 数据集进行全文检索性能测试的方法与结果。

数据集 http_logs 源自 1998 年世界杯官方网站的服务器访问日志。它包含 2.47 亿条记录，原始数据大小约为 32GB。每条记录包含 @timestamp（时间戳）、clientip（客户端 IP）、request（HTTP 请求）、status（状态码）和 size（响应大小）等字段。该数据集被广泛用作评估搜索引擎和数据库全文检索与分析性能的基准。

测试环境准备

测试资源：

Hologres：
- 计算资源：48 CU
- 版本：V4.1.6
- 分片（Shard）数：6。如需增加计算节点数，建议对应线性增加分片数
ECS：
- 规格：ecs.c9i.16xlarge 或 ecs.g9i.16xlarge
- 操作系统：Debian 13.2 64 位

环境准备：

准备 Hologres 实例
- 购买Hologres实例 V4.1 版本实例并创建数据库。
- 创建用户。具体操作，请参见用户管理。

准备 ECS 实例

购买 ECS 实例。具体操作，请参见控制台快速购买并使用ECS实例。

安装依赖

# 更新 apt 缓存
sudo apt update
# 安装 PostgreSQL 客户端用于连接数据库
sudo apt install -y postgresql-client

数据集准备：从官方源下载并解压 http_logs 数据集：

mkdir ~/data && cd ~/data
wget https://rally-tracks.elastic.co/http_logs/documents-181998.json.bz2 && bunzip2 documents-181998.json.bz2
wget https://rally-tracks.elastic.co/http_logs/documents-191998.json.bz2 && bunzip2 documents-191998.json.bz2
wget https://rally-tracks.elastic.co/http_logs/documents-201998.json.bz2 && bunzip2 documents-201998.json.bz2
wget https://rally-tracks.elastic.co/http_logs/documents-211998.json.bz2 && bunzip2 documents-211998.json.bz2
wget https://rally-tracks.elastic.co/http_logs/documents-221998.json.bz2 && bunzip2 documents-221998.json.bz2
wget https://rally-tracks.elastic.co/http_logs/documents-231998.json.bz2 && bunzip2 documents-231998.json.bz2
wget https://rally-tracks.elastic.co/http_logs/documents-241998.json.bz2 && bunzip2 documents-241998.json.bz2

性能测试

本文性能测试过程，包括 Hologres 建表、数据导入、索引构建，均由 Hologres 研发的开源测试工具完成，无需手动处理。测试工具详见Git项目alibabacloud-hologres-benchmark。建表样例详见附录。

安装测试工具

# 创建隔离环境
sudo apt install -y python3-venv
python3 -m venv .venv

# 激活隔离环境
source .venv/bin/activate
python3 -m pip install -U pip

# 安装依赖
git clone https://github.com/aliyun/alibabacloud-hologres-benchmark
cd alibabacloud-hologres-benchmark/fulltext_search/http_logs
pip3 install -r requirements.txt

修改配置文件

{
  "host": "<hologres_endpoint>",
  "port": <hologres_port>,
  "database": "<database_name>",
  "username": "<user_name>",
  "password": "<password>",
  "table_name": "http_logs"
}

执行测试脚本

cd alibabacloud-hologres-benchmark/fulltext_search/http_logs

# 包含数据导入、查询 benchmark 全流程，如数据已存在，则跳过导入步骤
python3 hologres_benchmark.py \
    --config config.json \
    --queries-config benchmark_queries.yaml \
    --data-dir ~/data

测试结果

结果总览

指标	单位	Hologres 结果
数据导入时间	秒	203.583
数据+索引存储	GB	6.105
查询总耗时	秒	36.392

说明

说明：查询总耗时为 20 条查询分别连续执行 10 次的总时长。

性能详情：下表展示了各查询的平均响应时间（单位：毫秒）。Hologres 在绝大多数简单查询（如 term、range）的平均响应时间都在 10 毫秒以内，在复杂的聚合（hourly_agg）和排序场景也可达到百毫秒级响应。

查询名称	平均时间（毫秒）	查询名称	平均时间（毫秒）
`sort_status_asc`	1442	`desc_sort_timestamp`	37
`sort_size_asc`	727	`desc_sort_timestamp_can_match_shortcut`	35
`sort_numeric_no_can_match_shortcut`	251	`desc_sort_timestamp_no_can_match_shortcut`	33
`terms_enum`	251	`term`	12
`sort_numeric_can_match_shortcut`	240	`range`	11
`hourly_agg`	197	`200s-in-range`	10
`sort_size_desc`	139	`400s-in-range`	9
`sort_status_desc`	103	`asc_sort_with_after_timestamp`	9
`desc_sort_with_after_timestamp`	63	`default`	7
`scroll`	40	`asc_sort_timestamp`	7

附录：Hologres 建表与索引构建

Hologres 中创建测试表

-- 创建新 Table Group，Shard 数设为 6
CALL HG_CREATE_TABLE_GROUP ('tg_6', 6);

-- 创建核心表
CREATE TABLE http_logs (
  id BIGINT,
  "@timestamp" BIGINT NOT NULL,
  clientip TEXT,
  request TEXT,
  status INTEGER,
  size INTEGER
) WITH (
  table_group = 'tg_6',          -- 指定 Table Group
  bitmap_columns = 'status',     -- 对 status 列建立位图索引，加速等值/范围查询
  segment_key = '"@timestamp"',  -- 按时间戳分段，提升时间范围查询效率
  clustering_key = '"@timestamp"'-- 按时间戳聚簇存储，进一步优化范围扫描
);

ECS 中转换原始数据文件格式：Hologres 使用标准 COPY 协议进行高速数据导入，由于原始数据是 NDJSON 格式，导入 Hologres 前，建议先转换为 CSV
```
python3 ndjson_to_csv.py ~/data ~/csv
```

ECS 中执行转换后，使用 psql 的 COPY 命令导入数据

# 设置环境变量
export PGHOST=<hologres_endpoint>
export PGPORT=<hologres_port>
export PGUSER=<user_name>
export PGPASSWORD='<password>'
export PGDATABASE=<database_name>

# COPY 导入数据到 Hologres
cd ~/csv
psql -c "COPY http_logs FROM STDIN WITH (FORMAT CSV)" < documents-181998.csv
psql -c "COPY http_logs FROM STDIN WITH (FORMAT CSV)" < documents-191998.csv
psql -c "COPY http_logs FROM STDIN WITH (FORMAT CSV)" < documents-201998.csv
psql -c "COPY http_logs FROM STDIN WITH (FORMAT CSV)" < documents-211998.csv
psql -c "COPY http_logs FROM STDIN WITH (FORMAT CSV)" < documents-221998.csv
psql -c "COPY http_logs FROM STDIN WITH (FORMAT CSV)" < documents-231998.csv
psql -c "COPY http_logs FROM STDIN WITH (FORMAT CSV)" < documents-241998.csv

Hologres 中构建全文索引：针对 request 字段创建全文倒排索引

-- 创建全文索引
CREATE INDEX http_logs_request_idx
  ON http_logs
  USING FULLTEXT (request)
  WITH (tokenizer = 'keyword');

-- 执行索引全量构建
VACUUM http_logs;