PolarDB 开源版通过 duckdb_fdw 支持 parquet 列存数据文件以及高效OLAP

本文涉及的产品
云原生数据库 PolarDB MySQL 版,Serverless 5000PCU 100GB
简介: 背景PolarDB 的云原生存算分离架构, 具备低廉的数据存储、高效扩展弹性、高速多机并行计算能力、高速数据搜索和处理; PolarDB与计算算法结合, 将实现双剑合璧, 推动业务数据的价值产出, 将数据变成生产力.本文将介绍PolarDB 开源版通过duckdb_fdw 支持 parquet 列存...

背景

PolarDB 的云原生存算分离架构, 具备低廉的数据存储、高效扩展弹性、高速多机并行计算能力、高速数据搜索和处理; PolarDB与计算算法结合, 将实现双剑合璧, 推动业务数据的价值产出, 将数据变成生产力.

本文将介绍PolarDB 开源版通过duckdb_fdw 支持 parquet 列存数据文件以及高效OLAP.

测试环境为macos+docker, polardb部署请参考如何用 PolarDB 证明巴菲特的投资理念 - 包括PolarDB简单部署

duckdb_fdw for PolarDB

参考

《用duckdb_fdw加速PostgreSQL分析计算, 提速40倍, 真香.》

  1. 部署

需要一个 高版本 cmake .

https://cmake.org/download  
  
wget https://github.com/Kitware/CMake/releases/download/v3.25.1/cmake-3.25.1.tar.gz  
tar -zxvf cmake-3.25.1.tar.gz  
cd cmake-3.25.1  
./configure  
make -j 4  
make install  
  
  
[postgres@67e1eed1b4b6 duckdb]$ export PATH=/usr/local/bin:$PATH  
[postgres@67e1eed1b4b6 duckdb]$ which cmake  
/usr/local/bin/cmake  
git clone --depth 1 https://github.com/cwida/duckdb  
cd duckdb  
make -j 4  
cd ~/duckdb  
  
cp build/release/tools/sqlite3_api_wrapper/libsqlite3_api_wrapper.so ~/tmp_basedir_polardb_pg_1100_bld/lib/    
cp tools/sqlite3_api_wrapper/include/sqlite3.h ~/tmp_basedir_polardb_pg_1100_bld/include/server/    
cp build/release/src/libduckdb.so ~/tmp_basedir_polardb_pg_1100_bld/lib/    
cd ~    
git clone --depth 1 https://github.com/alitrack/duckdb_fdw    
  
cd duckdb_fdw    
USE_PGXS=1 make    
USE_PGXS=1 make install    
[postgres@67e1eed1b4b6 duckdb_fdw]$ USE_PGXS=1 make    
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation   -g -pipe -Wall -grecord-gcc-switches -I/usr/include/et -O3 -Wp,-D_FORTIFY_SOURCE=2 -fPIC -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal  -D_GNU_SOURCE -I/usr/include/libxml2   -c -o connection.o connection.c  
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation   -g -pipe -Wall -grecord-gcc-switches -I/usr/include/et -O3 -Wp,-D_FORTIFY_SOURCE=2 -fPIC -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal  -D_GNU_SOURCE -I/usr/include/libxml2   -c -o option.o option.c  
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation   -g -pipe -Wall -grecord-gcc-switches -I/usr/include/et -O3 -Wp,-D_FORTIFY_SOURCE=2 -fPIC -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal  -D_GNU_SOURCE -I/usr/include/libxml2   -c -o deparse.o deparse.c  
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation   -g -pipe -Wall -grecord-gcc-switches -I/usr/include/et -O3 -Wp,-D_FORTIFY_SOURCE=2 -fPIC -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal  -D_GNU_SOURCE -I/usr/include/libxml2   -c -o sqlite_query.o sqlite_query.c  
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation   -g -pipe -Wall -grecord-gcc-switches -I/usr/include/et -O3 -Wp,-D_FORTIFY_SOURCE=2 -fPIC -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal  -D_GNU_SOURCE -I/usr/include/libxml2   -c -o duckdb_fdw.o duckdb_fdw.c  
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation   -g -pipe -Wall -grecord-gcc-switches -I/usr/include/et -O3 -Wp,-D_FORTIFY_SOURCE=2 -fPIC -shared -o duckdb_fdw.so connection.o option.o deparse.o sqlite_query.o duckdb_fdw.o -L/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib  -Wl,-rpath,'$ORIGIN/../lib' -L/opt/rh/llvm-toolset-7.0/root/usr/lib64  -Wl,--as-needed -Wl,-rpath,'/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib',--enable-new-dtags  -lsqlite3_api_wrapper   
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2  -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal  -D_GNU_SOURCE -I/usr/include/libxml2  -flto=thin -emit-llvm -c -o connection.bc connection.c  
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2  -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal  -D_GNU_SOURCE -I/usr/include/libxml2  -flto=thin -emit-llvm -c -o option.bc option.c  
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2  -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal  -D_GNU_SOURCE -I/usr/include/libxml2  -flto=thin -emit-llvm -c -o deparse.bc deparse.c  
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2  -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal  -D_GNU_SOURCE -I/usr/include/libxml2  -flto=thin -emit-llvm -c -o sqlite_query.bc sqlite_query.c  
/opt/rh/llvm-toolset-7.0/root/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2  -I. -I./ -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/server -I/home/postgres/tmp_basedir_polardb_pg_1100_bld/include/internal  -D_GNU_SOURCE -I/usr/include/libxml2  -flto=thin -emit-llvm -c -o duckdb_fdw.bc duckdb_fdw.c  
  
  
[postgres@67e1eed1b4b6 duckdb_fdw]$ USE_PGXS=1 make  install  
/usr/bin/mkdir -p '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib'  
/usr/bin/mkdir -p '/home/postgres/tmp_basedir_polardb_pg_1100_bld/share/extension'  
/usr/bin/mkdir -p '/home/postgres/tmp_basedir_polardb_pg_1100_bld/share/extension'  
/usr/bin/install -c -m 755  duckdb_fdw.so '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/duckdb_fdw.so'  
/usr/bin/install -c -m 644 .//duckdb_fdw.control '/home/postgres/tmp_basedir_polardb_pg_1100_bld/share/extension/'  
/usr/bin/install -c -m 644 .//duckdb_fdw--1.0.sql .//duckdb_fdw--1.0--1.1.sql  '/home/postgres/tmp_basedir_polardb_pg_1100_bld/share/extension/'  
/usr/bin/mkdir -p '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode/duckdb_fdw'  
/usr/bin/mkdir -p '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/  
/usr/bin/install -c -m 644 connection.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./  
/usr/bin/install -c -m 644 option.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./  
/usr/bin/install -c -m 644 deparse.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./  
/usr/bin/install -c -m 644 sqlite_query.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./  
/usr/bin/install -c -m 644 duckdb_fdw.bc '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode'/duckdb_fdw/./  
cd '/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/bitcode' && /opt/rh/llvm-toolset-7.0/root/usr/bin/llvm-lto -thinlto -thinlto-action=thinlink -o duckdb_fdw.index.bc duckdb_fdw/connection.bc duckdb_fdw/option.bc duckdb_fdw/deparse.bc duckdb_fdw/sqlite_query.bc duckdb_fdw/duckdb_fdw.bc  
  1. 加载插件

create extension duckdb_fdw;    
  1. 测试使用

cd ~/duckdb/build/release  
./duckdb /home/postgres/db    
  
COPY (select generate_series as id, md5(random()::text) as info, now()::timestamp+(generate_series||' second')::interval as crt_time from generate_series(1,100)) TO '/home/postgres/t1.parquet' (FORMAT 'PARQUET');    
  
COPY (select generate_series as cid, md5(random()::text) as info, now()::timestamp+(generate_series||' second')::interval as crt_time from generate_series(1,100)) TO '/home/postgres/t2.parquet' (FORMAT 'PARQUET');    
  
COPY (select (floor(random()*100)+1)::int as gid, (floor(random()*100)+1)::int as cid, (random()*10)::int as c1, (random()*100)::int as c2, (random()*1000)::int as c3, (random()*10000)::int as c4, (random()*100000)::int as c5 from generate_series(1,1000000)) TO '/home/postgres/t3.parquet' (FORMAT 'PARQUET');    
    
    
create view t1 as select * from read_parquet('/home/postgres/t1.parquet');    
create view t2 as select * from read_parquet('/home/postgres/t2.parquet');    
create view t3 as select * from read_parquet('/home/postgres/t3.parquet');    
    
checkpoint;    
  1. 性能怎么样

psql  
  
  
CREATE SERVER DuckDB_server FOREIGN DATA WRAPPER duckdb_fdw OPTIONS (database '/home/postgres/db');    
    
IMPORT FOREIGN SCHEMA public FROM SERVER DuckDB_server INTO public;    
    
postgres=# \det    
     List of foreign tables    
 Schema | Table |    Server         
--------+-------+---------------    
 public | t1    | duckdb_server    
 public | t2    | duckdb_server    
 public | t3    | duckdb_server    
(3 rows)    
    
postgres=# explain verbose select count(distinct gid) from t3;    
                         QUERY PLAN                              
-------------------------------------------------------------    
 Foreign Scan  (cost=1.00..1.00 rows=1 width=8)    
   Output: (count(DISTINCT gid))    
   SQLite query: SELECT count(DISTINCT "gid") FROM main."t3"    
(3 rows)    
    
postgres=# \timing    
Timing is on.    
  
select count(distinct gid) from t3;    
  
select count(distinct gid),count(*) from t3;    
  
explain verbose select count(distinct t3.gid),count(*) from t3 join t1 on t3.gid=t1.gid;    
  
  
select count(distinct t3.gid),count(*) from t3 join t1 on t3.gid=t1.gid;    
  
select count(distinct t3.gid),count(*) from t3 join t1 on t3.gid=t1.gid join t2 on t3.cid=t2.cid;    
  
explain verbose select count(distinct t3.gid),count(*) from t3 join t1 on t3.gid=t1.gid join t2 on t3.cid=t2.cid;    
  
  1. 将数据导入PolarDB本地, 做同样的查询看一下时间.

postgresql 本地计算

postgres=# create unlogged table lt1 as select * from t1;    
SELECT 100    
postgres=# create unlogged table lt2 as select * from t2;    
SELECT 100    
postgres=# create unlogged table lt3 as select * from t3;    
SELECT 10000000    
postgres=# \timing    
Timing is on.    
postgres=# select count(distinct gid) from lt3;    
 count     
-------    
   100    
(1 row)    
    
Time: 14891.356 ms (00:14.891)    

参考

相关实践学习
数据库实验室挑战任务-初级任务
本场景介绍如何开通属于你的免费云数据库,在RDS-MySQL中完成对学生成绩的详情查询,执行指定类型SQL。
阿里云云原生数据仓库AnalyticDB MySQL版 使用教程
云原生数据仓库AnalyticDB MySQL版是一种支持高并发低延时查询的新一代云原生数据仓库,高度兼容MySQL协议以及SQL:92、SQL:99、SQL:2003标准,可以对海量数据进行即时的多维分析透视和业务探索,快速构建企业云上数据仓库。 了解产品 https://www.aliyun.com/product/ApsaraDB/ads
相关文章
|
5月前
|
存储 关系型数据库 分布式数据库
PolarDB-X HTAP新特性 ~ 列存索引
随着数据爆炸式的增长,传统的OLTP和OLAP解决方案基于简单的读写分离或ETL模型,将在线库的数据以T+1的方式抽取到数据仓库中进行计算,这种方案存在存储成本高、实时性差、链路和维护成本高等缺陷。 为应对数据爆炸式增长的挑战,PolarDB分布式版本基于对象存储设计了一套列存索引(Clustered Columnar Index,CCI)功能,支持将行存数据实时同步到列存存储上
76008 148
|
10月前
|
存储 文件存储 索引
GreenPlum列存解密
GreenPlum列存解密
138 0
|
存储 关系型数据库 对象存储
PolarDB-PG | PostgreSQL + 阿里云OSS 实现高效低价的海量数据冷热存储分离
数据库里的历史数据越来越多, 占用空间大, 备份慢, 恢复慢, 查询少但是很费钱, 迁移慢 怎么办? 冷热分离方案: - 使用PostgreSQL 或者 PolarDB-PG 存成parquet文件格式, 放到aliyun OSS存储里面. 使用duckdb_fdw对parquet文件进行查询. - duckdb 存储元数据(parquet 映射) 方案特点: - 内网oss不收取网络费用, 只收取存储费用, 非常便宜 - oss分几个档, 可以根据性能需求选择 - parquet为列存储, 一般历史数据的分析需求多,性能不错 - duckdb 支持 parquet下推过滤, 数据过滤性能不错
6256 5
PolarDB-PG | PostgreSQL + 阿里云OSS 实现高效低价的海量数据冷热存储分离
|
存储 并行计算 算法
PolarDB 开源版通过 duckdb_fdw 支持 parquet 列存数据文件以及高效OLAP
PolarDB 的云原生存算分离架构, 具备低廉的数据存储、高效扩展弹性、高速多机并行计算能力、高速数据搜索和处理; PolarDB与计算算法结合, 将实现双剑合璧, 推动业务数据的价值产出, 将数据变成生产力. 本文将介绍PolarDB 开源版通过duckdb_fdw 支持 parquet 列存数据文件以及高效OLAP.
1218 0
|
存储 算法 关系型数据库
一文剖析PolarDB HTAP的列存数据压缩
PolarDB MySQL是阿里云自研的OLTP云原生数据库,为满足客户HTAP的需求,PolarDB技术团队提出了基于In-Memory Column Index的技术方案,本文介绍数据的压缩方法以及PolarDB HTAP在列存数据压缩上的工作。通过压缩,PolarDB HTAP在大部分业务场景将列存存储空间减少到十分之一,大幅减少客户的存储成本。在计算性能上,PolarDB技术团队探究将计算下推到压缩数据从而加速分析,进一步为客户提供更好的OLAP查询性能,实现HTAP系统性能和成本的兼得。
346 0
|
存储 SQL 算法
列式存储?OLAP?ClickHouse究竟是何方神圣
列式存储?OLAP?ClickHouse究竟是何方神圣
列式存储?OLAP?ClickHouse究竟是何方神圣
|
SQL 存储 druid
开源OLAP迁移至Hologres
本文将会为您介绍开源OLAP如ClickHouse/Druid/Presto等架构同Hologres的深度对比,并介绍如何如何平滑迁移到Hologres技术体系,实现更稳定,更可扩展,更多功能的HSAP架构。
10976 0
开源OLAP迁移至Hologres
|
SQL 存储 数据库
MonetDB 列存数据库架构初探
### 前言 对于越来越多的分析型场景,例如数据仓库,科学计算等, 经典的数据库DBMS的检索性能颇显乏力。 相反的,最近出现了很多面向列存的数据库DBMS,像ClickHouse,Vertica, MonetDB等,因其充分利用了现代计算机的一些硬件优势,同时舍弃了一些DBMS特性,得到了非常好的检索性能。本文就MonetDB, 整理一些资料和代码,简单介绍其核心设计。 通过简单Benc
5611 0