文档

Iceberg数据源

更新时间:

本文介绍云数据库 SelectDB 版与Iceberg数据源进行对接使用的流程,帮助您对Iceberg数据源进行联邦分析。

注意事项

  • 支持Iceberg V1、V2表格式。

  • V2格式仅支持Position Delete方式,不支持Equality Delete。

创建Catalog

基于Hive Metastore创建Catalog

和Hive Catalog基本一致,此处仅提供简单示例,更多信息请参见Hive数据源

CREATE CATALOG iceberg PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
    'hadoop.username' = 'hive',
    'dfs.nameservices'='your-nameservice',
    'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
    'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
    'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
    'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);

基于Iceberg API创建Catalog

使用Iceberg API访问元数据的方式,支持Hadoop File System、Hive、REST、DLF等服务作为Iceberg的Catalog。

Hadoop Catalog

-- 非HA集群
CREATE CATALOG iceberg_hadoop PROPERTIES (
    'type'='iceberg',
    'iceberg.catalog.type' = 'hadoop',
    'warehouse' = 'hdfs://your-host:8020/dir/key'
);
-- HA集群
CREATE CATALOG iceberg_hadoop_ha PROPERTIES (
    'type'='iceberg',
    'iceberg.catalog.type' = 'hadoop',
    'warehouse' = 'hdfs://your-nameservice/dir/key',
    'dfs.nameservices'='your-nameservice',
    'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
    'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
    'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
    'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);

Hive Metastore

CREATE CATALOG iceberg PROPERTIES (
    'type'='iceberg',
    'iceberg.catalog.type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
    'hadoop.username' = 'hive',
    'dfs.nameservices'='your-nameservice',
    'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
    'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
    'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
    'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);

REST Catalog

该方式需要预先提供REST服务,您需实现获取Iceberg元数据的REST接口。

CREATE CATALOG iceberg PROPERTIES (
    'type'='iceberg',
    'iceberg.catalog.type'='rest',
    'uri' = 'http://172.21.0.1:8181'
);

如果使用HDFS存储数据,并开启了高可用模式,还需在Catalog中增加HDFS高可用配置:

CREATE CATALOG iceberg PROPERTIES (
    'type'='iceberg',
    'iceberg.catalog.type'='rest',
    'uri' = 'http://172.21.0.1:8181',
    'dfs.nameservices'='your-nameservice',
    'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
    'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.1:8020',
    'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.2:8020',
    'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);

Iceberg On Object Storage

若数据存放在对象存储中,以OSS为例,需要在properties中配置以下参数:

"oss.access_key" = "ak"
"oss.secret_key" = "sk"
"oss.endpoint" = "oss-cn-beijing-internal.aliyuncs.com"
"oss.region" = "oss-cn-beijing"

类型映射

和Hive Catalog一致,请参见Hive数据源的列类型映射

Time Travel

在Iceberg中,每次对表的写操作都会产生一个新的快照(Snapshot)。

默认情况下,SelectDB的读请求只会读取Iceberg最新版本的快照,您可以使用FOR time AS OF FOR version AS OF语句,根据快照ID或者快照产生的时间读取历史版本的数据。示例如下:

-- 查询指定时间的数据。
SELECT * FROM iceberg_tbl FOR TIME AS OF "2022-10-07 17:20:37";
-- 查询指定快照ID。
SELECT * FROM iceberg_tbl FOR VERSION AS OF 868895038****72;

另外,您可以使用iceberg_meta表函数查询指定表的快照信息。

  • 本页导读 (1)
文档反馈