PolarDB 开源版通过 postgresql_hll 实现高效率 UV滑动分析、实时推荐已读列表过滤-阿里云开发者社区

PolarDB 开源版通过 postgresql_hll 实现高效率 UV滑动分析、实时推荐已读列表过滤

2023-02-03 25

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

云原生数据库 PolarDB MySQL 版，Serverless 5000PCU 100GB

简介： 背景PolarDB 的云原生存算分离架构, 具备低廉的数据存储、高效扩展弹性、高速多机并行计算能力、高速数据搜索和处理; PolarDB与计算算法结合, 将实现双剑合璧, 推动业务数据的价值产出, 将数据变成生产力.本文将介绍PolarDB 开源版通过 postgresql_hll 实现高效率 UV...

背景

PolarDB 的云原生存算分离架构, 具备低廉的数据存储、高效扩展弹性、高速多机并行计算能力、高速数据搜索和处理; PolarDB与计算算法结合, 将实现双剑合璧, 推动业务数据的价值产出, 将数据变成生产力.

本文将介绍PolarDB 开源版通过 postgresql_hll 实现高效率 UV滑动分析、实时推荐已读列表过滤

测试环境为macos+docker, polardb部署请参考如何用 PolarDB 证明巴菲特的投资理念 - 包括PolarDB简单部署。

postgresql_hll 简单介绍

postgresql_hll是高效率存储一堆唯一值的"hash value"的插件：可以

往这个"hash value"里面追加内容.
有多少唯一值
两个hash value 的差值个数
两个hash value 的union
多个hash value 的union

一个hll可能使用十几KB存储数亿唯一值.

常见场景:

UV
滑动窗口UV
新增UV
同比, 环比

短视频推荐业务, 只推荐未读的短视频. 使用postgresql_hll可以高效率记录和过滤已读列表.

hll也有点类似bloom filter:

如果判断结果为val在hll里面, 实际val可能不在hll里面. 因为是失真存储, 那么多个val的占位bitmask可能覆盖其他val的bitmask.
如果判断结果为val不在hll里面, 则一定不在.

postgresql_hll for PolarDB

安装部署postgresql_hll for polardb

git clone --depth 1 https://github.com/citusdata/postgresql-hll  
  
export PGHOST=localhost  
[postgres@67e1eed1b4b6 ~]$ psql  
psql (11.9)  
Type "help" for help.  
  
postgres=# \q  
  
  
cd postgresql-hll/  
  
USE_PGXS=1 make  
  
USE_PGXS=1 make install  
  
USE_PGXS=1 make installcheck

使用例子

建表, 写入大量UID的行为数据. 生成按天的UV数据, 使用hll存储uid hash.

create table t1 (id int, uid int, info text, crt_time timestamp);  
create table t1_hll (dt date, hllval hll);  
insert into t1 select id, random()*100000, random()::text, now() from generate_series(1,1000000) id;  
insert into t1 select id, random()*100000, random()::text, now()+interval '1 day' from generate_series(1,1000000) id;  
insert into t1_hll select date(crt_time), hll_add_agg(hll_hash_integer(uid)) from t1 group by 1;

判断UID是否在hll hash内, 检查hll精确性.

postgres=# select t1.uid, t2.hllval=hll_add(t2.hllval, hll_hash_integer(t1.uid)) from t1 , t1_hll t2 where t2.dt=date(now()) and t1.crt_time < date(now())+1 limit 10;  
  uid  | ?column?   
-------+----------  
 95912 | t  
 69657 | t  
 53722 | t  
 95821 | t  
  2836 | t  
 66298 | t  
 68466 | t  
 10122 | t  
 27861 | t  
  6824 | t  
(10 rows)  
  
  
select * from   
  (select t1.uid, t2.hllval=hll_add(t2.hllval, hll_hash_integer(t1.uid)) as yesorno from t1 , t1_hll t2 where t2.dt=date(now()) and t1.crt_time < date(now())+1) t   
where t.yesorno=false;  
  
 uid | yesorno   
-----+---------  
(0 rows)  
  
-- 完全正确.

划窗分析, 例如直接在hll的统计表中, 统计任意7天的划窗口uv. 如果没有HLL, 划窗分析必须去基表进行统计, 性能极差. 而有了hll, 只需要访问7条记录, 聚合即可.

## What if you wanted to this week's uniques?

SELECT hll_cardinality(hll_union_agg(users)) FROM daily_uniques WHERE date >= '2012-01-02'::date AND date <= '2012-01-08'::date;

## Or the monthly uniques for this year?

SELECT EXTRACT(MONTH FROM date) AS month, hll_cardinality(hll_union_agg(users))
FROM daily_uniques
WHERE date >= '2012-01-01' AND
      date <  '2013-01-01'
GROUP BY 1;

## Or how about a sliding window of uniques over the past 6 days?

SELECT date, #hll_union_agg(users) OVER seven_days
FROM daily_uniques
WINDOW seven_days AS (ORDER BY date ASC ROWS 6 PRECEDING);

## Or the number of uniques you saw yesterday that you didn't see today?

SELECT date, (#hll_union_agg(users) OVER two_days) - #users AS lost_uniques
FROM daily_uniques
WINDOW two_days AS (ORDER BY date ASC ROWS 1 PRECEDING);

PolarDB 开源版通过 postgresql_hll 实现高效率 UV滑动分析、实时推荐已读列表过滤

背景

postgresql_hll 简单介绍

postgresql_hll for PolarDB

参考

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像