In-Database AI/ML概述
AnalyticDB PostgreSQL 7.0版中支持In-Database AI/ML功能。您可以应用该功能提供的算法和模型对数据进行处理,从而降低数据流转成本。In-Database AI/ML框架在兼容PostgresML开源社区接口的基础上,对其功能、性能和易用性进行了大量优化,利用GPU/CPU实现算法模型的训练、Fine-Tune、部署和推理等。In-Database AI/ML框架是基于pgml插件实现的。pgml插件是PostgresML开源社区的组件之一,集成了XGBoost、LightGBM和SciKit-Learn等经典机器学习算法。
版本限制
内核版本为v7.1.1.0及以上的AnalyticDB PostgreSQL 7.0版实例。
实例资源类型为存储弹性模式。
已经安装pgml插件。
说明pgml暂不支持白屏化安装,如有需要请提交工单联系工作人员协助安装。如有卸载插件需求,也请提交工单联系工作人员协助卸载。
元数据简介
AnalyticDB PostgreSQL 7.0版中In-Database AI/ML框架是基于pgml插件实现的。当在符合条件的版本中安装完pgml插件后,系统会自动创建名为pgml的Schema。在该Schema下有以下元数据表。
元数据表名称 | 描述 |
projects | 训练任务中对应的项目信息。 |
models | 训练后的模型信息。 |
files | 模型文件的存储信息。 |
snapshots | 训练时数据集的快照。 |
logs | 训练过程中输出的日志信息。 |
deployments | 训练后模型的部署信息。 |
当发起训练时,训练信息会被自动写入以上元数据表。
元数据表中pgml的自定义类型(如task、runtime和sampling等)的介绍请参见机器学习使用文档。
projects
projects表记录训练任务的项目ID、项目名称、任务类型、创建时间和更新时间。表结构和索引等信息如下。
Table "pgml.projects"
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+--------------------------------------
id | bigint | | not null | nextval('projects_id_seq'::regclass)
name | text | | not null |
task | task | | not null |
created_at | timestamp without time zone | | not null | clock_timestamp()
updated_at | timestamp without time zone | | not null | clock_timestamp()
Indexes:
"projects_pkey" PRIMARY KEY, btree (id)
"projects_name_idx" btree (name)
Triggers:
projects_auto_updated_at BEFORE UPDATE ON projects FOR EACH ROW EXECUTE FUNCTION set_updated_at()
trigger_before_insert_pgml_projects BEFORE INSERT ON projects FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_projects()
Distributed Replicated
models
models表记录模型训练时指定的参数和关联的项目ID和快照ID等信息。表结构和索引等信息如下。
Table "pgml.models"
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+------------------------------------
id | bigint | | not null | nextval('models_id_seq'::regclass)
project_id | bigint | | not null |
snapshot_id | bigint | | |
num_features | integer | | not null |
algorithm | text | | not null |
runtime | runtime | | | 'python'::runtime
hyperparams | jsonb | | not null |
status | text | | not null |
metrics | jsonb | | |
search | text | | |
search_params | jsonb | | not null |
search_args | jsonb | | not null |
created_at | timestamp without time zone | | not null | clock_timestamp()
updated_at | timestamp without time zone | | not null | clock_timestamp()
Indexes:
"models_pkey" PRIMARY KEY, btree (id)
"models_project_id_idx" btree (project_id)
"models_snapshot_id_idx" btree (snapshot_id)
Triggers:
models_auto_updated_at BEFORE UPDATE ON models FOR EACH ROW EXECUTE FUNCTION set_updated_at()
trigger_before_insert_pgml_models BEFORE INSERT ON models FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_models_fk()
Distributed Replicated
files
在训练结束后,模型目录下的每个文件以二进制形式被保存到files表的data列里,文件二进制流会按照每100MB切片保存。表结构和索引等信息如下。
Table "pgml.files"
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+-----------------------------------
id | bigint | | not null | nextval('files_id_seq'::regclass)
model_id | bigint | | not null |
path | text | | not null |
part | integer | | not null |
created_at | timestamp without time zone | | not null | clock_timestamp()
updated_at | timestamp without time zone | | not null | clock_timestamp()
data | bytea | | not null |
Indexes:
"files_pkey" PRIMARY KEY, btree (id)
"files_model_id_path_part_idx" btree (model_id, path, part)
Triggers:
files_auto_updated_at BEFORE UPDATE ON files FOR EACH ROW EXECUTE FUNCTION set_updated_at()
trigger_before_insert_pgml_files BEFORE INSERT ON files FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_files()
Distributed Replicated
snapshots
snapshots表记录训练时数据集的快照信息:数据表名称、测试集划分信息等。表结构和索引等信息如下。
Table "pgml.snapshots"
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------------------------------------
id | bigint | | not null | nextval('snapshots_id_seq'::regclass)
relation_name | text | | not null |
y_column_name | text[] | | |
test_size | real | | not null |
test_sampling | sampling | | not null |
status | text | | not null |
columns | jsonb | | |
analysis | jsonb | | |
created_at | timestamp without time zone | | not null | clock_timestamp()
updated_at | timestamp without time zone | | not null | clock_timestamp()
materialized | boolean | | | false
Indexes:
"snapshots_pkey" PRIMARY KEY, btree (id)
Triggers:
snapshots_auto_updated_at BEFORE UPDATE ON snapshots FOR EACH ROW EXECUTE FUNCTION set_updated_at()
Distributed Replicated
logs
Logs表记录输出训练过程中的信息。对于一个训练任务可能会存在多条训练信息,可以对created_at列升序查看。表结构和索引等信息如下。
Table "pgml.logs"
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+----------------------------------
id | integer | | not null | nextval('logs_id_seq'::regclass)
model_id | bigint | | |
project_id | bigint | | |
created_at | timestamp without time zone | | | CURRENT_TIMESTAMP
logs | jsonb | | |
Indexes:
"logs_pkey" PRIMARY KEY, btree (id)
Distributed Replicated
deployments
当模型需要部署时,系统会创建一条部署信息,关联项目ID、部署ID和模型ID,deployments表记录部署的策略。表结构和索引等信息如下。
Table "pgml.deployments"
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+-----------------------------------------
id | bigint | | not null | nextval('deployments_id_seq'::regclass)
project_id | bigint | | not null |
model_id | bigint | | not null |
strategy | strategy | | not null |
created_at | timestamp without time zone | | not null | clock_timestamp()
Indexes:
"deployments_pkey" PRIMARY KEY, btree (id)
"deployments_model_id_created_at_idx" btree (model_id)
"deployments_project_id_created_at_idx" btree (project_id)
Triggers:
deployments_auto_updated_at BEFORE UPDATE ON deployments FOR EACH ROW EXECUTE FUNCTION set_updated_at()
trigger_before_insert_pgml_deployments BEFORE INSERT ON deployments FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_deployments_fk()
Distributed Replicated