数据库内构建部署机器学习模型-In-Database AI/ML-云原生数据仓库AnalyticDB-阿里云

AnalyticDB PostgreSQL 7.0版支持In-Database AI/ML功能，可在数据库内直接进行数据处理与模型计算，显著降低数据流转成本。该功能基于兼容PostgresML开源社区接口的pgml插件实现，并在性能、功能和易用性方面进行了深度优化，支持GPU/CPU加速下的模型训练、Fine-Tune、部署与推理。内置集成XGBoost、LightGBM、SciKit-Learn等主流机器学习算法，助力企业高效构建智能化分析应用。

前提条件

内核版本为V7.1.1.0及以上的AnalyticDB PostgreSQL 7.0版实例。
说明
您可以在控制台实例的基本信息页查看内核小版本。如不满足上述版本要求，需要您升级内核小版本。
实例资源类型为存储弹性模式。
已经安装pgml插件。
说明
- pgml暂不支持白屏化安装，如有需要请提交工单联系工作人员协助安装。如有卸载插件需求，也请提交工单联系工作人员协助卸载。
- 暂不支持在AnalyticDB for PostgreSQL7.0经济版安装和使用pgml插件。

元数据简介

AnalyticDB PostgreSQL 7.0版中In-Database AI/ML框架是基于pgml插件实现的。当在符合条件的版本中安装完pgml插件后，系统会自动创建名为pgml的Schema。在该Schema下有以下元数据表。

元数据表名称	描述
projects	训练任务中对应的项目信息。
models	训练后的模型信息。
files	模型文件的存储信息。
snapshots	训练时数据集的快照。
logs	训练过程中输出的日志信息。
deployments	训练后模型的部署信息。

当发起训练时，训练信息会被自动写入以上元数据表。

说明

元数据表中pgml的自定义类型（如task、runtime和sampling等）的介绍请参见机器学习。

projects

projects表记录训练任务的项目ID、项目名称、任务类型、创建时间和更新时间。表结构和索引等信息如下。

                                         Table "pgml.projects"
   Column   |            Type             | Collation | Nullable |               Default                
------------+-----------------------------+-----------+----------+--------------------------------------
 id         | bigint                      |           | not null | nextval('projects_id_seq'::regclass)
 name       | text                        |           | not null | 
 task       | task                        |           | not null | 
 created_at | timestamp without time zone |           | not null | clock_timestamp()
 updated_at | timestamp without time zone |           | not null | clock_timestamp()
Indexes:
    "projects_pkey" PRIMARY KEY, btree (id)
    "projects_name_idx" btree (name)
Triggers:
    projects_auto_updated_at BEFORE UPDATE ON projects FOR EACH ROW EXECUTE FUNCTION set_updated_at()
    trigger_before_insert_pgml_projects BEFORE INSERT ON projects FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_projects()
Distributed Replicated

models

models表记录模型训练时指定的参数和关联的项目ID和快照ID等信息。表结构和索引等信息如下。

                                           Table "pgml.models"
    Column     |            Type             | Collation | Nullable |              Default               
---------------+-----------------------------+-----------+----------+------------------------------------
 id            | bigint                      |           | not null | nextval('models_id_seq'::regclass)
 project_id    | bigint                      |           | not null | 
 snapshot_id   | bigint                      |           |          | 
 num_features  | integer                     |           | not null | 
 algorithm     | text                        |           | not null | 
 runtime       | runtime                     |           |          | 'python'::runtime
 hyperparams   | jsonb                       |           | not null | 
 status        | text                        |           | not null | 
 metrics       | jsonb                       |           |          | 
 search        | text                        |           |          | 
 search_params | jsonb                       |           | not null | 
 search_args   | jsonb                       |           | not null | 
 created_at    | timestamp without time zone |           | not null | clock_timestamp()
 updated_at    | timestamp without time zone |           | not null | clock_timestamp()
Indexes:
    "models_pkey" PRIMARY KEY, btree (id)
    "models_project_id_idx" btree (project_id)
    "models_snapshot_id_idx" btree (snapshot_id)
Triggers:
    models_auto_updated_at BEFORE UPDATE ON models FOR EACH ROW EXECUTE FUNCTION set_updated_at()
    trigger_before_insert_pgml_models BEFORE INSERT ON models FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_models_fk()
Distributed Replicated

files

在训练结束后，模型目录下的每个文件以二进制形式被保存到files表的data列里，文件二进制流会按照每100MB切片保存。表结构和索引等信息如下。

                                         Table "pgml.files"
   Column   |            Type             | Collation | Nullable |              Default              
------------+-----------------------------+-----------+----------+-----------------------------------
 id         | bigint                      |           | not null | nextval('files_id_seq'::regclass)
 model_id   | bigint                      |           | not null | 
 path       | text                        |           | not null | 
 part       | integer                     |           | not null | 
 created_at | timestamp without time zone |           | not null | clock_timestamp()
 updated_at | timestamp without time zone |           | not null | clock_timestamp()
 data       | bytea                       |           | not null | 
Indexes:
    "files_pkey" PRIMARY KEY, btree (id)
    "files_model_id_path_part_idx" btree (model_id, path, part)
Triggers:
    files_auto_updated_at BEFORE UPDATE ON files FOR EACH ROW EXECUTE FUNCTION set_updated_at()
    trigger_before_insert_pgml_files BEFORE INSERT ON files FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_files()
Distributed Replicated

snapshots

snapshots表记录训练时数据集的快照信息：数据表名称、测试集划分信息等。表结构和索引等信息如下。

                                           Table "pgml.snapshots"
    Column     |            Type             | Collation | Nullable |                Default                
---------------+-----------------------------+-----------+----------+---------------------------------------
 id            | bigint                      |           | not null | nextval('snapshots_id_seq'::regclass)
 relation_name | text                        |           | not null | 
 y_column_name | text[]                      |           |          | 
 test_size     | real                        |           | not null | 
 test_sampling | sampling                    |           | not null | 
 status        | text                        |           | not null | 
 columns       | jsonb                       |           |          | 
 analysis      | jsonb                       |           |          | 
 created_at    | timestamp without time zone |           | not null | clock_timestamp()
 updated_at    | timestamp without time zone |           | not null | clock_timestamp()
 materialized  | boolean                     |           |          | false
Indexes:
    "snapshots_pkey" PRIMARY KEY, btree (id)
Triggers:
    snapshots_auto_updated_at BEFORE UPDATE ON snapshots FOR EACH ROW EXECUTE FUNCTION set_updated_at()
Distributed Replicated

logs

Logs表记录输出训练过程中的信息。对于一个训练任务可能会存在多条训练信息，可以对created_at列升序查看。表结构和索引等信息如下。

                                         Table "pgml.logs"
   Column   |            Type             | Collation | Nullable |             Default              
------------+-----------------------------+-----------+----------+----------------------------------
 id         | integer                     |           | not null | nextval('logs_id_seq'::regclass)
 model_id   | bigint                      |           |          | 
 project_id | bigint                      |           |          | 
 created_at | timestamp without time zone |           |          | CURRENT_TIMESTAMP
 logs       | jsonb                       |           |          | 
Indexes:
    "logs_pkey" PRIMARY KEY, btree (id)
Distributed Replicated

deployments

当模型需要部署时，系统会创建一条部署信息，关联项目ID、部署ID和模型ID，deployments表记录部署的策略。表结构和索引等信息如下。

                                         Table "pgml.deployments"
   Column   |            Type             | Collation | Nullable |                 Default                 
------------+-----------------------------+-----------+----------+-----------------------------------------
 id         | bigint                      |           | not null | nextval('deployments_id_seq'::regclass)
 project_id | bigint                      |           | not null | 
 model_id   | bigint                      |           | not null | 
 strategy   | strategy                    |           | not null | 
 created_at | timestamp without time zone |           | not null | clock_timestamp()
Indexes:
    "deployments_pkey" PRIMARY KEY, btree (id)
    "deployments_model_id_created_at_idx" btree (model_id)
    "deployments_project_id_created_at_idx" btree (project_id)
Triggers:
    deployments_auto_updated_at BEFORE UPDATE ON deployments FOR EACH ROW EXECUTE FUNCTION set_updated_at()
    trigger_before_insert_pgml_deployments BEFORE INSERT ON deployments FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_deployments_fk()
Distributed Replicated