机器学习使用文档

AnalyticDB PostgreSQL 7.0版支持In-Database AI/ML功能,其中AI部分集成Huggingface模型,ML部分集成XGBoost、LightGBM、CatBoost、Linfa以及所有Scikit-Learn等算法。该功能可直接对表或视图中的数据进行AI/ML训练、微调、部署和推理。本文介绍在AnalyticDB PostgreSQL 7.0版中如何进行机器学习的训练、部署及推理等。

训练

训练接口以UDF方式提供,接口名为pgml.train。当前仅支持在master节点训练。

语法

CREATE FUNCTION pgml.train(
    "project_name" TEXT,
    "task" TEXT DEFAULT NULL,
    "relation_name" TEXT DEFAULT NULL,
    "y_column_name" TEXT DEFAULT NULL,
    "algorithm" pgml.Algorithm DEFAULT 'linear',
    "hyperparams" jsonb DEFAULT '{}',
    "search" pgml.Search DEFAULT NULL,
    "search_params" jsonb DEFAULT '{}',
    "search_args" jsonb DEFAULT '{}',
    "test_size" real DEFAULT 0.25,
    "test_sampling" pgml.Sampling DEFAULT 'stratified',
    "runtime" pgml.Runtime DEFAULT NULL,
    "automatic_deploy" bool DEFAULT true,
    "materialize_snapshot" bool DEFAULT false, 
    "preprocess" jsonb DEFAULT '{}'
) RETURNS TABLE (
    "project" TEXT,
    "task" TEXT,
    "algorithm" TEXT,
    "deployed" bool
)
EXECUTE ON COORDINATOR
LANGUAGE c /* Rust */
AS 'MODULE_PATHNAME', 'train_wrapper';

参数说明

参数名称

描述

示例值

project_name

项目名称。

'Breast Cancer Detection'

task

任务名称。

'classification'

relation_name

表的名称,可使用'$SCHEMA.$TABLE'指定其他模式中的表。

'pgml.breast_cancer'

y_column_name

表中的训练列的名称。

'mglignant'

algorithm

算法名称,参数详情

'XGBoost'

hyperparams

训练超参数。

{

"n_estimators": 25,

"nthread": 4

}

search

最佳超参数搜索策略,参数详情

'grid'

search_params

对搜索过程中的超参数进行范围或值的定义。详情请参见Scikit-Learn/XGBoost等官方文档。

{ "max_depth": [1, 2, 3, 4] }

search_args

搜索策略的补充参数,详情请参见Scikit-Learn/XGBoost等官方文档。

{ "n_iter": 10}

test_size

测试集划分比率。

0.25

test_sampling

测试集划分方法,参数详情

'random'

runtime

算法实现的运行环境,有Python和Rust两种。

'python'

automatic_deploy

是否自动部署。

true

materialize_snapshot

是否物化部署。

true

preprocess

预处理。

{"col_name": {"impute": "mean", scale: "standard"}}

说明

如无特殊指定,任务默认使用单核训练。您可以参阅具体模型的参数文档通过hyperparams进行传参控制。例如XGBoost模型可通过nthread参数控制训练使用的核心数。

pgml自定义类型

下面列出pgml.algorithmpgml.samplingpgml.runtime三种自定义参数类型当前支持的算法等信息。

  • pgml.algorithm为自定义枚举类型,支持以下算法。

    算法名称

    描述

    linear

    针对回归分析的线性模型。

    xgboost

    一个优化的分布式梯度增强库。

    xgboost_random_forest

    为随机森林特别调整的XGBoost版本。

    svm

    用于分类和回归的支持向量机算法。

    lasso

    统计学和机器学习中用于回归分析的算法。

    elastic_net

    结合L1和L2正则化的弹性网络回归方法。

    ridge

    岭回归通过L2正则化来防止过拟合。

    kmeans

    聚类算法。

    dbscan

    基于密度的聚类算法,能够发现各种形状的聚类。

    knn

    用于回归的k最近邻算法。

    random_forest

    用于分类和回归的集成学习算法。

    least_angle

    与lasso兼容的最小角回归算法。

    lasso_least_angle

    结合lar和lasso特点的lasso最小角回归算法。

    orthogonal_matching_pursuit

    用于稀疏信号恢复的贪心算法。

    bayesian_ridge

    通过先验分布估计回归参数的贝叶斯岭回归算法。

    automatic_relevance_determination

    一种贝叶斯回归算法。

    stochastic_gradient_descent

    针对回归和分类的简单而高效的优化算法。

    perceptron

    一种二分类的线性模型算法。

    passive_aggressive

    适用于大规模学习的在线学习算法。

    ransac

    用于鲁棒线性回归的随机采样一致性方法。

    theil_sen

    用于鲁棒线性回归的泰尔森估计器。

    huber

    对离群点具有鲁棒性的Huber回归。

    quantile

    预测条件分位数(如中位数)的分位数回归。

    kernel_ridge

    核岭回归通过核技巧学习非线性关系。

    gaussian_process

    用于概率回归和分类的高斯过程。

    nu_svm

    适用于分类和回归的支持向量机变体。

    ada_boost

    用于分类问题的提升算法。

    bagging

    用于降低过拟合的集成算法。

    extra_trees

    对随机森林算法的扩展。

    gradient_boosting_trees

    用于回归和分类问题的提升算法。

    hist_gradient_boosting

    适用于大数据集的梯度提升方法。

    linear_svm

    适合分类问题的线性支持向量机。

    lightgbm

    微软开发的适用于大规模数据的梯度提升框架。

    affinity_propagation

    基于数据点间相似度的消息传递聚类算法。

    birch

    特别适用于大数据集的聚类算法。

    feature_agglomeration

    用于特征选择的聚类方法。

    mini_batch_kmeans

    适用于大型或在线数据集的K均值聚类变式。

    mean_shift

    能够发现任何形状簇的基于质心的聚类算法。

    optics

    一种识别各种密度簇的算法。

    spectral

    利用数据的谱特性进行聚类。

    spectral_bi

    同时对数据的两个维度(如行和列)进行聚类。

    spectral_co

    使用谱方法同时对两个维度进行聚类。

    catboost

    针对分类/回归问题有效的梯度提升机。

    pca

    用于数据降维的技术。

  • pgml.sampling为自定义枚举类型,支持以下函数/方法。

    枚举值

    描述

    random

    测试数据随机抽样。

    last

    测试数据取seq scan尾部数据。

    stratified

    测试数据。

  • pgml.runtime为自定义枚举类型,支持以下方法。

    枚举值

    描述

    Python

    算法实现为Python。

    Rust

    算法实现为Rust。

超参数搜索

在模型训练时,可使用超参数搜索和交叉验证找到一组最优训练超参数。涉及到的训练接口参数主要有:searchsearch_paramssearch_args

  • pgml.search为自定义枚举类型,支持如下方法。

    枚举值

    描述

    grid

    使用笛卡尔积训练search_params中每一种参数排列组合。

    random

    根据search_args中提供的n_iter数量,随机抽取search_params中的参数组合,进行最多n_iter次迭代的抽样。

  • search_params:对搜索过程中的超参数进行范围或值的定义。超参数详情请参见Scikit-Learn/XGBoost等官方文档。

  • search_args支持的格式如下。

    JSONB格式

    描述

    '{ "n_iter": K }'

    适用于随机搜索策略。

    '{ "cv": K }'

    k-fold交叉验证中指定的K。

    使用示例

    在下面示例中,网格搜索将训练len(max_depth) * len(n_estimators) * k-fold= 3 * 3 * 3 = 27种组合,以比较search_params中的所有可能的参数排列。

    说明
    • 对于回归,默认使用R2 Score来选择最优超参数。

    • 对于分类,默认使用F1 Score来选择最优超参数。

    SELECT * FROM pgml.train(
        'Handwritten Digit Image Classifier', 
        'classification',
        'pgml.digits',
        'target',
        algorithm => 'xgboost', 
        SEARCH => 'grid', 
        search_params => '{
            "max_depth": [1, 3, 5], 
            "n_estimators": [20, 80, 160]
        }',
        search_args => '{ "cv": 3 }'
    );

预处理

训练函数提供了preprocess参数用作预处理数据。预处理器可以根据训练数据对每列进行集中配置。pgml支持以下三种数据预处理方式。

说明

预处理器信息会在训练后被保存在pgml.snapshots元数据表里,且在后续调用pgml.predict()时会对预测数据做同样的预处理。

枚举类型编码

将枚举类型编码成数值类型。

编码方法

描述

native

默认值。对于非文本类型,强制转换为32位浮点型。

target

对于文本类型,会编码成target类的均值。

one_hot

将枚举类型转换为one-hot格式。

ordinal

转换为ordinal指定数组的枚举值对应的下标,NULL默认下标为0。

缺失值补全

将NULL或NaN替换为数值类型。

补全方法

描述

error

默认值。在训练或推理中检测到NULL或NaN值报错。

mean

补全为训练数据中的均值。

median

补全为训练数据中的中位数。

mode

补全为训练数据中的出现频率最高的值。

min

补全为训练数据中的最小值。

max

补全为训练数据中的最大值。

zero

补全为0。

缩放

将所有变量缩放到一个标准范围内,有助于提高模型的泛化能力。

缩放方法

描述

preserve

默认值。不做任何缩放处理。

standard

通过去除均值并缩放到单位方差,对特征进行标准化。

min-max

将特征缩放到给定的最小值和最大值(通常是0和1)之间。

max-abs

通过除以每个特征的最大绝对值来缩放数据,使得训练集中每个特征的最大绝对值为1。

robust

根据该类的统计数据的第一四分位数和第三四分位数的范围来缩放数据。相比于其他方法,这种方法在处理异常值时更加稳健。

部署

部署过程将训练的项目信息、预处理方法和模型参数存储在元数据表里,后续推理时只指定项目编号或者名称即可进行推理,简化了推理流程。

部署策略类型参数为pgml.strategy,有以下几种类型。

枚举值

解释

best_score

达到最佳指标的模型被部署。

most_recent

最新训练的模型被部署,忽略指标情况。

rollback

返回当前模型之前的部署。

在训练时提供了automatic_deploy参数实现部署,默认值为true。在多次训练中将自动选择最优的一组模型部署,对于分类任务默认选取F1 Score最大的模型,对于回归任务默认选择R2 Score最大的模型。同时,也可以使用下面两种方式的UDF手动部署某次训练出来的模型。

  • 强制部署指定model_id的模型。

    CREATE FUNCTION pgml."deploy"(
            "model_id" bigint
    ) RETURNS TABLE (
            "project" TEXT,
            "strategy" TEXT,
            "algorithm" TEXT
    )
    STRICT EXECUTE ON COORDINATOR
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'deploy_model_wrapper';
  • 从当前项目所有模型中选择指定算法和指定策略的模型来部署。

    CREATE FUNCTION pgml."deploy"(
            "project_name" TEXT,
            "strategy" pgml.Strategy,
            "algorithm" pgml.Algorithm DEFAULT NULL
    ) RETURNS TABLE (
            "project" TEXT,
            "strategy" TEXT,
            "algorithm" TEXT
    )
    EXECUTE ON COORDINATOR
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'deploy_strategy_wrapper';

推理

单条推理

单条推理使用predict来实现,根据项目名称或模型ID对一组特征进行推理。返回值为float4,对于分类是所属类别,对于回归是预测值。以下列出几种不同方式的推理使用情况。

说明

根据项目名称找到的模型是该项目最近部署的模型。

  • 指定项目名称对表的数据行进行推理。

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "row" anyelement
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_row_wrapper';
  • 指定模型ID对表的数据行进行推理。

    CREATE FUNCTION pgml."predict"(
            "model_id" bigint,
            "row" anyelement
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_model_row_wrapper';
  • 指定项目名称对float4特征进行推理。

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" real[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_f32_wrapper';
  • 指定模型ID对float4特征进行推理。

    CREATE FUNCTION pgml."predict"(
            "model_id" bigint,
            "features" real[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_model_wrapper';
  • 指定项目名称对bigint特征进行推理。

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" bigint[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_i64_wrapper';
  • 指定项目名称对int特征进行推理。

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" INT[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_i32_wrapper';
  • 指定项目名称对smallint特征进行推理。

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" smallint[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_i16_wrapper';
  • 指定项目名称对float8特征进行推理。

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" double precision[]
    ) RETURNS real
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_f64_wrapper';
  • 指定项目名称对bool特征进行推理。

    CREATE FUNCTION pgml."predict"(
            "project_name" TEXT,
            "features" bool[]
    ) RETURNS real /* f32 */
    IMMUTABLE STRICT PARALLEL SAFE
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_bool_wrapper';

批量推理

批量推理根据项目名称或模型ID对多组float4特征进行批量推理。批量推理使用predict_batchpredict_probapredict_joint实现。

说明

入参的多组特征需要平铺成一维数组,返回值是float4的数组。

predict_batch

通常情况下,使用predict_batch实现批量推理。以下列出两种不同方式的推理使用情况。

  • 指定模型ID对表的数据行进行推理。

    CREATE FUNCTION pgml."predict_batch"(
        "model_id" bigint, /* i64 */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS real[] /* alloc::vec::Vec<f32> */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_model_batch_wrapper';
  • 指定项目名称对表的数据行进行推理。

    CREATE FUNCTION pgml."predict_batch"(
        "project_name" TEXT, /* &str */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS SETOF real /* f32 */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_batch_wrapper';

predict_proba

当您需要返回置信概率时,predict_proba是恰好的选择。predict_proba的使用类似于predict_batch,不同的是返回模型的输出,即每个类别的置信概率。

  • 指定模型ID对表的数据行进行推理。

    CREATE FUNCTION pgml."predict_proba"(
        "model_id" bigint, /* i64 */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS real[] /* alloc::vec::Vec<f32> */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_model_proba_wrapper';
  • 指定项目名称对表的数据行进行推理。

    CREATE FUNCTION pgml."predict_proba"(
        "project_name" TEXT, /* &str */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS real[] /* alloc::vec::Vec<f32> */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_proba_wrapper';

predict_joint

当您需要对多label列的推理进行推理时,请使用predict_joint实现,目前不支持分类场景。predict_joint对于每一组特征返回多维预测结果。

  • 指定模型id对多组float4特征进行推理。

     CREATE FUNCTION pgml."predict_joint"(
        "model_id" bigint, /* i64 */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS real[] /* alloc::vec::Vec<f32> */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_model_joint_wrapper';
    
  • 指定项目名称对多组float4特征进行推理。

    CREATE FUNCTION pgml."predict_joint"(
        "project_name" TEXT, /* &str */
        "features" real[] /* alloc::vec::Vec<f32> */
    ) RETURNS real[] /* alloc::vec::Vec<f32> */
    IMMUTABLE STRICT PARALLEL SAFE 
    LANGUAGE c /* Rust */
    AS 'MODULE_PATHNAME', 'predict_joint_wrapper';

使用示例

以下列出针对分类、回归、聚类、降维和预处理的SQL完整示例。

分类

分类使用示例

-- Exit on error (psql)
-- \set ON_ERROR_STOP true
\timing on

SELECT pgml.load_dataset('breast_cancer');

-- view the dataset
SELECT * FROM pgml.breast_cancer LIMIT 10;

-- train a simple model to classify the data
SELECT * FROM pgml.train('Breast Cancer Detection', 'classification', 'pgml.breast_cancer', 'malignant');

-- check out the predictions
SELECT malignant, pgml.predict(
    'Breast Cancer Detection', 
    (
        "mean radius", 
        "mean texture", 
        "mean perimeter", 
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
    )
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

-- view raw class probabilities
SELECT malignant, pgml.predict_proba(
    'Breast Cancer Detection',
    ARRAY[
        "mean radius",
        "mean texture",
        "mean perimeter",
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
    ]
) AS prediction
FROM pgml.breast_cancer
LIMIT 10;

--
-- After a project has been trained, omitted parameters will be reused from previous training runs
-- In these examples we'll reuse the training data snapshots from the initial call.
--

-- linear models
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'ridge');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'stochastic_gradient_descent');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'perceptron');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'passive_aggressive');

-- support vector machines
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'svm');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'nu_svm');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'linear_svm');

-- ensembles
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'ada_boost');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'bagging');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'random_forest', hyperparams => '{"n_estimators": 10}');

-- other
-- Gaussian Process is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'gaussian_process', hyperparams => '{"max_iter_predict": 100, "warm_start": true}');

-- Gradient Boosting
SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}');
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Breast Cancer Detection', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 2}');


-- check out all that hard work
SELECT trained_models.* FROM pgml.trained_models 
JOIN pgml.models on models.id = trained_models.id
ORDER BY models.metrics->>'f1' DESC LIMIT 5;

-- deploy the random_forest model for prediction use
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'most_recent', 'random_forest');
-- check out that throughput
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- do a hyperparam search on your favorite algorithm
SELECT pgml.train(
    'Breast Cancer Detection', 
    algorithm => 'xgboost', 
    search => 'grid', 
    search_params => '{
        "n_estimators": [2, 4],
        "max_depth": [1, 2, 3]
    }'
);

-- deploy the "best" model for prediction use
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'best_score');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'most_recent');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'rollback');
SELECT * FROM pgml.deploy('Breast Cancer Detection', 'best_score', 'svm');

-- check out the improved predictions
SELECT malignant, pgml.predict(
    'Breast Cancer Detection', 
    (
        "mean radius", 
        "mean texture", 
        "mean perimeter", 
        "mean area",
        "mean smoothness",
        "mean compactness",
        "mean concavity",
        "mean concave points",
        "mean symmetry",
        "mean fractal dimension",
        "radius error",
        "texture error",
        "perimeter error",
        "area error",
        "smoothness error",
        "compactness error",
        "concavity error",
        "concave points error",
        "symmetry error",
        "fractal dimension error",
        "worst radius",
        "worst texture",
        "worst perimeter",
        "worst area",
        "worst smoothness",
        "worst compactness",
        "worst concavity",
        "worst concave points",
        "worst symmetry",
        "worst fractal dimension"
    )
) AS prediction
FROM pgml.breast_cancer 
LIMIT 10;

回归

回归完整示例

SELECT pgml.load_dataset('diabetes');

-- view the dataset
SELECT * FROM pgml.diabetes LIMIT 10;

-- train a simple model on the data
SELECT * FROM pgml.train('Diabetes Progression', 'regression', 'pgml.diabetes', 'target');

-- check out the predictions
SELECT target, pgml.predict('Diabetes Progression', (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes 
LIMIT 10;

-- Check predictions against a specific model id
SELECT model_id, target, pgml.predict(model_id, (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes
CROSS JOIN LATERAL (
    SELECT pgml.models.id AS model_id FROM pgml.models
    INNER JOIN pgml.projects
    ON pgml.models.project_id = pgml.projects.id
    WHERE pgml.projects.name = 'Diabetes Progression'
    LIMIT 1
) models
LIMIT 10;

--
-- After a project has been trained, omitted parameters will be reused from previous training runs
-- In these examples we'll reuse the training data snapshots from the initial call.
--

-- linear models
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ridge');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'elastic_net');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'least_angle');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lasso_least_angle');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'orthogonal_matching_pursuit');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bayesian_ridge');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'automatic_relevance_determination');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'stochastic_gradient_descent');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'passive_aggressive');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ransac');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'theil_sen', hyperparams => '{"max_iter": 10, "max_subpopulation": 100}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'huber');
-- Quantile Regression too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'quantile');

-- support vector machines
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'svm', hyperparams => '{"max_iter": 100}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'nu_svm', hyperparams => '{"max_iter": 10}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear_svm', hyperparams => '{"max_iter": 100}');

-- ensembles
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'ada_boost', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'bagging', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'extra_trees', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'gradient_boosting_trees', hyperparams => '{"n_estimators": 5}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'random_forest', hyperparams => '{"n_estimators": 5}');

-- other
-- Kernel Ridge is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'kernel_ridge');
-- Gaussian Process is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'gaussian_process');

-- gradient boosting
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', hyperparams => '{"n_estimators": 10}');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'catboost', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost_random_forest', hyperparams => '{"n_estimators": 10}');
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'lightgbm', hyperparams => '{"n_estimators": 1}');
-- Histogram Gradient Boosting is too expensive for normal tests on even a toy dataset
-- SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'hist_gradient_boosting', hyperparams => '{"max_iter": 10}');

-- runtimes
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear', runtime => 'python');
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear', runtime => 'rust');

--SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', runtime => 'python', hyperparams => '{"n_estimators": 1}'); -- too slow
SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'xgboost', runtime => 'rust', hyperparams => '{"n_estimators": 10}');

-- check out all that hard work
SELECT trained_models.* FROM pgml.trained_models 
JOIN pgml.models on models.id = trained_models.id
ORDER BY models.metrics->>'mean_squared_error' DESC LIMIT 5;

-- deploy the random_forest model for prediction use
SELECT * FROM pgml.deploy('Diabetes Progression', 'most_recent', 'random_forest');
-- check out that throughput
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- do a hyperparam search on your favorite algorithm
SELECT pgml.train(
    'Diabetes Progression', 
    algorithm => 'xgboost',
    hyperparams => '{"eval_metric": "rmse"}'::JSONB,
    search => 'grid', 
    search_params => '{
        "max_depth": [1, 2], 
        "n_estimators": [20, 40]
    }'
);

-- deploy the "best" model for prediction use
SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score');
SELECT * FROM pgml.deploy('Diabetes Progression', 'most_recent');
SELECT * FROM pgml.deploy('Diabetes Progression', 'rollback');
SELECT * FROM pgml.deploy('Diabetes Progression', 'best_score', 'svm');

-- check out the improved predictions
SELECT target, pgml.predict('Diabetes Progression', (age, sex, bmi, bp, s1, s2, s3, s4, s5, s6)) AS prediction
FROM pgml.diabetes 
LIMIT 10;

聚类

聚类完整示例

-- This example trains models on the sklean digits dataset
-- which is a copy of the test set of the UCI ML hand-written digits datasets
-- https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
--
-- This demonstrates using a table with a single array feature column
-- for clustering. You could do something similar with a vector column
--

-- Exit on error (psql)
-- \set ON_ERROR_STOP true
\timing on

SELECT pgml.load_dataset('digits');

-- create an unlabeled table of the images for unsupervised learning
CREATE VIEW pgml.digit_vectors AS
SELECT image FROM pgml.digits;

-- view the dataset
SELECT left(image::text, 40) || ',...}' FROM pgml.digit_vectors LIMIT 10;

-- train a simple model to classify the data
SELECT * FROM pgml.train('Handwritten Digit Clusters', 'clustering', 'pgml.digit_vectors', hyperparams => '{"n_clusters": 10}');

-- check out the predictions
SELECT target, pgml.predict('Handwritten Digit Clusters', image) AS prediction
FROM pgml.digits
LIMIT 10;

SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'affinity_propagation');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'birch', hyperparams => '{"n_clusters": 10}');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'kmeans', hyperparams => '{"n_clusters": 10}');
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'mini_batch_kmeans', hyperparams => '{"n_clusters": 10}');

降维

降维完整示例

-- This example reduces the dimensionality of images in the sklean digits dataset
-- which is a copy of the test set of the UCI ML hand-written digits datasets
-- https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
--
-- This demonstrates using a table with a single array feature column
-- for decomposition to reduce dimensionality.
--
-- Exit on error (psql)
-- \set ON_ERROR_STOP true
\timing on

SELECT pgml.load_dataset('digits');

-- view the dataset
SELECT left(image::text, 40) || ',...}', target FROM pgml.digits LIMIT 10;

-- create a view of just the vectors for decomposition, without any labels
CREATE VIEW digit_vectors AS
SELECT image FROM pgml.digits;

SELECT * FROM pgml.train('Handwritten Digits Reduction', 'decomposition', 'digit_vectors');

-- check out the decomposed vectors
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
FROM pgml.digits
LIMIT 10;

--
-- After a project has been trained, omitted parameters will be reused from previous training runs
-- In these examples we'll reuse the training data snapshots from the initial call.
--

-- We can reduce the image vectors from 64 dimensions to 3 components
SELECT * FROM pgml.train('Handwritten Digits Reduction', hyperparams => '{"n_components": 3}');

-- check out the reduced vectors
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
FROM pgml.digits
LIMIT 10;

-- check out all that hard work
SELECT trained_models.* FROM pgml.trained_models
                                 JOIN pgml.models on models.id = trained_models.id
ORDER BY models.metrics->>'cumulative_explained_variance' DESC LIMIT 5;

-- deploy the PCA model for prediction use
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent', 'pca');
-- check out that throughput
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;

-- deploy the "best" model for prediction use
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'rollback');
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score', 'pca');

-- check out the improved predictions
SELECT target, pgml.predict('Handwritten Digits Reduction', image) AS prediction
FROM pgml.digits
LIMIT 10;

预处理

预处理完整示例

-- load the diamonds dataset, that contains text categorical variables
SELECT pgml.load_dataset('jdxcosta/diamonds');

-- view the data
SELECT * FROM pgml."jdxcosta/diamonds" LIMIT 10;

-- drop the Unamed column, since it's not useful for training (you could create a view instead)
ALTER TABLE pgml."jdxcosta/diamonds" DROP COLUMN "Unnamed: 0";

-- train a model using preprocessors to scale the numeric variables, and target encode the categoricals
SELECT pgml.train(
       project_name => 'Diamond prices',
       task => 'regression',
       relation_name => 'pgml.jdxcosta/diamonds',
       y_column_name => 'price',
       algorithm => 'lightgbm',
       preprocess => '{
                      "carat": {"scale": "standard"},
                      "depth": {"scale": "standard"},
                      "table": {"scale": "standard"},
                      "cut": {"encode": "target", "scale": "standard"},
                      "color": {"encode": "target", "scale": "standard"},
                      "clarity": {"encode": "target", "scale": "standard"}
                  }'
);

-- run some predictions, notice we're passing a heterogeneous row (tuple) as input, rather than a homogenous ARRAY[].
SELECT price, pgml.predict('Diamond prices', (carat, cut, color, clarity, depth, "table", x, y, z)) AS prediction
FROM pgml."jdxcosta/diamonds"
LIMIT 10;

-- This is a difficult dataset for more algorithms, which makes it a good challenge for preprocessing, and additional
-- feature engineering. What's next?