Model Creation

更新时间:
复制 MD 格式

The Lindorm AI engine lets you import pre-trained AI models to analyze and process data in your database or to model time series data for time series analysis tasks.

Syntax

CREATE MODEL model_name 
FROM {table_name | (select_statement) | model_file_path | huggingface_repo | modelscope_repo}
[ TARGET column_name ]
TASK ( FEATURE_EXTRACTION | SEMANTIC_RETRIEVAL | TEXT_TO_IMAGE | ... )
ALGORITHM ( BGE_LARGE_ZH | CHATGLM2_6B | CHINESE_STABLE_DIFFUSION | ... )
[ PREPROCESSORS 'string' ]
SETTINGS (
    text_splitter 'off',
    ...
)

Parameter description

The parameters that you need to configure vary depending on the task type (TASK).

  • model_name: The model name.

  • FROM: Specifies the model source, which can be a table in the database (for retrieval tasks), training data, or a query statement (for time series tasks). Supported options include the following:

    Option

    Description

    table_name

    When TASK is a time series task (TIME_SERIES_FORECAST or TIME_SERIES_ANOMALY_DETECTION), this specifies the training data. The specified table or query result must contain at least two columns, one of which must be a time column.

    select_statement

    When TASK is a retrieval task (SEMANTIC_RETRIEVAL or RETRIEVAL_QA), this specifies the data from the document table used for retrieval.

    model_file_path

    The path to the model uploaded to LDFS, formatted as ldfs://model_file_path. Example: ldfs://x.x.x.x/CLIP-ViT-B-32-IMAGE.zip.

    huggingface_repo

    The model path on the Hugging Face platform, formatted as huggingface://repository_id. Example: huggingface://lllyasviel/ControlNet.

    modelscope_repo

    The model path on the ModelScope platform, formatted as modelscope://repository_id. Example: modelscope://damo/multi-modal_chinese_stable_diffusion_v1.0.

  • TARGET column_name: Specifies the target column for model analysis and processing. This parameter is required when TASK is one of the following:

    • Semantic retrieval (SEMANTIC_RETRIEVAL)

    • Retrieval-based QA (RETRIEVAL_QA)

    • Time series forecasting (TIME_SERIES_FORECAST)

    • Time series anomaly detection (TIME_SERIES_ANOMALY_DETECTION)

  • TASK: Specifies the model task type. Supported task types include the following:

    Task type

    Keyword

    Description

    Feature extraction

    FEATURE_EXTRACTION

    Uses an embedding model to extract feature vectors from data such as text or images.

    Text-to-image

    TEXT_TO_IMAGE

    An AIGC task that generates images from text descriptions.

    Semantic retrieval

    SEMANTIC_RETRIEVAL

    Retrieves semantically similar text from a specified data table based on a descriptive query.

    Basic question answering

    QUESTION_ANSWERING

    Performs question answering using a large language model.

    Retrieval-based QA

    RETRIEVAL_QA

    Builds a retrieval-augmented generation (RAG) application by combining a knowledge base from a specified data table with a large language model.

    Time series forecasting

    TIME_SERIES_FORECAST

    A time series forecasting task.

    Time series anomaly detection

    TIME_SERIES_ANOMALY_DETECTION

    A time series anomaly detection task.

  • ALGORITHM: Specifies the algorithm that the model uses. The supported algorithms are listed below:

    Task type

    Algorithm

    Description

    Feature extraction, semantic retrieval

    TEXT2VEC_BASE_CHINESE

    • An embedding model that converts Chinese text into vectors.

    • Corresponding model path on the platform: huggingface://shibing624/text2vec-base-chinese. For more information, see Hugging Face model.

    BGE_LARGE_ZH

    • An embedding model trained by BAAI (Zhiyuan) that converts Chinese text into vectors.

    • Corresponding model path on the platform: huggingface://BAAI/bge-large-zh-v1.5. For more information, see Hugging Face model.

    M3E_BASE

    • An embedding model trained by MoKaAI that converts Chinese text into vectors.

    • Corresponding model path on the platform: huggingface://moka-ai/m3e-base. For more information, see Hugging Face model.

    GTE_LARGE_ZH

    • An embedding model trained by Alibaba DAMO Academy that converts Chinese text into vectors.

    • Corresponding model path on the platform: huggingface://thenlper/gte-large-zh. For more information, see Hugging Face model.

    BGE_M3

    • A multilingual embedding model trained by BAAI (Zhiyuan) that converts text into vectors.

    • Corresponding model path on the platform: huggingface://BAAI/bge-m3. For more information, see Hugging Face model.

    JINA_V2_BASE_ZH

    • An embedding model trained by Jina AI that supports both Chinese and English text-to-vector conversion.

    • Corresponding model path on the platform: modelscope://jinaai/jina-embeddings-v2-base-zh. For more information, see Model Scope model.

    Text-to-image

    CHINESE_STABLE_DIFFUSION

    • A Chinese Stable Diffusion text-to-image model that returns 2D images matching the input description.

    • Corresponding model path on the platform: modelscope://damo/multi-modal_chinese_stable_diffusion_v1.0. For more information, see Model Scope model.

    Basic question answering, retrieval-based QA

    CHATGLM3_6B

    • The third-generation version of ChatGLM-6B, a bilingual (Chinese and English) conversational language model.

    • Corresponding model path on the platform: huggingface://THUDM/chatglm3-6b. For more information, see Hugging Face model.

    CHATGLM2_6B_INT4

    • The second-generation INT4 quantized version of ChatGLM-6B, a bilingual (Chinese and English) conversational language model.

    • Corresponding model path on the platform: huggingface://THUDM/chatglm2-6b-int4. For more information, see Hugging Face model.

    CHATGLM2_6B

    • The second-generation version of ChatGLM-6B, a bilingual (Chinese and English) conversational language model.

    • Corresponding model path on the platform: huggingface://THUDM/chatglm2-6b. For more information, see Hugging Face model.

    QWEN_7B_CHAT_INT4

    • The 7-billion-parameter INT4 quantized version of Qwen, a large language model developed by Alibaba Cloud.

    • Corresponding model path on the platform: modelscope://qwen/Qwen-7B-Chat-Int4. For more information, see Model Scope model.

    QWEN_14B_CHAT_INT4

    • The 14-billion-parameter INT4 quantized version of Qwen, a large language model developed by Alibaba Cloud.

    • Corresponding model path on the platform: modelscope://qwen/Qwen-14B-Chat-Int4. For more information, see Model Scope model.

    Time series forecasting

    DeepAR

    DeepAR is an RNN-based deep neural network algorithm. For more information, see DeepAR paper.

    TFT

    TFT (Temporal Fusion Transformer) is a deep neural network algorithm based on the Transformer architecture. For more information, see TFT paper.

    Time series anomaly detection

    esd

    A proprietary algorithm from Alibaba DAMO Academy designed for spike-type anomalies (such as spikes in monitoring curves). It delivers accurate results when the data contains a small number of significant outliers. For more information, see Time series anomaly detection.

    nsigma

    A proprietary algorithm from Alibaba DAMO Academy with a simple principle that makes anomaly root cause analysis straightforward. For more information, see Time series anomaly detection.

    ttest

    A proprietary algorithm from Alibaba DAMO Academy designed to detect anomalies caused by mean shifts within a time window. For more information, see Time series anomaly detection.

  • PREPROCESSORS 'string': Optional. This parameter applies only to time series tasks (TIME_SERIES_FORECAST or TIME_SERIES_ANOMALY_DETECTION). It specifies preprocessing operations for certain columns and is typically defined as a JSON-formatted string.

    The PREPROCESSORS parameter includes a set of columns to process (Columns) and a list of preprocessing operations (Transformers). The Transformers form a pipeline and execute in the specified order. Each transformer includes an operation name (Name) and its parameters (Parameters). Example:

    PREPROCESSORS '[
    {
      "Columns":[
        "c1"
      ],
      "Transformers":[
        {
          "Name": "Imputer",
          "Parameters": {"value": 0}
        },
        {
          "Name": "StandardScaler"
        }
      ]
    },
    {
      "Columns":[
        "c2",
        "c3"
      ],
      "Transformers":[
        {
          "Name": "OrdinalEncoder"
        }
      ]
    }
    ]'
    Note

    Both the Columns and Parameters fields are optional.

    Preprocessing operations that you specify during model training are automatically applied during model inference. Lindorm AI supports the following preprocessing operations:

    Preprocessing operation

    Parameters

    Description

    OneHotEncoder

    None

    Encodes categorical features using binary values. Suitable for categorical features without ordinal relationships.

    OrdinalEncoder

    None

    Encodes categorical features as integers starting from 0. Suitable for categorical features with ordinal relationships.

    Imputer

    • method: String. Valid values: dummy, mean, median, most_frequent, roll7, last. Default: dummy.

    • value: Integer. Optional. Default: 0.

    Imputes missing values using various strategies.

    StandardScaler

    None

    Transforms data to follow a standard normal distribution (mean = 0, standard deviation = 1), also known as z-score normalization.

    MinMaxScaler

    • min: Integer.

    • max: Integer. Optional.

    Scales data to the [min, max] range. Defaults to [0, 1].

    LogTransformer

    None

    Transforms data using the logarithm function.

  • SETTINGS: Specifies additional parameters. The supported parameters vary by task type, as listed below:

    • Parameters for semantic retrieval and retrieval-based QA tasks

      Parameter name

      Parameter type

      Description

      Required

      embedding_model

      VARCHAR

      This parameter serves two purposes:

      • For semantic retrieval tasks, specify an embedding model imported via the BYOM feature. Set this to the name of the imported embedding model.

      • For retrieval-based QA tasks, replace the default embedding model. Valid values include the following:

        • The name of an embedding model imported via the BYOM feature.

        • Model paths from the platform corresponding to any embedding model supported by feature extraction tasks, such as huggingface://BAAI/bge-large-zh-v1.5.

      No

      hybrid_retrieval

      VARCHAR

      Enables hybrid vector + full-text retrieval (applies only to semantic retrieval tasks). When enabled, results from both retrieval paths are re-ranked using the RRF (Reciprocal Rank Fusion) algorithm. Valid values:

      • on: Default. Enabled.

      • off: Disabled.

      No

      text_splitter

      VARCHAR

      Enables document segmentation. Valid values:

      • on: Default. Enabled.

      • off: Disabled.

      No

      incremental_train

      VARCHAR

      Enables incremental processing. Valid values:

      • on: Enabled.

      • off: Default. Disabled.

      Note

      To use incremental processing, you must meet the following requirements:

      No

      retrievel_algo

      VARCHAR

      The index algorithm for vector retrieval. Valid values:

      • HNSW: Default. Builds a vector index using the HNSW graph structure and performs queries with this algorithm. Suitable for large datasets.

      • FLAT: Does not build a separate index. Uses brute-force search for queries. Suitable for small datasets with fewer than 10,000 entries.

      No

      retrieval_distance_method

      VARCHAR

      The distance function for vector retrieval. Valid values:

      • IP: Default. Inner product.

      • COSINE: Cosine distance.

      • L2: Squared Euclidean distance.

      No

      retrieval_ef_construct

      INTEGER

      When using the HNSW vector retrieval algorithm, this sets the length of the dynamic candidate list during index construction. Default: 400. Range: [1, 1000]. Higher values increase ANN query accuracy but also increase performance overhead.

      No

      retrieval_maximum_degree

      INTEGER

      When using the HNSW vector retrieval algorithm, this sets the maximum number of outgoing edges per node in each layer. Default: 80. Range: [1, 100]. Higher values increase ANN query accuracy but also increase performance overhead.

      No

      retrieval_num_shards

      INTEGER

      The number of shards used for the vector index. During fused retrieval of vector and structured data, each index table shard first retrieves topK results based on vector similarity, then applies structured data filtering to the merged topK results. Default: 4.

      No

      text_analyzer

      VARCHAR

      This setting takes effect only when hybrid retrieval (hybrid_retrieval) is enabled. It specifies the tokenizer used for full-text retrieval. Valid values:

      • ik: Default.

      • standard

      • english

      • whitespace

      • comma

      No

    • Parameters for time series forecasting tasks

      Parameter name

      Parameter type

      Description

      Required

      epochs

      INTEGER

      Number of training epochs. Applies only to time series forecasting. Default: 80.

      Yes

      time_column

      VARCHAR

      The time column.

      Yes

      group_columns

      VARCHAR

      Grouping columns. These define the time series (TAG columns).

      Yes

      freq

      VARCHAR

      The frequency of the time series data. Example: “1D”.

      Yes

      prediction_length

      INTEGER

      Prediction step size.

      Yes

      feat_static_columns

      VARCHAR

      Static feature columns (TAGs), separated by commas.

      No

Examples

Example 1: Text-to-image

CREATE MODEL t2i_model
FROM 'modelscope://damo/multi-modal_chinese_stable_diffusion_v1.0'
TASK TEXT_TO_IMAGE
ALGORITHM CHINESE_STABLE_DIFFUSION;

Example 2: Feature extraction

CREATE MODEL fe_model
FROM 'huggingface://BAAI/bge-large-zh-v1.5'
TASK FEATURE_EXTRACTION
ALGORITHM BGE_LARGE_ZH;

Example 3: Semantic retrieval

CREATE MODEL sr_model
FROM doc_table
TARGET doc_field
TASK SEMANTIC_RETRIEVAL
ALGORITHM BGE_LARGE_ZH;

Example 4: Basic question answering

CREATE MODEL qa_model
FROM 'huggingface://THUDM/chatglm2-6b-int4'
TASK QUESTION_ANSWERING
ALGORITHM CHATGLM2_6B_INT4;

Example 5: Retrieval-based QA

CREATE MODEL rqa_model
FROM doc_table
TARGET doc_field
TASK RETRIEVAL_QA
ALGORITHM CHATGLM2_6B_INT4;

Example 6: Time series forecasting

CREATE MODEL tft_model
FROM (SELECT * FROM fresh_sales WHERE `time` > '2021-02-08T00:00:00+08:00')
TARGET sales
TASK time_series_forecast
ALGORITHM tft
SETTINGS
(
  time_column 'time',
  group_columns 'id_code',
  feat_static_columns 'cate1_id,cate2_id,brand_id',
  context_length '28',
  prediction_length '6',
  epochs '5',
  freq '1D'
);