The Lindorm AI engine lets you import pre-trained AI models to analyze and process data in your database or to model time series data for time series analysis tasks.
Syntax
CREATE MODEL model_name
FROM {table_name | (select_statement) | model_file_path | huggingface_repo | modelscope_repo}
[ TARGET column_name ]
TASK ( FEATURE_EXTRACTION | SEMANTIC_RETRIEVAL | TEXT_TO_IMAGE | ... )
ALGORITHM ( BGE_LARGE_ZH | CHATGLM2_6B | CHINESE_STABLE_DIFFUSION | ... )
[ PREPROCESSORS 'string' ]
SETTINGS (
text_splitter 'off',
...
)
Parameter description
The parameters that you need to configure vary depending on the task type (TASK).
-
model_name: The model name.
-
FROM: Specifies the model source, which can be a table in the database (for retrieval tasks), training data, or a query statement (for time series tasks). Supported options include the following:
Option
Description
table_name
When TASK is a time series task (TIME_SERIES_FORECAST or TIME_SERIES_ANOMALY_DETECTION), this specifies the training data. The specified table or query result must contain at least two columns, one of which must be a time column.
select_statement
When TASK is a retrieval task (SEMANTIC_RETRIEVAL or RETRIEVAL_QA), this specifies the data from the document table used for retrieval.
model_file_path
The path to the model uploaded to LDFS, formatted as
ldfs://model_file_path. Example:ldfs://x.x.x.x/CLIP-ViT-B-32-IMAGE.zip.huggingface_repo
The model path on the Hugging Face platform, formatted as
huggingface://repository_id. Example:huggingface://lllyasviel/ControlNet.modelscope_repo
The model path on the ModelScope platform, formatted as
modelscope://repository_id. Example:modelscope://damo/multi-modal_chinese_stable_diffusion_v1.0. -
TARGET column_name: Specifies the target column for model analysis and processing. This parameter is required when TASK is one of the following:
-
Semantic retrieval (SEMANTIC_RETRIEVAL)
-
Retrieval-based QA (RETRIEVAL_QA)
-
Time series forecasting (TIME_SERIES_FORECAST)
-
Time series anomaly detection (TIME_SERIES_ANOMALY_DETECTION)
-
-
TASK: Specifies the model task type. Supported task types include the following:
Task type
Keyword
Description
Feature extraction
FEATURE_EXTRACTION
Uses an embedding model to extract feature vectors from data such as text or images.
Text-to-image
TEXT_TO_IMAGE
An AIGC task that generates images from text descriptions.
Semantic retrieval
SEMANTIC_RETRIEVAL
Retrieves semantically similar text from a specified data table based on a descriptive query.
Basic question answering
QUESTION_ANSWERING
Performs question answering using a large language model.
Retrieval-based QA
RETRIEVAL_QA
Builds a retrieval-augmented generation (RAG) application by combining a knowledge base from a specified data table with a large language model.
Time series forecasting
TIME_SERIES_FORECAST
A time series forecasting task.
Time series anomaly detection
TIME_SERIES_ANOMALY_DETECTION
A time series anomaly detection task.
-
ALGORITHM: Specifies the algorithm that the model uses. The supported algorithms are listed below:
Task type
Algorithm
Description
Feature extraction, semantic retrieval
TEXT2VEC_BASE_CHINESE
-
An embedding model that converts Chinese text into vectors.
-
Corresponding model path on the platform:
huggingface://shibing624/text2vec-base-chinese. For more information, see Hugging Face model.
BGE_LARGE_ZH
-
An embedding model trained by BAAI (Zhiyuan) that converts Chinese text into vectors.
-
Corresponding model path on the platform:
huggingface://BAAI/bge-large-zh-v1.5. For more information, see Hugging Face model.
M3E_BASE
-
An embedding model trained by MoKaAI that converts Chinese text into vectors.
-
Corresponding model path on the platform:
huggingface://moka-ai/m3e-base. For more information, see Hugging Face model.
GTE_LARGE_ZH
-
An embedding model trained by Alibaba DAMO Academy that converts Chinese text into vectors.
-
Corresponding model path on the platform:
huggingface://thenlper/gte-large-zh. For more information, see Hugging Face model.
BGE_M3
-
A multilingual embedding model trained by BAAI (Zhiyuan) that converts text into vectors.
-
Corresponding model path on the platform:
huggingface://BAAI/bge-m3. For more information, see Hugging Face model.
JINA_V2_BASE_ZH
-
An embedding model trained by Jina AI that supports both Chinese and English text-to-vector conversion.
-
Corresponding model path on the platform:
modelscope://jinaai/jina-embeddings-v2-base-zh. For more information, see Model Scope model.
Text-to-image
CHINESE_STABLE_DIFFUSION
-
A Chinese Stable Diffusion text-to-image model that returns 2D images matching the input description.
-
Corresponding model path on the platform:
modelscope://damo/multi-modal_chinese_stable_diffusion_v1.0. For more information, see Model Scope model.
Basic question answering, retrieval-based QA
CHATGLM3_6B
-
The third-generation version of ChatGLM-6B, a bilingual (Chinese and English) conversational language model.
-
Corresponding model path on the platform:
huggingface://THUDM/chatglm3-6b. For more information, see Hugging Face model.
CHATGLM2_6B_INT4
-
The second-generation INT4 quantized version of ChatGLM-6B, a bilingual (Chinese and English) conversational language model.
-
Corresponding model path on the platform:
huggingface://THUDM/chatglm2-6b-int4. For more information, see Hugging Face model.
CHATGLM2_6B
-
The second-generation version of ChatGLM-6B, a bilingual (Chinese and English) conversational language model.
-
Corresponding model path on the platform:
huggingface://THUDM/chatglm2-6b. For more information, see Hugging Face model.
QWEN_7B_CHAT_INT4
-
The 7-billion-parameter INT4 quantized version of Qwen, a large language model developed by Alibaba Cloud.
-
Corresponding model path on the platform:
modelscope://qwen/Qwen-7B-Chat-Int4. For more information, see Model Scope model.
QWEN_14B_CHAT_INT4
-
The 14-billion-parameter INT4 quantized version of Qwen, a large language model developed by Alibaba Cloud.
-
Corresponding model path on the platform:
modelscope://qwen/Qwen-14B-Chat-Int4. For more information, see Model Scope model.
Time series forecasting
DeepAR
DeepAR is an RNN-based deep neural network algorithm. For more information, see DeepAR paper.
TFT
TFT (Temporal Fusion Transformer) is a deep neural network algorithm based on the Transformer architecture. For more information, see TFT paper.
Time series anomaly detection
esd
A proprietary algorithm from Alibaba DAMO Academy designed for spike-type anomalies (such as spikes in monitoring curves). It delivers accurate results when the data contains a small number of significant outliers. For more information, see Time series anomaly detection.
nsigma
A proprietary algorithm from Alibaba DAMO Academy with a simple principle that makes anomaly root cause analysis straightforward. For more information, see Time series anomaly detection.
ttest
A proprietary algorithm from Alibaba DAMO Academy designed to detect anomalies caused by mean shifts within a time window. For more information, see Time series anomaly detection.
-
-
PREPROCESSORS 'string': Optional. This parameter applies only to time series tasks (TIME_SERIES_FORECAST or TIME_SERIES_ANOMALY_DETECTION). It specifies preprocessing operations for certain columns and is typically defined as a JSON-formatted string.
The PREPROCESSORS parameter includes a set of columns to process (Columns) and a list of preprocessing operations (Transformers). The Transformers form a pipeline and execute in the specified order. Each transformer includes an operation name (Name) and its parameters (Parameters). Example:
PREPROCESSORS '[ { "Columns":[ "c1" ], "Transformers":[ { "Name": "Imputer", "Parameters": {"value": 0} }, { "Name": "StandardScaler" } ] }, { "Columns":[ "c2", "c3" ], "Transformers":[ { "Name": "OrdinalEncoder" } ] } ]'NoteBoth the Columns and Parameters fields are optional.
Preprocessing operations that you specify during model training are automatically applied during model inference. Lindorm AI supports the following preprocessing operations:
Preprocessing operation
Parameters
Description
OneHotEncoder
None
Encodes categorical features using binary values. Suitable for categorical features without ordinal relationships.
OrdinalEncoder
None
Encodes categorical features as integers starting from 0. Suitable for categorical features with ordinal relationships.
Imputer
-
method: String. Valid values: dummy, mean, median, most_frequent, roll7, last. Default: dummy.
-
value: Integer. Optional. Default: 0.
Imputes missing values using various strategies.
StandardScaler
None
Transforms data to follow a standard normal distribution (mean = 0, standard deviation = 1), also known as z-score normalization.
MinMaxScaler
-
min: Integer.
-
max: Integer. Optional.
Scales data to the [min, max] range. Defaults to [0, 1].
LogTransformer
None
Transforms data using the logarithm function.
-
-
SETTINGS: Specifies additional parameters. The supported parameters vary by task type, as listed below:
-
Parameters for semantic retrieval and retrieval-based QA tasks
Parameter name
Parameter type
Description
Required
embedding_model
VARCHAR
This parameter serves two purposes:
-
For semantic retrieval tasks, specify an embedding model imported via the BYOM feature. Set this to the name of the imported embedding model.
-
For retrieval-based QA tasks, replace the default embedding model. Valid values include the following:
-
The name of an embedding model imported via the BYOM feature.
-
Model paths from the platform corresponding to any embedding model supported by feature extraction tasks, such as
huggingface://BAAI/bge-large-zh-v1.5.
-
No
hybrid_retrieval
VARCHAR
Enables hybrid vector + full-text retrieval (applies only to semantic retrieval tasks). When enabled, results from both retrieval paths are re-ranked using the RRF (Reciprocal Rank Fusion) algorithm. Valid values:
-
on: Default. Enabled.
-
off: Disabled.
No
text_splitter
VARCHAR
Enables document segmentation. Valid values:
-
on: Default. Enabled.
-
off: Disabled.
No
incremental_train
VARCHAR
Enables incremental processing. Valid values:
-
on: Enabled.
-
off: Default. Disabled.
NoteTo use incremental processing, you must meet the following requirements:
-
Data subscription is enabled. For instructions, see Enable data subscription.
-
Stream engine is enabled. For instructions, see Enable stream engine.
No
retrievel_algo
VARCHAR
The index algorithm for vector retrieval. Valid values:
-
HNSW: Default. Builds a vector index using the HNSW graph structure and performs queries with this algorithm. Suitable for large datasets.
-
FLAT: Does not build a separate index. Uses brute-force search for queries. Suitable for small datasets with fewer than 10,000 entries.
No
retrieval_distance_method
VARCHAR
The distance function for vector retrieval. Valid values:
-
IP: Default. Inner product.
-
COSINE: Cosine distance.
-
L2: Squared Euclidean distance.
No
retrieval_ef_construct
INTEGER
When using the HNSW vector retrieval algorithm, this sets the length of the dynamic candidate list during index construction. Default: 400. Range: [1, 1000]. Higher values increase ANN query accuracy but also increase performance overhead.
No
retrieval_maximum_degree
INTEGER
When using the HNSW vector retrieval algorithm, this sets the maximum number of outgoing edges per node in each layer. Default: 80. Range: [1, 100]. Higher values increase ANN query accuracy but also increase performance overhead.
No
retrieval_num_shards
INTEGER
The number of shards used for the vector index. During fused retrieval of vector and structured data, each index table shard first retrieves topK results based on vector similarity, then applies structured data filtering to the merged topK results. Default: 4.
No
text_analyzer
VARCHAR
This setting takes effect only when hybrid retrieval (hybrid_retrieval) is enabled. It specifies the tokenizer used for full-text retrieval. Valid values:
-
ik: Default.
-
standard
-
english
-
whitespace
-
comma
No
-
-
Parameters for time series forecasting tasks
Parameter name
Parameter type
Description
Required
epochs
INTEGER
Number of training epochs. Applies only to time series forecasting. Default: 80.
Yes
time_column
VARCHAR
The time column.
Yes
group_columns
VARCHAR
Grouping columns. These define the time series (TAG columns).
Yes
freq
VARCHAR
The frequency of the time series data. Example: “1D”.
Yes
prediction_length
INTEGER
Prediction step size.
Yes
feat_static_columns
VARCHAR
Static feature columns (TAGs), separated by commas.
No
-
Examples
Example 1: Text-to-image
CREATE MODEL t2i_model
FROM 'modelscope://damo/multi-modal_chinese_stable_diffusion_v1.0'
TASK TEXT_TO_IMAGE
ALGORITHM CHINESE_STABLE_DIFFUSION;
Example 2: Feature extraction
CREATE MODEL fe_model
FROM 'huggingface://BAAI/bge-large-zh-v1.5'
TASK FEATURE_EXTRACTION
ALGORITHM BGE_LARGE_ZH;
Example 3: Semantic retrieval
CREATE MODEL sr_model
FROM doc_table
TARGET doc_field
TASK SEMANTIC_RETRIEVAL
ALGORITHM BGE_LARGE_ZH;
Example 4: Basic question answering
CREATE MODEL qa_model
FROM 'huggingface://THUDM/chatglm2-6b-int4'
TASK QUESTION_ANSWERING
ALGORITHM CHATGLM2_6B_INT4;
Example 5: Retrieval-based QA
CREATE MODEL rqa_model
FROM doc_table
TARGET doc_field
TASK RETRIEVAL_QA
ALGORITHM CHATGLM2_6B_INT4;
Example 6: Time series forecasting
CREATE MODEL tft_model
FROM (SELECT * FROM fresh_sales WHERE `time` > '2021-02-08T00:00:00+08:00')
TARGET sales
TASK time_series_forecast
ALGORITHM tft
SETTINGS
(
time_column 'time',
group_columns 'id_code',
feat_static_columns 'cate1_id,cate2_id,brand_id',
context_length '28',
prediction_length '6',
epochs '5',
freq '1D'
);