To use the vector search feature, you must install the Proxima CE package. This topic describes how to prepare the environment, obtain and upload the installation package, and prepare your input data.
Prerequisites
Make sure you have completed the environment preparation.
Download the Proxima CE package
Click Proxima CE package to download the installation package.
The installation package contains the executable JAR file for Proxima CE. Upload this file to your MaxCompute project as a resource. You can then call the JAR file to run Proxima CE tasks.
Upload as a MaxCompute resource
You can upload the downloaded package to your MaxCompute project by using either the MaxCompute client (odpscmd) or DataWorks. This topic uses DataWorks as an example to show how to upload and publish the resource. For information about how to upload a resource by using odpscmd, see Add resources.
-
On the Data Development page in DataWorks, visually upload the installation package as a JAR resource.
NoteWhen you create or upload a resource using the DataWorks console, note the following:
-
If the resource has not been uploaded to MaxCompute, select Upload to MaxCompute. If the resource has already been uploaded to MaxCompute, deselect Upload to MaxCompute. Otherwise, the upload will fail.
-
If you select Upload to MaxCompute during the upload, the resource is stored in both DataWorks and MaxCompute. If you later delete the resource from MaxCompute by using a command, the resource still exists in DataWorks and remains visible.
-
The resource name does not need to match the name of the uploaded file.
In the Create Resource dialog box, set File Source to Local, and click Click to upload to upload the package file. If the resource type is JAR, you must add the
.jarsuffix to the resource name. -
-
Commit and publish the resource.
After you create the resource, click the
icon in the toolbar of the resource editor to commit the resource to the scheduling development server.NoteIf a production task needs to use this resource, you must also deploy the resource to the production environment. For more information, see Deploy tasks.
Prepare input tables
Before running the task, you must prepare two input tables:
-
doc table: The table that contains the base vectors to search.
-
query table: The table containing the query vectors used to find the nearest neighbors.
CREATE TABLE statements
-- Create a doc table
CREATE TABLE doc_table_float_smoke(pk STRING,vector STRING <,category BIGINT>) PARTITIONED BY (pt STRING);
-- Create a query table
CREATE TABLE query_table_float_smoke(pk STRING,vector STRING <,category BIGINT>) PARTITIONED BY (pt STRING);
Input table format
-
Table names
-
Input table names cannot contain the
tmp_string. Otherwise, the task fails. -
An input table name and its partition value cannot exceed 64 characters in length. Otherwise, the task fails.
-
-
Fields
NoteBoth input tables must contain the following fixed fields, and the field names must match exactly.
Field
Description
Type
pk
The primary key (pk) of the record.
STRING by default.
-
The pk value can be a number or a string, such as
1.nid,2.nid,3.nid,...or an INT64 number like123,456,789,.... -
If all values in the pk column are INT64 numbers, you can specify the column data type as BIGINT. You can also set the
-pk_typestartup parameter to INT64 to improve performance.
vector
The vector field.
STRING by default.
category
The category field for multi-category search.
This field is required only for multi-category search.
BIGINT by default.
pt
The partition field.
STRING by default.
-
Input table examples
-
doc table
pk
vector
pt
id1
0~1~1~5
20190322
id2
0~1~1~2
20190322
id3
3~2~1~1
20190322
...
...
...
-
query table
pk
vector
pt
id8
0~1~1~5
20190322
id9
0~1~1~2
20190322
id10
3~2~1~1
20190322
...
...
...
Next steps: Use vector search
|
检索场景 |
关键特性 |
指导文档 |
|
基础向量检索 |
支持百万级别TopK查询。 |
|
|
多类目检索 |
支持多类目场景,包括query和doc属于多个类目的场景以及单个query属于多个类目的场景。 |
|
|
聚类分片 |
支持聚类分片索引构建方式,该方式能够减小计算量和加速后续索引查询过程。 |
|
|
内积和余弦距离 |
支持内积和余弦距离检索。 |
|
|
量化使用 |
支持量化器使用,一般配置量化器可提升性能,减少索引大小,召回视情况有所损失。 |
After a vector search task runs, an output table is automatically generated and stored in MaxCompute. You do not need to create this table. Specify the table name by using the -output_table parameter in your Proxima CE code. For more information about the output table format, see Output table format.
Output table format
After a vector search task runs, an output table is automatically generated and stored in MaxCompute. The format of the output table is described as follows.
-
Table name: The name of the output table that you specified in the Proxima CE code.
-
The output table name cannot contain a period (
.). This is a special character in MaxCompute and causes MaxCompute to fail to parse the table name. -
The output table name cannot contain the
tmp_string. Otherwise, the task fails. -
An output table name and its partition name cannot exceed 64 characters in length. Otherwise, the task fails.
-
-
Fields
Field
Description
Type
pk
The pk value for each query in the query table.
STRING by default.
-
Values in the pk column can be numbers or strings, such as string values
1.nid,2.nid,3.nid,...or INT64 numeric values123,456,789,.... -
If the pk column stores only INT64 numbers, you can set the column type to BIGINT. If you also set the
-pk_typestartup parameter to INT64, you can improve performance.
knn_result
The pk value of the matched record in the doc table.
STRING by default.
score
The similarity score of the retrieved doc.
The default data type is STRING. In Proxima CE, results are uniformly sorted in descending order by similarity score.
NoteThe
scorefield's meaning depends on the distance algorithm. However, Proxima CE unifies the sorting so that results are always returned in descending order of similarity.-
For
inner_productandmips_squared_euclideandistance, thescorerepresents similarity, where a higher value indicates greater similarity. -
For other distance algorithms, the calculation is based on distance, where a lower value indicates greater similarity.
category
The category field for multi-category search.
This field is required only for multi-category search.
BIGINT by default.
pt
The partition field.
STRING by default.
-
Output table example
|
pk |
knn_result |
score |
pt |
|
id8 |
id1 |
0.1 |
20190322 |
|
id8 |
id2 |
0.2 |
20190322 |
|
id9 |
id1 |
0.1 |
20190322 |
|
id9 |
id3 |
0.3 |
20190322 |
|
... |
... |
... |
... |