Install the Proxima CE package

更新时间:
复制 MD 格式

To use the vector search feature, you must install the Proxima CE package. This topic describes how to prepare the environment, obtain and upload the installation package, and prepare your input data.

Prerequisites

Make sure you have completed the environment preparation.

Download the Proxima CE package

Click Proxima CE package to download the installation package.

The installation package contains the executable JAR file for Proxima CE. Upload this file to your MaxCompute project as a resource. You can then call the JAR file to run Proxima CE tasks.

Upload as a MaxCompute resource

You can upload the downloaded package to your MaxCompute project by using either the MaxCompute client (odpscmd) or DataWorks. This topic uses DataWorks as an example to show how to upload and publish the resource. For information about how to upload a resource by using odpscmd, see Add resources.

  1. On the Data Development page in DataWorks, visually upload the installation package as a JAR resource.

    Note

    When you create or upload a resource using the DataWorks console, note the following:

    • If the resource has not been uploaded to MaxCompute, select Upload to MaxCompute. If the resource has already been uploaded to MaxCompute, deselect Upload to MaxCompute. Otherwise, the upload will fail.

    • If you select Upload to MaxCompute during the upload, the resource is stored in both DataWorks and MaxCompute. If you later delete the resource from MaxCompute by using a command, the resource still exists in DataWorks and remains visible.

    • The resource name does not need to match the name of the uploaded file.

    In the Create Resource dialog box, set File Source to Local, and click Click to upload to upload the package file. If the resource type is JAR, you must add the .jar suffix to the resource name.

  2. Commit and publish the resource.

    After you create the resource, click the 提交 icon in the toolbar of the resource editor to commit the resource to the scheduling development server.

    Note

    If a production task needs to use this resource, you must also deploy the resource to the production environment. For more information, see Deploy tasks.

Prepare input tables

Before running the task, you must prepare two input tables:

  • doc table: The table that contains the base vectors to search.

  • query table: The table containing the query vectors used to find the nearest neighbors.

CREATE TABLE statements

-- Create a doc table
CREATE TABLE doc_table_float_smoke(pk STRING,vector STRING <,category BIGINT>) PARTITIONED BY (pt STRING);
-- Create a query table
CREATE TABLE query_table_float_smoke(pk STRING,vector STRING <,category BIGINT>) PARTITIONED BY (pt STRING);

Input table format

  • Table names

    • Input table names cannot contain the tmp_ string. Otherwise, the task fails.

    • An input table name and its partition value cannot exceed 64 characters in length. Otherwise, the task fails.

  • Fields

    Note

    Both input tables must contain the following fixed fields, and the field names must match exactly.

    Field

    Description

    Type

    pk

    The primary key (pk) of the record.

    STRING by default.

    • The pk value can be a number or a string, such as 1.nid,2.nid,3.nid,... or an INT64 number like 123,456,789,....

    • If all values in the pk column are INT64 numbers, you can specify the column data type as BIGINT. You can also set the -pk_type startup parameter to INT64 to improve performance.

    vector

    The vector field.

    STRING by default.

    category

    The category field for multi-category search.

    This field is required only for multi-category search.

    BIGINT by default.

    pt

    The partition field.

    STRING by default.

Input table examples

  • doc table

    pk

    vector

    pt

    id1

    0~1~1~5

    20190322

    id2

    0~1~1~2

    20190322

    id3

    3~2~1~1

    20190322

    ...

    ...

    ...

  • query table

    pk

    vector

    pt

    id8

    0~1~1~5

    20190322

    id9

    0~1~1~2

    20190322

    id10

    3~2~1~1

    20190322

    ...

    ...

    ...

Next steps: Use vector search

检索场景

关键特性

指导文档

基础向量检索

支持百万级别TopK查询。

基础向量检索

多类目检索

支持多类目场景,包括querydoc属于多个类目的场景以及单个query属于多个类目的场景。

多类目检索

聚类分片

支持聚类分片索引构建方式,该方式能够减小计算量和加速后续索引查询过程。

聚类分片

内积和余弦距离

支持内积和余弦距离检索。

内积和余弦距离

量化使用

支持量化器使用,一般配置量化器可提升性能,减少索引大小,召回视情况有所损失。

量化使用

After a vector search task runs, an output table is automatically generated and stored in MaxCompute. You do not need to create this table. Specify the table name by using the -output_table parameter in your Proxima CE code. For more information about the output table format, see Output table format.

Output table format

After a vector search task runs, an output table is automatically generated and stored in MaxCompute. The format of the output table is described as follows.

  • Table name: The name of the output table that you specified in the Proxima CE code.

    • The output table name cannot contain a period (.). This is a special character in MaxCompute and causes MaxCompute to fail to parse the table name.

    • The output table name cannot contain the tmp_ string. Otherwise, the task fails.

    • An output table name and its partition name cannot exceed 64 characters in length. Otherwise, the task fails.

  • Fields

    Field

    Description

    Type

    pk

    The pk value for each query in the query table.

    STRING by default.

    • Values in the pk column can be numbers or strings, such as string values 1.nid,2.nid,3.nid,... or INT64 numeric values 123,456,789,....

    • If the pk column stores only INT64 numbers, you can set the column type to BIGINT. If you also set the -pk_type startup parameter to INT64, you can improve performance.

    knn_result

    The pk value of the matched record in the doc table.

    STRING by default.

    score

    The similarity score of the retrieved doc.

    The default data type is STRING. In Proxima CE, results are uniformly sorted in descending order by similarity score.

    Note

    The score field's meaning depends on the distance algorithm. However, Proxima CE unifies the sorting so that results are always returned in descending order of similarity.

    • For inner_product and mips_squared_euclidean distance, the score represents similarity, where a higher value indicates greater similarity.

    • For other distance algorithms, the calculation is based on distance, where a lower value indicates greater similarity.

    category

    The category field for multi-category search.

    This field is required only for multi-category search.

    BIGINT by default.

    pt

    The partition field.

    STRING by default.

Output table example

pk

knn_result

score

pt

id8

id1

0.1

20190322

id8

id2

0.2

20190322

id9

id1

0.1

20190322

id9

id3

0.3

20190322

...

...

...

...