How to install the Proxima CE package-MaxCompute(MaxCompute)-阿里云帮助中心

To use the vector search feature, you must install the Proxima CE package. This topic describes how to prepare the environment, obtain and upload the installation package, and prepare your input data.

Prerequisites

Make sure you have completed the environment preparation.

Download the Proxima CE package

Click Proxima CE package to download the installation package.

The installation package contains the executable JAR file for Proxima CE. Upload this file to your MaxCompute project as a resource. You can then call the JAR file to run Proxima CE tasks.

Upload as a MaxCompute resource

You can upload the downloaded package to your MaxCompute project by using either the MaxCompute client (odpscmd) or DataWorks. This topic uses DataWorks as an example to show how to upload and publish the resource. For information about how to upload a resource by using odpscmd, see Add resources.

On the Data Development page in DataWorks, visually upload the installation package as a JAR resource.
Note
When you create or upload a resource using the DataWorks console, note the following:
- If the resource has not been uploaded to MaxCompute, select Upload to MaxCompute. If the resource has already been uploaded to MaxCompute, deselect Upload to MaxCompute. Otherwise, the upload will fail.
- If you select Upload to MaxCompute during the upload, the resource is stored in both DataWorks and MaxCompute. If you later delete the resource from MaxCompute by using a command, the resource still exists in DataWorks and remains visible.
- The resource name does not need to match the name of the uploaded file.
In the Create Resource dialog box, set File Source to Local, and click Click to upload to upload the package file. If the resource type is JAR, you must add the .jar suffix to the resource name.
Commit and publish the resource.
After you create the resource, click the icon in the toolbar of the resource editor to commit the resource to the scheduling development server.
Note
If a production task needs to use this resource, you must also deploy the resource to the production environment. For more information, see Deploy tasks.

Prepare input tables

Before running the task, you must prepare two input tables:

doc table: The table that contains the base vectors to search.
query table: The table containing the query vectors used to find the nearest neighbors.

CREATE TABLE statements

-- Create a doc table
CREATE TABLE doc_table_float_smoke(pk STRING,vector STRING <,category BIGINT>) PARTITIONED BY (pt STRING);
-- Create a query table
CREATE TABLE query_table_float_smoke(pk STRING,vector STRING <,category BIGINT>) PARTITIONED BY (pt STRING);

Input table format

Table names
- Input table names cannot contain the tmp_
  string. Otherwise, the task fails.
- An input table name and its partition value cannot exceed 64 characters in length. Otherwise, the task fails.

Fields

Note

Both input tables must contain the following fixed fields, and the field names must match exactly.

Field	Description	Type
pk	The primary key (pk) of the record.	STRING by default. The pk value can be a number or a string, such as `1.nid,2.nid,3.nid,...` or an INT64 number like `123,456,789,...`. If all values in the pk column are INT64 numbers, you can specify the column data type as BIGINT. You can also set the `-pk_type` startup parameter to INT64 to improve performance.
vector	The vector field.	STRING by default.
category	The category field for multi-category search. This field is required only for multi-category search.	BIGINT by default.
pt	The partition field.	STRING by default.

Input table examples

doc table
pk
vector
pt
id1
0~1~1~5
20190322
id2
0~1~1~2
20190322
id3
3~2~1~1
20190322
...
...
...
query table
pk
vector
pt
id8
0~1~1~5
20190322
id9
0~1~1~2
20190322
id10
3~2~1~1
20190322
...
...
...

Next steps: Use vector search

Scenario	Key capability	Reference
Basic vector search	Top K retrieval from millions of records	Basic vector search
Multi-category search	Supports different-category query/doc tables and single-query-multiple-category scenarios	Multi-category search
Cluster sharding	Index by cluster shard to reduce compute and accelerate queries	Cluster sharding
Inner product and cosine distance	Inner-product and cosine distance search	Inner product and cosine distance
Converters	Improve performance and reduce index size (retrieval loss varies)	Converters

After a vector search task runs, an output table is automatically generated and stored in MaxCompute. You do not need to create this table. Specify the table name by using the -output_table parameter in your Proxima CE code. For more information about the output table format, see Output table format.

Output table format

After a vector search task runs, an output table is automatically generated and stored in MaxCompute. The format of the output table is described as follows.

Table name: The name of the output table that you specified in the Proxima CE code.
- The output table name cannot contain a period (.). This is a special character in MaxCompute and causes MaxCompute to fail to parse the table name.
- The output table name cannot contain the tmp_ string. Otherwise, the task fails.
- An output table name and its partition name cannot exceed 64 characters in length. Otherwise, the task fails.

Fields

Field	Description	Type
pk	The pk value for each query in the query table.	STRING by default. Values in the pk column can be numbers or strings, such as string values `1.nid,2.nid,3.nid,...` or INT64 numeric values `123,456,789,...`. If the pk column stores only INT64 numbers, you can set the column type to BIGINT. If you also set the `-pk_type` startup parameter to INT64, you can improve performance.
knn_result	The pk value of the matched record in the doc table.	STRING by default.
score	The similarity score of the retrieved doc.	The default data type is STRING. In Proxima CE, results are uniformly sorted in descending order by similarity score. Note The `score` field's meaning depends on the distance algorithm. However, Proxima CE unifies the sorting so that results are always returned in descending order of similarity. For `inner_product` and `mips_squared_euclidean` distance, the `score` represents similarity, where a higher value indicates greater similarity. For other distance algorithms, the calculation is based on distance, where a lower value indicates greater similarity.
category	The category field for multi-category search. This field is required only for multi-category search.	BIGINT by default.
pt	The partition field.	STRING by default.

Output table example

pk	knn_result	score	pt
id8	id1	0.1	20190322
id8	id2	0.2	20190322
id9	id1	0.1	20190322
id9	id3	0.3	20190322
...	...	...	...

pk	vector	pt
id1	0~1~1~5	20190322
id2	0~1~1~2	20190322
id3	3~2~1~1	20190322
...	...	...