Proxima CE supports vector search tasks, including basic vector search and million-scale top-K search. This topic shows how to run a basic vector search offline task in Proxima CE and provides examples.
Prerequisites
You have installed the Proxima CE package and prepared the input table. For more information, see Install the Proxima CE package.
Limitations
You cannot run vector search tasks in a project where the tenant-level schema syntax switch is enabled. Otherwise, you will receive an error similar to Schema xxx does not exist. Run the code examples in this topic in a project where this switch is disabled.
Notes
-
If you have permissions for an External Volume, you can run the task using the Volume method. Otherwise, you must provide an
AccessKeyor aRAM role ARNto allow the Proxima CE task to programmatically create and use the required resources. -
When you run a Proxima CE task in DataWorks, you must use the Shared Resource Groups for Scheduling for Smoke Testing. For more information, see Perform smoke testing.
Populate the input table
If you want to test the entire process, you can run the following commands on a SQL node in DataWorks to generate small data tables for a simple search example.
You must populate the doc and query tables with your own vector data.
ALTER TABLE doc_table_float_smoke ADD PARTITION(pt='20221111');
INSERT OVERWRITE TABLE doc_table_float_smoke PARTITION (pt='20221111') VALUES
('1.nid','1~1~1~1~1~1~1~1'),
('2.nid','2~2~2~2~2~2~2~2'),
('3.nid','3~3~3~3~3~3~3~3'),
('4.nid','4~4~4~4~4~4~4~4'),
('5.nid','5~5~5~5~5~5~5~5'),
('6.nid','6~6~6~6~6~6~6~6'),
('7.nid','7~7~7~7~7~7~7~7'),
('8.nid','8~8~8~8~8~8~8~8'),
('9.nid','9~9~9~9~9~9~9~9'),
('10.nid','10~10~10~10~10~10~10~10');
ALTER TABLE query_table_float_smoke ADD PARTITION(pt='20221111');
INSERT OVERWRITE TABLE query_table_float_smoke PARTITION (pt='20221111') VALUES
('q1.nid','1~1~1~1~2~2~2~2'),
('q2.nid','4~4~4~4~3~3~3~3'),
('q3.nid','9~9~9~9~5~5~5~5');
Run the task
You can run the task in DataWorks or odpscmd. Choose the tool that best suits your needs.
For more information about the parameters used in the following code examples, see Reference: Proxima CE parameters.
Run in DataWorks
In DataWorks, create a MaxCompute ODPS MR node and run a Proxima CE task using an ODPS SQL script.
-
Volume method
--@resource_reference{"<proxima_ce_jar>"} -- To generate this reference, on the Data Development page, right-click the uploaded JAR package and select Reference Resource. jar -resources <proxima_ce_jar> -- The uploaded Proxima CE JAR package. -classpath <proxima_ce_jar> com.alibaba.proxima2.ce.ProximaCERunner -- The classpath that specifies the entry class of the main function. -doc_table doc_table_float_smoke -- The input doc table. -doc_table_partition 20221111 -- The partition of the input doc table. -query_table query_table_float_smoke -- The input query table. -query_table_partition 20221111 -- The partition of the input query table. -output_table output_table_float_smoke -- The output table. -output_table_partition 20221111 -- The partition of the output table. -data_type float -- The data type of vectors. -dimension 8 -- The vector dimension. -topk 1 -- The top-k value for vector search. -job_mode train:build:seek:recall -- The mode of the search task. The default is train:build:seek. Add :recall to calculate the recall rate. -external_volume_name <ext_volume> -- The External Volume on OSS. The underlying OSS directory must also be created. Otherwise, the task fails. -owner_id <oid> -- A unique ID to identify the user. ; -- Do not omit the semicolon. It marks the end of an ODPS SQL statement.Replace the following placeholders with your actual values:
Parameter
Description
proxima_ce_jar
The name of the uploaded Proxima CE installation package, for example, proxima-ce-aliyun-1.0.1.jar. For more information, see Install the Proxima CE package.
ext_volume
The name of the External Volume. For more information about creating an External Volume, see External Volume operations.
oid
A unique numeric user identifier up to 32 digits long, such as 123456. We recommend using your Alibaba Cloud account ID.
-
RAM role ARN method
--@resource_reference{"<proxima_ce_jar>"} -- To generate this reference, on the Data Development page, right-click the uploaded JAR package and select Reference Resource. jar -resources <proxima_ce_jar> -- The uploaded Proxima CE JAR package. -classpath <proxima_ce_jar> com.alibaba.proxima2.ce.ProximaCERunner -- The classpath that specifies the entry class of the main function. -doc_table doc_table_float_smoke -- The input doc table. -doc_table_partition 20221111 -- The partition of the input doc table. -query_table query_table_float_smoke -- The input query table. -query_table_partition 20221111 -- The partition of the input query table. -output_table output_table_float_smoke -- The output table. -output_table_partition 20221111 -- The partition of the output table. -data_type float -- The data type of vectors. -dimension 8 -- The vector dimension. -topk 1 -- The top-k value for vector search. -job_mode train:build:seek:recall -- The mode of the search task. The default is train:build:seek. Add :recall to calculate the recall rate. -oss_role_arn <rolearn> -- The RAM role ARN with permissions to access OSS. Example: acs:ram::1234xxx5678:role/xxx-role. -oss_endpoint <endpoint> -- The endpoint of the destination region. -oss_bucket <bucket> -- The OSS bucket. -owner_id <oid> -- A unique ID to identify the user. ; -- Do not omit the semicolon. It marks the end of an ODPS SQL statement.Replace the following placeholders with your actual values:
Parameter
Description
proxima_ce_jar
The name of the uploaded Proxima CE installation package, for example, proxima-ce-aliyun-1.0.1.jar. For more information, see Install the Proxima CE package.
rolearn
The Alibaba Cloud Resource Name (ARN) of the RAM role, for example,
acs:ram::1234xxx5678:role/xxx-role.You can obtain the ARN from the RAM console. In the navigation pane on the left, choose Identities > Role.
endpoint
The internal endpoint of OSS in the region where the MaxCompute project is located. For more information, see Regions and Endpoints.
bucket
The name of an OSS bucket in the same region as your MaxCompute project. For information about how to view bucket names, see List buckets.
oid
A unique numeric user identifier up to 32 digits long, such as 123456. We recommend using your Alibaba Cloud account ID.
Run in odpscmd
Run one of the following scripts in the MaxCompute client (odpscmd):
-
Volume method
jar -resources <proxima_ce_jar> -classpath <proxima_ce_jar_path> com.alibaba.proxima2.ce.ProximaCERunner -doc_table doc_table_float_smoke -doc_table_partition 20221111 -query_table query_table_float_smoke -query_table_partition 20221111 -output_table output_table_float_smoke -output_table_partition 20221111 -data_type float -dimension 8 -topk 1 -job_mode train:build:seek:recall -external_volume_name <ext_volume> -owner_id <oid> ;Replace the following placeholders with your actual values:
Parameter
Description
proxima_ce_jar
The name of the uploaded Proxima CE installation package, for example, proxima-ce-aliyun-1.0.1.jar. For more information, see Install the Proxima CE package.
proxima_ce_jar_path
The local path of the Proxima CE JAR package. If the JAR package is in the same directory as the script, you can use the package name alone.
ext_volume
The name of the External Volume. For more information about creating an External Volume, see External Volume operations.
oid
A unique numeric user identifier up to 32 digits long, such as 123456. We recommend using your Alibaba Cloud account ID.
-
RAM role ARN method
jar -resources <proxima_ce_jar> -classpath <proxima_ce_jar_path> com.alibaba.proxima2.ce.ProximaCERunner -doc_table doc_table_float_smoke -doc_table_partition 20221111 -query_table query_table_float_smoke -query_table_partition 20221111 -output_table output_table_float_smoke -output_table_partition 20221111 -data_type float -dimension 8 -topk 1 -job_mode train:build:seek:recall -oss_role_arn <rolearn> -oss_endpoint <endpoint> -oss_bucket <bucket> -owner_id <oid> ;Replace the following placeholders with your actual values:
Parameter
Description
proxima_ce_jar
The name of the uploaded Proxima CE installation package, for example, proxima-ce-aliyun-1.0.1.jar. For more information, see Install the Proxima CE package.
proxima_ce_jar_path
The local path of the Proxima CE JAR package. If the JAR package is in the same directory as the script, you can use the package name alone.
rolearn
The Alibaba Cloud Resource Name (ARN) of the RAM role, for example,
acs:ram::1234xxx5678:role/xxx-role.You can obtain the ARN from the RAM console. In the navigation pane on the left, choose Identities > Role.
endpoint
The OSS internal endpoint for the region where the MaxCompute project is located. For more information, see Regions and Endpoints.
bucket
The name of an OSS bucket in the same region as your MaxCompute project. For information about how to view bucket names, see List buckets.
oid
A unique numeric user identifier up to 32 digits long, such as 123456. We recommend using your Alibaba Cloud account ID.
Results
-
Sample console output
Vector Search Data Type:4 , Vector Dimension:8 , Search Method:HNSW , Distance Metric:SquaredEuclidean , Build Mode:train:build:seek:recall Doc Table Details Table Name: doc_table_float_smoke , Partition:20221111 , Doc Count:10 , Vector Delimiter:~ Query Table Details Table Name: query_table_float_smoke , Partition:20221111 , Query Count:3 , Vector Delimiter:~ Output Table Details Table Name: output_table_float_smoke , Partition:20221111 Row and Column Details Rows: 1 , Columns:1 , Docs per Column Index:1000000 Clear Volume Index:false Time spent by each worker (seconds): SegmentationWorker: 1 TmpTableWorker: 0 KmeansGraphWorker: 0 BuildJobWorker: 120 SeekJobWorker: 60 TmpResultJoinWorker: 0 RecallWorker: 60 CleanUpWorker: 1 Total time (minutes): Actual recall rate Recall@1: 1.0 -
Sample output table
+------------+------------+------------+------------+ | pk | knn_result | score | pt | +------------+------------+------------+------------+ | q1.nid | 1.nid | 4.0 | 20221111 | | q2.nid | 3.nid | 4.0 | 20221111 | | q3.nid | 7.nid | 32.0 | 20221111 |
Million-scale top-K search
Powered by the Proxima 2.x kernel, Proxima CE offers enhanced performance for million-scale top-K search. This allows you to quickly retrieve the top K most similar results for a query vector from a dataset of millions of vectors. To use this feature, set the -topk startup parameter.