Reference: Complete list of Proxima CE parameters

更新时间:
复制 MD 格式

This topic describes the required and optional parameters for running Proxima CE.

Required parameters

Parameter Name

Description

doc_table

The input base table, which is a MaxCompute table. Prepare this table to use as the candidate set for retrieval.

Important

Do not use a period (.) in the table name. A period is a special character in MaxCompute and causes table parsing to fail. To reference a table from another project, use the project_name.table_name format.

doc_table_partition

The MaxCompute partition of the base table.

query_table

The input query table, which is a MaxCompute table. Prepare this table to use as the retrieval set.

Important

Do not use a period (.) in the table name. A period is a special character in MaxCompute and causes table parsing to fail. To reference a table from another project, use the project_name.table_name format.

query_table_partition

The MaxCompute partition of the query table.

output_table

The output table. You do not need to create this table. Specify a table name to store the retrieval results.

output_table_partition

The MaxCompute partition of the output table.

data_type

The data type of the input data table. The supported types are FLOAT/INT8/BINARY.

dimension

The dimension of the feature vector. If data_type is set to BINARY, the dimension must be an integer multiple of 32.

Optional parameters

Parameter Name

Description

Default value

h (–help)

Displays help information.

None

topk

The number of similar results to retrieve. You can specify multiple values, such as 10,20,30. The number of results retrieved to the output table is the maximum value specified.

200

pk_type

Specifies the data type of the pk column in the input table. Valid values are INT64 and STRING. The default is STRING. If the data in the pk column is not of the INT64 type, such as the STRING type, Proxima CE creates a temporary input table. In this table, the pk column is mapped to a tmp_pk column of the INT64 type. The final result is then obtained using a MaxCompute table JOIN operation. This process adds about 30 minutes to the runtime for 100 million documents.

string

vector_separator

The separator for the vector. You can specify a separator other than the tilde (~). Spaces are supported. To use a space, specify blank. The separator is identified as a string, so enter only the character itself without single or double quotation marks. For example, ',' is treated as the string ',' instead of the comma ,.

~

binary_to_int

Specifies whether to use INT32 to represent BINARY data. This parameter is valid only for data of the BINARY type. The dimension parameter still represents the dimension of the binary feature. For example, assume the separator is a comma. If binary_to_int is set to false, the user input is similar to "1,1,1,1,1,1,....". If binary_to_int is set to true, the user input is similar to "12345,13423,13325,....". This converts N 0s or 1s into N/32 integers and reduces the index size.

false

job_mode

The supported modes are combinations of the following:

  • train:build:seek (default)

  • build:seek

  • seek

  • train:build:seek:recall

  • build:seek:recall

  • seek:recall

train:build:seek

clean_build_volume

Specifies whether to delete the index. After the build job completes index building, it writes the index to a MaxCompute volume. The seek job then loads this index. After the seek job is executed, the index is deleted by default.

Note

If this parameter is set to true, the index is also cleared when the task fails.

true

algo_model

The index building method. Based on the proxima2.x kernel, the following six index building methods are supported: hnsw/ssg/hc/gc/qc/linear. This parameter determines the builder for constructing the index and the searcher for querying. The mappings are as follows:

  • hnsw: HnswBuilder/HnswSearcher

  • ssg: SsgBuilder/SsgSearcher

  • hc: ClusteringBuilder/ClusteringSearcher

  • gc: GcBuilder/GcSearcher

  • qc: QcBuilder/QcSearcher

  • linear: LinearBuilder/LinearSearcher (brute-force search)

hnsw

builder_params

The parameters for index building. The default value is empty. These parameters must correspond to the index type specified by algo_model. Provide the parameters as a single-line JSON string. Do not escape the double quotation marks or include spaces. For example, {"proxima.hnsw.builder.efconstruction":400,"proxima.hnsw.builder.max_neighbor_count":100} specifies the ef value and the maximum number of neighbors for a node in the HnswBuilder. For more information, see IndexBuilder parameter settings.

None

searcher_params

The parameters for index searching. The default value is empty. These parameters must correspond to the index type specified by algo_model. Provide the parameters as a single-line JSON string. Do not escape the double quotation marks or include spaces. For example, {"proxima.hnsw.searcher.ef":400} specifies the ef value for an HNSW search. For more information, see IndexSearcher parameter settings.

None

converter

The name of the converter for index building. Index Converter is a Proxima 2.x module that transforms feature vectors. For example, it can perform dimensionality reduction, half-float conversion, or INT8 quantization on features. It can be used independently or as part of the retrieval flow. For more information, see Index Converter.

None

converter_params

The parameters for the converter. Provide the parameters as a single-line JSON string. Do not escape the double quotation marks or include spaces. For example, to specify the parameters for MipsConverter, use {"proxima.mips.converter.m_value":4,"proxima.mips.converter.u_value":0.38196601,"proxima.mips.converter.forced_half_float":false,"proxima.mips.converter.spherical_injection":false}. For more information, see IndexConverter parameter settings.

None

distance_method

The formula for calculating the distance between features. The following methods are supported:

  • squared_euclidean (squared Euclidean distance)

  • euclidean (Euclidean distance)

  • mips_squared_euclidean

  • inner_product (inner product)

  • hamming (used for the BINARY type)

  • manhattan (L1 distance)

  • chebyshev (Chebyshev distance)

  • canberra (Canberra distance)

  • geo_distance (geographical distance)

  • rogers_tanimoto (used for the BINARY type)

  • russell_rao (used for the BINARY type)

  • matching (used for the BINARY type)

squared_euclidean

measure_params

The parameters for the distance method specified by -distance_method. Provide the parameters as a single-line JSON string. Do not escape the double quotation marks or include spaces. For example, to specify the parameters for MipsSquaredEuclidean, use {"proxima.mips_euclidean.measure.injection_type":0}. For more information, see IndexMeasure parameter settings.

None

column_num

The number of columns for index building. The default value is 0.

  • The system calculates this value based on the data volume of doc_table and the data_type. If the total size is less than 50 GB, 2 GB of data is allocated to each column. If the total size is greater than 50 GB, 2.5 GB is allocated to each column.

  • Manual configuration: You can typically calculate and configure this value based on the preceding method. You can increase or decrease the value based on your cluster resources.

Both column_num and row_num must be set to positive values to take effect.

0

row_num

The number of rows for retrieval queries. The default value is 0.

  • The system calculates this value based on the data volume of doc_table and the data_type. If the total number of queries is less than 100 million, 2 million queries are allocated to each row. If the total number of queries is greater than 100 million, 10 million queries are allocated to each row.

  • Manual configuration: You can typically calculate and configure this value based on the preceding method. You can increase or decrease the value based on your cluster resources.

Both column_num and row_num must be set to positive values to take effect.

0

category_threshold

In multi-category retrieval scenarios, this parameter specifies the threshold for large-category retrieval. When the number of documents in a category exceeds this threshold, the category is processed using large-category retrieval. Otherwise, it is processed using small-category retrieval. Small-category retrieval uses the linear retrieval method by default, and data from multiple small categories is merged for retrieval.

1000000

category_col_num

When you query by category, this parameter specifies the number of columns for building indexes for small categories (fewer than 1 million documents). For more information, see the description of the column_num parameter.

0

category_row_num

When you query by category, this parameter specifies the number of rows for querying indexes for small categories (fewer than 1 million documents). For more information, see the description of the row_num parameter.

0

category_thread_num

When you query by category, this parameter sets the concurrency (thread pool size) for tasks that process large categories (more than 1 million documents).

10

query_multi_label

Specifies whether a single query can have multiple categories. If this parameter is set to true, the doc table must also contain a category column. For more information, see Multi-category Retrieval.

false

threshold_score

The score threshold for filtering retrieval results. For distance methods other than inner_product and mips_squared_euclidean, a smaller score value indicates greater similarity. Results with a score greater than this threshold are filtered out. For the inner_product and mips_squared_euclidean distance methods, a larger value indicates greater similarity. Results with a score less than this threshold are filtered out.

None

tunnel_endpoint

The tunnel endpoint for MaxCompute. The default value is empty. This prevents download session creation failures when accessing data tables across networks. For more information, see MaxCompute Tunnel Endpoint issues.

None

memory_load

Specifies the index loading method for the seek phase. The default value is true, which indicates that the index is loaded entirely into memory. If cluster memory resources are limited, you can set this to false as needed.

true

sharding_mode

The index sharding method. The hash and cluster modes are supported. The hash mode partitions the index by taking the hash value modulo a number. The cluster mode partitions the index using k-means clustering. The cluster mode can reduce the computational workload in the subsequent retrieval (seek) phase.

hash

kmeans_resource_name

This parameter is used for the cluster index sharding mode. The cluster mode first starts a MaxCompute graph computing task to perform k-means clustering on the raw data. This parameter specifies the name of the k-means centroids.

kmeans_resource_name

kmeans_sample_ratio

This parameter is used for the cluster index sharding mode. It specifies the sample rate for k-means centroids. The value ranges from 0 to 1.

0.05

kmeans_seek_ratio

This parameter is used for the cluster index sharding mode. It specifies the selection rate for the nearest centroids during retrieval. The value ranges from 0 to 1.

0.1

kmeans_iter_num

This parameter is used for the cluster index sharding mode. It specifies the number of iterations for the k-means task.

30

kmeans_cluster_num

This parameter is used for the cluster index sharding mode. It specifies the number of k-means centroids.

1000

kmeans_init_center_method

This parameter is used for the cluster index sharding mode. It specifies the initialization method for k-means centroids.

""

kmeans_worker_num

This parameter is used for the cluster index sharding mode. It specifies the number of k-means clustering worker instances.

0

mapper_split_size

Exposes the mapper.split.size option. Specifies the data processing size for internal mapper instances in MB. If not specified, the default size for MapReduce (MR) on the MaxCompute platform, which is 256 MB, is used.

256

odps_task_priority

The priority of the Proxima CE task. This is set by configuring the priority for all internal MaxCompute tasks in Proxima CE, such as SQL, MapReduce (MR), and Graph tasks. The value can be an integer from 0 to 9. A smaller value indicates a higher priority. The default value is -1, which follows the baseline priority of MaxCompute.

-1

oss_access_id

The AccessKey ID of an Alibaba Cloud account or a Resource Access Management (RAM) user. You can obtain the AccessKey ID on the AccessKey Management page.

None

oss_access_key

The AccessKey secret that corresponds to the AccessKey ID.

You can obtain the AccessKey secret on the AccessKey Management page.

None

oss_endpoint

The endpoint of the MaxCompute service.

You need to configure the endpoint based on the region and network connectivity type you selected when you created the MaxCompute project. For the endpoints of different regions and networks, see Endpoints.

None

oss_bucket

The name of the Object Storage Service (OSS) bucket. For information about how to view bucket names, see List buckets.

None