MaxCompute + API data source

更新时间:
复制 MD 格式

This document outlines the process of adding a table when utilizing MaxCompute as a data source.

Prerequisites

  • To understand MaxCompute, refer to What is MaxCompute;

  • Ensure you have the necessary table permissions before configuring MaxCompute data tables. Log in to the OpenSearch account with the appropriate permissions (describe, select, download, field label permissions). The authorization statement is as follows:

-- Add account
add user ****@aliyun.com;

-- Grant the corresponding table permissions to the account
GRANT describe, select, download ON TABLE table_xxx TO USER ****@aliyun.com

-- Since ODPS has enabled field permission verification, it is impossible to access high-permission fields when pulling data, causing the table index to fail to build. In this case, you need to grant field-level access permissions to the account.
-- Grant permissions to the entire project
SET LABEL 3 to USER bs*******@aliyun.com
-- Grant permissions to a single table
GRANT LABEL 3 ON TABLE table_xxx(col1, col2) TO ****@aliyun.com

  • The MaxCompute table field types supported by the retrieval engine version include: STRING, BOOLEAN, DOUBLE, BIGINT, DATETIME;

Configure MaxCompute data source

1. Navigate to the OpenSearch console, switch to OpenSearch Retrieval Engine Version in the upper left corner, locate the corresponding instance in the instance management page list, and click Manage in the Actions column.

Click the 'Go to configure' button to initiate instance configuration.

2. In the Table Management interface, begin configuring the basic information of the table, including the table name, the number of shards, and the number of data update resources.

Configuration Description:

  • Table Name: Customizable, but does not support the format project.tablename.

  • Data Shards: The number of shards must be consistent across each index table, or at least one index table must have a single shard, with the others maintaining consistency. The total number of shards should not exceed 256 positive integers (it is recommended not to exceed three times the number of instance data nodes).

  • Number of Resources for Data Update: The number of resources allocated for data updates. Each index provides 2 free 4-core 8G update resources by default. Additional resources beyond the free quota are chargeable.

3. Data Synchronization: Configure the data source and, after verification, click Next.

Configuration Parameter Description:

  • Full Data Source: Select MaxCompute + API

  • Project Name: The name of the target MaxCompute project to access

  • AccessKey: The AccessKey ID of the Alibaba Cloud account or RAM user

  • AccessKey Secret: The AccessKey Secret corresponding to the AccessKey ID

  • Table name: The name of the target MaxCompute table to access

  • Partition: The partition key must be set for the MaxCompute data source; Example: ds=20170626

  • Automatic Reindexing: Indicates whether to enable automatic reindexing tasks. If enabled, the index table referencing the current data source will be automatically rebuilt when changes to the data source are detected

Note

If automatic reindexing is enabled, a done table must be created. For the creation method, see Automatic reindexing below;

4. Field Configuration: After completion, click Next.

5. Index Configuration: After completion, click Next.

6. Confirm Creation: After clicking Confirm Creation, the system will automatically create the configured table.

7. You can monitor the progress of table creation in the change history.

Note:

  • Full Data Source: The user's data source, named as instance name_user-defined name;

  • Project Name, AccessKey, AccessKey Secret, Table name, Partition: The parameters required for the user to access the MaxCompute data source;

  • Automatic Reindexing: Indicates whether to enable automatic reindexing tasks. If enabled, the index table referencing the current data source will be automatically rebuilt when changes to the data source are detected;

  • If Automatic Reindexing is enabled, a done table must be created. For the creation method, see Automatic reindexing below;

Automatic reindexing

Enabling automatic reindexing during data source configuration signifies that the retrieval engine version instance will automatically rebuild the index based on changes in the user's done table.

Example: If the user's MaxCompute data table is mytable, with a partition of ds=20220113, the retrieval engine version instance will need to scan new partitions produced daily (containing the full table data) and automatically rebuild the index to incorporate the new partition data. This process requires the automatic reindexing + done table functionality. The operation steps are as follows:

  1. Enable automatic reindexing when adding a data source.

  2. Create the done table in MaxCompute. Assuming the source table name is mytable and the partition key name is ds, the done table name would be mytable_done, with the partition key name also being ds. The two tables appear in MaxCompute as:

odps:sql:xxx> show tables;
InstanceId: xxx  
SQL: .                                                                                              

ALIYUN$xxx@aliyun.com:mytable          # Full data source table
ALIYUN$xxx@aliyun.com:mytable_done     # Done table controlling automatic full data

Statement to create the done table:

create table mytable_done (attribute string) partitioned by (ds string);
  1. Upon completion of the mytable partition ds=20220114, set the done table to trigger the reindexing of the retrieval engine version instance;

-- Add partition
alter table mytable_done add if not exists partition (ds="20220114");

-- Insert automatic full signal data
insert into table mytable_done partition (ds="20220114") select '{"swift_start_timestamp":1642003200}';

The final content of the done table is as follows:

odps:sql:xxx> select * from mytable_done where ds=20220114 limit 1;
InstanceId: xxx  
SQL: .                                                                                              
+-----------+----+
| attribute | ds |
+-----------+----+
| {"swift_start_timestamp":1642003200} | 20220114 |
+-----------+----+

Once the done table receives the automatic full signal data, the retrieval engine version instance will scan the done signal and subsequently trigger the reindexing process.

Note:

  • The done table must have at least one partition key, and the partition key name must match the partition key name of the source table;

  • Partitions added to the done table must correspond to existing partitions in the source table. For example, if the source table has partitions like ds="20220114", ds="20220115", ds="20220116", then the new partitions added to the done table must fall within the range of the source table partitions;

Precautions

  • The MaxCompute table name must not include the project name; otherwise, an error will occur due to unrecognized table name during the build process.

  • The table name cannot be altered once the table has been edited;

  • The table designated for the MaxCompute data source must be a partitioned table;

  • The table created by the user on MaxCompute serves as the full input, while the API data source is utilized for pushing real-time data;