StarRocks Connector

更新时间:
复制 MD 格式

The StarRocks Connector uses the External Catalog mechanism to seamlessly access MaxCompute data sources—without importing data or creating external tables—and run complex SQL queries. This improves data analytics efficiency and reduces operational and maintenance (O&M) complexity and cost. This topic describes how to access MaxCompute using the StarRocks Connector.

Background information

StarRocks is a next-generation, high-speed Massively Parallel Processing (MPP) database that supports multiple enterprise-level analytical workloads, such as online analytical processing (OLAP)-based multidimensional analysis, custom report creation, real-time data analysis, and ad hoc queries. For details about StarRocks, see StarRocks Introduction. StarRocks supports the Catalog (data catalog) feature, which lets you manage both internal and external data within a single system and easily access and query data stored in various external sources. For more information, see Catalog Overview.

Prerequisites

  • You have deployed a StarRocks cluster of version 3.2.3 or later and ensured that the cluster can properly access the MaxCompute service. For more information about how to deploy a StarRocks cluster, see Deploy a StarRocks shared-nothing cluster using Docker Compose .

  • Purchase a dedicated Data Transmission Service resource group (subscription) or activate open storage (pay-as-you-go) resources.

Limits

  • Supports reading standard tables, partitioned tables, clustered tables, Delta Tables, and materialized views. Does not support reading external tables or logical views.

  • Does not support reading JSON data types.

  • Only supports MaxCompute projects with Schema disabled. For details about Schema, see Schema operations.

Create a MaxCompute Catalog

Log on to StarRocks and create a MaxCompute Catalog in StarRocks. For more information about External Catalogs, see Create External Catalog.

Syntax

CREATE EXTERNAL CATALOG <catalog_name>
[COMMENT <comment>]
PROPERTIES
(
    "type" = "odps",
    CatalogParams,
    ScanParams,
    CachingMetaParams
)

Parameter descriptions

  • Common parameters

    Parameter

    Required

    Description

    catalog_name

    Yes

    Name of the MaxCompute Catalog. Follow these naming rules:

    • Must consist of letters (a–z or A–Z), digits (0–9), or underscores (_), and must start with a letter.

    • Maximum length is 1023 characters.

    • Catalog names are case-sensitive.

    type

    Yes

    Type of the data source. Set this parameter to odps.

    comment

    No

    Description of the MaxCompute Catalog.

    CatalogParams

    Yes

    Parameters for StarRocks to access MaxCompute.

    ScanParams

    No

    Parameters for StarRocks to access MaxCompute file storage.

    CachingMetaParams

    No

    Parameters that define the metadata caching policy.

  • CatalogParams parameters

    Parameter

    Required

    Description

    odps.endpoint

    Yes

    Connection address of the MaxCompute service. Configure the endpoint based on the region and network connectivity method selected when you created the MaxCompute project. For endpoint values by region and network type, see Endpoint.

    Important

    Currently, only Alibaba Cloud VPC networks are supported.

    odps.project

    Yes

    Name of the MaxCompute project. Log on to the MaxCompute console and go to the Workspace > Projects page to obtain the project name.

    Note

    If you created a workspace in standard mode, distinguish between the production environment and development environment (_dev) project names when configuring this parameter.

    odps.access.id

    Yes

    AccessKey ID of your Alibaba Cloud account or Resource Access Management (RAM) user. Go to the AccessKey management page to obtain the AccessKey ID.

    odps.access.key

    Yes

    AccessKey secret corresponding to the AccessKey ID.

    odps.tunnel.quota

    Yes

  • ScanParams parameters

    Parameter

    Required

    Description

    odps.split.policy

    No

    Sharding strategy used when scanning data. Valid values:

    • size (default): shards by data size. Default value: 256 MB.

    • row_offset: shards by row count.

    Note

    The sharding strategy significantly affects data scan concurrency in the compute engine. If the default strategy does not meet your needs, choose the row_offset strategy and adjust it using the odps.split.row.count parameter.

    odps.split.row.count

    No

    Maximum number of rows per shard. Default value: 4194304. Value range: greater than 0.

    Note

    Configure this parameter only when odps.split.policy is set to row_offset.

  • CachingMetaParams parameters

    Parameter

    Required

    Description

    odps.cache.table.enable

    No

    Whether to cache table metadata. Valid values:

    • true (default): enables table metadata caching. This improves metadata retrieval and update speed and is suitable for scenarios involving frequent DDL operations, large multi-user environments, or complex query optimization.

    • false: disables table metadata caching. Because cache resources are limited and excessive reliance on caching may consume too much memory, disable caching for infrequent queries or tables whose metadata changes frequently.

    odps.cache.table.expire

    No

    Time interval, in seconds (s), after which cached table metadata is automatically evicted. Default value: 86400 (24 hours). Value range: greater than or equal to 0.

    odps.cache.table.size

    No

    Number of table metadata entries to cache. Default value: 1000.

    odps.cache.partition.enable

    No

    Whether to cache metadata for all partitions under the target table. Valid values:

    • true (default): enables caching of metadata for all partitions under the target table.

    • false: disables caching. Metadata for all partitions is not cached.

    Note

    If the table is not partitioned, the odps.cache.partition.enable parameter has no effect.

    odps.cache.partition.expire

    No

    Time interval, in seconds (s), after which cached partition metadata is automatically evicted. Default value: 86400 (24 hours). Value range: greater than or equal to 0.

    odps.cache.partition.size

    No

    Number of partitioned tables whose metadata can be cached. Default value: 1000.

    odps.cache.table-name.enable

    No

    Whether to cache all table names in the MaxCompute project. Valid values:

    • true: enables caching of all table names in the MaxCompute project.

    • false (default): disables caching of all table names in the MaxCompute project.

    odps.cache.table-name.expire

    No

    Time interval, in seconds (s), after which cached table name information in the MaxCompute project is automatically evicted. Default value: 86400 (24 hours). Value range: greater than or equal to 0.

Example

The following example creates a MaxCompute Catalog named odps_catalog, using the MaxCompute project mf_mc_bj as an example.

-- Create catalog
CREATE EXTERNAL CATALOG odps_catalog PROPERTIES(
    "type"="odps",
    "odps.access.id"="<yourAccessKeyId>",
    "odps.access.key"="<yourAccessKeySecret>",
    "odps.endpoint"="http://service.cn-beijing.maxcompute.aliyun.com/api",
    "odps.tunnel.quota"="pay-as-you-go",
    "odps.project"="mf_mc_bj"
);

Access MaxCompute

  1. Switch to the target Catalog and MaxCompute project. This example uses odps_catalog and mf_mc_bj.

    • Syntax

      • Method 1

        -- Specify Catalog
        SET CATALOG <catalog_name>;
        -- Specify MaxCompute project
        USE <project_name>;
      • Method 2

        USE <catalog_name>.<project_name>;
    • Example

      SET CATALOG odps_catalog;
      USE mf_mc_bj;
  2. Query data. This example uses the src table.

SELECT * FROM src LIMIT 10;

The result is as follows.

+------+-------+
| key  | value |
+------+-------+
| 1    | 1     |
| 3    | 3     |
| 2    | 2     |
| 4    | 100   |
| 5    | 200   |
| 6    | 300   |
| 3    | 400   |
+------+-------+

Other operations

View MaxCompute Catalog

  • List all Catalogs in the current StarRocks cluster.

    SHOW CATALOGS;

    The result is as follows.

    +-----------------+----------+------------------------------------------------------------------+
    | Catalog         | Type     | Comment                                                          |
    +-----------------+----------+------------------------------------------------------------------+
    | default_catalog | Internal | An internal catalog contains this cluster‘s self-managed tables. |
    | odps_catalog    | Odps     | NULL                                                             |
    +-----------------+----------+------------------------------------------------------------------+
  • Show the CREATE statement for a specific Catalog.

    SHOW CREATE CATALOG odps_catalog;

    The result is as follows.

    +--------------+--------------------------------------------------------------------------+
    | Catalog      | Create Catalog                                                           |
    +--------------+--------------------------------------------------------------------------+
    | odps_catalog | CREATE EXTERNAL CATALOG `odps_catalog`
                     PROPERTIES ("odps.endpoint"  =  "http://service.cn-beijing.maxcompute.aliyun.com/api",
                     "odps.access.id"  =  "<yourAccessKeyId>",
                     "odps.access.key"  =  "<yourAccessKeySecret>",
                     "odps.project"  =  "odps_project",
                     "type"  =  "odps"
                    )|
    +-----------------------------------------------------------------------------------------+

View MaxCompute table schema

  • Describe a MaxCompute table schema.

    • Syntax

      DESC[RIBE] <catalog_name>.<database_name>.<table_name>;
    • Parameter descriptions

      • catalog_name: required. Name of the Catalog.

      • database_name: required. Name of the MaxCompute project under the target Catalog.

      • table_name: required. Name of the table under the target MaxCompute project.

    • Example

      DESC odps_catalog.mf_mc_bj.src;

      The result is as follows.

      +-------+------+------+-------+---------+-------+
      | Field | Type | Null | Key   | Default | Extra |
      +-------+------+------+-------+---------+-------+
      | id    | INT  | Yes  | false | NULL    |       |
      | a     | INT  | Yes  | false | NULL    |       |
      +-------+------+------+-------+---------+-------+
  • Show the MaxCompute table schema and its CREATE TABLE statement.

    • Syntax

      SHOW CREATE TABLE <catalog_name>.<database_name>.<table_name>;
    • Parameter descriptions

      • catalog_name: required. Name of the Catalog.

      • database_name: required. Name of the MaxCompute project under the target Catalog.

      • table_name: required. Name of the table under the target MaxCompute project.

    • Example

      SHOW CREATE TABLE odps_catalog.mf_mc_bj.src;

      The result is as follows.

      +-------+--------------------------------------------------------------------------------+
      | Table | Create Table                                                                   |
      +-------+--------------------------------------------------------------------------------+
      | src   | CREATE TABLE `src` (
        `id` int(11) DEFAULT NULL,
        `a` int(11) DEFAULT NULL
      ) |
      +-------+--------------------------------------------------------------------------------+

Delete a MaxCompute Catalog

Run the following command to delete a MaxCompute Catalog.

DROP CATALOG odps_catalog;

Cost-Based Optimizer (CBO) statistics collection

The current version of StarRocks does not support automatic collection of MaxCompute table statistics. Run the following command to manually create a statistics collection task.

-- Replace <table_name> with the MaxCompute table name
ANALYZE TABLE <table_name>;
Note

StarRocks CBO can perform cost estimation based on various statistics and select the execution plan with the lowest cost from tens of thousands of execution plans, improving the efficiency and performance of complex queries. For more information, see CBO statistics.

Manually refresh metadata cache

By default, StarRocks caches MaxCompute metadata to improve query performance. After changing a table schema or updating table data, manually refresh the table's metadata to ensure StarRocks retrieves the latest metadata immediately.

-- Replace <table_name> with the MaxCompute table name
REFRESH EXTERNAL TABLE <table_name>;

FAQ

Error when reading data: Your slot quota is exceeded.

  • Problem Analysis

    This error occurs when the data read volume exceeds the Data Transmission Service quota. During reads, StarRocks splits the table into multiple shards and distributes them to BE nodes, which use thread pools to read data concurrently. The number of slots occupied during the read is typically: Max(number of shards, number of BEs × BE thread pool concurrency).

  • Solution

    • Purchase a dedicated Data Transmission Service resource group with higher concurrency. For instructions, see Purchase and use dedicated Data Transmission Service resource groups.

    • Adjust the number of shards by configuring ScanParams parameters. If you use the row_offset sharding strategy, also increase the odps.split.row.count parameter.

    • Modify the StarRocks BE thread pool parameters scanner_thread_pool_queue_size and scanner_thread_pool_thread_num. For details, see Introduction to StarRocks.

Appendix: Type mapping

MaxCompute field type

StarRocks field type

BOOLEAN

BOOLEAN

TINYINT

TINYINT

SMALLINT

SMALLINT

INT

INT

BIGINT

BIGINT

FLOAT

FLOAT

DOUBLE

DOUBLE

DECIMAL(p, s)

DECIMAL(p, s)

STRING

VARCHAR(1073741824)

VARCHAR(n)

VARCHAR(n)

CHAR(n)

CHAR(n)

JSON

VARCHAR(1073741824)

BINARY

VARBINARY

DATE

DATE

DATETIME

DATETIME

TIMESTAMP

DATETIME

Important

StarRocks does not have a TIMESTAMP type. When reading TIMESTAMP data, it is treated as DATETIME, which results in loss of precision.

ARRAY

ARRAY

MAP

MAP

STRUCT

STRUCT