Overview of the streaming data tunnel

更新时间:
复制 MD 格式

MaxCompute Streaming Tunnel lets you write data to MaxCompute in streaming mode using a dedicated set of APIs and backend services. These APIs significantly reduce the development costs of distributed services and remove the performance bottlenecks of MaxCompute Tunnel in high-concurrency and high-QPS (queries per second) scenarios such as partition locking conflicts, small-file fragmentation, and complex synchronization code.

MaxCompute Streaming Tunnel has been in public preview since January 1, 2021, and is free of charge during the preview period. Follow Service notices to stay informed about future billing changes.

When to use Streaming Tunnel

MaxCompute Streaming Tunnel complements MaxCompute Tunnel rather than replacing it. Use this table to decide which channel fits your workload:

Dimension MaxCompute Streaming Tunnel MaxCompute Tunnel
Data form Streaming rows Batched files
Concurrency High concurrency supported; no partition locking contention Concurrent writes can cause partition locking conflicts
Write throughput Optimized for high QPS; prevents small-file fragmentation Small batch size at high QPS generates many small files
Incremental data Asynchronously merged in the background without service interruption No built-in async merge; data is written as-is
Partitioning Automatic partitioning across concurrent jobs Manual partition management required
Best for Real-time log ingestion, stream processing results, message queue sync Large-batch ETL, periodic bulk loads

Key capabilities

  • Streaming semantic APIs: Help facilitate the development of distributed data synchronization services, reducing development costs.

  • Automatic partitioning: Eliminates concurrent partition locking when multiple synchronization jobs write to the same table simultaneously.

  • Asynchronous data merging: Merges incremental data in the background without interrupting active write operations, improving storage efficiency and preventing small-file accumulation.

Use cases

Scenario Description
Real-time event log ingestion Write log data directly into MaxCompute for downstream batch processing—no intermediate storage service needed, which reduces pipeline costs.
Stream processing result storage Persist Flink or other stream computing results into MaxCompute without concurrency or batch size limits, avoiding small-file accumulation from high-frequency writes. MaxCompute Streaming Tunnel ensures the availability of streaming services in scenarios that involve high-concurrency locking.
Message queue synchronization Sync data from DataHub or ApsaraMQ for Kafka into MaxCompute at high concurrency and large batch volumes, replacing workarounds previously needed with the Simple Message Queue connector.

Integrate with upstream services

By default, Realtime Compute for Apache Flink, DataWorks, and ApsaraMQ for Kafka write to MaxCompute via MaxCompute Tunnel. To switch to Streaming Tunnel:

Service How to enable Streaming Tunnel
Realtime Compute for Apache Flink Use the built-in Streaming Tunnel plug-in provided by Realtime Compute for Apache Flink.
DataWorks Contact the DataWorks engineer on duty to enable Streaming Tunnel in the background.
ApsaraMQ for Kafka Contact the Kafka engineer on duty to enable Streaming Tunnel in the background.
  • Logstash log collector: Use Logstash (stream).

  • Limitations

    Table or partition locking during writes

    MaxCompute Tunnel Service locks the target table or partition for the duration of a streaming write. All DML operations that modify data—such as insert into and insert overwrite—are blocked until the write completes and the lock is released.

    Schema modification not supported

    If the schema of the target table is modified while Streaming Tunnel is active, streaming data cannot be written to the table.

    Temporary storage overhead for hot data

    When asynchronous data merging or ZORDER BY is enabled, Streaming Tunnel retains two copies of data written within the previous hour: the original ingested data and the asynchronously merged copy. This redundant storage is automatically cleaned up after the default retention period of 1 hour.

    Plan storage capacity accordingly if your workload has a high ingestion rate during the merge window.