Data upload scenarios and tools

更新时间:
复制 MD 格式

This topic describes how to upload data to and download data from MaxCompute. It covers service connections, software development kits (SDKs), tools, and common operations such as data import and export.

Background information

MaxCompute provides multiple channels for uploading and downloading data. This lets you select the right technical solution for your scenario.

  • Batch data channel: Supports batch data uploads and downloads.

  • Streaming Tunnel: Allows you to stream data to MaxCompute.

  • Real-time data tunnel: DataHub is a streaming data processing platform that lets you publish, subscribe to, and distribute streaming data, and archive it to MaxCompute.

Function introduction

  • Upload data using the batch data channel

    You can use the batch data channel to upload data to MaxCompute in batches. Supported data sources include external files, external databases, Object Storage Service, and log files. The following solutions are available for batch data uploads.

    • Tunnel SDK: You can upload data to MaxCompute using Tunnel.

    • Data synchronization service: You can use Data Integration (DataWorks) tasks to extract, transform, and load (ETL) data into MaxCompute.

    • Data shipping: You can ship data to MaxCompute using DataHub, SLS, the MaxCompute Sink Connector for Kafka, or Blink.

    • Open source tools and plugins: You can upload data to MaxCompute using Sqoop, Kettle, Flume, the Fluentd plugin, or OGG, MMA.

    • Product tools: The MaxCompute client provides built-in Tunnel commands that are based on the SDK for the batch data channel. You can use these commands to upload data. For more information about how to use Tunnel commands, see Tunnel commands.

    Note

    For offline data synchronization, you can use Data Integration. For more information, see Data Integration.

  • Write data using the streaming data channel

    The MaxCompute streaming data channel service lets you write data to MaxCompute in a streaming manner. It uses a new set of APIs and backend services that are different from those of the batch data channel service. The following solutions are available for writing streaming data to MaxCompute.

    • SDK interfaces: These interfaces provide APIs with streaming semantics. You can use the APIs of the streaming service to easily develop distributed data synchronization services.

    • Data synchronization service: You can use real-time sync tasks in Data Integration to write streaming data (StreamX).

    • Data shipping: You can use data shipping modes that have integrated the streaming write APIs to write streaming data. This method supports SLS and Message Queue for Kafka.

    • Data collection: MaxCompute supports writing log data collected by open source Logstash to MaxCompute as a stream.

    • Real-time writes with Flink: You can use the Flink platform to write streaming data in real time.

Reliability of solutions

MaxCompute provides a Service-Level Agreement (SLA). However, the batch and streaming data channels use free shared resources by default. Therefore, you must consider the reliability of your specific solution. The Tunnel service allocates available service resources (slots) based on the order in which access requests are received.

  • If no service resources are available, new access requests are rejected until resources are released.

  • If the number of valid requests does not reach 100 within 5 minutes, the service is not considered unavailable. For more information about valid requests, see Valid status codes for Data Transmission Service.

  • Request latency and throttled requests are not covered by the SLA. For more information about service limits, see Limits on Data Transmission Service.

Note

To meet the resource requirements of specific solutions with high resource usage, you can purchase dedicated resources. For more information, see Purchase and use a dedicated resource group for Data Transmission Service.

Notes

Network conditions greatly affect Tunnel upload and download speeds. The speed typically ranges from 1 MB/s to 10 MB/s. If you upload a large amount of data, you can set the Tunnel Endpoint to the endpoint for the internal network or a VPC. You must connect to the internal network or VPC through an Alibaba Cloud ECS instance or a leased line. If the upload speed is too slow, you can use a multi-threaded upload method.

For more information about Tunnel Endpoints, see Endpoint.

References

For more information about Data Transmission Service, see Data Transmission Service overview.