DataHub is a real-time data streaming service for ingesting, caching, and routing high-throughput data to downstream analytics and storage systems.
Benefits
Stability: Built on Alibaba's internal real-time data transfer infrastructure, DataHub has proven its stability and reliability at scale — including supporting the annual Double 11 event.
High throughput: A single topic supports terabytes of data writes per day. Each shard handles hundreds of gigabytes per day.
Pay-as-you-go pricing: DataHub is available on demand — you pay only for what you use.
Ecosystem integration: Built on the Apsara distributed system, DataHub integrates with the Alibaba Cloud big data ecosystem, including MaxCompute, Realtime Compute for Apache Flink, and Hologres, to form a unified data architecture.
Features
Data ingestion: Ingest data into DataHub using SDKs, APIs, or third-party connectors such as Flume and Logstash.
Data shipping: The DataConnector module syncs ingested data in real time to downstream storage and analytics systems — including MaxCompute, Object Storage Service (OSS), and Tablestore — with minimal configuration.
Data caching: Configurable retention periods let multiple independent consumers replay and re-read the same stream. For example, one application can compute real-time aggregates while another archives the raw data, both reading from the same topic simultaneously. Automatic multi-copy replication ensures data reliability.
Multiple interfaces: Access DataHub through the web console for quick operations, or use APIs and SDKs for programmatic access.