EMR provides open source and self-developed components that cover data development, compute engines, data services, resource management, data storage, and data integration. You can select and configure components based on your needs.
If a component you want to use is unavailable when you create a cluster, or if an open source component is available only to existing users, you can install and manage it yourself.
EMR consists of open source components, self-developed components, integrated Alibaba Cloud products, and cluster management. Refer to the service architecture diagram for the full list of big data components and their use cases.
Data development
The data development layer provides visual tools and code management for data collection, cleansing, modeling, analysis, and task scheduling. It helps enterprises efficiently manage and use their data assets.
For data development in EMR, you can use the Alibaba Cloud products DataWorks and EMR Workflow. The details are as follows:
|
Product Name |
Description |
General Documents |
|
DataWorks |
DataWorks provides end-to-end data integration, development, governance, quality management, O&M, and security control. It is suitable for scenarios that require complex data integration and governance. |
|
|
EMR Workflow |
EMR Workflow focuses on workflow scheduling and management. It is 100% compatible with the open source Apache DolphinScheduler. |
To use open source data development components, you can choose Hue and Superset:
|
Component type |
Component Name |
Description |
Common Documentation |
|
Open source |
Hue |
Hue is available only to existing users. Hue is an open source web interface for interacting with the Apache Hadoop ecosystem. |
|
|
Superset |
Superset is available only to existing users. Superset is a data visualization tool that provides rich visualization and dashboard features. |
Compute engines
EMR supports mainstream compute engines for batch processing, interactive analysis, stream processing, and machine learning. These engines transform data structure and logic to meet the requirements of different big data scenarios.
|
Component type |
Component name |
Description |
Related documentation |
|
Open source |
Spark |
Spark is a fast, general-purpose big data processing engine that provides in-memory computing and supports batch processing, real-time processing, machine learning, and graph computing. |
|
|
Hive |
Hive is a data warehouse tool based on Hadoop that provides HiveQL, a query language similar to SQL, for storing, querying, and analyzing large-scale data on Hadoop. |
||
|
StarRocks |
StarRocks is a next-generation, high-speed Massively Parallel Processing (MPP) database that supports Online Analytical Processing (OLAP) multidimensional analysis, high-concurrency queries, and real-time analytics. |
||
|
Doris |
Doris is a high-performance, real-time analytical database well-suited for report analysis, ad hoc queries, and federated query acceleration for data lakes. |
||
|
ClickHouse |
ClickHouse is an open source, column-oriented database management system designed for efficient Online Analytical Processing (OLAP) and fast queries on massive datasets. |
||
|
Trino |
Trino, formerly PrestoSQL, is an open source, distributed SQL query engine suitable for interactive analytical queries. |
||
|
Flink |
Flink is a stream processing engine that supports large-scale, real-time data stream processing. |
||
|
Presto |
Presto, also known as PrestoDB, is a flexible and scalable distributed SQL query engine suitable for interactive analytical queries. |
||
|
Tez |
Apache Tez is a distributed big data processing framework that provides an efficient and flexible directed acyclic graph (DAG) execution model. It primarily serves as a replacement for MapReduce to optimize query and batch processing performance. |
||
|
Phoenix |
Phoenix is a SQL middle layer built on HBase that lets you use standard SQL syntax to query and manage data stored in HBase. |
||
|
Impala |
Impala is available only to existing users. Impala provides high-performance, low-latency SQL queries for data stored in Apache Hadoop. |
||
|
Kudu |
Kudu is available only to existing users. Kudu is a distributed, scalable, column-oriented storage manager that provides low-latency random read/write operations and efficient data analytics. |
||
|
Druid |
Druid is available only to existing users. Druid is a distributed, in-memory, real-time analytics system for fast, interactive queries on large-scale datasets. |
Data services
The data services layer provides data encryption, access control, data query, data access, and API capabilities to improve data security, operational performance, and analysis efficiency in big data environments.
|
Component type |
Component Name |
Description |
Related documentation |
|
Open source |
Ranger |
Ranger is a centralized security management framework for permission management and auditing in the Hadoop ecosystem. |
|
|
Kerberos |
Kerberos is an identity authentication protocol that uses symmetric key technology to provide identity verification for other services and supports SSO. |
||
|
OpenLDAP |
OpenLDAP is an open source implementation of the LDAP protocol for managing and storing user and resource information. It provides user management and identity authentication features. |
||
|
Kyuubi |
Kyuubi is a distributed, multi-tenant SQL gateway that simplifies data analysis and query processing by providing SQL and other query services for data lake query engines. |
||
|
Zookeeper |
ZooKeeper is a distributed coordination service for managing key tasks in distributed applications, such as configuration, synchronization, and naming. It provides a consistent, high-performance, and reliable cluster management solution. |
||
|
Knox |
Knox is a REST API gateway that simplifies secure access to Hadoop and related components while providing unified identity authentication and access control. |
||
|
Livy |
Livy is a service that interacts with Spark through a REST interface or an RPC client library. |
||
|
Kafka Manager |
Kafka Manager is available only to existing users. Kafka Manager is a cluster management tool for Kafka that provides a web interface to manage and monitor Kafka clusters. |
||
|
Self-developed |
DLF-Auth |
DLF-Auth is provided by Data Lake Formation (DLF) and allows fine-grained access control over databases, tables, columns, and functions managed by DLF, enabling unified data permission management on the data lake. |
Resource management
The resource management layer provides efficient resource scheduling and management, enabling automated task scheduling, intelligent resource allocation, and elastic scaling of clusters to improve the efficiency and reliability of big data processing.
|
Component type |
Component Name |
Description |
Related documentation |
|
Open source |
YARN |
YARN is the resource management system for Hadoop that schedules and manages cluster resources. It supports running different types of distributed computing tasks on shared cluster resources. |
Data storage
The data storage layer supports distributed storage for structured and unstructured data. You can choose a storage method that suits the requirements of your compute engine.
|
Component type |
Component Name |
Description |
Common Documents |
|
Self-developed |
OSS-HDFS |
OSS-HDFS is an object storage solution compatible with the Hadoop Distributed File System (HDFS) interface that lets big data computing tasks access data in Alibaba Cloud OSS directly using the standard HDFS protocol. |
|
|
JindoCache |
JindoCache is a distributed cache solution that accelerates large-scale data access by caching data blocks in memory to improve read performance and reduce pressure on the underlying storage system. |
||
|
ESS |
ESS is available only to existing users. New users should use the Celeborn component. ESS is an extension component based on Shuffle that optimizes Shuffle read and write issues. |
||
|
JindoData |
JindoData is available only to existing users. New users should use the JindoCache component. JindoData is a self-developed data lake storage acceleration suite for the big data and AI ecosystems. It provides a comprehensive access acceleration solution for major data lake storage systems from Alibaba Cloud and the industry. |
||
|
SmartData |
SmartData is available only to existing users. New users should use the OSS-HDFS component. SmartData is a self-developed EMR component that provides unified storage optimization, cache optimization, computation acceleration, and storage feature extensions for EMR compute engines. It covers data access, data governance, and data security. |
||
|
Open source |
Paimon |
Paimon is a unified stream and batch lake storage format that supports high-throughput writes and low-latency queries. |
|
|
Hudi |
Hudi is a data lake storage format that supports updating and deleting data on a Hadoop file system and consuming data changes. |
||
|
Iceberg |
Iceberg is an open data lake table format that provides high-performance read/write operations and metadata management. |
||
|
DeltaLake |
DeltaLake is an open source data storage layer. It provides atomicity, consistency, isolation, and durability (ACID) transactions, scalable metadata processing, and unified stream and batch processing. |
||
|
HDFS |
Hadoop Distributed File System (HDFS) is a distributed file system for storing large-scale datasets. It features high fault tolerance and high throughput and stores data redundantly across multiple cluster nodes. |
||
|
HBase |
HBase is a distributed, column-oriented open source database built on the Hadoop file system. It provides low-latency random read/write access and highly reliable storage for large-scale datasets. |
||
|
Celeborn |
Celeborn is a service that processes intermediate data to improve the stability, flexibility, and performance of big data engines. |
||
|
HBASE-HDFS |
HBASE-HDFS is HDFS. In compute-storage separation scenarios, it uses local HBASE-HDFS to store Write-Ahead Logging (WAL) data. |
||
|
Alluxio |
Alluxio is available only to existing users. Alluxio is an open source data orchestration technology for cloud-based data analytics and AI that provides a unified data access layer supporting multiple underlying storage systems. |
Data integration
The data integration layer provides batch data transfer, real-time message stream processing, and distributed log collection to improve data transfer efficiency and collection reliability.
|
Component type |
Component name |
Description |
General Documents |
|
Open source |
Flume |
Flume is a distributed, reliable, and highly available system for collecting, aggregating, and moving large streams of log data to a centralized data store. |
|
|
Sqoop |
Sqoop is a tool for efficiently transferring data between Hadoop and relational databases. It supports large-scale data import and export operations. |
||
|
Kafka |
Kafka is available only to existing users. Kafka is an open source distributed event streaming platform with high throughput, low latency, and persistence, widely used for real-time data stream processing and data pipeline applications. |
References
-
For the overall architecture of EMR, see Service architecture.
-
For the components and their versions supported by each EMR release, see Components supported by each version.
-
For the scenarios supported by EMR and the components to use for each scenario, see Big data use cases.