Component overview-E-MapReduce(EMR)-阿里云帮助中心

EMR provides open source and self-developed components that cover data development, compute engines, data services, resource management, data storage, and data integration. You can select and configure components based on your needs.

Note

If a component you want to use is unavailable when you create a cluster, or if an open source component is available only to existing users, you can install and manage it yourself.

EMR consists of open source components, self-developed components, integrated Alibaba Cloud products, and cluster management. Refer to the service architecture diagram for the full list of big data components and their use cases.

Data development

The data development layer provides visual tools and code management for data collection, cleansing, modeling, analysis, and task scheduling. It helps enterprises efficiently manage and use their data assets.

For data development in EMR, you can use the Alibaba Cloud products DataWorks and EMR Workflow. The details are as follows:

Product Name

Description

General Documents

DataWorks

DataWorks provides end-to-end data integration, development, governance, quality management, O&M, and security control. It is suitable for scenarios that require complex data integration and governance.

EMR Workflow

EMR Workflow focuses on workflow scheduling and management. It is 100% compatible with the open source Apache DolphinScheduler.

To use open source data development components, you can choose Hue and Superset:

Component type

Component Name

Description

Common Documentation

Open source

Hue

Hue is available only to existing users.

Hue is an open source web interface for interacting with the Apache Hadoop ecosystem.

Hue

Superset

Superset is available only to existing users.

Superset is a data visualization tool that provides rich visualization and dashboard features.

Superset

Compute engines

EMR supports mainstream compute engines for batch processing, interactive analysis, stream processing, and machine learning. These engines transform data structure and logic to meet the requirements of different big data scenarios.

Component type	Component name	Description	Related documentation
Open source	Spark	Spark is a fast, general-purpose big data processing engine that provides in-memory computing and supports batch processing, real-time processing, machine learning, and graph computing.	Spark Shell and basic RDD operations Connect Spark to OSS FAQ and troubleshooting
	Hive	Hive is a data warehouse tool based on Hadoop that provides HiveQL, a query language similar to SQL, for storing, querying, and analyzing large-scale data on Hadoop.	Hive connection methods User-defined functions (UDFs) FAQ and troubleshooting
	StarRocks	StarRocks is a next-generation, high-speed Massively Parallel Processing (MPP) database that supports Online Analytical Processing (OLAP) multidimensional analysis, high-concurrency queries, and real-time analytics.	StarRocks overview Create a StarRocks cluster FAQ
	Doris	Doris is a high-performance, real-time analytical database well-suited for report analysis, ad hoc queries, and federated query acceleration for data lakes.	Doris overview Create a Doris cluster Quick Start
	ClickHouse	ClickHouse is an open source, column-oriented database management system designed for efficient Online Analytical Processing (OLAP) and fast queries on massive datasets.	Quickly use ClickHouse Import and export data between OSS and ClickHouse FAQ
	Trino	Trino, formerly PrestoSQL, is an open source, distributed SQL query engine suitable for interactive analytical queries.	Trino Connect to Trino from the command line FAQ
	Flink	Flink is a stream processing engine that supports large-scale, real-time data stream processing.	Basic usage Use Flink to stream Kafka data to Alibaba Cloud OSS FAQ
	Presto	Presto, also known as PrestoDB, is a flexible and scalable distributed SQL query engine suitable for interactive analytical queries.	Presto Access Presto from the command line Access Presto using JDBC
	Tez	Apache Tez is a distributed big data processing framework that provides an efficient and flexible directed acyclic graph (DAG) execution model. It primarily serves as a replacement for MapReduce to optimize query and batch processing performance.	Tez
	Phoenix	Phoenix is a SQL middle layer built on HBase that lets you use standard SQL syntax to query and manage data stored in HBase.	Phoenix
	Impala	Impala is available only to existing users. Impala provides high-performance, low-latency SQL queries for data stored in Apache Hadoop.	Impala overview Connect to Impala FAQ
	Kudu	Kudu is available only to existing users. Kudu is a distributed, scalable, column-oriented storage manager that provides low-latency random read/write operations and efficient data analytics.	Overview Integrate Impala with Kudu FAQ
	Druid	Druid is available only to existing users. Druid is a distributed, in-memory, real-time analytics system for fast, interactive queries on large-scale datasets.	Druid

Data services

The data services layer provides data encryption, access control, data query, data access, and API capabilities to improve data security, operational performance, and analysis efficiency in big data environments.

Component type	Component Name	Description	Related documentation
Open source	Ranger	Ranger is a centralized security management framework for permission management and auditing in the Hadoop ecosystem.	Ranger Configure Hive to enable Ranger for access control FAQ
	Kerberos	Kerberos is an identity authentication protocol that uses symmetric key technology to provide identity verification for other services and supports SSO.	Kerberos Basic usage of Kerberos Cross-domain mutual trust
	OpenLDAP	OpenLDAP is an open source implementation of the LDAP protocol for managing and storing user and resource information. It provides user management and identity authentication features.	OpenLDAP
	Kyuubi	Kyuubi is a distributed, multi-tenant SQL gateway that simplifies data analysis and query processing by providing SQL and other query services for data lake query engines.	Kyuubi overview Connect to Kyuubi Kyuubi compute engine management
	Zookeeper	ZooKeeper is a distributed coordination service for managing key tasks in distributed applications, such as configuration, synchronization, and naming. It provides a consistent, high-performance, and reliable cluster management solution.	Overview Basic usage FAQ
	Knox	Knox is a REST API gateway that simplifies secure access to Hadoop and related components while providing unified identity authentication and access control.	Knox
	Livy	Livy is a service that interacts with Spark through a REST interface or an RPC client library.	Livy
	Kafka Manager	Kafka Manager is available only to existing users. Kafka Manager is a cluster management tool for Kafka that provides a web interface to manage and monitor Kafka clusters.	Kafka Manager
Self-developed	DLF-Auth	DLF-Auth is provided by Data Lake Formation (DLF) and allows fine-grained access control over databases, tables, columns, and functions managed by DLF, enabling unified data permission management on the data lake.	DLF-Auth

Resource management

The resource management layer provides efficient resource scheduling and management, enabling automated task scheduling, intelligent resource allocation, and elastic scaling of clusters to improve the efficiency and reliability of big data processing.

Component type	Component Name	Description	Related documentation
Open source	YARN	YARN is the resource management system for Hadoop that schedules and manages cluster resources. It supports running different types of distributed computing tasks on shared cluster resources.	YARN resource configuration YARN scheduler FAQ

Data storage

The data storage layer supports distributed storage for structured and unstructured data. You can choose a storage method that suits the requirements of your compute engine.

Component type	Component Name	Description	Common Documents
Self-developed	OSS-HDFS	OSS-HDFS is an object storage solution compatible with the Hadoop Distributed File System (HDFS) interface that lets big data computing tasks access data in Alibaba Cloud OSS directly using the standard HDFS protocol.	OSS/OSS-HDFS overview OSS/OSS-HDFS Quick Start AccessDenied error when accessing OSS
	JindoCache	JindoCache is a distributed cache solution that accelerates large-scale data access by caching data blocks in memory to improve read performance and reduce pressure on the underlying storage system.	JindoCache overview Use JindoCache to accelerate transparent caching for OSS-HDFS Use JindoCache to accelerate transparent caching for OSS
	ESS	ESS is available only to existing users. New users should use the Celeborn component. ESS is an extension component based on Shuffle that optimizes Shuffle read and write issues.	ESS
	JindoData	JindoData is available only to existing users. New users should use the JindoCache component. JindoData is a self-developed data lake storage acceleration suite for the big data and AI ecosystems. It provides a comprehensive access acceleration solution for major data lake storage systems from Alibaba Cloud and the industry.	JindoData
	SmartData	SmartData is available only to existing users. New users should use the OSS-HDFS component. SmartData is a self-developed EMR component that provides unified storage optimization, cache optimization, computation acceleration, and storage feature extensions for EMR compute engines. It covers data access, data governance, and data security.	SmartData (available only to existing users)
Open source	Paimon	Paimon is a unified stream and batch lake storage format that supports high-throughput writes and low-latency queries.	Paimon overview Integrate Paimon with Spark Integrate Paimon with Flink
	Hudi	Hudi is a data lake storage format that supports updating and deleting data on a Hadoop file system and consuming data changes.	Hudi overview Integrate Hudi with Spark SQL FAQ
	Iceberg	Iceberg is an open data lake table format that provides high-performance read/write operations and metadata management.	Iceberg Basic usage Batch read and write Iceberg with Spark
	DeltaLake	DeltaLake is an open source data storage layer. It provides atomicity, consistency, isolation, and durability (ACID) transactions, scalable metadata processing, and unified stream and batch processing.	DeltaLake Basic usage FAQ
	HDFS	Hadoop Distributed File System (HDFS) is a distributed file system for storing large-scale datasets. It features high fault tolerance and high throughput and stores data redundantly across multiple cluster nodes.	HDFS overview Common HDFS commands JVM memory tuning
	HBase	HBase is a distributed, column-oriented open source database built on the Hadoop file system. It provides low-latency random read/write access and highly reliable storage for large-scale datasets.	Use HBase snapshots Use the HBase Shell FAQ and troubleshooting
	Celeborn	Celeborn is a service that processes intermediate data to improve the stability, flexibility, and performance of big data engines.	Celeborn
	HBASE-HDFS	HBASE-HDFS is HDFS. In compute-storage separation scenarios, it uses local HBASE-HDFS to store Write-Ahead Logging (WAL) data.	HBASE-HDFS
	Alluxio	Alluxio is available only to existing users. Alluxio is an open source data orchestration technology for cloud-based data analytics and AI that provides a unified data access layer supporting multiple underlying storage systems.	Alluxio

Data integration

The data integration layer provides batch data transfer, real-time message stream processing, and distributed log collection to improve data transfer efficiency and collection reliability.

Component type	Component name	Description	General Documents
Open source	Flume	Flume is a distributed, reliable, and highly available system for collecting, aggregating, and moving large streams of log data to a centralized data store.	Common parameter tuning Sync HDFS audit logs to HDFS FAQ
	Sqoop	Sqoop is a tool for efficiently transferring data between Hadoop and relational databases. It supports large-scale data import and export operations.	Use Sqoop Sqoop FAQ
	Kafka	Kafka is available only to existing users. Kafka is an open source distributed event streaming platform with high throughput, low latency, and persistence, widely used for real-time data stream processing and data pipeline applications.	Use SASL to authenticate Kafka services Use SSL to encrypt Kafka connections Kafka FAQ

References

For the overall architecture of EMR, see Service architecture.
For the components and their versions supported by each EMR release, see Components supported by each version.
For the scenarios supported by EMR and the components to use for each scenario, see Big data use cases.