Component overview

更新时间:
复制 MD 格式

EMR provides open source and self-developed components that cover data development, compute engines, data services, resource management, data storage, and data integration. You can select and configure components based on your needs.

Note

If a component you want to use is unavailable when you create a cluster, or if an open source component is available only to existing users, you can install and manage it yourself.

EMR consists of open source components, self-developed components, integrated Alibaba Cloud products, and cluster management. Refer to the service architecture diagram for the full list of big data components and their use cases.

image

Data development

The data development layer provides visual tools and code management for data collection, cleansing, modeling, analysis, and task scheduling. It helps enterprises efficiently manage and use their data assets.

For data development in EMR, you can use the Alibaba Cloud products DataWorks and EMR Workflow. The details are as follows:

Product Name

Description

General Documents

DataWorks

DataWorks provides end-to-end data integration, development, governance, quality management, O&M, and security control. It is suitable for scenarios that require complex data integration and governance.

EMR Workflow

EMR Workflow focuses on workflow scheduling and management. It is 100% compatible with the open source Apache DolphinScheduler.

To use open source data development components, you can choose Hue and Superset:

Component type

Component Name

Description

Common Documentation

Open source

Hue

Hue is available only to existing users.

Hue is an open source web interface for interacting with the Apache Hadoop ecosystem.

Hue

Superset

Superset is available only to existing users.

Superset is a data visualization tool that provides rich visualization and dashboard features.

Superset

Compute engines

EMR supports mainstream compute engines for batch processing, interactive analysis, stream processing, and machine learning. These engines transform data structure and logic to meet the requirements of different big data scenarios.

Component type

Component name

Description

Related documentation

Open source

Spark

Spark is a fast, general-purpose big data processing engine that provides in-memory computing and supports batch processing, real-time processing, machine learning, and graph computing.

Hive

Hive is a data warehouse tool based on Hadoop that provides HiveQL, a query language similar to SQL, for storing, querying, and analyzing large-scale data on Hadoop.

StarRocks

StarRocks is a next-generation, high-speed Massively Parallel Processing (MPP) database that supports Online Analytical Processing (OLAP) multidimensional analysis, high-concurrency queries, and real-time analytics.

Doris

Doris is a high-performance, real-time analytical database well-suited for report analysis, ad hoc queries, and federated query acceleration for data lakes.

ClickHouse

ClickHouse is an open source, column-oriented database management system designed for efficient Online Analytical Processing (OLAP) and fast queries on massive datasets.

Trino

Trino, formerly PrestoSQL, is an open source, distributed SQL query engine suitable for interactive analytical queries.

Flink

Flink is a stream processing engine that supports large-scale, real-time data stream processing.

Presto

Presto, also known as PrestoDB, is a flexible and scalable distributed SQL query engine suitable for interactive analytical queries.

Tez

Apache Tez is a distributed big data processing framework that provides an efficient and flexible directed acyclic graph (DAG) execution model. It primarily serves as a replacement for MapReduce to optimize query and batch processing performance.

Tez

Phoenix

Phoenix is a SQL middle layer built on HBase that lets you use standard SQL syntax to query and manage data stored in HBase.

Phoenix

Impala

Impala is available only to existing users.

Impala provides high-performance, low-latency SQL queries for data stored in Apache Hadoop.

Kudu

Kudu is available only to existing users.

Kudu is a distributed, scalable, column-oriented storage manager that provides low-latency random read/write operations and efficient data analytics.

Druid

Druid is available only to existing users.

Druid is a distributed, in-memory, real-time analytics system for fast, interactive queries on large-scale datasets.

Druid

Data services

The data services layer provides data encryption, access control, data query, data access, and API capabilities to improve data security, operational performance, and analysis efficiency in big data environments.

Component type

Component Name

Description

Related documentation

Open source

Ranger

Ranger is a centralized security management framework for permission management and auditing in the Hadoop ecosystem.

Kerberos

Kerberos is an identity authentication protocol that uses symmetric key technology to provide identity verification for other services and supports SSO.

OpenLDAP

OpenLDAP is an open source implementation of the LDAP protocol for managing and storing user and resource information. It provides user management and identity authentication features.

OpenLDAP

Kyuubi

Kyuubi is a distributed, multi-tenant SQL gateway that simplifies data analysis and query processing by providing SQL and other query services for data lake query engines.

Zookeeper

ZooKeeper is a distributed coordination service for managing key tasks in distributed applications, such as configuration, synchronization, and naming. It provides a consistent, high-performance, and reliable cluster management solution.

Knox

Knox is a REST API gateway that simplifies secure access to Hadoop and related components while providing unified identity authentication and access control.

Knox

Livy

Livy is a service that interacts with Spark through a REST interface or an RPC client library.

Livy

Kafka Manager

Kafka Manager is available only to existing users.

Kafka Manager is a cluster management tool for Kafka that provides a web interface to manage and monitor Kafka clusters.

Kafka Manager

Self-developed

DLF-Auth

DLF-Auth is provided by Data Lake Formation (DLF) and allows fine-grained access control over databases, tables, columns, and functions managed by DLF, enabling unified data permission management on the data lake.

DLF-Auth

Resource management

The resource management layer provides efficient resource scheduling and management, enabling automated task scheduling, intelligent resource allocation, and elastic scaling of clusters to improve the efficiency and reliability of big data processing.

Component type

Component Name

Description

Related documentation

Open source

YARN

YARN is the resource management system for Hadoop that schedules and manages cluster resources. It supports running different types of distributed computing tasks on shared cluster resources.

Data storage

The data storage layer supports distributed storage for structured and unstructured data. You can choose a storage method that suits the requirements of your compute engine.

Component type

Component Name

Description

Common Documents

Self-developed

OSS-HDFS

OSS-HDFS is an object storage solution compatible with the Hadoop Distributed File System (HDFS) interface that lets big data computing tasks access data in Alibaba Cloud OSS directly using the standard HDFS protocol.

JindoCache

JindoCache is a distributed cache solution that accelerates large-scale data access by caching data blocks in memory to improve read performance and reduce pressure on the underlying storage system.

ESS

ESS is available only to existing users. New users should use the Celeborn component.

ESS is an extension component based on Shuffle that optimizes Shuffle read and write issues.

ESS

JindoData

JindoData is available only to existing users. New users should use the JindoCache component.

JindoData is a self-developed data lake storage acceleration suite for the big data and AI ecosystems. It provides a comprehensive access acceleration solution for major data lake storage systems from Alibaba Cloud and the industry.

JindoData

SmartData

SmartData is available only to existing users. New users should use the OSS-HDFS component.

SmartData is a self-developed EMR component that provides unified storage optimization, cache optimization, computation acceleration, and storage feature extensions for EMR compute engines. It covers data access, data governance, and data security.

SmartData (available only to existing users)

Open source

Paimon

Paimon is a unified stream and batch lake storage format that supports high-throughput writes and low-latency queries.

Hudi

Hudi is a data lake storage format that supports updating and deleting data on a Hadoop file system and consuming data changes.

Iceberg

Iceberg is an open data lake table format that provides high-performance read/write operations and metadata management.

DeltaLake

DeltaLake is an open source data storage layer. It provides atomicity, consistency, isolation, and durability (ACID) transactions, scalable metadata processing, and unified stream and batch processing.

HDFS

Hadoop Distributed File System (HDFS) is a distributed file system for storing large-scale datasets. It features high fault tolerance and high throughput and stores data redundantly across multiple cluster nodes.

HBase

HBase is a distributed, column-oriented open source database built on the Hadoop file system. It provides low-latency random read/write access and highly reliable storage for large-scale datasets.

Celeborn

Celeborn is a service that processes intermediate data to improve the stability, flexibility, and performance of big data engines.

Celeborn

HBASE-HDFS

HBASE-HDFS is HDFS. In compute-storage separation scenarios, it uses local HBASE-HDFS to store Write-Ahead Logging (WAL) data.

HBASE-HDFS

Alluxio

Alluxio is available only to existing users.

Alluxio is an open source data orchestration technology for cloud-based data analytics and AI that provides a unified data access layer supporting multiple underlying storage systems.

Alluxio

Data integration

The data integration layer provides batch data transfer, real-time message stream processing, and distributed log collection to improve data transfer efficiency and collection reliability.

Component type

Component name

Description

General Documents

Open source

Flume

Flume is a distributed, reliable, and highly available system for collecting, aggregating, and moving large streams of log data to a centralized data store.

Sqoop

Sqoop is a tool for efficiently transferring data between Hadoop and relational databases. It supports large-scale data import and export operations.

Kafka

Kafka is available only to existing users.

Kafka is an open source distributed event streaming platform with high throughput, low latency, and persistence, widely used for real-time data stream processing and data pipeline applications.

References