MaxCompute data lakehouse
The MaxCompute data lakehouse solution integrates a MaxCompute data warehouse with a data lake. This solution combines the flexibility and rich ecosystem of a data lake with the enterprise-grade capabilities of a data warehouse to help you build an integrated data management platform. This topic describes how to use Dataphin to manage data assets in a data lakehouse built with MaxCompute and Data Lake Formation (DLF).
Background information
The MaxCompute data lakehouse solution integrates a MaxCompute data warehouse with a data lake. You can build a data lakehouse in one of two ways:
Build a data lakehouse with MaxCompute, Data Lake Formation (DLF), and Object Storage Service (OSS): All data lake metadata (schema) is stored in DLF. MaxCompute can use DLF to manage OSS metadata. This improves the ability of MaxCompute to process semi-structured data in OSS, such as data in Delta Lake, Hudi, AVRO, CSV, JSON, PARQUET, and ORC formats. For more information about DLF and OSS, see Data Lake Formation and Object Storage Service (OSS).
Build a data lakehouse with MaxCompute and Hadoop: This method supports on-premises data centers, setups that use cloud virtual machines, and Alibaba Cloud E-MapReduce. After you enable the VPC network for the region where the MaxCompute and Hadoop platforms are located, MaxCompute can directly access the Hive global meta service and map the metadata to an external project in MaxCompute.
Prerequisites
Before you use Dataphin to manage a data lakehouse built with MaxCompute, DLF, and OSS, complete the following preparations:
Activate DLF. You can activate the service on the DLF activation page.
Activate OSS. For more information, see Activate OSS.
Activate MaxCompute and create a MaxCompute project. For more information, see MaxCompute project.
create externalproject -source dlf -name external_project -- Required. The name of the External Project to create. -ref maxcompute_project -- The name of the existing MaxCompute project. -comment "DLF" -region "cn-hangzhou" -- The region ID where DLF is located. For information about region IDs, see Obtain a region ID and a VPC ID. -db metadat_store -- The name of the DLF metadatabase. -endpoint "dlf-share.cn-hangzhou.aliyuncs.com" -- The endpoint of DLF. -ossEndpoint "oss-cn-hangzhou-internal.aliyuncs.com"; -- The endpoint of the region where OSS is located.
Grant permissions to MaxCompute
If you build a data lakehouse with MaxCompute and Hadoop, grant permissions as follows.
You must grant MaxCompute the permission to create a network interface controller (NIC) in your VPC. This enables network connectivity between MaxCompute and your VPC. To grant the permission, log on to the Alibaba Cloud console using the VPC owner's account and click Authorize.
If you build a data lakehouse with MaxCompute, DLF, and OSS, grant permissions as follows.
The account for the MaxCompute project cannot access DLF without authorization. You must grant the required permissions. You can grant permissions in one of the following two ways:
One-click authorization: If the account used to create the MaxCompute project is the same as the account used to deploy DLF, click Authorize DLF.
Custom authorization: You can use this method regardless of whether the account used to create the MaxCompute project is the same as the account used to deploy DLF. For more information, see Custom authorization for DLF.
Manage a MaxCompute data lakehouse with Dataphin
DLF can be used for metadata discovery and management in OSS. MaxCompute can create an external project based on DLF to register the metadata from DLF. Dataphin can then use MaxCompute and DLF to process data in the data lakehouse. This includes offline development and standard modeling. Dataphin also handles metadata management, access permissions, security management, Data Quality checks, and compute resource administration.
Create a MaxCompute compute source and attach it to a Dataphin project
You can create a MaxCompute compute source to register the MaxCompute external project. Because a MaxCompute external project does not contain compute resources, you must specify another MaxCompute project. This project is used to execute nodes, run quality rules, scan security rules, and install security policies. For more information about how to create a MaxCompute compute source, see Create a MaxCompute compute source.
After you create the compute source, you can create a Dataphin project and attach the compute source to it.
Perform standard modeling and data processing on data from the external project
After you create a MaxCompute compute source and attach it to a Dataphin project, you can perform standard modeling. You can create logical tables based on the source tables in the external project. MaxCompute SQL nodes can use the compute resources of the mapped internal project for execution. These nodes also support reading data from and writing data to the tables of the external project.
View metadata and manage permissions for the data lakehouse
View metadata.
Search for and query assets, such as data tables and fields in the external project.
Preview data.
Generate SELECT statements and DDL statements.
Request permissions for tables and fields in the external project.
Perform data quality checks and security control for the data lakehouse
Configure data quality rules for physical tables in the external project.
Execute quality rule checks on MaxCompute SQL nodes.
Scan security rules and install security policies.