OpenLake Studio overview
OpenLake Studio is a data lakehouse development platform that unifies ETL, data catalog management, and cross-engine workflow orchestration across Alibaba Cloud compute services such as MaxCompute, Hologres, and Serverless Spark.
What is OpenLake Studio
OpenLake Studio incorporates 17 years of Alibaba's big data methodologies and integrates with Alibaba Cloud compute services including MaxCompute, Hologres, Serverless Spark, Serverless StarRocks, DLF, Flink, and PAI. It provides intelligent ETL for the OpenLake data lakehouse architecture with the following features:
-
Data catalog: A metadata catalog for an integrated data lakehouse.
-
Workflow: Orchestrates real-time, offline, and AI nodes across engines such as MaxCompute and Serverless Spark.
-
Personal development environment: Develop and debug Python nodes, run interactive Notebook analysis, and integrate with Git, NAS, or OSS.
-
Notebook: Interactive SQL and Python analysis across data engines. Run or debug code instantly and visualize results.
Enable OpenLake Studio
OpenLake Studio is available in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), and China (Shenzhen).
When you create a workspace, select OpenLake for the workspace template.
Access OpenLake Studio
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
-
This entry point is visible only for workspaces in which the Use Data Studio (New Version) feature is enabled.
-
OpenLake Studio is supported only in Chrome 69 or later on a PC.
OpenLake Studio features
The following table summarizes the data development features. Related terminology is explained in Appendix: Data development concepts.
|
Type |
Description |
|
Flow management |
The Workflow development mode uses a visual, business-oriented DAG interface to manage complex task projects. Supported workflow types: Periodic workflow orchestration, Event-triggered workflows, and Manual business flows. Note
The following limits apply per workspace:
No new workflows or objects can be created after the limit is reached. |
|
Task development |
Supported node types are documented in Node development. |
|
Task scheduling |
Scheduling details: Node scheduling configuration. |
|
Quality control |
Quality control mechanisms for task deployment:
|
|
Other |
|
Task development process
OpenLake Studio supports real-time synchronization, offline scheduled tasks (including offline synchronization and processing tasks), and manually triggered tasks across multiple engines. Data synchronization capabilities are provided by Data Integration.
DataWorks workspaces support standard mode and basic mode, each with a different development process:
Standard mode
Basic mode
-
Basic process: In standard mode, a scheduled task goes through development, debugging, scheduling configuration, deployment, and O&M.
-
Process control: Use built-in code review, Data Governance check items, and Open Platform extension validation to enforce development standards.
Data development methods
OpenLake Studio supports two development approaches: build data processing flows with Workflow, or create individual task nodes and configure dependencies manually.
Supported node types
OpenLake Studio supports dozens of node types for MaxCompute, Hologres, Serverless Spark, Serverless StarRocks, and Flink, many with periodic scheduling support. Supported node types.
Appendix: Data development concepts
Task development concepts
|
Concept |
Description |
|
Workflow |
A visual DAG interface for managing complex task projects. Supports orchestration of dozens of node types across MaxCompute, Hologres, Serverless Spark, Serverless StarRocks, and Flink, with workflow-level scheduling for periodic and event-triggered workflows. |
|
Node |
The basic execution unit in DataWorks. Node types include Data Integration nodes for synchronization, engine compute nodes (such as MaxCompute SQL and EMR Hive) for data processing, and general-purpose nodes (virtual nodes, do-while loops) for logic control. |
Task scheduling concepts
|
Concept |
Description |
|
Dependency |
Defines the execution order of tasks. If node A must run before node B, A is an upstream dependency of B. In a DAG, dependencies appear as arrows between nodes. |
|
Output name |
A virtual entity that connects upstream and downstream tasks when configuring dependencies within a single tenant (Alibaba Cloud account). Configure dependencies using output names, not node names or IDs. A task's output name becomes the input name for its downstream nodes. |
|
Output table name |
Set the output table name to the current task's output table so downstream nodes can verify the data source. Do not modify auto-parsed output table names. The output table name is an identifier only and does not affect the actual table generated by the SQL script. Note
The Node output of a node must be globally unique, whereas the Output table name does not have this restriction. |
|
Scheduling resource group |
A resource group used for task scheduling. |
|
Scheduling parameter |
Scheduling parameters are code variables that dynamically obtain values such as date or time at runtime. Use them to inject runtime context into repeatedly executed code. |
|
Business date |
The date when business data is generated, as opposed to when it is processed. For example, revenue data for 20241010 is typically computed in the early hours of 20241011. The business date is 20241010. |
|
Scheduled time |
The expected run time of a periodic task, configurable to the minute. Important
A task may not run at the exact scheduled time. DataWorks starts a task only after upstream tasks succeed, the scheduled time arrives, and scheduling resources are available. |