OpenLake Studio overview

更新时间:
复制 MD 格式

OpenLake Studio is a data lakehouse development platform that unifies ETL, data catalog management, and cross-engine workflow orchestration across Alibaba Cloud compute services such as MaxCompute, Hologres, and Serverless Spark.

What is OpenLake Studio

OpenLake Studio incorporates 17 years of Alibaba's big data methodologies and integrates with Alibaba Cloud compute services including MaxCompute, Hologres, Serverless Spark, Serverless StarRocks, DLF, Flink, and PAI. It provides intelligent ETL for the OpenLake data lakehouse architecture with the following features:

  • Data catalog: A metadata catalog for an integrated data lakehouse.

  • Workflow: Orchestrates real-time, offline, and AI nodes across engines such as MaxCompute and Serverless Spark.

  • Personal development environment: Develop and debug Python nodes, run interactive Notebook analysis, and integrate with Git, NAS, or OSS.

  • Notebook: Interactive SQL and Python analysis across data engines. Run or debug code instantly and visualize results.

Enable OpenLake Studio

Important

OpenLake Studio is available in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), and China (Shenzhen).

When you create a workspace, select OpenLake for the workspace template.

Access OpenLake Studio

Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

Note
  • This entry point is visible only for workspaces in which the Use Data Studio (New Version) feature is enabled.

  • OpenLake Studio is supported only in Chrome 69 or later on a PC.

OpenLake Studio features

The following table summarizes the data development features. Related terminology is explained in Appendix: Data development concepts.

image

Type

Description

Flow management

The Workflow development mode uses a visual, business-oriented DAG interface to manage complex task projects.

Supported workflow types: Periodic workflow orchestration, Event-triggered workflows, and Manual business flows.

Note

The following limits apply per workspace:

  • Internal nodes: Each Workflow can contain a maximum of 400 nodes.

  • Objects (workflows, nodes, files, tables, resources, and functions): Up to 200,000 for Enterprise Edition, or 100,000 for Professional, Standard, or Basic Edition.

No new workflows or objects can be created after the limit is reached.

Task development

  • Richer capabilities:

    • Engine nodes that expose the full capabilities of each engine.

    • General-purpose nodes for complex logic: external triggers, file checks, conditional branches, loops, and output passing.

    • Supports MaxCompute, Hologres, Serverless Spark, and Serverless StarRocks for multi-engine lakehouse development.

  • Simpler operations:

    • Visual drag-and-drop workflow editor for orchestrating multi-engine tasks.

    • Intelligent SQL editor with code suggestions, visual operator structures, and permission checks.

Supported node types are documented in Node development.

Task scheduling

  • Trigger methods: Supports triggers from external systems, event-based triggers, and triggers based on upstream tasks parsed from internal data lineage.

  • Dependency types: Supports same-cycle and cross-cycle dependencies, as well as dependencies among different scheduling cycles and task types.

  • Execution control: Configure rerun behavior, downstream scheduling time, effective dates, and scheduling types such as dry run (skips execution without blocking downstream tasks) and freeze (skips execution and blocks downstream tasks).

  • Idempotence: Task rerun with custom conditions and retry counts.

Scheduling details: Node scheduling configuration.

Quality control

Quality control mechanisms for task deployment:

  • Code review: Manual code review before deployment, with the ability to block production scheduling flows.

  • Checks and validation: Integrate governance item checks and custom extension validation for automated flow control on task submission and deployment.

  • Data Quality: Associate Data Quality monitoring with scheduling nodes to trigger quality rule checks after task execution.

Other

Task development process

OpenLake Studio supports real-time synchronization, offline scheduled tasks (including offline synchronization and processing tasks), and manually triggered tasks across multiple engines. Data synchronization capabilities are provided by Data Integration.

DataWorks workspaces support standard mode and basic mode, each with a different development process:

Standard mode

image

Basic mode

image
  • Basic process: In standard mode, a scheduled task goes through development, debugging, scheduling configuration, deployment, and O&M.

  • Process control: Use built-in code review, Data Governance check items, and Open Platform extension validation to enforce development standards.

Data development methods

OpenLake Studio supports two development approaches: build data processing flows with Workflow, or create individual task nodes and configure dependencies manually.

Workflow.

Supported node types

OpenLake Studio supports dozens of node types for MaxCompute, Hologres, Serverless Spark, Serverless StarRocks, and Flink, many with periodic scheduling support. Supported node types.

Appendix: Data development concepts

Task development concepts

Concept

Description

Workflow

A visual DAG interface for managing complex task projects. Supports orchestration of dozens of node types across MaxCompute, Hologres, Serverless Spark, Serverless StarRocks, and Flink, with workflow-level scheduling for periodic and event-triggered workflows.

Node

The basic execution unit in DataWorks. Node types include Data Integration nodes for synchronization, engine compute nodes (such as MaxCompute SQL and EMR Hive) for data processing, and general-purpose nodes (virtual nodes, do-while loops) for logic control.

Task scheduling concepts

Concept

Description

Dependency

Defines the execution order of tasks. If node A must run before node B, A is an upstream dependency of B. In a DAG, dependencies appear as arrows between nodes.

Output name

A virtual entity that connects upstream and downstream tasks when configuring dependencies within a single tenant (Alibaba Cloud account).

Configure dependencies using output names, not node names or IDs. A task's output name becomes the input name for its downstream nodes.

Output table name

Set the output table name to the current task's output table so downstream nodes can verify the data source. Do not modify auto-parsed output table names. The output table name is an identifier only and does not affect the actual table generated by the SQL script.

Note

The Node output of a node must be globally unique, whereas the Output table name does not have this restriction.

Scheduling resource group

A resource group used for task scheduling.

Scheduling parameter

Scheduling parameters are code variables that dynamically obtain values such as date or time at runtime. Use them to inject runtime context into repeatedly executed code.

Business date

The date when business data is generated, as opposed to when it is processed. For example, revenue data for 20241010 is typically computed in the early hours of 20241011. The business date is 20241010.

Scheduled time

The expected run time of a periodic task, configurable to the minute.

Important

A task may not run at the exact scheduled time. DataWorks starts a task only after upstream tasks succeed, the scheduled time arrives, and scheduling resources are available.