Data Studio (new version)

更新时间:
复制 MD 格式

Data Studio is an intelligent data lakehouse development platform built by Alibaba Cloud based on 15 years of big data experience. It integrates with multiple Alibaba Cloud compute services and provides intelligent extract, transform, and load (ETL), data catalog management, and cross-engine workflow orchestration. With personal development environment instances that support Python development, Notebook analytics, and Git integration, Data Studio also features a rich plugin ecosystem to unify real-time and offline processing, lakehouse architecture, and big data with AI—enabling full-lifecycle “Data+AI” management.

What is Data Studio?

Data Studio is an intelligent data lakehouse development platform that incorporates Alibaba Cloud’s 15-year big data methodology. It deeply integrates with dozens of Alibaba Cloud big data and AI compute services—including MaxCompute, E-MapReduce, Hologres, Flink, and PAI—to deliver intelligent ETL development for data warehouses, data lakes, and OpenLake lakehouse architectures. It supports:

  • Data catalog: A metadata management system designed for lakehouse environments.

  • Workflow: A development model that orchestrates real-time and offline data processing nodes and AI nodes across dozens of engine types.

  • Personal development environment: Supports Python node development and debugging, interactive Notebook analytics, and Git-based code management integrated with NAS/OSS storage.

  • Notebook: An intelligent, interactive tool for data development and analysis. Run or debug SQL or Python code against multiple data engines and instantly visualize results.

Enable the new Data Studio

Enable the new Data Studio as follows:

  • When creating a workspace, select Use Data Studio (New Version). For details, see Create a workspace.

  • In the legacy DataStudio, click the DataStudio button at the top of the page, then click Upgrade to New Version and follow the prompts to migrate your data to the new Data Studio.

    image

  • The new Data Studio is available in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Thailand (Bangkok), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia).

Important
  • If you encounter issues while using the new Data Studio, join the Dedicated DingTalk support group for upgrading to the new Data Studio.

  • Data in the new Data Studio and the legacy DataStudio are completely separate and cannot be shared.

  • Upgrading from the legacy DataStudio to the new version is irreversible. After a successful upgrade, you cannot revert to the legacy version. Before switching, create a test workspace with the new Data Studio enabled to verify it meets your business needs.

  • Starting February 19, 2025, when a root account creates a DataWorks workspace for the first time in a supported region, the new Data Studio will be enabled by default. The legacy DataStudio will no longer be available.

Access Data Studio

Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

Note
  • This entry is visible only for workspaces where Use Data Studio (New Version) is enabled. For details, see Enable the new Data Studio.

  • Data Studio works only on Chrome browser version 69 or later on desktop.

Key features of Data Studio

Data Development offers the following key features. For more information, see Appendix: Concepts Related to Data Development.

Type

Description

Flow management

DataWorks provides a Workflow-based development model. Workflows offer a DAG-based visual interface from a business perspective, simplifying management of complex task pipelines.

For more information, see Recurring workflow orchestration, Event-triggered workflows, and Manually triggered workflows.

Note

In Data Studio, each workspace has the following limits on internal workflow nodes and objects:

  • Inner nodes: Each workflow supports up to 400 nodes.

  • Objects (workflows, nodes, files, tables, resources, and functions): Users with DataWorks Enterprise Edition can create up to 200,000 objects. Users with DataWorks Professional, Standard, or Basic Edition can create up to 100,000 objects.

If your workspace reaches these limits, you cannot create new workflows or objects.

Task development

  • Richer capabilities:

    • Offers diverse engine nodes with fully encapsulated engine functionality.

    • Provides general-purpose nodes for complex logic when combined with engine nodes—for example, external system triggers, file object checks, conditional branches, loop execution, and result passing.

    • Supports Flink stream processing tasks using Realtime Compute for Apache Flink, and enables collaborative development between Flink and engines like MaxCompute and Hologres.

  • Simpler operations:

    • Features a visual workflow builder—drag and drop components to quickly orchestrate multi-engine tasks.

    • Includes an intelligent SQL editor with code hinting, visual operator structure display, and permission verification.

For supported node types, see Node development.

Task scheduling

  • Trigger methods: Supports external system triggers, event triggers, and upstream-triggered scheduling via auto-captured lineage parsing.

  • Dependency types: Supports same-cycle and cross-cycle dependencies, plus mutual dependencies across different schedule cycles and task types.

  • Execution control: Lets you configure whether a task can rerun, control downstream scheduling timing based on upstream tasks, set effective dates for scheduled tasks, and define scheduling behaviors—for example, dry-run (skip execution without blocking downstream tasks) or freeze (skip execution and block downstream tasks).

  • Idempotency guarantee: Offers a rerun mechanism with customizable conditions and retry counts.

For more scheduling details, see Node scheduling configuration.

Quality control

Provides standardized task publishing and multiple quality control mechanisms, including the following:

  • Code review: Requires manual code review before publishing and can block problematic production schedules.

  • Validation checks: Integrates governance item checks from data governance and custom validation logic from extensions to automate and customize submission and publishing controls.

  • Data Quality: Links quality monitoring to scheduling nodes, triggering rule validation after task completion to help detect data issues immediately.

Other

  • Open capabilities: Offers extensive OpenAPI through the Open Platform, with built-in extension points to subscribe to DataWorks development events.

  • Access control: Manages UI feature permissions and data access permissions. For details, see Workspace-level module permission control.

Data Studio interface

See Data Studio feature guide to learn about the interface layout and how to use each module.

Task development workflow

DataWorks supports creating real-time sync tasks, offline scheduled tasks (including offline sync and transformation jobs), and manually triggered tasks. For data synchronization capabilities, see the Data Integration module.

DataWorks workspaces operate in either standard mode or basic mode. Task development workflows differ slightly between modes, as shown below.

Standard Mode Workspace Development Workflow

Basic mode workflow

  • Basic workflow: In standard mode, scheduled task development includes stages such as development, debugging, scheduling configuration, publishing, and O&M. For the general development process, see Overview of the new Data Studio.

  • Workflow control: During development, combine built-in code review, predefined data governance checks, and custom validation logic from Open Platform extensions to ensure tasks comply with standards.

Data development approaches

Data Studio lets you customize your development process. Build data pipelines quickly using workflows, or manually create task nodes and configure their dependencies.

For details, see Workflows.

Supported node types in Data Studio

Data Studio supports dozens of node types—including data integration, MaxCompute, Hologres, EMR, Flink, Python, Notebook, and ADB—and many support recurring scheduling. Choose the right node type based on your business needs. For the complete list, see Supported node types.

Appendix: Key concepts in data development

Task development

Concept

Description

Workflow

A new development approach featuring a DAG-based visual interface from a business perspective, simplifying complex task pipelines. Workflows support orchestrating dozens of node types—including data integration, MaxCompute, Hologres, EMR, Flink, Python, Notebook, and ADB—and offer workflow-level scheduling. They support both recurring and event-triggered workflows.

Manually triggered workflow

A collection of tasks, tables, resources, and functions for a specific business need.

Unlike recurring workflows, tasks in manually triggered workflows must be started manually rather than on a schedule.

Task node

A task node is the basic execution unit in DataWorks. Data Studio provides multiple node types: data integration nodes for data sync, engine compute nodes for data cleansing (such as ODPS SQL, Hologres SQL, EMR Hive), and general-purpose nodes for complex logic (such as zero load nodes for managing multiple nodes or do-while nodes for looping). Combining these nodes meets diverse data processing needs.

Task scheduling

Concept

Description

Dependency

Dependencies define the execution order between tasks. If node B runs only after node A completes, A is B’s upstream dependency (or B depends on A). In a DAG, dependencies appear as arrows between nodes.

Output name

The name of a task’s output point. Within a single tenant (Alibaba Cloud account), this virtual entity connects upstream and downstream tasks when defining dependencies.

When setting dependencies between tasks, use the output name—not the node name or ID. Once configured, this output name becomes the input name for the downstream node.

Output table name

We recommend setting the output table name to match the actual output table of the current task. Correctly specifying this name helps downstream tasks confirm they’re using the expected upstream data. Avoid manually changing auto-generated output table names—they serve only as identifiers and do not affect the actual table produced by your SQL script, which is determined by the SQL logic itself.

Note

A node’s Output Name must be globally unique, but its Output table name does not have this restriction.

Schedule resource group

The resource group used for task scheduling.

Scheduling parameters

Scheduling parameters are variables in your code that receive dynamic values at runtime. To access contextual information—such as date or time—during repeated executions, define variables using DataWorks’ scheduling parameter syntax.

Data timestamp

A data timestamp is the date associated with a business activity. It indicates when the corresponding business data was generated. This concept is especially important in offline computing scenarios. For example, in the retail industry, the sales turnover for 20241010 is often calculated in the early morning of 20241011. The result is the actual sales turnover for 20241010. In this case, 20241010 is the data timestamp.

Scheduled time

The exact minute-level time you set for a recurring task to run.

Important

Reaching the scheduled time does not guarantee immediate execution. DataWorks checks that upstream tasks succeeded, the scheduled time has passed, and sufficient scheduling resources are available. Only when all conditions are met does it trigger the task.