Overview of the new Data Studio-DataWorks(DataWorks)-阿里云帮助中心

What is Data Studio?

Data Studio integrates with dozens of Alibaba Cloud big data and AI compute services—including MaxCompute, E-MapReduce, Hologres, Flink, and PAI—to deliver intelligent ETL development for data warehouses, data lakes, and OpenLake lakehouse architectures. Key capabilities include:

Data catalog: Metadata management designed for lakehouse environments.
Workflow: Orchestrates real-time and offline data processing nodes and AI nodes across dozens of engine types.
Personal development environment: Python node development and debugging, interactive Notebook analytics, and Git-based code management integrated with NAS/OSS storage.
Notebook: An interactive tool for data development and analysis. Run or debug SQL or Python code against multiple data engines and instantly visualize results.

Enable the new Data Studio

To enable the new Data Studio:

When creating a workspace, select Use Data Studio (New Version). For details, see Create a workspace.
In the legacy DataStudio, click the DataStudio button at the top of the page, then click Upgrade to New Version and follow the prompts to migrate your data to the new Data Studio.
The new Data Studio is available in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Thailand (Bangkok), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia).

Important

If you encounter issues while using the new Data Studio, join the Dedicated DingTalk support group for upgrading to the new Data Studio.
Data in the new Data Studio and the legacy DataStudio are completely separate and cannot be shared.
Upgrading from the legacy DataStudio to the new version is irreversible. After a successful upgrade, you cannot revert to the legacy version. Before switching, create a test workspace with the new Data Studio enabled to verify it meets your business needs.
Starting February 19, 2025, when a root account creates a DataWorks workspace for the first time in a supported region, the new Data Studio will be enabled by default. The legacy DataStudio will no longer be available.

Access Data Studio

Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

Note

This entry is visible only for workspaces where Use Data Studio (New Version) is enabled. For details, see Enable the new Data Studio.
Data Studio works only on Chrome browser version 69 or later on desktop.

Key features of Data Studio

Data Studio offers the following key features. For more information, see Appendix: Concepts Related to Data Development.

Type	Description
Flow management	DataWorks provides a Workflow-based development model. Workflows use a DAG-based visual interface to simplify management of complex task pipelines from a business perspective. For more information, see Recurring workflow orchestration, Event-triggered workflows, and Manually triggered workflows. Note In Data Studio, each workspace has the following limits on internal workflow nodes and objects: Inner nodes: Each workflow supports up to `400` nodes. Objects (workflows, nodes, files, tables, resources, and functions): Users with DataWorks Enterprise Edition can create up to `200,000` objects. Users with DataWorks Professional, Standard, or Basic Edition can create up to `100,000` objects. If your workspace reaches these limits, you cannot create new workflows or objects.
Task development	Richer capabilities: Diverse engine nodes with fully encapsulated engine functionality. General-purpose nodes for complex logic when combined with engine nodes—for example, external system triggers, file object checks, conditional branches, loop execution, and result passing. Supports Flink stream processing tasks using Realtime Compute for Apache Flink, and enables collaborative development between Flink and engines like MaxCompute and Hologres. Simpler operations: Visual workflow builder—drag and drop components to quickly orchestrate multi-engine tasks. Intelligent SQL editor with code hinting, visual operator structure display, and permission verification. For supported node types, see Node development.
Task scheduling	Trigger methods: Supports external system triggers, event triggers, and upstream-triggered scheduling via auto-captured lineage parsing. Dependency types: Supports same-cycle and cross-cycle dependencies, plus mutual dependencies across different schedule cycles and task types. Execution control: Lets you configure whether a task can rerun, control downstream scheduling timing based on upstream tasks, set effective dates for scheduled tasks, and define scheduling behaviors—for example, dry-run (skip execution without blocking downstream tasks) or freeze (skip execution and block downstream tasks). Idempotency guarantee: Offers a rerun mechanism with customizable conditions and retry counts. For more scheduling details, see Node scheduling configuration.
Quality control	Standardized task publishing with multiple quality control mechanisms: Code review: Requires manual code review before publishing to block problematic production schedules. Validation checks: Integrates governance item checks from data governance and custom validation logic from extensions to automate and customize submission and publishing controls. Data Quality: Links quality monitoring to scheduling nodes, triggering rule validation after task completion to help detect data issues immediately.
Other	Open capabilities: Extensive OpenAPI through the Open Platform, with built-in extension points to subscribe to DataWorks development events. Access control: Manages UI feature permissions and data access permissions. See Workspace-level module permission control.

Data Studio interface

See Data Studio feature guide to learn about the interface layout and how to use each module.

Task development workflow

DataWorks supports real-time sync tasks, offline scheduled tasks (including offline sync and transformation jobs), and manually triggered tasks. For data synchronization capabilities, see the Data Integration module.

DataWorks workspaces operate in either standard mode or basic mode. Task development workflows differ between modes as shown below.

Standard Mode Workspace Development Workflow

Basic mode workflow

Basic workflow: In standard mode, scheduled task development includes stages such as development, debugging, scheduling configuration, publishing, and O&M. For the general development process, see Overview of the new Data Studio.
Workflow control: During development, combine built-in code review, predefined data governance checks, and custom validation logic from Open Platform extensions to ensure tasks comply with standards.

Data development approaches

Data Studio lets you customize your development process. Build data pipelines using workflows, or manually create task nodes and configure dependencies.

For details, see Workflows.

Supported node types in Data Studio

Data Studio supports dozens of node types—including data integration, MaxCompute, Hologres, EMR, Flink, Python, Notebook, and ADB—many of which support recurring scheduling. Choose the right node type based on your business needs. For the complete list, see Supported node types.

Appendix: Key concepts in data development

Task development

Concept	Description
Workflow	A DAG-based visual development approach that simplifies complex task pipelines from a business perspective. Workflows support orchestrating dozens of node types—including data integration, MaxCompute, Hologres, EMR, Flink, Python, Notebook, and ADB—and offer workflow-level scheduling. They support both recurring and event-triggered workflows.
Manually triggered workflow	A collection of tasks, tables, resources, and functions for a specific business need. Unlike recurring workflows, tasks in manually triggered workflows must be started manually rather than on a schedule.
Task node	The basic execution unit in DataWorks. Data Studio provides multiple node types: data integration nodes for data sync, engine compute nodes for data cleansing (such as ODPS SQL, Hologres SQL, EMR Hive), and general-purpose nodes for complex logic (such as zero load nodes for managing multiple nodes or do-while nodes for looping). Combining these nodes meets diverse data processing needs.

Task scheduling

Concept	Description
Dependency	Dependencies define the execution order between tasks. If node B runs only after node A completes, A is B’s upstream dependency (or B depends on A). In a DAG, dependencies appear as arrows between nodes.
Output name	The name of a task’s output point. Within a single tenant (Alibaba Cloud account), this virtual entity connects upstream and downstream tasks when you define dependencies. When setting dependencies between tasks, use the output name—not the node name or ID. Once configured, this output name becomes the input name for the downstream node.
Output table name	Set the output table name to match the actual output table of the current task. This helps downstream tasks confirm they are using the expected upstream data. Do not manually change auto-generated output table names—they serve only as identifiers and do not affect the actual table produced by your SQL script, which is determined by the SQL logic itself. Note A node’s Output Name must be globally unique, but its Output table name does not have this restriction.
Schedule resource group	The resource group used for task scheduling.
Scheduling parameters	Scheduling parameters are variables in your code that receive dynamic values at runtime. To access contextual information such as date or time during repeated executions, define variables using DataWorks scheduling parameter syntax.
Data timestamp	The date associated with a business activity, indicating when the corresponding business data was generated. This concept is especially important in offline computing scenarios. For example, in the retail industry, the sales turnover for 20241010 is often calculated in the early morning of 20241011. The result is the actual sales turnover for 20241010. In this case, 20241010 is the data timestamp.
Scheduled time	The exact minute-level time you set for a recurring task to run. Important Reaching the scheduled time does not guarantee immediate execution. DataWorks checks that upstream tasks succeeded, the scheduled time has passed, and sufficient scheduling resources are available. Only when all conditions are met does it trigger the task.