Features and preparations for legacy DataStudio-DataWorks(DataWorks)-阿里云帮助中心

Go to DataStudio

Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.

Note

DataStudio is supported only on Google Chrome 69 or later on a PC.

Overview

Features

The following table describes the main DataStudio features. Key terms are defined in Appendix: Concepts.

Type	Description
Object organization and management	DataStudio organizes and manages objects in the following ways: Object organization: Two-level management with Solution > workflow. Objects are organized by workflow in a directory tree and a visual panel. Create objects in the tree or drag and drop components on the panel to build data flows. Solutions group related workflows. Object management: Create and manage nodes, tables, resources, and functions through a visual interface. Create a workflow. Management modes. Note In DataStudio, the following limits apply to the number of workflows and objects that you can create in a workspace: Workflows: You can create a maximum of `10,000` workflows. Objects (nodes, files, tables, resources, and functions): If you have DataWorks Enterprise Edition, you can create a maximum of `200,000` objects. If you have DataWorks Professional, Standard, or Basic Edition, you can create a maximum of `100,000` objects. If the number of workflows or objects in the current workspace reaches the upper limit, you cannot create new ones.
Task development	Extensive capabilities: Supports a wide range of compute engine nodes. Provides general-purpose nodes for complex logic: external triggers, file checks, conditional branches, loops, and output passing. Simplified operations: Visual workflow editor for drag-and-drop orchestration of multi-engine tasks. Intelligent SQL editor with smart hints, visual operator structures, and permission checks. All available node types are listed in Supported node types.
Task scheduling	Trigger methods: Supports external triggers, event-based triggers, and dependency-based triggers parsed from data lineage. Dependency types: Supports same-cycle and cross-cycle dependencies across different task types and scheduling cycles. Execution control: Configure rerun policies, downstream scheduling control, effective dates, and scheduling types such as dry run (skips execution, does not block downstream) or freeze (skips execution, blocks downstream). Idempotence: Supports custom rerun conditions and rerun counts. Configure time properties. Guide for configuring scheduling dependencies.
Task debugging	DataStudio supports debugging individual tasks and entire workflows. Task debugging process.
Process control	Provides standardized task publishing and process control: Manual code review and smoke testing before publishing. DataStudio can block problematic workflows. Custom and automated submission controls through the governance item check from Data Governance Center and extension-based validation.
Other features	Extensibility: Integrates with Open Platform for OpenAPI operations, built-in extension points, and event subscriptions. Permission control: Controls UI feature and data access permissions. Manage permissions on workspace-level services. View operation records: Integrated with Alibaba Cloud ActionTrail for viewing and searching operation logs.

UI overview

Follow the DataStudio feature guide to learn about the data development interface and how to use the features of each module.

Development process

DataStudio supports real-time sync tasks, offline scheduled tasks (including sync and processing), and manually triggered tasks for various compute engines. For data synchronization capabilities, see Data Integration. Before you begin, understand the development requirements for each compute engine and choose the appropriate task type.

Compute engine development guides: DataWorks supports various data sources and compute engines. Configuration requirements vary by engine. Major compute engine guides:
General development process: DataWorks workspaces run in standard mode and basic mode. The development process differs slightly between modes.

Development process in a workspace in standard mode.

Development process in a workspace in basic mode.
- Basic process: In standard mode, the scheduled task lifecycle includes development, debugging, scheduling configuration, committing, publishing, and O&M. Guide to the data development process.
- Process control: Use built-in code review and smoke testing, preset checks in Data Governance Center, and custom validation through Open Platform extensions to ensure standards compliance.
  
  Note
  Process control options vary by workspace mode. The features available in the console are definitive.

Organization

In DataStudio, a workflow is the basic unit for code development and resource organization. Workflows and task nodes are developed independently in each workspace and do not affect each other. Create a workflow.

Workflows are presented as a directory tree and an operation panel, helping you organize code from a business perspective.

Directory tree structure: Provides a way to organize code based on task types.
Workflow panel: Provides a process-oriented display of business logic.

Development organization structure

Get started

Prerequisites

To develop tasks, model data, or schedule periodic tasks in DataWorks, you must associate your data sources or clusters as computing resources in DataStudio. Without this association, you cannot create data development nodes.

Create the data sources or clusters required for your planned task types.

Data source or cluster	Description
Associate a MaxCompute computing resource	DataWorks auto-associates the first MaxCompute data source. Manually associate subsequent ones.
Associate a Hologres computing resource	Manually associate these data sources after creation.
Associate an AnalyticDB for PostgreSQL computing resource
Associate an AnalyticDB for MySQL 3.0 computing resource
Associate a ClickHouse computing resource
Register an E-MapReduce cluster with DataWorks	DataWorks auto-associates registered clusters. No manual association needed.
Register a CDH or CDP cluster with DataWorks

Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.
In the left-side navigation pane, click Computing Resources.

If the Computing Resource module is not in the left-side navigation pane, add it from Personal Settings. Module Management.
Associate a computing resource.

On the Computing Resource page, search for the target data source or cluster by Computing Resource Name or Computing Resource Type and click Associate. After association, you can use the data source for development.

Note
If data source information changes, refresh the page to view the updates.
- In some cases, a data source or cluster may fail to be associated with DataStudio:
  - Association depends on configuration. For example, data sources using an AccessKey pair cannot be associated. Check the association page for limitations.
  - The data source is missing a development or production environment.
  - A MaxCompute computing resource cannot be associated with multiple DataWorks workspaces at the same time.
  Note
  The platform displays the reason for any association failure.
- Only MaxCompute, E-MapReduce, Hologres, AnalyticDB for MySQL, ClickHouse, CDH/CDP, and AnalyticDB for PostgreSQL can be associated with DataStudio.
- Associable data source types and limits vary by DataWorks edition. Features of different editions of DataWorks.

Tutorial

Get started with data development covers basic operations and the development process.

Supported node types

DataStudio provides various node types, many supporting periodic scheduling. Select nodes based on your business needs. Supported node types.

Appendix: Concepts

Task development

Term	Description
Solution	A collection of workflows managed collectively. Workflows can be reused across multiple Solutions for collaboration.
Workflow	A collection of tasks, tables, resources, and functions for a business requirement. Tasks run on a schedule.
Manually triggered workflow	A collection of tasks, tables, resources, and functions for a specific business requirement. Unlike regular workflows, tasks in a manually triggered workflow must be triggered manually rather than running on a schedule.
DAG	An abbreviation for `directed acyclic graph`. Displays nodes and their dependencies. In DataStudio, all tasks within a workflow share one DAG.
Task	The basic execution unit in DataWorks. Tasks run sequentially based on dependencies.
Node	Represents a task in a DAG. Nodes run sequentially based on dependencies.

Task scheduling

Term	Description
Dependency	Defines execution order between tasks. If task B runs only after task A completes, A is an upstream dependency of B. Shown as arrows in a DAG.
Output name	A globally unique identifier for a node. A node can have multiple output names. DataWorks uses output names to define scheduling dependencies.
Output table name	The name of the task's output table, helping downstream tasks confirm the correct data source. Do not modify auto-generated output table names. This identifier does not affect the actual table name, which is determined by SQL logic. Note A node's Output Name must be globally unique, whereas an Output Table Name does not have this restriction.
Scheduling resource group	The resource group used for task scheduling. Overview of DataWorks resource groups.
Scheduling parameter	Variables in code that dynamically retrieve runtime values such as date and time. Define scheduling parameters in DataWorks to assign values to code variables at execution.
Business date	The date a business transaction occurred. In offline computing, this is typically the day before the task runs. By default, DataWorks sets this to the day before the task runs, accurate to the day. For example, when generating statistics for yesterday's sales, yesterday is the business date.
Scheduling time	The expected time a task is scheduled to run, accurate to the second. The actual start time may differ due to various factors.