How to use MaxCompute on DataWorks-DataWorks(DataWorks)-阿里云帮助中心

DataWorks allows you to easily build an offline data warehouse and analytics system based on MaxCompute. You can use the visual interface in DataWorks to configure task workflows, schedule and run tasks periodically, and manage metadata to ensure efficient and stable data production and management. This topic describes the basic process, billing details, environment preparation, and permission management for developing MaxCompute tasks in DataWorks.

Prerequisites

DataWorks is activated. For more information, see Activate DataWorks.
MaxCompute is activated. For more information, see Activate MaxCompute.
A DataWorks workspace is created. For more information, see Create a workspace.

Usage notes

The following table describes key topics for developing tasks.

Category	Description
Billing	When you develop MaxCompute tasks in DataWorks, fees are incurred for both DataWorks and other related services.
Environment preparation	Before you develop MaxCompute tasks in DataWorks, you must purchase the required DataWorks edition and resource groups for your business needs and associate the MaxCompute data source.
Permission management	DataWorks provides product-level and module-level permission management, and supports visual permission requests and approvals for MaxCompute data sources.
Get started: Data Integration	Data Integration in DataWorks allows you to read data from and write data to MaxCompute. It supports various data synchronization scenarios, including batch synchronization, real-time synchronization, and full and incremental synchronization.
Get started: Data development and O&M	DataWorks provides the Data Modeling service to structure and manage large amounts of complex, disorganized data. It also provides the DataStudio feature to develop scheduling tasks. DataStudio works with Operation Center to monitor and maintain these tasks.
Get started: Data governance	DataWorks provides metadata management and data governance capabilities for MaxCompute.
Get started: Data analysis and services	DataAnalysis in DataWorks provides data analysis and service sharing capabilities for MaxCompute.
Get started: Open Platform	DataWorks provides open capabilities that let you quickly integrate various application systems. This facilitates development process control, data governance, and O&M, and helps you quickly respond to business status changes from integrated systems.

Billing

DataWorks DataStudio and Operation Center let you periodically schedule MaxCompute data synchronization and data processing tasks. Using these features incurs fees for DataWorks and other related services, as described in the following sections.

1. DataWorks charges

These charges appear on your DataWorks bill. For more information about DataWorks billing, see Billable items of DataWorks.

Charge	Description
DataWorks edition fee	Before you can develop tasks, you must activate DataWorks. If you activate DataWorks Standard Edition, Professional Edition, or Enterprise Edition, you must pay the fee for the selected edition.
Scheduling resource fee	After you develop a task, it requires scheduling resources to run on a schedule. You can use a serverless resource group (recommended) or an old-version exclusive resource group and pay the corresponding fees. Note A serverless resource group can be used for both task scheduling and data synchronization.
Synchronization resource fee	When you run data synchronization tasks, you need data synchronization resources in addition to scheduling resources. You can use a serverless resource group (recommended) or an old-version exclusive resource group for data integration and pay the corresponding fees.

Note

No scheduling fees are charged for tasks that are run using the Run or Run with Parameters feature on the DataStudio page.
No scheduling fees are charged for tasks that fail to run or are run as dry runs.

For information about how DataWorks dispatches scheduling tasks to help you better understand the billing details, see How DataWorks dispatches scheduling tasks.

2. Non-DataWorks-related fees

These charges do not appear on your DataWorks bill. You may incur the following charges when you develop and run tasks.

Important

The fees for other services are determined by the billing rules of those services. For more information, see the billing documentation for each service. For example, for information about MaxCompute billing, see Billable items of MaxCompute.

Fee	Description
Database fees	Database fees may be incurred when you read data from or write data to upstream and downstream databases during data synchronization.
Computing and storage fees	When you run a task on a compute engine, computing and storage fees may be incurred for the compute engine. For example, if you run a MaxCompute SQL task to create a table and write data to it, computing and storage fees may be incurred for MaxCompute.
Network service fees	Network service fees may be incurred when you establish network connections between DataWorks and other related services. For example, using services such as Express Connect, Internet Shared Bandwidth, or Elastic IP Address (EIP) to establish network connections incurs fees from the corresponding services.

Environment preparation

1. Resource preparation

DataWorks provides Standard, Professional, and Enterprise feature-rich editions, and also provides tenant-specific serverless resource groups. Select an edition and a resource group type that meets your business needs.

Category	Description	Related documentation
Edition selection	DataWorks Basic Edition can meet the basic requirements for data migration to the cloud, data development, scheduled production, and simple data governance in MaxCompute development. If you require professional data governance and data security solutions, you can select Standard, Professional, or Enterprise Edition.	Compare DataWorks editions and upgrade
Resource group selection	Serverless resource group (Recommended): A single resource group can be used by all features, such as Data Integration, task scheduling, and DataService Studio. You can plan and allocate resources based on your business requirements. Old-version resource groups (old-version exclusive resource groups and shared resource groups): Meet basic scheduling requirements. These resource groups will be discontinued in the future.	Overview of DataWorks resource groups

2. Development environment preparation

You must create a MaxCompute project as a data source of a DataWorks workspace and associate it with DataStudio before you can start development. You can also manage workspace members to collaborate on development.

Category	Description	Related documentation
Data synchronization environment	Before you run MaxCompute synchronization tasks in DataWorks, you must add the MaxCompute project as a data source to the workspace. After you add the data source, you can use it to run synchronization tasks.	Associate a MaxCompute compute engine with a workspace
Data development and analysis environment	Before you run MaxCompute scheduling tasks in DataWorks, you must add the MaxCompute project as a data source to the workspace and associate it with DataStudio. Once associated, you can use the data source for data development, data analysis, and periodic task scheduling.	Associate a MaxCompute compute engine with a workspace Preparations: Associate a compute engine or cluster
Collaborative development environment	To enable RAM users to collaborate on development in a workspace, you must add them as members to the workspace and grant them a development-related role.	Add members to a workspace

Permission management

DataWorks provides product-level and module-level permission management, and supports visual permission requests and approvals for MaxCompute data sources. The following sections describe how to manage permissions.

1. Data access and permission management

You can use a MaxCompute SQL node or an ad hoc query node to query MaxCompute table data. A workspace in basic mode does not support fine-grained permission management or isolation between the development and production environments. This topic uses a workspace in standard mode as an example.

Default permissions

After a RAM user is added to a workspace as a member, their default data access permissions are as follows:

Permission type

Description

MaxCompute development project permissions

DataWorks maps predefined workspace-level roles to MaxCompute data source roles in the development environment. A RAM user with a workspace role inherits the permissions of the mapped data source role. By default, the RAM user has permissions on the development project, but not on the production project.

MaxCompute production project permissions

Only RAM users that are granted the scheduling access identity have broad permissions on the MaxCompute project in the production environment. Other RAM users do not have permissions on the production project. To operate on production tables, go to Security Center to request permissions.

DataWorks provides a default approval process, and also allows administrators to customize approval workflows.

For more information about MaxCompute data access control, see MaxCompute data permission management.

Data access behavior

MaxCompute supports cross-project table queries. In DataStudio, you can query production data in a DataWorks workspace across projects by specifying the project name. The following table describes how to access tables across projects and the accounts used for execution.

Note

You can view the data sources created for different environments and the execution accounts configured for them in the workspace's data source information. For more information, see Associate a MaxCompute compute engine with a workspace.
In a DataWorks workspace in standard mode, tasks in the development environment are run using the personal identity of a task executor by default. Tasks in the production environment are run by using a specific Alibaba Cloud account, which is the scheduling access identity. For more information, see Associate a MaxCompute compute engine with a workspace.

Code example	Execution in development	Execution in production
Access a development table in a development project: `select col1 from projectname_dev.tablename;`	The task executor's personal Alibaba Cloud account accesses the development table. If a RAM user runs the task, the RAM user's personal account is used. If the primary Alibaba Cloud account runs the task, that account is used.	The scheduling access identity accesses the development table.
Access a production table in a production project: `select col1 from projectname.tablename;`	The task executor's personal Alibaba Cloud account accesses the production table. Note By default, personal accounts do not have permission to access production tables due to security controls. You must go to Security Center to request permission. DataWorks provides a default approval process and also supports custom approval workflows.	The scheduling access identity accesses the production table.
Access a table in the current environment (for example, a development table in the development environment): `select col1 from tablename;`	When run in the development environment, the task uses the task executor's personal Alibaba Cloud account to access the target table in the development data source.	When run in the production environment, the task uses the scheduling access identity to access the target table in the production data source.

2. Feature module permission management

Before you start data development, you can assign different workspace roles to users to grant them different operation permissions. The following permission models are available:

The RAM policy-based authorization model manages permissions on DataWorks feature modules (for example, denying access to Data Map) and console operations (for example, allowing users to delete workspaces).
The RBAC model manages permissions on workspace-level modules (for example, allowing users to access DataStudio for development) and global modules (for example, denying access to the Data Security Guard module).

开发流程

Get started

DataWorks provides multiple feature modules. You can develop scheduling tasks in DataStudio and then go to Operation Center to monitor and maintain the tasks. DataWorks also provides development process control for task development and deployment to help you standardize development operations and ensure security.

1. Data Integration

The Data Integration module in DataWorks lets you read data from and write data to MaxCompute. You can synchronize data from other data sources to a MaxCompute data source, or from a MaxCompute data source to other data sources. You can also run data synchronization tasks in different scenarios, such as batch synchronization, real-time synchronization, and full and incremental synchronization, as needed. For more information, see Data Integration.

2. Data development and O&M

Module	Description	Related documentation
Data Modeling	Using Alibaba's data middle platform methodology, it helps you define your data warehouse plan, data standards, dimensional models, and data metrics. This helps enterprises understand and use data quickly and consistently.	Data Modeling
DataStudio	DataWorks encapsulates the capabilities of the MaxCompute compute engine and allows you to run MaxCompute data synchronization and data development tasks. Data synchronization: You can synchronize data from other data sources to a MaxCompute data source or from a MaxCompute data source to other data sources. DataStudio supports only some batch synchronization and real-time synchronization scenarios. For more data synchronization applications, go to the Data Integration page. Data development: You can develop various types of tasks and periodically schedule the tasks in DataWorks without using complex command-line tools.	Node code definition: MaxCompute SQL Create and use MaxCompute tables Develop a MaxCompute SQL task Reference SQL components Develop a MaxCompute Spark task Develop a PyODPS 2 task Develop a PyODPS 3 task Develop a MaxCompute Script task Develop a MaxCompute MR task Create and use MaxCompute resources Create and use a user-defined function
	You can combine general-purpose nodes and compute engine nodes in DataWorks to process complex logic. Major nodes include: Virtual nodes to control business processes. HTTP trigger, OSS Object Check, and FTP Check nodes to trigger scheduling from external systems. Assignment and parameter nodes to pass parameters between upstream and downstream nodes. Do-while nodes to loop code, for-each nodes to iterate and evaluate the assignment results of upstream nodes, and branch nodes. Other nodes, such as general-purpose Shell nodes and MySQL database nodes.	Virtual nodes HTTP trigger nodes OSS Object Check nodes FTP Check nodes Assignment nodes Parameter nodes do-while node logic for-each node logic Branch nodes
	After you develop a node task, you can perform the following operations as needed: Scheduling configuration If the node task needs to run periodically, you can define its scheduling properties, such as scheduling dependencies and parameters. Node debugging To ensure that production tasks run efficiently and to prevent waste of computing resources, debug a task before you deploy it. Node deployment A task must be deployed to the production environment to be scheduled periodically. After deployment, you can go to Operation and Maintenance Center > Auto Triggered Node to view and manage the scheduled task. Node management You can deploy or undeploy node tasks, and batch-modify scheduling properties. Development process control DataWorks provides development process control for task development and deployment, such as code review, mandatory smoke testing, and custom check logic, to ensure the correctness and security of task-related operations.	Overview of task scheduling configuration Task debugging Deploy tasks Batch operations Development process control
Operation Center	Operation Center is a one-stop big data O&M and monitoring platform. It allows you to view the running status of tasks in real time and provides O&M operations such as intelligent diagnosis and reruns for abnormal tasks. It also provides an intelligent baseline feature to help you solve issues such as unpredictable completion times for important tasks and the difficulty of monitoring many tasks, ensuring timely task output.	Basic O&M of cycle tasks
Data Quality	Data Quality ensures data availability throughout the entire data development lifecycle. By efficiently validating data quality rules and tightly integrating with the task scheduling process, it helps you detect quality issues as early as possible and prevent data quality problems from spreading. This provides an efficient, reliable, and trustworthy data source for your business.	Overview of Data Quality

3. Data governance

After you associate a MaxCompute data source, DataWorks automatically collects its metadata. You can go to Data Map to view the metadata. You can also go to Data Governance Center to view the issues that DataWorks detected and resolve them.

Module	Description	Related documentation
Data Map	Using a unified metadata foundation, it helps you manage, inventory, find, and understand data objects.	Overview of Data Map
Security Center Data Security Guard Approval Center	Security Center is a one-stop data security governance platform that integrates data asset classification, sensitive data identification, data authorization, sensitive data masking, access auditing, and risk identification and response.	Overview of Security Center Overview of Data Security Guard Overview of Approval Center
Data Governance Center	Data Governance Center helps you proactively and systematically perform data governance. It uses data domain rule templates, automatically identifies assets that require optimization, and provides both pre-event and post-event governance strategies.	Overview of Data Governance Center

4. Data analysis and services

DataWorks provides data processing, analysis, and service features, supporting efficient data sharing and access through centrally managed APIs.

Module	Description	Related documentation
DataAnalysis	Lets you run online SQL queries, gain business insights, and edit and share data. You can also save query results as chart cards to quickly build visual data reports.	DataAnalysis
DataService Studio	DataService Studio provides comprehensive data service and sharing capabilities, helping enterprises centrally manage API services for both internal and external use.	DataService Studio

5. Open Platform

DataWorks supports open capabilities to help you quickly integrate various application systems with DataWorks. This facilitates development process control, data governance, and O&M, and helps you respond to business status changes in integrated application systems.

Category	Description	Related documentation
OpenAPI	The OpenAPI feature of Open Platform provides APIs that enable interaction between on-premises services and DataWorks services. This improves the efficiency of big data processing, reduces manual operations and O&M, and lowers data risks and costs.	OpenAPI
OpenEvent	The OpenEvent feature of Open Platform provides a message subscription service. By subscribing to DataWorks event statuses, your application systems can receive real-time status changes, helping you respond to events in a timely manner and meet personalized decision-making requirements.	Overview of OpenEvent
Extensions	DataWorks provides a message push subscription feature based on OpenEvent. You can register a service program as a DataWorks extension to intercept and respond to subscribed event messages. This allows you to implement message notifications and process control for specific events through the extension.	Overview of extensions

Appendix: Relationship between DataWorks and MaxCompute

Note

A workspace in basic mode has only one production environment and corresponds to a single MaxCompute project. This topic uses a workspace in standard mode as an example.

DataWorks provides capabilities such as task scheduling, metadata management, data governance, and data security control for MaxCompute. However, task computation and data storage still occur within MaxCompute. In a workspace in standard mode, DataWorks associates different MaxCompute projects with the development and production environments to isolate storage and resources.

For information about how to create a MaxCompute data source in a DataWorks workspace, associate it with DataStudio, and view the MaxCompute projects used in each environment, see Associate a MaxCompute compute engine with a workspace.
For more information about how DataWorks dispatches scheduling tasks, see How DataWorks dispatches scheduling tasks.

引擎绑定