DataWorks on MaxCompute

更新时间:
复制 MD 格式

DataWorks allows you to easily build an offline data warehouse and analytics system based on MaxCompute. You can use the visual interface in DataWorks to configure task workflows, schedule and run tasks periodically, and manage metadata to ensure efficient and stable data production and management. This topic describes the basic process, billing details, environment preparation, and permission management for developing MaxCompute tasks in DataWorks.

Prerequisites

  • DataWorks is activated. For more information, see Activate DataWorks.

  • MaxCompute is activated. For more information, see Activate MaxCompute.

  • A DataWorks workspace is created. For more information, see Create a workspace.

Usage notes

The following table describes key topics for developing tasks.

Category

Description

Billing

When you develop MaxCompute tasks in DataWorks, fees are incurred for both DataWorks and other related services.

Environment preparation

Before you develop MaxCompute tasks in DataWorks, you must purchase the required DataWorks edition and resource groups for your business needs and associate the MaxCompute data source.

Permission management

DataWorks provides product-level and module-level permission management, and supports visual permission requests and approvals for MaxCompute data sources.

Get started: Data Integration

Data Integration in DataWorks allows you to read data from and write data to MaxCompute. It supports various data synchronization scenarios, including batch synchronization, real-time synchronization, and full and incremental synchronization.

Get started: Data development and O&M

DataWorks provides the Data Modeling service to structure and manage large amounts of complex, disorganized data. It also provides the DataStudio feature to develop scheduling tasks. DataStudio works with Operation Center to monitor and maintain these tasks.

Get started: Data governance

DataWorks provides metadata management and data governance capabilities for MaxCompute.

Get started: Data analysis and services

DataAnalysis in DataWorks provides data analysis and service sharing capabilities for MaxCompute.

Get started: Open Platform

DataWorks provides open capabilities that let you quickly integrate various application systems. This facilitates development process control, data governance, and O&M, and helps you quickly respond to business status changes from integrated systems.

Billing

DataWorks DataStudio and Operation Center let you periodically schedule MaxCompute data synchronization and data processing tasks. Using these features incurs fees for DataWorks and other related services, as described in the following sections.

1. DataWorks charges

These charges appear on your DataWorks bill. For more information about DataWorks billing, see Billable items of DataWorks.

Charge

Description

DataWorks edition fee

Before you can develop tasks, you must activate DataWorks. If you activate DataWorks Standard Edition, Professional Edition, or Enterprise Edition, you must pay the fee for the selected edition.

Scheduling resource fee

After you develop a task, it requires scheduling resources to run on a schedule. You can use a serverless resource group (recommended) or an old-version exclusive resource group and pay the corresponding fees.

Note

A serverless resource group can be used for both task scheduling and data synchronization.

Synchronization resource fee

When you run data synchronization tasks, you need data synchronization resources in addition to scheduling resources. You can use a serverless resource group (recommended) or an old-version exclusive resource group for data integration and pay the corresponding fees.

Note
  • No scheduling fees are charged for tasks that are run using the Run or Run with Parameters feature on the DataStudio page.

  • No scheduling fees are charged for tasks that fail to run or are run as dry runs.

For information about how DataWorks dispatches scheduling tasks to help you better understand the billing details, see How DataWorks dispatches scheduling tasks.

2. Non-DataWorks-related fees

These charges do not appear on your DataWorks bill. You may incur the following charges when you develop and run tasks.

Important

The fees for other services are determined by the billing rules of those services. For more information, see the billing documentation for each service. For example, for information about MaxCompute billing, see Billable items of MaxCompute.

Fee

Description

Database fees

Database fees may be incurred when you read data from or write data to upstream and downstream databases during data synchronization.

Computing and storage fees

When you run a task on a compute engine, computing and storage fees may be incurred for the compute engine. For example, if you run a MaxCompute SQL task to create a table and write data to it, computing and storage fees may be incurred for MaxCompute.

Network service fees

Network service fees may be incurred when you establish network connections between DataWorks and other related services. For example, using services such as Express Connect, Internet Shared Bandwidth, or Elastic IP Address (EIP) to establish network connections incurs fees from the corresponding services.

Environment preparation

1. Resource preparation

DataWorks provides Standard, Professional, and Enterprise feature-rich editions, and also provides tenant-specific serverless resource groups. Select an edition and a resource group type that meets your business needs.

Category

Description

Related documentation

Edition selection

DataWorks Basic Edition can meet the basic requirements for data migration to the cloud, data development, scheduled production, and simple data governance in MaxCompute development. If you require professional data governance and data security solutions, you can select Standard, Professional, or Enterprise Edition.

Compare DataWorks editions and upgrade

Resource group selection

  • Serverless resource group (Recommended): A single resource group can be used by all features, such as Data Integration, task scheduling, and DataService Studio. You can plan and allocate resources based on your business requirements.

  • Old-version resource groups (old-version exclusive resource groups and shared resource groups): Meet basic scheduling requirements. These resource groups will be discontinued in the future.

Overview of DataWorks resource groups

2. Development environment preparation

You must create a MaxCompute project as a data source of a DataWorks workspace and associate it with DataStudio before you can start development. You can also manage workspace members to collaborate on development.

Category

Description

Related documentation

Data synchronization environment

Before you run MaxCompute synchronization tasks in DataWorks, you must add the MaxCompute project as a data source to the workspace. After you add the data source, you can use it to run synchronization tasks.

Associate a MaxCompute compute engine with a workspace

Data development and analysis environment

Before you run MaxCompute scheduling tasks in DataWorks, you must add the MaxCompute project as a data source to the workspace and associate it with DataStudio. Once associated, you can use the data source for data development, data analysis, and periodic task scheduling.

Collaborative development environment

To enable RAM users to collaborate on development in a workspace, you must add them as members to the workspace and grant them a development-related role.

Add members to a workspace

Permission management

DataWorks provides product-level and module-level permission management, and supports visual permission requests and approvals for MaxCompute data sources. The following sections describe how to manage permissions.

1. Data access and permission management

You can use a MaxCompute SQL node or an ad hoc query node to query MaxCompute table data. A workspace in basic mode does not support fine-grained permission management or isolation between the development and production environments. This topic uses a workspace in standard mode as an example.

  • Default permissions

    After a RAM user is added to a workspace as a member, their default data access permissions are as follows:

    Permission type

    Description

    MaxCompute development project permissions

    DataWorks maps predefined workspace-level roles to MaxCompute data source roles in the development environment. A RAM user with a workspace role inherits the permissions of the mapped data source role. By default, the RAM user has permissions on the development project, but not on the production project.

    MaxCompute production project permissions

    Only RAM users that are granted the scheduling access identity have broad permissions on the MaxCompute project in the production environment. Other RAM users do not have permissions on the production project. To operate on production tables, go to Security Center to request permissions.

    DataWorks provides a default approval process, and also allows administrators to customize approval workflows.

    For more information about MaxCompute data access control, see MaxCompute data permission management.

  • Data access behavior

    MaxCompute supports cross-project table queries. In DataStudio, you can query production data in a DataWorks workspace across projects by specifying the project name. The following table describes how to access tables across projects and the accounts used for execution.

    Note
    • You can view the data sources created for different environments and the execution accounts configured for them in the workspace's data source information. For more information, see Associate a MaxCompute compute engine with a workspace.

    • In a DataWorks workspace in standard mode, tasks in the development environment are run using the personal identity of a task executor by default. Tasks in the production environment are run by using a specific Alibaba Cloud account, which is the scheduling access identity. For more information, see Associate a MaxCompute compute engine with a workspace.

    Code example

    Execution in development

    Execution in production

    Access a development table in a development project:

    select col1 from projectname_dev.tablename;

    The task executor's personal Alibaba Cloud account accesses the development table.

    • If a RAM user runs the task, the RAM user's personal account is used.

    • If the primary Alibaba Cloud account runs the task, that account is used.

    The scheduling access identity accesses the development table.

    Access a production table in a production project:

    select col1 from projectname.tablename;

    The task executor's personal Alibaba Cloud account accesses the production table.

    Note

    By default, personal accounts do not have permission to access production tables due to security controls. You must go to Security Center to request permission. DataWorks provides a default approval process and also supports custom approval workflows.

    The scheduling access identity accesses the production table.

    Access a table in the current environment (for example, a development table in the development environment):

    select col1 from tablename;

    When run in the development environment, the task uses the task executor's personal Alibaba Cloud account to access the target table in the development data source.

    When run in the production environment, the task uses the scheduling access identity to access the target table in the production data source.

2. Feature module permission management

Before you start data development, you can assign different workspace roles to users to grant them different operation permissions. The following permission models are available:

  • The RAM policy-based authorization model manages permissions on DataWorks feature modules (for example, denying access to Data Map) and console operations (for example, allowing users to delete workspaces).

  • The RBAC model manages permissions on workspace-level modules (for example, allowing users to access DataStudio for development) and global modules (for example, denying access to the Data Security Guard module).

开发流程

Get started

DataWorks provides multiple feature modules. You can develop scheduling tasks in DataStudio and then go to Operation Center to monitor and maintain the tasks. DataWorks also provides development process control for task development and deployment to help you standardize development operations and ensure security.

1. Data Integration

The Data Integration module in DataWorks lets you read data from and write data to MaxCompute. You can synchronize data from other data sources to a MaxCompute data source, or from a MaxCompute data source to other data sources. You can also run data synchronization tasks in different scenarios, such as batch synchronization, real-time synchronization, and full and incremental synchronization, as needed. For more information, see Data Integration.

2. Data development and O&M

Module

Description

Related documentation

Data Modeling

Using Alibaba's data middle platform methodology, it helps you define your data warehouse plan, data standards, dimensional models, and data metrics. This helps enterprises understand and use data quickly and consistently.

Data Modeling

DataStudio

DataWorks encapsulates the capabilities of the MaxCompute compute engine and allows you to run MaxCompute data synchronization and data development tasks.

  • Data synchronization: You can synchronize data from other data sources to a MaxCompute data source or from a MaxCompute data source to other data sources. DataStudio supports only some batch synchronization and real-time synchronization scenarios. For more data synchronization applications, go to the Data Integration page.

  • Data development: You can develop various types of tasks and periodically schedule the tasks in DataWorks without using complex command-line tools.

You can combine general-purpose nodes and compute engine nodes in DataWorks to process complex logic.

Major nodes include:

  • Virtual nodes to control business processes.

  • HTTP trigger, OSS Object Check, and FTP Check nodes to trigger scheduling from external systems.

  • Assignment and parameter nodes to pass parameters between upstream and downstream nodes.

  • Do-while nodes to loop code, for-each nodes to iterate and evaluate the assignment results of upstream nodes, and branch nodes.

  • Other nodes, such as general-purpose Shell nodes and MySQL database nodes.

After you develop a node task, you can perform the following operations as needed:

  • Scheduling configuration

    If the node task needs to run periodically, you can define its scheduling properties, such as scheduling dependencies and parameters.

  • Node debugging

    To ensure that production tasks run efficiently and to prevent waste of computing resources, debug a task before you deploy it.

  • Node deployment

    A task must be deployed to the production environment to be scheduled periodically. After deployment, you can go to Operation and Maintenance Center > Auto Triggered Node to view and manage the scheduled task.

  • Node management

    You can deploy or undeploy node tasks, and batch-modify scheduling properties.

  • Development process control

    DataWorks provides development process control for task development and deployment, such as code review, mandatory smoke testing, and custom check logic, to ensure the correctness and security of task-related operations.

Operation Center

Operation Center is a one-stop big data O&M and monitoring platform. It allows you to view the running status of tasks in real time and provides O&M operations such as intelligent diagnosis and reruns for abnormal tasks. It also provides an intelligent baseline feature to help you solve issues such as unpredictable completion times for important tasks and the difficulty of monitoring many tasks, ensuring timely task output.

Basic O&M of cycle tasks

Data Quality

Data Quality ensures data availability throughout the entire data development lifecycle. By efficiently validating data quality rules and tightly integrating with the task scheduling process, it helps you detect quality issues as early as possible and prevent data quality problems from spreading. This provides an efficient, reliable, and trustworthy data source for your business.

Overview of Data Quality

3. Data governance

After you associate a MaxCompute data source, DataWorks automatically collects its metadata. You can go to Data Map to view the metadata. You can also go to Data Governance Center to view the issues that DataWorks detected and resolve them.

Module

Description

Related documentation

Data Map

Using a unified metadata foundation, it helps you manage, inventory, find, and understand data objects.

Overview of Data Map

Security Center

Data Security Guard

Approval Center

Security Center is a one-stop data security governance platform that integrates data asset classification, sensitive data identification, data authorization, sensitive data masking, access auditing, and risk identification and response.

Data Governance Center

Data Governance Center helps you proactively and systematically perform data governance. It uses data domain rule templates, automatically identifies assets that require optimization, and provides both pre-event and post-event governance strategies.

Overview of Data Governance Center

4. Data analysis and services

DataWorks provides data processing, analysis, and service features, supporting efficient data sharing and access through centrally managed APIs.

Module

Description

Related documentation

DataAnalysis

Lets you run online SQL queries, gain business insights, and edit and share data. You can also save query results as chart cards to quickly build visual data reports.

DataAnalysis

DataService Studio

DataService Studio provides comprehensive data service and sharing capabilities, helping enterprises centrally manage API services for both internal and external use.

DataService Studio

5. Open Platform

DataWorks supports open capabilities to help you quickly integrate various application systems with DataWorks. This facilitates development process control, data governance, and O&M, and helps you respond to business status changes in integrated application systems.

Category

Description

Related documentation

OpenAPI

The OpenAPI feature of Open Platform provides APIs that enable interaction between on-premises services and DataWorks services. This improves the efficiency of big data processing, reduces manual operations and O&M, and lowers data risks and costs.

OpenAPI

OpenEvent

The OpenEvent feature of Open Platform provides a message subscription service. By subscribing to DataWorks event statuses, your application systems can receive real-time status changes, helping you respond to events in a timely manner and meet personalized decision-making requirements.

Overview of OpenEvent

Extensions

DataWorks provides a message push subscription feature based on OpenEvent. You can register a service program as a DataWorks extension to intercept and respond to subscribed event messages. This allows you to implement message notifications and process control for specific events through the extension.

Overview of extensions

Appendix: Relationship between DataWorks and MaxCompute

Note

A workspace in basic mode has only one production environment and corresponds to a single MaxCompute project. This topic uses a workspace in standard mode as an example.

DataWorks provides capabilities such as task scheduling, metadata management, data governance, and data security control for MaxCompute. However, task computation and data storage still occur within MaxCompute. In a workspace in standard mode, DataWorks associates different MaxCompute projects with the development and production environments to isolate storage and resources.

引擎绑定