DataStudio

更新时间:
复制 MD 格式

The DataStudio module in DataWorks is used to develop periodically scheduled tasks and configure their scheduling properties. In conjunction with Operation Center, DataStudio provides a visual integrated development environment (IDE) for various compute engines, such as MaxCompute, Hologres, and E-MapReduce. DataStudio provides features such as intelligent code development, multi-engine hybrid workflows, and standardized task publishing. These features help you efficiently build stable offline data warehouses, real-time data warehouses, and ad hoc analytics systems. This topic describes the concepts and features of DataStudio and the prerequisites for developing in DataStudio.

Go to DataStudio

Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.

Note

DataStudio requires a PC with Chrome 69 or later.

Introduction

Features

The main features of DataStudio are described in the following table. For more information, see Appendix: Key concepts.

Type

Description

Object organization and management

DataStudio provides the following mechanisms for organizing and managing objects:

  • Object organization: DataStudio provides a two-level management model that uses Solution > Workflow. DataWorks provides a directory tree and a visual pane for you to organize objects in workflows. You can create required objects in the directory tree or drag and drop components on the visual workflow pane to build data processing flows. You can use solutions to manage workflows.

  • Object management: You can create and manage nodes, tables, resources, and functions visually.

For more information, see Create a workflow and Management modes.

Note

In DataStudio, the following limits apply to the numbers of workflows and objects that you can create in a workspace:

  • Workflows: A maximum of 10,000 workflows can be created.

  • Objects (nodes, files, tables, resources, and functions): The maximum number of objects you can create is 200,000 for DataWorks Enterprise Edition and 100,000 for the Professional, Standard, or Basic editions.

If the number of workflows or objects in your workspace reaches the upper limit, you cannot create new workflows or objects.

Task development

  • Rich features:

    • DataStudio provides a wide range of engine nodes that fully encapsulate engine capabilities.

    • General-purpose nodes can be used with engine nodes to process complex logic. For example, you can trigger scheduling from external systems, check file objects, implement conditional branching and loops, and pass output results.

  • Ease of use:

    • DataStudio provides a visual workflow development mechanism. You can drag and drop components to orchestrate multi-engine tasks.

    • The intelligent SQL editor provides features such as intelligent suggestions, visual display of SQL operator structures, and permission verification.

For more information about the node types supported by DataWorks, see Supported node types.

Task scheduling

  • Trigger modes: Tasks can be triggered by external systems, events, or upstream tasks based on internal lineage analysis.

  • Dependency types: You can configure same-cycle and cross-cycle dependencies. You can also configure dependencies among tasks of different types or with different scheduling cycles.

  • Execution control: You can specify whether to rerun a task, control the overall scheduling time of downstream tasks based on upstream tasks, set the effective date for a scheduled task, and define the scheduling type of a task. For example, you can perform a dry run on a task. In this case, the task is not executed and does not block downstream tasks. You can also freeze a task. In this case, the task is not executed and blocks downstream tasks.

  • Idempotence: DataStudio provides a task rerun mechanism that supports custom rerun conditions and a configurable number of reruns.

For more information about scheduling, see Configure time properties and Configure scheduling dependencies.

Task debugging

DataStudio provides mechanisms to debug a single task or an entire workflow. For more information, see Debug tasks.

Process control

DataStudio provides a standardized task publishing mechanism and multiple process control methods. The following scenarios are supported:

  • Before you publish a task, you can perform a manual code review and smoke testing on the task. This prevents problematic production scheduling processes from being published.

  • You can use governance rule checks of the Data Governance module and the custom check logic of extensions to implement custom and automated process control for task submission and publishing to the production environment.

Other

  • Open capabilities: In conjunction with Open Platform, DataStudio provides a rich set of OpenAPI operations and includes numerous built-in extension points. You can subscribe to event messages related to DataStudio by using DataWorks Open Platform.

  • Permission control: DataStudio supports permission controls for UI features and data access. For more information, see Manage permissions on modules in a workspace.

  • View operation records: DataWorks integrates with Alibaba Cloud ActionTrail, allowing you to view and search logs of recent DataWorks operations performed by your Alibaba Cloud account. For more information, see View operation records.

UI

For more information about the UI of DataStudio and the features of each module, see Features on the DataStudio page.

Development process

In DataStudio of DataWorks, you can create real-time data synchronization tasks, offline scheduling tasks that include offline synchronization tasks and offline processing tasks, and manually triggered tasks for various types of engines. For more information about data synchronization, see Data Integration. When you develop scheduled tasks, the configuration requirements vary based on the engine type. Before you develop a task, you must understand the limits and instructions for developing tasks of different engine types in DataWorks.

  • Developing tasks for different engines: DataWorks allows you to create various data sources and develop tasks for different engines. The configuration requirements vary based on the engine. For information about how to develop tasks for major engines, see the following topics:

  • General development process: DataWorks provides workspaces in standard mode and basic mode. The development process for scheduled tasks differs between the two modes.

    The following figure shows the development process in a workspace in standard mode.标准模式工作空间开发流程

    The following figure shows the development process in a workspace in basic mode.简单模式工作空间开发流程

    • Basic process: Developing a scheduled task in a workspace in standard mode involves development, debugging, scheduling configuration, submission, deployment, and O&M. For more information about the general development process of tasks, see General development process.

    • Process control: During development, you can use the built-in features of DataStudio, such as code review and smoke testing, the checks that are preset in Data Governance Center, and the custom check logic that is implemented based on extensions of Open Platform to ensure that the developed task meets your business requirements.

      Note

      The process control operations vary based on the workspace mode. Refer to the UI for the specific features available.

Management modes

In DataStudio, a workflow is the basic unit for code development and resource organization. It is an abstract business entity and helps you organize data development from a business perspective. Workflows and task nodes are developed independently in each workspace and do not affect each other. For more information about how to use workflows, see Create a workflow.

Workflows are presented in a directory tree and on a pane. This helps you organize code from a business perspective, which clarifies resource categories and business logic.

  • Directory tree structure: organizes code based on the task type.

  • Workflow pane: displays business logic in a process-oriented manner.

A solution groups and manages workflows of a specific type. You can double-click a workflow name to open the workflow pane. The directory tree organizes nodes by category, including Data Integration, Data Development, Table, Resource, Function, Algorithm, Data Service, and Control. The workflow pane displays the dependencies between nodes on a directed acyclic graph (DAG) canvas.

Get started

Prerequisites

To perform data modeling or data development or use Operation Center to periodically schedule tasks in DataWorks, you must associate your data sources or clusters with the DataStudio module as computing resources. Once associated, you can read data from the data sources or clusters and perform related development operations. Otherwise, you cannot create nodes in DataStudio.

  1. Create the required data sources or clusters in advance based on the types of tasks that you want to develop and schedule.

    Data source or cluster

    Description

    Associate a MaxCompute computing resource

    When you create your first MaxCompute data source, DataWorks automatically associates the data source with DataStudio. You must manually associate any subsequent MaxCompute data sources.

    Associate a Hologres computing resource

    After you create these data sources, you must manually associate them with DataStudio as described in this topic.

    Associate an AnalyticDB for PostgreSQL computing resource

    Associate an AnalyticDB for MySQL 3.0 computing resource

    Associate a ClickHouse computing resource

    Register an E-MapReduce cluster with DataWorks

    After you register a cluster, DataWorks automatically associates the cluster with DataStudio.

    Register a CDH or CDP cluster with DataWorks

  2. Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.

  3. In the left-side navigation pane, click Computing Resource to go to the page for creating computing resources.

    If Computing Resource is not displayed in the left-side navigation pane, go to the Personal Settings page to add the module to the navigation pane. For more information, see DataStudio Modules.
  4. Associate a computing resource.

    On the Computing Resource page, you can search for the target data source or cluster by Computing Resource Name or Computing Resource Type, and then click Associate. Once associated, you can use the connection information of the data source to read its data and perform related development operations.

    Note

    If data source information changes but the current page does not reflect the updates, refresh the page to load the latest data.

    You can also filter by Status (All, Associated, or Not Associated) to locate the computing resource that you want to associate. Click the target resource to view the configurations of its production environment, such as Project Name and Access Identity.

    • In some cases, a data source or cluster may fail to be associated with DataStudio:

      • Whether a data source or cluster can be associated with DataStudio depends on its configuration. For example, data sources that are created in AccessKey ID and AccessKey secret mode cannot be associated with DataStudio. For more information about other limits, see the prompts on the association page.

      • The data source does not have a development environment or a production environment.

      • A MaxCompute computing resource cannot be associated with multiple DataWorks workspaces at the same time.

      Note

      Association can fail for various reasons. The platform displays the specific cause, which you can use to troubleshoot the issue.

    • You can associate only computing resources of MaxCompute, E-MapReduce, Hologres, AnalyticDB for MySQL, ClickHouse, CDH/CDP, and AnalyticDB for PostgreSQL with DataStudio.

    • The types and number of data sources or clusters that can be associated with DataStudio vary based on the DataWorks edition. For more information, see Editions.

Tutorial

To learn about the basic operations and development process of DataStudio, see Get started with DataStudio.

Supported node types

DataStudio provides various types of nodes. These nodes support periodic task scheduling. You can select a node type based on your business requirements to perform development operations. For more information about the node types supported by DataWorks, see Supported node types.

Appendix: Key concepts

  • Task development

    Concept

    Description

    solution

    A collection of workflows. You can group workflows of a specific type into a solution for centralized management. A workflow can be reused across multiple solutions. This structure allows for collaborative development, where users in different solutions can edit the same referenced workflow.

    workflow

    An abstract business entity that is a collection of tasks, tables, resources, and functions for a specific business requirement. Tasks in a workflow can be triggered to run as scheduled.

    manually triggered workflow

    A collection of tasks, tables, resources, and functions for a specific business requirement.

    Unlike a standard workflow that runs on a schedule, tasks in a manually triggered workflow must be started manually.

    DAG

    An acronym for Directed Acyclic Graph. A DAG is used to display nodes and their dependencies. In DataStudio, all tasks in a workflow are displayed in the same DAG to facilitate task development and dependency configuration.

    task

    A task is the basic execution unit in DataWorks. DataWorks runs tasks in sequence based on their dependencies.

    node

    A node represents a task in a DAG. DataWorks runs nodes in sequence based on their dependencies.

  • Task scheduling

    Concept

    Description

    dependency

    Dependencies define the execution order of tasks. If Node B can run only after Node A is complete, Node A is an upstream dependency of Node B. In a DAG, dependencies are represented by arrows between nodes.

    output name

    An identifier that distinguishes a node from other nodes. The output name is globally unique. A node can have multiple output names. DataWorks uses output names to configure scheduling dependencies between nodes.

    output table name

    Set this parameter to the name of the table that is generated by the current task. A correctly specified output table name helps a downstream task confirm whether data is from the expected upstream table when the downstream task sets a dependency. We recommend that you do not manually modify an automatically parsed output table name. The output table name serves only as an identifier. Modifying the output table name does not affect the name of the table that is actually generated by the SQL script. The name of the table that is actually generated is determined by the SQL logic.

    Note

    The Output Name of a node must be globally unique, whereas the Output Table Name does not have this restriction.

    resource group for scheduling

    The resource group that is used for task scheduling. For more information about resource groups, see Overview of resource groups.

    scheduling parameter

    A variable in code whose value is dynamically assigned at runtime. If you want to obtain information about the runtime environment, such as the date and time, during repeated runs of your code, you can define scheduling parameters in the DataWorks scheduling system to dynamically assign values to the variables in your code.

    data timestamp

    Refers to the previous day. In offline computing, the data timestamp is the date on which a transaction occurs. By default, DataWorks uses the day before the task's scheduled runtime (yesterday) as the data timestamp, accurate to the day. For example, if you run a task today to calculate the sales of yesterday, "yesterday" is the date on which the transactions occurred, which is the data timestamp.

    scheduling time

    Refers to today, which is the expected execution time of a data processing task. By default, DataWorks uses the scheduled runtime of the task (today) as the scheduling time, accurate to the second. The expected execution time may not be the same as the actual start time. The actual start time may differ due to various factors.