DataStudio (legacy)

更新时间:
复制 MD 格式

DataStudio is the DataWorks module for developing and scheduling periodic tasks. It integrates with Operation Center and provides a visual development interface for compute engines such as MaxCompute, Hologres, and E-MapReduce, with intelligent code development, multi-engine hybrid workflows, and standardized task publishing. DataStudio helps you build offline data warehouses, real-time data warehouses, and ad hoc query systems.

Go to DataStudio

Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.

Note

DataStudio is supported only on Google Chrome 69 or later on a PC.

Overview

Features

The following table describes the main DataStudio features. Key terms are defined in Appendix: Concepts.

image

Type

Description

Object organization and management

DataStudio organizes and manages objects in the following ways:

  • Object organization: Two-level management with Solution > workflow. Objects are organized by workflow in a directory tree and a visual panel. Create objects in the tree or drag and drop components on the panel to build data flows. Solutions group related workflows.

  • Object management: Create and manage nodes, tables, resources, and functions through a visual interface.

Create a workflow. Management modes.

Note

In DataStudio, the following limits apply to the number of workflows and objects that you can create in a workspace:

  • Workflows: You can create a maximum of 10,000 workflows.

  • Objects (nodes, files, tables, resources, and functions): If you have DataWorks Enterprise Edition, you can create a maximum of 200,000 objects. If you have DataWorks Professional, Standard, or Basic Edition, you can create a maximum of 100,000 objects.

If the number of workflows or objects in the current workspace reaches the upper limit, you cannot create new ones.

Task development

  • Extensive capabilities:

    • Supports a wide range of compute engine nodes.

    • Provides general-purpose nodes for complex logic: external triggers, file checks, conditional branches, loops, and output passing.

  • Simplified operations:

    • Visual workflow editor for drag-and-drop orchestration of multi-engine tasks.

    • Intelligent SQL editor with smart hints, visual operator structures, and permission checks.

All available node types are listed in Supported node types.

Task scheduling

  • Trigger methods: Supports external triggers, event-based triggers, and dependency-based triggers parsed from data lineage.

  • Dependency types: Supports same-cycle and cross-cycle dependencies across different task types and scheduling cycles.

  • Execution control: Configure rerun policies, downstream scheduling control, effective dates, and scheduling types such as dry run (skips execution, does not block downstream) or freeze (skips execution, blocks downstream).

  • Idempotence: Supports custom rerun conditions and rerun counts.

Configure time properties. Guide for configuring scheduling dependencies.

Task debugging

DataStudio supports debugging individual tasks and entire workflows. Task debugging process.

Process control

Provides standardized task publishing and process control:

  • Manual code review and smoke testing before publishing. DataStudio can block problematic workflows.

  • Custom and automated submission controls through the governance item check from Data Governance Center and extension-based validation.

Other features

UI overview

Follow the DataStudio feature guide to learn about the data development interface and how to use the features of each module.

Development process

DataStudio supports real-time sync tasks, offline scheduled tasks (including sync and processing), and manually triggered tasks for various compute engines. For data synchronization capabilities, see Data Integration. Before you begin, understand the development requirements for each compute engine and choose the appropriate task type.

Organization

In DataStudio, a workflow is the basic unit for code development and resource organization. Workflows and task nodes are developed independently in each workspace and do not affect each other. Create a workflow.

Workflows are presented as a directory tree and an operation panel, helping you organize code from a business perspective.

  • Directory tree structure: Provides a way to organize code based on task types.

  • Workflow panel: Provides a process-oriented display of business logic.

Development organization structure

Get started

Prerequisites

To develop tasks, model data, or schedule periodic tasks in DataWorks, you must associate your data sources or clusters as computing resources in DataStudio. Without this association, you cannot create data development nodes.

  1. Create the data sources or clusters required for your planned task types.

    Data source or cluster

    Description

    Associate a MaxCompute computing resource

    DataWorks auto-associates the first MaxCompute data source. Manually associate subsequent ones.

    Associate a Hologres computing resource

    Manually associate these data sources after creation.

    Associate an AnalyticDB for PostgreSQL computing resource

    Associate an AnalyticDB for MySQL 3.0 computing resource

    Associate a ClickHouse computing resource

    Register an E-MapReduce cluster with DataWorks

    DataWorks auto-associates registered clusters. No manual association needed.

    Register a CDH or CDP cluster with DataWorks

  2. Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.

  3. In the left-side navigation pane, click Computing Resources.

    If the Computing Resource module is not in the left-side navigation pane, add it from Personal Settings. Module Management.
  4. Associate a computing resource.

    On the Computing Resource page, search for the target data source or cluster by Computing Resource Name or Computing Resource Type and click Associate. After association, you can use the data source for development.

    Note

    If data source information changes, refresh the page to view the updates.

    image

    • In some cases, a data source or cluster may fail to be associated with DataStudio:

      • Association depends on configuration. For example, data sources using an AccessKey pair cannot be associated. Check the association page for limitations.

      • The data source is missing a development or production environment.

      • A MaxCompute computing resource cannot be associated with multiple DataWorks workspaces at the same time.

      Note

      The platform displays the reason for any association failure.

    • Only MaxCompute, E-MapReduce, Hologres, AnalyticDB for MySQL, ClickHouse, CDH/CDP, and AnalyticDB for PostgreSQL can be associated with DataStudio.

    • Associable data source types and limits vary by DataWorks edition. Features of different editions of DataWorks.

Tutorial

Get started with data development covers basic operations and the development process.

Supported node types

DataStudio provides various node types, many supporting periodic scheduling. Select nodes based on your business needs. Supported node types.

Appendix: Concepts

  • Task development

    Term

    Description

    Solution

    A collection of workflows managed collectively. Workflows can be reused across multiple Solutions for collaboration.

    Workflow

    A collection of tasks, tables, resources, and functions for a business requirement. Tasks run on a schedule.

    Manually triggered workflow

    A collection of tasks, tables, resources, and functions for a specific business requirement.

    Unlike regular workflows, tasks in a manually triggered workflow must be triggered manually rather than running on a schedule.

    DAG

    An abbreviation for directed acyclic graph. Displays nodes and their dependencies. In DataStudio, all tasks within a workflow share one DAG.

    Task

    The basic execution unit in DataWorks. Tasks run sequentially based on dependencies.

    Node

    Represents a task in a DAG. Nodes run sequentially based on dependencies.

  • Task scheduling

    Term

    Description

    Dependency

    Defines execution order between tasks. If task B runs only after task A completes, A is an upstream dependency of B. Shown as arrows in a DAG.

    Output name

    A globally unique identifier for a node. A node can have multiple output names. DataWorks uses output names to define scheduling dependencies.

    Output table name

    The name of the task's output table, helping downstream tasks confirm the correct data source. Do not modify auto-generated output table names. This identifier does not affect the actual table name, which is determined by SQL logic.

    Note

    A node's Output Name must be globally unique, whereas an Output Table Name does not have this restriction.

    Scheduling resource group

    The resource group used for task scheduling. Overview of DataWorks resource groups.

    Scheduling parameter

    Variables in code that dynamically retrieve runtime values such as date and time. Define scheduling parameters in DataWorks to assign values to code variables at execution.

    Business date

    The date a business transaction occurred. In offline computing, this is typically the day before the task runs. By default, DataWorks sets this to the day before the task runs, accurate to the day. For example, when generating statistics for yesterday's sales, yesterday is the business date.

    Scheduling time

    The expected time a task is scheduled to run, accurate to the second. The actual start time may differ due to various factors.