Data Studio (legacy version) tutorial

更新时间:
复制 MD 格式

Build a user persona analysis pipeline with DataWorks and EMR Serverless Spark, covering data integration, transformation, quality monitoring, and visualization in Data Studio (legacy version).

Overview

Extract user profile data—such as geographic and social attributes—from website behavior to drive business strategy. Use DataWorks and EMR Serverless Spark to synchronize, transform, manage, and consume the data on a recurring schedule.

Note

Before you begin, read Tutorial objectives and design for a workflow overview.

Data development platform

This tutorial uses DataWorks Data Studio (legacy version). Make sure your workspace is not set to Use Data Studio (New Version).

  • When you create a workspace, do not select the Use Data Studio (New Version) option.

  • After February 18, 2025, the Data Studio (new version) is enabled by default for new workspaces in the following regions. If your account already uses the new version, follow Use Data Studio (new version) instead.

    China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Thailand (Bangkok), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia)

Procedure

  1. Prepare the environment

    Create the Spark project and DataWorks workspace, then configure the network for the resource group.

  2. Synchronize data

    Configure a DataWorks synchronization pipeline to load the sample user and website log data into Spark, then verify the results.

  3. Transform data

    Use an EMR Spark SQL node in DataWorks to transform the user information and access log data and produce the target persona dataset.

  4. Monitor data quality

    Set up Data Quality monitoring rules on the transformed tables to detect and block dirty data early.

  5. Manage data

    After the analysis tasks complete, view the generated tables and their lineage in Data Map.

  6. Consume data

    Visualize the transformed data in DataAnalysis to extract key insights and identify business trends.