Build a user persona analysis pipeline with DataWorks and EMR Serverless Spark, covering data integration, transformation, quality monitoring, and visualization in Data Studio (legacy version).
Overview
Extract user profile data—such as geographic and social attributes—from website behavior to drive business strategy. Use DataWorks and EMR Serverless Spark to synchronize, transform, manage, and consume the data on a recurring schedule.
Before you begin, read Tutorial objectives and design for a workflow overview.
Data development platform
This tutorial uses DataWorks Data Studio (legacy version). Make sure your workspace is not set to Use Data Studio (New Version).
When you create a workspace, do not select the Use Data Studio (New Version) option.
-
After February 18, 2025, the Data Studio (new version) is enabled by default for new workspaces in the following regions. If your account already uses the new version, follow Use Data Studio (new version) instead.
China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Thailand (Bangkok), Germany (Frankfurt), UK (London), US (Silicon Valley), US (Virginia)
Procedure
-
Create the Spark project and DataWorks workspace, then configure the network for the resource group.
-
Configure a DataWorks synchronization pipeline to load the sample user and website log data into Spark, then verify the results.
-
Use an EMR Spark SQL node in DataWorks to transform the user information and access log data and produce the target persona dataset.
-
Set up Data Quality monitoring rules on the transformed tables to detect and block dirty data early.
-
After the analysis tasks complete, view the generated tables and their lineage in Data Map.
-
Consume data
Visualize the transformed data in DataAnalysis to extract key insights and identify business trends.