Develop Spark jobs with a Notebook
The Notebook AnalyticDB for MySQL feature is an interactive platform for data analytics and development, offering features like job editing, data analysis, and data visualization. This topic explains how to use the Notebook feature in Data Management (DMS) to develop Spark SQL jobs.
Prerequisites
-
An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
A job resource group is created for the AnalyticDB for MySQL cluster.
A database account is created for the AnalyticDB for MySQL cluster.
If you use an Alibaba Cloud account, you need to only create a privileged account.
If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
-
AnalyticDB for MySQL is authorized to assume the AliyunADBSparkProcessingDataRole role to access other cloud resources.
The log storage path of Spark applications is configured for the AnalyticDB for MySQL cluster.
NoteLog on to the AnalyticDB for MySQL console. Find the cluster that you want to manage and click the cluster ID. In the left-side navigation pane, choose . Click Log Settings. In the dialog box that appears, select the default path or specify a custom storage path. You cannot set the custom storage path to the root directory of OSS. Make sure that the custom storage path contains at least one layer of folders.
An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.
Limitations
The Notebook feature is available only in the following regions: China (Hangzhou), China (Beijing), China (Shanghai), and China (Shenzhen) .
Procedure
-
Create a project and go to the workspace.
Log in to DMS 5.0.
-
Move the pointer over the
icon in the upper-left corner and choose . NoteIf you use the DMS console in normal mode, choose in the top navigation bar.
-
Click Create Workspace. In the dialog box that appears, enter a Workspace Name and select a Region. Then, click OK.
-
In the Actions column of the workspace, click Go to Workspace.
-
(Optional) Add project members. If multiple users work in the workspace, you must complete this step to assign different workspace roles to them.
-
Configure the code storage space.
-
On the Project Settings
tab, click
Storage Management
.
-
Configure the OSS path for Code Storage.
-
-
Add a resource.
-
On the Project Settings
tab, click Resource Settings. -
Click Add Resource and configure the parameters for the resource.
Parameter
Required
Description
Resource Name
Yes
The name of the resource. You can specify a custom name.
Resource Introduction
Yes
A description of the resource's intended use. This value is customizable.
Image
Yes
Only the Spark3.5+Python3.9 image is supported.
AnalyticDB Instance
Yes
The ID of the AnalyticDB for MySQL cluster.
NoteIf the cluster that you want to manage is not displayed, check whether the cluster is added to DMS.
AnalyticDB Resource Group
Yes
Select the destination job resource group.
Executor Specifications
Yes
The resource specification for Spark executors. This example uses the default value, medium.
For more information about the specifications that correspond to different models, see the Model column in Spark application configuration parameters.
Max Executors
Min Executors
Yes
The number of Spark executors.
After you select the Spark3.5+Python3.9 image, Min Executors is set to 2 and Max Executors is set to 8 by default.
Notebook Spec
Yes
Select a Notebook specification. This example uses General_Tiny_v1 (1-core and 4 GB of memory).
VPC ID
Yes
Select the Virtual Private Cloud (VPC) where the AnalyticDB for MySQL cluster resides to ensure that the Notebook can communicate with the cluster.
ImportantIf you later switch the AnalyticDB for MySQL cluster to a new Virtual Private Cloud (VPC) and vSwitch, you must update the VPC ID and vSwitch ID of the resource. Otherwise, job submissions fail.
Zone ID
Yes
Select the zone where the AnalyticDB for MySQL cluster resides.
vSwitch ID
Yes
Select the vSwitch to which the AnalyticDB for MySQL cluster belongs.
Security group ID
Yes
Select an available security group to ensure that the Notebook can communicate with the AnalyticDB for MySQL cluster.
Release resource
Yes
The resource is automatically released after a specified idle period.
Dependent Jars
No
The OSS storage path of the JAR package. This parameter is required only for Python jobs that use a JAR package.
SparkConf
No
The configuration items are mostly the same as those in open source Spark. The parameters are in the
key: valueformat. For more information about configuration parameters that are different from open source Spark and parameters that are unique to AnalyticDB for MySQL, see Spark application configuration parameters. -
Click Save.
-
In the Actions column of the resource, click Enable.
-
-
Initialize data.
-
In the upper-left corner of the console, click the
icon and choose . -
Click +New. In the Add Instance dialog box, configure the following parameters:
Parameter
Description
Data Source
On the Alibaba Cloud tab, select OSS.
Basic Information
Files and logs
Select OSS.
Instance Region
Select the region where the AnalyticDB for MySQL cluster resides.
Import Method
Select Connection String.
Connection String
Enter oss-cn-hangzhou.aliyuncs.com.
Bucket
Select the bucket name.
Access Mode
The access mode. This example uses Security Hosting - Manual.
AccessKey ID
The AccessKey ID of the Alibaba Cloud account or a RAM user that has the permissions to access OSS.
For more information about how to obtain an AccessKey ID and an AccessKey Secret, see Accounts and permissions.
AccessKey Secret
The AccessKey Secret of the Alibaba Cloud account or a RAM user that has the permissions to access OSS.
For more information about how to obtain an AccessKey ID and an AccessKey Secret, see Accounts and permissions.
Advanced Information
Optional parameters. For more information, see Advanced Information.
-
After you configure the parameters, click Test Connection in the lower-left corner.
NoteIf the connectivity test fails, check the instance information that you entered based on the error message.
-
After the Successful connection message appears, click Submit.
-
Go to the workspace and click the
tab. -
On the Data Lake Data tab, click Add OSS and select the bucket that you added in substep b.
-
-
Create a Notebook.
On the Files
tab, click
and select Notebook. -
Develop a Spark SQL job on the Notebook page.
NoteFor more information about the buttons on the Notebook page, see Notebook interface.
-
Run the following command to download Python dependencies:
pip install delta -
Change the cell type to SQL and run the following statement to create a database.
NoteThe db_delta database created in this step and the
sample_dataexternal table created in the next step are automatically displayed in AnalyticDB for MySQL. You can then analyze thesample_datatable on the AnalyticDB for MySQL console.CREATE DATABASE db_delta LOCATION 'oss://testBucketName/db_delta/'; -- Specify the storage path for data in the db_delta database. -
Change the cell type to Code, run the following code to create the
sample_dataexternal table, and insert data into the table. This action stores the data for thesample_dataexternal table at the OSS path specified in the previous step.# -*- coding: utf-8 -*- import pyspark from delta import * from pyspark.sql.types import * from pyspark.sql.functions import * print("Starting Delta table creation") data = [ ("Robert", "Baratheon", "Baratheon", "Storms End", 48), ("Eddard", "Stark", "Stark", "Winterfell", 46), ("Jamie", "Lannister", "Lannister", "Casterly Rock", 29), ("Robert", "Baratheon", "Baratheon", "Storms End", 48), ("Eddard", "Stark", "Stark", "Winterfell", 46), ("Jamie", "Lannister", "Lannister", "Casterly Rock", 29), ("Robert", "Baratheon", "Baratheon", "Storms End", 48), ("Eddard", "Stark", "Stark", "Winterfell", 46), ("Jamie", "Lannister", "Lannister", "Casterly Rock", 29) ] schema = StructType([ StructField("firstname", StringType(), True), StructField("lastname", StringType(), True), StructField("house", StringType(), True), StructField("location", StringType(), True), StructField("age", IntegerType(), True) ]) sample_dataframe = spark.createDataFrame(data=data, schema=schema) sample_dataframe.write.format('delta').mode("overwrite").option('mergeSchema','true').saveAsTable("db_delta.sample_data") -
Change the cell type to SQL and run the following statement to query the data in the
sample_datatable.SELECT * FROM db_delta.sample_data;
-
-
To analyze the
sample_datatable with Spark SQL in the AnalyticDB for MySQL console, follow these steps.Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
-
In the left-side navigation pane, choose . Select the Spark engine and an Interactive resource group.
-
Query the
sample_datatable.SELECT * FROM db_delta.sample_data LIMIT 1000;
Related documentation
Notebook (legacy): Learn more about the Notebook feature.
tab, click
and select Notebook.