Develop Spark jobs with a Notebook-AnalyticDB(AnalyticDB)-阿里云帮助中心

The Notebook AnalyticDB for MySQL feature is an interactive platform for data analytics and development, offering features like job editing, data analysis, and data visualization. This topic explains how to use the Notebook feature in Data Management (DMS) to develop Spark SQL jobs.

Prerequisites

An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
A job resource group is created for the AnalyticDB for MySQL cluster.
A database account is created for the AnalyticDB for MySQL cluster.
- If you use an Alibaba Cloud account, you need to only create a privileged account.
- If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
AnalyticDB for MySQL is authorized to assume the AliyunADBSparkProcessingDataRole role to access other cloud resources.
The log storage path of Spark applications is configured for the AnalyticDB for MySQL cluster.
Note
Log on to the AnalyticDB for MySQL console. Find the cluster that you want to manage and click the cluster ID. In the left-side navigation pane, choose Job Development > Spark JAR Development. Click Log Settings. In the dialog box that appears, select the default path or specify a custom storage path. You cannot set the custom storage path to the root directory of OSS. Make sure that the custom storage path contains at least one layer of folders.
An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.

Limitations

The Notebook feature is available only in the following regions: China (Hangzhou), China (Beijing), China (Shanghai), and China (Shenzhen) .

Procedure

Create a project and go to the workspace.
1. Log in to DMS 5.0.
2. Move the pointer over the icon in the upper-left corner and choose All Features > Data+AI > Notebook.
  
  Note
  If you use the DMS console in normal mode, choose Data+AI > Notebook in the top navigation bar.
3. Click Create Workspace. In the dialog box that appears, enter a Workspace Name and select a Region. Then, click OK.
4. In the Actions column of the workspace, click Go to Workspace.
(Optional) Add project members. If multiple users work in the workspace, you must complete this step to assign different workspace roles to them.
Configure the code storage space.
1. On the Project Settings tab, click Storage Management .
2. Configure the OSS path for Code Storage.

Add a resource.

On the Project Settings tab, click Resource Settings.

Click Add Resource and configure the parameters for the resource.

Parameter	Required	Description
Resource Name	Yes	The name of the resource. You can specify a custom name.
Resource Introduction	Yes	A description of the resource's intended use. This value is customizable.
Image	Yes	Only the Spark3.5+Python3.9 image is supported.
AnalyticDB Instance	Yes	The ID of the AnalyticDB for MySQL cluster. Note If the cluster that you want to manage is not displayed, check whether the cluster is added to DMS.
AnalyticDB Resource Group	Yes	Select the destination job resource group.
Executor Specifications	Yes	The resource specification for Spark executors. This example uses the default value, medium. For more information about the specifications that correspond to different models, see the Model column in Spark application configuration parameters.
Max Executors Min Executors	Yes	The number of Spark executors. After you select the Spark3.5+Python3.9 image, Min Executors is set to 2 and Max Executors is set to 8 by default.
Notebook Spec	Yes	Select a Notebook specification. This example uses General_Tiny_v1 (1-core and 4 GB of memory).
VPC ID	Yes	Select the Virtual Private Cloud (VPC) where the AnalyticDB for MySQL cluster resides to ensure that the Notebook can communicate with the cluster. Important If you later switch the AnalyticDB for MySQL cluster to a new Virtual Private Cloud (VPC) and vSwitch, you must update the VPC ID and vSwitch ID of the resource. Otherwise, job submissions fail.
Zone ID	Yes	Select the zone where the AnalyticDB for MySQL cluster resides.
vSwitch ID	Yes	Select the vSwitch to which the AnalyticDB for MySQL cluster belongs.
Security group ID	Yes	Select an available security group to ensure that the Notebook can communicate with the AnalyticDB for MySQL cluster.
Release resource	Yes	The resource is automatically released after a specified idle period.
Dependent Jars	No	The OSS storage path of the JAR package. This parameter is required only for Python jobs that use a JAR package.
SparkConf	No	The configuration items are mostly the same as those in open source Spark. The parameters are in the `key: value` format. For more information about configuration parameters that are different from open source Spark and parameters that are unique to AnalyticDB for MySQL, see Spark application configuration parameters.

Click Save.
In the Actions column of the resource, click Enable.

Initialize data.

In the upper-left corner of the console, click the icon and choose All features > Instance Management > Data Assets > Instance Management > Instance Management.

Click +New. In the Add Instance dialog box, configure the following parameters:

Parameter		Description
Data Source		On the Alibaba Cloud tab, select OSS.
Basic Information	Files and logs	Select OSS.
	Instance Region	Select the region where the AnalyticDB for MySQL cluster resides.
	Import Method	Select Connection String.
	Connection String	Enter oss-cn-hangzhou.aliyuncs.com.
	Bucket	Select the bucket name.
	Access Mode	The access mode. This example uses Security Hosting - Manual.
	AccessKey ID	The AccessKey ID of the Alibaba Cloud account or a RAM user that has the permissions to access OSS. For more information about how to obtain an AccessKey ID and an AccessKey Secret, see Accounts and permissions.
	AccessKey Secret	The AccessKey Secret of the Alibaba Cloud account or a RAM user that has the permissions to access OSS. For more information about how to obtain an AccessKey ID and an AccessKey Secret, see Accounts and permissions.
Advanced Information		Optional parameters. For more information, see Advanced Information.

After you configure the parameters, click Test Connection in the lower-left corner.

Note
If the connectivity test fails, check the instance information that you entered based on the error message.
After the Successful connection message appears, click Submit.
Go to the workspace and click the tab.
On the Data Lake Data tab, click Add OSS and select the bucket that you added in substep b.

Create a Notebook.

On the Files tab, click and select Notebook.

Develop a Spark SQL job on the Notebook page.

Note

For more information about the buttons on the Notebook page, see Notebook interface.

Run the following command to download Python dependencies:
```
pip install delta
```
Change the cell type to SQL and run the following statement to create a database.

Note
The db_delta database created in this step and the sample_data external table created in the next step are automatically displayed in AnalyticDB for MySQL. You can then analyze the sample_data table on the AnalyticDB for MySQL console.
```
CREATE DATABASE db_delta 
LOCATION 'oss://testBucketName/db_delta/';    -- Specify the storage path for data in the db_delta database.
```

Change the cell type to Code, run the following code to create the sample_data external table, and insert data into the table. This action stores the data for the sample_data external table at the OSS path specified in the previous step.

# -*- coding: utf-8 -*-
import pyspark
from delta import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
print("Starting Delta table creation")
data = [
    ("Robert", "Baratheon", "Baratheon", "Storms End", 48),
    ("Eddard", "Stark", "Stark", "Winterfell", 46),
    ("Jamie", "Lannister", "Lannister", "Casterly Rock", 29),
    ("Robert", "Baratheon", "Baratheon", "Storms End", 48),
    ("Eddard", "Stark", "Stark", "Winterfell", 46),
    ("Jamie", "Lannister", "Lannister", "Casterly Rock", 29),
    ("Robert", "Baratheon", "Baratheon", "Storms End", 48),
    ("Eddard", "Stark", "Stark", "Winterfell", 46),
    ("Jamie", "Lannister", "Lannister", "Casterly Rock", 29)
        ]
schema = StructType([
    StructField("firstname", StringType(), True),
    StructField("lastname", StringType(), True),
    StructField("house", StringType(), True),
    StructField("location", StringType(), True),
    StructField("age", IntegerType(), True)
])
sample_dataframe = spark.createDataFrame(data=data, schema=schema)
sample_dataframe.write.format('delta').mode("overwrite").option('mergeSchema','true').saveAsTable("db_delta.sample_data")

Change the cell type to SQL and run the following statement to query the data in the sample_data table.
```
SELECT * FROM db_delta.sample_data;
```

To analyze the sample_data table with Spark SQL in the AnalyticDB for MySQL console, follow these steps.
1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
2. In the left-side navigation pane, choose Job Development > SQL Development > SQL Development. Select the Spark engine and an Interactive resource group.
3. Query the sample_data table.
```
SELECT * FROM db_delta.sample_data LIMIT 1000;
```

Prerequisites

Limitations

Procedure

Related documentation