Develop Spark jobs with a Notebook

更新时间:
复制 MD 格式

The Notebook AnalyticDB for MySQL feature is an interactive platform for data analytics and development, offering features like job editing, data analysis, and data visualization. This topic explains how to use the Notebook feature in Data Management (DMS) to develop Spark SQL jobs.

Prerequisites

Limitations

The Notebook feature is available only in the following regions: China (Hangzhou), China (Beijing), China (Shanghai), and China (Shenzhen) .

Procedure

  1. Create a project and go to the workspace.

    1. Log in to DMS 5.0.

    2. Move the pointer over the 2023-01-28_15-57-17.png icon in the upper-left corner and choose All Features > Data+AI > Notebook.

      Note

      If you use the DMS console in normal mode, choose Data+AI > Notebook in the top navigation bar.

    3. Click Create Workspace. In the dialog box that appears, enter a Workspace Name and select a Region. Then, click OK.

    4. In the Actions column of the workspace, click Go to Workspace.

  2. (Optional) Add project members. If multiple users work in the workspace, you must complete this step to assign different workspace roles to them.

  3. Configure the code storage space.

    1. On the Project Settings image tab, click Storage Management .

    2. Configure the OSS path for Code Storage.

  4. Add a resource.

    1. On the Project Settings image tab, click Resource Settings.

    2. Click Add Resource and configure the parameters for the resource.

      Parameter

      Required

      Description

      Resource Name

      Yes

      The name of the resource. You can specify a custom name.

      Resource Introduction

      Yes

      A description of the resource's intended use. This value is customizable.

      Image

      Yes

      Only the Spark3.5+Python3.9 image is supported.

      AnalyticDB Instance

      Yes

      The ID of the AnalyticDB for MySQL cluster.

      Note

      If the cluster that you want to manage is not displayed, check whether the cluster is added to DMS.

      AnalyticDB Resource Group

      Yes

      Select the destination job resource group.

      Executor Specifications

      Yes

      The resource specification for Spark executors. This example uses the default value, medium.

      For more information about the specifications that correspond to different models, see the Model column in Spark application configuration parameters.

      Max Executors

      Min Executors

      Yes

      The number of Spark executors.

      After you select the Spark3.5+Python3.9 image, Min Executors is set to 2 and Max Executors is set to 8 by default.

      Notebook Spec

      Yes

      Select a Notebook specification. This example uses General_Tiny_v1 (1-core and 4 GB of memory).

      VPC ID

      Yes

      Select the Virtual Private Cloud (VPC) where the AnalyticDB for MySQL cluster resides to ensure that the Notebook can communicate with the cluster.

      Important

      If you later switch the AnalyticDB for MySQL cluster to a new Virtual Private Cloud (VPC) and vSwitch, you must update the VPC ID and vSwitch ID of the resource. Otherwise, job submissions fail.

      Zone ID

      Yes

      Select the zone where the AnalyticDB for MySQL cluster resides.

      vSwitch ID

      Yes

      Select the vSwitch to which the AnalyticDB for MySQL cluster belongs.

      Security group ID

      Yes

      Select an available security group to ensure that the Notebook can communicate with the AnalyticDB for MySQL cluster.

      Release resource

      Yes

      The resource is automatically released after a specified idle period.

      Dependent Jars

      No

      The OSS storage path of the JAR package. This parameter is required only for Python jobs that use a JAR package.

      SparkConf

      No

      The configuration items are mostly the same as those in open source Spark. The parameters are in the key: value format. For more information about configuration parameters that are different from open source Spark and parameters that are unique to AnalyticDB for MySQL, see Spark application configuration parameters.

    3. Click Save.

    4. In the Actions column of the resource, click Enable.

  5. Initialize data.

    1. In the upper-left corner of the console, click the 2023-01-28_15-57-17.png icon and choose All features > Instance Management > Data Assets > Instance Management > Instance Management.

    2. Click +New. In the Add Instance dialog box, configure the following parameters:

      Parameter

      Description

      Data Source

      On the Alibaba Cloud tab, select OSS.

      Basic Information

      Files and logs

      Select OSS.

      Instance Region

      Select the region where the AnalyticDB for MySQL cluster resides.

      Import Method

      Select Connection String.

      Connection String

      Enter oss-cn-hangzhou.aliyuncs.com.

      Bucket

      Select the bucket name.

      Access Mode

      The access mode. This example uses Security Hosting - Manual.

      AccessKey ID

      The AccessKey ID of the Alibaba Cloud account or a RAM user that has the permissions to access OSS.

      For more information about how to obtain an AccessKey ID and an AccessKey Secret, see Accounts and permissions.

      AccessKey Secret

      The AccessKey Secret of the Alibaba Cloud account or a RAM user that has the permissions to access OSS.

      For more information about how to obtain an AccessKey ID and an AccessKey Secret, see Accounts and permissions.

      Advanced Information

      Optional parameters. For more information, see Advanced Information.

    3. After you configure the parameters, click Test Connection in the lower-left corner.

      Note

      If the connectivity test fails, check the instance information that you entered based on the error message.

    4. After the Successful connection message appears, click Submit.

    5. Go to the workspace and click the image tab.

    6. On the Data Lake Data tab, click Add OSS and select the bucket that you added in substep b.

  6. Create a Notebook.

    On the Files image tab, click image and select Notebook.

  7. Develop a Spark SQL job on the Notebook page.

    Note

    For more information about the buttons on the Notebook page, see Notebook interface.

    1. Run the following command to download Python dependencies:

      pip install delta
    2. Change the cell type to SQL and run the following statement to create a database.

      Note

      The db_delta database created in this step and the sample_data external table created in the next step are automatically displayed in AnalyticDB for MySQL. You can then analyze the sample_data table on the AnalyticDB for MySQL console.

      CREATE DATABASE db_delta 
      LOCATION 'oss://testBucketName/db_delta/';    -- Specify the storage path for data in the db_delta database.
    3. Change the cell type to Code, run the following code to create the sample_data external table, and insert data into the table. This action stores the data for the sample_data external table at the OSS path specified in the previous step.

      # -*- coding: utf-8 -*-
      import pyspark
      from delta import *
      from pyspark.sql.types import *
      from pyspark.sql.functions import *
      print("Starting Delta table creation")
      data = [
          ("Robert", "Baratheon", "Baratheon", "Storms End", 48),
          ("Eddard", "Stark", "Stark", "Winterfell", 46),
          ("Jamie", "Lannister", "Lannister", "Casterly Rock", 29),
          ("Robert", "Baratheon", "Baratheon", "Storms End", 48),
          ("Eddard", "Stark", "Stark", "Winterfell", 46),
          ("Jamie", "Lannister", "Lannister", "Casterly Rock", 29),
          ("Robert", "Baratheon", "Baratheon", "Storms End", 48),
          ("Eddard", "Stark", "Stark", "Winterfell", 46),
          ("Jamie", "Lannister", "Lannister", "Casterly Rock", 29)
              ]
      schema = StructType([
          StructField("firstname", StringType(), True),
          StructField("lastname", StringType(), True),
          StructField("house", StringType(), True),
          StructField("location", StringType(), True),
          StructField("age", IntegerType(), True)
      ])
      sample_dataframe = spark.createDataFrame(data=data, schema=schema)
      sample_dataframe.write.format('delta').mode("overwrite").option('mergeSchema','true').saveAsTable("db_delta.sample_data")
    4. Change the cell type to SQL and run the following statement to query the data in the sample_data table.

      SELECT * FROM db_delta.sample_data;
  8. To analyze the sample_data table with Spark SQL in the AnalyticDB for MySQL console, follow these steps.

    1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.

    2. In the left-side navigation pane, choose Job Development > SQL Development > SQL Development. Select the Spark engine and an Interactive resource group.

    3. Query the sample_data table.

      SELECT * FROM db_delta.sample_data LIMIT 1000;

Related documentation

Notebook (legacy): Learn more about the Notebook feature.