Quick start for SparkSQL development

更新时间:
复制 MD 格式

EMR Serverless Spark lets you edit and run tasks using SQL code. This topic guides you through creating, running, and managing SQL tasks.

Prerequisites

Step 1: Create and publish a development task

Important

A task must be published before it can be used for task orchestration.

  1. Navigate to the Data Development page.

    1. Log on to the EMR console.

    2. In the navigation pane on the left, choose EMR Serverless > Spark.

    3. On the Spark page, click the target workspace.

    4. On the EMR Serverless Spark page, click Data Development in the navigation pane on the left.

  2. Create a task named users_task.

    1. On the Development tab, click the image icon.

    2. In the Create dialog box, enter a name such as users_task, set the Type to the default SparkSQL, and then click OK.

    3. Copy the following code to the new Spark SQL tab (users_task).

      CREATE TABLE IF NOT EXISTS students (
        name VARCHAR(64),          
        address VARCHAR(64)       
      )
      USING PARQUET  
      PARTITIONED BY (data_date STRING)              
      OPTIONS (
        'path'='oss://<bucketname>/path/'
      );
      INSERT OVERWRITE TABLE students PARTITION (data_date = '${ds}') VALUES
        ('Ashua Hill', '456 Erica Ct, Cupertino'),
        ('Brian Reed', '723 Kern Ave, Palo Alto');

      The following basic date variables are supported. The default value is the previous day.

      Variable

      Data type

      Description

      {data_date}

      str

      A variable that indicates date information. The format is YYYY-MM-DD.

      For example, 2023-09-18.

      {ds}

      str

      {dt}

      str

      {data_date_nodash}

      str

      A variable that indicates date information. The format is YYYYMMDD.

      For example, 20230918.

      {ds_nodash}

      str

      {dt_nodash}

      str

      {ts}

      str

      A variable that indicates a timestamp. The format is YYYY-MM-DDTHH:MM:SS.

      For example, 2023-09-18T16:07:43.

      {ts_nodash}

      str

      A variable that indicates a timestamp. The format is YYYYMMDDHHMMSS.

      For example, 20230918160743.

    4. Select a database from the database drop-down list and a running session instance from the session drop-down list.

      You can also create a new session instance by selecting Create SQL Session from the drop-down list. For more information about session management, see Manage SQL sessions.

    5. Click Run to execute the task.

      The Execution Results tab displays the results. If an error occurs, you can view the details on the Execution Issues tab.

  3. Publish the users_task task.

    Note

    The parameters specified for the task are published with the task and are used when the task runs in production. Tasks executed in the SQL editor use the parameters from the session.

    1. On the new Spark SQL tab, click Publish.

    2. In the dialog box, enter the publishing information and click OK.

  4. Create a users_count task.

    1. On the Development tab, click the image icon.

    2. In the Create dialog box, enter a name such as users_count, use the default Type, SparkSQL, and click OK.

    3. Copy the following code to the new Spark SQL task tab (users_count).

      SELECT COUNT(1) FROM students;
    4. Select a database from the database drop-down list and a running session instance from the session drop-down list.

      You can also select Create SQL Session from the drop-down list to create a new session. For more information about session management, see Manage SQL sessions.

    5. Click Run to execute the task.

      The results are displayed on the Execution Results tab. If an error occurs, you can view the details on the Execution Issues tab.

  5. Publish the users_count task.

    Note

    The parameters specified for the task are published with the task and are used when the task runs in production. Tasks executed in the SQL editor use the parameters from the session.

    1. On the new Spark SQL task tab, you can click Publish.

    2. In the dialog box, enter the publishing information and click OK.

Step 2: Create a workflow and its nodes

  1. In the left navigation pane, click Workflows.

  2. On the Workflows page, click Create Workflow.

  3. In the Create Workflow panel, for Name, enter a name such as spark_workflow_task, and click Next.

    You can configure the parameters in the Other Settings section as needed. For more information about the parameters, see Manage workflows.

  4. Add the users_task node.

    1. On the new node canvas, click Add Node.

    2. In the Add Node panel, select the published task users_task from the Source File Path drop-down list, and then click Save.

  5. Add the users_count node.

    1. Click Add Node.

    2. In the Add Node panel, select the published task, users_count, from the Source File Path drop-down list, select users_task from the Upstream Node drop-down list, and then click Save.

  6. On the new node canvas, click Publish Workflow.

    The canvas now contains two SQL nodes, users_task and users_count, with an arrow connecting them to indicate the execution dependency.

  7. In the Publish dialog box, enter the publication details and click OK.

Step 3: Run the workflow

  1. On the Workflows page, click the new workflow's Name, such as spark_workflow_task.

  2. On the Workflow Runs page, click Run.

    Note

    After you configure a scheduling cycle, you can also trigger the scheduling on the Workflows page using the switch on the left.

  3. In the Run Workflow dialog box, click OK.

Step 4: View the instance status

  1. On the Workflows page, click the target workflow, such as spark_workflow_task.

  2. On the Workflow Runs page, you can view a list of all workflow instances with their runtime, status, and other details.

    A column chart at the top of the page shows the distribution of workflow run statuses by date. The table below includes columns such as Workflow Run ID, Version, Run Mode, Number of Nodes, Scheduled Time, Start Time, and End Time.

  3. To view the instance graph, in the Workflow Runs section, click a Workflow Run ID, or click the Workflow Run Graph tab at the top.

  4. Click a node instance. In the node information box that appears, you can perform operations or view information as needed.

    For more information about related operations and details, see View node instances.

    The node information box displays run details such as the Start Time, End Time, Status, and Runtime of the task. It also provides buttons for operations such as Rerun, Stop, and Set to Success.

    For example, click Spark UI to open the Spark Jobs page and view real-time information about the Spark task.

    The Spark Jobs page displays a list of completed jobs with columns such as Job Id, Description, Submitted, Duration, Stages: Succeeded/Total, and Tasks: Succeeded/Total. You can check the execution status and duration of each job.

  5. Click a Job Run ID to open the Job History page, where you can view metrics, diagnostics, and log information.

    On the Overview tab, in the Basic Information section, you can view information such as the run status, task name, task run ID, task type, database engine version, and Fusion acceleration status.

Step 5: Workflow O&M

On the Workflows page, click the name of a workflow to open the Workflow Runs page. You can:

  • In the Workflow Information section, you can edit the parameters.

  • In the Workflow Runs section, you can view all workflows. Click a Workflow Run ID to open the instance graph.

    The workflow instance graph page shows the dependencies and run statuses of the task nodes. In the example, users_task (SQL) is the ancestor node and users_count (SQL) is the descendant node. A connector between them indicates the dependency. Both nodes show a successful run status (green check mark).

Step 6: View data

  1. In the left navigation pane, click Data Development.

  2. Create a new SparkSQL development task. Then, enter and run the following command to view the table details.

    SELECT * FROM students;

    The following information is returned.

    The result contains three columns: name, address, and data_date. There are two records: Ashua Hill (address: 456 Erica Ct, Cupertino, date: 2024-01-01) and Brian Reed (address: 723 Kern Ave, Palo Alto, date: 2024-01-01).

References