Quick start for PySpark development

更新时间:
复制 MD 格式

You can write Python scripts with custom business logic and upload them to EMR Serverless Spark. This topic guides you through the PySpark development process with an example.

Prerequisites

Procedure

Step 1: Prepare the sample files

In EMR Serverless Spark, you can develop Python files on a local or independent platform and submit them as jobs. To help you get started quickly, this guide provides sample files for you to download and use.

Click DataFrame.py and employee.csv to download the sample files.

Note
  • DataFrame.py is a code snippet that uses the Apache Spark framework to process data in OSS.

  • employee.csv is a data file that contains employee names, departments, and salaries.

Step 2: Upload the sample files

  1. Upload the Python file to EMR Serverless Spark.

    1. Go to the resource upload page.

      1. Log on to the EMR console.

      2. In the left-side navigation pane, choose EMR Serverless > Spark.

      3. On the Spark page, click the name of your target workspace.

      4. On the EMR Serverless Spark page, click Files in the left-side navigation pane.

    2. On the Files page, click Upload File.

    3. In the Upload File dialog box, click the upload area to select a Python file, or drag the file into the upload area.

      In this example, upload DataFrame.py.

  2. Upload the data file (employee.csv) to OSS. For more information, see Upload files.

Step 3: Develop and run the job

  1. On the EMR Serverless Spark page, click Data Development in the left-side navigation pane.

  2. On the Development tab, click the image icon.

  3. In the dialog box, enter a name, select Batch Job > PySpark as the type, and click OK.

  4. In the upper-right corner, select a queue.

    For more information about how to add a queue, see Manage resource queues.

  5. On the new development tab, configure the following parameters, leave the rest at their default values, and click Run.

    Parameter

    Description

    Main Python Resources

    Select the Python file that you uploaded on the Files page in the previous step. In this example, select DataFrame.py.

    Execution Parameters

    Enter the OSS path to the employee.csv data file. Example: oss://<yourBucketName>/employee.csv.

  6. After the job runs, in the Execution Records section below, click Log Exploration in the Actions column of the job.

  7. On the Log Exploration tab, you can view the log information.

    This tab includes the driver log, executor log, and startup log sub-tabs. Each sub-tab supports three output types: Stdout, Stderr, and Log4j. In this example, the Stdout output of the driver log shows the Spark DataFrame query results, including employee salary details (employee_name, department, salary) and salary data aggregated by department.

Step 4: Publish the job

Important

A published job can be used as a task in a workflow node.

  1. After the job completes, click Publish on the right side of the Development page.

  2. In the job publishing dialog box, enter the publishing information and click OK.

Step 5: View the Spark UI

After the job runs successfully, you can view its execution details in the Spark UI.

  1. In the left-side navigation pane, click Job History.

  2. On the Application page, click Spark UI in the Actions column of the target job.

  3. On the Spark Jobs page, you can view the job details.

    The page displays basic application information (such as User: root, Total Uptime, and Scheduling Mode: FIFO) and the Completed Jobs list. The table includes columns such as Job Id, Description, Submitted, Duration, Stages: Succeeded/Total, and Tasks. For each job, you can view its description, duration, and stage and task completion details.

Related topics