You can write Python scripts with custom business logic and upload them to EMR Serverless Spark. This topic guides you through the PySpark development process with an example.
Prerequisites
-
You have an Alibaba Cloud account. For more information, see Account registration.
-
Ensure you have the required roles. For more information, see Role authorization for an Alibaba Cloud account.
-
You have created a workspace. For more information, see Create a workspace.
Procedure
Step 1: Prepare the sample files
In EMR Serverless Spark, you can develop Python files on a local or independent platform and submit them as jobs. To help you get started quickly, this guide provides sample files for you to download and use.
Click DataFrame.py and employee.csv to download the sample files.
-
DataFrame.py is a code snippet that uses the Apache Spark framework to process data in OSS.
-
employee.csv is a data file that contains employee names, departments, and salaries.
Step 2: Upload the sample files
-
Upload the Python file to EMR Serverless Spark.
-
Go to the resource upload page.
Log on to the EMR console.
-
In the left-side navigation pane, choose EMR Serverless > Spark.
-
On the Spark page, click the name of your target workspace.
-
On the EMR Serverless Spark page, click Files in the left-side navigation pane.
-
On the Files page, click Upload File.
-
In the Upload File dialog box, click the upload area to select a Python file, or drag the file into the upload area.
In this example, upload DataFrame.py.
-
-
Upload the data file (employee.csv) to OSS. For more information, see Upload files.
Step 3: Develop and run the job
-
On the EMR Serverless Spark page, click Data Development in the left-side navigation pane.
-
On the Development tab, click the
icon. -
In the dialog box, enter a name, select as the type, and click OK.
-
In the upper-right corner, select a queue.
For more information about how to add a queue, see Manage resource queues.
-
On the new development tab, configure the following parameters, leave the rest at their default values, and click Run.
Parameter
Description
Main Python Resources
Select the Python file that you uploaded on the Files page in the previous step. In this example, select DataFrame.py.
Execution Parameters
Enter the OSS path to the employee.csv data file. Example: oss://<yourBucketName>/employee.csv.
-
After the job runs, in the Execution Records section below, click Log Exploration in the Actions column of the job.
-
On the Log Exploration tab, you can view the log information.
This tab includes the driver log, executor log, and startup log sub-tabs. Each sub-tab supports three output types: Stdout, Stderr, and Log4j. In this example, the Stdout output of the driver log shows the Spark DataFrame query results, including employee salary details (employee_name, department, salary) and salary data aggregated by department.
Step 4: Publish the job
A published job can be used as a task in a workflow node.
-
After the job completes, click Publish on the right side of the Development page.
-
In the job publishing dialog box, enter the publishing information and click OK.
Step 5: View the Spark UI
After the job runs successfully, you can view its execution details in the Spark UI.
-
In the left-side navigation pane, click Job History.
-
On the Application page, click Spark UI in the Actions column of the target job.
-
On the Spark Jobs page, you can view the job details.
The page displays basic application information (such as User: root, Total Uptime, and Scheduling Mode: FIFO) and the Completed Jobs list. The table includes columns such as Job Id, Description, Submitted, Duration, Stages: Succeeded/Total, and Tasks. For each job, you can view its description, duration, and stage and task completion details.
Related topics
-
After a job is published, you can schedule it in a workflow. For more information, see Manage workflows. For a complete example of the job orchestration development process, see Get started with Spark SQL development.
-
For an example of developing a PySpark streaming job, see Submit a PySpark streaming job using Serverless Spark.