Write and build a JAR package containing your business logic. After uploading it, you can develop a Spark JAR job. This topic uses two examples to demonstrate how to develop and deploy these jobs.
Prerequisites
-
You have created a workspace. For more information, see Manage workspaces.
-
You have developed your business application and built it into a JAR package.
Procedure
Step 1: Develop the JAR package
EMR Serverless Spark does not provide an integrated development environment (IDE) for JAR development. You must code your Spark application and package it into a JAR file in a local or standalone development environment.
In your Maven project's pom.xml file, add the necessary Spark dependencies. Because the EMR Serverless Spark runtime environment already includes these dependencies, setting the scope to provided prevents duplicate packaging and version conflicts while keeping the dependencies available for compilation and testing.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.5.2</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.5.2</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>3.5.2</version>
<scope>provided</scope>
</dependency>
Query DLF table
public class HiveTableAccess {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.appName("DlfTableAccessExample")
.enableHiveSupport()
.getOrCreate();
spark.sql("SELECT * FROM test_table").show();
spark.stop();
}
}
Calculate pi (π)
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import java.util.ArrayList;
import java.util.List;
/**
* Computes an approximation to pi
* Usage: JavaSparkPi [partitions]
*/
public final class JavaSparkPi {
public static void main(String[] args) throws Exception {
SparkSession spark = SparkSession
.builder()
.appName("JavaSparkPi")
.getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2;
int n = 100000 * slices;
List<Integer> l = new ArrayList<>(n);
for (int i = 0; i < n; i++) {
l.add(i);
}
JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);
int count = dataSet.map(integer -> {
double x = Math.random() * 2 - 1;
double y = Math.random() * 2 - 1;
return (x * x + y * y <= 1) ? 1 : 0;
}).reduce((integer, integer2) -> integer + integer2);
System.out.println("Pi is roughly " + 4.0 * count / n);
spark.stop();
}
}
Click SparkExample-1.0-SNAPSHOT.jar to download the test JAR package.
Step 2: Upload the JAR package
-
Navigate to the artifact upload page.
Log on to the EMR console.
-
In the left-side navigation pane, choose EMR Serverless > Spark.
-
On the Spark page, click the name of your target workspace.
-
In the left-side navigation pane of the EMR Serverless Spark page, click Files.
-
On the Files page, click Upload File.
-
In the Upload File dialog box, click the upload area to select your local JAR package, or drag the package into the upload area.
This example uses SparkExample-1.0-SNAPSHOT.jar.
Step 3: Develop and run the job
-
In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.
-
On the Development tab, click the
icon. -
Enter a name, select as the type, and then click OK.
-
In the upper-right corner, select a resource queue.
For more information about how to add a queue, see Manage resource queues.
-
In the new job editor, configure the following parameters. Leave the rest at their default settings, and then click Run.
Parameter
Description
Main JAR Resource
Select the JAR package you uploaded. This example uses SparkExample-1.0-SNAPSHOT.jar.
Main Class
The main class to run for the Spark job.
-
To calculate the approximate value of pi, enter
org.example.JavaSparkPi. -
To query the Hive table, enter
org.example.HiveTableAccess.
-
-
After the job completes, go to the Execution Records section at the bottom and click Logs in the Actions column.
On the Logs panel, choose Driver log > Stdout. The output
Pi is roughly 3.1403confirms the Spark job ran successfully.The example output for the table query is a formatted table containing one record:
id=1, name=jay.
Step 4: Publish the job
Published jobs can be used as tasks in a workflow.
-
After the job completes, click Publish on the right.
-
In the dialog box that appears, enter the publishing information and click OK.
(Optional) Step 5: View the Spark UI
After a job runs, you can view its execution details in the Spark UI.
-
In the left-side navigation pane, click Job History.
-
On the Application page, find the target job and click Spark UI in the Actions column.
-
The page shows details such as User:
root, Scheduling Mode:FIFO, and Completed Jobs:1. The job entry displays an ID of 0, a description ofreduce at SparkPi.scala:38, a duration of 4s, and progress for stages (1/1) and tasks (2/2). These details confirm the Spark Pi job executed successfully.
Related documents
After you publish the job, you can use it as a task in a workflow. For more information, see Manage workflows. For a complete job orchestration example, see Get Started with SparkSQL Development.