The Python environment for EMR Serverless Spark includes matplotlib, NumPy, and pandas by default. To use other third-party libraries, you must create a runtime environment that packages the required libraries.
Prerequisites
You have created a workspace. For more information, see Manage workspaces.
Create a runtime environment
-
Go to the runtime environment management page.
-
Log on to the E-MapReduce console.
-
In the left-side navigation pane, choose EMR Serverless > Spark.
-
On the Spark page, click the name of the target workspace.
-
On the EMR Serverless Spark page, select Runtime Environments in the left navigation bar.
-
-
Click Create Runtime Environment.
-
On the Create Runtime Environment page, configure the following parameters.
Parameter
Required
Description
Name
Yes
Enter a name for the runtime environment.
Description
No
Enter a description for the runtime environment.
Queue for Environment Initialization
Yes
Select a resource queue for initialization. Creating the runtime environment consumes 1 Core and 4 GB of resources from this queue. These resources are released automatically after initialization.
Normal Network Connection
No
If you need to add PyPI libraries from a source other than the Alibaba Cloud source, select an appropriate network connection. The runtime environment uses this connection to access the source address during creation.
For more information about how to create a network connection, see Establish network connectivity between EMR Serverless Spark and other VPCs.
Python version
Yes
It defaults to Python 3.8. You can select another version based on your business requirements.
Ensure that the selected Python version is compatible with your target Python libraries to prevent packaging failures or runtime errors caused by version mismatches.
-
Add library information.
-
Click Add Library.
-
In the Create Library dialog box, select a Source Type, configure the related parameters, and then click OK.
Parameter
Description
PyPI
-
PyPI Package: Enter the library name and, optionally, the version. If you omit the version, the system installs the latest version. The Alibaba Cloud source is used by default.
For example,
PlotlyorPlotly==4.9.0. -
Package Source: Specify a custom PyPI source URL. If you leave this field blank, it defaults to the Alibaba Cloud source. If you use a custom source, ensure that you have selected an appropriate network connection.
Workspace
From the Workspace drop-down list, select a file resource from the current workspace. If no resources are available, upload one on the Files page.
Supported file types:
.zip,.tar,.whl,.tar.gz,.jar, and.txt.NoteIf you select a
.txtfile, the system treats it as a requirements file and installs the Python libraries and versions listed in the file.OSS Resource
In the OSS Resource field, enter the path of a file stored in Object Storage Service (OSS).
Supported file types:
.zip,.tar,.whl,.tar.gz,.jar, and.txt.NoteIf you specify a
.txtfile, the system treats it as a requirements file and installs the Python libraries and versions listed in the file. -
-
-
Click create.
The environment begins initializing after creation.
Edit a runtime environment
You can edit a runtime environment to update the libraries it contains.
-
On the Runtime Environments page, find the target runtime environment and click Edit in the Actions column.
-
On the Modify Runtime Environment page, update the configuration of the runtime environment.
-
Click Save Changes.
Saving the changes re-initializes the environment based on the new configuration.
NoteAfter an environment is re-initialized, the changes do not apply to active Notebook sessions immediately. To use the latest runtime environment in a Notebook session, you must restart the Notebook session resources.
Use a runtime environment
Once a runtime environment is in the Ready state, you can use it for data development or in corresponding sessions.
-
PySpark batch job: When a job starts, the system pre-installs the necessary libraries from the selected runtime environment.
-
Job orchestration: When you add a Notebook node to a workflow, you can select the corresponding runtime environment.
-
Notebook session: When a Notebook session starts, the system pre-installs libraries based on the selected environment.
-
Livy Gateway: When you submit a job through Livy Gateway, the system pre-configures the resources required to run the job based on the selected environment.
-
When submitting jobs using Spark Submit, Apache Airflow, and Livy, specify the runtime environment by passing the environment ID as a configuration parameter:
--conf spark.emr.serverless.environmentId=<environment_id>.