Develop Spark applications with PySpark-AnalyticDB(AnalyticDB)-阿里云帮助中心

Prerequisites

An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.
A job resource group is created for the AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster.
A database account is created for the AnalyticDB for MySQL cluster.
- If you use an Alibaba Cloud account, you need to only create a privileged account.
- If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.

Basic PySpark usage

Write the following sample code and save it as example.py.

from pyspark.sql import SparkSession
if __name__ == "__main__":
    spark = SparkSession.builder.getOrCreate()
    df = spark.sql("SELECT 1+1")
    df.printSchema()
    df.show()

Upload example.py to OSS. For more information, see Upload a file.
Go to the Spark development editor.
1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
2. In the left-side navigation pane, choose Job Development > Spark JAR Development.
At the top of the editor window, select a job resource group and a Spark job type. This topic uses the Batch type as an example.

Enter the following job configuration in the editor.

{
 "name": "Spark Python Test",
 "file": "oss://testBucketName/example.py",
 "conf": {
 "spark.driver.resourceSpec": "small",
 "spark.executor.instances": 1,
 "spark.executor.resourceSpec": "small"
 }
}

For parameter details, see Parameter description.

Use Python dependencies

This topic provides two solutions for building a cloud-based environment without requiring local packaging.

Solution type

Use cases

Pros and cons

real-time installation

Debugging and temporary jobs.
Few dependencies and fast download.

Pros: Simple to configure and ready to use.

Cons: Dependencies are re-downloaded and installed each time the job runs.

cloud build

Production environments.
Many or large dependencies.
Environments that require long-term reuse.

Pros: Package once, reuse indefinitely. Ensures fast startup and high stability.

Cons: Requires an extra job to package dependencies.

Prerequisites

Before you configure the cloud environment, ensure the following requirements are met:

Cluster Status: The cluster is initialized and can successfully run the basic Spark Pi example.
Version Requirements:
- Spark version: 3.5.1 is supported.
- Python version: 3.9 or 3.11 is supported.
Key Constraint (NumPy version):
- Apache Spark must run in an environment with numpy < 2.0.0.
- The system enforces the installation of numpy==1.26.0.
- Important: Ensure that other dependencies you install, such as Pandas and SciPy, are compatible with NumPy 1.26.0. Otherwise, the job will fail.

Examples

Real-time installation

Prepare your application code

Write a Python script, for example, job.py, and upload it to an OSS path, such as oss://your-bucket/scripts/job.py.

# This sample script prints all dependencies in the current Python environment.
# Print all modules in the Python environment
import pkgutil
if __name__ == "__main__":
    for module_info in pkgutil.iter_modules():
        print(module_info.name)

Configure job parameters

## Sample job
{
    "file": "oss://your-bucket/scripts/job.py",  // Path to your code file
    "name": "Real-time Env Demo",
    "conf": {
        "spark.adb.version": "3.5",
        "spark.driver.resourceSpec": "medium",
        "spark.executor.instances": 1,
        "spark.executor.resourceSpec": "medium",
        // --- Start of core configuration ---
        // 1. Specify the Python version
        "spark.kubernetes.driverEnv.PYTHON_BIN": "python3.11",
        "spark.executorEnv.PYTHON_BIN": "python3.11",
        // 2. Specify the dependencies to install (for both Driver and Executor)
        "spark.kubernetes.driverEnv.PYTHON_MODULES": "chinesecalendar>=1.10.0,pandas>=1.5.3,lunar_python",
        "spark.executorEnv.PYTHON_MODULES": "chinesecalendar>=1.10.0,pandas>=1.5.3,lunar_python"
        // --- End of core configuration ---
    }
}

Important

Spark consists of a Driver (control node) and Executors (execution nodes). To ensure a consistent environment, you must configure the same environment variables for both .

Parameter	Description	Required	Default	Notes
spark.kubernetes.driverEnv.PYTHON_MODULES	A list of Python packages to install.	Yes	None	Separate multiple Python dependencies with commas (,). The format of the Python dependencies must fully comply with the requirements of the `PyPI` community. Python dependencies without version constraints must be placed at the end of the list. Example: `chinesecalendar>=1.10.0,dynaconf>=3.2.10,pandas>=1.5.3,lunar_python`
spark.executorEnv.PYTHON_MODULES
spark.kubernetes.driverEnv.PYTHON_BIN	The Python version to use for the job.	No	python3.11	Valid values: python3.11 python3.9
spark.executorEnv.PYTHON_BIN
spark.kubernetes.driverEnv.INDEX_URL	The URL of the `PyPI` repository.	No	http://mirrors.cloud.aliyuncs.com/pypi/simple/	The default value is the URL of a mirror hosted within Alibaba Cloud, which can be accessed over the internal network. If you specify an address that is only accessible over the public network, such as the PyPI mirror from Tsinghua University, you must enable public network access. For more information, see Configure public network access for a Spark application.
spark.executorEnv.INDEX_URL
spark.kubernetes.driverEnv.TRUSTED_HOST	The domain of the `PyPI` repository to add as a trusted host.	No	mirrors.cloud.aliyuncs.com	Python verifies the SSL certificate of the PyPI repository during installation. If the repository's certificate is not from a trusted certificate authority (CA), use this parameter to mark the repository's domain as a trusted host. Important Use this parameter with caution. Ensure the configured PyPI source is trustworthy, as dependency confusion attacks are a common threat.
spark.executorEnv.TRUSTED_HOST

Run the sample job. You can view the log to see that the Python environment contains the declared dependencies and their transitive dependencies.

xxlimited_35
zlib
numpy
pandas
_distutils_hack
_virtualenv
chinese_calendar
dateutil
lunar_python
pip
pkg_resources
pytz
setuptools
six
tzdata
wheel
>>>>>>>> stderr:
25/12/25 17:01:56 INFO ShutdownHookManager: Shutdown hook called

Cloud build

This solution uses a dedicated job to package dependencies into a compressed archive, which is then uploaded to OSS for reuse in subsequent jobs.

Plan an OSS path

Specify an OSS path to store the packaged environment, for example, oss://your-bucket/envs/my_custom_env.

Submit a packaging job

Important

Do not modify the path of the built-in packaging script: local:///opt/tools/build_venv.py.
In args, specify the dependencies to install.

## Sample job
{
    // 1. Specify all dependencies to be packaged.
    "args": [
        "chinesecalendar>=1.10.0",
        "pandas>=1.5.3",
        "pyarrow>=19.0.1",
        "lunar_python"
    ],
    // 2. Call the built-in packaging script (do not modify).
    "file": "local:///opt/tools/build_venv.py",
    "name": "Build VirtualEnv Job",
    "conf": {
        "spark.driver.resourceSpec": "medium",
        "spark.executor.instances": 1,
        "spark.executor.resourceSpec": "medium",
        // 3. Specify the Python version.
        "spark.kubernetes.driverEnv.PYTHON_BIN": "python3.11",
        // 4. Specify the OSS path to upload the packaged environment (modify as needed).
        "spark.kubernetes.driverEnv.VENV_OSS_PATH": "oss://your-bucket/envs/my_custom_env",
        // 5. Specify the temporary directory for the build.
        "spark.kubernetes.driverEnv.VENV_DIR": "/tmp/build_test"
    }
}

Parameter	Description	Required	Default	Notes
spark.kubernetes.driverEnv.VENV_OSS_PATH	The storage path for the environment package.	Yes	None	Example: `oss://your-bucket/envs/my_custom_env`.
spark.kubernetes.driverEnv.VENV_DIR	The temporary build directory.	No	/tmp/venv	If the environment package is large, mount a data disk and change this path to `/user_data_dir`.

spark.kubernetes.driverEnv.PYTHON_BIN	The Python version to use for the job.	No	python3.11	Valid values: python3.11 python3.9
spark.kubernetes.driverEnv.INDEX_URL	The URL of the `PyPI` repository.	No	http://mirrors.cloud.aliyuncs.com/pypi/simple/	The default value is the URL of a mirror hosted within Alibaba Cloud, which can be accessed over the internal network. If you specify an address that is only accessible over the public network, such as the PyPI mirror from Tsinghua University, you must enable public network access. For more information, see Configure public network access for a Spark application.

spark.kubernetes.driverEnv.TRUSTED_HOST	The domain of the `PyPI` repository to add as a trusted host.	No	mirrors.cloud.aliyuncs.com	Python verifies the SSL certificate of the PyPI repository during installation. If the repository's certificate is not from a trusted certificate authority (CA), use this parameter to mark the repository's domain as a trusted host. Important Use this parameter with caution. Ensure the configured PyPI source is trustworthy, as dependency confusion attacks are a common threat.

Run the job

The logs show the archive upload details and a complete list of packages and their versions installed in the virtual environment.

------------------ --------------
chinesecalendar    1.11.0
lunar_python       1.4.8
numpy              1.26.0
pandas             2.3.3
pip                24.2
pyarrow            22.0.0
python-dateutil    2.9.0.post0
pytz               2025.2
setuptools         75.1.0
six                1.17.0
tzdata             2025.3
wheel              0.44.0
Uploading archive to oss://xxx/envs/my_custom_env/venv_20251225174458.tar.gz
+ ossutil cp /tmp/venv_20251225174458.tar.gz xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
Succeed: Total num: 1, size: 115,655,687. OK num: 1(upload 1 files).
0.788886(s) elapsed
Upload completed: oss://xxx/envs/my_custom_env/venv_20251225174458.tar.gz

Use the environment package

For subsequent PySpark jobs, you can reference the archive in oss://your-bucket/envs/my_custom_env.

To use the packaged environment, set the pyFiles parameter in your job configuration.

## Sample usage
{
 "name": "Spark Python",
 "file": "oss://testBucketName/example.py",
 "pyFiles": ["oss://your-bucket/envs/my_custom_env/venv_*****.tar.gz"],
 "args": [
 "oss://testBucketName/staff.csv"
 ],
 "conf": {
 "spark.driver.resourceSpec": "small",
 "spark.executor.instances": 2,
 "spark.executor.resourceSpec": "small"
 }
}

Troubleshooting

ModuleNotFoundError:
- Verify that dependencies are configured for both driverEnv and executorEnv.
- Verify that the package names are spelled correctly and match the names on PyPI.
NumPy-related errors:

Check whether your dependencies require numpy >= 2.0.0. If so, downgrade your dependency versions to be compatible with numpy 1.26.0.
Download timeouts:

If timeouts still occur when using the default internal network mirror, verify that if you have specified a public mirror, you have also enabled public network access for your job.