Develop Spark applications with PySpark
This article describes how to develop AnalyticDB for MySQL Spark Python jobs and provides a method to build a cloud environment.
Prerequisites
An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.
A job resource group is created for the AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster.
A database account is created for the AnalyticDB for MySQL cluster.
If you use an Alibaba Cloud account, you need to only create a privileged account.
If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
Basic PySpark usage
-
Write the following sample code and save it as
example.py.from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession.builder.getOrCreate() df = spark.sql("SELECT 1+1") df.printSchema() df.show() -
Upload
example.pyto OSS. For more information, see Upload a file. -
Go to the Spark development editor.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
-
In the left-side navigation pane, choose .
-
At the top of the editor window, select a job resource group and a Spark job type. This topic uses the Batch type as an example.
-
Enter the following job configuration in the editor.
{ "name": "Spark Python Test", "file": "oss://testBucketName/example.py", "conf": { "spark.driver.resourceSpec": "small", "spark.executor.instances": 1, "spark.executor.resourceSpec": "small" } }For parameter details, see Parameter description.
Use Python dependencies
This topic provides two solutions for building a cloud-based environment without requiring local packaging.
|
Solution type |
Use cases |
Pros and cons |
|
real-time installation |
|
Pros: Simple to configure and ready to use. Cons: Dependencies are re-downloaded and installed each time the job runs. |
|
cloud build |
|
Pros: Package once, reuse indefinitely. Ensures fast startup and high stability. Cons: Requires an extra job to package dependencies. |
Prerequisites
Before you configure the cloud environment, ensure the following requirements are met:
-
Cluster Status: The cluster is initialized and can successfully run the basic Spark Pi example.
-
Version Requirements:
-
Spark version: 3.5.1 is supported.
-
Python version: 3.9 or 3.11 is supported.
-
-
Key Constraint (NumPy version):
-
Apache Spark must run in an environment with numpy < 2.0.0.
-
The system enforces the installation of numpy==1.26.0.
-
Important: Ensure that other dependencies you install, such as Pandas and SciPy, are compatible with NumPy 1.26.0. Otherwise, the job will fail.
-
Examples
Real-time installation
-
Prepare your application code
Write a Python script, for example,
job.py, and upload it to an OSS path, such as oss://your-bucket/scripts/job.py.# This sample script prints all dependencies in the current Python environment. # Print all modules in the Python environment import pkgutil if __name__ == "__main__": for module_info in pkgutil.iter_modules(): print(module_info.name) -
Configure job parameters
## Sample job { "file": "oss://your-bucket/scripts/job.py", // Path to your code file "name": "Real-time Env Demo", "conf": { "spark.adb.version": "3.5", "spark.driver.resourceSpec": "medium", "spark.executor.instances": 1, "spark.executor.resourceSpec": "medium", // --- Start of core configuration --- // 1. Specify the Python version "spark.kubernetes.driverEnv.PYTHON_BIN": "python3.11", "spark.executorEnv.PYTHON_BIN": "python3.11", // 2. Specify the dependencies to install (for both Driver and Executor) "spark.kubernetes.driverEnv.PYTHON_MODULES": "chinesecalendar>=1.10.0,pandas>=1.5.3,lunar_python", "spark.executorEnv.PYTHON_MODULES": "chinesecalendar>=1.10.0,pandas>=1.5.3,lunar_python" // --- End of core configuration --- } }ImportantSpark consists of a Driver (control node) and Executors (execution nodes). To ensure a consistent environment, you must configure the same environment variables for both .
Parameter
Description
Required
Default
Notes
spark.kubernetes.driverEnv.PYTHON_MODULES
A list of Python packages to install.
Yes
None
-
Separate multiple Python dependencies with commas (,).
-
The format of the Python dependencies must fully comply with the requirements of the
PyPIcommunity. -
Python dependencies without version constraints must be placed at the end of the list.
Example:
chinesecalendar>=1.10.0,dynaconf>=3.2.10,pandas>=1.5.3,lunar_pythonspark.executorEnv.PYTHON_MODULES
spark.kubernetes.driverEnv.PYTHON_BIN
The Python version to use for the job.
No
python3.11
Valid values:
-
python3.11
-
python3.9
spark.executorEnv.PYTHON_BIN
spark.kubernetes.driverEnv.INDEX_URL
The URL of the
PyPIrepository.No
http://mirrors.cloud.aliyuncs.com/pypi/simple/
The default value is the URL of a mirror hosted within Alibaba Cloud, which can be accessed over the internal network. If you specify an address that is only accessible over the public network, such as the PyPI mirror from Tsinghua University, you must enable public network access. For more information, see Configure public network access for a Spark application.
spark.executorEnv.INDEX_URL
spark.kubernetes.driverEnv.TRUSTED_HOST
The domain of the
PyPIrepository to add as a trusted host.No
mirrors.cloud.aliyuncs.com
Python verifies the SSL certificate of the PyPI repository during installation. If the repository's certificate is not from a trusted certificate authority (CA), use this parameter to mark the repository's domain as a trusted host.
ImportantUse this parameter with caution. Ensure the configured PyPI source is trustworthy, as dependency confusion attacks are a common threat.
spark.executorEnv.TRUSTED_HOST
-
-
Run the sample job. You can view the log to see that the Python environment contains the declared dependencies and their transitive dependencies.
xxlimited_35 zlib numpy pandas _distutils_hack _virtualenv chinese_calendar dateutil lunar_python pip pkg_resources pytz setuptools six tzdata wheel >>>>>>>> stderr: 25/12/25 17:01:56 INFO ShutdownHookManager: Shutdown hook called
Cloud build
This solution uses a dedicated job to package dependencies into a compressed archive, which is then uploaded to OSS for reuse in subsequent jobs.
-
Plan an OSS path
Specify an OSS path to store the packaged environment, for example,
oss://your-bucket/envs/my_custom_env. -
Submit a packaging job
Important-
Do not modify the path of the built-in packaging script:
local:///opt/tools/build_venv.py. -
In
args, specify the dependencies to install.
## Sample job { // 1. Specify all dependencies to be packaged. "args": [ "chinesecalendar>=1.10.0", "pandas>=1.5.3", "pyarrow>=19.0.1", "lunar_python" ], // 2. Call the built-in packaging script (do not modify). "file": "local:///opt/tools/build_venv.py", "name": "Build VirtualEnv Job", "conf": { "spark.driver.resourceSpec": "medium", "spark.executor.instances": 1, "spark.executor.resourceSpec": "medium", // 3. Specify the Python version. "spark.kubernetes.driverEnv.PYTHON_BIN": "python3.11", // 4. Specify the OSS path to upload the packaged environment (modify as needed). "spark.kubernetes.driverEnv.VENV_OSS_PATH": "oss://your-bucket/envs/my_custom_env", // 5. Specify the temporary directory for the build. "spark.kubernetes.driverEnv.VENV_DIR": "/tmp/build_test" } }Parameter
Description
Required
Default
Notes
spark.kubernetes.driverEnv.VENV_OSS_PATH
The storage path for the environment package.
Yes
None
Example:
oss://your-bucket/envs/my_custom_env.spark.kubernetes.driverEnv.VENV_DIR
The temporary build directory.
No
/tmp/venv
If the environment package is large, mount a data disk and change this path to
/user_data_dir.spark.kubernetes.driverEnv.PYTHON_BIN
The Python version to use for the job.
No
python3.11
Valid values:
-
python3.11
-
python3.9
spark.kubernetes.driverEnv.INDEX_URL
The URL of the
PyPIrepository.No
http://mirrors.cloud.aliyuncs.com/pypi/simple/
The default value is the URL of a mirror hosted within Alibaba Cloud, which can be accessed over the internal network. If you specify an address that is only accessible over the public network, such as the PyPI mirror from Tsinghua University, you must enable public network access. For more information, see Configure public network access for a Spark application.
spark.kubernetes.driverEnv.TRUSTED_HOST
The domain of the
PyPIrepository to add as a trusted host.No
mirrors.cloud.aliyuncs.com
Python verifies the SSL certificate of the PyPI repository during installation. If the repository's certificate is not from a trusted certificate authority (CA), use this parameter to mark the repository's domain as a trusted host.
ImportantUse this parameter with caution. Ensure the configured PyPI source is trustworthy, as dependency confusion attacks are a common threat.
-
-
Run the job
The logs show the archive upload details and a complete list of packages and their versions installed in the virtual environment.
------------------ -------------- chinesecalendar 1.11.0 lunar_python 1.4.8 numpy 1.26.0 pandas 2.3.3 pip 24.2 pyarrow 22.0.0 python-dateutil 2.9.0.post0 pytz 2025.2 setuptools 75.1.0 six 1.17.0 tzdata 2025.3 wheel 0.44.0 Uploading archive to oss://xxx/envs/my_custom_env/venv_20251225174458.tar.gz + ossutil cp /tmp/venv_20251225174458.tar.gz xxx xxx xxx xxx xxx xxx xxx xxx xxx xxx Succeed: Total num: 1, size: 115,655,687. OK num: 1(upload 1 files). 0.788886(s) elapsed Upload completed: oss://xxx/envs/my_custom_env/venv_20251225174458.tar.gz -
Use the environment package
For subsequent PySpark jobs, you can reference the archive in
oss://your-bucket/envs/my_custom_env.To use the packaged environment, set the
pyFilesparameter in your job configuration.## Sample usage { "name": "Spark Python", "file": "oss://testBucketName/example.py", "pyFiles": ["oss://your-bucket/envs/my_custom_env/venv_*****.tar.gz"], "args": [ "oss://testBucketName/staff.csv" ], "conf": { "spark.driver.resourceSpec": "small", "spark.executor.instances": 2, "spark.executor.resourceSpec": "small" } }
Troubleshooting
-
ModuleNotFoundError:
-
Verify that dependencies are configured for both
driverEnvandexecutorEnv. -
Verify that the package names are spelled correctly and match the names on PyPI.
-
-
NumPy-related errors:
Check whether your dependencies require numpy >= 2.0.0. If so, downgrade your dependency versions to be compatible with numpy 1.26.0.
-
Download timeouts:
If timeouts still occur when using the default internal network mirror, verify that if you have specified a public mirror, you have also enabled public network access for your job.