GPU-accelerated instance FAQ-Function Compute(FC)-阿里云帮助中心

Common questions and solutions for GPU-accelerated instances in Function Compute.

What are the driver and CUDA versions for Function Compute GPU-accelerated instances?
What do I do if I encounter a CUFFT_INTERNAL_ERROR during execution?
How do I resolve a CUDA GPG error that occurs when I build an image?
Why is my GPU-accelerated instance type displayed as g1?
Why does my instance fail to start?
What do I do if elastic GPU instances cannot be created and a "ResourceExhausted" or "ResourceThrottled" error is reported?
What is the size limit for a GPU image?
What do I do if GPU image acceleration fails?
Should the model be packaged in the image or separated from it?
How do I perform a model warm-up, and are there any best practices?
What do I do if a GPU image fails to start and reports "FunctionNotStarted: Function Instance health check failed on port xxx in 120 seconds"?
My function has high and fluctuating end-to-end latency. How do I handle this?
What do I do if the NVIDIA driver cannot be found?

What are the driver and CUDA versions for Function Compute GPU-accelerated instances?

tl;dr: Function Compute currently runs driver version 580.95.05 (CUDA user mode driver 13.0). Use CUDA Toolkit 11.8 or later in your image — the platform handles everything else.

GPU-accelerated instances have two separate version layers:

Platform-managed (you cannot change these):

Driver version — the kernel mode driver (nvidia.ko) and the CUDA user mode driver (libcuda.so). The platform injects these components into each container at instance creation. Do not include them in your image.

You manage this:

CUDA Toolkit version — CUDA Runtime, cuDNN, and cuFFT. You specify the CUDA Toolkit version when you build your container image. For best compatibility, use CUDA Toolkit 11.8 or later, but not later than the CUDA user mode driver version (13.0) provided by the platform.

The driver version may change due to feature updates, new GPU card models, bug fixes, or driver lifecycle expiration. For the full version compatibility matrix, see the CUDA Toolkit Release Notes.

What do I do if I encounter a CUFFT_INTERNAL_ERROR during execution?

Cause: The cuFFT library in CUDA 11.7 has a known forward compatibility issue that causes this error on newer GPU card models.

Fix: Upgrade to CUDA 11.8 or later.

After upgrading, verify the fix with this PyTorch snippet:

import torch
out = torch.fft.rfft(torch.randn(1000).cuda())

If no error is reported, the upgrade was successful. For supported GPU card models, see Instance types and specifications.

How do I resolve a CUDA GPG error that occurs when I build an image?

Error:

W: GPG error: https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease' is not signed.

Cause: The NVIDIA CUDA repository public key is missing from your build environment.

Fix: Add the following line after the RUN rm command in your Dockerfile, then rebuild:

RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC

Why is my GPU-accelerated instance type displayed as g1?

g1 is an alias for fc.gpu.tesla.1. They are equivalent. For the full list of instance types and their specifications, see Instance types and specifications.

Why does my instance fail to start?

There are two common causes:

Startup timeout

Error code: FunctionNotStarted
Error message: Function instance health check failed on port XXX in 120 seconds
Cause: The application takes too long to start — usually because it loads a large model (over 10 GB) from the public network before the web server is ready.
Fix: Start your web server first, then load models in the background or in the /initialize method. Avoid blocking server startup with large downloads from the public network.

Quota exceeded

Error code: ResourceThrottled
Error message: Reserve resource exceeded limit
Cause: The default quota is 30 physical GPUs per region per Alibaba Cloud account.
Fix: Check your actual quota in Quota Center. To request an increase, submit a request in Quota Center.

What do I do if elastic GPU instances cannot be created and a "ResourceExhausted" or "ResourceThrottled" error is reported?

Cause: GPU resources are shared and subject to pool fluctuations. When demand spikes, elastic GPU instances may not be provisioned in time to serve invocation requests.

Fix: Configure a minimum number of instances for your function. This reserves GPU resources in advance rather than waiting for on-demand provisioning. For setup instructions, see Configure an elastic policy with a minimum number of instances.

What is the size limit for a GPU image?

The size limit applies to the compressed image. Images smaller than 20 GB before compression can typically be deployed to Function Compute.

To check image sizes:

Compressed size: View in the Container Registry console.
Uncompressed size: Run docker images locally.

What do I do if GPU image acceleration fails?

Cause: As image size grows, the accelerated image conversion process takes longer and may time out.

Fix: To retrigger the conversion, open the Function Compute console, edit the function configuration, and save — no parameter changes are needed.

Should the model be packaged in the image or separated from it?

Separate the model from the image if any of the following apply:

The model file is large.
The model is updated frequently.
Including the model would push the image over the platform's size limit.

When separating, store the model in a File Storage NAS (NAS) file system or Object Storage Service (OSS). For detailed guidance, see Best practices for model storage in GPU-accelerated instances.

How do I perform a model warm-up, and are there any best practices?

Put model initialization logic in the /initialize method. Function Compute holds all incoming traffic until the /initialize method completes, so the instance only starts serving production requests after the model is fully loaded.

For more details, see:

What do I do if a GPU image fails to start and reports "FunctionNotStarted: Function Instance health check failed on port xxx in 120 seconds"?

Cause: The web server takes too long to start — usually because the application loads a large model before the server is ready to accept connections.

Fix:

Do not load models from the public network at startup. Place models inside the image or in a NAS file system for faster access.
Move model initialization into the /initialize method. This lets the web server start first and pass the health check, while the model loads in the background.

Note

For details on how the /initialize method fits into the instance lifecycle, see Configure the instance lifecycle.

My function has high and fluctuating end-to-end latency. How do I handle this?

High latency in GPU functions typically comes from two sources: resource wait time and initialization time. Diagnosing which is which helps you apply the right fix.

Is the latency happening before any invocation work starts (resource wait)?

If invocations spend too long in a pending state waiting for a GPU container to become available, the problem is resource availability, not initialization speed. Fix: configure a minimum number of instances to keep warm containers ready. See Configure an elastic policy with a minimum number of instances.

Is the latency concentrated in the first invocation on a new instance (initialization)?

If some invocations are much slower than others and those invocations are always the first on a freshly started container, the problem is initialization time. Check the following:

Confirm image acceleration is active. In the Function Compute console, check that the image acceleration status shows Available. If it does not, retrigger the conversion by editing and saving the function configuration — no parameter changes needed.
Check your NAS file system type. If your function reads a model from NAS, use a compute-optimized General-purpose NAS file system. Storage-optimized file systems have lower throughput and slow down model loading. For details, see General-purpose NAS file system.
Move model loading into /initialize. The instance is not considered warm until /initialize completes, so no traffic is routed to it during model loading. This eliminates model-load latency from production requests.

What do I do if the NVIDIA driver cannot be found?

Cause: You built your container image using docker run --gpus all followed by docker commit. This captures the local NVIDIA driver state in the image, which conflicts with the driver the Function Compute platform injects at runtime.

Fix: always use a Dockerfile to build your application image. See Dockerfile for reference.

Additionally, keep your image free of driver-specific components:

Do not include libcuda.so in the image. This library is tightly coupled to the host kernel driver version. If it does not match what the platform provides, your application may behave unexpectedly.
Do not make your application dependent on a specific driver version.

When a function instance starts, the Function Compute platform automatically injects the correct user mode driver components into the container. This is the same mechanism used by GPU container virtualization technologies like NVIDIA Container Runtime — it keeps driver management on the platform side so your image stays portable across driver updates.

If you are already using NVIDIA Container Runtime or a similar technology, avoid docker commit. Images created this way contain injected driver components that may version-mismatch with what the Function Compute platform provides, causing undefined behavior.