Elastic GPU Service gives you GPU-accelerated instances with high compute density, low-latency networking, and flexible billing—plus DeepGPU, a free toolkit that accelerates training and inference workloads without requiring code changes.
Elastic GPU Service
Global deployment across 17 regions. Deploy GPU-accelerated instances at scale in 17 regions worldwide. Auto provisioning and auto scaling handle sudden demand spikes without manual intervention.
Up to 1,000 TFLOPS of mixed-precision compute. When combined with a high-performance CPU platform, GPU-accelerated instances deliver up to 1,000 trillion floating point operations per second (TFLOPS) of mixed-precision computing performance—enough for large-scale large language model (LLM) training and high-throughput inference.
Two-tier networking for every workload type. Each GPU-accelerated instance connects to a virtual private cloud (VPC) with up to 32 Gbit/s of internal bandwidth and 4.5 million packets per second (Mpps) for general workloads. For distributed training tasks, pair instances with Super Computing Cluster (SCC) to add up to 50 Gbit/s of Remote Direct Memory Access (RDMA) bandwidth between nodes—keeping gradient synchronization off the bottleneck.
Billing that matches your usage pattern. Choose from subscription, pay-as-you-go, preemptible instances, reserved instances, and storage capacity units (SCUs) to fit your cost model. Combine billing methods to cover both steady-state and burst workloads.
DeepGPU
DeepGPU is a free toolkit that accelerates GPU workloads on Elastic GPU Service. It includes Deepytorch, AIACC-ACSpeed (ACSpeed), AIACC-AGSpeed (AGSpeed), FastGPU, and cGPU.
Deepytorch
Deepytorch accelerates training and inference for generative AI and LLM workloads. It includes two packages: Deepytorch Training and Deepytorch Inference.
End-to-end training throughput, no framework rewrites. Deepytorch Training integrates distributed communication and computational graph compilation to improve end-to-end training performance. It is fully compatible with mainstream PyTorch versions and distributed training frameworks including DeepSpeed, PyTorch Fully Sharded Data Parallel (FSDP), and Megatron-LM.
Lower inference latency, no precision or input-size configuration. Deepytorch Inference accelerates compilation to reduce model inference latency. It supports instant compilation and removes the need to manually specify precision or input size, reducing code complexity and maintenance overhead.
DeepNCCL
DeepNCCL is a communication acceleration library for multi-GPU workloads on Alibaba Cloud's SHENLONG architecture.
20%+ higher throughput than cloud-native NCCL. DeepNCCL optimizes single-machine and cross-machine communication, consistently outperforming cloud-native NCCL by more than 20%.
Drop-in acceleration for distributed training and multi-GPU inference. DeepNCCL accelerates communication for distributed training and multi-GPU inference tasks without interrupting running workloads.
DeepGPU-LLM
DeepGPU-LLM is an LLM inference engine built on Elastic GPU Service for high-performance large language model serving.
Tensor parallelism and cross-GPU communication optimization. DeepGPU-LLM distributes inference across multiple GPUs using tensor parallelism and optimized GPU-to-GPU communication to maximize throughput and minimize latency.
Mainstream LLMs supported out of the box. DeepGPU-LLM supports Tongyi Qianwen, Llama, ChatGLM, and Baichuan, covering both proprietary and open-source model families.
FastGPU
FastGPU is a cluster deployment tool that automates infrastructure setup so you can run AI training and inference tasks without manually provisioning compute, storage, or network resources.
Clusters ready in 5 minutes. FastGPU deploys a fully configured cluster in 5 minutes. All resources are provisioned at the infrastructure layer and are directly accessible for debugging.
Resource lifecycle tied to task lifecycle. FastGPU releases GPU-accelerated instances automatically when a training or inference task ends. Preemptible instances are supported to reduce costs.
Full observability. FastGPU provides visualization, log management, and task tracing so every run is auditable.
cGPU
cGPU lets multiple containers share a single physical GPU with strict resource isolation, improving GPU utilization, reducing costs, and improving security.
Share one GPU across containers without exposing business data. Most AI workloads do not need an entire GPU. cGPU allocates GPU resources across containers while isolating each container's data, so you pay only for the GPU memory and computing power each workload actually uses.
Flexible allocation by GPU memory or computing power. Allocate resources based on GPU memory or computing power ratios to match each container's requirements.

Three switchable scheduling policies for peak and off-peak hours. Switch between three computing power allocation policies in real time to match workload intensity—no restarts required.
