PAI-Lingjun, also known as PAI-Lingjun AI Computing Service, is a large-scale, high-density computing service. It provides heterogeneous computing power for high-performance AI training and high-performance computing (HPC). PAI-Lingjun is designed for large-scale, distributed AI research and development scenarios, such as image recognition, natural language processing, search and ad recommendations, and large language models (LLMs). It is suitable for industries such as autonomous driving, financial risk control, drug discovery, AI for Science, the metaverse, the internet, and independent software vendors (ISVs). You pay only for the resources that you consume during AI training. This lets you use a highly scalable, high-performance, and cost-effective intelligent computing infrastructure without the need to build, tune, or maintain complex compute nodes, storage, and RDMA networks.
Service architecture
PAI-Lingjun is a computing cluster service that integrates software and hardware. The hardware includes servers, networks, storage, and overall cluster delivery and management. The software includes resource management and O&M for computing power, AI acceleration suites, cloud-native task management, and a complete AI development platform. It supports common AI frameworks such as PyTorch and TensorFlow.
The core hardware components of PAI-Lingjun are Panjiu servers and a high-performance RDMA network:
The service uses Panjiu servers developed by Alibaba Cloud. These servers feature multiple core configuration optimizations to ensure optimal hardware performance.
The network supports common Fat-Tree network topologies and multiple communication protocols, such as TCP/IP and RDMA. The PAI-Lingjun 25 Gbps and 100 Gbps networks are built independently. The 25 Gbps network is used for in-band server management. The 100 Gbps network uses multiple network interface controllers (NICs) for efficient communication in AI training. To improve network availability, PAI-Lingjun supports dual-uplink networking. Each NIC has two ports connected to two separate switches. If a connection fails, it automatically fails over to ensure network availability.
The software architecture consists of multiple components from the bottom up, including resource management, a computing acceleration library, machine learning and deep learning frameworks, a development environment, and task management.
For resource management, PAI-Lingjun uses container technology (Docker) to partition and schedule resources. It is compatible with orchestration tools such as Kubernetes (K8s).
For system O&M and monitoring, the Apsara Infrastructure Management Framework from Alibaba performs real-time monitoring of the cluster's underlying resources and status.
It supports acceleration libraries that are deeply customized and optimized for communication within PAI-Lingjun clusters.
The computing system supports submitting tasks and viewing task logs through a user interface. It also supports mainstream AI computing frameworks, such as PyTorch and TensorFlow.
Why choose PAI-Lingjun
PAI-Lingjun helps you build intelligent clusters that provide the following advantages:
Computing power as a service. Provides high-performance, highly elastic heterogeneous computing power services, supports elastic resources for tens of thousands of GPUs, a single-cluster network capacity of up to 4 Pbps, and latency as low as 2 microseconds.
High resource efficiency. Increases resource utilization by 3 times and parallel computing efficiency by over 90%.
Integrated computing power pool. Supports unified allocation and integrated scheduling of computing power for AI and HPC scenarios with seamless connectivity.
Computing power management and monitoring. Provides an IT O&M platform deeply customized for heterogeneous computing power. This platform enables end-to-end monitoring and management, from heterogeneous computing power and pooled resources to usage efficiency.
Benefits
Accelerate AI innovation. Full-stack performance acceleration can improve the iteration efficiency of compute-intensive projects by more than two times.
Maximize ROI. Efficient scheduling technology for pooled heterogeneous computing power ensures that all computing power is fully utilized, increasing resource utilization by up to 3 times.
Handle challenges at any scale. Easily meet the computing power demands of large models and large-scale engineering simulations, ensuring that innovation is not limited by computing power.
Visibility and control. Easily manage the allocation of heterogeneous computing power and continuously monitor and optimize it.
Scenarios
PAI-Lingjun is designed for large-scale, distributed AI research and development scenarios, such as image recognition, natural language processing, search and ad recommendations, and large language models (LLMs). It is suitable for industries such as autonomous driving, financial risk control, drug discovery, AI for Science, the metaverse, the internet, and ISVs.
Large-scale distributed training.
Ultra-large-scale GPU computing power system.
A full peer-to-peer networking architecture and complete resource pooling can be used with Platform for AI (PAI). It supports multiple training frameworks, such as PyTorch, TensorFlow, Caffe, Keras, XGBoost, and Mxnet, and can meet the needs of AI training and inference services of various scales.
AI infrastructure.
Smooth scale-out. It meets GPU computing power requirements of different scales with smooth scale-out and linear performance scaling.
Intelligent data acceleration. Provides intelligent data acceleration for AI training scenarios by actively prefetching the required training data to improve training efficiency.
Higher resource utilization. Supports fine-grained control of heterogeneous resources to improve resource turnover efficiency.
Autonomous driving.
Rich deployment and scheduling policies.
Multiple GPU resource scheduling policies ensure the efficient execution of training tasks. Cloud Parallel File Storage (CPFS) combined with an RDMA network architecture ensures training data supply and computing I/O. You can also use the tiered storage feature of Object Storage Service (OSS) to reduce storage costs for archived data.
Supports both training and simulation scenarios.
It intelligently provides integrated computing power, supporting both training and simulation scenarios. This improves iteration efficiency and reduces data migration costs in a collaborative model.
AI for Science.
Push the limits of innovation.
Based on the ultra-large-scale RDMA "high-speed network" and communication flow control technology in data centers, it achieves microsecond-level end-to-end communication latency. Ultra-large-scale linear scaling can create parallel computing power for tens of thousands of cards.
Integrate ecosystems and expand the boundaries of innovation.
Supports integrated scheduling of HPC and AI tasks, providing a unified and collaborative foundation for scientific research and AI, and promoting the integration of technology ecosystems.
Cloud-based research, inclusive computing power.
Supports cloud-native and containerized AI and HPC application ecosystems, deep resource sharing, and makes inclusive intelligent computing power readily available.
Features
High-speed RDMA network architecture. Alibaba began dedicated research into Remote Direct Memory Access (RDMA) in 2016.
Alibaba Cloud has now built a large-scale "high-speed network" within its data centers. Based on extensive experience in RDMA network deployment, Alibaba Cloud has independently developed a high-performance RDMA network protocol based on end-to-end network collaboration and the HPCC congestion control algorithm. It has also implemented protocol hardware offloading through intelligent NICs. This reduces end-to-end network latency, increases network I/O throughput, and effectively avoids or mitigates performance loss for upper-layer applications caused by traditional network issues such as network failures or black holes.
High-performance collective communication library ACCL. PAI-Lingjun supports the high-performance Alibaba Collective Communication Library (ACCL). Combined with hardware such as network switches, ACCL provides congestion-free, high-performance cluster communication for AI clusters with tens of thousands of GPUs. Using ACCL, Alibaba Cloud has implemented intelligent matching of GPUs and NICs, automatic identification of physical topologies inside and outside nodes, and topology-aware, congestion-free communication algorithms. This completely eliminates network congestion, improves network communication efficiency, and enhances the scalability of distributed training systems. At a scale of tens of thousands of cards, it can achieve over 80% linear cluster efficiency. At a scale of hundreds of cards, the effective (computational) performance can exceed 95%, meeting the needs of more than 80% of business scenarios.
High-performance data preloading acceleration software KSpeed. PAI-Lingjun uses the high-performance RDMA network and ACCL to develop KSpeed, a high-performance data preloading acceleration software for intelligent data I/O optimization. The storage-compute decoupled architecture is common in AI, HPC, and big data scenarios, but loading large amounts of training data can easily become an efficiency bottleneck. Alibaba Cloud uses KSpeed to improve data I/O performance by orders of magnitude.
GPU container virtualization solution eGPU. To address common issues such as the demands of large-scale AI jobs, the high cost of GPU hardware resources, and low GPU utilization in clusters, PAI-Lingjun supports the GPU virtualization technology eGPU. eGPU can effectively improve the GPU utilization of AI clusters with the following features:
Flexible partitioning based on both VRAM and computing power.
Support for multiple specifications.
Dynamic creation and destruction.
Hot upgrades.
User-mode technology for higher reliability.
Limits on PAI-Lingjun networks
Limitations | Limit | Method to increase quota |
Maximum number of Lingjun CIDR blocks that can be created by a single account in the same region | 8 | For more information, see Manage quotas. |
Maximum number of Lingjun subnets that can be created in a single Lingjun CIDR block | 16 | For more information, see Manage quotas. |
Maximum number of Lingjun nodes in a single Lingjun subnet | 1000 | Not applicable |
Maximum number of Lingjun nodes in a single Lingjun CIDR block | 1000 | Not applicable |
CIDR blocks that can be configured for Lingjun CIDR blocks and Lingjun subnets | You can use custom CIDR blocks other than | Not applicable |
Maximum number of Lingjun connection instances that can be created by a single account in the same region | 16 | Not applicable |
Maximum number of IPv4 routes that a single Lingjun connection instance can learn from the public cloud | 50 | Not applicable |
Maximum number of IPv6 routes that a single Lingjun connection instance can learn from the public cloud | 25 | Not applicable |
Maximum number of Lingjun Hub instances that can be created by a single account in the same region | 4 | For more information, see Manage quotas. |
Maximum number of Lingjun Hub instances that can be connected to a single Lingjun CIDR block | 1 | For more information, see Manage quotas. |
Maximum number of Lingjun Hub instances that can be connected to a single Lingjun connection instance | 1 | For more information, see Manage quotas. |
Maximum number of Lingjun connection instances that can be connected to a single Lingjun Hub instance | 32 | For more information, see Manage quotas. |
Maximum number of Lingjun nodes in all Lingjun CIDR blocks within the same region that a single Lingjun Hub instance can support | 2000 | Not applicable |
Maximum number of routing policy entries that can be configured for a single Lingjun Hub instance | 100 | Not applicable |
Maximum number of secondary private IP addresses supported by a single Lingjun NIC | 3 | For more information, see Manage quotas. |
Product specifications and activation
Activation: PAI-Lingjun AI Computing Service is currently in public preview. Alibaba Cloud sales representatives provide purchase links and administrator accounts for the console to eligible users. For more information about how to activate PAI-Lingjun, see Activate Lingjun.
Billing: PAI-Lingjun AI Computing Service supports installment and subscription billing methods. For more information about billing, see Billing.