Lingjun Intelligent Computing Overview-Intelligent Computing LINGJUN(LINGJUN)-阿里云帮助中心

Lingjun Intelligent Computing is designed for artificial intelligence scenarios, providing stable, efficient, and highly extensible AI computing public cloud services. Lingjun supports high-performance network scaling for up to 100,000 cards, with a single job supporting up to 10,000-card scale extensibility. The product provides out-of-the-box features including O&M management and cluster job management. It delivers large-scale, high-density computing power for LLM/AIGC, autonomous driving, search and recommendation, scientific intelligence, financial quantization, and other industries and domains. Combined with comprehensive monitoring and automatic fault recovery capabilities, it provides an innovative foundation integrating high performance, high extensibility, and high stability for your business—gaining a first-mover advantage in the AI era.

Core Advantages

Ultra-Large-Scale Cluster for Integrated Training and Inference
Designed for current and future AI LLMs, supports rapid deployment and management of clusters with up to 100,000 GPUs, and is compatible with mainstream deep learning frameworks. Lingjun adopts a high-performance network architecture (HPN, High Performance Network) to address scalability challenges. A single-layer network supports thousand-GPU-scale extension, while a two-layer network supports ten-thousand-GPU-scale extension. With high scalability and strong stability, it provides powerful computing power support for LLM development in the AGI era.
Stable and Reliable
Based on job parallelism policies, stability is enhanced through cloud service interaction self-healing design and systematic optimization across hardware architecture and platform-layer software. This provides fine-grained failure event detection, supports real-time monitoring and abnormal status handling at multiple levels including edge zones, GPU cards, and scheduling services, reduces failure probability and impact scope, and improves job robustness. The system delivers minute-level failure-affinity recovery, with a failure detection rate exceeding 98%, effectively supporting trillion-parameter MoE LLM training to achieve over 99% training validity period.
Flexible storage combinations in multiple forms
Using RDMA's high-throughput, low-latency data access path and the distributed parallel file system CPFS, tens-of-thousands-of-card clusters can continuously and efficiently obtain training samples during the training procedure. High-performance, low-latency cache capabilities meet the storage performance requirements of scenarios such as data preloading and checkpoint storage. Combined with products such as EBS and OSS, it also meets multi-specification, multi-type data storage requirements for blocks, objects, and more. Switch between different storage access modes quickly through specification switching without surrendering the lease. Storage traffic is separated from compute traffic. Asynchronous checkpoint cluster-level communication optimization is supported to prevent storage traffic from interfering with compute traffic and causing parallel communication jitter.
Computing Power Foundation
Provides compute resource types including bare metal instances (clusters) and container instances to standardize and scale AI computing resources. Bare metal instances provide equivalent communication and interconnection capabilities. Paired with the latest-generation CPUs, they offer ample local disk resources supporting GPU Direct Storage. They support standardized Kubernetes interfaces and container network solutions, and can integrate security containers to virtualize compute, storage, and network resources — delivering pooled intelligent computing power for upper-layer platform applications, with resource utilization improvements of up to 3x in some scenarios.

Features

Supporting Continuous AI Innovation with Ten-Thousand-Card Parallel Computing Power

Common Scenarios
Intelligent Computing Lingjun is designed for future-oriented AI parallel computing scenarios, including large-scale foundation model training, MoE multimodal model training and fine-tuning, and parallel inference of models with up to trillion-scale parameters. Delivering hundreds of exaflops (EFLOPS) of floating-point performance per second, its integrated hardware-software design provides greater performance and higher stability within the same footprint. It supports continuous innovation across foundation models, autonomous driving, scientific intelligence, fintech, search and recommendation, and other industries.
Ultra-large-scale non-convergence architecture
Designed with AI application performance and job stability as the core, implements network congestion control based on physical bandwidth and topology, and combines communication libraries to achieve traffic load optimization. End-to-end latency is as low as 2 μs. Based on end-network collaboration, the network cluster is systematically optimized across architecture design, network topology, path selection, and throttling to achieve ultra-large-scale linear extension. This delivers over 96% explicit extension degree at 10,000-GPU scale with stable network steady-state operation, while supporting ultra-large-scale Scale-up & Scale-out converged extension at the 100,000-GPU cluster level. Data exchange and interconnection resources within the cluster are coordinated, and throughput performance limits are approached through systematic tuning to improve system robustness.
Available regions
Intelligent Computing Lingjun supports the following regions: China (Hangzhou), China (Beijing), China (Shanghai), China (Shenzhen), China (Guangzhou), China (Zhongwei), China (Hohhot), China (Zhangjiakou), China (Ulanqab), China (Heyuan), Hong Kong (China), Singapore, Japan (Tokyo), US (Atlanta), Germany (Frankfurt), Malaysia (Johor), Malaysia (Kuala Lumpur), UAE (Dubai), and Thailand (Bangkok). Intelligent Computing Lingjun supports subscription (fixed period) and pay-as-you-go billing modes. You need to request resource usage from the Alibaba Cloud sales team.

High-performance intelligent computing instances

Support for diverse GPU card types in Lingjun accelerated computing instances
Intelligent Computing Lingjun is a bare metal cluster service for accelerated computing scenarios. Based on Lingjun's heterogeneous computing instances, it features GPUs (Graphics Processing Units), high-speed inter-chip interconnects, and high-speed network interconnects. The overall architecture is designed around the Scale-up domain dimension: all GPUs within a Lingjun accelerated computing instance share high-speed inter-chip interconnects, and each GPU card supports RDMA-equivalent interconnect communication at the cluster dimension. In addition, instances support balanced GPU deployment, local NVMe, and high-performance RDMA network interface cards, enabling features such as NUMA affinity.
Standardized Kubernetes and Serverless container instance support with AI scenario optimization
Cluster deployment is ready. You can use containerized GPU computing power through ACS Serverless container instances and Kubernetes cluster services as needed. Kubernetes has completed the abstraction and orchestration of heterogeneous resources such as CPUs and GPUs. For AI scenarios, the product also provides enhancements including operator optimization, PD separation, EP parallel deployment and load balancing, and cache-aware scheduling. It also supports on-demand storage mounting and network usage.
High-performance data storage access
To meet the extreme data loading requirements of high-speed checkpoint storage and intelligent computing services in training scenarios, Intelligent Computing Lingjun supports flexible mounting of high-performance distributed file storage (CPFS) based on RDMA, and supports VPC/EBS endpoint offloading capabilities to achieve standard cloud product intercommunication.

Automated cluster management

Multi-dimensional filter interaction and automated cluster management
Intelligent Computing Lingjun supports the assignment of Lingjun bare metal edge zones and resource pooling scheduling capabilities. The cluster supports using Alibaba Cloud-native Kubernetes products such as ACK as the container cluster management service, CPFS as high-performance parallel file storage, EBS as system disks and data disks, PAI as the artificial intelligence platform service, and ARMS Prometheus monitoring. The entire cluster deployment procedure is completed with one click, and high-performance intelligent computing clusters are ready in minutes.
Automatic fault recovery system ensures parallel computing job stability
The Lingjun automatic fault recovery system is built on a comprehensive monitoring framework, delivering customer protection capabilities and automated self-healing processing that covers multi-dimensional failures across Lingjun's OS, GPU, high-performance network, DPU, and more. Fine-grained fault consumption policies and automatic fault cold migration minimize user downtime. Lingjun has also built an automated inspection platform for detecting the overall health of intelligent computing clusters. It provides multi-dimensional, periodic cluster health status detection covering GPU heterogeneous computing power and inter-node collective communication, and supports continuous detection in multiple modes including single-node, dual-node, and multi-node cluster detection. This delivers true proactive prevention (reducing failure probability) and post-failure fault tolerance (improving problem resolution efficiency and reducing training computing power loss), progressively advancing a shift-left failure strategy.

Comprehensive Cluster Monitoring

Observability and O&M Capabilities
Lingjun provides a monitoring data dashboard in the console, including GPU heterogeneous computing resources, second-level RDMA high-performance network monitoring and alerting, and fault localization capabilities. It also supports operation log and O&M job display, as well as resource and environment info display, giving users visibility into the resource usage and running status of the entire cluster. Built on the monitoring data foundation, Lingjun also supports GUI-based basic O&M capabilities, such as system restart, cluster and node O&M diagnosis, and web terminal capabilities.

Purchase and Billing

Product Purchase: Log on to the Lingjun console to purchase edge zone resources. For more information, see Purchase a Lingjun Node.
Product Billing: Intelligent Computing Lingjun supports the subscription (fixed epoch) and pay-as-you-go billing methods. For more information, see Product Billing.

Usage

After purchasing Lingjun compute nodes, you must create a cluster to use the resources. For detailed instructions, please refer to Create a Lingjun Basic Cluster.