With continuous evolution of AI models and soaring model parameters, improving the capacity of intelligent computing centers is urgently needed. Limited by the network communication performance, the computing efficiency of large-scale distributed
GPU clusters still cannot achieve linear increase. Development of intelligent computing centers is faced with many challenges.
Large Network Scale
AI training requires large-scale GPU cluster networking and distributed parallel computing. This addresses the balance between the cluster scale and GPU efficiency. The network needs to support the construction of clusters
with thousands or even tens of thousands of GPUs.
High Network Performance Requirements
The proportion of inter-machine communication for large models increases, and bandwidth access and usage become the key network indicators affecting training efficiency.
Tight Construction and Deployment Timeline
The project construction timeline is tight, requiring rapid service deployment. This puts higher requirements on the network deployment timeliness.
Difficult O&M
If network instability occurs during training, the progress of the entire training task will be affected.
Ruijie AI-Fabric Intelligent Computing Center Network Solution
Meet Training Requirements of AI Models
Ultra-Large-Scale Networking
Network with Extreme High Throughput
Quick Deployment and Rollout
AI-based Intelligent O&M
Ultra-Large-Scale Networking
GPU servers are typically configured with multiple NICs for parameter training. To enhance GPU training efficiency and ensure low-latency, lossless cluster communication, Ruijie Networks' AI-Fabric Intelligent Computing Center Network
Solution utilizes a multi-rail networking architecture. This architecture connects NICs with the same ID to the same PoD, ensuring that training traffic is confined to the same PoD or ToR switch, thereby reducing forwarding hops
and significantly lowering network latency. To build a large-scale GPU cluster with high computing power, this solution employs three-layer networking. This design follows a 1:1 oversubscription ratio at each layer, allowing for
a maximum of 32,768 400G ports and supporting a cluster of up to 32,000 GPUs.
Al-Fabric Three-Layer Multi-Rail Networking
Three-Layer Networking: Carry Large-Scale GPU Cluster and Realize High-Speed Communication Between Servers;Multi-Layer Architecture: Reduce Forwarding Hops, Lower Network Latency, and Improve Service Affinity
The 400G RoCE solution is adopted to achieve low-latency, lossless network communication. RDLB is used to detect the link quality and implement per-packet global dynamic load balancing, increasing the network bandwidth utilization
to 97.6%.
Bandwidth Utilization 97.6%
Quick Deployment and Rollout
One-click RoCE configuration can be employed to import ECN, PFC, and other complex RDMA-related configuration for quick deployment. Leveraging extensive deployment experiences of leading Internet companies in large-scale AI networking,
the solution achieves rapid network deployment. The solution is compatible with mainstream cloud platforms, eliminating the need for additional adaptation and allowing services to be quickly provisioned.
One-Click RoCE Configuration
AI-based Intelligent O&M
The solution leverages big data and AI algorithms to comprehensively analyze network-wide data. It proactively discovers exceptions and accurately presents link issues based on network indicators. It quickly isolates devices
and interfaces to narrow the fault scope.
Precise Analysis of Network Health
Proactively detect exceptions and present network indicators accurately
Deep Telemetry of Service Indicators
Leverage big data and AI algorithms to comprehensively analyze network-wide data
Ultra-Large-Scale Networking
The multi-rail networking architecture is adopted to support on-demand flexible deployment. The three-layer networking supports clusters of up to 32,000 GPUs.
Network with Extreme High Throughput
The RoCE lossless network is designed to achieve network communication with high bandwidth and low latency. RDLB ensures high bandwidth utilization of the network.
Quick Deployment and Rollout
One-click RoCE deployment improves the onboarding efficiency. There are multiple application cases and large-scale RoCE tuning experiences.
AI-based Intelligent O&M
Real-time telemetry of key RoCE network indicators is visualized. Multi-dimensional monitoring and analysis prevent potential risks.
When you visit any website, the website will store or retrieve the information on your
browser. This process is mostly in the form of cookies. Such information may involve your
personal information, preferences or equipment, and is mainly used to enable the website to
provide services in accordance with your expectations. Such information usually does not
directly identify your personal information, but it can provide you with a more personalized
network experience. We fully respect your privacy, so you can choose not to allow certain
types of cookies. You only need to click on the names of different cookie categories to
learn more and change the default settings. However, blocking certain types of cookies may
affect your website experience and the services we can provide you.
Performance cookies
Through this type of cookie, we can count website visits and traffic sources in
order to evaluate and improve the performance of our website. This type of
cookie can also help us understand the popularity of the page and the activity
of visitors on the site. All information collected by such cookies will be
aggregated to ensure the anonymity of the information. If you do not allow such
cookies, we will have no way of knowing when you visited our website, and we
will not be able to monitor website performance.
Essential cookies
This type of cookie is necessary for the normal operation of the website and
cannot be turned off in our system. Usually, they are only set for the actions
you do, which are equivalent to service requests, such as setting your privacy
preferences, logging in, or filling out forms. You can set your browser to block
or remind you of such cookies, but certain functions of the website will not be
available. Such cookies do not store any personally identifiable information.