Data Center Modernization
& AI Fabrics

PalC designs and delivers open, AI-ready data center fabrics for high-throughput GPU communication, predictable latency, and operational stability across training and inference environments.

View Architecture

DATA CENTRE AI FABRIC — REFERENCE TOPOLOGY

SPINE 1

400GbE — SONiC

SPINE 2

400GbE — SONiC

LEAF 1

100/200GbE

LEAF 2

100/200GbE

LEAF 3

100/200GbE

GPU POD

A100/H100 — NVLink

NVMe-oF

RDMA — TCP

DPU

Offload

RoCE v2 Pipeline

PFC — ECN — DCQCN

Observability

gNMI — Grafana

<1µs

GPU LATENCY

>80%

BW UTIL.

>1M

STORAGE IOPS

>90%

GPU UTIL.

SONiC RoCE v2 NVMe-oF EVPN-VXLAN

KEY CAPABILITIES

Built for AI workloads from day one

Comprehensive fabric engineering -- from leaf-spine design to storage access and operational telemetry.

DC Fabric Architecture

Scalable leaf-spine designs aligned with GPU workload behaviour, east-west traffic patterns, and long-term growth requirements.

Leaf-Spine ECMP MLAG

AI-Aware Network Design

Network designs engineered for the east-west traffic, high data movement, and performance sensitivity of training and inference platforms.

RoCE v2 PFC DCQCN

Open & Disaggregated Networking

SONiC-based open networking on multi-vendor hardware -- full operational control, zero proprietary lock-in.

SONiC Open HW EVPN-VXLAN

High-Performance Storage Access

NVMe-oF over RoCE transport for low-latency, high-IOPS tiered storage architectures for AI workloads.

NVMe-oF RDMA Tiered

Observability-First Design

Telemetry, monitoring, and diagnostics embedded at fabric design time -- gNMI streaming, Grafana dashboards, and real-time visibility from day zero.

gNMI Grafana Day-0

Production Validation & Readiness

IntelliSuite-driven validation against real traffic, scale limits, and failure scenarios before any production cutover.

SONiC-based validation Failover <200ms 100G proven

Modular, scalable fabric architecture

Leaf-spine at the core, RoCE transport for GPU communication, NVMe-oF for storage, and full observability from Day 0.

Click a component in the diagram or panel to explore details.

Spine Switch 1100GbE/400GbE spine

Spine Switch 2100GbE/400GbE spine

Leaf Switch 125GbE/100GbE leaf

Leaf Switch 225GbE/100GbE leaf

Leaf Switch 325GbE/100GbE leaf

GPU PodNVIDIA A100/H100 cluster

NVMe-oF ClusterHigh-speed storage fabric

Observability LayerPrometheus, Grafana, ELK

DPU OffloadSmartNIC/DPU acceleration

RoCE PipelineRDMA over Converged Ethernet

Components

SONiC Open Networking

Open-source network operating system for disaggregated infrastructure.

Vendor-agnostic switches with standardized NOS
40–60% cost reduction vs. proprietary
Full control for AI workload customization
Scales without lock-in

Used across spine, leaf, and fabric switches.

Network Speeds

Spine Interconnect400GbE
Leaf Uplinks100 / 200GbE
Server / GPU NIC100 / 200GbE

Protocols

GPU TransportRoCEv2
Congestion CtrlPFC · ECN · DCQCN
OverlayEVPN-VXLAN

Storage

ProtocolNVMe-oF (TCP/RDMA)
ArchitectureTiered · Distributed
Per Pod IOPS>1 Million

GPU Support

GPU ClassNVIDIA A100 / H100
Intra-nodeNVLink
Inter-nodeRoCE v2

TECHNICAL CONFIGURATION

RoCE + NVMe-oF — configured for production

Lossless RoCE transport and high-performance NVMe-oF storage, tuned to meet AI traffic patterns and production-scale workloads.

SONiC — PRIORITY FLOW CONTROL

PFC for Lossless RoCE Transport

Zero packet loss for RoCE transport across configured queues and lossless buffer pools.

SONiC PFC configuration for RoCE

{
  "PORT_QOS_MAP": {
    "Ethernet0": {
      "pfc_enable": "3,4"
    }
  },
  "BUFFER_POOL": {
    "ingress_lossless_pool": {
      "type": "ingress",
      "size": "139458560",
      "xoff": "20971520"
    }
  }
}

PFC Queues	3 & 4 (RoCE)
ECN Mark	DCQCN
Packet Loss	Zero
GPU Latency	<1µs

KUBERNETES — NVMe-oF STORAGECLASS

NVMe-oF for GPU-pod Storage

High-IOPS distributed storage tuned for AI workloads with striped pools.

Kubernetes StorageClass for AI workloads

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nvmeof
provisioner: nvmeof.csi.openebs.io
parameters:
  replicas: "3"
  poolType: "striped"
allowVolumeExpansion: true

Protocol	RDMA / TCP
IOPS per Pod	>1 Million
Replicas / Stripe	3-way stripe
Latency Class	Memory-class

PERFORMANCE

SLOs, Latency & Scale Benchmarks

Designed for AI fabrics, cloud interconnects, and enterprise data centres under real production load.

<1µs

Inter-GPU Latency

Network latency between GPU nodes via RoCE v2 lossless fabric.

>80%

Bandwidth Utilisation

Sustained during large model training workloads at full scale.

>1M

Storage IOPS

Per GPU pod via NVMe-oF with RDMA transport and tiered storage.

>90%

GPU Utilisation

Achieved during training — network no longer the bottleneck.

USE CASES

Who deploys this solution

Purpose-designed infrastructure for the environments where AI performance, reliability, and scale are non-negotiable.

AI & Machine Learning Platforms

Data center networks supporting large-scale distributed training, inference pipelines, and GPU-dense environments—where network latency directly impacts model iteration speed.

BFSI & Regulated Enterprises

Highly available, observable data center networks for transaction systems, analytics platforms, and compliance-driven environments—where audit trails and predictable performance are required.

Cloud-Adjacent & Hybrid Data Centers

Modern data center networks designed to integrate cleanly with public cloud—consistent networking policies, automation-first design, and DC interconnect for hybrid workload placement.

High-Growth Digital Platforms

Environments where rapid scale and frequent infrastructure change demand stable, predictable network behavior—open disaggregated architectures that grow without re-architecture.

HOW WE WORK

Design → Build → Validate → Operate

A proven engineering methodology that delivers production-grade results with operational excellence from day one.

STEP 1

Design

Business goals, workload profiles, and scale requirements translated into architecture and fabric designs.

STEP 2

Build

Engineering open fabric configurations, integration, and deployment tooling for production environments.

STEP 3

Validate

IntelliSuite-driven testing against real traffic, scale limits, and failure scenarios before cutover.

STEP 4

Operate

Ongoing support, telemetry monitoring, and continuous optimization so the fabric remains healthy long term.

INTEGRATED CAPABILITIES

SONiC Integration Cloud Integration IaC/GitOps Observability RoCE Optimisation NVMe-oF Storage

Faster AI Readiness

Deploy and scale AI workloads without infrastructure bottlenecks.

Predictable Performance

Consistent latency and throughput under varying load conditions.

Full Visibility

Real-time telemetry for proactive issue detection and resolution.

No Vendor Lock-in

Open architectures and multi-vendor hardware preserve long-term flexibility.

Scales Operationally

Observability-first design keeps operations manageable as networks grow.

Porting commercial-NOS on to Marvell Gen-6 switching ASICs

This case study focuses on porting a commercial Network Operating System (NOS) that is already supported on Marvell Gen-5 switching…

View Case Study

ONF VOLTHA OLT Certification Procedure

ONT stands for Optical Network Terminal. An ONT is the device that serves as the telecommunication chain's endpoint of the…

View Case Study

Transport Technology Software Development

Segment Routing is a new source routing technology that will add benefit to IP and MPLS networks.

View Case Study

Real-Time Video Streaming

Streaming Media may be defined as listening or viewing media in real time as it comes across the World Wide…

View Case Study

Build Time Optimization

The source-code consists of multiple components: the basic Linux-kernel, the control-plane, data-plane code, tool-chains and third party packages.

View Case Study

Supply Chain Solution

FALCA is an Indian origin agricultural commodities online market facilitating company addressing farmer marketing challenges.

View Case Study

Falca

FALCA is an Indian origin agricultural commodities online market facilitating company.

View Case Study

Cloud Native Application Development

For starters, we need to understand the difference between cloud based and cloud native application development.

View Case Study

Network Analytics

Network Analytics is a practice of the collecting and analyzing different types of network data like network events, state information,…

View Case Study

Network Management

In today's world, the term network management and monitoring are widespread throughout the IT industry.

View Case Study

Planning your AI-ready data center fabric?

Share your SLO targets for latency, bandwidth utilization, storage IOPS, and visibility. PalC will help design an open, production-ready RoCE + NVMe-oF architecture.

Get in touch

Discuss your infrastructure goals with our experts.

View Case Studies

Data Center Modernization & AI Fabrics

Built for AI workloads from day one

DC Fabric Architecture

AI-Aware Network Design

Open & Disaggregated Networking

High-Performance Storage Access

Observability-First Design

Production Validation & Readiness

Modular, scalable fabric architecture

SONiC Open Networking

RoCE Optimization

Storage & Data Management

Accelerated AI Networking

Scalable & Open Network Fabrics

End-to-End AI Infrastructure Monitoring

RoCE + NVMe-oF — configured for production

PFC for Lossless RoCE Transport

NVMe-oF for GPU-pod Storage

SLOs, Latency & Scale Benchmarks

Inter-GPU Latency

Bandwidth Utilisation

Storage IOPS

GPU Utilisation

Who deploys this solution

Design → Build → Validate → Operate

Design

Build

Validate

Operate

Faster AI Readiness

Predictable Performance

Full Visibility

No Vendor Lock-in

Scales Operationally

Ask PalC AI

Suggested Questions:

Technical Assistant

Related Documents:

Planning your AI-ready data center fabric?

Get in touch

San Jose

Dubai

Bengaluru

Chennai

Let's Build Your Next-Gen Infrastructure

Discuss Your Project

Thank You!

Data Center Modernization
& AI Fabrics