Dheeraj Ramasahayam -

Explainable AI for Root Cause Analysis in Large-Scale Datacenter Networks

Dheeraj Ramasahayam

March 30, 2026

Black-box RCA models are difficult to trust in datacenter operations. This paper presents a real-data-driven explainable RCA framework that combines topology-aware temporal attention with operator-facing explanations. The benchmark is built from public CAIDA passive 100G statistics and MAWI samplepoint-F summaries, yielding 162 captures from January 1, 2024 through March 25, 2026. Because public traces do not expose switch-level RCA labels, controlled localized incidents are injected on top of real traffic windows in a Clos topology. On a 9-node benchmark with 500 windows of length 6, the temporal XAI model achieves 100% failure accuracy and 100% root-cause accuracy, matches Random Forest and LSTM baselines, outperforms a non-attention GNN on failure detection, and concentrates 76.1% of explanation mass in the top three nodes. Estimated operator RCA time drops from 10.15 to 5.33 minutes. Across 9-, 52-, and 104-node fabrics, the model maintains 100% failure and root-cause accuracy while explanation compactness declines from 0.66 to 0.15, motivating hierarchical pod-and rack-level aggregation.

Autonomous Self-Healing Datacenter Networks: A Unified AI System for Prediction, Dete...

Dheeraj Ramasahayam

March 20, 2026

Modern datacenter networks still operate through fragmented workflows in which predictive maintenance, intrusion detection, root cause analysis (RCA), and remediation are studied separately and deployed through loosely coupled tooling. This paper presents a unified AI system for autonomous self-healing datacenter networks that connects four stages: temporal failure prediction, drift-adaptive intrusion detection, topology-aware RCA, and safety-gated recovery with counterfactual validation. The architecture combines streaming telemetry, network-flow analytics, graph reasoning, and a topology digital twin inside a single operational loop. The system is formalized as a constrained sequential decision problem over telemetry, flows, topology, and policy constraints, and is evaluated through staged module validation plus a trace-driven closed-loop emulation. Because no public benchmark spans all four stages jointly, the empirical evidence combines public telemetry and flow datasets, streaming emulation, packet-capture replay, topology-grounded recovery traces, and a synthetic end-to-end incident timeline that makes the module handoff contract explicit. Across failure-prediction benchmarks, the temporal sequence model reaches F1 scores of 0.3737 on optical zero-shot hard-failure evaluation and 0.4677 on Cisco BGP failure prediction within a 60-second warning window. In intrusion detection, the drift-adaptive hybrid improves weighted F1 from 61.35% to 68.69% on full CICIDS2017 cross-dataset transfer without retraining the base detectors and reaches 98.05% weighted F1 in a packet-capture replay case study. For RCA, topology-aware reasoning reaches 0.8380 target-localization F1 with 1.0000 hidden-target accuracy and 0.9394 temporal RCA accuracy at 5.2 s mean detection delay. In the recovery twin, gated actions improve mean reachability from 0.9740 to 1.0000, achieve 0.8182 recovery success, and block 100% of mismatched unsafe actions. The results show that prediction, detection, diagnosis, and remediation can be organized into a reproducible closed loop for next-generation self-healing datacenter networks.

ClosRCA-Bench: An Open Topology-Grounded Benchmark and Counterfactual Recovery Framew...

Dheeraj Ramasahayam

March 20, 2026

Open research on datacenter-network root cause analysis (RCA) is limited by two recurring problems: many studies rely on private production traces, and public studies often omit the topology and remediation context needed for self-healing systems. This paper introduces ClosRCA-Bench, a reproducible topology-grounded benchmark constructed from Cisco's public Clos-topology telemetry repository by combining event files, CDP maps, and per-device YANG telemetry into fixed graph windows. The resulting benchmark contains 311 windows over 11 topology nodes with 30 features per node and four cause families: BFD outage, blackhole, ECMP change, and interface shutdown. Two fault classes localize to hidden target devices that are not directly monitored, making topology-aware localization a first-class task. The paper evaluates rule-based, correlation-based, graph-only, and spatio-temporal graph RCA methods, then measures remediation with a safety gate and a counterfactual topology digital twin. On the held-out split, Random Forest achieves the strongest anomaly F1 at 0.9688 and weighted RCA cause F1 at 0.9707. The full STGNN reaches 0.8380 weighted F1 for target-device localization and 1.0000 hidden-target accuracy, while the no-topology ablation collapses to 0.0000 hidden-target accuracy. In temporal tracking, the full STGNN attains 0.9394 RCA accuracy with 5.2 s mean detection delay, improving over the graph-only baseline's 6.3 s delay at the same RCA accuracy. On the compound-failure slice, the full STGNN retains 0.9130 cause accuracy compared with 1.0000 on single-failure windows. In the counterfactual recovery twin, gated actions improve mean reachability from 0.9740 under fault to 1.0000 after recovery and achieve 0.8182 recovery-success rate while blocking 100% of mismatched unsafe actions. The main contribution is therefore an open benchmark and evaluation protocol that makes topology-aware, temporal, and recoveryaware RCA measurable.

Drift-Adaptive Intrusion Detection for Enterprise Networks

Dheeraj Ramasahayam

March 20, 2026

Enterprise intrusion detection remains a moving target because traditional rule-based systems are fast but narrow, while learned detectors often report strong in-dataset performance without demonstrating how they adapt to live distribution shift. This paper presents a reproducible benchmark over four evaluation settings: the full official UNSW-NB15 split, the full official NSL-KDD split, the full external CICIDS2017 corpus, and a cleaned CSE-CIC-IDS2018 external corpus. A transparent flow-signature IDS baseline is compared against Random Forest, LSTM, Transformer, a static Drift-Aware Hybrid, and an online Drift-Adaptive Hybrid controller that reweights the ensemble under detected shift. On UNSW-NB15, the static Drift-Aware Hybrid achieves the strongest weighted F1 score of 90.72%, slightly above Random Forest at 90.65%. On NSL-KDD, LSTM achieves the best weighted F1 score of 81.01%. Under cross-dataset transfer into the full CICIDS2017 corpus, LSTM remains best overall with a weighted F1 score of 71.53%, but the online Drift-Adaptive Hybrid improves the static hybrid from 61.35% to 68.69% without retraining the base detectors. A formal driftdetector study comparing Isolation Forest, ADWIN, DDM, and Page-Hinkley finds that Isolation Forest is strongest in this benchmark, reaching 70.58% post-adaptation weighted F1 with zero source-domain false positives. On CSE-CIC-IDS2018, the flow-signature baseline is unexpectedly strongest at 66.37% weighted F1, exposing the limits of source-only learned transfer. Family-level failure analysis on CICIDS2017 further shows that the adaptive hybrid partially recovers FTP-Patator traffic with F1 0.1793 where the source-trained LSTM collapses. Finally, a replayed packet-capture case study over 221,253 real packets from a local service attack trace detects the attack immediately at onset and achieves 98.05% weighted F1 on one-second bidirectional flow windows. The repository also reports latency-under-load, real-time streaming traces, deployment architecture, and explainability-by-ablation artifacts, making the benchmark useful both as an academic paper and as an open-source systems study.

Temporal Attention-Guided Sequence Learning for Zero-Shot Generalization in Datacente...

Dheeraj Ramasahayam

March 16, 2026

As hyperscale datacenter topologies grow increasingly complex, static Z-Score thresholding and classical stateless anomaly algorithms (like Random Forests) struggle to map the chaotic, multi-dimensional signatures preceding catastrophic network outages. Compounding this challenge is the extreme class imbalance inherent in network telemetry: millions of stable communication sequences are generated for every single failure sequence. We present a highly novel, deep learning approach for early-warning preventative alerting. By applying a Long Short-Term Memory (LSTM) Autoencoder infused with a Self-Attention Mechanism natively weighted against soft-failure occurrences, our model dynamically isolates critical cascading latency signatures to predict hard outages up to 60 seconds before they manifest. We demonstrate zero-shot generalization across disparate optical hardware topologies, proving robust cross-validation against single-dataset hardware bias.