Xueyong Tan

and 2 more

As the complexity of tasks continues to grow, ensuring the reliability of big data task scheduling systems has become increasingly critical. Despite significant advancements in this field, achieving the necessary level of reliability remains a daunting challenge. A robust anomaly detection mechanism, integrated with an effective root cause localization process, is essential for maintaining system reliability. However, current methodologies face two major challenges when applied to troubleshooting big data task scheduling systems. Firstly, while these systems generate a variety of data types, such as traces, system logs, and key performance indicators (KPIs), most current approaches mainly focus on traces. Nevertheless, a trace-centric perspective may lack a comprehensive view of the system, potentially missing specific abnormal conditions. Secondly, troubleshooting big data task scheduling systems typically involves two key phases: anomaly detection and root cause localization. However, traditional approaches often treat these phases independently, neglecting their intrinsic interdependencies. Furthermore, inaccuracies in anomaly detection can substantially compromise the effectiveness of localization processes. To address these issues, we propose DAGLoc, an innovative end-to-end framework that integrates anomaly detection and root cause localization for big data task scheduling systems. By leveraging the power of graph neural networks (GNNs), DAGLoc unifies both processes, enabling a more comprehensive and accurate troubleshooting mechanism. Experimental results on several widely recognized benchmark datasets demonstrate that DAGLoc consistently outperforms existing state-of-the-art methods, providing enhanced reliability and efficiency in troubleshooting complex systems.