The World Health Organization (WHO) has instituted the Caregiver Skill Training (CST) program to assist families with children diagnosed with Autism Spectrum Disorder. The Joint Engagement Rating Inventory (JERI) protocol evaluates participants’ engagement levels within the CST initiative. Traditionally, JERI assessments rely on retrospective video analysis conducted by qualified professionals, thus incurring substantial labor costs. This study aims to augment the evaluation efficiency of the Expressive Language Level and Use (EXLA) criterion within JERI, striving for consistency with human-based scoring. To this end, we introduce a multimodal behavioral signal-processing framework designed to analyze both child and caregiver behaviors, thereby offering grading recommendations as an alternative to medical professional input. Initially, raw audio and video signals are segmented into concise intervals via voice activity detection, speaker diarization and speaker age classification, serving the dual purpose of eliminating non-speech content and tagging each segment with its respective speaker. Subsequently, we extract an array of audio-visual features, encompassing our proposed interpretable, hand-crafted textual features, end-to-end audio embeddings and end-to-end video embeddings. Finally, these features are fused at the feature level to train a linear regression model aimed at predicting the EXLA scores. Our framework has been evaluated on the largest in-the-wild database currently available under the CST program. Experimental results indicate that the proposed system achieves a Pearson Correlation Coefficient of 0.713 against the expert ratings, evidencing performance comparable to that of human experts. This approach not only provides immediate feedback for CST participants but also optimizes resource allocation in less developed regions.