Autism Spectrum Disorder (ASD) is a complex neurodevelopment challenge, presenting a spectrum of difficulties in social interaction, communication, and the expression of repetitive behaviors in different situations. The increase in Autism Spectrum Disorder (ASD) calls for a public health concern, necessitating more effective early detection methods. In this paper, we introduce a novel hierarchical feature fusion method for early ASD detection in children through the analysis of a code-switched speech corpus (English and Hindi), CoSAm. Our approach integrates acoustic, linguistic, and paralinguistic features using advanced audio processing techniques and Transformer Encoders. Our experiment, highlights the importance of modality order in feature fusion, demonstrating enhanced diagnostic capabilities when specific sequences are used. Our results, demonstrate the best performance from the hierarchical fusion technique with an accuracy of 98.75% using a combination of acoustic and linguistic features first, followed by paralinguistic features in a hierarchical manner. Additionally, analysing MFCCs and statistical features, of the audio corpus, we were able to map the variability and complexity of speech patterns.