The accurate perception of audio-visual stimuli heavily relies on the spatial and temporal alignment of the sensory cues, with multisensory enhancement only occurring if those cues are presented in spatiotemporal congruency. While spatial localization processing and temporal binding regulation of audiovisual information have been deeply investigated separately, many of the neural correlates subtending audiovisual interactions in spatiotemporally varying conditions remain unclear. Empirically evaluating the respective contribution of spatial and temporal discrepancies on behavioral responses may be challenging when they vary simultaneously. Here, we sought to investigate the mutual interaction of temporal and spatial offsets in cue presentation on the neural processing of audiovisual cues. To this end, we developed a biologically inspired neurocomputational model that reproduces behavioral evidence of perceptual phenomena observed in audiovisual tasks, i.e. the modality switch effect (temporal realm), and the ventriloquist effect (spatial realm). Tested against the race model, our network proved also able to successfully simulate multisensory enhancement due to the concurrent presentation of audiovisual cues in reaction times. Further investigation on the mechanisms implemented in the network upheld the centrality of cross-sensory inhibition in explaining Modality Switch Effects, and of cross-modal and lateral intra-area connections in regulating spatial localization, respectively. Finally, the model predicts an amelioration in temporal detection of different modality stimuli with increasing between-stimuli eccentricity, and indicates a plausible reduction in auditory localization bias for increasing inter-stimulus-interval between spatially disparate cues.