Accurate multi-variate meteorological time series forecasting, particularly for wind speed, is crucial for effective renewable energy integration. However, existing deep learning models often struggle to simultaneously capture complex long-range temporal dependencies and intricate inter-variable relationships. To address these limitations, this research introduces and evaluates a novel hybrid architecture that combines Spatio-Temporal Convolutional Sequence to Sequence models with Transformer encoders. We investigated both serial and parallel configurations, with the parallel design uniquely employing cross-attention for enhanced feature fusion. Our experiments were conducted on regionally aggregated multi-variate time series data from Southeast Asia, where input spatial dimensions were treated as H=1 and W=1 to focus on temporal and inter-variable dynamics. The parallel Spatio-Temporal Convolutional Sequence to Sequence-Transformer model achieved a Root Mean Squared Error of 0.1064 and a Mean Absolute Error of 0.0858 in wind speed prediction, significantly outperforming various baseline models. These results affirm the substantial benefits of explicitly modeling long-range temporal dependencies and effectively fusing diverse features within multi-variate time series. While this study demonstrates the architecture’s efficacy for regionally focused time series, its design inherently possesses the potential for broader spatio-temporal applications on grid-based data.