MATHEMATICAL DETAILS Derivation of the Model Equations The goal is to estimate the probability distribution of species across geographic coordinates \((X, Y)\), given that the species is \( S = s \), and that it occurs \((O_s = 1)\) e.g. \(P(X, Y | S = s, O_s = 1)\). For simplicity we will use the expression \(S = s\) to represent \(S = s, O_s = 1\). To include the environment in this probability, this can be represented mathematically as: \[ P(X, Y \mid S = s) = ^{\infty} \cdots ^{\infty} P(X, Y, \mid S = s) \, e_1 \cdots e_n. \] This expression introduces the environmental variables \(\) and integrates over all possible environmental conditions. 1. APPLYING THE LAW OF TOTAL PROBABILITY: We expand the joint probability using the Law of Total Probability: \[ P(X, Y \mid S = s) = ^{\infty} \cdots ^{\infty} P(X, Y \mid S = s, ) P( \mid S = s) \, e_1 \cdots e_n. \] This step decomposes the probability into two components: one that describes how environmental conditions affect the geographic distribution, and another that captures the species' niche, or how environmental conditions influence species occurrence. 2. ASSUMPTION OF CONDITIONAL INDEPENDENCE: At this point, we make a key biological assumption: the occurrence of a species is driven entirely by environmental conditions, not by geographic coordinates themselves. In statistical terms, this means we assume that species \( S \) is conditionally independent of the coordinates \((X, Y)\) given the environmental conditions \(\). Mathematically, this is expressed as: \[ P(X, Y \mid S = s, ) = P(X, Y \mid ). \] This substitution reflects the idea that any effects of geographic coordinates on species occurrence are only through their relationship with the environment. Biologically, this means that the environment determines where species can occur, and coordinates influence species distributions only indirectly via their environmental characteristics. This is related to the standard statistical assumption of "no unmeasured confounders," implying that environmental variables capture all relevant factors affecting species occurrences. Substituting this assumption, we get: \[ P(X, Y \mid S = s) = ^{\infty} \cdots ^{\infty} P(X, Y \mid ) P( \mid S = s) \, e_1 \cdots e_n. \] 3. ENVIRONMENTAL NICHE AS A CONDITIONAL DENSITY WITH SPECIES EMBEDDINGS: The term \( P( \mid S = s) \) captures the environmental niche of the species, representing a high-dimensional probability distribution of environmental conditions where the species is likely to occur—often referred to as a "hypervolume" in ecological terms. We further decompose this distribution by introducing a latent variable \(_s\), which serves as a lower-dimensional vector representation of the species' complex niche. This process is a form of representational learning, where \(_s\) is optimized to encapsulate the essential ecological characteristics of the species. Specifically: \[ P( \mid S = s) = \int_Z P( \mid = _s) P( = _s) \, z, \] where \(_s\) represents the species' niche in a lower-dimensional space, allowing for a more efficient and flexible representation of complex environmental dependencies. 4. COMBINING THE INTEGRALS: Substituting this into Equation (1), we arrive at the final expression: \[ P(X, Y \mid S = s) = ^{\infty} \cdots ^{\infty} P(X, Y \mid ) \left(\int_Z P( \mid = _s) P() \, z\right) \, e_1 \cdots e_n. \] This final equation (2) is a combination of two probability distributions that are independent of one another: 1. \( P(X, Y \mid ) \): Represents the probability of geographic coordinates given environmental conditions. 2. \( P( \mid S = s) \): Represents the environmental niche of the species, describing how likely a species is to occur under different environmental conditions. Biologically, \( P( \mid S = s) \) captures the species' niche, detailing the range of environmental conditions under which a species can thrive. Meanwhile, \( P(X, Y \mid ) \) maps these environmental conditions to geographic locations, indicating where such suitable conditions are found on Earth. Distance Metrics for Comparing High Dimensional Distributions Let \(_{}\) and \(_{}\) be two matrices where rows represent individual environmental vectors associated with predicted and observed occurrences, respectively. These matrices can have different numbers of rows, reflecting the flexibility of the distance measures used. Energy Distance: Energy Distance measures the expected difference between pairs of samples from two distributions and provides a smooth, computationally efficient metric for comparing distributions: \[ E(_1, _2) = 2[\|_1 - _2\|] - [\|_1 - _1'\|] - [\|_2 - _2'\|], \] where \(_1\) and \(_2\) represent matrices of vectors, and \(_1'\) and \(_2'\) are independent copies of \(_1\) and \(_2\). Energy Distance primarily matches the marginal characteristics of the distributions, such as mean and variance, without explicitly aligning their spatial structures. This makes it computationally efficient and ideal for quickly moving the optimization towards higher probability regions in the latent space due to its smoother loss surface (Székely & Rizzo, 2013). Sinkhorn Distance: Sinkhorn Distance is a computationally efficient approximation of the Wasserstein distance (also known as Earth Mover's Distance), which measures the cost of optimally transporting one probability distribution to match another. The classic Wasserstein distance is defined as: \[ W(_1, _2) = c(_{1,i}, _{2,j}), \] where \( \Pi(\mu, \nu) \) represents the set of all transport plans between distributions \( _1 \) and \( _2 \), and \( c(_{1,i}, _{2,j}) \) is typically the squared Euclidean distance between vectors \(_{1,i}\) and \(_{2,j}\). The Wasserstein distance captures detailed structural characteristics such as covariance and spatial distribution of the data, making it particularly valuable for ecological applications (Villani, 2008). However, it is computationally expensive, especially in high-dimensional settings typical of environmental data. Sinkhorn Distance introduces an entropic regularization term to the Wasserstein distance, which not only smooths the optimization landscape but also makes the transport plan differentiable, allowing gradient-based optimization. The Sinkhorn distance is defined as: \[ S(_1, _2) = c(_{1,i}, _{2,j}) + \epsilon \log(), \] where \(\epsilon\) is a regularization parameter that controls the trade-off between transport cost and entropy. The Sinkhorn algorithm iteratively adjusts dual variables to find a stable transport plan, which is computationally efficient and differentiable, unlike traditional optimal transport solutions. This differentiation enables the use of gradient descent for embedding optimization, making Sinkhorn particularly suitable for complex, high-dimensional comparisons (Cuturi, 2013). Zero-shot Optimization Details Monte Carlo Sampling for Loss Estimation: A key aspect of the optimization approach for zero-shot species is the Monte Carlo sampling of environmental vectors. At each optimization step, \( N_{} \) samples of predicted environmental vectors, \(_{}\), are drawn from the generative model conditioned on the current species embedding \(_{s^*}\). This results in a matrix of predicted environmental vectors, \(_{}\), which is compared to the matrix of observed environmental vectors \(_{}\). The flexibility of Energy and Sinkhorn distances allows them to operate on matrices with differing numbers of rows, meaning \( N_{} \) does not have to match the number of observed points \( N_{} \). This flexibility is advantageous because it permits balancing between computational efficiency and optimization effectiveness. A larger \( N_{} \) results in more accurate distance estimation but increases computational costs, while a smaller \( N_{} \) adds stochasticity, which can help escape local minima during stochastic gradient descent. In practice, I found \( N_{} = 1000 \) samples per optimization iteration provided a good balance, offering sufficient stochasticity without excessively slowing down computation, particularly for test species with approximately 100 observed occurrence points. A good approach to the optimization was to have the parameter \(\alpha\) start at 1 and gradually decrease it to 0 throughout the optimization. Initially, Energy Distance is emphasized due to its computational efficiency and smoother loss surface, which facilitates rapid convergence toward higher probability regions in the latent space. As \(\alpha\) decreases, the focus shifts towards Sinkhorn Distance, which allows for fine-tuning by capturing intricate distributional details, albeit with a more complex loss landscape. The differentiability of both Energy Distance and Sinkhorn Distance is crucial, as it allows for efficient backpropagation of gradients through the loss function, enabling gradient-based optimization of \(_{s^*}\). MODEL ARCHITECTURE DETAILS NichEncoder Stage 1: Conditional Variational Autoencoder (CVAE) The first stage of NichEncoder is a Conditional Variational Autoencoder (CVAE) that generates environmental variables \(\) from species-specific embeddings, \(_{}\). The CVAE conditions on \(_{}\) by concatenating these embeddings to each layer of both the encoder and decoder networks, allowing the model to learn species-specific environmental niches. The architecture consists of multiple layers of Multi-Layer Perceptrons (MLPs) that encode and decode the input environmental variables. ENCODER AND DECODER STRUCTURE The encoder network takes environmental variables concatenated with species embeddings as input and processes them through three fully connected layers, each with 1024 neurons and ReLU activations. The output of the encoder includes the mean (\(\mu\)) and log variance (\(\log \sigma^2\)) of the latent variable \(_{}\). The decoder network mirrors this structure, taking the latent variable \(_{}\) and species embeddings as input to reconstruct the environmental variables \(}\). Species embeddings are learned using an embedding layer that maps species identifiers to \(_{}\) vectors, which are used throughout the encoder and decoder networks. LOSS FUNCTION AND OPTIMIZATION STRATEGY The CVAE is optimized using a combined loss function that includes the reconstruction loss, KL divergence, and species latent space regularization. The \(\gamma\) parameter plays a crucial role in balancing the contributions of the likelihood and the KL divergence during training, progressively adjusting the importance of these terms to ensure the latent variables capture the data manifold effectively. The total loss function is given by: \[ = \gamma \cdot _{q(_{} \mid , _{})}[\log p( \mid _{}, _{})] - D_{}(q(_{} \mid , _{}) \parallel p(_{})) + \lambda \cdot }, \] where \(\gamma\) is initially set to down-weight the likelihood term and is gradually increased during training to enhance the model's focus on data reconstruction. The species regularization term, \(}\), is designed to keep the species latent space compact by penalizing a combination of squared (L2) and absolute (L1) values of the latent variables, effectively balancing between Gaussian (ridge regression) and Laplace (lasso regression) priors: \[ } = (1 - \alpha) \cdot \sum (_{}^2) + \alpha \cdot \sum |_{}|, \] where \(\alpha\) is a tunable parameter controlling the weighting between the two regularization terms. The optimizer used for training the CVAE is AdamW with a learning rate of 0.002, and a one-cycle learning rate policy is employed over 2500 epochs to dynamically adjust the learning rate during training. Stage 2: Rectified Flow Model The second stage of NichEncoder uses a Rectified Flow model, inspired by , to address the challenge of modeling the non-Gaussian posterior distribution of \(_{}\). This model estimates an Ordinary Differential Equation (ODE) that transforms noise into the complex distribution observed in \(_{}\). RECTIFIED FLOW MODEL DETAILS The Rectified Flow model is based on a U-net style architecture adapted with MLP layers for vector field estimation (Figure 2, main manuscript). The architecture consists of three encoding and three decoding layers, with progressively decreasing and then increasing neuron counts (512, 256, and 128). Inputs to the model include the coordinates of latent variables, a time variable encoding, and species embeddings. TRAINING DATA CREATION Training data for the Rectified Flow model is generated by interpolating between Gaussian noise samples and target samples from the latent distribution, forming a path that the model learns to approximate. The model is trained to predict the vector direction of these interpolated samples, effectively estimating a vector field pointing towards areas of high density in the target distribution. TRAINING PROCESS The training process of the Rectified Flow model consists of two stages: 1. In the first stage, the model learns to estimate the vector field that guides noise samples toward the target latent distribution. The model is trained on paths created by the interpolation, capturing the flow dynamics through the latent space. 2. In the second stage, the model refines the learned ODE by training on noise samples and their corresponding points after the initial ODE transformation. This rectification stage adjusts the ODE to approximate a linear transformation from noise to the target distribution, enhancing sampling efficiency. LOSS FUNCTION AND OPTIMIZATION SETTINGS The loss function minimizes the mean squared error (MSE) between the predicted and target vectors during training: \[ = (, ). \] The optimizer used is AdamW with a learning rate of 0.001 and a weight decay of 0.01. A one-cycle learning rate policy is applied over 6000 epochs, adjusting learning rates dynamically to ensure effective training. TRAJECTORY SAMPLING Trajectories are sampled using an ODE solver (ode45), integrating the learned vector field to transform noise into samples from the target distribution efficiently. This approach allows the model to generate high-quality samples with minimal integration steps. GeODE Model Architecture GeODE uses a modified rectified flow architecture similar to that used in NichEncoder but tailored specifically for geographic data. The model generates 2-dimensional noise vectors as inputs, which are transformed through the rectified flow mechanism to output the desired geographic coordinates. The input consists of random noise vectors representing initial guesses in 2D space, while the conditioning input (\(\)) comprises environmental variables associated with each geographic location. Each environmental vector is normalized using means and standard deviations calculated from the data, ensuring numerical stability during training. U-NET ARCHITECTURE The core of GeODE is a U-net style structure implemented with Multi-Layer Perceptrons (MLPs) instead of convolutional layers . The U-net consists of two primary paths: downsampling and upsampling. In the downsampling path, the input noise vectors and environmental conditioning are passed through three fully connected layers with progressively smaller neuron counts (512, 256, and 128). These layers reduce the dimensionality while learning broad, high-level representations of the relationship between geographic locations and environmental factors. The upsampling path reconstructs the geographic coordinates by reversing the dimensionality reduction, using three corresponding fully connected layers to produce the final outputs. Skip connections between the downsampling and upsampling paths retain and propagate finer details, leading to more accurate predictions. INPUT CONDITIONING AND ENCODING In addition to the U-net structure, GeODE includes specialized encoding layers for the time variable \(t\) and environmental conditioning vectors. A linear layer encodes the time step, representing the interpolation factor between noise and target coordinates. Another linear layer processes the environmental vectors, embedding them into a latent space that informs the transformation from noise to geographic coordinates. These encoded time and environmental vectors are concatenated with the latent representations from the U-net, allowing the model to incorporate both spatial and environmental dependencies into its predictions. Training Data Creation Training data for GeODE is generated through a Monte Carlo sampling process. Gaussian noise samples are drawn for both the latitude and longitude dimensions, creating initial random coordinate sets. These coordinates are linearly interpolated with target coordinates (actual occurrence points), guided by the ODE. This interpolation path forms the input for training, allowing the model to learn how to evolve from noise to realistic geographic distributions. Training Process STAGE 1: VECTOR FIELD ESTIMATION In the first stage, the model learns to estimate a vector field that guides the initial noise samples toward the target coordinates, which represent actual geographic occurrence points. The U-net predicts the transformation vectors that align the noise vectors with the target distribution. Training data is generated by sampling Gaussian noise vectors for both \(X\) and \(Y\) dimensions, which are then linearly interpolated with the actual occurrence points. This interpolation forms a path between the noise and the target coordinates, which the model learns to follow. STAGE 2: ODE RECTIFICATION The second stage refines the transformation by rectifying the ODE. In this step, the ODE is adjusted so that it evolves the noise vectors in a near-linear path toward the target coordinates, minimizing the computational steps required to generate realistic samples during inference. Loss Function and Optimization The loss function used to train GeODE is the mean squared error (MSE) between the predicted and target coordinates: \[ = (, ). \] Optimization is performed using the AdamW optimizer with a learning rate of 0.001 and a weight decay of 0.01. A one-cycle learning rate policy is employed over 5000 epochs, dynamically adjusting the learning rate to improve training efficiency.