As shown in Figure 1, the first step in a simulation workflow typically involves defining the configuration of the atoms (or more generally, particles) in the system. The mBuild Python library42,43 has been developed to be a general, customizable tool for constructing arbitrarily complex system configurations in a programmatic fashion (i.e., scriptable). Key to the mBuild library is its underlying Compound data structure. A Compound is a general “container” that can describe effectively anything: an atom, a collection of atoms, a molecule, a generic point particle, a collection of Compounds, operations on the underlying Compounds and/or data, etc. Compounds can be duplicated, rotated, translated, scaled, etc. to construct a system. Compounds can also contain information regarding connections between the atoms, by defining either fixed Bonds within a Compound or by adding Ports that allow connections to be made between separate Compounds. Ports define both location and orientation of a connection; in atomistic systems, the number of Ports and their locations are typically representative of the underlying chemistry. For example, Figure 2 shows Python code that defines a CH2moiety with two C-H Bonds and two Ports. In order to create a connection between two Compounds, a user simply states which Ports should connect and mBuild automatically performs translations and reorientations, creating a new (composite) Compound (see Klein et al .42 for more details). As such, this allows complex systems to be built-up from smaller, interchangeable pieces that can beconnected , through the use of the concept of generative modeling.42 This design approach allows for declaratively expressing repetitive structures, such as polymer chains and planar tilings (as used in Figure 2) and also allows significant modifications to system structure/chemistry to be made with only minimal changes to the initialization routines.

3.2. Foyer

After a system configuration is initialized, the interactions between all constituents must be defined before a system can be simulated (as shown in Figure 1), i.e., the force field must be applied to the system. The Foyer library44 has been developed as a general tool for applying force fields to molecular systems (i.e., atom-typing), that provides a standardized approach to defining chemical context and atom-typing rules22,45. In Foyer, the forcefield parameters and the rules that dictate parameter usage are stored together in a standardized XML file, separate from the code used to evaluate them. Usage rules are encoded by using a combination of a SMARTS-based annotation scheme, which defines the chemical context associated with a given parameter, and overrides that define rule precedence. SMARTS is a language designed for describing molecular patterns,46 thus allowing information about the bonded environment of an atom to be efficiently and clearly encoded in a format that is both human and machine readable. For example, the chemical context of a terminal methyl group (-CH3) in an alkane can be expressed as [C;X4](C)(H)(H)H. In this annotation, [C;X4] indicates that the atom of interest is a carbon (C), with 4 total bonds (X4) and (C)(H)(H)H provides the identity of those 4 bonds (1 carbon, 3 hydrogens). Figure 3 shows a snippet from the Foyer XML forcefield file demonstrating how these usage rules can be encoded, using select parameters from OPLS-AA force field (See Klein et al.22 for more details). By separating the usage rules and parameters from the software used to evaluate them, the Foyer library does not need to change if changes are made to a force field file. As such, this allows the implementation of novel and “custom” force fields without the need to write new software, which simplifies the process of disseminating and evolving forcefields, and increases reproducibility of work by making it clear not just what force field was used, but how it was applied to the system. A complimentary approach not requiring SMARTS and overrides is to make molecule-specific XML files available (e.g., via webpages such as http://trappe.oit.umn.edu).

3.3. General Molecular Simulation Object (GMSO)

With a system initialized and parameterized, the information in the system topology must be written to a file for a simulation engine. While the information required by different simulation engines is, generally speaking, the same, the structure and format of the data file(s) passed to simulation engines is typically unique to the engine itself. Generating these files accurately, especially for a wide range of unique simulation engines, can be non-trivial. The current version of MoSDeF relies upon the use open-source utilities parmed47 and OpenMM48,49 to store this information; these tools along with native MoSDeF code, include parsers to generate syntactically correct data files. In this approach, a single simulation topology can be used to generated input files for a variety of simulation engines, allowing different engines and methodologies (e.g., MC and MD) to be applied to the same system. While effective, these backend codes do not have general support for the breadth of simulation engines and force fields we aim to include. To this end, the General Molecular Simulation Object (GMSO) has been under development with the goal of becoming thede facto backend data structure of the MoSDeF. The goal of GMSO is to serve as a general container for all of the relevant system information (e.g., the fully parameterized system), stored in a simulation engine agnostic way. GMSO is designed with interoperability and support for various functional forms as a first-class feature. For example, GMSO builds upon the idea of Foyer XML data file, shown in Fig. 3, but provides further meta data; this includes encoding the functional forms of the potentials in the force field (those that can be expressed in computer algebraic inputs) using the sympy Python library. GMSO is also structured to make it easier to add data file writers, allowing GMSO support to be extended and customized. Because GMSO supports user-defined analytic equations for force field components, it future-proofs GMSO for new developments in force fields, such as those being pursued by several of the authors.

3.4. Computational Screening and Automation using MoSDeF

Since all the functions of MoSDeF are scriptable, when combined with a workflow management tool such as signac/signac flow21, it is relatively trivial to perform computational screening of the properties of systems by looping over chemistries and/or conditions and calculating relevant properties from the simulations. The MoSDeF/signac combination has been used to screen the impact on nanolubrication properties of end-group chemistry of self-assembled alkylsilane tethers on amorphous silica surfaces23, leading to a machine-learning-derived model connecting end-group cheminformatic descriptors with tribological properties of interest. In another example50, the diffusivities of ions in organic solvents were screened for 22 different solvents, revealing a pattern in this large data set (ion diffusivity proportional to solvent diffusivity) that was in contrast with previous, primarily experimental findings (ion diffusivity proportional to solvent dipole moment). The computational screening finding were confirmed in subsequent experimental studies utilizing quasi-elastic neutron scattering51 and NMR52.

3.5 Expanding MoSDeF

As noted earlier, the genesis of MoSDeF was a series of NSF grants to Vanderbilt PIs Cummings, McCabe, Iacovella, and Ledezci34–36. A recent collaborative NSF grant53 has funded groups from the universities of Michigan (Glotzer and Anderson), Notre Dame (Maginn), Minnesota (Siepmann), Delaware (Jayaraman), Houston (Palmer), Wayne State (Potoff), and Boise State (Jankowski) universities to work together to expand MoSDeF’s capabilities, including the collaborative design and development of the aforementioned GMSO backend. This collaboration is resulting in increasing integration with HOOMD-blue, integration with MC codes Cassandra and GOMC, and the first principles MD/MC code CP2K; additionally, MoSDeF has been integrated more closely with Michigan’s signac workflow management tools. In the case of Cassandra, for example, using MoSDeF existing utilities and adding additional capabilities resulting from the Vanderbilt/Notre Dame collaboration, the complexity of setting up a simulation has been reduced from 9 steps (including 3 requiring user editing of files) to a single python script using MoSDeF; this, in turn, has enabled computational screening with Cassandra. Other groups, including Houston, Boise State, and Delaware, are focusing on developing modules to implement complex workflows and analyses involved in phase equilibrium calculations and construction of intricate molecular models. Building the modules around the MoSDEF framework will enable these workflows to be performed in a reproducible fashion with a variety of widely used simulation engines.
An example of the capabilities enabled by this collaboration is given in the Supplementary Information (SI). Inspired by the honoree of this special issue, Keith Gubbins, in the SI we report the use of five different simulation codes (the open-source MC codes Cassandra and GOMC, the open-source MD codes LAMMPS and GROMACS, and the open -source first principles MD code CP2K) to repeat calculations reported by Strioloet al .54 on the adsorption of water into carbon slit pores. The latter were groundbreaking simulations for their time and the paper has been cited ~200 times (Google Scholar). The paper reported adsorption/desorption isotherms, demonstrating the hysteresis seen in experiment, as well as density profiles and orientational structure of the adsorbed water into carbon slit pores. The Striolo et al . simulations were performed using in-house codes; thus, they are almost impossible to reproduce in detail. In the SI, we show that we can reproduce the adsorption/desorption isotherms reported by Striolo et al . to within an acceptable degree using Cassandra and GOMC; more importantly, we show that by using the MoSDeF tools to create the simulations, we can easily test multiple engines, and show we get excellent agreement between the two different MC codes. Having used the technique of GEMC in both Cassandra and GOMC, we establish the number of water molecules in the pore at a given external pressure. We then perform NVT (constant number of molecules, volume and temperature) simulations using multiple codes. We find remarkable agreement for the water structure inside the pore between the MC engines Cassandra and GOMC and MD engines LAMMPS and GROMACS. The use of MoSDeF (mbuild to build the simulation systems and foyer to apply the force fields) is absolutely essential to obtaining consistency between these calculations. The first principles MD code CP2K with interactions described on-the-fly via Kohn-Sham density functional theory produces similar, but not identical, results for water structure, thereby allowing us to identify differences in water-substrate interactions. The fact that one can move the simulated system between all of these codes fairly effortlessly, thanks to the use of the MoSDeF tools and its meta-level abstraction of the concept of molecular simulation, is a very significant step forward for the simulation community. Moreover, the SI contains all the instructions needed for the reader to download and run all the utilities and codes needed to reproduce the reported calculations exactly, hence qualifying these as TRUE simulations.33

4. Conclusions

For several decades, the open-software movement has been making its presence felt in the chemical engineering community. Open-source software offers many advantages over proprietary codes. First, they are universally available and do not contain any hidden parameters. This makes verification of results published using these codes much more feasible than for proprietary codes. Indeed, some scholarly journals have taken the position of considering only manuscripts for publication in which molecular modeling calculations were performed using open-source codes or source code that is made available to reviewers. Second, open-source codes are available at no cost, which means that the codes can be downloaded and used by researchers throughout the world, removing barriers for scientific progress. Third, open-source codes typically attract a community of users and/or developers, so that bugs are discovered and eliminated quickly, often overnight; in the case of proprietary software, bugs are typically only fixed during update cycles, which may be months apart, or may even go unnoticed, since the code cannot be inspected by users. The downside of open-source software is that, since there is no revenue stream in the usual sense (sale of software), the sustainability of an open-source code over decades can be questionable. However, codes can reach a level of usage such that the effort to maintain and improve the code is taken on by the user community; LAMMPS has arguably reached this position. Also, for some open-source codes there is an alternative revenue stream. For example, Red Hat is the biggest contributor and supporter of the open-source Linux operating system. It makes money by writing, selling, and supporting business-oriented middleware that runs within Linux, as well as selling consulting services to companies switching to Linux for their enterprise software. The commercial Scienomics MAPS platform for materials and process simulations embeds some of the open-source MD and MC codes, such as LAMMPS, Cassandra, and MCCCS-Towhee. Enthought, Inc. is a software company based in Austin, Texas, that develops and markets scientific and analytic computing solutions using primarily the Python programming language; its commercial activities underwrite the widely used open-source SciPy (Scientific Python) package.
We dedicate this Perspective to our colleague, mentor, and friend, Keith Gubbins. The authors of this article wish to express their deep gratitude to Keith for all he has done for our community. We wish him many more years of productive science.
Acknowledgements
The preparation of this Perspective article has been supported by a National Science Foundation grants OAC-1835874 to Vanderbilt University, OAC-1835612 to the University of Michigan, OAC-1835630 to the University of Notre Dame, OAC-1835067 to the University of Minnesota, OAC-1835613 to the University of Delaware, OAC-1835593 to Boise State University, OAC-1835713 to Wayne State University, and OAC-1835560 to the University of Houston.