Data Connect
Data Connect is a standard for discovery and search of biomedical data from the GA4GH (https://github.com/ga4gh-discovery/data-connect/blob/master/SPEC.md). It provides data custodians with a mechanism to organize and semantically describe their data and its data model, and data consumers with a mechanism to construct flexible queries and search the described data. Unlike other data-sharing technologies, Data Connect does not prescribe a data model, thus allowing arbitrary data to be discovered and searched “as is”, without potentially expensive transformations. It relies on the JSON Schema standard (https://json-schema.org/) for describing data models, and the SQL standard for querying.
Through Data Connect, databases harboring variant-level data with or without phenotypic feature data will be able to connect in a federated network, answer more complex questions, and communicate while preserving their respective data models.
Databases can connect in the network by implementing the Data Connect application programming interface (API). The API consists of three parts:
1) Table API, through which each database describes its data models to enable their discovery as well as fetching of associated data;
2) Service Info API for discovery of metadata about the database, and;
3) Search API allowing other databases to search the database for similar variants using rich and flexible queries.
The algorithm that decides similarity is defined by the database being queried. The database evaluates the query, applies the matching function, and replies with a list of other similar cases it hosts.
We plan to establish a peer-to-peer federated network based on Data Connect, where each database connects to one or more databases within the network. Because of the sensitivity of the information being shared, most databases will require requests from other databases to be authenticated with a pre-shared key (PSK). These keys are usually shared via encrypted email messages. This process of connecting to other databases can be time-consuming, but it assures each database full control over who it shares data with.