Data submission, use, and access
Policies and Plan for Data Submitted to the DCC and Shared on the Common Metabolic Diseases Knowledge Portal
The Broad Institute will serve as the Data Coordinating Center (DCC) for the Common Metabolic Diseases Knowledge Portal (CMDKP), supported by the Accelerating Medicines Partnership in Common Metabolic Diseases® (AMP CMD®). The DCC will aggregate data, support analyses, and continue to update capabilities to disseminate results relevant to type 1 diabetes, type 2 diabetes, cardiovascular disease, cerebrovascular disease, sleep disorders, and related traits, while coordinating collaboration within the CMDKP Project. The DCC will also be responsible for sharing the results from the data coordination and analysis activities on the CMDKP.
Data Aggregation, Analysis, and Resource Distribution
As the DCC, the Broad Institute intends to (a) serve as the gateway to a large (and growing) aggregation of genetic and genomic data relevant to common metabolic diseases and their complications; (b) perform and automate analyses required to interpret those data; and (c) communicate results to diverse audiences via an open access Web Portal (CMDKP), presentation, and publication. Each of these goals involves distinct categories of resources and activities that we will share and/or manage:
Data aggregated under this effort will not be generated by the Knowledge Portal development work funded through the Portal-specific grants, but rather obtained from other investigators and repositories who wish to collaborate on and contribute to this effort, including other investigators funded by the wider AMP T2D program. Moreover, the role of the Portal will not be to redistribute individual-level data, but rather to generate results attained via standard and customized queries that can be widely shared with the scientific community. Because the primary, individual-level data are neither generated by this project, nor redistributed to other users, the role of the Portal is limited to secure intake, storage and management, automated analyses, and dissemination of results in summary (i.e., not individual-level) form while complying with intended use of the data and all relevant regulations.
We will focus on three classes of data: individual-level genotypes, individual-level phenotypes, and external precomputed results or annotations (e.g., results from individual studies or meta-analyses of multiple studies, processed annotations).
The data and results currently stored in the Portal have either been generated at the Broad Institute as part of IRB-approved secondary use protocols or, in the case of meta-analysis results from published GWAS datasets, obtained in such a form that Broad was determined to be "not engaged in human subjects research" (per the criteria described in the U.S. Health and Human Services Office for Human Research Protection's 2008 Guidance on Engagement of Institutions in Human Subjects Research). All data have been de-identified prior to being sent to Broad; at no time or under no circumstances will investigators funded by this grant have information linking data back to subject identifiers.
For datasets subsequently added to the Portal, raw data will be obtained through formal NIH systems for data-sharing such as dbGaP, or directly from investigators who collected the data. It will be the decision of NIH and the AMP T2D Steering Committee whether to accept into the Portal data that are not in dbGaP and, if so, the terms on which the data can be made accessible for analyses by other parties. For the raw data transferred to the DCC for representation on the Knowledge Portal, in all cases, we will obtain a DTA between the Submitter and the Broad Institute (outlined below). The DTA outlines the use and protection of the transferred data. The Submitting site will be responsible for ensuring that the datasets transferred to the Broad are consented for transfer, genetic analysis, and representation on the CMD Knowledge Portal. At the DCC, we will develop software for storing and managing datasets, and will not redistribute the raw data to any outside third parties. However, as part of our analysis process, we may send array data to the University of Michigan (an AMP T2D funded site) imputation server for imputation purposes only. See Appendix C of our data transfer agreement for additional details. For bulk data (e.g., raw and harmonized individual-level genotype data), we will use object storage systems, with access controlled through application programming interfaces (APIs). Only authenticated and authorized users can access data; all such access is logged and auditable.
Additionally, we are currently building a database tool that captures data use restrictions for each dataset electronically using an ontology-based consent database, and can match those restrictions against potential research usage to ensure that only appropriate users can query specific datasets. Web-based tools, currently in development, support both the entry of data-use restrictions and the review of access requests. This will enable the aggregation of additional and more diverse datasets for the Portal.
Examples of datasets
- Current and future genetic studies of type 1 diabetes, type 2 diabetes, cardiovascular and cerebrovascular disease, and sleep and circadian disorders
- Current and future genetic studies of related quantitative traits
- Current and future studies of complications related to these diseases
- Annotations of function
- Results of computational methods that make predictions about the disease relevance of variants, genes, and tissues based on the data types listed above
Classes of datasets for storage and analysis in the Portal
- Those that do not require ethical and regulatory approval
- Results from publicly available datasets
- Summary statistics
- Those that require approval from public access sites
- Access controlled export (dbGaP, EGA, etc.) of individual-level data where we will serve results and summary statistics
- Those that require local IRB approval
- Data generated directly at the Broad on de-identified DNA samples
- Those that require a Data Transfer Agreement (DTA)
- All de-identified individual-level genetic and phenotypic data generated externally
Data Transfer Agreement
For all de-identified individual-level genetic and phenotypic data generated externally (genotype, phenotypes, annotations, etc.) received by the Broad Institute as DCC for the AMP-T2D Knowledge Portal, we will execute a Data Transfer Agreement (DTA) with the submitting institution. We will ensure that the usage of the data is compliant with the Data Use Restrictions associated with the dataset. It will be the responsibility of the submitting institution to outline the Data Use Restrictions for the data coming to the DCC for the Knowledge Portal. It will be the responsibility of the submitting site or Institution to outline the appropriate Data Use Restrictions as part of the executed DTA for the DCC. Below we outline the data use and analysis plan for the DCC.
Find complete information on data submission in our AMP T2D Knowledge Portal Submitter and Analysis Guide for Data at the DCC.
Data use and analysis
For all de-identified individual-level genetic and phenotypic data generated externally (genotype, phenotypes, annotations, etc.) received by the Broad Institute as DCC for the CMDKP we will perform quality control assessment, harmonization, and association analysis for relevant traits. This process will be a collaborative "handshake" process with the submitter. At the completion of each major phase we will share a report and results with the submitter. All results will be approved by the submitting site before results are available through the CMDKP. We have outlined our procedures for the data use, and analysis steps in a document entitled "AMP T2D Knowledge Portal Submitter and Analysis Guide for Data at the DCC," which may be downloaded here.
The DCC will share the results of all approved analyses directly with the submitter upon completion for review. We will partner to address any quality control matters or confounders in the data before deposition in the CMDKP. Once the results are finalized, the DCC will make the data available for query via the CMDKP.
The individual-level data sent by data submitters, stored, and analyzed by the DCC will never be shared with Portal users; only results will be shared. The individual-level and summary data will reside in one or several data vaults behind a secure firewall. User-activated analytical modules will be deployed behind the firewall to analyze the data or query precomputed results. The Portal will provide results in response to queries for information, obtained from genetic analyses performed on the data. The purpose of the CMD web Portal will be to enable broad access to the comprehensive results of genetic studies of T2D and the common metabolic diseases listed above. To ensure that the web Portal is effective in allowing access to results and data – both within AMP T2D and with the broader biomedical research community – we will develop an interface to provide access to results in a form designed to meet user needs while maintaining individual data privacy requirements, and will engage Portal users in assessing the value of these features.
Results from studies included in the Portal will be available genome-wide (i.e., not limited to "top hits"), and results from different studies and types will be integrated and presented simultaneously. Metadata and other technical details (e.g., analysis parameters, explanations of terms, documentation of methods) will be available at lower levels of drilldown.
Resource Distribution and Sharing
We will share software, methods, and code developed as part of consortium efforts. Specifically, we envision three types of sharing: (a) sharing of software source code; (b) sharing services; and (c) sharing of effort between groups with the intention of maintaining or extending existing software.
Sharing of software source code
We are producing open-source software under the terms of the BSD 3 open-source license. As such, this code will be freely available for use by any other parties; the software will be supplied “AS IS” with no implied warranty or promises of support. We will maintain a Github repository from which interested parties can download the source code. The code source of the AMP-T2D Portal, entitled Framework for investigating Genetic Associations (FGA), is located here.
Our software will be constructed as a distributed system in which computers communicate using standard protocols (HTTP for transport, with REST as an organizing principle and data payloads defined with JSON), with well-defined interfaces specific to the computational topics addressed by each computer system. These services will in principle be accessible by any other party willing to adopt the conventions used by our services. To the extent that data on these services may be under privacy and use restrictions, these running services will be designed to provide information only in forms that protect privacy and security, or in secure, encrypted mode for other parties with permission to access and receive the information.
Sharing of effort between groups to maintain or extend existing software
The Portal architecture will be designed to facilitate front-end contributions (e.g., extensions of existing widgets for data exploration) from a wide community of developers. Data or computations for REST servers will be encapsulated as loosely coupled "plug-in" modules that may be written in different languages (e.g., Python, JVM-based languages, shell scripting). This approach anticipates the contribution of computational modules from other individuals and groups, both within and outside of the AMP consortium.
Policies for Data Release, Accessing the Portal, and Terms of Conduct
Data processing and availability (applicable for both data coming to the DCC and to Federated nodes)
The DCC and Federated nodes will receive data from submitters on an ongoing basis. The Data deposition stage has several components that must be completed for the data to be ready for release into the Portal. These are:
- Data use agreements and ethical approvals for data transfer to DCC (or the Federated nodes) and release into Portal.
- Physical transfer of data and all meta-data, in required formats, into Data Intake System at the DCC (or at a Federated node).
- Data storage, curation, QC, and harmonization.
In general, upon depositing data into the Portal, a QC filter on genotypes and phenotypes will be deployed as per standard operating procedures in the field. Data will only be available after these initial filters are applied. Filters will include automated steps and final human curation, as determined by the AMP T2D investigative team.
Data use and availability
Datasets labeled "Open access": All users are welcome to use results from analyses of these data to further their research without seeking explicit permission from the Portal team or funders. Users are also welcome to cite the data in scientific publications, provided that they cite the Portal as the source. If users are citing a single dataset represented in the Portal, they should cite both the Portal and the relevant paper for that dataset.
Datasets labeled "Pre-publication": These data have been submitted to the Portal by authors in advance of publication in order to provide the immediate benefits of data access to the research community. Portal users may explore these data via all of the Portal tools and interfaces, but are not permitted to submit for publication the results of any such analyses until the primary paper has been published or until the results have been available in the Portal for 6 months, whichever comes first.
Portal users are expected to abide by the following provisions on data use:
- Users will not attempt to download any dataset in bulk from the Portal, other than those made available for public download
- Users will not attempt to identify or contact research participants
- Users will protect data confidentiality
- Users will not share any of the data with unauthorized users
- Users will report any inadvertent data release, security breach, or other data management incidents of which they become aware
- Users will abide by all applicable laws and regulations for handling genomic data
- Users will not submit a manuscript for publication until the primary manuscript is published, or 6 months after the dataset becomes available in the Portal (whichever comes first), to allow for beta testing on the integrity of the dataset and finalization of the results on the Portal.
Agreeing to these provisions is a requirement of Portal use. Violating them may result in an NIH investigation and sanctions including revocation of access to the Portal.
Citing portal data
Users who wish to cite data in this Portal in a scientific publication should do so in the following format:
Common Metabolic Diseases Knowledge Portal (cmdkp.org). Year/Month/Date of access; URL of page you are citing (RRID:SCR_020937).
For instance, a user who viewed the Portal's page on the gene SLC30A8 on September 1, 2020, and wanted to cite it would use this citation:
Common Metabolic Diseases Knowledge Portal (cmdkp.org). SLC30A8 Gene page. 2020 Sept 1; https://hugeamp.org/gene.html?gene=SLC30A8 (RRID:SCR_020937).
The Portal does not yet have a PubMed identifier.
Re-using written content on the portal
Except where otherwise noted, text on this site is licensed under a Creative Commons Attribution Non-Commercial Share Alike 4.0 International License.
The Portal team tracks a limited set of usage statistics. We do this to improve functionality based on how users interact with the Portal and to ensure that Portal data are being used properly (see our data use policy). Two types of people are allowed to view usage statistics at different levels of detail:
- Our website developers track deidentified, aggregate analytics (such as hit counts for specific pages) in order to improve the Portal's user experience. They do not view statistics attached to individual user accounts.
- NIH personnel may be asked to examine individual user histories in cases of suspected misuse of Portal data.