Data Guidelines
Data Guidelines
1. Background
Gates Open Research requires that the source data underlying the results are made available as soon
as an article is published (see also the
Gates Foundation’s data policy). This page
provides information about data you need to include, where your data can be stored, and how your
data should be presented. In accordance with our
data
policies, authors will be required to provide details of where their
data is hosted upon submission (excepting ethical, data protection or confidentiality
considerations). Before you begin, please closely consult any Data Management Plan completed in relation to the research.
For additional help, please see our Getting Started Guide.
A large number of
journals and publishers have
confirmed that they welcome research articles reporting analysis and conclusions that are based on
previously published datasets: They do not consider the publication of a dataset with a DOI and
associated protocol information as a ‘prior publication’ that would preclude subsequent
publication of new results obtained from such a dataset.
1.1 Open Data Policy
Gates Open Research advocates an Open Data
policy.
All articles should include citations to repositories that host the data underlying the results, together with
details of any software used to process results. It is essential that others can see the raw data
to be able to replicate your study and analysis of the data, as well as in some circumstances,
reuse it. Furthermore, publishing your data will show clearly that you did the work first. Others
who then reuse your data for their own studies can cite your data (which can be cited separately
from your article if appropriate). We ask authors to deposit their data with approved data
repositories (see list below), normally under the
CC0
license which facilitates data reuse. Failure to provide data for publication without good
justification is likely to result in your article being rejected.
Exceptions: We recognise that there may be cases where openly sharing
data may not be feasible (due to ethical or confidentiality considerations), or because the data
have been obtained from a third party and access restrictions apply (see
data policy for details). If you think that you cannot provide the
source data, please let the
editorial team know, so we can advise further.
1.2 FAIR Data Principles
Gates Open Research endorses the
FAIR Data Principles, alongside an Open Data
policy, as a framework to promote the broadest reuse of research data.
Findable
In order for data to be reused, it must be findable. To ensure that others can find your data, we
ask that data be hosted by a stable and recognised open repository (where it is safe to do so) and
assigned a globally unique persistent identifier (such as a DOI). Using such a repository and
identifier ensures that your dataset continues to be available to both humans and machines in a
useable form in the future.
To aid discoverability, data should also be described using appropriate metadata. The content and
format of metadata is often guided by a specific discipline and/or repository through the use of a
metadata standard. When depositing data in a repository, it is important that you fill in as many
fields as possible as this information usually contributes to the metadata record(s). In some
cases, specifically where using a discipline-specific repository, the submission of metadata files
alongside the data may be required.
For practical guidance please see
Select a Repository.
Accessible
Data accessibility is defined by the presence of a user license. Data supporting
Gates Open Research articles should be openly published under the
CC0 license which facilitates data reuse. For software
and source code, we strongly advise the use an
OSI-approved license.
However, we recognise that there may be cases where openly sharing data may not be feasible (due to
ethical or confidentiality considerations). In these cases, we have
policies in place to allow the publication of articles associated with
such data, whilst maintaining the appropriate level of security.
Interoperable
Interoperable data can be compared and combined with data from different sources by both humans and
machines – promoting integrative analyses. To bolster interoperability, data supporting
Gates Open Research articles should be stored in a non-proprietary open file format and described
using a standard vocabulary (where available). In some cases, the preferred file formats and
vocabularies will be dictated by the repository you choose to host your data.
For practical guidance please see
Prepare your data
for sharing.
Reusable
Data that is findable, accessible, and interoperable is generally fit for reuse. On occasion, the
inclusion of additional documentation alongside the data may be required to ensure that the data
are understandable and thus reusable. As a general rule, someone who is not familiar with the data
should be able to understand what it is about using only the metadata and documentation provided.
By extension, the same practises that enable data reuse also support reproducibility.
2. Share Your Data in
4 Steps
2.1 Prepare Your
Data for Sharing
Before you begin, we strongly suggest that you consult
FAIRSharing.org
for details of data standards specific to the topic of your research. Depending on your field of
study, there may already be standards in place that will help guide how your data should be
structured, formatted, and annotated.
When depositing data involving human participants, authors must ensure that all datasets have been
de-identified in accordance with the
Safe Harbor method before submission.
Please ensure that all files are labelled clearly so readers will understand the contents of, and
difference between, the files. For each file/group, we suggest you provide:
- A single short title describing the content of the files;
- A more detailed legend describing each dataset, so it is clear
that the files are distinct and downloadable (including the explanation of any acronyms
used in the dataset).
In the manuscript, please provide a brief summary of the deposited datasets under a heading
“Data Availability”.
2.1.1 Spreadsheet data
To increase the accessibility and reusability of spreadsheet data (i.e. large tables or raw data),
they should adhere to the following best practices:
DO
- Give each column a descriptive heading.
- Use a single header row.
- Ensure you have used the first cell, i.e. A1.
- Include a title and a legend to describe each spreadsheet.
- Save each data file with a name that appropriately reflects the content
of that file.
- Deposit each table that is part of the dataset as a separate file.
- Deposit each worksheet as a separate file.
DO NOT
- Embed charts, comments or tables within a spreadsheet.
- Use color coding (machine-based data mining cannot interpret this).
- Include special (i.e. non alphanumeric) characters within the
spreadsheet, including commas.
- Use merged cells.
- Deposit multiple worksheets within a spreadsheet (such as those used in
Microsoft Excel), as these are not supported by CSV and TAB formats.
Spreadsheets should be deposited in CSV or TAB format; EXCEPT if the spreadsheet contains variable
labels, code labels, or defined missing values, as these should be deposited in SAV, SAS or POR
format, with the variable defined in English.
2.1.2 Software source code
All articles should include details of any software that is required to view the datasets described
or to replicate the analysis. For all software used, please state the version used, details of
where the software can be accessed, and any variable parameters that could impact the outcome of
the results.
Where software has been coded by the authors of the article, the source code should be made
available. If there are ethical or privacy considerations as to why the source code may not be made
available, please contact the
editorial team.
2.2 Select a Repository
Where it is possible to do so, data should be deposited in a stable and recognised open
repository under a CC0 license prior to article submission. Please check that the
DOI(s) and/or accession number(s) you provide us are publicly available.
Gates Open Research strongly encourages the use of community-recognised repositories. For some
data types, such as genetic sequences and protein structures, it is essential that the data are
deposited in GenBank and Protein Data Bank, respectively. For X-ray crystal structures, please also
submit your validation reports.
Where a community-recognised repository does not exist, prepare the files according to the
guidelines above and submit to a
general data
repository, institutional repository, or national repository. Please include descriptive legends and, where applicable, coding schemas
alongside your datasets.
Some types of data benefit from visualization within the article. Gates Open Research welcomes the
submission of articles featuring
Plot.ly interactive figures and
Code Ocean compute capsules. Videos and images can be displayed through a widget provided by Figshare. If you think your dataset would benefit from visualisation, please
contact us. We will then advise whether such visualization is suitable for your data.
2.2.1 Non-exhaustive list of Gates Open Research-approved
repositories
Below is a list of repositories that have already been approved for hosting data alongside a Gates Open Research article.
If you are an author who wishes to use a repository not already on this list, please
contact us. If you manage a
repository and would like to be included on the list, please complete our
Repository Evaluation form and
return it to us.
General
data, research materials and supporting documents
Data Type |
Where to submit* |
What to include in the data availability section of your article |
Any |
Figshare$ |
Title, DOI |
Any, but especially deposits with mixed data and code |
Zenodo |
Title, DOI |
Any |
Dryad |
Title, DOI |
Any, but especially humanities/social science data and data in SAV and POR
formats |
Dataverse |
Title, DOI |
Any, but especially deposits with mixed data, materials and documents |
Open Science Framework† |
Title, DOI |
Deposits of mixed data and code |
Code Ocean |
Title, DOI, embed code for interactive reanalysis tool |
Research materials |
Any appropriate public repository, such as
Addgene,
American Type Culture Collection,
Arabidopsis Biological Resource Center,
Bloomington Drosophila Stock Center,
Caenorhabditis Genetics Center,
DSMZ,
European Conditional Mouse Mutagenesis Program,
European Mouse Mutant Archive,
Knockout Mouse Project,
Jackson Laboratory,
Mutant Mouse Regional Resource Centers and
RIKEN Bioresource Centre
|
Accession number(s) or unique identifier(s) |
* Please note that many repositories have a limit on the size (usually 2 or 5 GB) of single file
uploads and charge for larger data files.
$ If you think your data are suitable for visualization within your article through the Figshare viewer, please
contact us.
† Deposits must be made public and your project must be registered to ensure that a record will remain persistent and unchangeable.
Software & source code
Data Type |
Where to submit* |
What to include in the data availability section of your article |
Latest source code |
GitHub,
GitLab,
BitBucket
|
URL |
Archived source code |
Zenodo |
Title, DOI and license* used |
Deposits of mixed data and code |
Code Ocean |
Title, DOI, embed code for interactive reanalysis tool |
Software |
Authors may host software where they wish, though it is strongly recommended to use a
stable URL |
URL |
* An open license must be assigned and we strongly advise authors to use an
OSI-approved license.
3D-printable models
Data Type |
Where to submit |
What to include in the data availability section of your article |
All 3D-printable models (including molecular, cellular, medical/anatomical
and labware models) |
NIH 3D Print
Exchange |
Title, model ID, URL |
Health data (restricted access to protect anonymity of participants possible)
Humanities and social science data
* Deposits must be open access.
Transcript data
Qualitative data resulting from recordings of interviews or focus group discussions should be anonymised by redaction and uploaded to a general data repository (see above). If it is not possible to anonymise the data sufficiently by redaction, a restricted route of data access should be provided by the authors and a comprehensive statement must be added to the Data Availability section of the article (see below for data that cannot be shared). If the transcript data cannot be shared under any circumstances, please contact the editorial team, who will be able to advise you.
Environmental and ecological data
* Data entries must be made public.
Chemical and macromolecular structures
* X-ray crystallography validation reports should be submitted (as a PDF) directly to Gates Open
Research via the submission system.
Neuroimaging data
Data Type |
Where to submit |
What to include in the data availability section of your article |
Raw fMRI datasets |
OpenfMRI |
Title and accession number(s) |
MRI and PET unthresholded statistical maps |
NeuroVault* |
Title and URL (which includes a unique data ID) |
* Please note that authors will still be expected to deposit their raw neuroimaging data in an
appropriate repository. Also, once submitted, administrative powers will be transferred to Gates
Open Research. This is necessary to ensure stability of the dataset; this transfer does not affect
the CC0 license assigned to all NeuroVault submissions.
Sequence and omics data
Data Type |
Where to submit |
What to include in the data availability section of your article |
Expression and sequence data (including Nucleotide/protein sequence, microarray, SNP/SNV, GWAS, phenotype or sequence-based reagent data)
Systems and chemical biology data (including chemical entities, chemical reactions, computational models, metabolic profiles, or molecular interactions)
|
Any appropriate INSDC member repository, e.g. DDBJ, ENA or NCBI repositories.*
The GSA, which is working towards INSDC membership, is also acceptable.
Researchers in China may alternatively use the CNGB Sequence Archive.
|
Accession number(s). For SNP/SNV data please provide HGVS name(s), local ID(s) and rs/ss number(s) |
Metabolomic data |
Metabolomics Workbench$
|
Project DOI, Study ID |
Proteomic data |
Any appropriate ProteomeXchange member repository
|
Accession number(s) |
* Some higher-level repositories, such as BioProject and BioStudies, provide access to data deposited in various archival databases. In these cases, please cite the accession numbers that are assigned to the data submissions by the archival databases in addition to the higher-level identifier.
$ Or any appropriate INSDC member repository, see above.
2.3 Add a Data
Availability Statement to Your Article
All articles must include a Data Availability statement, even where there is no data associated
with the article. This statement should be added to the end of the article prior to
submission. The Data Availability statement should not refer readers or
reviewers to contact an author to obtain the data, but should instead include the applicable details
listed below.
No associated or additional data
For articles which have no associated data, the statement should read:
“No data are associated with this article.”
For articles where all associated data are presented in the article itself, please include the
statement:
“All data underlying the results are available as part of the article and no additional
source data are required.”
Repository-hosted data
Where underlying and/or extended data are hosted in a repository, please include the name of the
repository used and the license along with details indicated in the ‘What to include in the
data availability section of your article’ column in the
tables
above. This should be done in the style of, for example:
Repository: Manually annotated miRNA-disease and miRNA-gene interaction corpora.
https://doi.org/10.5256/repository.4591.d34639.
This project contains the following underlying data:
- Data file 1. (Description of data.)
- Data file 2. (Description of data.)
Data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).
Where data are held in a structured, subject-specific repository, the following example would be appropriate:
NCBI Gene: Ihe1 intestinal helminth expulsion 1 [ Mus musculus (house mouse) ]. Accession number 107537.
Each dataset mentioned in the article, including those in the Data Availability statement, must
also be referenced using a formal
data citation.
For more information on how to structure the Data Availability section, please see
our Article Guidelines.
Data that cannot be shared
Ethical and security considerations
If data access is restricted for ethical or security reasons, please include a description of the
restrictions on the data and all necessary information required for a reader or reviewer to apply
for access to the data and the conditions under which access will be granted.
Data protection issues
Where human data cannot be sufficiently de-identified, please include: an explanation of the data
protection concern; what, if anything, the relevant Institutional Review Board (IRB) or equivalent
said about data sharing; and, where applicable, all necessary information required for a reader or
reviewer to apply for access to the data and the conditions under which access will be granted.
Large data
Where data is too large to be feasibly hosted by a repository approved by Gates Open Research,
please include all necessary information required for a reader or reviewer to access the data
alongside a description of this process.
Data under license by a third party
In cases where data has been obtained from a third party and restrictions apply to the availability
of the data, the manuscript must include: all necessary information required for a reader or
reviewer to access the data by the same means as the authors; and publicly available data that is
representative of the analysed dataset and can be used to apply the methodology described in the
article (please see
Repository-hosted data
above).
If you are unable to share your data for any reason not included here, or have additional
questions about data sharing, please let our editorial team know and we will be happy to advise.
2.4 Link Your Datasets to Your Article
Once your article is published, we strongly advise that you update your repository project with the DOI for your article, which will be emailed to you upon article publication. Linking your data to your article will enable your data and article to be reciprocally connected, ensuring you receive credit for your work.