Guest post by Sarah H Carl (@sarahhcarl)
In many ways, the currency of the scientific world is publications. Published articles are seen as proof – often by colleagues and future employers – of the quality, relevance and impact of a researcher’s work. Scientists read papers to familiarize themselves with new results and techniques, and then they cite those papers in their own publications, increasing the recognition and spread of the most useful articles. However, while there is undoubtedly a role for publishing a nicely-packaged, (hopefully) well-written interpretation of one’s work, are publications really the most valuable product that we as scientists have to offer one another?
As biology moves more and more towards large-scale, high-throughput techniques – think all of the ‘omics – an increasingly large proportion of researchers’ time and effort is spent generating, processing and analyzing datasets. In genomics, large sequencing consortia like the Human Genome Project or ENCODE were funded in part to generate public resources that could serve as roadmaps to guide future scientists. However, in smaller labs, all too often after a particular set of questions is answered, large datasets end up languishing on a dusty server somewhere. Even for projects whose express purpose is to create a resource for the community, the process of curating, annotating and making data available is a time-consuming and often thankless task.
Current genomics data repositories like GEO and ArrayExpress serve an important role in making datasets available to the public, but they typically contain data that is already described in a published article; citing the dataset is typically secondary to citing the paper. If more, easier-to-use platforms existed for publishing datasets themselves, alongside methods to quantify the use and impact of these datasets, it might help drive a shift away from the mindset of ascribing value purely to journal articles towards a more holistic approach where the actual products of research projects – including datasets as well as code or software tools used to analyse them, in addition to articles – are valued. Such a shift could bring benefits to all levels of biological research, from ensuring that students who toiled for years to produce a dataset get adequate credit for their work, to encouraging greater sharing and reuse of data that might not have made it into a paper but still has the potential to yield scientific insights.
Tools and platforms to do just this are gradually emerging and gaining recognition in the biological community. Figshare is a particularly promising platform that allows for the sharing and discovery of many types of research outputs, including datasets as well as papers, posters and various media formats. Importantly, items uploaded to Figshare are assigned a Digital Object Identifier (DOI), which provides a unique and persistent link to each item and allows it to be easily cited. This is analogous to the treatment of articles on preprint servers such as arXiv and bioRxiv , whose use is also growing in biological disciplines; however, Figshare is more flexible in terms of the types of research output it accepts. In addition to the space and ability to share and cite data, the research community could benefit from better quantification of data citation and impact. Building on the altmetrics movement, which attempts to provide alternative measures of the impact of scientific articles besides the traditional journal impact factor, a new Data-Level Metrics pilot project has recently been announced as a collaboration between PLOS, the California Digital Library and DataONE. The goal of this project is to create a new set of metrics that quantify usage and impact of shared datasets.
Although slow at times, the biological research community is gradually adapting to the new needs and possibilities that come along with high-throughput datasets. Particularly in the field of genomics, I hope that researchers will continue to push for and embrace innovative ways of sharing their data. If data citation becomes the new standard, it could facilitate collaboration and reproducibility while helping to diversify the range of outputs that scientists consider valuable. Hopefully, the combination of easy-to-use platforms and metrics that capture the impact of non-traditional research outputs will provide incentives to researchers to make their data available and encourage the continued growth of sharing, recognizing and citing biological datasets.
This is a guest post written by Sarah H Carl (@sarahhcarl)
Sarah Carl is a PhD student in Developmental Biology and Genetics at the University of Cambridge. She is inquisitive about coding, evolution and open science.