The idea for this guest post by Kate Hodesdon of Seven Bridges Genomics grew out of a discussion with Adam Resnick (Children’s Hospital of Philadelphia) and Deniz Kural (Seven Bridges Genomics).
There is widespread recognition that sharing data benefits science. In this article, I’ll examine the best practices of data sharing, and assess the prospects for codifying these into a metric for how well scientists share data.
When scientists say that sharing data is good for science, they have certain models of sharing and certain kinds of data in mind. I want to look at what makes someone a good sharer of data. For instance, simply being a prolific sharer is useless if the quality or relevance of the data is poor. And sharing high-quality data is not helpful if it you store it in an insecure repository, or an obscure format. Clarifying best practices of data sharing will help us maximize the value of shared data, but it can also play another important role of helping to incentivize data sharing.
The problem of incentivization is that while data sharing undoubtedly benefits scientific progress, it is only beneficial to individuals if they can take advantage of another’s shared data.
In other words, data sharing requires unilateral adoption.
To encourage this, researchers have investigated ways of incentivizing scientists to share their data — see, for instance, the report commissioned by the Wellcome Trust: ‘Large-scale data sharing in the life sciences: Data standards, incentives, barriers and funding models’. Financial rewards, in the form of research grants, are clearly one way to do this: funding bodies can opt to award grants to projects on the condition that any data they generate is made accessible. Similarly, hiring committees can look for evidence of data sharing when evaluating applicants for research posts or tenure.
However, if funding and hiring bodies are going to usefully assess how well a scientist shares data, then we need to make clear exactly what it is that they should measure. We need a metric. The metric ought to be objective, and permit comparisons between scientists, ultimately resulting in a ranking: a kind of sharers’ leaderboard.
The sharing metric would play an analogous role to the h-index, which aims to measure scientists’ productivity. A scientist has an h-index of h if he or she has published h articles, each of which has received at least h citations. The h-index measures for strength in two of the most prized academic ideals: being both prolific and widely cited. When it comes to research output, simply valuing quality over quantity isn’t enough: we want both, at once. This means that good scientists who publish just one or two highly influential papers can, counterintuitively, be penalized with a lower h-index than their colleagues who publish more papers that are less widely cited. However, despite its flaws, to funding and hiring bodies the h-index provides a useable quantitative proxy for a scientist’s quality. Adam Resnick, of the Children’s Hospital of Philadelpha, proposed we create an analogous index for a scientist’s contribution of open access data. He called this prospective metric the ‘s-index’.
The simplest way to measure how much a scientist shares is just be to sum the volume of data that they make available. But just as the sum of a scientist’s publications doesn’t tell you much about the value of their work to the scientific community, tracking the sheer quantity of data output won’t give us useful information. We need the s-index to correlate with quality as well as quantity. What makes data high quality is a tricky normative question: it includes data that is gathered accurately, with good coverage, but will also often be subject-specific. Fortunately, we don’t need to pin down quality precisely: we can simply assume that a dataset’s quality is reflected in the number of times it is actually used. We can measure data usage by imposing a system of formal citation, just as we do to keep track of when an article is used in further research; something like PageRank for datasets.
Incorporating the number of times a dataset is used by other people into the s-index will also help to ensure that data is made available in the most useful and usable format. For instance, genomic data is more likely to be used if it is properly annotated with metadata specifying, for example, the identities of the samples used and any sequencing technology used to process them. To increase usage rates, sharers should also adhere to standard conventions about naming and file formats, and provide readme files when they deviate from these.
The resulting s-index won’t be perfect — metrics rarely are — but its weaknesses seem to be no worse than those of the h-index. One drawback is that an s-index will favour scientists in disciplines that produce a lot of data by volume, like genomics or particle physics, over those in subjects like geology. However, a similar discrepancy is also built into the h-index: physicists typically publish far more papers (as n-th named author) than historians do. The problem is dealt with by applying the index only in context — by comparing particle physicists to other particle physicists, geologists to other geologists, and so on.
You will notice that although I recommend that the s-index ought to be based both on the quantity of data outputted and the amount of use it receives, I haven’t suggested an explicit formula. I leave this discussion as a topic for another day. My hope is that a numerical metric for data sharing will make it easier for us to reward the data sharing pioneers, and ultimately lead to wider adoption of data sharing in science.
Kate Hodesdon, PhD, is an editor at Seven Bridges Genomics, the cloud-powered platform for population-scale bioinformatics.