Andrew Magee

The Makings of a Meta-Analysis


How I Wasted Dozens of Hours Obtaining Publicly Available Data

This post was written by Andrew Magee as one of the honorable mentions for our data sharing essay competitions.

Andrew Magee
by Andrew Magee from UC Davis.

Phylogenies are estimates of the genealogical relationships among species, and are increasingly critical to research in a vast and rapidly expanding number of scientific disciplines, including evolutionary and conservation biology, comparative genomics, medicine and epidemiology. The process of estimating phylogenies from genetic sequence data is technically demanding and computationally intensive: many modern estimation techniques rely on Bayesian Markov chain Monte Carlo (MCMC) methods, which can require a great deal of expertise to apply and hundreds or thousands of CPU hours to perform. Given their incredible utility and the effort required to estimate them, it is crucial that phylogenetic data are readily available to the scientific community.

There have been numerous initiatives to promote the permanence of and increase access to phylogenetic data, among these are strict journal and publisher policies and even a government mandate for publicly funded projects. Stated reasons for such policies are variable, but reproducibility and accountability, foundational ideas of science, are common. Still, despite policies mandating data sharing, and a clear morality of sharing, it remains difficult to obtain phylogenetic data from published studies. Many studies, although presenting figures of their phylogenies in print, fail to make the corresponding tree files readily available in a usable format for other researchers, either as a supplement or in an online archive. This widespread failure has driven some researchers interested in acquiring phylogenies to print out pictures of them and measure them with calipers (see McPeek 2008)!

Recently, I was tasked with the collection of phylogenetic data for use in a meta-analysis regarding the prevalence of diversification-rate decreases across the tree of life. The initial plan was rather straightforward: gather studies and data, reanalyze data, and write the paper. However, it quickly became evident that, while the studies I sought were easy to access, and cataloguing their findings was simple, obtaining the phylogenetic associated data was anything but easy. After finalizing a list of studies, many published in journals with policies requiring authors to deposit their data in public archives, I had to comb through the entire list of studies and attempt to extract their data. This process was not as simple as originally anticipated. Many times a paper would reference archival, only for me to find that the study was absent, or only part of the data was archived. And other times a paper would direct me to supplementary data that I could not obtain.

In total, I gathered over 250 studies, and barely nine percent had publicly accessible data. Although an additional twenty one percent of studies had partially available data, approximately seventy percent of studies failed to make any of their data publicly available, and over eighty percent required direct solicitation of the data from the authors. Gathering the data went from being a minor step in a reasonably short project, a step that I was to handle alone, to a multi-step, multi-person job requiring several weeks to complete. A lab-mate devised a script to help take the information I had catalogued on the studies and turn them into emails requesting the relevant study data. Then, with the help of our advisor, we began the process of sending out what ended up being hundreds of emails. The automated generation of requests saved us a lot of time, and having help sending the emails saved me a lot of time, for both of which I am grateful to the members of my lab. Soliciting the data by email was a long process that required me to track down many people who had changed email addresses since writing their papers. It also required persistence in the face of authors who were non-responsive or reluctant to provide the data. In the end, between the three of us, we spent more than a fifty hours on emails, and got many datasets, but we still ended up unable to retrieve some data due to uncooperative authors and people misplacing files or suffering computer failure.

Data sharing is an incredibly important issue, and the frequency of sharing data needs to improve. It is entirely conceivable that we could have replicated most of the analyses from scratch, since almost every study made it easy to find the raw sequence data on GenBank. However, doing this would have represented a significant waste of our time, computational resources, and would have made our results not directly comparable to the studies we were using. This seems especially ridiculous given both how easy it can be for an author to deposit their data into an online repository, and how many authors are obligated to archive their data. Whether we accept that data access is important because of the need for reproducibility, or because of the resources wasted when sharing is absent, we can all agree that sharing data is desirable. Data archival policies ensure preservation and ease of access to data for the foreseeable future, but my experiences show that data sharing has a long ways to go.