Fiona Nielsen from DNAdigest interviewed Phil Bourne – Associate Director for Data Science at the National Institutes of Health about the Big Data to Knowledge (BD2K) project.

Phil_E_BournePhoto credit: Wikipedia

What is Big Data to Knowledge (BD2K)?

BD2K is an NIH program across 27 institutes of about 110 million dollars a year. The program focus is on the challenges emerging in data across biosciences, and leveraging the power of biomedical data to benefit the NIH.

The impact of the BD2K is across all biomedical data science research, supporting the exciting science that would not happen with traditional means.

Take for example the Center for Predictive Phenotyping which is mining electronic health records (EHRs) at scale. They use computing ability combined with the vast information captured in EHRs, such as CD-9 codes associated with medical conditions, to get to a point of undertaking medical intervention for the patients.

Another example project is the Stanford project regarding mobility data. They use mobility info, body mass index, GPS coordinates, and more for gait rehabilitation and weight management research.

What does BD2K provide for these projects?

BD2K is providing the funding as well as the environment that supports big data research. This includes:

– addressing the lack of skilled workforce in data science, 20% of the budget goes to training people to do data analysis and analytics.

– the program emphasises the digital preservation of output of grants, for data and software to feedback into the ecosystem to be reused by the community.

“Once all research output was paper based, or lab consumables that could not be preserved. Today in data science everything is digital, there is no longer any excuse for not sharing the research outputs.”

“If we can improve the efficiency of biomedical research by just 5% that is approximately 1.5 billion dollars worth that can be spent on making more progress!”

What is your role at BD2K and what do you see as your mission?

I want to improve the efficiency of the biomedical research world. It is the enabling of big data insights which will contribute to the next major breakthrough. My gut feeling tells me it will happen. And part of my personal mission for this work is to create satisfying careers for researchers.

If you should give some advice to young researchers today, what would it be?

What is really important is to get the training that gives diversification. For instance, a training in data analysis combined with experimental work allows you to understand the nuances. If you have experience both in collecting, analysing and applying the data it gives you the most insightful understanding. The best researchers are able to both design and conduct experiments with data and I think the key is this hybrid approach to science – combining the theoretical with the experimental.

I would also advice them to follow these Ten Simple Rules for Building and Maintaining a Scientific Reputation.

What is the best advice you can give to research organisations to achieve biggest impact with their use of big biomedical data?

Both researchers and research institutions should adopt the FAIR principles:  only by making all research data FAIR (Findable, Accessible, Interoperable and Re-usable) will knowledge discovery and data analysis be enabled to the maximum impact.

Unfortunately, there is a cost of implementing FAIR principles, and the current reward system in science is skewed. We need to redesign the reward mechanisms so all kinds of scholarship, including FAIR data management is rewarded, not just rewarding paper publications.

In the past, the emphasis on Re-use has been missing, but today the definition of scholarship is changing: all data should be cited and rewarded.

In your slideshare presentation, ‘Understanding the Big Data Enterprise’, you state that the desired endpoint for the University as a Digital Enterprise is ‘Uber’… What do you mean by that?

I am drawing a parallel to the new very successful services that we have seen in recent years, all of them including Uber, AirBnB, Ebay, are based on the principles of suppliers and consumers. Each of these services has only been successful because they have created a TRUST between supplier and consumer, all carried out and supported by their software platform.

By drawing the parallel to my academic work: if I write a paper I am a supplier. If I read a paper I am a consumer. It is by knowing the journal of the publication that I have a level of TRUST in the supply. However, in the context of consuming research data, the TRUST is the missing link.

In science we have many different layers and platforms, for publications, for data repositories, for job searches, but with no unifying platform. All of these principles bring us to the concept of a commons, a platform that unifies across the different layers. It is not a new concept or invented by me (see e.g. this article in WIRED ) but I think that this concept of a trusted and unifying commons is what is missing in research as a digital enterprise.


Phil Bourne will be speaking at the BioData World Congress in Boston on September 14-15th presenting: Highlights from the NIH Big Data To Knowledge (BD2K) program