This interview focuses on the open source DataSHIELD software that enables you to take the analysis to the data, not the data to the analysis. Just like the software itself, this interview is a result of a group effort.
D2K group: (from left to right) Dr Andrew Turner, Prof Paul Burton, Dr Demetris Avraam, Dr Stephanie Roberts, Prof Madeleine Murtagh, Dr Olly Butters, Dr Neil Parley, Dr Becca Wilson. The two dogs are the group mascots Java (left) and Data (right).
Please introduce yourself. What is your background and your role in the project?
The DataSHIELD project is co-ordinated by the Data to Knowledge (D2K) Research Group from the School of Social and Community Medicine, University of Bristol. The following people are involved in the day-to-day running of the project:
Paul Burton – Professor of Infrastructural Epidemiology, Principal Investigator of the overall DataSHIELD project and an active developer of the software and statistical methods.
Becca Wilson – originally a planetary scientist – now the DataSHIELD Lead. I coordinate the project and contribute to the expansion of the project beyond biomedical applications.
Demetris Avraam – mathematical modeller. I have a leading role in the development, implementation and testing of new statistical methodologies for use in DataSHIELD.
Andrew Turner – originally a philosopher. I conduct social studies of health science with respect to the development of DataSHIELD. I also contribute to the development of DataSHIELD’s socio-technical infrastructure.
Madeleine Murtagh – Professor of Social Studies of Health Science. I lead Governance and studies of the human dimensions of DataSHIELD and data sharing.
Tom Bishop – data scientist, MRC Epidemiology Unit. I am a member of the DataSHIELD core development team and contribute towards the development of new platform functionality.
When was DataSHIELD created and why?
The idea underpinning DataSHIELD emerged at a joint meeting in May 2009 of the EU Framework 6 project PHOEBE, and the Public Population Project in Genomics and Society in May 2009. DataSHIELD comprises a series of R packages and a computing infrastructure that enables the fully efficient remote analysis of distributed, sensitive individual level biomedical data (microdata) as if one had access to all the required data in one place. In fact, under DataSHIELD the data remain secure – invisible and unattainable – on their study/host servers, behind their host firewall. DataSHIELD can be used when governance or other constraints prohibit release of microdata to third parties e.g. ethical and legal concerns surrounding patient confidentiality limit the access and use of such data.
DataSHIELD can also be used to precisely replicate the mathematics of a study-level meta-analysis, a process that involves i) each study analyses its own data ii) each study transmits its results to the analysis centre iii) the analysis centre undertakes, for example, a random-effects meta-analysis of all results. But under DataSHIELD this is highly streamlined – all commands are issued by the DataSHIELD user and there is no need to wait for studies to analyse their own data. DataSHIELD does precisely the same analysis as each study would do, but the studies themselves don’t have to do any work at all.
Given its functionality, DataSHIELD has the potential to increase access to and usability of datasets across the world. It is of particular value when the original individual-level data (microdata) cannot physically be shared because of: (i) governance constraints; (ii) concerns about the loss of intellectual property or professional control of a precious set of data; or (iii) physical size of the data objects that would need to be transferred. In essence, DataSHIELD can legitimately “lower the bar” on the governance processes that determine whether a particular set of data can be used in a given co-analysis.
How does DataSHIELD work for the data provider, and what is required to set it up?
A data provider needs to set up both a technical (hardware and software) and social infrastructure (people and protocols) to use DataSHIELD successfully. In terms of hardware, data providers require a database server (MongoDB or MySQL) to host the data and a virtual machine to provide the DataSHIELD R environment that will analyse individual level data from behind the study firewall.
There is a minimum technical specification of these hardware components that are outlined in table 1.
|CPU||Recent server-grade or high-end consumer-grade processor|
|Memory (RAM)||Minimum: 2 GB|
Recommended: >4 GB
|Disk space||Rule of thumb calculation: 10 GB for operating system + 4 GB/10000 participants|
Table 1: A table of the minimum hardware specification to host data within the DataSHIELD infrastructure (from Opal Server Administrator Guide).
The software required to run DataSHIELD includes the freeware statistical programming software R and the open source Opal data warehouse (utilising Java) that has DataSHIELD built into it.
If a DataSHIELD is being used in a co-analysis setting, each data provider will need to harmonise their data with all other studies in the consortium. One of the studies in the consortium would need to set up and maintain a DataSHIELD client portal – utilising Rstudio on a separate virtual machine and managing a governance mechanism for data access and management of usernames and passwords for authorised users of the consortium data.
In terms of social infrastructure, a computer infrastructure specialist is essential to install and set up DataSHIELD at a study. Furthermore, once DataSHIELD has been set up at a given study, it is important that someone (an infrastructure specialist) is identified as formally having the responsibility to maintain the hardware and updating DataSHIELD, Opal and R software. One or more named individuals should also be identified as having responsibility for governance and for local implementation of the protocols underpinning user-access to the data, user access to DataSHIELD and maintenance of the dataset.
Which organisations have already successfully deployed DataSHIELD?
DataSHIELD has been piloted under the EU FP7 funded BioSHaRE project. Within BioSHaRE two use cases implemented DataSHIELD:
- The Healthy Obesity Project (HOP) ultimately co-analysed data from 10 European biobanks with 99 variables. Project undertaken to determine if it was possible to identify personal, demographic or environmental factors that could explain why some people who are ‘obese’ nonetheless seem completely healthy.
- The Environmental Core project (ECP) co-analysed 5 biobanks with 51 phenotypic variables and 14 environmental variables extracted from exposure models. ECP set out to explore whether air pollution or excessive noise were associated with a disordered distribution of biochemical markers for cardiovascular disease.
Further piloting of DataSHIELD that is already planned:
- The FP7-funded Interconnect Study – led by Professor Nick Wareham, MRC Epidemiology Unit, Cambridge – is constructing an epidemiological infrastructure aimed at exploring variation in the risk of diabetes and obesity between different populations and, in particular, to investigate the cause of the excess prevalence of diabetes in certain high risk populations. In an initial pilot, an exploration will be undertaken of the potential benefits of physical activity during pregnancy on offspring birthweight. Data from 8 studies – comprising a total of 144,349 participants – will be harmonized and it will then be securely co-analysed using generalized linear modelling via DataSHIELD. If the pilot is successful a portfolio of co-analyses is planned that may incorporate up to 40 studies.
- ENPADASI is an FP7-funded project to deliver an open access research infrastructure that will contain data from a wide variety of nutritional studies to facilitate combined analyses in the future. ENPADASI is investigating the use of DataSHIELD for this context.
- SPIRIT comprises a network of academic and industry partners investigating intra-uterine determinants of child health and development and perinatal health services in Quebec and Shanghai, China. DataSHIELD is being implemented to co-analyse 4 cohort studies (3 in Canada, 1 in China).
To date, DataSHIELD is predominantly been used to conduct epidemiological analyses based on conventional (non-‘omics) phenotypic data. We are in discussion with research groups considering quite different applications in other areas of ‘omics – including epigenomics. It has always been intended to ensure a capability to analyse genomic/genetic data.
DataSHIELD can already be used to analyse a small/moderate number of genetic covariates (e.g. up to 200 Single Nucleotide Polymorphisms [SNPs]) simply by treating the SNPs as binary, or three-level, nominal/ordinal covariates. There is also no reason in principle that DataSHIELD could not be used to analyse high-throughput genomics data (e.g. 1 million SNPs from 10,000 individuals in a genome wide association analysis). However, before this can be enabled, funding is required to develop and pilot a computing infrastructure that would include compatible data storage and analytic environment to utilise a DataSHIELD approach for the secure analysis of this type of big data.
Do you provide support for new users and new organisations who want to use DataSHIELD?
DataSHIELD is an open source project – core information about how to install and use DataSHIELD is freely available on our wiki page and our code including puppet install scripts is available on the DataSHIELD github page. We are keen to work with other projects and research groups across the scientific community and have created a free DataSHIELD demo environment – allowing potential users and developers to explore our software and infrastructure.
Throughout the year we run DataSHIELD user and developer training courses at conferences and workshops as well as an annual training course as part of the short-course programme for our department. Our next set of DataSHIELD training courses (for users and data providers) runs across 19th and 20th October 2016:
- 19th October 2016 – Introduction to R (a refresher course in R, designed for those who have never used R)
- 20th October 2016 (morning) – DataSHIELD User training course
- 20th October 2016 (afternoon) – DataSHIELD technical course (for data providers or developers wanting to implement or develop in DataSHIELD)
Where appropriate, these courses can be tailored to the specific needs and professional experience of a particular group of researchers.
In the future, as part of our sustainability planning, we intend that projects that wish to deploy DataSHIELD will be offered tailored support plans costed on an as-needs basis. This may include support to implement and/or use DataSHIELD in a study/consortium and, where it is necessary, the development of new analytic functionality.
Would it be possible to connect every biorepository in the world using DataSHIELD, or does it have limitations?
In principle, there is no maximum limit to the number of datasets (or number of observations) that may be pooled together and analysed using DataSHIELD. However, as the number of data sources co-analysed rises, the potential for a technical problem to arise in the IT system of at least one of the sources inevitably increases. In practice, this will mean that co-analyses on more than tens of studies may prove to be impractical. It is important to recognise that we do not propose that DataSHIELD should be used for all analyses of all studies – it is only of value when there would otherwise be a problem in physically sharing the microdata. Because DataSHIELD introduces additional complexity over and above the work required for an equivalent analysis without DataSHIELD it would be inconvenient – indeed counterproductive – for analysts to use it when it is not required. This would risk putting off potential users from working with DataSHIELD when it is useful.
From a technical perspective, R (the DataSHIELD analytic environment) has a finite memory capacity and it may be thought that this would limit the number of studies that may be co-analysed simultaneously. But this is not the case. The study-specific analysis carried out on the R environment behind the firewall at each study takes place entirely on the server running Opal and R at that study. This means that it is the amount of data (observations and variables) to be analysed from an individual study that determines memory use and computing load. Furthermore, because the non-disclosive information passed back from all studies to the central analysis server is generally simple (small and low-dimensional) it is extremely unlikely that the R environment on the central client server would be overwhelmed even with a very large number of studies and analytic requests.
Finally, we note that it is only meaningful to co-analyse studies that hold comparable information: data should therefore be harmonized before using any form of joint analysis, including DataSHIELD. This is time and resource intensive and, because it is context specific, it may have to be repeated when a new analysis is proposed. In consequence, whilst we could see an argument for many studies installing DataSHIELD so they could use the infrastructure whenever it might be of value to them our future vision is based on the assumption that any one co-analysis would be limited to a group of studies with a particular shared goal where the data have been adequately harmonized. In practice, because of the practical limitations noted above, we suspect that this will rarely exceed tens of studies, but any one study could potentially be a member of a series of different co-analysis consortia (addressing a range of different scientific questions). This could produce a large network of studies that all run DataSHIELD and undertake co-analyses in many different combinations, but it would be misleading to view this as being equivalent to all studies being simultaneously linked together in one huge network for a single shared analysis.
What do you think is the biggest barrier for research collaborations and data sharing in the genomics research community?
Ironically it is failure to recognise that there is no single “biggest barrier” to data sharing in genomics that is central to the important challenges that do exist. As outlined by Murtagh et al (2016), data sharing is a complex phenomenon providing both scientific and human opportunities and challenges that are changing over time as perspectives and understanding of the key issues are evolving amongst the professional research community, ethico-social commentators and – most fundamentally – across society at large. If we are to optimise data sharing and co-analysis of the increasingly large datasets and data objects, not only in genomics research but across the biomedical research arena as a whole, it is essential that we develop secure but streamlined governance systems and infrastructures that properly reflect the complex reality of contemporary data sharing, and then develop and apply novel solutions to address each of the individual challenges (scientific, technical, ethico-legal, social and political) that are generated by those systems.
One key challenge relates to concerns about loss of intellectual property. It is common for researchers who generate data and make them available to others to report cases where having physically shared microdata with external users, one or more key problems may arise: research is undertaken that goes beyond the scope of what was formally agreed when the data were released; important results are published without proper involvement or acknowledgement of the data generators; or the data are forward-shared with new users that have not been subject to the formal governance oversight that apply to all potential users of the data – potentially including transferring data to jurisdictions with global governance controls that are markedly weaker than those that apply where the data were first generated. These are human problems. Problems which are compounded by the often perverse incentives in academic institutions whereby discovery science and its outputs measured in high impact research papers are given high professional credit whereas data generation – even when it is extensive, of high quality, and involves providing data to multiple research groups with large impact – is given little credit. Because DataSHIELD allows data providers to share the information in their data rather than the data themselves, they have much greater control over the nature and extent of analysis that can then be undertaken.
The scientific desirability of sharing and co-analysing individual-level data (microdata) is on an upward trajectory. By enabling microdata to be shared and co-analysed without physically transferring them to the user and by preventing direct visualisation of those data and providing disclosure controls to reduce the risk of inferential disclosure based on particular patterns of analysis, DataSHIELD provides a technical solution to these challenges – human and scientific – that can then be applied to data being released under a wide variety of governance systems.
Watch this video to learn more:
Are you part of a project that facilitates data sharing for genomics or other related research?
Are you directly or indirectly involved in the Open Science movement?
Would you like to be featured on our blog?
We would love to hear from you.