Interview with Dr Yaniv Erlich, Assistant Professor of Computer Science at Columbia University and Core Member of New York Genome Center about one of his projects; DNA.Land
1) What is DNA.Land?
DNA.Land is a website where people can upload their genome, contribute their data to science, and learn more about themselves. The idea of this project is that in order to realise the promise of precision medicine you need to analyse a large number of samples, genomes and phenomes. It takes a lot of resources to collect this kind of data, but we already have 2-3 million people here in the US who have access to their digitised genomes (23andme and Ancestry each have more than one million samples, Family Tree DNA has several hundred of thousands). With this number of people, you do not want to start everything from scratch, but you can try to reach out to them and ask if they would like to donate their data to science.
2) What is your background and your role in the project?
I am a computational biologist by training. I received my PhD in genomics and bioinformatics from Watson School of Biological Sciences at the Cold Spring Harbor Laboratory in New York and my B.Sc. was in computational neuroscience from Tel-Aviv University.
I joined Columbia University in 2015, and prior to that I was a Fellow at MIT’s Whitehead Institute where my research lab studied genetic privacy, created lobSTR, a short tandem repeat profiler for personal genomes, and also constructed a genealogy tree linking 13 million people.
My main interest is dissecting the complex relationships of genes, health and privacy. My current role is a joint one with the New York Genome Center and aims to bridge the strong computer-science communities at Columbia with the Genome Center’s focused efforts on translating genomic research into patient care.
DNA.Land is just one of many projects that I run with my team at Columbia.
3) How, when, and why did it start? How is it funded?
As mentioned above, since there is lots of data already available, we thought it was worth to try and reach out to people and use crowdsourcing to collect the data.
The platform was launched at the end of 2015, so far we have collected more than 15,000 genomes and the number is growing. The number of users is smaller though, since many people upload more than one genome, e.g. from their family members.
At the moment, funding comes fromthe New York Genome Center and the Borroughs Wellcome Career Award. In fact, it is not a very expensive project to run, we have some fixed costs for personnel and storage, but the process of collecting genomes itself is basically free. And in a way we are funded by all these sequencing companies, like 23andme, because they are doing the job of creating data that we collect.
4) How is it different from other projects, like e.g. openSNP or GEDmatch?
Unlike openSNP, which is a wonderful project, we believe that we need to reciprocate people who donate their genomes immediately. You need to give them something back because many people in fact sympathise with the aim of donating their data to science but everybody is super busy, does not have time etc. But if we give our users some incetive then we probably could create some traffic to the website and motivate them to give their data.
What is our difference from GEDmatch? Again, GEDmatch is a great tool, but it is not really a scientific project. They do not collect phenotypes and they have no consent in place, which is a necessary condition for human-subject research.
5) What do you expect from the participants and what is their reward for participation? What kind of information can they get after uploading their data?
As I said, at DNA.Land we are immediately reciprocating our users. As soon as people upload their genomes, we aim to give them something back. For instance, we will analyse our users’ ancestry. We will also impute* their genome immediately. We need to impute in order to systemise the data before we do any scientific studies. But imputation can also be quite useful for the individuals themselves, since they can get access to their whole genome. We also allow people to search for relatives among the samples in our database. Genetic companies also search relatives but only within their own databases, whereas we allow searching through the data irrespective of which company produced the genomic data. We also allow people to connect their genealogy tree from Geni.com profile if they have one to their account on DNA.Land. We plan to add more interesting features in the future.
6) How do you store and handle participants’ data? Who can apply for access to the data for use in their own projects?
We decided to use Amazon cloud to store data since they have built-in security mechansims in place, which we use such as encryption and strong access control. We also store the passwords of users in an encrypted form. Finally, the people in my group that work with the participants data needs to take human subject training and are individually vetted by our Institute Review Board, which is an external commitee that oversees the project.
Regarding access, the users themselves can access their data after imputation. They can only access their own data, but they can see their matches and the matches of their matches. This matching report is an opt-in feature, not active by default, i.e. the user can decide whether or not he/she wants to be matched to other users of the DNA.Land platform.
The DNA.Land team does see all the data that is uploaded to the platform. In the future, and according to the consent that the users signed, we will release summary statistics based on the analysis of this data, which we believe will be beneficial for genetic research Right now we are still accumulating data and focusing on building new features.
The consent does not allow sharing individual level data without an explicit authorization of users. We plan in the future to allow users to contribute their data for external studies they care about (e.g. cancner) for researchers that we specficially vetted. However, currently, no external researchers can apply for and access this data at the moment, since we do not have a mechanism to allow this. We are a very small group and there are many other things to do. Also, we think that researchers would prefer to wait a little bit until we get more data.
7) How would you advise young researchers today in how they approach data sharing in genomics?
It is crucial for young researchers working in human genetics to understand the human part of it: to understand that there is a human being behind this process. There is no need to shy away from thinking about the consent, about what they are asking their participants. Because if you do not have proper consent from the very beginning it is very hard to change or adjust it later on. So, when you are building a consent form, do not just treat it as a bureaucratic step, but think about it carefully.
At DNA.Land, we have so-called dynamic consent, which means that by default we do not share our participants’ data but we are allowed to ask them in the future if they wish to donate their data to this or that research project.
* Imputation is a process that uses a large dataset to predict what will happen in a specific instance. It is like the ability to read misspelled words, so long as the first and last letters are in the right places. Using prior data, imputation can fill in missing parts of genomes by tracking genetic markers that are commonly inherited together.
Are you part of a project that facilitates data sharing for genomics or other related research?
Are you directly or indirectly involved in the Open Science movement?
Would you like to be featured on our blog?
We would love to hear from you.