Biopeer is a data sharing tool for small- to medium-scale collaborative sequencing efforts and begun its journey from a group of senior students from Bilkent University, Turkey. Today, DNAdigest interviews Can Alkan, an Assistant Professor in the Department of Computer Engineering at the Bilkent University and one of the minds behind Biopeer.
1. Please introduce yourself; what is your background, position?
I am an Assistant Professor in the Department of Computer Engineering at the Bilkent University, Ankara, Turkey. I’m a computer scientist by training, I finished my PhD at Case Western Reserve University, where I worked on algorithms on the analysis of centromere evolution, and then RNA folding and RNA-RNA interactions. Later, I did a lengthy postdoc at the Genome Sciences Department of the University of Washington. I was lucky during my postdoc, that the next generation sequencing started a few months after I joined UW, and suddenly I found myself in many large scale sequencing projects such as the 1000 Genomes Project. Since NGS was entirely new, we needed to develop many novel algorithms to analyze the data. Together with my colleagues I developed read mappers (mrFAST/mrsFAST) specifically for segmental duplication analysis, which we used to generate the first personalized segmental duplication and copy number polymorphism maps. We then continued on structural variation discovery and developed VariationHunter and NovelSeq. After working on many exciting projects at the UW, I moved to Bilkent University in January 2012.
2. What is Biopeer? How did it start?
Biopeer is a data sharing tool for small- to medium-scale collaborative sequencing efforts. Both during my postdoc, and after I started my position at Bilkent, I was involved in projects where we generated a few to tens of genomes (Neandertal, Great Ape Diversity, dog domestication, etc.). Larger projects like the 1000 Genomes Project rely on high throughput data transfer services such as Aspera, but since it requires dedicated servers, and license is not free, smaller groups and groups with intermittent data sharing needs cannot afford it. Many labs from all over the world are involved in smaller projects, and the usual way of distributing the data is writing them onto external hard disks and shipping them with courier services. Needless to say, this method is annoying at the least. At Bilkent, the senior students are expected to complete a year-long senior design project as a 3-5 person group, where they are supposed to develop innovative solutions to real-life problems. I mentioned about this problem to one of the groups, and they started on the Biopeer project. Since most research data is kept private until publication, one of the requirements was to establish access control. We use ORCID to authenticate users, and the AES protocol to encrypt the data transfer. The use of Biopeer is simple, the users create a “project” where they add the files to be synchronized among peers, then give necessary permissions to collaborators, then all files are shared between the people in the same project. The collaborators can also add more files to the project. Biopeer also uses UDT — a UDP-based high speed data transfer protocol, and much like Bittorrent, it is implemented to support multicasting. A few months after we started developing Biopeer, Bittorrent Sync was released, which is pretty similar, but the free version has some limitations.
3. How and where is Biopeer used currently, where could it potentially be used and what problems could it potentially solve?
Since we didn’t publish it yet, I guess we and a couple of our collaborators in the US, are the only users / beta testers so far. It is a desktop application, and it can be used to share and synchronize data among project members. We plan to write a short manuscript describing it, which we hope will improve Biopeer’s visibility.
4. Is it easy to extend or build upon BioPeer?
Biopeer is mostly written in Java, and the underlying data transfer protocol is written in C++, but also has a Java interface. We tried to decouple parts of the software as much as we could, so it should be easy to extend. We also plan to release the low level design details, and descriptions of classes and subsystems that will be even more beneficial to developers.
5. If anyone is interested in getting involved with BioPeer development, what should they do?
We develop Biopeer using GitHub, and we will make the source code public soon. GitHub provides great resources and very good ways for collaborative software development. As I have mentioned above, the details to the class interfaces will be public also, so anyone who wants to get involved in Biopeer can just fork it, modify it, and send us pull requests.
6. Are there any plans for a public “tracker” which would enable researchers to discover content shared through Biopeer?
It is a relatively easy extension to provide tracker support in Biopeer, but we will need to do additional changes to make it feasible for large projects. We are already “tracking” which files are in which user’s desktop and which users are allowed to access them, we will just make this information and the access rights public for such projects that require tracker support. But we want to separate Biopeer from likes of Bittorrent when it comes to public trackers — we don’t really want to enable movie sharing with it. Maybe the public trackers need to be audited first, or at least, only ORCID-validated users can be allowed to share and download public data. Actually, Richard Durbin from Sanger Institute recommended us to implement public tracker at the HiTSeq conference. His idea was to make data sharing for the 1000 Genomes Project, and maybe for TCGA and others distributed, so the NCBI and EBI servers could breathe a little. But since Biopeer is initially designed as a desktop application, we will need to modify it accordingly to enable large-scale data sharing, as I describe below.
7. How do you see the future development of Biopeer?
Since the implementation was a project done by undergraduate students, and that they all graduated now, we lost considerable manpower for future development. But fortunately I was able to convince one of the group members (Tuğba Doğan, second from the left in the group photo) to stay for an M.Sc. degree. The plans are already there, we first want to resolve some minor issues about multicasting. Then we will include tracker support, but since Biopeer is designed as a desktop application, it may be difficult to keep it running for extended time periods for large-scale projects. To work around this issue, we will extend Biopeer to run as a daemon in servers, and include terminal interface using an ncurses-like library. I have some other ideas as well, Tuğba will be pretty busy.
8. This news article came out 7 days ago: Could BioPeer be used to set up a similar system between collaborating organisations?
This system for CENIC is about building a very fast network infrastructure on the hardware level. Biopeer can be used similarly, but as a software-based infrastructure over regular networks. Obviously a 100 Gbps underlying network would be nice to have.
Are you part of a project that facilitates data sharing for genomics research? Would you like to be featured on our blog? We would love to hear from you. Drop us an email at email@example.com or use our contact page to get in touch.