EOS Keystone Skills in Bioinformatics Training Fellowship: February 2014


For two intensive weeks in February I was fortunate enough to be selected to attend a bioinformatics training fellowship. The course was a NERC-funded, ELXIR-UK project, held at the Centre of Ecology and Hydrology in Wallingford. Eighteen young researchers from diverse biological backgrounds participated in the course, and were lucky to be taught by an equally diverse range of computer scientists and biologists.

I was amazed by the diversity of topics covered in just two weeks teaching! However, despite the huge myriad of information and bioinformatic specialities out there, some central reoccurring themes emerged.


Theme #1. – Learning to program is hard! But fun
Trying to tell a computer what you want it to do involves an entirely new skill set. Firstly, you have to learn a new language – we were learning Python. And secondly, and perhaps more difficulty, you have to learn to think like a computer. Breaking problems down into logical steps and then writing the appropriate code for each step. Sounds pretty easy I suppose. But it gets very complicated very quickly.


Theme #2. – Programming vs graphical user interface
There seems to be two diverging fields of bioinformatics. One is a traditional programming/computer science field, the typical terminal environment, with maximum user flexibility and maximum computing efficiency. However, many biologists find working in this environment unfamiliar and challenging and instead want to work with a graphical interface. As a result, there is an emerging branch of bioinformatics which allows users to work on a graphical interface, but reduces computing efficiency and flexibility.

 I find the divergence slightly worrying. On one hand it is exciting so many biologists will have access to exciting and powerful bioinformatic tools to answer their research questions. But on the other hand there is a danger that biologists will become detached, and even more segregated, from real computer science. User interfaces are appealing, but once working in a comfortable graphical environment it can be easy to forget the computational methods which are going on underneath. The limitations of working with large data sets are apparently forgotten as we are lured into a false sense of security with the click of a button... It will certainly be interesting to observe, and participate in, the evolution of bioinformatics.  We can only hope that increased accessibility will help to bridge the gap between biologists and computer scientists, instead of pushing them even further apart.


Theme #3. – You don’t have to be an expert at everything
One of the major challenges of being a PhD student is managing a project from beginning to end. From conception of ideas and field collections, through to experimental design and implementation, moving onto laboratory work and then the data analysis and write-up. Each step is, in itself, full of specialised skill sets, and as a PhD student you have to be able to do each step proficiently enough to pull-off a research project. However, in the real research world – post PhD - research is typically conducted in teams. No one single team member will be able to do every aspect of the project. Instead, researchers specialise and collaborate; at this stage it is not important to be good at everything. You just need to be good at the role you play, and most importantly, be able to communicate with, and understand to a certain extent, what everybody else is doing.  So how does this fit into bioinformatics? The importance is not carrying out bioinformatic analyses, or being able to tell a computer what to do. The trick is to understand the concepts of the analyses and to be able to communicate with collaborators. The speed at which bioinformatics is progressing is amazing! In three years time, when I finish my PhD (hopefully) there is absolutely no chance I will be using the same algorithms, pipelines and packages I’m using today. To be successful you have to be flexible; understand the concepts, understand the limitations and let the research questions drive your analyses.


Theme #4. – Big data = big computers
As next generation technology develops, the size of our sequenced data sets are rapidly increasing. While this poses problems in terms of analytical methods and statistics, which is an entire separate issue, it also poses serious demands on computing power. Increasingly biologists require access to servers, super computers, clusters and “scalable clouds” (four jargon phrases which all mean the same thing- big computers). One major problem I envision is the passing of data from individual scientists onto computing platforms. Packages such as Galaxy Cloud, Cloud Man, Taverna and BioLinux have all made significant effort and progress towards pulling together lots of useful tools into one place. But how can you run Galaxy or BioLinux on an academic cluster which is set-up mainly for theoretical physics and maths? I have my data, access to big computers and the know-how regarding the bioinformatics tools I want to use. The problem is trying to get all three things in one place to work!
Theme #5. - Reproducibility
How many times have you read the bioinformatic paragraph of a paper and been left wondering what on earth they actually did?? Often the methods section will include a single sentence: transcriptome was assembled using Trinity. Or: experimental treatments were mapped back onto the transcriptome using Maq. Or even: all statistics were carried out using R. Great! But not reproducible. One of the major benefits of user interfaces such as Galaxy and Taverna is the creation of workflows, which are publishable and easy to share. Personally, I think publishing your raw data, workflow and/or script is the only way forward, and I seriously question any scientists motives who argues otherwise.
Theme #6. – Collapsing your data vs using all your data
This is issue is huge! Perhaps the biggest issue facing bioinformatic analyses, but I will try to summarise here. There are generally two approaches to understand large sequence data sets. The approach you choose to take should, of course, be driven by the biological question and is ideally aided by some biological insight. On one hand you might want to collapse your data down to a small number of variables in an attempt to understand what is going on. For example, quantitative trait locus analysis and gene network analysis. Using this approach is very powerful for understanding molecular pathways. However , it assumes that there are few loci of large biological affect. We are constantly looking for a strong signal in the noise, but what if there is no such signal? What if the biological phenomena you are trying to understand cannot be explained by general linear models and p-values?
The opposite approach has Bayesian inference at its core. Instead of collapsing data down and throwing half of it away, keep it all and use it to help you predict a model using probability and prior. Instead of being obsessed with “independence” try to understand your data considering that nothing in biology is independent.  For example, mixed-effects models use information on variation in the entire data set and create hyper variables which can predict phenotypes with remarkable accuracy.  This approach is arguably more black-box. Here we cannot describe mechanisms for how the variation occurs, but we have incredible predictive power. In the future, ideally, we should try to integrate both approaches to truly understand omics data.


Covered here are just a few of the issues raised on the course which were of particular interest to me. Just as each student entered the course from diverse backgrounds, I expect each student came away with different perspectives too. Only two weeks on from the course I am already finding my day-to-day task are becoming more efficient. I have written some small programs and scripts and I’m well on my way to analysing de novo assembled transcriptome data from a non-model species experiment. Perhaps more importantly I can now communicate with bioinformaticians and the I.T. department at work – I feel like I have quite literally learnt a new language.

All that is left is for me to thank all the amazing instructors, course co-ordinators and fellow students for a great two weeks. Tim Booth, Dawn Field, Tracey Timms-Wilson, Richard Nichols, Robert Knell, Pete Kille, Philipp Antczak, Enis Afgan, Kathryn Couch, Peter Cock, Francisco Quevedo and Norman Morris – you’re all complete legends! Thank you so much for imparting some of your bioinformatic and computer science wisdom onto us!

 

 

Comments

Popular Posts