What makes ML in Computational Biology especially difficult?


There are several related issues that make ML for computational biology hard.
1- Getting “signal” is challenging. The data often comes with lots of noise and missing values and imputation is hard. Say you are measuring single cell RNA sequence counts, the data comes with considerable noise and lots of things may be dynamically changing during your measurements, it’s challenging to clean it. Vision and NLP are often cleaner.


2- A lot of biological data can be high dimensional in the number of variables that affect them (Curse of dimensionality). Take protein sequence of size 20, there are 20^20 possible sequences, you can barely hope to have collected data about an infinitesimally small fraction of this space, hence if the space itself is not easy (hint: it’s often not), you have a poor chance of making a generalizable model. Even notwithstanding how big the space is, large biological datasets (which may cost millions of dollars to acquire) are of the order of 10^6 samples, but most fall within 10^3–10^4. For Vision and NLP, it’s much cheaper to get bigger datasets.
3-Labeling your data is expensive and time consuming. Hence, you cannot iterate a lot. In my work, each experiment that labels my protein sequences, if done by people who are experts at it, takes a couple of months to finish. At best, I can hope to conduct 3–4 experiments for a project. Unsupervised learning is also hard due to lack of data.
4- Validation is hard (especially with generative models). If my language model is producing garbage sentences, I will immediately know, and occasionally even trouble shoot (Oh, it thinks humans bark too). If my image recognition algorithm is recognizing dogs as volcanos, I’d probably know. But if my model is telling me a protein is toxic, or a particular sequence is not going to bind to another one, I have to run at least a couple of time consuming experiments to verify.
These bottlenecks slow down research in ML as relevant to computational biology, which also means that algorithms are less mature for the purpose that we care about. But there is no lack of effort, and it is growing over time. The fact that it is hard makes it just more interesting.

Comments

Popular posts from this blog

What are the recent breakthrough in bio-science?

How do plants get overwatered?

What is biomedical science about? What do biomedical scientists do?