What makes ML in Computational Biology especially difficult?
There
are several related issues that make ML for computational
biology hard.
1-
Getting “signal” is challenging. The data often comes with lots of noise and
missing values and imputation is hard. Say you are measuring single cell RNA sequence
counts, the data comes with considerable noise and lots of things may be
dynamically changing during your measurements, it’s challenging to clean it. Vision
and NLP are often cleaner.
2- A
lot of biological
data can be high dimensional in the number of variables that affect them
(Curse of dimensionality). Take protein sequence of size 20, there are 20^20
possible sequences, you can barely hope to have collected data about an infinitesimally
small fraction of this space, hence if the space itself is not easy (hint: it’s
often not), you have a poor chance of making a generalizable
model. Even notwithstanding how big the space is, large biological datasets
(which may cost millions of dollars to acquire) are of the order of 10^6
samples, but most fall within 10^3–10^4. For Vision and NLP, it’s much cheaper
to get bigger datasets.
3-Labeling
your data is expensive and time consuming. Hence, you cannot iterate a lot. In
my work, each experiment that labels my protein sequences,
if done by people who are experts at it, takes a couple of months to finish. At
best, I can hope to conduct 3–4 experiments for a project. Unsupervised
learning is also hard due to lack of data.
4-
Validation is hard (especially with generative models). If my language model is
producing garbage sentences, I will immediately know, and occasionally even
trouble shoot (Oh, it thinks humans bark too). If my image recognition
algorithm is recognizing
dogs as volcanos, I’d probably know. But if my model is telling me a protein
is toxic, or a particular sequence is not going to bind to another one, I
have to run at least a couple of time consuming experiments to verify.
These
bottlenecks slow down research in ML as relevant to computational biology,
which also means that algorithms are less mature for the purpose that we care
about. But there is no lack of effort, and it is growing over time. The fact
that it is hard makes it just more
interesting.
Comments
Post a Comment