Machine Learning in World of Genomics and Genetics

Varshitha G
Updated: Oct 19, 2020

Interesting Title right?! And also the article is!

Table of Contents

Introduction

In the world, the major Big Shift was towards DNA Sequencing and Synthesis, -AI, MV & ML-, and Automation in which DNA Sequencing and Synthesis has won the crown!

A Company with combi of unique Bio IP(Intellectual Property), Software, and essential Database allows building products rapidly and even More Partnerships.

The CompBio companies are more than we imagine in which many companies have benefitted more than $20B! Colossal!

Let’s get started with the prelims of Machine Learning in Genomics and Genetics!

Machine Learning: The art of Training the Machine to achieve its own intelligence, the AI.

Genetics: DNA(Deoxyribonucleic acid) is a double helix that carries genetic info of development, functioning, growth, and reproduction of all organisms and viruses too! Each and Every infant inherits genes from their biological parents. And the study of these genes is Genetics. Most of us have two copies of the genome (contains genes as well as Noncoding DNA, the study of this is genomics) with 6Billion pairs of DNA!

Let’s get Started…!

In order to reach our desired requirements, we must have an approach or methods to achieve it.

Machine Learning essentially has three such methods in order to tackle the maximum number of our requirements.

They are Supervised Learning method, Unsupervised Learning method and Semi-supervised Learning method

1. Supervised Learning

Let us start with an example for a better understanding! Consider a DNA Sequence of the chromosome.

For this, we have an algorithm in Machine Learning named Gene finding algorithm.

The purpose of this algorithm is to predict the locations and elaborated intron/exon structure of protein-coding genes of Chromosomes.

In this supervised learning what we do is train the algorithm/machine in such a way that it can recognize the required value from the data set which we provide (here, labeled data set).

In this algorithm, we provide the genome data which has the start and end (In genetics terminology they are TSS (Transcriptional Start Site) and TTS (Triplex Target DNA Sites)) of the gene (which is the input).

Now it’s the turn of the model to use the provided data and learn about DNA sequence pattern, length distributions of UTR’s (Untranslated Regions), and about Introns.

All these can help in finding novel genes that resemble the data provided or technically called a Training set.

2. Unsupervised Learning

Let us start this type of learning by examining epigenomic data sets. It has a huge volume of data and procuring our required output through human becomes impractical!

In such unlabeled data sets, we use the most sturdy method called Unsupervised Learning.

The essence of this type of Learning in simple terms we don’t need to provide labeled data, we just need to provide all the unlabelled data and it converts all those into labeled data where a human is required in assigning semantic labels to each.

The fringe benefit of this type of learning is that we can find novel genes from epigenomic data when labeled data isn’t available.

3. Semi-supervised learning:

As we have seen so far we came to know that in supervised learning we need data that is labeled as input and in unsupervised only data is received by the algorithm without labels.

An interesting thing to note is that… You might have guessed by now… Yes, it’s a combo of the two! Let’s see how 😉

In simple terms, the input of this type of learning is a small amount of labeled data combined with a large amount of unlabeled data during the process of training.

In a gene-finding algorithm, input has both kinds of data. Here the labeled data set is used in finding and labeling the remaining data.

The whole process is iterated until we find no new genes. In this type of learning the model is able to learn from the larger data sets and generally we use this type of learning in genomics and genetics.

Hinging upon the Data set we have we must wisely make a decision in choosing the right learning method (which is feature selection)!

Generative Approach vs Discriminative Approach

These are one of the applications of Machine Learning in Genomics and Genetics. Let us tackle this in a lucid way!

Presume if we have two types of data set.

The Generative approach works in a way such that it focuses on building the model of each data set.
And the Discriminative Approach works in a way that only focuses on dividing the two types of data sets.

Machine Learning in genetics helps us to identify Genetic Expression, Genetic Interactions, Sequences, and more.

We have a mammoth of data many factors which include being Transcription factors, Histone modification, Chromatin accessibility, and much more of gene data. Choosing the right Machine Learning method for the data set that we have, plays a prominent role!

More Research into Machine Learning in Genomics and Genetics leads to astonishing discoveries!

Varshitha G

I'm an IT Enthusiast and Inquisitive about Modern Technologies. I love coding, especially in Python. I'm highly organized and determined towards my Work. I'm an ARTH Learner.