Machine Learning for Protein Family Classification

Over the past several months, I have been experimenting with artificial neural networks for protein family classification.

While small-scale neural networks for protein family classification have reduced predictive utility compared to large-scale folding models, they can also be invaluable tools for researchers seeking to ascertain the possible evolutionary homology and physiological functions of newly sequenced proteins.

Herein, we demonstrated the feasibility and evaluated the performance metrics of a simple artificial neural network model trained to probabilistically predict protein superfamily from amino acid sequence data. The neural network was trained on annotated, centre-padded, one-hot encoded amino acid sequences (N=92,982) and evaluated on a hold-out test set (N=11,565) to produce favourable performance metrics, including average classification accuracy upwards of 98%.

The common errors of the network were characterized in a confusion matrix heatmap while its understanding of protein families was probed via back-queried sequence generation. Utilising the state-of-the-art in protein folding, comparisons were made between the tertiary structures of real sequences and those of artificial sequences generated in silico from back-querying the corresponding protein family.

 

Protein Classification Tool

As a potential application of this model, I developed a simple tool for use by proteomics researchers and structural biologists. The tool takes an amino acid sequence from a newly sequenced protein as input, and outputs the likely protein family, as well as some relevant contextual information extracted from the Pfam protein database via the organisation’s associated API (Boldrini, 2022). An example output for a call of this function is provided below. This engine has been made available on Google Collab in view-only format, via this link.

 

The Blue Sky Vision

With large enough datasets, a model that is trained to predict protein function from amino acid sequences may inadvertently solve protein folding. This may enable the ad hoc generation of novel proteins that achieve the same physiological functions as proteins from real families, but with low sequence identity to any real proteins. Analogous to Open AI's Dall·E (Ramesh et al., 2021) or AlphaDesign (Jendrusch et al., 2021), I envisage a de novo protein designer network that can generate proteins for any function desired by the user.

By back-querying the trained classification model for a particular combination of functions, it may be possible to generate a selection of proteins that fulfill the user's specifications. Such a tool can be leveraged, for instance, to (1) construct biologics that perform similarly to patented ones, but are different enough in sequence so as to fall outside of their intellectual property, (2) rapidly generate a repertoire of therapeutic proteins for a potential drug target, which can then be evaluated via high-throughput screening and (3) accelerate antibiotic discovery in silico by rapidly generating potential antimicrobial proteins which are functionally analogous but share little sequence identity, thus evading microbial recognition and circumventing antibiotic resistance.

 
Previous
Previous

Detecting Cancer Early. In the Comfort of Your Own Home.

Next
Next

The Psychology of the Inventor | TEDxUniMelb Talk