Machine-learning| KarolisBio

Machine learning

One of my most ambitious machine learning projects focused on predicting protein-protein interaction (PPI) sites at the amino acid level. The goal was to develop a predictive model capable of identifying interface residues in homodimeric protein complexes. To build a reliable training dataset, I curated a large-scale benchmark from experimentally resolved protein structures in the Protein Data Bank (PDB). Interaction interfaces were annotated using Voronoi tessellation via the VORONOTA toolkit.

Each protein sequence was then encoded using advanced protein language models—namely ESM-2 and ProtTrans – to extract high-dimensional, contextual embeddings that capture both evolutionary and structural information. These embeddings were complemented with solvent-accessible surface area (SASA) metrics derived from structure-based tools, providing an additional layer of biophysical context. The combined feature vectors were fed into a custom neural network architecture constructed in PyTorch, which evolved over the course of the project in terms of depth, regularization strategies, and attention mechanisms to improve generalization and interpretability,