Improving Prediction for Disease-Associated Frameshift and Nonsense Mutations
Author(s)
Xu, Kyle
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
Frameshift and nonsense mutations account for approximately 8.4\% of disease-causing mutations but have received substantially less computational attention than missense variants. This research builds upon ENTPRISE-X and TransPPMP by incorporating protein language model embeddings and implementing class-aware loss functions to improve pathogenicity prediction, with particular focus on reducing false positive rates. We systematically evaluate three loss functions designed to address class imbalance: Focal Loss, Class-Balanced Loss, and Label-Distribution-Aware Margin (LDAM) Loss.
Our best-performing model, utilizing Class Balanced Loss combined with ESM-2 embeddings and mutation index features, achieves a Matthews Correlation Coefficient (MCC) of 0.697, F-score of 0.717, sensitivity of 0.817, and specificity of 0.963 on the VEST-indel test set. This represents a 4.2\% improvement in MCC over the current state-of-the-art TransPPMP method (MCC: 0.669, Specificity: 0.939), while achieving a 59\% reduction in false positive rate (from 6.1\% to 3.7\%) and maintaining strong sensitivity.
An alternative configuration using Class Balanced Focal Loss achieves even higher specificity (0.976) for applications where minimizing false positives is paramount, at the cost of reduced sensitivity (0.732). These results demonstrate that strategic loss function selection can meaningfully reduce false positives in mutation pathogenicity prediction without compromising overall performance, offering practical value for clinical variant interpretation where each false positive triggers expensive confirmatory testing and patient burden.
Sponsor
Date
2025-12
Extent
Resource Type
Text
Resource Subtype
Thesis (Masters Degree)