Academic ML Research · 6-week sprint

Predicting Mechanical Strength of β-Sheet Protein Domains

Sequence-, secondary-structure-, and geometry-aware descriptors were combined with nonlinear ensembles to estimate the peak stretching force (Fmax) observed in Sulkowska & Cieplak's β-rich protein survey.

View GitHub Repository Download PDF Report

Research Question

How much of the mechanical strength (Fmax) variance across 54 β-sheet-rich protein domains can be explained from engineered descriptors derived from FASTA, DSSP, and PDB sources?

Challenges

Only 54 samples versus ~70 correlated descriptors
Potential overfitting and signal dilution in linear models
Need for rigorous cross-validation and honest uncertainty

Research Overview

Predicting force-clamp strength in β-sheet domains is difficult because mechanical stability emerges from distributed interactions such as hydrogen-bond ladders, strand topology, and solvent shielding. Length alone is insufficient, and causal claims are inappropriate with such a small cohort. This study therefore targets predictive—not causal—insight by quantifying how far carefully validated ML can push accuracy without overstating confidence.

A bespoke feature set translates FASTA sequence statistics, DSSP secondary-structure summaries, and PDB-level geometric descriptors into a compact dataframe. Each transformation is traceable, enabling reproducible notebooks and sanity checks for collinearity. Emphasis is placed on modeling discipline over flashy metrics: the best models explain roughly 30% of Fmax variance, underscoring that remaining variance likely depends on nonlinear physics yet to be captured.

Technical Stack

Pythonscikit-learnXGBoostNumPyPandasmatplotlibSeabornDSSP parsingPDB processingCross-validationFeature engineering

Dataset & Feature Engineering

The dataset contains 54 β-rich domains curated from Sulkowska & Cieplak's theoretical stretching survey. Roughly 70 engineered descriptors describe sequence balance, secondary-structure geometry, and packing density. Multicollinearity is pervasive, so perfectly correlated pairs (notably Lf_A vs residue count) are removed before modeling, and all statistics are recomputed inside each CV split to avoid leakage.

Sequence descriptors capture amino acid charge and hydrophobic balance, DSSP-derived metrics summarize β-topology length scales, and PDB-derived features quantify contact density, SASA, and beta-sheet connectivity. These engineered signals provide richer hypotheses than scalar length alone while retaining interpretability needed for structural biology discussions.

FASTA-derived sequence descriptors

Composition-heavy sequence descriptors capture amino acid balance, charge distribution, and hydrophobic trends that influence stretch response.

• Amino acid composition vectors and physicochemical group ratios
• Net charge, fraction charged residues, and calculated pI
• Kyte-Doolittle hydrophobicity mean and variance per domain

DSSP secondary-structure metrics

Hand-crafted strand statistics summarize β-architecture complexity beyond simple counts.

• Strand and helix fractions with context on β-richness
• Mean strand/segment lengths (Ln_A, Lm_A) and longest strand Lf_A
• Per-residue transition frequencies to quantify architectural disorder

PDB geometry & contact topology

3D descriptors approximate packing and solvent exposure drivers of mechanostability.

• Cα–Cα contact density within 8Å and β-sheet topology proxies
• Radius of gyration alongside solvent-accessible surface area (SASA)
• Percent buried residues to flag tightly packed cores

Correlation heatmap for engineered descriptors — Correlation structure reveals clusters of redundant descriptors, guiding pruning and informing the need for nonlinear models that can consider interactions without assuming independence.

Validation Strategy

Every metric reported stems from cross-validation; there is no single train/test split that could inflate performance given only 54 samples. The workflow emphasized reproducibility and conservative estimates.

5-fold cross-validation for initial screening across all models
Nested CV for Ridge/Lasso α selection to avoid optimistic bias
Leave-One-Out CV (LOOCV) on nonlinear ensembles to maximize data use
Removal of perfectly collinear descriptors (e.g., Lf_A vs residue count)
Strict separation of feature scaling statistics inside each fold

Validation explicitly communicates limitations: R² values near 0.30 do not imply causal understanding of mechanostability; they simply show partial predictability under rigorous splitting.

Model Results

Linear baselines

Length-only, Ridge, and Lasso regressors struggled with the multi-collinear 70-feature space.

R²: ≤ 0.0 (often negative)
MAE: ≈ 0.18
RMSE: ≈ 0.25

Even after alpha tuning, penalties simply zeroed weights and failed to beat the simple residue-length baseline, highlighting the limits of linear additivity in this regime.

Nonlinear ensembles

Random Forest (400 trees, depth=5) and XGBoost (600 trees, lr=0.05) captured distributed interactions.

R²: ≈ 0.30
MAE: ≈ 0.15
RMSE: ≈ 0.21

Performance held across 5-fold and LOOCV splits, showing ~30% of Fmax variance is predictable without implying causality.

Key Insight

Nonlinear models explain roughly 30% of Fmax variance by capturing distributed structural interactions. Linear models that assume additive effects cannot match the data, reiterating that mechanostability is governed by collective β-sheet architecture rather than any single descriptor.

Visual Results Gallery

Each visualization supports the validation narrative—highlighting correlation structure, nonlinear response patterns, and ensemble importances without implying causality.

Correlation heatmap: Multicollinearity across ~70 descriptors motivated feature pruning and regularization-aware validation.

Fmax relationships: Pairwise views reinforce that no single descriptor dominates mechanical strength.

Random Forest importances: Contact density and strand topology emerge as leading nonlinear cues.

XGBoost importances: Boosting favors combinations of SASA, β-topology, and charge balance features.

Future Directions

Larger β-rich domain collections spanning experimental and simulated pulls
Graph neural networks that reason over residue-level contact maps
Geometric deep-learning features (curvature, torsion, interface exposure)
Augment descriptors with molecular dynamics-derived mechanical observables

Reproducibility

The public repository exposes notebooks, feature-generation scripts, and configuration files used for all experiments. Every figure on this page comes from that code, ensuring transparent provenance.

Clone: git clone https://github.com/matthue-lee/compsci_380
Environment: create a Python ≥3.10 virtual environment and install the repository requirements (pip install -r requirements.txt).
Run: execute the documented notebooks or training scripts to regenerate cross-validation tables and figures. All scripts log fold-wise metrics for auditing.

GitHub Repository Download Report