Academic ML Research · 6-week sprint
Predicting Mechanical Strength of β-Sheet Protein Domains
Sequence-, secondary-structure-, and geometry-aware descriptors were combined with nonlinear ensembles to estimate the peak stretching force (Fmax) observed in Sulkowska & Cieplak's β-rich protein survey.
Research Question
How much of the mechanical strength (Fmax) variance across 54 β-sheet-rich protein domains can be explained from engineered descriptors derived from FASTA, DSSP, and PDB sources?
Challenges
- Only 54 samples versus ~70 correlated descriptors
- Potential overfitting and signal dilution in linear models
- Need for rigorous cross-validation and honest uncertainty
Research Overview
Predicting force-clamp strength in β-sheet domains is difficult because mechanical stability emerges from distributed interactions such as hydrogen-bond ladders, strand topology, and solvent shielding. Length alone is insufficient, and causal claims are inappropriate with such a small cohort. This study therefore targets predictive—not causal—insight by quantifying how far carefully validated ML can push accuracy without overstating confidence.
A bespoke feature set translates FASTA sequence statistics, DSSP secondary-structure summaries, and PDB-level geometric descriptors into a compact dataframe. Each transformation is traceable, enabling reproducible notebooks and sanity checks for collinearity. Emphasis is placed on modeling discipline over flashy metrics: the best models explain roughly 30% of Fmax variance, underscoring that remaining variance likely depends on nonlinear physics yet to be captured.
Technical Stack
Dataset & Feature Engineering
The dataset contains 54 β-rich domains curated from Sulkowska & Cieplak's theoretical stretching survey. Roughly 70 engineered descriptors describe sequence balance, secondary-structure geometry, and packing density. Multicollinearity is pervasive, so perfectly correlated pairs (notably Lf_A vs residue count) are removed before modeling, and all statistics are recomputed inside each CV split to avoid leakage.
Sequence descriptors capture amino acid charge and hydrophobic balance, DSSP-derived metrics summarize β-topology length scales, and PDB-derived features quantify contact density, SASA, and beta-sheet connectivity. These engineered signals provide richer hypotheses than scalar length alone while retaining interpretability needed for structural biology discussions.
FASTA-derived sequence descriptors
Composition-heavy sequence descriptors capture amino acid balance, charge distribution, and hydrophobic trends that influence stretch response.
- • Amino acid composition vectors and physicochemical group ratios
- • Net charge, fraction charged residues, and calculated pI
- • Kyte-Doolittle hydrophobicity mean and variance per domain
DSSP secondary-structure metrics
Hand-crafted strand statistics summarize β-architecture complexity beyond simple counts.
- • Strand and helix fractions with context on β-richness
- • Mean strand/segment lengths (Ln_A, Lm_A) and longest strand Lf_A
- • Per-residue transition frequencies to quantify architectural disorder
PDB geometry & contact topology
3D descriptors approximate packing and solvent exposure drivers of mechanostability.
- • Cα–Cα contact density within 8Å and β-sheet topology proxies
- • Radius of gyration alongside solvent-accessible surface area (SASA)
- • Percent buried residues to flag tightly packed cores

Validation Strategy
Every metric reported stems from cross-validation; there is no single train/test split that could inflate performance given only 54 samples. The workflow emphasized reproducibility and conservative estimates.
- 5-fold cross-validation for initial screening across all models
- Nested CV for Ridge/Lasso α selection to avoid optimistic bias
- Leave-One-Out CV (LOOCV) on nonlinear ensembles to maximize data use
- Removal of perfectly collinear descriptors (e.g., Lf_A vs residue count)
- Strict separation of feature scaling statistics inside each fold
Validation explicitly communicates limitations: R² values near 0.30 do not imply causal understanding of mechanostability; they simply show partial predictability under rigorous splitting.
Model Results
Linear baselines
Length-only, Ridge, and Lasso regressors struggled with the multi-collinear 70-feature space.
- R²
- ≤ 0.0 (often negative)
- MAE
- ≈ 0.18
- RMSE
- ≈ 0.25
Even after alpha tuning, penalties simply zeroed weights and failed to beat the simple residue-length baseline, highlighting the limits of linear additivity in this regime.
Nonlinear ensembles
Random Forest (400 trees, depth=5) and XGBoost (600 trees, lr=0.05) captured distributed interactions.
- R²
- ≈ 0.30
- MAE
- ≈ 0.15
- RMSE
- ≈ 0.21
Performance held across 5-fold and LOOCV splits, showing ~30% of Fmax variance is predictable without implying causality.
Key Insight
Nonlinear models explain roughly 30% of Fmax variance by capturing distributed structural interactions. Linear models that assume additive effects cannot match the data, reiterating that mechanostability is governed by collective β-sheet architecture rather than any single descriptor.
Visual Results Gallery
Each visualization supports the validation narrative—highlighting correlation structure, nonlinear response patterns, and ensemble importances without implying causality.




Future Directions
- Larger β-rich domain collections spanning experimental and simulated pulls
- Graph neural networks that reason over residue-level contact maps
- Geometric deep-learning features (curvature, torsion, interface exposure)
- Augment descriptors with molecular dynamics-derived mechanical observables
Reproducibility
The public repository exposes notebooks, feature-generation scripts, and configuration files used for all experiments. Every figure on this page comes from that code, ensuring transparent provenance.
- Clone:
git clone https://github.com/matthue-lee/compsci_380 - Environment: create a Python ≥3.10 virtual environment and install the repository requirements (
pip install -r requirements.txt). - Run: execute the documented notebooks or training scripts to regenerate cross-validation tables and figures. All scripts log fold-wise metrics for auditing.