MLB Pitch Quality (Stuff+) Evaluation System
In progress
Modeling pure stuff quality — independent of location — across 700K+ Statcast pitches.
Independently replicated and extended the FanGraphs Stuff+ methodology across two end-to-end
pipelines trained on 700K+ Statcast pitches (2015–2025): a cascaded ensemble
(Random Forest → contact classifier → LightGBM regressor) and a PyTorch MLP
(256→128→64, GELU, BatchNorm) that isolates pure stuff quality from 12 physical features
with no location confounders.
Applied per-pitch-type normalization (average = 100 per type), GroupKFold cross-validation
grouped by pitcher to prevent leakage, and a gradient-based stability regularizer
penalizing year-over-year inconsistency. Validated via Spearman correlation and quartile
persistence metrics. Deployed on W&M's HPC cluster (Sciclone) via SLURM with
automated multi-year data ingestion, model serialization, and leaderboard generation.
- Python
- PyTorch
- LightGBM
- scikit-learn
- pybaseball
- Statcast
- SLURM
GitHub →
Repository under refinement
Free-Agent Contract Valuation Model
Complete · Personal
A front-office-style surplus-value model for an online baseball simulation league — valuing players and pricing the market as two independent problems.
A model that recommends a single contract — years and average annual value (AAV) — for any free agent in
Frostfire, a long-running online OOTP simulation
league with 21 seasons of data. The design mirrors how real front offices think: estimate what a player
is worth and what the market will pay as two independent models, then recommend a
signing only when value clears price by enough margin to absorb projection risk. I built it through
directed, end-to-end AI-assisted development — supplying the domain knowledge, modeling decisions, and
validation judgment while iterating the implementation through conversation with an AI coding assistant.
Value is assembled bottom-up: box-score stats are park-neutralized, decomposed into components (power,
contact, baserunning, defense, and catcher framing), converted to runs above replacement using linear
weights fit from the league's own run environment, aged forward with delta-method curves fit per
position and component, then run through a 40,000-iteration Monte Carlo that separates true-talent
uncertainty from season-to-season luck before pricing on a convex dollars-per-run curve. Price is a
deliberately simple ridge regression on real signings, and a length optimizer picks the contract that
maximizes risk-adjusted surplus — or recommends not signing at all. I excluded an available
player-ratings access token by choice: using information other managers couldn't see would be
cheating, not analytics.
On 280 held-out signings the market model reached an R² of 0.568 but cleared my pre-committed accuracy
bar — 85% of deals within ±15% of actual AAV — just 18.6% of the time. Rather than lower the target, I
diagnosed the gap: three independent rounds of experiments (gradient boosting, robust regression, added
features, and more training data) all failed to beat the simple baseline on held-out data. With no
scouting or player-ratings data exposed by the league and only ~250–280 unbiased signings to learn from,
the ceiling was structural — a limit of the available data, not the architecture. Identifying that
distinction, rather than chasing a number, was the real result.
- Python
- Surplus-value modeling
- Monte Carlo simulation
- Ridge regression
- Sabermetrics
- StatsPlus API
- AI-assisted (Claude)
GitHub →
Computational Modeling of T₂ Relaxation via Single-Sided NMR
Manuscript in prep
Simulating molecular T₂ relaxation for binary mixtures from first principles.
A research project in the Meldrum Spin Lab predicting T₂ relaxation times of substances and
molecular mixtures from physics-based first principles. Simulates molecular behavior in the
presence of a single-sided NMR magnet using molecular dynamics, with target manuscript
submission in August 2026.
Built an automated simulation pipeline combining OpenMM, MDAnalysis, SLURM, MPI, and tmux on
W&M's HPC cluster, with a validation suite comparing simulated vs. experimental T₂
values across a panel of test mixtures.
- OpenMM
- MDAnalysis
- SLURM
- MPI
- Bash
- Python
- SciPy
Repository private — available on request
Automated Google Business Profile Pipeline
Shipped · Client work
Replacing daily manual social posting with a fully automated content pipeline for a Northern Virginia realtor.
Designed and deployed an end-to-end content automation system for
Andrew Capuano,
a realtor and certified appraiser serving the Gainesville and Bristow area. The pipeline replaces
roughly 20 minutes of daily manual work with a fully hands-off system that publishes professional,
localized Google Business posts every morning.
On a daily schedule, the workflow draws a randomized hook–topic pair from a curated content library,
selects an image at random from a pool of 20 pre-sized assets, generates the post copy via a ChatGPT
step keyed to the day's inputs, and publishes the formatted post directly to the Google Business
Profile API through Zapier. The output is consistent, on-brand, and indistinguishable from manual posting.
- Zapier
- ChatGPT API
- Google Business Profile
- Workflow Automation
- Content Systems
Workflow template →