Thomas Felton — Data Science & Baseball Analytics

01

About

I'm a fourth-year student at the College of William & Mary studying data science with a concentration in artificial intelligence and a minor in chemistry. My technical work outside of class spans two worlds I genuinely care about: building predictive models for baseball and developing computational pipelines for physical chemistry research.

In Dr. Tyler Meldrum's Spin Lab, I've spent two summers developing end-to-end molecular dynamics pipelines to simulate single-sided NMR T₂ relaxation for binary mixtures, with a manuscript in preparation. On the baseball side, I serve as Student Manager for Offensive Analytics & Advanced Scouting for William & Mary Baseball, during the season I build scouting pipelines and produce reports for 10–20 opponents a season. In the offseason I build analytical models like the stuff+ project. I'm also pitching coach and president of W&M Club Baseball.

My goal after I finish graduation is an entry level role as a data scientist. My dream position is in an MLB front office, yet I am open to any opportunity. Outside of my work, I enjoy fishing, hiking, exploring and discovering new places, and pitching whenever possible. I am also a semi-professional blitzballer.

02

Experience

Laboratory Assistant

Meldrum Physical Chemistry Lab · William & Mary

Jan 2024 — Present · 400+ hours

Developed an end-to-end molecular dynamics pipeline using OpenMM and MDAnalysis to compute and validate T₂ relaxation times, orchestrating 1,000+ SLURM job submissions on W&M's HPC cluster via MPI and Bash.
Engineered a validation suite comparing simulated vs. experimental T₂ values and automated publication-quality figures for diffusion and T₂ profiles using Matplotlib, SciPy, and LaTex.
Manuscript in preparation: Computational Modeling of T₂ Relaxation via Single-Sided NMR (target submission: August 2026).

Student Manager — Offensive Analytics & Advanced Scouting

William & Mary Baseball

Mar 2024 — Present

Built a semi-automated scouting pipeline in Python that scrapes and aggregates opposing-pitcher data from Synergy and TrackMan, eliminating manual data entry.
Produced integrated scouting reports — combining Google Sheets analytics and video — via python integration, for 10–20 opponents per season, delivered to coaching staff and players ahead of each series.

Pitching Coach & President

William & Mary Club Baseball

Aug 2024 — Present

Provide individualized mechanical and pitch-development coaching while directing in-game strategy based on batter–pitcher matchups and high-leverage situations.
Design structured practice programs driving team-wide development; recognized as NCBA 1st Team All-Region Pitcher, Mid-Atlantic North.

03

Projects

MLB Pitch Quality (Stuff+) Evaluation System

In progress

Modeling pure stuff quality — independent of location — across 700K+ Statcast pitches.

Independently replicated and extended the FanGraphs Stuff+ methodology across two end-to-end pipelines trained on 700K+ Statcast pitches (2015–2025): a cascaded ensemble (Random Forest → contact classifier → LightGBM regressor) and a PyTorch MLP (256→128→64, GELU, BatchNorm) that isolates pure stuff quality from 12 physical features with no location confounders.

Applied per-pitch-type normalization (average = 100 per type), GroupKFold cross-validation grouped by pitcher to prevent leakage, and a gradient-based stability regularizer penalizing year-over-year inconsistency. Validated via Spearman correlation and quartile persistence metrics. Deployed on W&M's HPC cluster (Sciclone) via SLURM with automated multi-year data ingestion, model serialization, and leaderboard generation.

Python
PyTorch
LightGBM
scikit-learn
pybaseball
Statcast
SLURM

GitHub → Repository under refinement

Free-Agent Contract Valuation Model

Complete · Personal

A front-office-style surplus-value model for an online baseball simulation league — valuing players and pricing the market as two independent problems.

A model that recommends a single contract — years and average annual value (AAV) — for any free agent in Frostfire, a long-running online OOTP simulation league with 21 seasons of data. The design mirrors how real front offices think: estimate what a player is worth and what the market will pay as two independent models, then recommend a signing only when value clears price by enough margin to absorb projection risk. I built it through directed, end-to-end AI-assisted development — supplying the domain knowledge, modeling decisions, and validation judgment while iterating the implementation through conversation with an AI coding assistant.

Value is assembled bottom-up: box-score stats are park-neutralized, decomposed into components (power, contact, baserunning, defense, and catcher framing), converted to runs above replacement using linear weights fit from the league's own run environment, aged forward with delta-method curves fit per position and component, then run through a 40,000-iteration Monte Carlo that separates true-talent uncertainty from season-to-season luck before pricing on a convex dollars-per-run curve. Price is a deliberately simple ridge regression on real signings, and a length optimizer picks the contract that maximizes risk-adjusted surplus — or recommends not signing at all. I excluded an available player-ratings access token by choice: using information other managers couldn't see would be cheating, not analytics.

On 280 held-out signings the market model reached an R² of 0.568 but cleared my pre-committed accuracy bar — 85% of deals within ±15% of actual AAV — just 18.6% of the time. Rather than lower the target, I diagnosed the gap: three independent rounds of experiments (gradient boosting, robust regression, added features, and more training data) all failed to beat the simple baseline on held-out data. With no scouting or player-ratings data exposed by the league and only ~250–280 unbiased signings to learn from, the ceiling was structural — a limit of the available data, not the architecture. Identifying that distinction, rather than chasing a number, was the real result.

Python
Surplus-value modeling
Monte Carlo simulation
Ridge regression
Sabermetrics
StatsPlus API
AI-assisted (Claude)

GitHub →

Computational Modeling of T₂ Relaxation via Single-Sided NMR

Manuscript in prep

Simulating molecular T₂ relaxation for binary mixtures from first principles.

A research project in the Meldrum Spin Lab predicting T₂ relaxation times of substances and molecular mixtures from physics-based first principles. Simulates molecular behavior in the presence of a single-sided NMR magnet using molecular dynamics, with target manuscript submission in August 2026.

Built an automated simulation pipeline combining OpenMM, MDAnalysis, SLURM, MPI, and tmux on W&M's HPC cluster, with a validation suite comparing simulated vs. experimental T₂ values across a panel of test mixtures.

OpenMM
MDAnalysis
SLURM
MPI
Bash
Python
SciPy

Repository private — available on request

Automated Google Business Profile Pipeline

Shipped · Client work

Replacing daily manual social posting with a fully automated content pipeline for a Northern Virginia realtor.

Designed and deployed an end-to-end content automation system for Andrew Capuano, a realtor and certified appraiser serving the Gainesville and Bristow area. The pipeline replaces roughly 20 minutes of daily manual work with a fully hands-off system that publishes professional, localized Google Business posts every morning.

On a daily schedule, the workflow draws a randomized hook–topic pair from a curated content library, selects an image at random from a pool of 20 pre-sized assets, generates the post copy via a ChatGPT step keyed to the day's inputs, and publishes the formatted post directly to the Google Business Profile API through Zapier. The output is consistent, on-brand, and indistinguishable from manual posting.

Zapier
ChatGPT API
Google Business Profile
Workflow Automation
Content Systems

Workflow template →

04

Skills

Languages

Python · R · SQL · Bash · LaTeX

Machine Learning & Data Science

PyTorch · scikit-learn · LightGBM · CatBoost · statsmodels · SciPy · pandas · NumPy · Matplotlib · Seaborn

Statistics & Modeling

Regression · neural networks · random forests & gradient boosting · generative models (diffusion, VAE, transformers) · cross-validation with leakage controls

Baseball Analytics

Statcast (pitch- and event-level) · pybaseball · MLB Stats API · TrackMan · Synergy · Baseball Savant · pitch-quality modeling (Stuff+) · xwOBA & run-value frameworks · advanced scouting workflows

Simulation & HPC

OpenMM · CHARMM · MDAnalysis · SLURM · MPI · TCSH · tmux · W&M Sciclone HPC cluster

Data Engineering & Tools

JSON · CSV · NumPy binary (.npz) · molecular simulation formats (.dcd, .pdb, .mol2) · Git · Jupyter · Google Sheets · Excel · Cinema4D

AI-Assisted Development

Directing LLM coding assistants to build & iterate full analytical pipelines (“vibe-coding”) · pairing domain expertise with AI implementation · validation & verification of AI-built systems · prompt engineering · LLM API integration (ChatGPT API) · AI-driven workflow automation (Zapier)

05

Coursework

Selected courses completed or in progress at William & Mary, ordered by relevance.

DATA 301

Applied Machine Learning

MATH 351

Probability & Statistics for Scientists

MATH 352

Statistical Data Analysis

DATA 302

Databases

DATA 303

Data Visualization

DATA 201

Intro to Data Science

DATA 209

Applied Linear Algebra & Calculus

DATA 440

Supercomputing for Science

DATA 446

Generative AI · in progress

DATA 442

Neural Networks & Deep Learning · Fall 2026

DATA 448

Reinforcement Learning · Fall 2026

DATA 451

AI Systems · Fall 2026

MATH 212

Multivariable Calculus

CSCI 141

Computational Problem Solving

06

Contact

Open to internships, research collaborations, and conversations about baseball analytics or computational science. The fastest way to reach me is email.