Independent Research

Benchmarks, failure cases, and physical models

I use independent projects to test ideas that need room outside day-to-day work: small benchmarks, failure cases, reproducible studies, and code that makes a technical claim easier to check.

Projects

Costgate: CI Cost Regression Gate for LLM Inference (2026)

Status: manuscript in preparation / released code · LLM Evaluation Tooling

A CI-native tool for catching LLM inference cost regressions before they reach production.

  • Why - LLM changes can pass quality checks while becoming slower, longer, or more expensive.
  • Approach - Runs fixed prompt suites, records token/cost/latency metrics, and compares pull-request runs against a saved baseline.
  • Outcome - Produces pass/fail gates and reviewable Markdown/JSON reports for cost-per-success, token drift, and latency regressions.
  • Methods & checks - Provider adapters, rate cards, deterministic harness settings, repeated trials, practical thresholds, and statistical comparison tests.
  • Artifacts - Manuscript under preparation ·  Zenodo Code DOI

Trajectory-Only Structural Regime Detection (2026)

Status: preprint / released code · Nonlinear Dynamics Scientific Computing Trajectory Analysis

Trajectory-based study of whether structural regime changes can be detected without giving the detector access to the governing equations.

  • Why - Recorded trajectories are often available before clean equations, Jacobians, or model internals.
  • Approach - Built trajectory-derived indicators, combined them into a structural score, and extracted boundaries across parameter sweeps.
  • Outcome - On FitzHugh-Nagumo and autonomous forced van der Pol tests, the structural boundary appeared before the qualitative transition.
  • Methods & checks - Change-point extraction, lead-distance analysis, nuisance sweeps, robustness tests, ablations, and manifest validation.
  • Artifacts - Preprint PDF · Supplement · Zenodo Preprint DOI · Zenodo Code DOI

ThermoBench-Consist v1.0 (2025)

Status: released code · Benchmark Thermodynamics

Benchmark/diagnostic for learned equations of state. Four identity checks (monotonicity, stability, Clapeyron slope, acoustic speed), JSON scoring, and guardrails near critical regions.

  • Why - Accuracy metrics can hide violations of basic physical identities.
  • Approach - Minimal API; per-check assertions emit structured failures that map directly to the violated identity.
  • Outcome - CI-ready tests that block physically impossible predictions before deployment.
  • Methods & checks - Finite-difference identities; envelope/phase-boundary sanity tests; critical-region clamping.
  • Artifacts - Preprint PDF · Zenodo Documentation DOI · Zenodo Code DOI

Assessing the Limits of Graph Neural Networks for Vapor-Liquid Equilibrium (2025)

Status: preprint · Negative Results VLE

Negative-results study showing that seemingly accurate GNNs fail global thermodynamic consistency; proposes a hybrid fallback with classical libraries.

  • Why - Small pointwise error ≠ integrable or identity-preserving fields.
  • Approach - Stress tests with identity checks; trigger a classical fallback policy when tests fail.
  • Outcome - The useful result was not a better GNN; it was a set of failure cases and a fallback rule for when a learned property model should hand off to a classical library.
  • Methods & checks - Integrability tests, common-tangent/Maxwell constructions, cross-library comparisons.
  • Artifacts - Preprint PDF · arXiv

Energy-Efficient Robotics Software (2020-2024): Systematic Literature Review (2025)

Status: preprint / released code · Survey Energy

A 79-study synthesis on energy in autonomy stacks (planning, perception, middleware, HW acceleration) across mobile and manipulator platforms.

  • Why - Robotics papers report speed/latency but rarely in comparable energy terms.
  • Approach - Systematic screening; taxonomy of techniques (scheduling/DVFS, refactoring, off-loading, task-level policies); reporting checklist.
  • Outcome - Quantified gaps (Wh/mission, energy-latency trade-offs, standard workloads) and under-measured components (sensing, comms, OS); full replication package.
  • Methods & checks - Inclusion/exclusion rationale; code + data to regenerate figures; foundation for REB-1’s Wh→gCO₂ framing.
  • Artifacts - Preprint PDF · arXiv · Zenodo Code DOI

REB-1: Robot Energy Benchmark v0.1 (2025)

Status: alpha · Energy Robotics

Micro-benchmark for measuring power and carbon footprint of robotics workloads; standardized 60-second traces and tidy logs for repeatable comparison.

  • Why - Teams lack a consistent yardstick for energy across perception/control stacks.
  • Approach - Fixed-duration runs; synchronized power + CPU/GPU counters; ready-to-plot summaries.
  • Outcome - Repeatable numbers that enable energy-aware choices and rank stability across hardware.
  • Methods & checks - Normalized telemetry; per-board calibration notes; Wh → gCO₂ conversion.
  • Artifacts - Zenodo Code DOI

Research Themes

Applied ML, Control, and Robotics

Models are most interesting to me when they have to touch a sensor, a controller, or a physical budget.

Physical Modeling & Sustainable Systems

Molecular modeling, carbon capture, and energy measurement work where the units matter as much as the code.

Reproducible Scientific Plumbing

The unglamorous parts I keep coming back to: adapters, logs, manifests, tests, and small datasets that make a result easier to rerun.