machine learning engineer & researcher
Brihat Sharma
I build and study machine learning systems.
I care about evaluation as much as capability: how we decide whether a model or an agent is actually good, not just whether it looks good.
- evaluation
- agents
- measurement
- clinical NLP
- physics
Projects
Writing
- Two Agents, the Same Score, Different Failures2026-06-09
Aggregate success rate tells you an agent failed. It will not tell you that two agents with the same score fail in completely different ways. A short look at why per-axis failure profiles are the more useful number.
- Writing a Judge Rubric Two Models Can Agree On2026-06-09
When two LLM judges from different families disagree at high confidence, the problem is usually the rubric, not the judges. Here is the three-iteration loop that got them to converge, and the one change that actually did it.
- Which Capability Actually Broke? A Calibrated 5-Axis Judge for Agent Tool Use2026-06-04
End-task success tells you an agent failed, not why. Here is a 5-axis decomposition of tool-use correctness, graded by cross-family LLM judges, with the calibration discipline that makes the verdicts mean something.