Recursive agent engineering

Robust agent engineering through extreme vibe coding.

I build agent systems where evaluation is not the ending. It is the feedback signal for the next version.

I'm Pratik Bhavsar, working across tools, memory, judges, datasets, benchmarks, and feedback loops. The goal is simple: turn agent behavior into better agents.

Agent buildertools, memory, multi-turn systems
Open-source benchmarkingleaderboards, datasets, public evals
Writerbooks, essays, field guides
Self-improvement loopstraces, judges, feedback loops
5
Books
25K+
Followers
100+
Articles
11
Talks
6
Startups

I build the loop where agent traces become better agents.

The interesting work starts after an agent runs: capture the tool path, memory hits, judge disagreements, failed actions, and production edge cases. Those traces become the raw material for the next prompt, policy, tool, memory rule, dataset, and benchmark.

That is the thread across Galileo, open-source leaderboards, books, and community work: evals are not a separate page. They are the control system that lets agents improve from one run to the next.

01

Instrument the run

Capture tool calls, retrievals, memory writes, human corrections, and judge signals as first-class evidence.

02

Select what survives

Use judges, datasets, and benchmark pressure to separate durable behavior from lucky demos.

03

Patch the agent

Feed accepted traces back into prompts, policies, tools, memory rules, and regression suites.

04

Publish the loop

Turn private lessons into books, leaderboards, benchmarks, and reusable patterns for other builders.

Books for building production AI systems.

The Mastering GenAI series turns hard-won lessons into readable field guides: agents, multi-agent systems, RAG, LLM-as-a-judge, and eval engineering.

Latest book

Eval Engineering

The emerging discipline of building trust in production AI systems. The book is structured as a practical path from evaluation basics to LLM-as-judge systems, SME refinement, SLM scaling, and production guardrails.

Metrics Judges Traces Guardrails
Pratik holding the Eval Engineering book

Measuring agent behavior in the open.

Benchmarks should expose how systems behave under real pressure: multi-turn tasks, domain constraints, factuality, retrieval quality, and tool use. That is the work behind these public artifacts.

Featured benchmark

Agent Leaderboard v2

Enterprise-grade benchmark for AI agents: 30+ models, 500 real-world multi-turn support scenarios, and domains including banking, healthcare, investment, telecom, and insurance.

Agent Leaderboard v2 blog graphic
Agent Leaderboard v2 launch graphic
Hallucination Index
Model reliability
Tracking hallucination rates across foundation models
Reliability benchmark

Hallucination Index

A living index tracking factual reliability and hallucination rates across foundation models.

BRAG
Fine-tuned SOTA RAG
Fine-tuned open models for RAG quality
Fine-tuned RAG models

BRAG

Fine-tuned open-source LLMs that pushed state-of-the-art retrieval-augmented generation quality and evaluation workloads.

Full-stack AI, from models to communities.

I'm a full stack AI engineer at Galileo (employee #26, Series A), leading open-source evaluations and developer relations. I built the Agent Leaderboard, an open benchmark for AI agents in real-world tasks, along with the Hallucination Index and the BRAG model series for RAG evaluation.

Before Galileo, I was founding engineer at Enterpret (employee #6, pre-seed) building NLP systems for customer feedback at scale, and a principal data scientist at TaskHuman designing semantic search using transformers. Earlier, I launched end-to-end AI initiatives at Morningstar as their first quantitative research hire.

My RAG evaluation research was featured in Andrew Ng's newsletter, and I joined the Latent Space Podcast to discuss agent evaluation. I was also named one of the Top AI Developers to Watch in 2023.

Education

IIT Bombay

M.Tech.

Galileo

Employee #26

Open-source evals, developer relations, books, benchmarks, and agent reliability work.

Enterpret

Employee #6

Founding engineer building NLP systems for customer feedback intelligence.

Recognition

Top AI Developer

Named one of the AI developers to watch in 2023.

Applying AI at the edge of products.

June 2023 - present

Galileo

Employee #26. Leading open-source evaluations and developer relations. Built the Agent Leaderboard, Hallucination Index, BRAG model series, and authored five GenAI books.

2019 - present

Maxpool

Founded and grew a community of AI professionals building production GenAI systems.

Feb 2021 - May 2023

Enterpret

Employee #6, pre-seed. Built semantic search, reranking, text generation, MLOps, and NLP pipelines for customer feedback intelligence.

Sep 2020 - Dec 2020

Jina AI

Employee #6, seed. Contributed to the open-source neural search framework powering multi-modal search at scale.

2017 - 2020

TaskHuman and Morningstar

At TaskHuman, built semantic search and recommendation systems. At Morningstar, joined as the first quantitative research hire and built NLP extraction, sentiment, and quantitative ML workflows.

The loop gets better when more builders can inspect it.

I founded Maxpool as a community for AI engineers building production systems. The same idea runs through the books and benchmarks: make practice visible, inspectable, and useful beyond one team.

Let's build the next agent loop.

Reach out for agent systems, evaluation, writing, speaking, or communities around production AI.