Recursive agent engineering

Robust agents, built through extreme vibe coding.

I build agent systems where evaluation is not the ending. It is the feedback signal for the next version.

I'm Pratik Bhavsar, working across tools, memory, judges, datasets, benchmarks, and feedback loops. The goal is simple: turn agent behavior into better agents.

01 / BuildAgent systems
02 / MeasureOpen benchmarks
03 / ShareBooks & field guides
04 / ImproveFeedback loops
5
Books
25K+
Followers
100+
Articles
11
Talks
6
Startups

Books for building production AI systems.

The Mastering GenAI series turns hard-won lessons into readable field guides: agents, multi-agent systems, RAG, LLM-as-a-judge, and eval engineering.

Latest book

Eval Engineering

The emerging discipline of building trust in production AI systems. The book is structured as a practical path from evaluation basics to LLM-as-judge systems, SME refinement, SLM scaling, and production guardrails.

Metrics Judges Traces Guardrails
Pratik holding the Eval Engineering book

I build the loop where agent traces become better agents.

The interesting work starts after an agent runs: capture the tool path, memory hits, judge disagreements, failed actions, and production edge cases. Those traces become the raw material for the next prompt, policy, tool, memory rule, dataset, and benchmark.

That is the thread across Galileo, open-source leaderboards, books, and community work: evals are not a separate page. They are the control system that lets agents improve from one run to the next.

01

Instrument the run

Capture tool calls, retrievals, memory writes, human corrections, and judge signals as first-class evidence.

02

Select what survives

Use judges, datasets, and benchmark pressure to separate durable behavior from lucky demos.

03

Patch the agent

Feed accepted traces back into prompts, policies, tools, memory rules, and regression suites.

04

Publish the loop

Turn private lessons into books, leaderboards, benchmarks, and reusable patterns for other builders.

Measuring agent behavior in the open.

Benchmarks should expose how systems behave under real pressure: multi-turn tasks, domain constraints, factuality, retrieval quality, and tool use. That is the work behind these public artifacts.

Featured benchmark

Agent Leaderboard v2

Enterprise-grade benchmark for AI agents: 30+ models, 500 real-world multi-turn support scenarios, and domains including banking, healthcare, investment, telecom, and insurance.

Agent Leaderboard v2 blog graphic
Agent Leaderboard v2 launch graphic
Hallucination Index
Model reliability
Tracking hallucination rates across foundation models
Reliability benchmark

Hallucination Index

A living index tracking factual reliability and hallucination rates across foundation models.

Full-stack AI, from models to communities.

I'm now part of Cisco, working on tokenomics and agent evaluations after Cisco acquired Galileo in May 2026. At Galileo (employee #26, Series A), I led open-source evaluations and developer relations and built the Agent Leaderboard, Hallucination Index, and BRAG model series.

Before Galileo, I was founding engineer at Enterpret (employee #6, pre-seed) building NLP systems for customer feedback at scale, and a principal data scientist at TaskHuman designing semantic search using transformers. Earlier, I launched end-to-end AI initiatives at Morningstar as their first quantitative research hire.

My RAG evaluation research was featured in Andrew Ng's newsletter, and I joined the Latent Space Podcast to discuss agent evaluation. I was also named one of the Top AI Developers to Watch in 2023.

Education

IIT Bombay

M.Tech.

Cisco / Galileo

Agent evals + tokenomics

Now building at Cisco after Galileo's acquisition; previously employee #26 at Galileo.

Enterpret

Employee #6

Founding engineer building NLP systems for customer feedback intelligence.

Recognition

Top AI Developer

Named one of the AI developers to watch in 2023.

Applying AI at the edge of products.

May 2026 - present

Cisco

Working on tokenomics and agent evaluations after Cisco's acquisition of Galileo.

June 2023 - May 2026

Galileo

Employee #26. Led open-source evaluations and developer relations. Built the Agent Leaderboard, Hallucination Index, BRAG model series, and authored five GenAI books.

2019 - present

Maxpool

Founded and grew a community of AI professionals building production GenAI systems.

Feb 2021 - May 2023

Enterpret

Employee #6, pre-seed. Built semantic search, reranking, text generation, MLOps, and NLP pipelines for customer feedback intelligence.

Sep 2020 - Dec 2020

Jina AI

Employee #6, seed. Contributed to the open-source neural search framework powering multi-modal search at scale.

2017 - 2020

TaskHuman and Morningstar

At TaskHuman, built semantic search and recommendation systems. At Morningstar, joined as the first quantitative research hire and built NLP extraction, sentiment, and quantitative ML workflows.

The loop gets better when more builders can inspect it.

I founded Maxpool as a community for AI engineers building production systems. The same idea runs through the books and benchmarks: make practice visible, inspectable, and useful beyond one team.

Let's build the next agent loop.

Reach out for agent systems, evaluation, writing, speaking, or communities around production AI.