Robust agent engineering through extreme vibe coding.
I build agent systems where evaluation is not the ending. It is the feedback signal for the next version.
I'm Pratik Bhavsar, working across tools, memory, judges, datasets, benchmarks, and feedback loops. The goal is simple: turn agent behavior into better agents.
I build the loop where agent traces become better agents.
The interesting work starts after an agent runs: capture the tool path, memory hits, judge disagreements, failed actions, and production edge cases. Those traces become the raw material for the next prompt, policy, tool, memory rule, dataset, and benchmark.
That is the thread across Galileo, open-source leaderboards, books, and community work: evals are not a separate page. They are the control system that lets agents improve from one run to the next.
Instrument the run
Capture tool calls, retrievals, memory writes, human corrections, and judge signals as first-class evidence.
Select what survives
Use judges, datasets, and benchmark pressure to separate durable behavior from lucky demos.
Patch the agent
Feed accepted traces back into prompts, policies, tools, memory rules, and regression suites.
Publish the loop
Turn private lessons into books, leaderboards, benchmarks, and reusable patterns for other builders.
Books for building production AI systems.
The Mastering GenAI series turns hard-won lessons into readable field guides: agents, multi-agent systems, RAG, LLM-as-a-judge, and eval engineering.
Eval Engineering
The emerging discipline of building trust in production AI systems. The book is structured as a practical path from evaluation basics to LLM-as-judge systems, SME refinement, SLM scaling, and production guardrails.
Mastering Multi-Agent Systems
Design patterns for systems where agents coordinate, specialize, and hand off work.
Agent systemsMastering AI Agents
A practical guide to building agents with tools, memory, planning, and evaluation.
240+ pagesMastering RAG 2.0
Modern retrieval, generation, and evaluation patterns for grounded AI systems.
Judge pipelinesMastering LLM-as-a-Judge
How to design, calibrate, and trust model-based evaluation workflows.
Measuring agent behavior in the open.
Benchmarks should expose how systems behave under real pressure: multi-turn tasks, domain constraints, factuality, retrieval quality, and tool use. That is the work behind these public artifacts.
Agent Leaderboard v2
Enterprise-grade benchmark for AI agents: 30+ models, 500 real-world multi-turn support scenarios, and domains including banking, healthcare, investment, telecom, and insurance.
BRAG
Fine-tuned open-source LLMs that pushed state-of-the-art retrieval-augmented generation quality and evaluation workloads.
Full-stack AI, from models to communities.
I'm a full stack AI engineer at Galileo (employee #26, Series A), leading open-source evaluations and developer relations. I built the Agent Leaderboard, an open benchmark for AI agents in real-world tasks, along with the Hallucination Index and the BRAG model series for RAG evaluation.
Before Galileo, I was founding engineer at Enterpret (employee #6, pre-seed) building NLP systems for customer feedback at scale, and a principal data scientist at TaskHuman designing semantic search using transformers. Earlier, I launched end-to-end AI initiatives at Morningstar as their first quantitative research hire.
My RAG evaluation research was featured in Andrew Ng's newsletter, and I joined the Latent Space Podcast to discuss agent evaluation. I was also named one of the Top AI Developers to Watch in 2023.
IIT Bombay
M.Tech.
Employee #26
Open-source evals, developer relations, books, benchmarks, and agent reliability work.
Employee #6
Founding engineer building NLP systems for customer feedback intelligence.
Top AI Developer
Named one of the AI developers to watch in 2023.
Learning in public sharpens the work.
I share what I learn through podcasts, talks, essays, community sessions, and technical guides. Speaking and writing force the ideas to survive contact with practitioners.
Ranking Agentic LLMs
Galileo Agent Leaderboard, agent evaluation, benchmarks, and measuring real AI behavior.
02 / DAIR.AI / 1.3K viewsAI Agent Evaluation
How to test agent behavior beyond one-shot LLM answers.
03 / Galileo Live / 213 viewsBehind Agent Leaderboard 2.0
What it takes to build a public agent benchmark for real workflows.
04 / DAIR.AI / 3.8K views101 Ways to Solve Search
Search systems, retrieval quality, and the engineering choices behind relevance.
05 / WiMLDS / 2.6K viewsModeling Fallacies in NLP
NLP modeling, data, and the traps that show up in language systems.
06 / PyData Mumbai / 538 viewsAutomated Machine Learning
Earlier work on AutoML and practical model-building workflows.
Applying AI at the edge of products.
Galileo
Employee #26. Leading open-source evaluations and developer relations. Built the Agent Leaderboard, Hallucination Index, BRAG model series, and authored five GenAI books.
Maxpool
Founded and grew a community of AI professionals building production GenAI systems.
Enterpret
Employee #6, pre-seed. Built semantic search, reranking, text generation, MLOps, and NLP pipelines for customer feedback intelligence.
Jina AI
Employee #6, seed. Contributed to the open-source neural search framework powering multi-modal search at scale.
TaskHuman and Morningstar
At TaskHuman, built semantic search and recommendation systems. At Morningstar, joined as the first quantitative research hire and built NLP extraction, sentiment, and quantitative ML workflows.
The loop gets better when more builders can inspect it.
I founded Maxpool as a community for AI engineers building production systems. The same idea runs through the books and benchmarks: make practice visible, inspectable, and useful beyond one team.
Maxpool
Practitioners sharing research, tools, and real-world lessons.
NewsletterPakodas
Notes on technology, business, AI engineering, and what changes when AI reaches production.
Technical blogGalileo writing
Deep dives on evaluation, model behavior, RAG quality, and agent benchmarks.
ArchiveOlder essays
Writing across AI engineering, startups, and technical learning.
Let's build the next agent loop.
Reach out for agent systems, evaluation, writing, speaking, or communities around production AI.