Full Stack AI Engineer · Author of the Mastering GenAI Series
I've done it all across the AI stack: trained models, built evaluation datasets, designed benchmarks the community ships to, and written 100+ articles before distilling it all into 5 books on production GenAI.
Five comprehensive guides on the most critical topics in modern AI engineering — starting with the latest, Eval Engineering. All free.
The emerging discipline of building trust in production AI systems. Master the science of AI evaluation — from metrics design to judge pipelines.
Part of the Galileo Mastering GenAI Series · All books available free online
I'm a full stack AI engineer at Galileo (employee #26, Series A), leading open-source evaluations and developer relations. I built the Agent Leaderboard, an open benchmark for AI agents in real-world tasks, along with the Hallucination Index and the BRAG model series for RAG evaluation.
Before Galileo, I was founding engineer at Enterpret (employee #6, pre-seed) building NLP systems for customer feedback at scale, and a principal data scientist at TaskHuman designing semantic search using transformers. Earlier, I launched end-to-end AI initiatives at Morningstar as their first quantitative research hire.
My RAG evaluation research was featured in Andrew Ng's newsletter, I was a guest on the Latent Space Podcast talking agent evaluation, and I've been named one of the Top AI Developers to Watch in 2023.
Benchmarks, datasets, and models built in the open. Because evaluation should never be locked away behind a paywall.
The enterprise-grade benchmark for AI agents. Tests 30+ LLMs across 500 real-world multi-turn support scenarios spanning banking, healthcare, investment, telecom, and insurance.
A living index tracking factual reliability and hallucination rates across foundation models. Gives engineering teams the data they need to pick the right model for accuracy-critical workloads.
A series of state-of-the-art open-source LLMs fine-tuned for retrieval-augmented generation. BRAG models outperform much larger models on RAG evaluation benchmarks while staying production-friendly.
From the Latent Space podcast to PyData to community livestreams. Speaking forces you to clarify your thinking in ways writing cannot.
Leading open-source evaluations and developer relations. Built the Agent Leaderboard (open benchmark for AI agents), the Hallucination Index, and the BRAG model series. Authored 5 books on GenAI published under the Mastering GenAI Series.
Founded and grew Maxpool, an open community of AI professionals building production GenAI systems. Built Max, an AI research assistant for the Discord community.
Built semantic search, reranking, text generation, MLOps, and NLP pipelines to categorize 50K+ topics of customer feedback.
Contributed to Jina, the open-source neural search framework powering multi-modal search at scale.
Led development of core semantic search using transformers for autocompletion and search. Developed and deployed unsupervised deep learning recommendation engine.
First hire in the new quantitative research team. Led full-stack NLP with deep learning: financial document extraction, sentiment analysis, fund rating with ML, and search/recommendation.
Founded in 2019, Maxpool is a thriving community of AI engineers building production systems. Join practitioners from around the world sharing research, tools, and real-world lessons.
I write about AI engineering, evaluation, and the business of AI. My Substack newsletter covers tech and business insights. Deep technical dives live on galileo.ai/blog.
Whether it's AI evaluation, production systems, speaking, or community, I'm always up for a good conversation.