How to Argue With a Language Model (And Win)

Learn structured experimentation for AI agents. Discover A/B testing, dataset curation, and evaluation methods to improve your agent's performance beyond guesswork.

Mastra A/B testing Dataset curation Model evaluation AI agents

Overview

You’ve built an AI agent, it mostly works, and now you’re stuck in a loop of tweaking prompts and hoping for the best. Sound familiar? In this talk we’ll move past vibes-based development and into structured experimentation. We’ll cover how to set up A/B tests for your agents, build and curate datasets from captured interactions or static data, and wire up evals that actually tell you whether your changes made things better or worse.

Tech stack

Mastra

Mastra: The open-source TypeScript framework for building and scaling AI agents, featuring durable workflows, RAG, and a unified router for 40+ LLM providers (OpenAI, Gemini).

Mastra is the leading open-source TypeScript framework for AI agents, launched in 2024 by Gatsby veterans Sam Bhagwat, Shane Thomas, and Abhi Aiyer. This structured platform streamlines AI engineering: it provides autonomous Agents, graph-based Workflows for complex orchestration, and RAG pipelines for grounded responses. The unified model router connects to 40+ LLM providers (e.g., OpenAI, Gemini). Backed by $13M from investors like Y Combinator, Mastra delivers production essentials: built-in Evals, Observability, and a local development playground.

https://mastra.ai

View projects
A/B testing

A/B testing is a randomized controlled experiment: it compares two variants (A: Control, B: Variation) to determine which one produces a statistically significant lift in a key metric.

This methodology (also called split testing) is the definitive way to compare two versions of a digital asset: a webpage, an email subject line, or a mobile app feature. The process randomly splits user traffic (e.g., 50/50) between the Control and the Variation, measuring the impact on a specific business goal. By testing elements like a new call-to-action button color or a revised headline, teams move optimization from 'we think' to 'we know.' This data-backed approach ensures that only changes proven to increase conversion rate, click-through rate, or revenue per visitor are implemented, maximizing ROI.

https://www.optimizely.com/optimization-glossary/ab-testing/

View projects
Dataset curation

Dataset curation is the systematic process of cleaning, labeling, and filtering raw data to build high-performance AI models.

Modern AI performance depends more on data quality than model architecture. Curation involves removing duplicates (deduplication), fixing label errors, and balancing class distributions to prevent bias. Platforms like Hugging Face and tools like Cleanlab allow engineers to audit millions of rows (such as the 5-trillion-token FineWeb dataset) to ensure training sets are diverse and accurate. By filtering out low-quality noise and PII, teams reduce compute costs and improve downstream accuracy metrics like MMLU scores.

https://huggingface.co/docs/datasets/index

View projects
Model evaluation

Model evaluation is the systematic process of using objective metrics and validation techniques to quantify a machine learning model's predictive accuracy and generalization performance.

Effective model evaluation moves beyond simple accuracy to provide a high-fidelity view of how an algorithm handles unseen data. Practitioners rely on specific metrics tailored to the task: precision, recall, and F1-score for classification; Mean Squared Error (MSE) and R-squared for regression; and Silhouette Coefficients for clustering. Beyond these numbers, robust evaluation requires rigorous validation strategies like k-fold cross-validation to mitigate overfitting and confusion matrices to pinpoint specific error patterns. By applying these standards, teams ensure that models are not just statistically significant but also reliable enough for production environments where edge cases and class imbalances are the norm.

https://scikit-learn.org/stable/modules/model_evaluation.html

View projects
AI agents

Autonomous software systems that leverage LLMs to reason, plan, and execute complex, multi-step goals across external tools and data sources.

AI agents are the next evolution of applied AI: they are goal-driven, autonomous entities that handle entire workflows without constant human prompting. Unlike simple chatbots, agents use a core LLM for reasoning, integrate with external tools (APIs, CRMs, web browsers), and maintain memory to adapt and improve over time. Enterprises deploy them for high-value automation: examples include a sales agent generating over 2,000 qualified leads monthly or a research agent analyzing 50 petabytes of clinical data for insights. This technology is about scaling complex decision-making and action, not just conversation.

https://cloud.google.com/vertex-ai

View projects