AI Engineering Anti-Patterns — Chip Huyen

June 04, 2025

Speaker Background

Former Snorkel AI / NVIDIA (core developer on NeMo generative AI framework)
Founded an AI infrastructure startup
Taught ML Systems Design at Stanford
Author: Designing Machine Learning Systems (2022, Amazon bestseller, 10+ language translations)
Author: AI Engineering (most-read book on O’Reilly platform since launch)
Currently building an AI-powered story production studio (Entanglements of Never End)

Build models from scratch (spam detection, fraud detection, recommender systems)
Requires deep ML theory knowledge: gradient descent, overfitting/underfitting
Model is the core artifact; data collection and training are prerequisites
Deployed as a feature embedded within an existing product (e.g., Spotify recommendations, bank fraud detection)

Build applications on top of foundation models (GPT, Claude, Gemini, etc.)
Foundation model = broader term than LLM; encompasses multi-modal inputs (text, image, video, audio)
Reversed development flow: Idea → Prototype → Data collection → (optionally) Fine-tune
Can deploy a standalone app without a parent product
Lower barrier to entry — ML theory not a hard prerequisite
Competitive moat shifts to product sense and proprietary data, not model ownership

Not either/or — most real-world GenAI systems are a combination
Ratio shifting: used to be ~50/50, now often 70–90% AI engineering
Example: Customer support chatbot architecture
- Input classifier (traditional ML): Is the request appropriate? Is it answerable by FAQ?
- Foundation model call (AI engineering): Generate a contextual response
- Output classifier (traditional ML): Does the response contain PII or sensitive data?
AI is becoming part of the standard software engineering toolkit, like databases or JavaScript

Example: Startup pitching GenAI to reduce customer electricity bills by 30%
The problem: Classic optimization problems (scheduling high-intensity tasks during off-peak hours) are solvable with greedy algorithms or classical automation — faster, cheaper, and more reliable
Key question to ask: Could a simpler algorithm solve this? What’s the baseline?
GenAI adds value when the problem is inherently open-ended or language-driven, not when deterministic logic suffices

Three real-world examples where the bottleneck was product design, not model quality:

Initial assumption: Users want summaries; debate was over length (3 sentences vs. 5–7)
Reality: Users don’t care about summaries — they want per-person action items
Fix: Model output was restructured to extract individualized next steps per attendee

Initial assumption: Users want accurate fit/no-fit judgments
Reality: Blunt rejection is discouraging; users want a roadmap
- Gap analysis: What skills are missing?
- Course/project recommendations
- Alternative job suggestions
The chatbot should behave more like a recommender system than a binary classifier

Initial assumption: Users avoid the bot due to privacy concerns
Reality: Two different problems
1. Users don’t know enough about tax to even formulate questions (blank-box problem)
2. Users hate typing
Fix: Replaced open-ended chat with pre-populated clickable questions — dramatically increased engagement
Takeaway: The technical part is rarely the hardest part; understanding how users interact with the product is

Companies get stuck debating which vector database to use (Pinecone vs. Weaviate vs. Chroma etc.)
In practice, switching vector databases yields minimal performance gain
What actually moves the needle in RAG: Data preprocessing quality
- Chunking strategy — naïve splits break semantic continuity
- Prepending summaries or extracted entities to chunks to preserve context across boundaries
- Converting documents to Q&A format before indexing (high-impact, low-glamour)
Similarly, chasing new agent frameworks (LangChain, LlamaIndex, etc.) introduces risk:
- Abstraction layers contain bugs (including typo-riddled default prompts)
- A silent prompt fix in a dependency can silently change your app’s behavior
- Makes debugging regressions very hard

Why evaluation is hard: GenAI outputs are open-ended — no ground truth for poems, summaries, or essays
LLM-as-Judge is now nearly universal for open-ended tasks (vs. coding with functional correctness)
- Use one model (e.g., Gemini) to score outputs of another (e.g., Claude)
- Prompt the judge with rubric + few-shot examples
Limitations of LLM-as-Judge:
- Non-deterministic: same input can score 3 today, 4 tomorrow
- Prompt drift: if the judge prompt changes (especially via a third-party eval tool), scores shift invisibly
- Scores become meaningless without knowing whether the judge itself changed
Mitigation: Always pair AI judges with human evaluation
- Daily or weekly sampling (30 to 1,000 samples depending on scale)
- Human review grounds the signal and catches distribution shifts
Greg Brockman (OpenAI president, former Stripe CTO) quote: manual data inspection has the highest value-to-prestige ratio in ML work — engineers undervalue it because it feels unglamorous

GenAI makes it trivially easy to build impressive demos (hours to a weekend)
This creates a dangerous illusion about how long production deployment takes
The 80/90/95% problem:
- Kovvena: 0→80% accuracy in 1 month; 80→90% took another full month
- LinkedIn: 0→80% in 1 month; 80→95% took 4 more months
- Some products never cross the threshold required for production viability
Makes quarterly planning and milestone estimation nearly impossible
Rule: demo complexity is not a predictor of production complexity

#	Anti-Pattern	Core Failure
1	Using GenAI when unnecessary	Technology selection bias
2	Confusing bad product with bad AI	Insufficient user research
3	Starting too complex	Over-engineering infrastructure
4	Foregoing human evaluation	Over-trusting automated metrics
5	Overindexing on early demo success	Misestimating delivery timelines

Learning AI models is a small part of the job
Skills that matter most:
- Rigorous engineering practices (version control for prompts, treat prompts as code)
- Ability to inspect and wrangle data manually
- Web dev proficiency — build demos fast to validate ideas before committing
- Product sense — understanding what users actually need
Evaluation pipeline is never “done” — user behavior and environment evolve continuously
Coding is the most popular GenAI application because it’s one of the few tasks with objective evaluation (compilation + unit tests) — a signal that teams should design applications around measurable outcomes where possible

On management anti-patterns: Engineering culture must have power to push back on misguided leadership; hard to mandate without cultural buy-in
On eval pipelines: Never “good enough” — must evolve as user base and external context shift
On identifying snake oil: Demand specific, named examples with real challenges; reject vague generalities
On getting engineers to learn: You can’t force it — intrinsic motivation is the only reliable driver