We Built a Quality Lab for AI-Generated Websites (And Test Every Component)
TL;DR: We built an internal evaluation framework that systematically tests every AI-generated component of a website—strategy, brand colors, headers, footers, content sections—across multiple models. Each evaluation produces a self-contained HTML report with live previews, dimension scores, and head-to-head model comparisons. This is how we ship AI that doesn’t embarrass our users.
The Problem: AI Output Is a Black Box
When you generate an entire website with AI, you’re making dozens of independent AI calls: strategy planning, brand color selection, header generation, content sections, footers. Each call can fail in subtle ways:
- A header that looks fine on desktop but breaks on mobile
- Brand colors that work for a bakery but look wrong for a law firm
- A footer that invents a phone number the business never provided
- Content sections with template syntax leaking into the output
The standard approach? Ship it and wait for complaints. We needed something better.
The Evaluation Framework
We built six evaluation tools, one for each major AI output in our pipeline. Each tool:
- Generates output for 10-30 diverse businesses (plumber, bakery, law firm, gym, salon, etc.)
- Runs quality checks (structural, visual, data integrity)
- Scores on multiple dimensions using a separate AI evaluator
- Produces an HTML report with live rendered previews
What We Test
| Component | Businesses Tested | Key Checks |
|---|---|---|
| Strategy | 3+ with real data | Strategic clarity, page structure, CTA distribution, SEO value |
| Brand Colors | 30 diverse types | Color-industry match, palette cohesion, customer trust |
| Headers | 30 businesses | Mobile menu presence, script blocks, inline style contamination |
| Footers | 30 businesses | Data invention detection, copyright, proper structure |
| Sections | 60 (10 × 6 variants) | Form elements, background handling, template syntax leaks |
| Enrichment | 15 real businesses | Contact completeness, SEO value, trust data, competitive intel |
Strategy Evaluation: Model Head-to-Head
The strategy step is the most expensive and impactful call in our pipeline. It determines page structure, section briefs, CTA placement, and keyword targeting. Getting it wrong cascades failures downstream.
We test four models head-to-head:
const models = [
{ name: 'Haiku 4.5', provider: 'anthropic', model: 'claude-haiku-4-5-20251001' },
{ name: 'Opus 4.5', provider: 'anthropic', model: 'claude-opus-4-5-20250115' },
{ name: 'DeepSeek Chat', provider: 'deepseek', model: 'deepseek-chat' },
{ name: 'DeepSeek Reasoner', provider: 'deepseek', model: 'deepseek-reasoner' },
];
Each model generates a complete website strategy for the same business. Then a separate AI evaluator (Opus 4.5, temperature 0.3) scores each strategy on six dimensions, 1-10:
- Strategic Clarity: Is the primary goal clear? Are priorities well-ordered?
- Page Structure: Are pages logical, well-named, and purposeful?
- Section Quality: Are sections well-briefed with clear AI instructions?
- CTA Strategy: Are CTAs strategic and well-distributed (not excessive)?
- SEO Value: Are keywords relevant and specific?
- Audience Alignment: Does the strategy match the business type and audience?
The report shows average scores, win counts, generation speed, token usage, and cost per call—per model.
Brand Color Evaluation: Does It Look Right?
This one surprised us. AI models are decent at picking colors that technically work, but terrible at picking colors that feel right for a specific industry.
Our brand pipeline has three stages:
- AI generates color descriptions in natural language (no hex codes)
- AI converts descriptions to hex values
- Algorithm matches to the nearest DaisyUI theme
We test 30 businesses and score 1-5 on: Does the primary color match the business type? Is the palette cohesive? Would customers trust this? Does the DaisyUI theme complement the brand?
The report renders actual color swatches and preview website headers so we can visually scan for obvious mismatches. Anything scoring below 3 gets flagged.
The Data Invention Problem
Our most critical check: does the AI fabricate contact information?
When generating footers, the AI sometimes invents a phone number or email address that the business never provided. This is catastrophic—a customer calls a fake number, or emails a nonexistent address.
// Footer evaluation: flag invented data
const hasInventedPhone = footer.includes('555-') ||
(footer.match(/\d{3}[-.]?\d{3}[-.]?\d{4}/) && !businessData.phone);
const hasInventedEmail = footer.includes('@example.com') ||
(footer.match(/[\w.-]+@[\w.-]+\.\w+/) && !businessData.email);
if (hasInventedPhone || hasInventedEmail) {
status = 'BROKEN'; // Hard fail
}
We run this check across 30 businesses with varying data availability—some have phone numbers, some don’t, some have email only. If the AI invents data even once, it’s flagged as BROKEN.
Section Variants: 6 Ways to Break
Content sections are our most complex output. We test six variants for each business:
- Text-only: No images, pure content
- With stock photo: Two-column layout
- With original photo: Business’s own photos
- Contact form: Interactive form with anti-spam
- Area of operation: Embedded Google Maps
- Visual break: Hero image with overlay
That’s 60 sections per evaluation run. Each gets structural checks (has <section> wrapper, no inline styles, no template syntax contamination) plus live rendered previews at desktop, tablet, and mobile breakpoints.
Post-Processing Pipeline
A critical insight: AI models make predictable mistakes. Instead of fighting the model, we fix outputs systematically:
- Inline styles: Stripped and converted to Tailwind classes
- Tailwind numeric classes:
text-[16px]→ proper Tailwind scale - Hex colors: Replaced with DaisyUI theme variables
- Template syntax:
{{businessName}}contamination detected and flagged
Our evaluation tracks which outputs needed post-processing fixes (blue tags), which had issues (yellow), and which were broken beyond repair (red). Over time, this data tells us whether our prompts are improving or degrading.
The Report Format
Every evaluation produces a self-contained HTML file. No build step, no server—just open the file. Each report includes:
- Stats cards: Pass/fail counts, average scores, color-coded
- Per-item cards: Detailed view with dimension scores
- Live iframes: Rendered output at multiple breakpoints
- Theme switcher: Test across DaisyUI themes
- Issue tags: Color-coded (CLEAN / FIXED / BROKEN)
- Collapsible JSON: Raw source data for debugging
We store these in scripts/evaluations/ and review them before any prompt change ships.
What We Learned
-
AI quality is measurable. You don’t have to guess. Build evaluation criteria, score systematically, track over time.
-
Test at the component level. Testing a full website is too coarse. Test headers, footers, sections, strategy independently. Failures in one component don’t mask successes in others.
-
Data invention is the #1 risk. AI models will confidently fabricate phone numbers, emails, and addresses. You must check for this explicitly.
-
Post-processing is not a hack—it’s a feature. AI outputs need cleanup. Tracking what you fix tells you where your prompts need work.
-
Evaluation is a product. Our evaluation reports are now the first thing we check when testing a new model or prompt change. They’ve prevented at least a dozen regressions from shipping.
The Pipeline
Prompt Change → Run Evaluation Scripts → Review HTML Reports →
Compare Scores → Fix Regressions → Ship
This loop runs before every significant prompt update. It takes about 10 minutes per evaluation suite and has caught issues that would have taken days to surface from user reports.
Try It
Every website on WebZum is built by the AI pipeline we test with this framework. The headers, footers, sections, colors, and strategy have all been evaluated across dozens of business types before they ever reached your site.