WebZum Logo
WebZum

From Zero to Website Hero

Sign InSign Up
Back to Blog
aideduplicationnlpstartup

How We Use AI to Detect Duplicate Business Registrations (And Why It's Harder Than You Think)

WebZum Team•September 18, 2025•9 min read
How We Use AI to Detect Duplicate Business Registrations (And Why It's Harder Than You Think)

How We Use AI to Detect Duplicate Business Registrations (And Why It’s Harder Than You Think)

TL;DR: We built an AI-powered business fingerprinting system that detects duplicate registrations with 95% accuracy. It handles typos, abbreviations, different formats, and even intentional variations. Uses Claude to normalize business data, generates unique fingerprints, and prevents users from creating multiple websites for the same business.

The Problem: Users Keep Creating Duplicates

We let users generate websites by entering a business name. Simple, right?

Wrong.

What we saw:

  • “Joe’s Pizza Brooklyn” (Monday)
  • “Joes Pizza - Brooklyn NY” (Tuesday)
  • “Joe’s Pizzeria” (Wednesday)

Same business. Three websites. Three subscriptions. Chaos.

Why it happens:

  • Typos: “Joe’s” vs “Joes” vs “Joe’s”
  • Abbreviations: “Brooklyn” vs “Bklyn” vs “BK”
  • Formatting: “123 Main St” vs “123 Main Street, Apt 2”
  • Intentional variations: Users forget they already created a site

The cost:

  • Wasted AI API calls ($2-5 per website generation)
  • Confused users (“Why do I have 3 websites?”)
  • Support tickets (“Which one is the real one?”)
  • Database bloat (3x more records than actual businesses)

We needed to detect duplicates before generating the website.

The Insight: Fingerprints, Not Exact Matches

The breakthrough came when we stopped trying to match business names exactly and started thinking about “business fingerprints.”

Exact matching (doesn’t work):

"Joe's Pizza Brooklyn" ≠ "Joes Pizza - Brooklyn NY"

Fingerprint matching (works):

normalize("Joe's Pizza Brooklyn") → "joes-pizza-brooklyn"
normalize("Joes Pizza - Brooklyn NY") → "joes-pizza-brooklyn"
✅ MATCH!

But normalization alone isn’t enough. We needed AI.

How It Works: The Technical Architecture

1. AI-Powered Business Name Extraction

When a user enters text, we use Claude to extract structured data:

async function extractBusinessInfo(userInput: string) {
  const prompt = `
Extract business information from this input:
"${userInput}"

Return JSON with:
- businessName: The core business name (no location, no legal entity)
- location: City, state, or neighborhood
- type: Business type (restaurant, plumber, etc.)
- legalEntity: LLC, Inc, etc. (if present)

Examples:
Input: "Joe's Pizza LLC in Brooklyn"
Output: {
  "businessName": "Joe's Pizza",
  "location": "Brooklyn",
  "type": "restaurant",
  "legalEntity": "LLC"
}

Input: "Best Plumbing Services - San Diego, CA"
Output: {
  "businessName": "Best Plumbing Services",
  "location": "San Diego, CA",
  "type": "plumber",
  "legalEntity": null
}
`;

  const response = await anthropic.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 500,
    messages: [{
      role: 'user',
      content: prompt
    }]
  });

  return JSON.parse(response.content[0].text);
}

Why AI? Because business names are messy:

  • “Joe’s Pizza Brooklyn” → name: “Joe’s Pizza”, location: “Brooklyn”
  • “Brooklyn Joe’s Pizza” → name: “Joe’s Pizza”, location: “Brooklyn”
  • “Joe’s Pizzeria of Brooklyn” → name: “Joe’s Pizzeria”, location: “Brooklyn”

AI understands context that regex can’t handle.

2. Normalization Pipeline

Once we have structured data, we normalize it:

function normalizeBusinessName(name: string): string {
  return name
    .toLowerCase()
    .replace(/['']/g, '') // Remove apostrophes
    .replace(/[^\w\s]/g, '') // Remove punctuation
    .replace(/\s+/g, '-') // Spaces to hyphens
    .replace(/^(the|a|an)-/, '') // Remove articles
    .replace(/-llc|-inc|-corp|-ltd$/, '') // Remove legal entities
    .trim();
}

function normalizeLocation(location: string): string {
  return location
    .toLowerCase()
    .replace(/\b(street|st|avenue|ave|road|rd|boulevard|blvd)\b/g, '') // Remove street types
    .replace(/\b(apartment|apt|suite|ste|unit)\s*\d+/g, '') // Remove apt numbers
    .replace(/[^\w\s]/g, '')
    .replace(/\s+/g, '-')
    .trim();
}

function normalizePhone(phone: string): string {
  // Extract just the digits
  const digits = phone.replace(/\D/g, '');
  
  // US phone: keep last 10 digits
  if (digits.length >= 10) {
    return digits.slice(-10);
  }
  
  return digits;
}

Examples:

normalizeBusinessName("Joe's Pizza LLC") → "joes-pizza"
normalizeBusinessName("The Joe's Pizzeria") → "joes-pizzeria"
normalizeLocation("123 Main St, Apt 2") → "123-main"
normalizeLocation("123 Main Street") → "123-main"
normalizePhone("(555) 123-4567") → "5551234567"
normalizePhone("+1-555-123-4567") → "5551234567"

3. Fingerprint Generation

We combine normalized components into a unique fingerprint:

interface BusinessFingerprint {
  primaryKey: string;      // Most specific
  secondaryKeys: string[]; // Fallback matches
  metadata: {
    originalName: string;
    normalizedName: string;
    location?: string;
    phone?: string;
    type?: string;
  };
}

function generateFingerprint(businessInfo: ExtractedBusinessInfo): BusinessFingerprint {
  const normalizedName = normalizeBusinessName(businessInfo.businessName);
  const normalizedLocation = businessInfo.location 
    ? normalizeLocation(businessInfo.location) 
    : null;
  const normalizedPhone = businessInfo.phone 
    ? normalizePhone(businessInfo.phone) 
    : null;
  
  // Primary key: name + location (most specific)
  const primaryKey = normalizedLocation
    ? `${normalizedName}-${normalizedLocation}`
    : normalizedName;
  
  // Secondary keys: alternative matches
  const secondaryKeys = [
    normalizedName, // Name only
    normalizedPhone ? `phone-${normalizedPhone}` : null, // Phone only
    businessInfo.type ? `${normalizedName}-${businessInfo.type}` : null // Name + type
  ].filter(Boolean);
  
  return {
    primaryKey,
    secondaryKeys,
    metadata: {
      originalName: businessInfo.businessName,
      normalizedName,
      location: normalizedLocation,
      phone: normalizedPhone,
      type: businessInfo.type
    }
  };
}

Example fingerprints:

Input: "Joe's Pizza Brooklyn"
Output: {
  primaryKey: "joes-pizza-brooklyn",
  secondaryKeys: [
    "joes-pizza",
    "joes-pizza-restaurant"
  ],
  metadata: {
    originalName: "Joe's Pizza",
    normalizedName: "joes-pizza",
    location: "brooklyn",
    type: "restaurant"
  }
}

Input: "Joes Pizza - Brooklyn NY (555) 123-4567"
Output: {
  primaryKey: "joes-pizza-brooklyn",
  secondaryKeys: [
    "joes-pizza",
    "phone-5551234567",
    "joes-pizza-restaurant"
  ],
  metadata: {
    originalName: "Joes Pizza",
    normalizedName: "joes-pizza",
    location: "brooklyn",
    phone: "5551234567",
    type: "restaurant"
  }
}

✅ PRIMARY KEY MATCH: Same business!

4. Duplicate Detection

Before creating a new business, we check for duplicates:

async function checkForDuplicates(fingerprint: BusinessFingerprint): Promise<DuplicateResult> {
  // Check primary key first (exact match)
  const primaryMatch = await db.findBusinessByFingerprint(fingerprint.primaryKey);
  if (primaryMatch) {
    return {
      isDuplicate: true,
      confidence: 'high',
      matchedBusiness: primaryMatch,
      matchType: 'primary'
    };
  }
  
  // Check secondary keys (fuzzy match)
  for (const secondaryKey of fingerprint.secondaryKeys) {
    const secondaryMatch = await db.findBusinessByFingerprint(secondaryKey);
    if (secondaryMatch) {
      // Verify it's actually the same business (not just similar name)
      const similarity = calculateSimilarity(fingerprint, secondaryMatch.fingerprint);
      
      if (similarity > 0.8) {
        return {
          isDuplicate: true,
          confidence: 'medium',
          matchedBusiness: secondaryMatch,
          matchType: 'secondary',
          similarity
        };
      }
    }
  }
  
  return {
    isDuplicate: false,
    confidence: 'none'
  };
}

function calculateSimilarity(fp1: BusinessFingerprint, fp2: BusinessFingerprint): number {
  let score = 0;
  let checks = 0;
  
  // Name similarity (most important)
  if (fp1.metadata.normalizedName === fp2.metadata.normalizedName) {
    score += 0.5;
  }
  checks++;
  
  // Location similarity
  if (fp1.metadata.location && fp2.metadata.location) {
    if (fp1.metadata.location === fp2.metadata.location) {
      score += 0.3;
    }
    checks++;
  }
  
  // Phone similarity
  if (fp1.metadata.phone && fp2.metadata.phone) {
    if (fp1.metadata.phone === fp2.metadata.phone) {
      score += 0.2;
    }
    checks++;
  }
  
  return score / checks;
}

5. User Confirmation Flow

When we detect a duplicate, we ask the user:

async function handleBusinessRegistration(userInput: string) {
  // Extract and normalize
  const businessInfo = await extractBusinessInfo(userInput);
  const fingerprint = generateFingerprint(businessInfo);
  
  // Check for duplicates
  const duplicateCheck = await checkForDuplicates(fingerprint);
  
  if (duplicateCheck.isDuplicate) {
    // Show confirmation dialog
    const userConfirmed = await showDuplicateDialog({
      originalInput: userInput,
      matchedBusiness: duplicateCheck.matchedBusiness,
      confidence: duplicateCheck.confidence
    });
    
    if (!userConfirmed) {
      // User says it's a duplicate, redirect to existing business
      return {
        action: 'redirect',
        businessId: duplicateCheck.matchedBusiness.id
      };
    }
    
    // User says it's NOT a duplicate, create new business
    // (but flag for manual review if confidence is high)
    if (duplicateCheck.confidence === 'high') {
      await flagForManualReview(fingerprint, duplicateCheck);
    }
  }
  
  // Create new business
  const business = await createBusiness(businessInfo, fingerprint);
  return {
    action: 'created',
    businessId: business.id
  };
}

Duplicate dialog UI:

function showDuplicateDialog(data: DuplicateData): Promise<boolean> {
  return new Promise((resolve) => {
    const dialog = document.createElement('div');
    dialog.innerHTML = `
      <div class="duplicate-dialog">
        <h3>We found a similar business</h3>
        <p>You entered: <strong>${data.originalInput}</strong></p>
        <p>We found: <strong>${data.matchedBusiness.name}</strong></p>
        <p>Created: ${formatDate(data.matchedBusiness.createdAt)}</p>
        
        <div class="actions">
          <button class="btn-primary" id="use-existing">
            Use Existing Business
          </button>
          <button class="btn-secondary" id="create-new">
            No, Create New Business
          </button>
        </div>
      </div>
    `;
    
    document.body.appendChild(dialog);
    
    dialog.querySelector('#use-existing').addEventListener('click', () => {
      resolve(false); // It's a duplicate
      dialog.remove();
    });
    
    dialog.querySelector('#create-new').addEventListener('click', () => {
      resolve(true); // Not a duplicate
      dialog.remove();
    });
  });
}

The Challenges We Solved

Challenge 1: False Positives

Problem: “Joe’s Pizza Brooklyn” and “Joe’s Burgers Brooklyn” matched as duplicates

Solution: Multi-factor scoring with type checking

function calculateSimilarity(fp1, fp2) {
  // ... previous code ...
  
  // Type check (critical for restaurants)
  if (fp1.metadata.type && fp2.metadata.type) {
    if (fp1.metadata.type !== fp2.metadata.type) {
      score *= 0.5; // Heavily penalize type mismatch
    }
  }
  
  return score;
}

Challenge 2: Franchise Locations

Problem: “McDonald’s Brooklyn” and “McDonald’s Manhattan” are different locations, not duplicates

Solution: Location-aware fingerprinting

// For franchise businesses, location is REQUIRED in primary key
const isFranchise = FRANCHISE_NAMES.includes(normalizedName);

const primaryKey = isFranchise || normalizedLocation
  ? `${normalizedName}-${normalizedLocation}`
  : normalizedName;

Challenge 3: AI Hallucinations

Problem: Claude sometimes extracts incorrect business types

Solution: Confidence scoring + fallback to user input

const businessInfo = await extractBusinessInfo(userInput);

// Validate AI extraction
if (!businessInfo.businessName || businessInfo.businessName.length < 2) {
  // AI failed, fall back to user input
  businessInfo.businessName = userInput;
}

// Store both AI-extracted and original input
await db.createBusiness({
  ...businessInfo,
  originalInput: userInput,
  aiExtracted: true
});

The Results: 95% Accuracy

Before (no deduplication):

  • 30% of businesses had duplicates
  • 1,000 businesses → 1,300 database records
  • $650 wasted on duplicate AI generations

After (fingerprinting system):

  • 5% false negative rate (missed duplicates)
  • 2% false positive rate (flagged non-duplicates)
  • 93% of duplicates caught before generation
  • $600 saved per month in AI costs

User feedback:

“Oh wow, I already created this last week! Thanks for catching that.” - Bakery owner

“I thought I lost my website. Turns out I just typed the name slightly differently.” - Contractor

Why This Matters for AI Applications

Most AI applications assume clean input. We learned:

Bad: Trust user input → create duplicates → clean up later Good: Normalize input → detect duplicates → confirm with user

The startup lesson: AI is great at understanding messy input, but you still need deterministic logic for matching. Use AI to extract structure, use code to match patterns.

Key Insights

  1. AI for extraction, code for matching: Claude extracts business info, code generates fingerprints
  2. Multi-factor scoring: Name + location + phone + type = high confidence
  3. User confirmation: When in doubt, ask the user
  4. Graceful degradation: If AI fails, fall back to user input

What’s Next

We’re exploring:

  • Fuzzy matching: Levenshtein distance for typo detection
  • Address normalization: Use Google Maps API to standardize addresses
  • Phone number lookup: Verify business phone numbers with Twilio
  • Historical data: Learn from user corrections to improve AI extraction

But the core insight remains: Fingerprints > exact matches.


Try it yourself: Enter “Joe’s Pizza Brooklyn” on WebZum, then try “Joes Pizza - Brooklyn NY”. Watch the duplicate detection catch it.

Building a deduplication system? Key takeaway: AI + normalization + fingerprinting = robust duplicate detection. Don’t rely on exact matches—businesses are messy.

The future of data quality isn’t perfect input—it’s intelligent normalization.

Ready to Build Your Website?

Join hundreds of businesses using WebZum to create professional websites in minutes, not weeks.

Get Started Free
Live in 5 minutesNo credit card required
Home•Free Tools•Blog•Directory•About•Agencies•Partners
FAQ•Privacy•Terms•© 2026 WebZum