Making Your Website Crawlable: Optimizing for Search Engines and AI Chatbots

Making Your Website Crawlable: Optimizing for Search Engines and AI Chatbots

With the rapid emergence of AI answer engines like ChatGPT Search, Perplexity, Claude, and Gemini, the landscape of web discovery has fundamentally shifted. Traditional SEO was about ranking in a list of blue links. Today, Answer Engine Optimization (AEO) is about ensuring that AI models can crawl your site, reconcile your identity, and accurately cite your URLs as references.

Recently, I set out with a clear objective: make this website fully crawlable, and ensure chatbots and search engines consistently reference and link back to my canonical domain: www.chetanrawal.com.

Here is a technical walkthrough of how we engineered these optimizations on this site.


1. Subdomain Consolidation & 301 Redirection

A common issue that dilutes search equity is having duplicate content served across both the bare domain (chetanrawal.com) and the www subdomain (www.chetanrawal.com). For AI agents and search crawlers, this can cause index fragmentation, where crawl budget is split and citation links point to inconsistent variations.

To fix this, we implemented a strict 301 (Permanent) Redirection at the edge using our Cloudflare Worker.

In the worker's fetch handler, we intercept incoming requests and redirect any bare domain traffic to the www subdomain:

// Redirect bare domain to www domain (primary canonical)
if (url.hostname === 'chetanrawal.com') {
    url.hostname = 'www.chetanrawal.com';
    url.protocol = 'https:';
    return Response.redirect(url.toString(), 301);
}

This guarantees that every bot and user is immediately sent to the canonical domain, establishing a single source of truth for references.


2. Granular Crawler Control via robots.txt

Many websites make the mistake of blocking all AI scrapers to protect copyright, but this also prevents real-time search assistants (like ChatGPT Search or Perplexity) from referencing them. The key is distinguishing between training bots (which scrape data to train future models) and retrieval bots (which search the web in real-time to answer user queries with citations).

In our new robots.txt configuration, we:

  1. Removed restrictive blocks on user-facing retrieval agents.
  2. Explicitly allowed both traditional search engines (Googlebot, Bingbot) and AI search agents (OAI-SearchBot, ChatGPT-User, Claude-SearchBot, PerplexityBot, Applebot-Extended).
  3. Kept our drafts and internal agent plans (/ai-agent-plans/) private.

Here is a look at the optimized robots.txt:

User-agent: *
Disallow: /ai-agent-plans/
Allow: /

# Specifically allow AI search and retrieval agents
User-agent: OAI-SearchBot
Disallow: /ai-agent-plans/
Allow: /

User-agent: ChatGPT-User
Disallow: /ai-agent-plans/
Allow: /

User-agent: Claude-SearchBot
Disallow: /ai-agent-plans/
Allow: /

User-agent: PerplexityBot
Disallow: /ai-agent-plans/
Allow: /

Sitemap: https://www.chetanrawal.com/sitemap.xml

3. Standardized XML Sitemap Generation

To make sure crawlers can discover every single article we publish, we updated the sitemap generator in our static build pipeline (scripts/build-blog.js).

The script now automatically compiles a standardized XML sitemap containing clean, directory-based canonical URLs with the www prefix. It targets the official Schema namespace:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://www.chetanrawal.com/</loc>
    <priority>1.0</priority>
  </url>
  ...
</urlset>

4. Serving LLM-Specific Aggregations (llms.txt)

Large Language Models ingest information differently than human readers. To make it incredibly easy for AI bots to read the entire library of insights on this site, our build script generates two special files at the root of the site:

  1. /llms.txt: A markdown file that serves as a structured map of the portfolio and blog, outlining key sections and summarizing all articles.
  2. /llms-full.txt: A consolidated text dump containing the full, raw markdown body of every single blog post.

When a crawler like GPTBot or ClaudeBot visits, it can read /llms-full.txt in a single request, allowing it to index the entire content of the site instantly without having to parse hundreds of HTML tags and execute expensive network round-trips.


5. Rich Schema Markup & Semantic HTML

Finally, we aligned the site's metadata. We updated the canonical link tags, Open Graph tags (for previews on Slack/LinkedIn), and JSON-LD structured data to use www.chetanrawal.com consistently.

On the homepage and blog listing, we inject structured data mapping:

  • Person Entity: Declaring credentials, worksFor relationships, and skills.
  • BlogPosting Schema: Detailing headers, description, datePublished, author, and keywords for each blog entry.

By providing explicit machine-readable facts, we help LLM models reconcile the entity "Chetan Rawal" in their knowledge bases, making it much more likely that they cite this site when answering queries about my work.


Conclusion

Making a website crawlable for the modern web requires looking beyond basic keyword-based SEO. By aligning our canonical subdomains, building a welcoming robots.txt for AI retrievers, and serving structured feeds like llms.txt, we ensure this website is ready for both search engines and the next generation of AI chatbots.

Are you building a product or optimizing your web visibility? Let's talk about it! Book a Free Session