How-to guide2026-03-02

How to Configure Your robots.txt for AI Bots (Practical Guide 2026)

How to Configure Your robots.txt for AI Bots (Practical Guide 2026)

Keyword target: "robots.txt AI bots", "block GPTBot", "allow AI crawlers robots.txt" Language: EN | Words: ~1,200 | Type: How-to


TL;DR

Your robots.txt file controls which AI bots can access your site. Block them, and AI systems can't cite your latest content. Allow them, and you increase your visibility in ChatGPT, Perplexity, Claude, and Gemini responses.


Why Does robots.txt Matter for AI Visibility?

Modern AI systems don't only use training data. Perplexity, ChatGPT with browsing, and Google Gemini crawl the web in real time to answer questions. If your robots.txt blocks their bots, you're invisible to 55% of informational searches.

Key finding: A 2025 study found that 42% of websites block at least one AI bot without knowing it — usually because they're using outdated robots.txt templates.


The AI Bots You Need to Know

Bot Company Function
GPTBot OpenAI Trains and browses for ChatGPT
ChatGPT-User OpenAI ChatGPT real-time browsing
ClaudeBot Anthropic Browses for Claude
PerplexityBot Perplexity Real-time search
Google-Extended Google Trains Gemini/Bard
Applebot-Extended Apple Trains Apple Intelligence
cohere-ai Cohere Model training
Bytespider ByteDance TikTok/Douyin model training

How to Check Your Current robots.txt

Go to https://yoursite.com/robots.txt in your browser. Look for any of these patterns that block AI bots:

Pattern 1 — Blocks ALL bots:

User-agent: *
Disallow: /

This blocks everyone including Google. Very bad.

Pattern 2 — Explicitly blocks OpenAI:

User-agent: GPTBot
Disallow: /

Pattern 3 — Old wildcard that blocks modern AI: Some robots.txt files from 2022-2023 have User-agent: * rules with Disallow: directives that also apply to AI bots that arrived later.


Here's a template that allows all major AI bots while maintaining control over sensitive content:

# Standard search engine crawlers
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# AI training and browsing bots — ALLOW ALL
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: cohere-ai
Allow: /

# Global default
User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Selective Blocking: Allow ChatGPT, Block Training Data Scrapers

If you want to allow real-time browsing (which builds visibility) but block bulk training data scraping, use this pattern:

# Allow BROWSING bots (real-time, builds your AI visibility)
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Block TRAINING bots (optional — prevents use in future training data)
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: cohere-ai
Disallow: /

Tradeoff: Blocking training bots means ChatGPT's base knowledge won't improve about your site over time, but Perplexity and ChatGPT with browsing can still find you in real-time searches.

For most businesses, allowing everything is the right call — the more AI systems know about you, the more they recommend you.


Allowing Only Specific Directories

If you have content you want to protect (internal tools, admin pages, sensitive docs) while still allowing AI bots access to your public content:

User-agent: GPTBot
Allow: /blog/
Allow: /products/
Allow: /about/
Disallow: /admin/
Disallow: /private/
Disallow: /api/

User-agent: PerplexityBot
Allow: /blog/
Allow: /products/
Disallow: /

Common Mistakes to Avoid

Mistake 1: The "Security Through robots.txt" Fallacy

robots.txt is a suggestion, not a barrier. Malicious scrapers ignore it. The only bots that respect robots.txt are legitimate ones (Googlebot, GPTBot, etc.).

Don't block AI bots to "protect" content from being scraped — they're the legitimate ones. The scrapers that actually steal content don't care about robots.txt.

Mistake 2: Testing on a Staging Site, Forgetting in Production

Many sites block all bots on staging with User-agent: * / Disallow: / — which is correct — but then accidentally deploy that same robots.txt to production.

Fix: Always check your production robots.txt after each deployment.

Mistake 3: Blocking with No-Index but Allowing Crawling (or Vice Versa)

If you use <meta name="robots" content="noindex"> on pages but still allow GPTBot in robots.txt, those pages can still be accessed by AI browsers. Be consistent between robots.txt and meta tags.


After Updating Your robots.txt

  1. Wait 2-4 weeks for AI bots to re-crawl your site
  2. Test your visibility by asking ChatGPT and Perplexity questions in your category
  3. Monitor monthly — AI systems change frequently

Automate the visibility check: EchoSignal audits your robots.txt for AI bot blocks and tests your visibility across ChatGPT, Claude, Gemini, and Perplexity — free, in 60 seconds.

Check if AI bots can access your site


Quick Reference: AI Bot Names

If you see this... It belongs to... Recommendation
GPTBot OpenAI ✅ Allow
ChatGPT-User OpenAI ✅ Allow
anthropic-ai Anthropic ✅ Allow
ClaudeBot Anthropic ✅ Allow
PerplexityBot Perplexity ✅ Allow
Google-Extended Google Gemini ✅ Allow
Applebot-Extended Apple ✅ Allow
Amazonbot Amazon Alexa ✅ Allow
FacebookBot Meta AI Consider
Bytespider TikTok/ByteDance Your choice

Published by EchoSignal | Last updated: March 2026

¿Tu sitio es visible para las IAs?

Descúbrelo gratis en 30 segundos con nuestro diagnóstico automático.

Analiza tu sitio gratis →