Your content can be brilliant and your authority signals strong — but if AI crawlers can't access or parse your pages, none of it matters. This is the technical foundation that GEO actually requires.

A SaaS company with strong Google rankings — #1 or #2 for three head terms in their category — ran a GeoXylia audit and discovered something uncomfortable. AI citability score: 23 out of 100. Their pages were technically invisible to Perplexity, ChatGPT, and Google AI Overviews, not because the content was weak, but because their JavaScript-rendered product pages were blocking AI crawlers, their Schema markup was five years out of date, and their homepage had no Organization structured data whatsoever.
This is the gap most technical SEO audits miss: AI systems don't use Google's crawler infrastructure. They use their own — with different tolerances, different behaviors, and different requirements. Your pages can pass every Google Core Web Vital and still be unreadable to the crawlers that determine whether your brand gets recommended inside AI-generated answers.
This guide is the technical foundation for GEO. Everything else — content quality, author credentials, passage structure — depends on AI crawlers being able to access and parse your pages in the first place. If you're failing the technical basics, no amount of LLMO optimization rescues your citation chances.
Google's crawler has been refining its behavior for over 25 years. It handles JavaScript rendering, waits for lazy-loaded content, follows redirect chains intelligently, and maintains sophisticated crawl budgets that prioritize important pages. AI crawlers vary significantly in sophistication, and the less-mature ones behave more like the early Googlebot than the current version.
The major AI crawlers you need to know:
The critical difference from traditional SEO: these crawlers often have shorter patience windows and less sophisticated rendering pipelines. A page that Googlebot renders successfully after a two-day wait may be abandoned by an AI crawler after a single unsuccessful attempt.
Your first technical SEO task for AI citability is checking whether these crawlers can actually access your pages. In Google Search Console's URL Inspection tool, paste a key URL and check the robots.txt status. Then manually test the same URL with each AI bot user-agent if your hosting allows it — or use a log analysis tool to check your server logs for visits from known AI crawler IPs.
Your robots.txt file determines which crawlers can access what on your site. For AI citability, the critical question is: are you accidentally blocking the crawlers that determine whether you get cited?
Common accidental blocks include a blanket 'Disallow: /' for all user-agents, which keeps out AI crawlers along with everyone else — fine if intentional, catastrophic if accidental. More commonly, sites explicitly block GPTBot site-wide because they don't want their content used for AI training, not realizing this also prevents ChatGPT from citing them. There's currently no standard way to allow citations while blocking training access — they're the same crawler.
The right configuration for most sites allows all major AI crawlers:
User-agent: * Allow: /
User-agent: GPTBot Allow: /
User-agent: CCBot Allow: /
User-agent: Google-Extended Allow: /
If you genuinely don't want your content in AI training datasets, the honest answer is there's no standard technical way to allow citations while blocking training. The better business decision for most brands is to allow crawlers and invest in making your content citeable — the training exposure is essentially free brand presence in AI model knowledge.
A page that isn't indexable can't be cited. This sounds obvious, but the failure modes are subtler than most teams realize.
Direct indexability failures include noindex meta tags on content pages you want cited, Disallow directives in robots.txt blocking the page, pages behind login walls or paywalls (AI crawlers won't authenticate), and canonical tags pointing to a different URL (AI crawler follows canonical, doesn't index the original).
Rendering-based failures are more common and harder to diagnose:
The test: disable JavaScript in your browser, navigate to your key content pages, and see what content is actually present in the raw HTML. If critical content disappears, AI crawlers that don't fully render JavaScript will see the same thing.
For sites built on React, Vue, or Next.js without SSR, the fix is implementing proper server-side rendering or static generation for content pages. Next.js with getStaticProps or React with server-side rendering ensures the HTML your server delivers contains the actual content — not just a JavaScript shell that requires client-side execution to render anything.
Schema.org structured data is the formal bridge between your content and how AI systems understand it. When implemented correctly, it gives AI systems unambiguous signals about what entities your page describes, who created the content, and how it relates to your broader web presence.
For AI citability, three Schema types cover 90% of sites:
Organization schema is your brand's ID card — the most critical for AI systems trying to understand who you are. On your homepage and About page, include name (your official brand name matching exactly across all citations), url (your canonical homepage URL), logo (a publicly accessible URL), and sameAs (an array of URLs for every official profile — LinkedIn, Twitter/X, Facebook, Wikipedia, Wikidata, Crunchbase, industry associations). These links are trust signals that AI systems use to verify your entity's legitimacy.
Article/BlogPosting schema establishes content provenance. For every article, implement headline (exact match to your H1), author (a Person entity with name and url), datePublished (ISO 8601 format), image (publicly accessible URL), and publisher (your Organization entity, creating the author-publisher chain).
FAQPage schema is one of the most reliable structured data types for AI citation. AI systems frequently extract FAQ answers verbatim and cite them in AI Overviews. The markup format is well-understood and relatively simple to implement correctly.
Here's what most SEO teams miss: AI systems aren't just looking for Schema markup to confirm what they already parsed from HTML. They're using structured data to disambiguate entities and verify relationships in ways raw text can't communicate.
When an AI system encounters Organization schema with sameAs links to your LinkedIn page, Crunchbase profile, and Wikipedia article, it's using those as verification signals — external confirmations of your entity's existence that create low-uncertainty signals. This is why Wikipedia entities are disproportionately cited by AI systems: Wikipedia's editorial process creates compounding authority signals. You can't buy Wikipedia citations, but you can build the structured data foundation that makes your brand citable with the same confidence.
For implementation priority:
1. Organization schema on homepage — the foundation; everything else builds on it 2. Article schema on content pages — establishes content provenance and author attribution 3. FAQPage schema — high citation ROI, relatively easy to implement correctly 4. Person schema for named authors — if authors have credible, verifiable credentials, make them citable entities 5. Product or HowTo schema — if your business warrants it
Test your Schema markup with Google's Rich Results Test and GeoXylia's structured data analyzer. Common failures: missing required fields, malformed dates, images returning 404, and logo URLs that redirect in ways crawlers don't follow.
Google's Core Web Vitals — LCP, INP, and CLS — are primarily framed as user experience metrics. For AI citability, they matter for a different reason: slow pages consume more of an AI crawler's budget and may not be fully processed before the crawler moves on.
Perplexity and ChatGPT with browsing have both been observed to apply timeouts to pages that don't return usable content quickly. A page with a 6-second LCP may be abandoned before the AI crawler finishes rendering and processing the content. This compounds over large sites: if every page takes 4+ seconds to deliver meaningful content, the crawler may deprioritize your site in favor of faster competitors.
The LCP (Largest Contentful Paint) threshold for AI crawlers is similar to Google's — under 2.5 seconds is good, over 4 seconds is poor. But AI crawlers may be less forgiving on mobile-emulated connections or from geographic locations distant from your server.
INP (Interaction to Next Paint) matters for JavaScript-heavy pages: if your page requires significant client-side JavaScript to become interactive and render content, AI crawlers with shorter rendering patience may abandon the page before it's ready. Server-side rendering or static HTML delivery eliminates this risk entirely.
CLS (Cumulative Layout Shift) is less directly critical for AI citability than for Google rankings, but unexpected layout shifts during rendering can interfere with content extraction — particularly if ads or late-loading images push text content around after the AI crawler has already extracted it.
The practical action: test your key pages using WebPageTest from multiple geographic locations. Set a 3-second threshold for Time to First Byte. If your server consistently responds in under 1 second from most locations, you've cleared the technical baseline for both Google and AI crawlers.
Internal linking tells AI systems which pages matter most and how your topics relate to each other. It's as important for AI citability as it is for traditional SEO — but the mechanism is slightly different.
When a page receives multiple internal links from related content using descriptive anchor text, AI systems interpret that as a topical authority signal. The page that gets the most links from related pages within your site is, in AI's assessment, your most authoritative page on that topic.
This is why pillar-and-cluster content models work well for AI citability: the pillar page accumulates internal links from satellite articles, and that concentrated topical authority signals to AI systems that this page is the definitive source on the topic. When Perplexity is building an answer that requires your topic, it's more likely to cite the page that has demonstrated topical authority through internal link architecture.
Internal linking best practices for AI citability:
External linking to authoritative sources — official documentation, academic research, recognized industry publications — also signals that your content is situated within a credible knowledge context. AI systems interpret unlinked, isolated content with more uncertainty than content that explicitly situates itself alongside known authoritative references.
Here's every technical check that determines whether AI systems can access, parse, and cite your content:
Crawl access: No blanket Disallow directives for AI crawler user-agents; no page-level noindex on content you want cited; key content not behind login walls, paywalls, or CAPTCHA gates; canonical tags correctly pointing to preferred URLs.
Rendering: Critical content present in raw HTML (test with JavaScript disabled); no content exclusively in JavaScript-rendered tabs, accordions, or expandable sections; lazy-loaded content has fallback HTML or is server-rendered; client-side hydration completes within 3 seconds on representative connections.
Structured data: Organization schema on homepage with complete sameAs links; Article/BlogPosting schema on all content pages with author and publisher; FAQPage schema on FAQ sections (valid, complete); Person schema for named authors with credentialed profiles; all Schema markup validated — no missing required fields, no 404 image URLs.
Performance: TTFB under 1 second from multiple geographic locations; LCP under 2.5 seconds on mobile-emulated connections; INP acceptable (under 200ms) on pages with interactive elements; no render-blocking resources that delay content access for crawlers.
Content accessibility: llms.txt file at domain root (draft standard but growing adoption); sitemap.xml accessible and submitted to Google Search Console; no soft 404s or error pages returning 200 status codes.
Run GeoXylia's free AI Citability Audit to test your site against all of these dimensions. You'll get a specific technical SEO readiness score alongside your full AI citability assessment across all 7 dimensions — including passage retrieval, entity precision, and structural clarity.
Answers to the questions we get asked most about this topic.
See how your content scores across all 7 dimensions — including passage retrieval, entity precision, and structural clarity.
Start Free Audit