How LLMs Actually Choose Who to Cite: A Walk-Through of the Princeton GEO Paper

In November 2023, a team led by Pranjal Aggarwal at Princeton - with collaborators from Georgia Tech, the Allen Institute for AI, and IIT Delhi - uploaded a paper to arXiv titled “GEO: Generative Engine Optimization” (arXiv:2311.09735).

It was, as far as we can tell, the first peer-reviewed academic framework for optimizing content for LLM retrieval. It was accepted to ACM SIGKDD 2024 (ACM proceedings) and the codebase and benchmark are public on GitHub.

The Aggarwal et al. result, in one sentence: specific, repeatable content tactics can boost a source’s visibility inside generative-engine answers by up to 40%.

That is a number worth understanding line by line, because it is the cleanest empirical foundation we have for what actually works in GEO.

What the paper actually tested

The researchers built a 10,000-query benchmark called GEO-bench, drawing from sources like MS MARCO, Natural Questions, AllSouls, and ELI5 - covering a wide range of query intents, domains, and difficulties.

They evaluated against a generative engine system architected like Bing Chat (the production AI search system at the time), then validated the results on Perplexity.

They tested nine distinct content optimization tactics, applied to source documents before retrieval, and measured visibility in the final generative answer using two novel metrics:

Position-Adjusted Word Count - measures both whether a source is cited AND how prominently it appears in the answer.
Subjective Impression - measures qualitative authority and influence of the citation in the answer.

The methodology - pre-modify a source, run the query, measure the change in citation prominence - gives causal evidence that specific writing techniques shift LLM behavior. Not correlational, not anecdotal: A/B-tested.

The tactics that worked

Of the nine tactics tested, three stood out as consistent winners across domains:

1. Citation Addition

Adding inline citations to existing sources lifted Position-Adjusted Word Count by 30–40% in factual domains. LLMs reward content that points to other authoritative sources.

“Citation Addition demonstrated significant improvements in factual content, particularly for queries seeking factual information.” - Aggarwal et al., 2023

2. Quotation Addition

Adding relevant quotations from authoritative figures or sources produced gains of similar magnitude, especially for historical and opinion-based content.

3. Statistics Addition

Inserting specific numerical data - measured percentages, study findings, hard counts - produced the largest gains in domains classified as “Law & Government” and other technical or analytical contexts.

The pattern is consistent: LLMs preferentially cite content that itself behaves like a citable source. Specific numbers, named sources, direct quotes. Not vague claims.

The tactics that helped less

Five other tactics produced moderate gains:

Fluency Optimization - cleaner prose, better readability.
Authoritative Voice - definitive language, expert framing.
Easy-to-Understand - simplifying complex content.
Technical Terms - appropriate domain-specific terminology.
Unique Words - semantic distinctiveness.

These work in specific domains but don’t generalize. The big three (statistics, quotations, citations) work nearly everywhere.

And one tactic - Keyword Stuffing - actively reduced visibility. Classic SEO behavior gets penalized in LLM retrieval. The optimization surface has inverted.

The headline finding

Combined optimization using the top tactics produced what the paper calls “up to a 40% improvement in source visibility on generative engine responses.”

In a market where the difference between being cited and not being cited is the difference between brand recall and brand invisibility, 40% is a chasm.

Where the production systems align

The Princeton work is foundational, but the production AI search engines - Perplexity, ChatGPT Search, Google’s AI Overviews - implement their own retrieval-augmented generation (RAG) pipelines that align with and extend these findings.

Perplexity in particular uses a multi-stage RAG pipeline:

Query parsing and intent classification.
Hybrid retrieval (BM25 keyword + dense embedding similarity).
A three-tier ML reranker, with a final XGBoost-based “quality gate” filtering for entity clarity, source authority, and document provenance (technical breakdown via ZipTie, 2025).

Perplexity’s published source-evaluation criteria reportedly hinge on four pillars: trustworthiness, authority, corroboration, and provenance (Singularity Digital analysis, 2025). Note: these technical analyses come from industry research rather than Perplexity’s own documentation - directionally accurate but worth verifying for your specific use case.

RAG is the substrate, not the bottleneck

A 2024 arXiv survey of retrieval-augmented generation systems counted over 1,200 RAG-related papers published in 2024 alone (arXiv:2506.00054) - a research explosion that reflects the production reality: every major AI agent now relies on RAG to ground its answers.

The implication for brands: your visibility in an AI answer is not primarily a function of the LLM weights. It is a function of whether your content gets retrieved in the first place. Retrieval is the bottleneck. And retrieval is something you can engineer.

A 2025 Springer BISE journal review puts the academic consensus simply: RAG reduces but does not eliminate hallucinations, and its effectiveness depends on retrieval quality, source relevance, and the interpretability of the retrieved context (Springer BISE, 2025).

In other words: if your content is structured to be retrieved cleanly and parsed accurately, you have a measurable advantage in being cited.

What the data does NOT yet support

Honest framing matters. There is a lot of GEO advice circulating that is either agency-sourced without methodology or extrapolated past what the research supports.

A few examples we’d encourage hedging on:

The widely-cited “0.334 correlation between brand search volume and LLM citation rate” comes from agency reports without published methodology.
“4.8x knowledge graph entity multiplier” - agency-sourced, not peer-reviewed.
“2.3x more citations for 50–150 word chunks” - same caveat.

A December 2024 study by Search Atlas found no statistically significant correlation between schema markup coverage and LLM citation rates in their sample - suggesting that relevance and topical authority outweigh schema markup alone. Schema is necessary infrastructure for AI parsing, but it is not a magic input.

Microsoft did publicly confirm in March 2025 that its LLMs use structured data to interpret web content (Search Engine Land). The mechanism matters; the magnitude of effect is what’s still being measured.

The summary, plainly

The Princeton GEO paper gives us the cleanest empirical answer we have to the question “how do I get an LLM to cite me?” The answer is:

Write content that itself cites things - sources, statistics, quotations.
Be quotable - use definitive, attributable language.
Be retrievable - clean structure, accurate schema, parsable chunks.
Don’t keyword-stuff - it actively hurts you.

Everything else in GEO is execution detail and platform-specific tuning. But the foundation - the why of why some brands get cited and others don’t - is now empirical, not speculative.

Primary sources for this article:

Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., Deshpande, A. (2023). “GEO: Generative Engine Optimization.” arXiv:2311.09735. https://arxiv.org/abs/2311.09735 - published at ACM SIGKDD 2024.
GEO codebase / GEO-bench on GitHub
ACM SIGKDD 2024 proceedings
arXiv:2506.00054 - Comprehensive Survey of RAG (2024)
Springer BISE Journal RAG review (2025)
ZipTie technical breakdown of Perplexity retrieval
Search Engine Land - Schema markup in AI search