Google Leak: Compilation of Terms Mentioned in the Documents

The leak of Google documents shared by Rand Fishkin and Mike King has given us access to an enormous amount of internal documentation related to the systems that power Google’s search engine.

Although we obviously cannot know which of the modules and attributes mentioned in this documentation are active in production within the search engine, and which of them affect SEO, the revealed information is important, both due to the quantity of documents (+2500) and concepts, as well as what is mentioned directly or can be deduced.

It has always been said that Google does not call things the same way as those of us who do SEO, and this glimpse into their internal documentation is proof of that.

That’s why I’ve started to compile a list or glossary of terms mentioned in the leaked documents that are of interest for SEO, trying to provide as much detail as we can for the moment. When it is clear what a concept is used for based on the known information so far, I have included it in the description, and if there is confirmation or external evidence that such a concept exists or is used in Google, I have also mentioned it.

For now I have 49 terms, with their respective explanations. If you are missing any, you can let me know in the comments.

AGC

Artificially (or Automatically) Generated Content. Mentioned especifically in the QualityNSR module, as part of an attribute called racterScores (see), that is described as a site-wide classification of whether the site contains machines generated content or not.

AnchorMismatch

An attribute to check if the anchor text of a link matches the content of the target page. If it doesn’t match, according to the Compressed Quality Signals document, the link will be devalued. We already knew that Google positively considers links from pages that contain mentions of the search terms, and this could be a way to implement it.

Ascorer

The traditional ranking system used by Google before other systems (e.g., Deep Learning systems) do a re-ranking process. The A in Ascorer comes from Amit Singhal, head of Search at Google between 2000 and 2015.

BabyPanda

Possibly a new version of Panda. We know there is a BabyPanda score and there could be or could have been devaluations due to BabyPanda, which were applied on top of the previous Panda devaluation.

BrainLoc

A ranking or list, with scores, of the main locations by categories (countries, states, cities, counties or provinces).

CenterPiece

The main content of a page, discarding menus, sidebars, footer, ads, etc. This concept appears many times in Google’s Quality Rater Guidelines. It is also something mentioned and publicly confirmed by the Googler Martin Splitt in this video.

Chunks

A chunk is a fragment of a page or site. It is a necessary concept for properly doing embeddings, since generating embeddings for an entire page is not the same as for each small part of a page.

In that sense, I believe chunks are closely related to the Passage Ranking system and also to the generation of Featured Snippets (since it is difficult to extract the most relevant phrase or phrases for a search if you have not previously analyzed all the small parts of the page).

The surprise for me were the site chunks (parts of the site), and my hypothesis is that Google chooses a series of text fragments from the entire site (randomly or according to some criteria) to extract characteristics of the site, such as the theme and possibly the quality.

Coati

Coati was an evolution of Panda, which we know thanks to the Googler Hyung-Jin Kim. Panda was integrated into the core algorithm, but Coati subsisted as an independent system, or at least subsisted until November 2022. Coati is mentioned once in the leaked docs.

CompositeDoc

Record used to store together all the information about a document.

See PerDoc.

ConstituencyTree

A syntactic phrase analysis tree. Google uses this to understand the relationships between terms mentioned in a text. See PoS.

DocJoin / DocJoiner

A system or protocol to join all the pieces of information about a document or page that Google generates from the moment of crawling and throughout the rest of the phases.

See CompositeDoc.

Embeddings

Numerical representation of a text, phrase, word or token. It is usually a vector with many dimensions, which tries to capture the meaning of a page numerically. If two words have a similar meaning or are closely related (they often appear together in a corpus of texts), their embeddings will have close values.

It is one of the key concepts on which Transformers are based, and therefore BERT and any other LLM, such as GPT-4. That’s why embeddings are very important for Google (and any search engine) to be able to understand and rank documents.

In the leaked documentation there are 46 mentions of embeddings, with specific mentions of site embeddings, topic embeddings and page embeddings.

ExactMatchDomainDemotion

Google can devalue domains that contain the terms of a search query.

Forum

There is an isForumPage attribute, to differentiate forum pages from other types of pages.

Geostore

It seems to be how they refer to the map and the annotation system on the map for local intent searches.

Gobi / GobiSite

A site that Google elevates for certain searches. For example, amazon.com is a «gobi» site for the search «hdtv» – but there are Amazon subdomains that are not elevated for that search, like «askville.amazon.com».

Gold-standard

These seem to be seed pages manually chosen and annotated to be used in different algorithms. One of these could be the updated PageRank patent algorithm, which uses manually chosen seed pages. I talked about this algorithm and patent in this post.

Golden

Flag or annotation that indicates if a document is «gold-standard».

Goldmine

For now, an unknown. From the structure of the docs we can deduce that it is a component of Quality/freshness. There are at least 20 mentions of Goldmine in the docs, very related to annotations. There is a Goldmine Page Score. Ideas about what Goldmine could be are welcome.

Indextier

There are 3 layers or levels within the index. Google gives more value to links from pages that are part of the higher quality index layer (indextier), and also to links from pages that it has recently indexed (I add that this is probably a temporary effect, which fades when the page is no longer recent).

Muppet

It seems that this is how they refer to a hacked site used for spam.

Mustang

Mentioned at least 144 times in the documents. According to Mike King, it is the main system for indexing, ranking and serving. The mentions of Mustang as an index are evident. The rest of the concepts I’m not so sure about, but 144 mentions leave room for a lot.

Navboost / NavboostCraps

We learned about Navboost during the revelations from the antitrust lawsuit against Google, so it’s no surprise to find it here.

Navboost is a probabilistic model based on users’ click history to predict clicks on a result, given a query. According to Pandu Nayak’s testimony: it has been active since 2005, works for query/URL pairs, distinguishes between mobile and desktop data, its data can be separated by location, and it can help distinguish search intents.

In the leaked documentation we have found even more detail: Craps seems to be the Navboost component that handles storing click (and impression) features, including whether it was a «good» click, a «bad» click, the «last longest click», etc. According to Pedro Dias (Xoogler) CRaPS is an acronym for Clicks and Results Prediction System.

NLP

Natural language processing, which of course is something we already knew Google does with the texts of a page.

Especially interesting is the GoogleApi.ContentWarehouse.V1.Model.NlpSaftDocument module, where it explains that the raw text is stored, and the natural language analysis, with various annotations and attributes, such as the relationship between the mentioned entities, the topic, and what is the main subject of the document.

NSR

We don’t know what these initials mean, but they are certainly important within the set, with several modules indicating NSR in their name. We also know from one mention that there is an «NSR team», an NSR score and an NSR confidence score. According to Mike King, it could mean «Neural Semantic Retrieval».

Ocean

This is a Google index. There are 64 mentions of Ocean in the docs.

OysterRank / Oyster Rank

A system to apply basic categorization, and it seems to only be applied to map elements. There are many mentions of OysterRank, but all in relation to Geostore.

PageRank

The original algorithm for calculating the importance of a page on the web, based on the importance of the links it receives. Although it seems that Google no longer evaluates links with PageRank (or only with PageRank), it continues to use this algorithm for many purposes, for example as a signal for canonicalization. I’ve found 17 mentions of PageRank in the docs – but some of them just explain that PageRank has been deprecated, and that PageRankNS is the value to be used in production.

PageRankNS / PageRank-NearestSeeds

According to at least 5 mentions in the docs, this is the PageRank method currently used in production. «NearestSeeds» makes me think of the Google patent for an updated version of PageRank called «Producing a ranking for pages using distances in a web-link graph«, which dealt with seed pages and links distances for propagating PageRank. I analyzed that patent in this post.

PerDoc / PerDocData

All the data that Google stores about a document (a page, a PDF, a video, etc.) for indexing and serving it as a result in the SERP. There are no less than 142 attributes.

Among them, some that are expected such as quality signals, and others not so much, such as the PageRank of the home page of the site to which the page belongs, and a KeywordStuffing score.

I highly recommend to start with this document or module, if you want to do a systematic analysis of the contents of the leak.

PoS

Part of Speech. After indexing the pages, Google can perform a syntactic analysis with them, to better understand what they are talking about and what are the relationships between the terms that appear on the page. «A dog bites a child» is not the same as «a child bites a dog», and this is something Google can only understand by extracting the subjects and other syntactic parts of the sentences.

QnAPage

There is the QnAPage attribute, to differentiate pages that contain questions and answers from other types of pages.

QualitySignals

There is a module that lists all the quality signals at the page level: GoogleApi.ContentWarehouse.V1.Model.CompressedQualitySignals

Among them there are signals that are not very intelligible from the outside for now (for example, experimentalQstarDeltaSignal) and on the other hand, there are other attributes that due to their name and description could fit well with recent updates publicly communicated by Google, such as «ugcDiscussionEffortScore» (hidden gems) and all the mentions of ProductReview.

This document is probably one of the densest and most interesting from the leak, and contains many threads to pull on to understand what Google considers page quality and what it does not.

Racter / RacterScores

RacterScores are a site-level AGC classifcation score. Racter was a ChatGPT precursor developed in the 1980s, so it wouldn’t surprise me that the Google engineers introduced a geeky reference there.

RankableSensitivity

There are more than 30 modules with mentions of this concept. I believe it always refers to «sensitive» content (adult topics, politics, violence, etc.), a quality that Google can extract from various places (from its analysis to understand the query, analysis of a document’s text, etc.).

RankEmbed

Again, thanks to the antitrust lawsuit, where Pandu Nayak spoke about it in some detail, we know that RankEmbed (and a later version called RankEmbed BERT) is a Deep Learning system used in the re-ranking phase, trained with user data, and which among other things is capable of «rescuing» documents that would not have been considered as candidates in the Retrieval phase.

RankEmbed is mentioned at least 16 times in the leaked documents, and seems very related to Mustang and quality models.

RankLab

Mentioned at least 25 times, RankLab is a framework or library related to the training of machine learning models, but possibly also to their inference (making predictions based on training). Specifically, it seems related to systems for making predictions about Titles and snippets (the text below the Title of a result in the SERP).

SAFT

According to former Googler Pedro Dias, it means Structured Annotation Framework and Toolkit.

Semantic Date

A date that is extracted from either the content of the page, the anchor text of links pointing to that page or some other related documents.

See Syntactic Date.

SiteEmbeddings

See Embeddings.

SmallPersonalSite

As its name indicates, a personal website. Note that the only mention I found in the docs talks about elevating this type of site in the rankings, not devaluing them.

SnippetBrain

The system used to determine if a Featured Snippet will be shown for a page, and what specific text from that page will be shown.

Syntactic Date

A date that is explicitly mentioned in the URL or the Title of a page.

See Semantyc Date.

SuperRoot

The central data processing and storage system of Google’s infrastructure (the service that powers Google Search, according to a former Googler). As such, it has modules for practically everything: from recommending podcasts to storing document information, to processing scores for ranking. SuperRoot integrates with many other systems and modules, such as ascoring, docjoins and TopicEmbeddings.

TeraGoogle

It is an index of documents with certain characteristics and most of its contents are stored on hard disk, which means that those contents are stable, they do not change as frequently as the contents that are stored in flash memory. According to this, if a document is included in TeraGoogle it is because Google considers it worth storing in the medium/long term, and the inclusion in TeraGoogle itself could be a quality signal (although there are some signals that the documentation specifies are stored in flash).

TitleMatch

The degree of relevance of a page’s Title to the query made by the user.

Topicality / TopicalityScore / TopicalityWeight

Topicality is how Google internally refers to relevance, that is, the degree to which a document talks about the same thing as the query entered by the user. We know this from Pandu Nayak’s testimony in the antitrust lawsuit against Google and from many IR papers.

There are at least 4 mentions of Topicality in the docs, with specific mentions of a Topicality score (TopicalityScore) and a weight (TopicalityWeight), which is used to qualify the links pointing to a URL.

TopicEmbeddings

An estimation of the main topic that a page or group of pages is about. There is a TopicEmbedding at the site level, which includes metrics such as SiteFocusScore (how much a site focuses on its main theme) and SiteRadius (how far in terms of theme the embeddings of its pages are from the site’s embedding). See embeddings.

Twiddlers

These are functions to obtain a re-ranking for a specific goal, without having to alter the conditions of the system. It would be like putting a filter or patch on the final results that users see.

The experiments that are done live in the search engine are twiddlers, and it is very possible that some of the specific updates that Google communicates (not the core updates) are also twiddlers.

If Google needs to make an emergency change to its results, it will do so by means of a twiddler.