Entity Archaeology
Entity
Archaeology
Diagnosing what machines believe β and what they’ll do when signals disappear.
Entity Archaeology is the process of figuring out what the machine already believes about a brand β and whether those beliefs are stable. It’s not about building. It’s about understanding what’s already there, what’s gone dark, and what the fallback chain looks like when a primary signal disappears.
What Is It
Entity Archaeology is the process of figuring out what the machine already believes about a brand β and whether those beliefs are stable.
It is distinct from Entity Architecture, which is the process of building a knowledge graph from scratch. Archaeology comes first when a brand has history. Architecture comes first when it doesn’t.
The goal of archaeology is not just to find what’s there. It’s to find what the machine will do when something disappears β because that fallback behavior is either working for you or against you, and most brands have no idea which.
The Core Insight
A node’s stability is not just about which sources reference a brand. It’s about whether multiple sources are saying the same thing.
Corroboration across training data tiers is the key variable. A brand mentioned once in Forbes has some signal. A brand mentioned in Forbes, Investopedia, and an industry directory β with all three pointing to the same facts β has a load-bearing node.
When archaeology reveals anchored nodes, those nodes become the foundation for positioning work β including identifying Blue Puddles, which are emerging micro-markets where a brand’s capabilities align with unmet demand. Blue Puddles is a market positioning framework; what archaeology provides is the structural clarity that makes that positioning credible to machines.
Node Classification
Every signal about a brand falls into one of three categories:
| Node Type | Definition | Action Required |
|---|---|---|
| Anchored Node | In training data AND still retrievable now. The machine knows it and can verify it. | Protect. Strengthen corroboration. Use Blue Puddles positioning framework to identify emerging micro-markets you can credibly claim around this node. |
| Ghost Node | In training data but no longer retrievable. Creates active uncertainty in LLM synthesis. | Restore or replace. Build a narrative bridge if source is gone. |
| Architectural Gap | Never had meaningful Tier 1 or 2 presence. Nothing to find. | Build from scratch. This is architecture, not archaeology. |
Training Data Source Tiers
Not all sources are equally likely to appear in training data. And the exact mix of pages in the corpus of any specific LLM is thought of as proprietary – like understanding Google’s search algorithm or TikTok’s interest-driven algorithm. Smart people can work with their machines to get a pretty close estimation, but they won’t know exactly. But having a general understanding of the types of data most likely to be represented in each tier helps me prioritize both archaeology and architecture efforts.
| Tier | Examples | Likelihood |
|---|---|---|
| Tier 1 | Wikipedia, NYT, Forbes, WSJ, Reuters, Amazon, LinkedIn (public), Investopedia, government databases, academic publications | Almost certain |
| Tier 2 | Mid-tier trade publications, podcast transcripts, Reddit, Quora, GitHub, YouTube transcripts, Goodreads | Likely at scale |
| Tier 3 | Niche blogs, smaller publications, personal websites, press releases | Possible but inconsistent |
Corroboration across tiers carries more weight than depth within a single tier. A brand confirmed by three Tier 1 sources has a stronger knowledge graph anchor than a brand with ten Tier 3 mentions.
When to Run It
| Brand Age | Recommendation |
|---|---|
| Under 2 years | Skip archaeology. Go straight to architecture. Revisit in 18 months. |
| 2β5 years | Light diagnostic. Enough may have accumulated to create ghost nodes. Run a quick confidence check. |
| 5β10 years | Full archaeology warranted. Highest-value use of this framework. |
| 10+ years | Essential before any repositioning or rebrand. Ghost nodes from a decade ago can actively contradict current identity. |
The Process β 5 Steps
The Fallback Chain
When a primary anchor goes dark, the retrieval layer doesn’t give up. It reaches for whatever else points to the same entity. This is the fallback chain.
Most brands don’t know their fallback chain exists. They discover it the way you discover a backup generator β when the power goes out.
The strategic question Entity Archaeology answers: what is the machine’s fallback chain for this brand, and can it be engineered intentionally? A well-architected fallback chain means that if the primary anchor weakens or disappears, the system reaches for the right signals β not random old ones.
Working Answers to Common Questions
These questions come up in practice. Here are working answers based on how Entity Archaeology has been applied across engagements.
| Question | Working Answer |
|---|---|
| What is the minimum corroboration threshold for a node to be considered anchored vs. weak signal? | Two independent Tier 1 or Tier 2 sources referencing the same fact about the entity β with consistent language β is the working minimum. One source is a signal. Two corroborating sources is an anchor. Below that threshold, treat it as a weak signal and reinforce before building architecture on top of it. |
| Can fallback chains be mapped predictively before a primary anchor goes dark? | Yes β and this is one of the highest-value uses of the framework. Run a confidence check without allowing search, then compare what surfaces to your known anchor inventory. The sources the machine reaches for in the absence of your primary signals are your fallback chain. Map it deliberately, then reinforce the fallbacks you want and starve the ones you don’t. |
| How does the confidence check differ across LLMs β and does that matter for the diagnostic? | It matters significantly. Different LLMs have different training data cutoffs, different retrieval behaviors, and different confidence thresholds for hedging. Run the confidence check across at least two LLMs β Claude and ChatGPT or Perplexity β and note where the responses diverge. Divergence is data: it tells you which signals are stable across systems and which are model-specific. Stable anchors show up consistently. Ghost nodes produce hedging in some models and silence in others. |
| What is the relationship between Blue Puddles cadence and anchor node stability? | Blue Puddles positioning β claiming emerging micro-markets β strengthens the Specialization node of the MVKG when done correctly. Each puddle adds a specific, corroborated sub-category signal. The risk is claiming puddles that contradict existing anchor signals. If your anchored identity says one thing and your puddle claims say another, the machine encounters conflicting signals and hedges. The rule: puddles should extend your specialization, not contradict it. Run Entity Archaeology before claiming any new puddle to verify your anchor nodes can support the shift. |
