Entity Archaeology

Entity Archaeology | Frameworks β€” Sorilbran Stone
F-001 Diagnostic Published Proprietary

Entity
Archaeology

Diagnosing what machines believe β€” and what they’ll do when signals disappear.

Entity Archaeology is the process of figuring out what the machine already believes about a brand β€” and whether those beliefs are stable. It’s not about building. It’s about understanding what’s already there, what’s gone dark, and what the fallback chain looks like when a primary signal disappears.

Framework ID
F-001 Β· Entity Archaeology
Layer
Layer 01 β€” Diagnostic
Status
Published Β· 2026
Author
Sorilbran Stone Β· Five-Talent Strategy House
Track
Diagnostic Β· Proprietary
Use When
Brand has 2+ years of digital history. Run before any repositioning, rebrand, or entity architecture work.
Video coming soon

What Is It

Entity Archaeology is the process of figuring out what the machine already believes about a brand β€” and whether those beliefs are stable.

It is distinct from Entity Architecture, which is the process of building a knowledge graph from scratch. Archaeology comes first when a brand has history. Architecture comes first when it doesn’t.

The goal of archaeology is not just to find what’s there. It’s to find what the machine will do when something disappears β€” because that fallback behavior is either working for you or against you, and most brands have no idea which.

LLMs have two distinct ways of knowing things about a brand: training data (baked in at model training, creates ambient recognition) and the retrieval layer (on-demand search during inference, current and specific). The retrieval layer does not simply confirm training data β€” it can introduce doubt. If it goes looking and finds nothing where it expected something, that absence is itself a signal. The machine doesn’t forget β€” but it can lose confidence.

The Core Insight

A node’s stability is not just about which sources reference a brand. It’s about whether multiple sources are saying the same thing.

Corroboration across training data tiers is the key variable. A brand mentioned once in Forbes has some signal. A brand mentioned in Forbes, Investopedia, and an industry directory β€” with all three pointing to the same facts β€” has a load-bearing node.

When archaeology reveals anchored nodes, those nodes become the foundation for positioning work β€” including identifying Blue Puddles, which are emerging micro-markets where a brand’s capabilities align with unmet demand. Blue Puddles is a market positioning framework; what archaeology provides is the structural clarity that makes that positioning credible to machines.

Node Classification

Every signal about a brand falls into one of three categories:

Node Type Definition Action Required
Anchored Node In training data AND still retrievable now. The machine knows it and can verify it. Protect. Strengthen corroboration. Use Blue Puddles positioning framework to identify emerging micro-markets you can credibly claim around this node.
Ghost Node In training data but no longer retrievable. Creates active uncertainty in LLM synthesis. Restore or replace. Build a narrative bridge if source is gone.
Architectural Gap Never had meaningful Tier 1 or 2 presence. Nothing to find. Build from scratch. This is architecture, not archaeology.
Ghost nodes are the most dangerous. They don’t go quiet β€” they introduce active uncertainty into how the system synthesizes a brand’s identity. The machine reaches for fallback signals, which may be old, inaccurate, or off-brand.

Training Data Source Tiers

Not all sources are equally likely to appear in training data. And the exact mix of pages in the corpus of any specific LLM is thought of as proprietary – like understanding Google’s search algorithm or TikTok’s interest-driven algorithm. Smart people can work with their machines to get a pretty close estimation, but they won’t know exactly. But having a general understanding of the types of data most likely to be represented in each tier helps me prioritize both archaeology and architecture efforts.

Tier Examples Likelihood
Tier 1 Wikipedia, NYT, Forbes, WSJ, Reuters, Amazon, LinkedIn (public), Investopedia, government databases, academic publications Almost certain
Tier 2 Mid-tier trade publications, podcast transcripts, Reddit, Quora, GitHub, YouTube transcripts, Goodreads Likely at scale
Tier 3 Niche blogs, smaller publications, personal websites, press releases Possible but inconsistent

Corroboration across tiers carries more weight than depth within a single tier. A brand confirmed by three Tier 1 sources has a stronger knowledge graph anchor than a brand with ten Tier 3 mentions.

When to Run It

Brand Age Recommendation
Under 2 years Skip archaeology. Go straight to architecture. Revisit in 18 months.
2–5 years Light diagnostic. Enough may have accumulated to create ghost nodes. Run a quick confidence check.
5–10 years Full archaeology warranted. Highest-value use of this framework.
10+ years Essential before any repositioning or rebrand. Ghost nodes from a decade ago can actively contradict current identity.

The Process β€” 5 Steps

1
Inventory What Should Exist
Ask the founder or brand lead: where has this brand been mentioned, listed, published, cited, or sold β€” ever? Don’t filter. Amazon, press mentions, directories, awards, podcasts, speaking engagements, partnerships, old websites. Everything goes on the list.
2
Check What Still Exists
Go find each item. Is it still live? Still accurate? Still pointing to the current version of the brand? Flag anything that has changed, moved, broken, or disappeared.
3
Run the Confidence Check β€” No Search
Ask the LLM β€” without allowing web search β€” what it knows about the brand. Compare what surfaces to the inventory list. Rate confidence: Strong signal, Weak signal, or No signal. Note what type of source the confidence appears anchored to.
4
Cross-Reference the Retrieval Layer
Now allow search. What does the retrieval layer find? Compare to both the inventory list and the training data confidence check. The gaps between all three are where ghost nodes live.
5
Classify Everything
Still live + in training data = Anchored Node. Gone or broken + was in training data = Ghost Node. Never existed at Tier 1 or 2 = Architectural Gap. The classification drives every next action.

The Fallback Chain

When a primary anchor goes dark, the retrieval layer doesn’t give up. It reaches for whatever else points to the same entity. This is the fallback chain.

Most brands don’t know their fallback chain exists. They discover it the way you discover a backup generator β€” when the power goes out.

The strategic question Entity Archaeology answers: what is the machine’s fallback chain for this brand, and can it be engineered intentionally? A well-architected fallback chain means that if the primary anchor weakens or disappears, the system reaches for the right signals β€” not random old ones.

Recency is not just a human credibility issue β€” it’s a machine confidence issue. Old signals that no longer have corroborating current presence carry less weight, not because the machine forgets, but because it can’t verify if those signals are still true. Brands must create ongoing signal, not just archive past signal, to maintain anchor node stability.

Working Answers to Common Questions

These questions come up in practice. Here are working answers based on how Entity Archaeology has been applied across engagements.

Question Working Answer
What is the minimum corroboration threshold for a node to be considered anchored vs. weak signal? Two independent Tier 1 or Tier 2 sources referencing the same fact about the entity β€” with consistent language β€” is the working minimum. One source is a signal. Two corroborating sources is an anchor. Below that threshold, treat it as a weak signal and reinforce before building architecture on top of it.
Can fallback chains be mapped predictively before a primary anchor goes dark? Yes β€” and this is one of the highest-value uses of the framework. Run a confidence check without allowing search, then compare what surfaces to your known anchor inventory. The sources the machine reaches for in the absence of your primary signals are your fallback chain. Map it deliberately, then reinforce the fallbacks you want and starve the ones you don’t.
How does the confidence check differ across LLMs β€” and does that matter for the diagnostic? It matters significantly. Different LLMs have different training data cutoffs, different retrieval behaviors, and different confidence thresholds for hedging. Run the confidence check across at least two LLMs β€” Claude and ChatGPT or Perplexity β€” and note where the responses diverge. Divergence is data: it tells you which signals are stable across systems and which are model-specific. Stable anchors show up consistently. Ghost nodes produce hedging in some models and silence in others.
What is the relationship between Blue Puddles cadence and anchor node stability? Blue Puddles positioning β€” claiming emerging micro-markets β€” strengthens the Specialization node of the MVKG when done correctly. Each puddle adds a specific, corroborated sub-category signal. The risk is claiming puddles that contradict existing anchor signals. If your anchored identity says one thing and your puddle claims say another, the machine encounters conflicting signals and hedges. The rule: puddles should extend your specialization, not contradict it. Run Entity Archaeology before claiming any new puddle to verify your anchor nodes can support the shift.