Early access: New content posts daily — updates are frequent and you may notice work in progress.
OSINTBench
Guides hidden-data-layer-ai-osint

hidden-data-layer-ai-osint

Public databases like EDGAR, the NPI registry, and IRS Form 990 have always been open — but functionally inaccessible to most investigators without deep domain expertise. This guide shows how to use AI tools to query, extract, and synthesize records across these sources while maintaining the source validation standards that real investigations require.

intermediate Updated 2026-04-02

Why AI Changes Public Database OSINT

The data has been available for a long time. EDGAR has decades of SEC filings. The NPI registry maps every licensed healthcare provider in the US. SAM.gov logs every entity doing business with the feds. IRS Form 990s reveal nonprofit officer pay and grant relationships. The information was public, just not usable.

The obstacle was not access, but making sense of it. Reading an SEC 10-K to grasp ownership structure means finding a limited partnership agreement in Exhibit 4. Parsing dual affiliations from a Form 990 Schedule L requires knowing the schedule's relevance. Verifying if a SAM.gov awardee matches a sanctioned OFAC entry means cross-referencing EINs, DUNS numbers, and addresses across disconnected systems.

AI does not replace expertise. It reduces the cost for investigators who know what they're searching for but lack the know-how to quickly navigate complex government data, including EDGAR, NPI registry, SAM.gov, and IRS Form 990s.

The shift is this: plain-English queries let you ask a specific question about a complex document without becoming a specialist. Summarization condenses a 200-page 10-K into a hypothesis in minutes. Structured extraction turns financial disclosures into an entity map draft you can validate.

Verification is still needed. AI compresses the initial research phase. Human analysis is still required to spot when AI fills gaps with convincing fabrications.

This guide is built around that tension.


The Database Classes That Matter Most

EDGAR (SEC Electronic Data Gathering, Analysis, and Retrieval) EDGAR is the go-to source for digging into public companies and private entities that file with the SEC. You'll find 10-Ks, S-1s, proxy statements, and 8-Ks packed with info on subsidiaries, officer pay, related-party deals, and equity structures. The real investigative gold is in the fine print. A three-word mention of a management agreement with an affiliate on page 74 of an annual report can be the thread that unravels an entire shell entity network.

NPI Registry (National Provider Identifier) Every licensed U.S. healthcare provider has an NPI number. The registry maps individuals to group practices, and those practices to parent organizations. For investigators tracking medical billing fraud, private equity roll-ups, or corporate healthcare structures, NPI records are a crucial link between provider identities and ownership structures found in EDGAR or state filings. It's a key pivot point.

PubMed and BioRxiv These databases shine for conflict-of-interest mapping and research network analysis. Look for author affiliations, funding disclosures, and acknowledgment sections. Even retraction notices can be revealing. BioRxiv preprints often have funding disclosures that differ from published versions, sometimes a big deal. Author affiliations, funding disclosures, and acknowledgment sections are available.

SAM.gov and USASpending This duo offers a window into federal contracting. SAM.gov holds entity registration and exclusion records. USASpending breaks down award data by recipient, agency, contract type, and performance period. Together, they help you track who lands federal contracts, under what entity IDs, and if those entities have been excluded or debarred. The catch: entity names change, subsidiaries aren't always linked, and OFAC cross-referencing is a manual process.

IRS Form 990 This is the top source for nonprofit financial transparency, yet most investigators overlook it. Schedule J reveals officer compensation. Schedule L flags transactions with interested parties. Schedule R covers related organizations. For probes into foundations, think tanks, advocacy groups, or research nonprofits, reading multiple years of Form 990 filings uncovers compensation trends, board changes, and grant relationships that aren't found elsewhere. Schedule J, Schedule L, Schedule R.

In each case, AI can best help with interpretation. Access is easy; understanding the data is hard.


Which AI Tools Actually Work on Which Databases

Our testing matrix evaluates AI tools based on their actual performance across various sources, not just what vendors claim. We put these tools through real-world workflow testing to see how they measure up. The results show that not all AI tools perform equally well.

Tool EDGAR NPI PubMed SAM.gov Form 990
Claude (MCP) ★★★★★ ★★★★★ ★★★★☆ ★★☆☆☆ ★★★☆☆
ChatGPT Plus (EDGAR plugin) ★★★★☆ ★★☆☆☆ ★★★☆☆ ★★☆☆☆ ★★☆☆☆
Elicit ★★☆☆☆ ★☆☆☆☆ ★★★★★ ★☆☆☆☆ ★★☆☆☆
Perplexity ★★★☆☆ ★★☆☆☆ ★★★☆☆ ★★★☆☆ ★★☆☆☆
Gemini ★★★☆☆ ★★☆☆☆ ★★★☆☆ ★★☆☆☆ ★★☆☆☆

Claude with MCP integrations beats the rest for EDGAR, NPI, and PubMed. It reads files directly, passing raw text into context without any plugin preprocessing. You get the full document structure, where the important stuff often hides.

ChatGPT Plus helps with EDGAR via the SEC plugin. It queries filings neatly, but outputs are pre-filtered. You can't see the filtering logic. ChatGPT Plus is good for spotting relationships; it is not ideal for details.

Elicit shines on PubMed and preprints. It handles funding, authors, citations.

SAM.gov and Form 990 are trouble. SAM.gov data scatters entity info. Form 990 PDFs are messy OCR. AI results are best used as a hypothesis, not fact.

You should verify AI output manually when output hits reports. This is also necessary when AI disagrees with prior findings. Additionally, verify AI output when filings look unusual.


Workflow 1: Tracing Beneficial Owners in a Private Equity-Backed Medical Chain

Scenario: A regional urgent care clinic chain got acquired by a private equity-backed management services organization. The parent entity structure is murky. The goal is to figure out who actually controls clinical operations and what the ownership chain looks like on paper.

Step 1: Start in EDGAR. Search for the management services organization and its affiliates by name and known variants. Grab the latest 10-K and any S-1 or registration statement if the parent was publicly traded. Focus on the "Business" section for subsidiary lists, Exhibit 21, which lists subsidiaries of the registrant, and related-party transaction disclosures in financial statement footnotes. Summarize ownership sections and extract every named entity — legal names, doing-business-as names, and jurisdiction of formation.

Step 2: Build a working entity map. Create a list of extracted entity names. Note which entities show up in SEC disclosures and which are only mentioned in management agreement language. Entities mentioned in passing but never formally described are often key.

Step 3: Pivot to NPI. Query the NPI registry for clinic group names, individual provider names from SEC filings, and group practice identifiers. NPI records reveal which legal entity employs or affiliates individual providers, and which group NPIs are active at specific clinic addresses. This connects the ownership layer to the operational layer.

Step 4: Validate each node. Every entity in your map needs a source record citation. Entity maps are just starting points. Each node requires a specific filing, a specific NPI record, or a specific state registration document as its evidentiary basis. The reporting logic is: link investment ownership to provider operations only where there's a disclosed control relationship, a shared identifier, or a documented management agreement — not proximity or assumed affiliation.


Workflow 2: Finding Hidden Financial and Institutional Ties

Pattern A: Research conflict of interest. Start with PubMed or BioRxiv to pull a target author's publication history. Use Elicit or Claude with direct file access to extract all listed affiliations across papers, funding acknowledgments, corrections or retraction notices.

Note affiliation discrepancies between papers and the author's institutional employment at publication time. These can be investigatively significant.

Cross-reference disclosed funders against Form 990 filings for nonprofit funder entities. Form 990 details Schedule B: contributors, Schedule I: grants, Schedule R: board overlap between funders and funded institutions.

Elicit or Claude helps with assembly; extracting recurring funders across papers is tedious manually. Validation is still manual.

Pattern B: Contractor risk. Start in SAM.gov or USASpending to identify federal awards to a target entity or entity family. Extract award amounts, agency counterparties, contract types, performance dates.

Search EDGAR for SEC filings involving those entities — enforcement actions, 8-K disclosures, or registration statements referencing contracts or government work.

Check if the entity or its principals appear in OFAC's SDN list or prior debarment actions in SAM.gov's exclusion records. Active federal contracting with sanctions exposure is an investigative finding.

Confirm matches across EDGAR, SAM.gov, and OFAC by verifying EINs, addresses, officer names.


Where AI Helps, Where It Hallucinates, and How to Verify

The failure modes are consistent across tools and databases.

Fabricated relationships. AI asserts Entity A is a subsidiary of Entity B because they share an address or officer name. Inference presented as fact. A source record is needed.

Overconfident summaries. Summarization loses nuance. A 10-K may disclose a complex management agreement but get summarized as "Company X controls Company Y." That isn't what the filing says.

Missing context in filings. AI tools often miss information in exhibits, XBRL-tagged data, or amendments. Always check for amended filings, 10-K/A, 8-K/A, before taking AI output as final.

Weak performance on fragmented procurement data. Entity naming inconsistencies in SAM.gov and USASpending data lead to AI extraction errors. Treat AI output as a hypothesis.

To verify, preserve the original record link, such as the SEC EDGAR filing URL, NPI lookup URL, or USASpending award ID, for every assertion. Confirm extracted names and dates directly. Separate document claims from inferences. Document uncertainty; a finding you can't source isn't a finding.


How to Operationalize the Hidden Data Layer in an Investigation Stack

Databases like these are particularly useful during investigations. You've got names, companies, and leads, now you need to dig deeper. They help before you have enough information to write up a case.

A practical sequence for analysts:

To begin your search, start with the right question. What do you need to find out? Ownership, financial ties, research conflicts, or past contractors? Your search begins with the right database.

The best source to begin with depends on your needs. For company structure, use EDGAR. For healthcare provider information, use NPI. For research connections, use PubMed or Elicit. For nonprofit finances, use Form 990. For federal contracts, use SAM.gov or USASpending.

Extract insights from documents by summarizing them, pulling out key names and entities, finding links to other records, and forming a theory with a list of key players.

Verify your findings manually by confirming each fact with a source document. Read the document yourself and ensure it matches your output.

Build your case on verified facts, keeping evidence separate from assumptions and noting uncertainties.

These databases hold underused public records, such as EDGAR, NPI, PubMed, Elicit, Form 990, SAM.gov, and USASpending. AI makes them more accessible. It doesn't change the standards for evidence.

This guide leads OSINTBench's series on AI-enhanced OSINT. Future guides will dive deeper into specific databases — EDGAR entities, Form 990, SAM.gov — with practical examples and verified sources. Precision is key. These workflows use current tools and public records.

That's the bar we'll hold.

Last updated 2026-04-02. Techniques and tools change — verify current capabilities with vendors directly.