API · In Development · Early Access Open

Structured access
to the graph.

Every entity in the registry carries a UUID, a timestamp, a source URL, and a content hash. The API surfaces that record — structured, provenanced, traversable — for AI training pipelines, research datasets, and knowledge graph construction.

Data Types

Six data types.
All provenance-backed.

The registry mints entities at scale. Domains measured against the full schema vocabulary, timestamped, and provenanced. Every field carries a traceable origin.

01 · Entity Records

Schema scores. Content fingerprints. Semantic matches across 42 language dictionaries. Temporal attestation. Discovery timestamps. Full recrawl history. Every entity a complete structured record.

02 · Graph Edge Data

Common edges — what entities share. Uncommon edges — what they don't. Substrate connections, dimensional relationships. Edges discovered through measurement. Nothing manually tagged.

03 · Schema Scores & Gap Measurements

Scores calculated against the full 916-type vocabulary. Cross-referenced with language and geography. Empirical coverage patterns across the web, by industry and region.

04 · Timestamped Crawl Archives

Every recrawl preserved. Previous passes archived, never overwritten. Temporal attestation for training on web change patterns. A record of what a domain was, not only what it is.

05 · ROOT-LD & Recursive-LD

Three-layer linked data structure — Anchor, Body, Recursive — with full provenance, timestamped passes, and dimensional context. Open specification at root-ld.org.

06 · Machine-Readable Manifests

Every entity has a manifest.json. Structured data without HTML parsing. Optimized for crawler ingestion, RAG retrieval, and training pipeline integration.

Access

Three use cases.
One infrastructure layer.

The registry is built for operators who need provenance-declared, machine-readable data at scale.

Frontier AI Companies
Training data that can be verified.

Citation-grounded web data with full provenance. Every entity traced to its source URL and crawl timestamp. Temporal attestation for tracking how information changes. Semantic fingerprints across 42 languages. Structured, falsifiable knowledge — not noisy crawl data.

Research Institutions
Empirical datasets for falsifiable work.

Schema adoption across languages and geographies. Linguistic bias analysis. Knowledge graph evolution over time. The only web dataset measuring knowledge organization at the structural layer — beneath language, beneath keywords.

Infrastructure Builders
The foundation layer for RAG and knowledge graphs.

Pre-structured entities with active graph edges. Traverse from any entry point. Common and uncommon edges surface connections invisible to keyword search. Manifests enable direct structured data fetching. Build domain-specific graphs from the registry's foundation.

Properties

What the data carries
that other sources don't.

Full Provenance

Discovery timestamp, mint timestamp, source URL, content hash, recrawl history. The origin of every field is structural — not a claim appended after the fact.

Falsifiable Measurements

Schema scores calculated against the full 916-type vocabulary. Semantic fingerprints run across 42 language dictionaries. Every number has a methodology. No black box. No opaque scoring.

Deterministic Edge Discovery

Common and uncommon edges form from accumulated measurements across the corpus. No manual tagging. No subjective classification. Relationships emerge from the data.

Temporal Attestation

Entities recrawl on schedule. Every pass generates a new timestamped record. Previous data archives in entity folders, never overwritten. A record of what a domain was across time — not only what it is today.

Multilingual by Design

42 language dictionaries. Semantic overlap patterns across language families. Linguistic bias measurements built into schema scoring from the first pass. Knowledge organization measured at the structural layer — beneath language, beneath keywords.

Technical Specifications

Endpoints.
Specification in progress.

API design is active. Response format: JSON-LD with ROOT-LD wrapper. Machine-readable, traversable, provenanced. Early access partners help shape the final specification.

Method
Endpoint
Description
GET /entities Query by domain, TLD, schema score, language matches
GET /entities/{id} Full entity record — manifest, ROOT-LD, folder contents, recrawl history
GET /entities/{id}/edges All graph edges for an entity — common, uncommon, dimensional
GET /graph/edges Query edges by type, confidence score, entity pairs
GET /schema/scores Schema score distribution by industry, geography, language
GET /manifests/{domain} Fetch entity manifest.json directly by domain — no HTML parsing required
GET /rootld/{id} ROOT-LD context pod — Anchor, Body, Recursive layers with full provenance
GET /search Search across all minted entities by keyword, schema type, or edge
Response Format
JSON-LD
ROOT-LD wrapper. Traversable. Provenanced. Machine-readable without HTML parsing.
Authentication
API Key · OAuth
Research, commercial, and enterprise tiers. Rate limits set per engagement during early access.
Status
In Development
Early access open. Specification shaped with first partners. Contact us to engage.
Early Access

API access is open.

The specification is in development. Early access is being coordinated with frontier AI companies, research institutions, and infrastructure builders. To request access or discuss terms, reach out directly.

contact@globaldataregistry.com globaldataregistry.com/contact →
[Page Service Name] is a service of the Global Data Registry  —  open provenance infrastructure for the machine-readable web.
View the Registry →