🟡 Golden Suite

A polyglot data-quality and entity-resolution toolkit. Polished, opinionated, AI-native.

GoldenCheck profiles → GoldenFlow standardizes → GoldenMatch deduplicates → GoldenAnalysis reports, all orchestrated by GoldenPipe. With InferMap for schema mapping, a Rust extension layer for Postgres / DuckDB, and optional WebAssembly acceleration behind the edge-safe TypeScript ports.

⚡ GoldenMatch scales from a CSV on your laptop to 100M+ rows on a Ray cluster — verified: 100,000,000 records deduped recall-complete (correct across any partitioning) in 9.2 min, with a 0.36 GB driver footprint.

_{Pair drilldown in the web workbench: cluster members, field-level diff, and a one-line NL explanation per pair. pip install goldenmatch[web] then goldenmatch serve-ui <project>. More screenshots →}

# Headline package: dedupe a CSV in 30 seconds
pip install goldenmatch && goldenmatch dedupe customers.csv

# TypeScript / Edge runtimes
npm install goldenmatch

🆕 v2.0.0 — GoldenMatch 2.0.0: the first backwards-incompatible major. It removes four deprecation-window items, each shipped with a 1.x runway: the legacy :hash: identity lookup bridge + GOLDENMATCH_IDENTITY_ID_SCHEME (run goldenmatch identity migrate-ids before upgrading; un-fingerprintable rows keep their :hash: id), the GOLDENMATCH_CLUSTER_FRAMES_OUT gate + legacy dict cluster path (build_clusters stays as a frames-backed adapter), and the cheapest_healthy / _scale_aware_backend shims. Pipeline behavior is output-equivalent. Migration guide: Migrating to v2.

v1.30.0 — Zero-training Fellegi-Sunter now beats hand-rolled, expert-tuned Splink, head-to-head and reproducibly. On one shared evaluator across every dataset Splink scores, GoldenMatch's probabilistic auto-config wins on all of them: historical_50k pairwise F1 0.778 vs 0.757 (cluster-level B³ 0.844 vs 0.789), febrl3 0.991 vs 0.965, synthetic_person 0.998 vs 0.996 — made reproducible by an EM training-pair determinism fix (#829). Full bake-off: docs/benchmarks/2026-06-09-splink-bakeoff.md.

v1.26.0 — 100M records, distributed, on a 4-worker Ray cluster — verified. The distributed Phase-5 pipeline (GOLDENMATCH_DISTRIBUTED_PIPELINE=2) now runs a full 100,000,000-row dedupe end to end in ~213 s with the driver process peaking at 0.30 GB RSS. The unlock was removing every driver-side collect from the pipeline (scoring -> per-partition local connected-components -> distributed join -> distributed golden build + write), so nothing funnels back to a single node.

Why a suite?

Each tool stands alone, but they compose into a single pipeline:

flowchart LR
    raw([raw rows])
    golden([golden records])

    subgraph orchestration ["GoldenPipe orchestrates"]
        direction LR
        infermap[InferMap]
        goldencheck[GoldenCheck]
        goldenflow[GoldenFlow]
        goldenmatch[GoldenMatch]
        infermap --> goldencheck --> goldenflow --> goldenmatch
    end

    raw --> infermap
    goldenmatch --> golden

Step	Role
InferMap	schema mapping — auto-aligns columns across heterogeneous sources
GoldenCheck	profile + validate — encoding, format, anomaly detection
GoldenFlow	standardize + transform — phone, date, address, categorical normalization
GoldenMatch	dedupe + cluster + survivorship — fuzzy / exact / probabilistic / LLM
GoldenAnalysis	analysis + reporting — one exportable report over any stage's output, plus cross-run regression detection
GoldenPipe	orchestrator — declarative YAML pipeline wiring the steps

Zero-config defaults that admit when they're unsure — every step has a self-verifying preflight + postflight; results carry an inspectable report instead of failing silently.
96.4% F1 on DBLP-ACM out of the box for entity resolution — and the opt-in Fellegi-Sunter engine beats hand-rolled, expert-tuned Splink head-to-head on every dataset Splink scores (historical_50k pairwise F1 0.778 vs 0.757, cluster-level B³ 0.844 vs 0.789; one shared evaluator, reproducible bake-off).
Learning Memory — corrections persist across runs and re-anchor across row reorders, so the system stops needing the same correction twice (GoldenMatch v1.6.0; off by default).
Identity Graph — a durable graph layer above run-local clusters: stable entity_ids that survive across runs, an append-only event log, and create / absorb / merge / split semantics, surfaced on the CLI, REST, MCP, and SQL interfaces (the Identity Graph v2 feature, shipped in GoldenMatch v1.15).
Privacy-preserving record linkage — match across organizations without sharing raw data (PPRL, 92.4% F1 on FEBRL4).
AI-native by design — every package ships an MCP server, a REST API, and an A2A agent surface. 50+ MCP tools across the suite, including auto_configure + controller_telemetry for v1.7-v1.12 introspection.
AutoConfigController visible everywhere (v1.7-v1.12 surface-parity arc) — web ControllerPanel, TUI Ctrl+A, CLI goldenmatch autoconfig, REST /autoconfig + /controller/telemetry, Postgres goldenmatch_autoconfig + gm_telemetry, DuckDB UDFs, MCP/A2A telemetry tools. One JSON shape across every interface.
Polyglot parity — the full suite ships on npm (goldenmatch, goldencheck, goldenflow, goldenanalysis, infermap, goldenpipe) alongside PyPI; the TypeScript and Python implementations track the same outputs to 4-decimal precision via a cross-language parity harness.
Edge-safe, with optional native speed — the TypeScript cores are dependency-free and node:*-free, so they run in browsers, Cloudflare Workers, Vercel Edge, and Deno. An opt-in WebAssembly backend (await enableWasm() / enableAnalysisWasm()) swaps in the same pyo3-free Rust kernels the Python wheels and the SQL UDFs use — pure-TS stays the default and the byte-identical fallback, so default users download zero wasm bytes.
SQL-native, both engines at parity — the same functions run inside PostgreSQL (pgrx extension) and DuckDB: dedupe / match / score / auto-config + telemetry / identity graph, plus data profiling, evaluate, Fellegi-Sunter probabilistic scoring, and GoldenFlow transforms.
Production paths — Postgres sync, daemon mode, lineage tracking, review queues, dbt integration, GitHub Actions, and a Rust extension layer for Postgres / DuckDB.

The Suite

Package	Lang	What it does	Install
GoldenMatch 🟡	Python · TS	Zero-config entity resolution. Fuzzy + exact + probabilistic + LLM. Headline package.	`pip install goldenmatch` · `npm i goldenmatch`
GoldenCheck	Python · TS	Data-quality scanning: encoding, Unicode, format validation, anomaly detection.	`pip install goldencheck` · `npm i goldencheck`
GoldenFlow	Python · TS	Transforms & standardizers: phone, date, address, categorical normalization.	`pip install goldenflow` · `npm i goldenflow`
GoldenPipe	Python · TS	Orchestrator that wires Check → Flow → Match into one declarative pipeline.	`pip install goldenpipe` · `npm i goldenpipe`
InferMap	Python · TS	Schema mapping engine — auto-aligns columns across heterogeneous sources.	`pip install infermap` · `npm i infermap`
GoldenAnalysis	Python · TS	Cross-cutting analysis & reporting — consumes any stage's typed artifacts (or a raw DataFrame) and emits a unified, exportable `AnalysisReport`; optional Rust / WASM `histogram`+`quantile` kernels.	`pip install goldenanalysis` · `npm i goldenanalysis`
goldenmatch-extensions	Rust	Postgres extension (pgrx) + DuckDB UDFs. SQL-native fuzzy matching.	source build
dbt-goldensuite	dbt · Python	dbt package — quality-gate tests, correction CRUD macros + GoldenCheck assertions for warehouse models.	`pip install dbt-goldensuite`
goldencheck-action	YAML	GitHub Action — fail PRs that introduce data-quality regressions.	Marketplace

Headline pitch and the deepest docs live in packages/python/goldenmatch/README.md (~1,300 lines, full feature list, CLI, architecture, benchmarks).

Choose your path

I want to...	Go here
Deduplicate a CSV right now	`packages/python/goldenmatch`
Use from Claude Desktop / Code	`packages/python/goldenmatch` — MCP
Edit rules in a browser, label pairs, compare runs	`packages/python/goldenmatch` — Web UI
Build AI agents that deduplicate	ER Agent / A2A wiki page
Profile data quality before matching	`packages/python/goldencheck`
Standardize messy fields (phone, date, address)	`packages/python/goldenflow`
Run the full pipeline declaratively	`packages/python/goldenpipe`
Map columns across schemas	`packages/python/infermap`
Analyze + report across stages and runs	`packages/python/goldenanalysis`
Write TypeScript / Node.js / Edge (browser, Workers; optional WASM)	`packages/typescript/goldenmatch`
Match in Postgres / DuckDB SQL	`packages/rust/extensions`
Add data-quality gates to dbt	`packages/python/goldenmatch/dbt-goldensuite`
Block bad data in GitHub PRs	`packages/actions/goldencheck`
Run as Airflow DAGs	`examples/airflow/` — 12 drop-in DAGs
Run from a single MCP container	`docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest`
Pull every Suite container	GitHub Packages

Quick examples

Python — dedupe in 30 seconds

import goldenmatch as gm

# Zero-config
result = gm.dedupe("customers.csv")
print(result)  # DedupeResult(records=5000, clusters=847, match_rate=12.0%)
result.golden.write_csv("deduped.csv")

# Or be explicit
result = gm.dedupe("customers.csv",
    exact=["email"],
    fuzzy={"name": 0.85, "zip": 0.95},
    blocking=["zip"],
    threshold=0.85)

TypeScript — edge-safe core

import { dedupe } from "goldenmatch";

const result = dedupe(rows, {
  fuzzy: { name: 0.85 },
  blocking: ["zip"],
  threshold: 0.85,
});
console.log(result.stats);  // { totalRecords, totalClusters, matchRate, ... }

Runs in browsers, Vercel Edge, Cloudflare Workers, Deno — and optionally swaps in the Rust score-core kernel via await enableWasm(). ~940 tests, strict TypeScript (noUncheckedIndexedAccess, exactOptionalPropertyTypes).

Web workbench — browser UI for matching

pip install 'goldenmatch[web]'
goldenmatch serve-ui my-project   # opens http://localhost:5050

GoldenMatch web UI

Edit rules with live validation, preview against a sampled slice, label pairs (mirrored into Learning Memory automatically), compare runs (CCMS), sweep parameters, browse the corrections store. Single-process localhost workbench shipped as the optional [web] extra.

Composed pipeline

import goldenpipe as gp

pipeline = gp.Pipeline.from_yaml("pipeline.yaml")  # check → flow → match
result = pipeline.run("customers.csv")
result.report.write_html("report.html")

More: examples/ has runnable demos for every Suite scenario: Python (quickstart, full pipeline, customer 360, PPRL, review workflow, MCP client) · TypeScript (quickstart, Vercel Edge route, MCP client) · Airflow DAGs (12 production-shaped pipelines).

Use cases (real-world pipelines)

Reproducible end-to-end pipelines running GoldenMatch on public data at scale, each with measured headline numbers vs baselines:

🕵️ goldenmatch-shell-company-network — investigative ER across ICIJ Offshore Leaks + OpenSanctions + GLEIF + UK PSC + UK disqualified-directors. Confidence-weighted graph, structure mining, named investigative candidates. −62.5% analyst-hours to triage vs single-source baselines; +133% adversarial perturbation recovery.
🛡️ goldenmatch-vuln-attribution — cross-database ER on 6.1M OSS vulnerability records across 40 sources (OSV, GHSA, PyPA, RustSec, Go vulndb, EPSS, CISA KEV, CVE Project bulk). 6,126,895 records → 847,475 canonical vulns in ~5 minutes end-to-end on a single 64GB runner via the full Golden Suite (Check + Flow + Match + Pipe).
⚖️ goldenmatch-sanctions-reconciliation — cross-list coverage analysis on 85 public sanctions lists across 50+ jurisdictions via OpenSanctions, plus 10-year OFAC SDN history and PEP/crypto cross-analysis. Coverage-gap benchmark for any sanctions-screening vendor.

Install variants

GoldenMatch ships fat optional extras so you only pay for what you use:

pip install goldenmatch                    # core (CSV in, CSV out) + native acceleration on common platforms
pip install goldenmatch[native]            # back-compat alias; native is already default on common platforms
pip install goldenmatch[embeddings]        # + sentence-transformers, FAISS
pip install goldenmatch[llm]               # + Claude / OpenAI for LLM boost
pip install goldenmatch[postgres]          # + Postgres sync
pip install goldenmatch[snowflake]         # + Snowflake connector
pip install goldenmatch[bigquery]          # + BigQuery connector
pip install goldenmatch[databricks]        # + Databricks connector
pip install goldenmatch[salesforce]        # + Salesforce connector
pip install goldenmatch[duckdb]            # + DuckDB out-of-core backend
pip install goldenmatch[ray]               # + Ray distributed backend (50M+ rows)
pip install goldenmatch[quality]           # + GoldenCheck integration
pip install goldenmatch[transform]         # + GoldenFlow integration
pip install goldenmatch[mcp]               # + MCP server for Claude Desktop
pip install goldenmatch[agent]             # + A2A agent (aiohttp)
pip install goldenmatch[web]               # + localhost browser workbench (FastAPI + React)

goldenmatch setup    # interactive wizard: GPU, API keys, database

Sister packages compose: pip install goldenpipe[full] brings in Check + Flow + Match together.

Remote MCP Server

GoldenMatch is hosted as an MCP server on Smithery — connect from any MCP client without installing anything.

{
  "mcpServers": {
    "goldenmatch": {
      "url": "https://goldenmatch-mcp-production.up.railway.app/mcp/"
    }
  }
}

50+ MCP tools across the suite: deduplicate, match, explain, review, link privately, configure, scan quality, transform, synthesize golden records, and manage Learning Memory corrections.

Container images

Every Suite package ships as a multi-arch container image (linux/amd64 + linux/arm64) on GitHub Container Registry. Pull anonymously, no auth needed:

# One container, every Suite tool — the convenience option
docker run -p 8300:8300 ghcr.io/benseverndev-oss/goldensuite-mcp:latest

# Per-package containers — narrower deployments
docker run -p 8200:8200 ghcr.io/benseverndev-oss/goldenmatch-mcp:latest
docker run -p 8100:8100 ghcr.io/benseverndev-oss/goldencheck-mcp:latest
docker run -p 8150:8150 ghcr.io/benseverndev-oss/goldenflow-mcp:latest
docker run -p 8250:8250 ghcr.io/benseverndev-oss/goldenpipe-mcp:latest
docker run -p 8400:8400 ghcr.io/benseverndev-oss/infermap-mcp:latest

# Postgres + extension preinstalled
docker run -e POSTGRES_PASSWORD=secret ghcr.io/benseverndev-oss/goldenmatch-extensions:latest

Tags:

:latest — current main
:main-<sha7> — every push to main, immutable
:vX.Y.Z and :vX.Y — pushed when a <package>-vX.Y.Z tag is created

See packages/python/goldensuite-mcp/README.md for the aggregator's tool-collision behaviour.

Airflow

12 drop-in DAGs at examples/airflow/, grouped by lifecycle stage:

Group	DAGs
Core pipeline	`daily_dedupe`, `incremental_match`, `warehouse_native` (Snowflake), `customer_360` (multi-source)
Privacy	`pprl_linkage` (two-party PPRL)
Onboarding & monitoring	`schema_align_and_load`, `schema_drift_alarm`, `quality_gate`
Feedback loop	`review_worker`, `active_learning`
Operationalize	`reverse_etl` (Salesforce/HubSpot), `backfill`

TaskFlow API, Airflow 2.7+ (compatible with 3.x). Each DAG has tunable knobs at the top, idempotent retries, and is marker-protected against double-processing. Drop the file you want into your Airflow dags/ folder.

Repository layout

goldenmatch/
├── packages/
│   ├── python/
│   │   ├── goldenmatch/      # entity resolution — headline package
│   │   ├── goldencheck/      # data quality scanning
│   │   ├── goldenflow/       # transforms & standardizers
│   │   ├── goldenpipe/       # orchestrator
│   │   ├── infermap/         # schema mapping
│   │   └── goldenanalysis/   # cross-cutting analysis & reporting
│   ├── typescript/
│   │   ├── goldenmatch/      # full TS port (edge-safe core)
│   │   ├── goldencheck/      # TS implementation
│   │   ├── goldencheck-types/ # shared TS types
│   │   ├── goldenflow/       # TS transforms
│   │   ├── infermap/         # TS schema mapping
│   │   └── goldenanalysis/   # TS analysis & reporting (edge-safe + WASM)
│   ├── rust/
│   │   └── extensions/       # Postgres pgrx + DuckDB UDFs (own Cargo workspace)
│   ├── python/goldensuite-mcp/ # aggregator MCP server (one container, all tools)
│   ├── dbt/goldencheck/      # dbt package
│   └── actions/goldencheck/  # GitHub Action
├── examples/
│   ├── python/               # 6 runnable Python scripts (quickstart → MCP)
│   ├── typescript/           # 3 TS scripts (quickstart, Vercel Edge, MCP)
│   └── airflow/              # 12 drop-in Airflow DAGs
├── docs/superpowers/         # design specs and implementation plans
├── justfile                  # install / test / lint / build, all languages
├── pyproject.toml            # uv workspace (root)
├── pnpm-workspace.yaml       # TypeScript pnpm workspace (Turborepo)
├── package.json              # root scripts + pnpm workspace root
└── .github/workflows/ci.yml

Workspaces (Cargo vs pnpm)

Cargo — no root workspace. packages/rust/extensions/ is itself a Cargo workspace (the postgres crate is excluded for pgrx-specific build requirements). Cargo doesn't allow nested workspaces sharing members, so Cargo commands run from inside packages/rust/extensions/.
TypeScript — a single pnpm workspace. packages/typescript/* form one pnpm + Turborepo workspace (see TypeScript dev setup). .npmrc pins node-linker=hoisted, giving a flat node_modules that avoids the Windows symlink issues an earlier per-package layout hit.

Build / test / lint everything

just install   # uv sync + per-package npm install + cargo fetch
just test      # all languages
just lint
just build

Reproducing benchmarks

Published GoldenMatch numbers (DQbench composite 91.04, DBLP-ACM 0.9641 F1, Febrl3 0.9443 F1, NCVR 0.9719 F1) map back to a single committed runner: scripts/run_benchmarks.py. See docs/reproducing-benchmarks.md for per-number commands, dataset URLs, expected output (with tolerance), variance notes (deterministic vs LLM-augmented), and a copy-pasteable one-click reproduction snippet for the DQbench composite. The same runner powers the weekly benchmarks.yml workflow.

Scale envelope

"How big can this handle?" is answered in docs/scale-envelope.md: per-backend ranges (Polars in-memory < 500K, DuckDB out-of-core 500K - 50M, Ray distributed >= 50M), block-size failure modes, candidate-pair math, and a single-page decision tree for picking a backend.

Verified at the top end: a full 100,000,000-row GoldenMatch dedupe on a 5-node Ray cluster (e2-standard-16, 80 CPU) in 9.2 min (554 s), 20,000,000 golden records recovered exactly, driver process peak 0.36 GB RSS — the default distributed path is now recall-complete (blocking-key shuffle scoring + a distributed randomized-contraction WCC), so duplicates merge correctly no matter how the input is partitioned, and it stays driver-collect-free end to end (#844). A faster per-partition path is available via GOLDENMATCH_DISTRIBUTED_BLOCK_SHUFFLE=0 (driver-collect-free, ~213 s on a 4-worker run) for inputs where duplicates already co-locate within partitions — but it under-merges when a cluster's members land in different input partitions, which is why recall-complete is the default. Recipe in packages/python/goldenmatch/configs/distributed-100m.yaml.

Contributing

Feature work goes on feature/<name> branches; merge via squash PR.
PR title format: feat: <description>, fix: <description>, docs: <description>.
Tests must pass on all three languages where the change applies; the parity harness in packages/typescript/goldenmatch/tests/parity/ enforces 4-decimal-tolerance Python ↔ TypeScript scorer parity.
See docs/superpowers/specs/ for design rationale on architectural decisions.

TypeScript dev setup (pnpm + Turborepo)

The TypeScript packages live in a single pnpm workspace orchestrated by Turborepo. From the repo root:

corepack enable                               # one-time, picks up pnpm@9.15.0 from package.json
pnpm install                                  # installs all workspace packages
pnpm turbo run build test typecheck lint      # full pipeline (cached after first run)
pnpm --filter goldenmatch test                # single package

Windows: enable Developer Mode for pnpm. pnpm install creates symlinks under node_modules/. Settings → For Developers → Developer Mode → On. If you see EPERM: operation not permitted, symlink ... during install, Dev Mode is off.

If corepack enable fails (often needs an admin shell on Windows), the fallback is npm i -g pnpm@9.15.0 — functionally equivalent.

History

This repository was formed on 2026-05-01 by folding 8 sibling repos into the existing goldenmatch repo using git filter-repo. Full commit history is preserved for every source. See docs/superpowers/specs/2026-05-01-goldenmatch-monorepo-fold-in-design.md for the design rationale and docs/superpowers/plans/2026-05-01-goldenmatch-monorepo-fold-in.md for the step-by-step migration plan.

Author & License

Built by Ben Severn.

MIT — see LICENSE.