Odel
GoldenMatch

GoldenMatch

@benseverndev-ossData & Analytics110PythonMITUpdated 1w ago

Find duplicate records in 30 seconds. Zero-config entity resolution, 97.2% F1 out of the box.

Server endpointStreamable HTTP

This is the third-party server itself โ€” Odel doesn't run it. Hitting this URL directly talks straight to the upstream server with no auth or proxying. Connect through Odel to front it with managed auth.

๐ŸŸก Golden Suite

A polyglot data-quality and entity-resolution toolkit. Polished, opinionated, AI-native.

GoldenCheck profiles โ†’ GoldenFlow standardizes โ†’ GoldenMatch deduplicates โ†’ GoldenAnalysis reports, all orchestrated by GoldenPipe. With InferMap for schema mapping, a Rust extension layer for Postgres / DuckDB, and optional WebAssembly acceleration behind the edge-safe TypeScript ports.

โšก GoldenMatch scales from a CSV on your laptop to 100M+ rows on a Ray cluster โ€” verified: 100,000,000 records deduped recall-complete (correct across any partitioning) in 9.2 min, with a 0.36 GB driver footprint.


PyPI โ€” goldenmatch npm โ€” goldenmatch Python Node License: MIT

CI codecov OpenSSF Scorecard Fellegi-Sunter beats hand-rolled Splink DBLP-ACM F1

PyPI downloads (suite) npm downloads (suite) GitHub stars

Docs Wiki Web UI Smithery MCP

Last commit

GoldenMatch web workbench โ€” pair drilldown with NL prose

Pair drilldown in the web workbench: cluster members, field-level diff, and a one-line NL explanation per pair. pip install goldenmatch[web] then goldenmatch serve-ui <project>. More screenshots โ†’

# Headline package: dedupe a CSV in 30 seconds
pip install goldenmatch && goldenmatch dedupe customers.csv

# TypeScript / Edge runtimes
npm install goldenmatch

๐Ÿ†• v2.0.0 โ€” GoldenMatch 2.0.0: the first backwards-incompatible major. It removes four deprecation-window items, each shipped with a 1.x runway: the legacy :hash: identity lookup bridge + GOLDENMATCH_IDENTITY_ID_SCHEME (run goldenmatch identity migrate-ids before upgrading; un-fingerprintable rows keep their :hash: id), the GOLDENMATCH_CLUSTER_FRAMES_OUT gate + legacy dict cluster path (build_clusters stays as a frames-backed adapter), and the cheapest_healthy / _scale_aware_backend shims. Pipeline behavior is output-equivalent. Migration guide: Migrating to v2.

v1.30.0 โ€” Zero-training Fellegi-Sunter now beats hand-rolled, expert-tuned Splink, head-to-head and reproducibly. On one shared evaluator across every dataset Splink scores, GoldenMatch's probabilistic auto-config wins on all of them: historical_50k pairwise F1 0.778 vs 0.757 (cluster-level Bยณ 0.844 vs 0.789), febrl3 0.991 vs 0.965, synthetic_person 0.998 vs 0.996 โ€” made reproducible by an EM training-pair determinism fix (#829). Full bake-off: docs/benchmarks/2026-06-09-splink-bakeoff.md.

v1.26.0 โ€” 100M records, distributed, on a 4-worker Ray cluster โ€” verified. The distributed Phase-5 pipeline (GOLDENMATCH_DISTRIBUTED_PIPELINE=2) now runs a full 100,000,000-row dedupe end to end in ~213 s with the driver process peaking at 0.30 GB RSS. The unlock was removing every driver-side collect from the pipeline (scoring -> per-partition local connected-components -> distributed join -> distributed golden build + write), so nothing funnels back to a single node.


Why a suite?

Each tool stands alone, but they compose into a single pipeline:

flowchart LR
    raw([raw rows])
    golden([golden records])

    subgraph orchestration ["GoldenPipe orchestrates"]
        direction LR
        infermap[InferMap]
        goldencheck[GoldenCheck]
        goldenflow[GoldenFlow]
        goldenmatch[GoldenMatch]
        infermap --> goldencheck --> goldenflow --> goldenmatch
    end

    raw --> infermap
    goldenmatch --> golden
StepRole
InferMapschema mapping โ€” auto-aligns columns across heterogeneous sources
GoldenCheckprofile + validate โ€” encoding, format, anomaly detection
GoldenFlowstandardize + transform โ€” phone, date, address, categorical normalization
GoldenMatchdedupe + cluster + survivorship โ€” fuzzy / exact / probabilistic / LLM
GoldenAnalysisanalysis + reporting โ€” one exportable report over any stage's output, plus cross-run regression detection
GoldenPipeorchestrator โ€” declarative YAML pipeline wiring the steps
  • Zero-config defaults that admit when they're unsure โ€” every step has a self-verifying preflight + postflight; results carry an inspectable report instead of failing silently.
  • 96.4% F1 on DBLP-ACM out of the box for entity resolution โ€” and the opt-in Fellegi-Sunter engine beats hand-rolled, expert-tuned Splink head-to-head on every dataset Splink scores (historical_50k pairwise F1 0.778 vs 0.757, cluster-level Bยณ 0.844 vs 0.789; one shared evaluator, reproducible bake-off).
  • Learning Memory โ€” corrections persist across runs and re-anchor across row reorders, so the system stops needing the same correction twice (GoldenMatch v1.6.0; off by default).
  • Identity Graph โ€” a durable graph layer above run-local clusters: stable entity_ids that survive across runs, an append-only event log, and create / absorb / merge / split semantics, surfaced on the CLI, REST, MCP, and SQL interfaces (the Identity Graph v2 feature, shipped in GoldenMatch v1.15).
  • Privacy-preserving record linkage โ€” match across organizations without sharing raw data (PPRL, 92.4% F1 on FEBRL4).
  • AI-native by design โ€” every package ships an MCP server, a REST API, and an A2A agent surface. 50+ MCP tools across the suite, including auto_configure + controller_telemetry for v1.7-v1.12 introspection.
  • AutoConfigController visible everywhere (v1.7-v1.12 surface-parity arc) โ€” web ControllerPanel, TUI Ctrl+A, CLI goldenmatch autoconfig, REST /autoconfig + /controller/telemetry, Postgres goldenmatch_autoconfig + gm_telemetry, DuckDB UDFs, MCP/A2A telemetry tools. One JSON shape across every interface.
  • Polyglot parity โ€” the full suite ships on npm (goldenmatch, goldencheck, goldenflow, goldenanalysis, infermap, goldenpipe) alongside PyPI; the TypeScript and Python implementations track the same outputs to 4-decimal precision via a cross-language parity harness.
  • Edge-safe, with optional native speed โ€” the TypeScript cores are dependency-free and node:*-free, so they run in browsers, Cloudflare Workers, Vercel Edge, and Deno. An opt-in WebAssembly backend (await enableWasm() / enableAnalysisWasm()) swaps in the same pyo3-free Rust kernels the Python wheels and the SQL UDFs use โ€” pure-TS stays the default and the byte-identical fallback, so default users download zero wasm bytes.
  • SQL-native, both engines at parity โ€” the same functions run inside PostgreSQL (pgrx extension) and DuckDB: dedupe / match / score / auto-config + telemetry / identity graph, plus data profiling, evaluate, Fellegi-Sunter probabilistic scoring, and GoldenFlow transforms.
  • Production paths โ€” Postgres sync, daemon mode, lineage tracking, review queues, dbt integration, GitHub Actions, and a Rust extension layer for Postgres / DuckDB.

The Suite

PackageLangWhat it doesInstall
GoldenMatch ๐ŸŸกPython ยท TSZero-config entity resolution. Fuzzy + exact + probabilistic + LLM. Headline package.pip install goldenmatch ยท npm i goldenmatch
GoldenCheckPython ยท TSData-quality scanning: encoding, Unicode, format validation, anomaly detection.pip install goldencheck ยท npm i goldencheck
GoldenFlowPython ยท TSTransforms & standardizers: phone, date, address, categorical normalization.pip install goldenflow ยท npm i goldenflow
GoldenPipePython ยท TSOrchestrator that wires Check โ†’ Flow โ†’ Match into one declarative pipeline.pip install goldenpipe ยท npm i goldenpipe
InferMapPython ยท TSSchema mapping engine โ€” auto-aligns columns across heterogeneous sources.pip install infermap ยท npm i infermap
GoldenAnalysisPython ยท TSCross-cutting analysis & reporting โ€” consumes any stage's typed artifacts (or a raw DataFrame) and emits a unified, exportable AnalysisReport; optional Rust / WASM histogram+quantile kernels.pip install goldenanalysis ยท npm i goldenanalysis
goldenmatch-extensionsRustPostgres extension (pgrx) + DuckDB UDFs. SQL-native fuzzy matching.source build
dbt-goldensuitedbt ยท Pythondbt package โ€” quality-gate tests, correction CRUD macros + GoldenCheck assertions for warehouse models.pip install dbt-goldensuite
goldencheck-actionYAMLGitHub Action โ€” fail PRs that introduce data-quality regressions.Marketplace

Headline pitch and the deepest docs live in packages/python/goldenmatch/README.md (~1,300 lines, full feature list, CLI, architecture, benchmarks).


Choose your path

I want to...Go here
Deduplicate a CSV right nowpackages/python/goldenmatch
Use from Claude Desktop / Codepackages/python/goldenmatch โ€” MCP
Edit rules in a browser, label pairs, compare runspackages/python/goldenmatch โ€” Web UI
Build AI agents that deduplicateER Agent / A2A wiki page
Profile data quality before matchingpackages/python/goldencheck
Standardize messy fields (phone, date, address)packages/python/goldenflow
Run the full pipeline declarativelypackages/python/goldenpipe
Map columns across schemaspackages/python/infermap
Analyze + report across stages and runspackages/python/goldenanalysis
Write TypeScript / Node.js / Edge (browser, Workers; optional WASM)packages/typescript/goldenmatch
Match in Postgres / DuckDB SQLpackages/rust/extensions
Add data-quality gates to dbtpackages/python/goldenmatch/dbt-goldensuite
Block bad data in GitHub PRspackages/actions/goldencheck
Run as Airflow DAGsexamples/airflow/ โ€” 12 drop-in DAGs
Run from a single MCP containerdocker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest
Pull every Suite containerGitHub Packages

Quick examples

Python โ€” dedupe in 30 seconds

import goldenmatch as gm

# Zero-config
result = gm.dedupe("customers.csv")
print(result)  # DedupeResult(records=5000, clusters=847, match_rate=12.0%)
result.golden.write_csv("deduped.csv")

# Or be explicit
result = gm.dedupe("customers.csv",
    exact=["email"],
    fuzzy={"name": 0.85, "zip": 0.95},
    blocking=["zip"],
    threshold=0.85)

TypeScript โ€” edge-safe core

import { dedupe } from "goldenmatch";

const result = dedupe(rows, {
  fuzzy: { name: 0.85 },
  blocking: ["zip"],
  threshold: 0.85,
});
console.log(result.stats);  // { totalRecords, totalClusters, matchRate, ... }

Runs in browsers, Vercel Edge, Cloudflare Workers, Deno โ€” and optionally swaps in the Rust score-core kernel via await enableWasm(). ~940 tests, strict TypeScript (noUncheckedIndexedAccess, exactOptionalPropertyTypes).

Web workbench โ€” browser UI for matching

pip install 'goldenmatch[web]'
goldenmatch serve-ui my-project   # opens http://localhost:5050

GoldenMatch web UI

Edit rules with live validation, preview against a sampled slice, label pairs (mirrored into Learning Memory automatically), compare runs (CCMS), sweep parameters, browse the corrections store. Single-process localhost workbench shipped as the optional [web] extra.

Composed pipeline

import goldenpipe as gp

pipeline = gp.Pipeline.from_yaml("pipeline.yaml")  # check โ†’ flow โ†’ match
result = pipeline.run("customers.csv")
result.report.write_html("report.html")

More: examples/ has runnable demos for every Suite scenario: Python (quickstart, full pipeline, customer 360, PPRL, review workflow, MCP client) ยท TypeScript (quickstart, Vercel Edge route, MCP client) ยท Airflow DAGs (12 production-shaped pipelines).


Use cases (real-world pipelines)

Reproducible end-to-end pipelines running GoldenMatch on public data at scale, each with measured headline numbers vs baselines:

  • ๐Ÿ•ต๏ธ goldenmatch-shell-company-network โ€” investigative ER across ICIJ Offshore Leaks + OpenSanctions + GLEIF + UK PSC + UK disqualified-directors. Confidence-weighted graph, structure mining, named investigative candidates. โˆ’62.5% analyst-hours to triage vs single-source baselines; +133% adversarial perturbation recovery.
  • ๐Ÿ›ก๏ธ goldenmatch-vuln-attribution โ€” cross-database ER on 6.1M OSS vulnerability records across 40 sources (OSV, GHSA, PyPA, RustSec, Go vulndb, EPSS, CISA KEV, CVE Project bulk). 6,126,895 records โ†’ 847,475 canonical vulns in ~5 minutes end-to-end on a single 64GB runner via the full Golden Suite (Check + Flow + Match + Pipe).
  • โš–๏ธ goldenmatch-sanctions-reconciliation โ€” cross-list coverage analysis on 85 public sanctions lists across 50+ jurisdictions via OpenSanctions, plus 10-year OFAC SDN history and PEP/crypto cross-analysis. Coverage-gap benchmark for any sanctions-screening vendor.

Install variants

GoldenMatch ships fat optional extras so you only pay for what you use:

pip install goldenmatch                    # core (CSV in, CSV out) + native acceleration on common platforms
pip install goldenmatch[native]            # back-compat alias; native is already default on common platforms
pip install goldenmatch[embeddings]        # + sentence-transformers, FAISS
pip install goldenmatch[llm]               # + Claude / OpenAI for LLM boost
pip install goldenmatch[postgres]          # + Postgres sync
pip install goldenmatch[snowflake]         # + Snowflake connector
pip install goldenmatch[bigquery]          # + BigQuery connector
pip install goldenmatch[databricks]        # + Databricks connector
pip install goldenmatch[salesforce]        # + Salesforce connector
pip install goldenmatch[duckdb]            # + DuckDB out-of-core backend
pip install goldenmatch[ray]               # + Ray distributed backend (50M+ rows)
pip install goldenmatch[quality]           # + GoldenCheck integration
pip install goldenmatch[transform]         # + GoldenFlow integration
pip install goldenmatch[mcp]               # + MCP server for Claude Desktop
pip install goldenmatch[agent]             # + A2A agent (aiohttp)
pip install goldenmatch[web]               # + localhost browser workbench (FastAPI + React)

goldenmatch setup    # interactive wizard: GPU, API keys, database

Sister packages compose: pip install goldenpipe[full] brings in Check + Flow + Match together.


Remote MCP Server

GoldenMatch is hosted as an MCP server on Smithery โ€” connect from any MCP client without installing anything.

{
  "mcpServers": {
    "goldenmatch": {
      "url": "https://goldenmatch-mcp-production.up.railway.app/mcp/"
    }
  }
}

50+ MCP tools across the suite: deduplicate, match, explain, review, link privately, configure, scan quality, transform, synthesize golden records, and manage Learning Memory corrections.


Container images

Every Suite package ships as a multi-arch container image (linux/amd64 + linux/arm64) on GitHub Container Registry. Pull anonymously, no auth needed:

# One container, every Suite tool โ€” the convenience option
docker run -p 8300:8300 ghcr.io/benseverndev-oss/goldensuite-mcp:latest

# Per-package containers โ€” narrower deployments
docker run -p 8200:8200 ghcr.io/benseverndev-oss/goldenmatch-mcp:latest
docker run -p 8100:8100 ghcr.io/benseverndev-oss/goldencheck-mcp:latest
docker run -p 8150:8150 ghcr.io/benseverndev-oss/goldenflow-mcp:latest
docker run -p 8250:8250 ghcr.io/benseverndev-oss/goldenpipe-mcp:latest
docker run -p 8400:8400 ghcr.io/benseverndev-oss/infermap-mcp:latest

# Postgres + extension preinstalled
docker run -e POSTGRES_PASSWORD=secret ghcr.io/benseverndev-oss/goldenmatch-extensions:latest

Tags:

  • :latest โ€” current main
  • :main-<sha7> โ€” every push to main, immutable
  • :vX.Y.Z and :vX.Y โ€” pushed when a <package>-vX.Y.Z tag is created

See packages/python/goldensuite-mcp/README.md for the aggregator's tool-collision behaviour.


Airflow

12 drop-in DAGs at examples/airflow/, grouped by lifecycle stage:

GroupDAGs
Core pipelinedaily_dedupe, incremental_match, warehouse_native (Snowflake), customer_360 (multi-source)
Privacypprl_linkage (two-party PPRL)
Onboarding & monitoringschema_align_and_load, schema_drift_alarm, quality_gate
Feedback loopreview_worker, active_learning
Operationalizereverse_etl (Salesforce/HubSpot), backfill

TaskFlow API, Airflow 2.7+ (compatible with 3.x). Each DAG has tunable knobs at the top, idempotent retries, and is marker-protected against double-processing. Drop the file you want into your Airflow dags/ folder.


Repository layout

goldenmatch/
โ”œโ”€โ”€ packages/
โ”‚   โ”œโ”€โ”€ python/
โ”‚   โ”‚   โ”œโ”€โ”€ goldenmatch/      # entity resolution โ€” headline package
โ”‚   โ”‚   โ”œโ”€โ”€ goldencheck/      # data quality scanning
โ”‚   โ”‚   โ”œโ”€โ”€ goldenflow/       # transforms & standardizers
โ”‚   โ”‚   โ”œโ”€โ”€ goldenpipe/       # orchestrator
โ”‚   โ”‚   โ”œโ”€โ”€ infermap/         # schema mapping
โ”‚   โ”‚   โ””โ”€โ”€ goldenanalysis/   # cross-cutting analysis & reporting
โ”‚   โ”œโ”€โ”€ typescript/
โ”‚   โ”‚   โ”œโ”€โ”€ goldenmatch/      # full TS port (edge-safe core)
โ”‚   โ”‚   โ”œโ”€โ”€ goldencheck/      # TS implementation
โ”‚   โ”‚   โ”œโ”€โ”€ goldencheck-types/ # shared TS types
โ”‚   โ”‚   โ”œโ”€โ”€ goldenflow/       # TS transforms
โ”‚   โ”‚   โ”œโ”€โ”€ infermap/         # TS schema mapping
โ”‚   โ”‚   โ””โ”€โ”€ goldenanalysis/   # TS analysis & reporting (edge-safe + WASM)
โ”‚   โ”œโ”€โ”€ rust/
โ”‚   โ”‚   โ””โ”€โ”€ extensions/       # Postgres pgrx + DuckDB UDFs (own Cargo workspace)
โ”‚   โ”œโ”€โ”€ python/goldensuite-mcp/ # aggregator MCP server (one container, all tools)
โ”‚   โ”œโ”€โ”€ dbt/goldencheck/      # dbt package
โ”‚   โ””โ”€โ”€ actions/goldencheck/  # GitHub Action
โ”œโ”€โ”€ examples/
โ”‚   โ”œโ”€โ”€ python/               # 6 runnable Python scripts (quickstart โ†’ MCP)
โ”‚   โ”œโ”€โ”€ typescript/           # 3 TS scripts (quickstart, Vercel Edge, MCP)
โ”‚   โ””โ”€โ”€ airflow/              # 12 drop-in Airflow DAGs
โ”œโ”€โ”€ docs/superpowers/         # design specs and implementation plans
โ”œโ”€โ”€ justfile                  # install / test / lint / build, all languages
โ”œโ”€โ”€ pyproject.toml            # uv workspace (root)
โ”œโ”€โ”€ pnpm-workspace.yaml       # TypeScript pnpm workspace (Turborepo)
โ”œโ”€โ”€ package.json              # root scripts + pnpm workspace root
โ””โ”€โ”€ .github/workflows/ci.yml

Workspaces (Cargo vs pnpm)

  • Cargo โ€” no root workspace. packages/rust/extensions/ is itself a Cargo workspace (the postgres crate is excluded for pgrx-specific build requirements). Cargo doesn't allow nested workspaces sharing members, so Cargo commands run from inside packages/rust/extensions/.
  • TypeScript โ€” a single pnpm workspace. packages/typescript/* form one pnpm + Turborepo workspace (see TypeScript dev setup). .npmrc pins node-linker=hoisted, giving a flat node_modules that avoids the Windows symlink issues an earlier per-package layout hit.

Build / test / lint everything

just install   # uv sync + per-package npm install + cargo fetch
just test      # all languages
just lint
just build

Reproducing benchmarks

Published GoldenMatch numbers (DQbench composite 91.04, DBLP-ACM 0.9641 F1, Febrl3 0.9443 F1, NCVR 0.9719 F1) map back to a single committed runner: scripts/run_benchmarks.py. See docs/reproducing-benchmarks.md for per-number commands, dataset URLs, expected output (with tolerance), variance notes (deterministic vs LLM-augmented), and a copy-pasteable one-click reproduction snippet for the DQbench composite. The same runner powers the weekly benchmarks.yml workflow.

Scale envelope

"How big can this handle?" is answered in docs/scale-envelope.md: per-backend ranges (Polars in-memory < 500K, DuckDB out-of-core 500K - 50M, Ray distributed >= 50M), block-size failure modes, candidate-pair math, and a single-page decision tree for picking a backend.

Verified at the top end: a full 100,000,000-row GoldenMatch dedupe on a 5-node Ray cluster (e2-standard-16, 80 CPU) in 9.2 min (554 s), 20,000,000 golden records recovered exactly, driver process peak 0.36 GB RSS โ€” the default distributed path is now recall-complete (blocking-key shuffle scoring + a distributed randomized-contraction WCC), so duplicates merge correctly no matter how the input is partitioned, and it stays driver-collect-free end to end (#844). A faster per-partition path is available via GOLDENMATCH_DISTRIBUTED_BLOCK_SHUFFLE=0 (driver-collect-free, ~213 s on a 4-worker run) for inputs where duplicates already co-locate within partitions โ€” but it under-merges when a cluster's members land in different input partitions, which is why recall-complete is the default. Recipe in packages/python/goldenmatch/configs/distributed-100m.yaml.


Contributing

  • Feature work goes on feature/<name> branches; merge via squash PR.
  • PR title format: feat: <description>, fix: <description>, docs: <description>.
  • Tests must pass on all three languages where the change applies; the parity harness in packages/typescript/goldenmatch/tests/parity/ enforces 4-decimal-tolerance Python โ†” TypeScript scorer parity.
  • See docs/superpowers/specs/ for design rationale on architectural decisions.

TypeScript dev setup (pnpm + Turborepo)

The TypeScript packages live in a single pnpm workspace orchestrated by Turborepo. From the repo root:

corepack enable                               # one-time, picks up pnpm@9.15.0 from package.json
pnpm install                                  # installs all workspace packages
pnpm turbo run build test typecheck lint      # full pipeline (cached after first run)
pnpm --filter goldenmatch test                # single package

Windows: enable Developer Mode for pnpm. pnpm install creates symlinks under node_modules/. Settings โ†’ For Developers โ†’ Developer Mode โ†’ On. If you see EPERM: operation not permitted, symlink ... during install, Dev Mode is off.

If corepack enable fails (often needs an admin shell on Windows), the fallback is npm i -g pnpm@9.15.0 โ€” functionally equivalent.


History

This repository was formed on 2026-05-01 by folding 8 sibling repos into the existing goldenmatch repo using git filter-repo. Full commit history is preserved for every source. See docs/superpowers/specs/2026-05-01-goldenmatch-monorepo-fold-in-design.md for the design rationale and docs/superpowers/plans/2026-05-01-goldenmatch-monorepo-fold-in.md for the step-by-step migration plan.


Author & License

Built by Ben Severn.

MIT โ€” see LICENSE.