PMH paper — block findings

Synthesized from main.pdf · Train on A, deploy on B, same labels

Library vs paper: Headline metrics here are from the paper, not from pip install matching-pmh on demo loaders. Use notebooks and pmh-train try on your stack; expect iteration until Step 5 passes on deploy holdout. See docs/START.md.

Summary: 12 blocks pass 1 partial / limit (pre-registered criteria)

The Perturbation Matching Hypothesis (PMH) treats label-preserving deploy change as one estimation problem: learn the geometry of nuisance variation, train with a matched penalty, and falsify with wrong-direction and isotropic controls before claiming deploy gains.

**12 of 13** pre-registered blocks meet their pass criteria in the paper ([main.pdf](main.pdf)); see [docs/findings.html](docs/findings.html) for a block summary. Wins span classical projection, ViT noise robustness, pose and depth, domain adaptation (DomainNet, Cityscapes), molecules, code renames, speech, HAR, LLM style, and PGD robustness.

**T1 / Office-31** is the honest partial case: on frozen ResNet-18 features, CORAL can beat projection-only PMH on accuracy; PMH still beats ERM and wrong-W controls — illustrating Lemma D1 eigengap limits, not a silent library bug.

Falsification arms (matched vs wrong-W vs isotropic) recur across blocks: gains tied to estimated nuisance geometry, not generic regularization.

Thirteen blocks (T1–T7)

BlockTaskHeadline result (paper)StatusTask doc
T1
t01-classical
Classical ML + matched projection (ridge, SVM, k-NN, logistic)
Lemma D1 · sklearn
Ridge theorem + oracle-W on MNIST/Fashion/SVHN; Office-31: CORAL > PMH on frozen ResNet, PMH > B0 on SVM — **documented D1 eigengap** case.Partial / honest limitdocs/tasks/t01-classical.md
T2A
t02a-vit-isotropic
ViT / image classifier — isotropic sensor noise
Lemma D2 · pytorch
ViT-B/16 isotropic PMH: **+4.29 pp** mean ImageNet-C; TDI **−58%** at σ=0.10.Passdocs/tasks/t02a-vit-isotropic.md
T2B
t02b-chexpert-isotropic
Medical imaging — hospital / scanner embedding shift
Lemma D2 · pytorch
CheXpert E1: best saliency **0.723**; ~**9×** lower embedding drift vs baseline.Passdocs/tasks/t02b-chexpert-isotropic.md
T3A
t03a-pose-gradient
Pose / keypoints — camera & studio shift
Lemma D3 · pytorch
COCO pose E1_aniso: **54.49%** PCK@0.05 (+22.4 pp vs baseline 32.07%).Passdocs/tasks/t03a-pose-gradient.md
T3B
t03b-depth-augmentation
Depth estimation — photometric shift
Lemma D3 · pytorch
Depth photometric hard stress: E1_aniso AbsRel **0.2152** (wins on combined_hard).Passdocs/tasks/t03b-depth-augmentation.md
T4A
t04a-vision-domain
Vision domain shift (single-layer / ResNet)
Lemma D4 · pytorch
DomainNet real→sketch E1_multiscale: **42.15%** acc (+3.31 pp vs B0 38.84%).Passdocs/tasks/t04a-vision-domain.md
T4B
t04b-multilayer-vision
Vision domain shift (multilayer FPN / U-Net)
Lemma D4 · pytorch
GTA5→Cityscapes rare-5 mIoU **30.75%** (+11.1 pp vs B0 19.68%).Passdocs/tasks/t04b-multilayer-vision.md
T5A
t05a-qm9-molecule
Molecules / graphs (QM9-style)
Lemma D5 · pytorch
QM9 position PMH: clean MAE **24.921**; robust under σ=0.2 Å noise.Passdocs/tasks/t05a-qm9-molecule.md
T5B
t05b-code-tokens
Code models — token-group shift
Lemma D5 · pytorch
Code rename stress: E1 rename_bacc_ratio **0.9383** vs B0 **0.8297**; wrong blocks fail.Passdocs/tasks/t05b-code-tokens.md
T6A
t06a-speech-whisper
Speech / ASR — mic & room shift
Lemma D6 · pytorch
Whisper/Libri content-residual: other-WER **14.63%** (−8.6 pp vs 23.26%).Passdocs/tasks/t06a-speech-whisper.md
T6B
t06b-temporal-har
Time-series / HAR — sensor drift
Lemma D6 · pytorch
HAR stress 3.0: balanced acc **0.4099** vs baseline **0.2794** (3 seeds).Passdocs/tasks/t06b-temporal-har.md
T7A
t07a-llm-style
LLM — format / tone / template
Lemma D7 · hf
Style RM + DPO: sycophancy **38.5%→13.5%**; margin_pmh Style TDI **1.836**.Passdocs/tasks/t07a-llm-style.md
T7B
t07b-adversarial-pgd
Adversarial / PGD perturbations
Lemma D7 · pytorch
CIFAR PGD-W pmh_aniso: TDI **0.878** (−19% vs 1.090); clean **80.9%**.Passdocs/tasks/t07b-adversarial-pgd.md

Use the library