When PMH helps (and when it does not)¶
Adoption path
First: Pick a task (does your setup fit?). This page: honest expectations.
PMH is a matching principle for the training loss (main.pdf): estimate $\Sigma_{\text{task}}$ for label-preserving deploy change, penalize the encoder Jacobian along a matched $\Sigma'$, and falsify with wrong-direction and isotropic arms before you trust a deploy metric. Pick your situation from the T1–T7 table in the README.
This page is the honesty layer: the theory does not guarantee higher accuracy on every benchmark. It gives named failure modes (e.g. Lemma D1 eigengap on Office-31, label-changing shifts out of scope) and requires Step 5 controls so gains are tied to the estimated nuisance geometry — not generic training noise.
Quick decision¶
flowchart TD
Q1{Same labels on A and B?}
Q1 -->|No| NO1[Do not use PMH — fix labels or use label-shift methods]
Q1 -->|Yes| Q2{New classes only at deploy?}
Q2 -->|Yes| NO2[Do not use PMH — open-set / label shift]
Q2 -->|No| Q3{Any target / deploy signal?}
Q3 -->|No unlabeled target| NO3[Collect target data or use style pairs for LLM]
Q3 -->|Yes| Q4{Representation h you can hook?}
Q4 -->|No| NO4[Extract features first — sklearn G2 path]
Q4 -->|Yes| GO[Run check_applicability + preflight + controls]
from pmh import check_applicability
print(check_applicability(
stack="pytorch", # or "sklearn" / "hf"
n_source=500,
n_target=200,
has_target_labels=False, # unlabeled target OK for estimation
).summary())
When you are more likely to see a benefit¶
| Signal | Why |
|---|---|
| Clear domain shift (camera, site, corpus style) with same semantics | PMH estimates directions that vary between A and B but should not change the label |
| Enough target data to estimate geometry | Rule of thumb: 50+ unlabeled target samples; 200+ for stable D1/D4; more for high rank |
| ERM underperforms on target but source training looks fine | Room to move the representation without destroying source fit |
Preflight passes (pass or strong eigengap) |
Estimated nuisance subspace is identifiable — pmh-train doctor |
| Matched beats wrong-W and isotropic in your control table | Gain is tied to the estimated nuisance story, not arbitrary regularization |
End-to-end fine-tuning (Mode A) with a hook on h |
Jacobian penalty can change what ERM alone cannot on frozen linear probes |
Toy sanity check: PMH_QUICK=1 python scripts/demos/first_run_domain_shift.py — synthetic shift where PMH often beats ERM on target accuracy in one minute on CPU.
When not to expect much (or use something else)¶
| Situation | What usually happens | Better approach |
|---|---|---|
| Frozen features + linear head on an easy DA benchmark | Small or no accuracy gain; CORAL may match or beat projection | See Office-31 table below; try Mode A fine-tuning if possible |
| Very small target pool (< ~30) | Unstable $\hat{W}$, marginal preflight | More target data, lower rank, or simpler nuisance (D2/D4) |
| Target already near source accuracy | Little headroom | ERM + report that PMH was unnecessary |
| Label shift / new classes | PMH is the wrong tool | Open-set, class-balanced reweighting, separate heads |
| Only generic noise robustness | Isotropic arm may look similar to matched | Augmentation, adversarial training |
| No target domain at all | Cannot estimate deployment nuisance | Collect unlabeled target batches |
Honest reference numbers (do not cherry-pick)¶
Office-31 (T1, frozen ResNet-18 features, Amazon → DSLR)¶
Protocol: paper_protocol=True, preset t1_office31_sklearn, rank 32. Runbook: T1 classical · scripts/demos/office31_sklearn.py
| Arm | Target accuracy (holdout) | Comment |
|---|---|---|
| B0 (ERM) | 0.224 | Baseline |
| Matched PMH | 0.216 | Slightly below B0 on accuracy alone |
| CORAL | 0.268 | Strong on this linear frozen-feature setup |
| Isotropic control | 0.184 | Different objective — not “free accuracy” |
Takeaway: On this benchmark, matched projection does not beat ERM accuracy; CORAL is competitive. PMH is still useful here for replication, geometry metrics (TDI, $D_N/D_S$), and falsification (wrong-W should not beat matched on both accuracy and geometry). Do not use this table as a marketing headline.
Synthetic domain shift (first-run demo)¶
Controlled shift in input space + trainable backbone — PMH often shows higher target accuracy than ERM because the representation can move. This is the right mental model for Mode A end-to-end training.
PMH vs other approaches (same goal, different lever)¶
| Approach | What moves | Controls built in? | Typical best when |
|---|---|---|---|
| ERM (source only) | Task loss on A | — | Strong baseline; document target metric |
| Fine-tune on target | All weights on labeled B | — | Many target labels available |
| CORAL / moment match | Feature covariance toward target | Optional baseline arm | Frozen features + linear classifier |
| DANN / domain adversary | Encoder vs domain classifier | External | Unlabeled target, classic DA setup |
| matching-pmh (matched) | Penalize sensitivity along $\hat\Sigma_{\text{task}}$ | wrong-W, isotropic arms | Same labels, target signal, hook on h, need credible claim |
On frozen features, compare PMH arms with CORAL in compare_arms_sklearn(..., include_coral=True) — see T1 classical.
How to know it “worked” (beyond accuracy)¶
- Preflight —
passbefore large training runs (artifact.preflight/pmh-train doctor). - Target metric — accuracy / AUROC on held-out target, not source only.
- Falsification — matched > wrong-W on deployment metric; isotropic should not beat matched on both accuracy and geometry (
evaluate_baseline_vs_pmh/evaluate_robust_fit). - Geometry (optional) —
tdi_cls, $D_N/D_S$ frompmh.tdi/compare_arms_sklearn(..., include_geometry=True).
# sklearn (frozen features)
from pmh import evaluate_baseline_vs_pmh
report = evaluate_baseline_vs_pmh(
x_source, y_source, x_target, y_target,
compare_to=("coral",),
)
print(report.summary())
# PyTorch (ERM vs PMH on labeled target val_loader)
from pmh import evaluate_robust_fit
report = evaluate_robust_fit(
model, train_loader, val_loader,
source_batches=src, target_batches=tgt,
hook="auto", head=classifier, epochs=10,
)
print(report.summary())
# Then: compare_arms(...) for matched / wrong_w / isotropic
Minimum checklist before production claims¶
- [ ]
check_applicabilityis go (notno_go) - [ ] Target holdout evaluated with same label space as source
- [ ] Matched compared to B0 and at least one control arm
- [ ] If only isotropic wins, treat as generic regularization
- [ ] Report protocol (Mode A vs B, rank, pool size) for reproducibility
Next steps¶
| Goal | Doc |
|---|---|
| Install and first run | QUICKSTART.md |
| Pick a paper task | 13 tasks |
| T1 Office-31 + sklearn | t01-classical.md |
| API reference | api/index.md |