R3 · PLO4 UTG/MP VPIP — Resolution Report

Positional VPIP inversion (UTG opens wider than MP) confirmed at endpoint · A1 / A1-strict B1 properties · 2026-05-11

B1 known-failure Properties: A1 (5pp tolerance, spec) · A1-strict (1pp noise floor, spec) Reviewer: Brad Wilson

The issue (one paragraph)

On PLO4 at 100bb, 6-max, 2-blind, 5%/1bb rake, the model opens UTG wider than MP — violating the expected positional ordering (UTG < MP < CO < BTN). Brad Wilson surfaced this in his April 2026 review with the baseline showing UTG 36.5% > MP 35.7% (a 0.8pp inversion). The 2026-05-11 endpoint re-run on the same model shows the inversion has widened to 3.9pp (UTG 36.41% > MP 32.51%). The fold-recover ordering for the later seats (MP → CO → BTN) is preserved.

Current endpoint data (2026-05-11)

Model: universal-dense-v4-player_20260402_150328.onnx (same model Brad reviewed on 2026-04-17). Setup: PLO4, 6-max, 100bb, 2-blind, no ante, 5%/1bb rake. Test script: tests/pull_r3_plo4_rfi.py.

Position	Endpoint RFI (2026-05-11)	Endpoint limp%	Brad's baseline (2026-04-17)	Δ
UTG	36.41%	0.02%	36.5%	−0.09pp
MP	32.51%	0.72%	35.7%	−3.19pp
CO	36.21%	0.83%	39.4%	−3.19pp
BTN	43.30%	3.07%	52.4%	−9.10pp
SB	13.47%r / 30.17%c	—	n/a	—

Monotonicity check:

UTG → MP: 36.41% → 32.51% (−3.9pp inversion, wider than Brad's 0.8pp baseline)
MP → CO: 32.51% → 36.21% (+3.7pp ✓)
CO → BTN: 36.21% → 43.30% (+7.1pp ✓)
Strict monotonic: FAIL
A1-strict (1pp noise floor): FAIL

Existing B1 properties that should catch this

ID	Property	Spec	Production runner
`A1`	Position → range width (UTG < MP < CO < BTN)	5pp tolerance (per yaml spec)	✅ Implemented (`classA.ts:54`) — actually uses STRICT comparison (no tolerance)
`A1-strict`	Same property, 1pp noise floor	Added in v2 (2026-04) after Brad's review	❌ NOT implemented in production runner

⚠ Two gaps between spec and implementation

A1-strict has no runner. The B1 yaml v2 added A1-strict (with the plo4 MP/UTG inversion in known_failures) but no corresponding TypeScript runner exists in projects_dev/agserving-rangeviewer-v2/tests/parity/properties/runners/classA.ts. The framework can document the known-failure in the yaml but can't automatically detect it on new model versions.
A1's runner doesn't sweep rake. A1's runner DSL is ${type}.${mode}.${N}p.2b.A0.${stack}bb / ${pos} : open — no rake parameter, so it tests at the DSL's default rake (0% or whatever the parser assumes). Brad's review and our re-run are at 5%/1bb rake. The runner may not be exercising the same model regime where the inversion exists.

Both gaps would explain why Nimit's 2026-04-23 cross-format sweep showed plo4-6max 55/55 PASS while Brad and our re-run see clear inversion. A1 with strict comparison should mathematically catch any inversion, BUT only if it's queried at the rake configuration where the inversion occurs.

Why is MP opening tighter than UTG at all?

This is the model-side question, separate from the framework gap. Possible explanations:

Training coverage gap at MP. Adjacent to KI-5 (MP val=1 VPIP non-monotonicity in Squid) — same family of "MP is the awkward middle position with thin training signal." MP sits between the well-trained early seats and the well-trained late seats; if the model's training data is dominated by either UTG-style or CO-style spots, MP can drift.
Rake interaction. 5%/1bb rake creates non-linear EV penalties for marginal hands. If MP's opening range includes more marginal hands than UTG's (because UTG faces more behind-callers, so opens fewer but stronger), the rake penalty could bite MP harder, shrinking VPIP.
Real strategic finding. In PLO4 at 5% rake, opening MP tighter than UTG might be the equilibrium response to specific BB / CO defensive behavior. Unlikely (it violates conventional poker intuition), but not impossible without external solver confirmation.

No published external PLO RFI reference exists for cross-check. Unlike NLHE (where GTOw publishes per-position RFI in their blog), PLO solver vendors (Mastermind, MonkerSolver) don't routinely publish per-position RFI charts. The "right answer" for PLO4 UTG vs MP at 100bb / 5% rake is not externally documented.

Resolution

Three parallel tracks. All required.

Track A — Implement A1-strict in the production runner

The yaml spec exists; the runner doesn't. Add it.

Add runA1Strict() to classA.ts following the pattern of runA1() but with the 1pp noise-floor logic (curr.value >= prev.value - 0.01 instead of strict >=).
Sweep rake parameter in the DSL: at minimum test at 5%/1bb rake (Brad's PLO setup) and 3%/3bb cap (Cash baseline). Today A1 only tests one default rake; that's the coverage gap.
Register A1-strict in the property suite so it runs in the cross-format sweep alongside the other A-series properties.

Track B — Audit A1's existing strict comparison vs Apr 23 PLO4 55/55 PASS

A1's runner uses strict >= comparison. If A1 was actually run on PLO4 with Brad's setup, it should have caught the 0.8pp inversion. Either (a) the run was at a different rake, (b) the model behavior shifted between Apr 17 (baseline) and Apr 23 (Nimit run), or (c) the DSL parses PLO4 differently than NLHE and short-circuits before the monotonic check.

Reproduce the Apr 23 PLO4 A1 run locally with the production vitest runner against the current universal-dense-v4-player_20260402_150328.onnx model.
Inspect the raw VPIP values A1 collects at UTG/MP/CO/BTN for PLO4. If values match our re-run (36/32/36/43), A1's strict comparison MUST fire — confirming the runner has a coverage path issue. If values disagree, there's a payload-construction divergence between the runner's DSL and our RVV2-exact pull.

Track C — Investigate the model-side cause

Why does MP open tighter than UTG in PLO4 at 5% rake?

Cross-check PLO5 / PLO6 at the same setup — does the inversion extend to other PLO variants, or is it PLO4-specific? Tim Ulmer's earlier cross-format B1 sweep showed PLO6 weakest at A1/A1-strict (it failed there). PLO4 was clean by aggregate metric but may have the inversion at this specific config.
Sweep rake values {0%, 3%, 5%, 7%} at the same positions to test the rake-interaction hypothesis. If the inversion only appears at 5%/1bb cap but not at 3%/3bb cap, that's evidence of rake-specific behavior.
Check training-data coverage at MP for PLO4 — is the model under-trained at MP relative to UTG and CO? Connects to KI-5 family (MP non-monotonicity in Squid).
Cross-check with PLO-format coach (Brad) — at 5%/1bb rake, is there a defensible reason MP could open tighter than UTG, or is this clearly a model error?

Suggested next steps (priority order)

Implement A1-strict in classA.ts with 1pp noise floor + rake-sweep DSL. Smallest, highest-leverage fix.
Re-run B1 A-series specifically against current PLO4 model at 5%/1bb rake — confirm A1 catches the 3.9pp inversion now.
Add cross-PLO check — same A1/A1-strict run at PLO5 and PLO6 to map the inversion across variants.
Brad re-review at current snapshot — show Brad the updated 32.51% MP number (down from 35.7% in his baseline) and ask whether the wider inversion changes his interpretation.
Investigate model training coverage at MP for PLO4 — same family as KI-5 (MP val=1 VPIP non-monotonicity in Squid).
Decide on coaching-surface display — if MP opens are confirmed model-side wrong by 3-4pp, should the coach UI flag affected positions with a warning?

Open questions for the team

Nimit (B1 framework): Why is A1-strict in the spec but not in the production runner? Was it deferred for implementation? Also — does A1's DSL accept a rake parameter, and if not, can we add one?
Model training team: Is there a known reason MP would open tighter than UTG at PLO4 5%/1bb rake? Same family as KI-5?
Brad Wilson: Does the wider 3.9pp inversion (vs the 0.8pp he initially flagged) change his confidence in the original recommendation? Has he seen any external PLO solver output that suggests UTG>MP is defensible at this rake?

Source of truth: engineering-department/gameplay-ai/projects/external-solver-benchmark/. Test script: tests/pull_r3_plo4_rfi.py. Property spec: engineering-department/gameplay-ai/projects/llm-verifier-game-expansion/shared/b1-properties/core.yaml:48-54. Cross-reference: R3 in Reviewer Findings · Solver QA index.