Reviewer Findings

Every divergence flagged by an external reviewer (coach, pro, theorist) reviewing Quintace strategy outputs against public solvers or theory baselines.

This is the reviewer-sourced subset of the full Solver QA catalog. Every entry here started as a reviewer comment (Google Doc, article QA pass, or in-person review), not a bench run. For the full benchmark catalogue including quantitative dataset comparisons, see the Solver QA index.

📋 Reviewer-side consolidated tracker

"QuintAce — QA findings" Google Doc → — reviewer-owned master tracker. This page is downstream; reviewer notes land in the Google Doc first, then surface here as R-findings.

Two tabs in the doc:

Strategy Issues — UTGvBB major differences with GTOw · Hygiene · RV v2 vs GTOw (maps to R1, R2, hygiene check)
Articles — per-article review status: JRB Dumbest Hands · Ivey 5 Defining Hands · Easy Game (Seidman) · Adjusting to Donk Bets [KILL] · Uri Peleg Stand-Up · Mariano AA · HCL 5-way (maps to R4, R5, R8, R9)

Summary table

Priority	Date	Reviewer	Spot / Article	Status
P1	2026-05-06	Timothy Ulmer	R1 · UTG vs BB, Cash NLHE 6-max 100bb (4 boards × 3 nodes)	Open for discussion
—	2026-05-06	Timothy Ulmer	R2 · Preflop RFI 100bb 6-max (5 positions)	Resolved at model layer
P1	2026-05-06	Brad Wilson	R3 · PLO4 UTG/MP VPIP monotonicity inversion	B1 known-failure
—	2026-05-06	GTO lab	R4 · `seidman-easy-game-reexamined` — K72r c-bet	Resolved at endpoint
—	2026-05-06	GTO lab	R5 · `adjusting-to-donk-bets-in-srp` — fold-to-donk	Model OK · article kill editorial
P1	2026-05-11	Internal observation	R7 · EV-display variance for folded hands at RFI (model EV noise)	B1 known-failure
P2	2026-05-11	Mariano AA comic article	R8 · Stand-Up Game deep-stack OOD (1260bb verdict instability)	B1 known-failure
P2	2026-05-11	HCL 5-way comic article	R9 · Stand-Up depth-flip on Klu T9 at val=10 (100bb call → 226bb fold)	B1 known-failure · SQ4 needed
P1	2026-05-11	PLO hygiene track	R10 · PLO4 6-max RFI over-loose by 8–19pp vs two solver refs (UTG/MP/CO)	Issue · open-threshold calibration
—	2026-05-11	PLO hygiene track	R16 · PLO preflop rake-flat at low rake, rake-aware at higher rake — correct model behavior	Resolved · correct behavior
—	2026-05-11	KVL absorption (ea5)	R17 · KVL ea5 — NLHE BTN open avg-sizing claim is stale (current 2.81 BB, within Pio/GTOW 2.0–3.0)	Resolved · stale register entry
—	2026-05-11	KVL absorption (q-cash-bb-btn-3bet)	R18 · KVL q-cash-bb-btn-3bet — BB 3-bet vs BTN matches GTOw within 0.32pp	Resolved · model matches reference
P2	2026-05-11	PLO hygiene track	R19 · PLO paired-board c-bet observations — needs external solver reference to classify	Open · data captured, awaiting external ref
P2	2026-05-11	MTT hygiene track	R11 · MTT ICM block: uses payouts shape + alive/entries threshold; ignores rank, pool, entries-alone	ISSUE Model · partial ICM coverage
P2	2026-05-11	MTT hygiene track	R12 · Mystery bounty silently equivalent to normal across all configs	ISSUE Model · bounty_type=mystery_bounty no-op
P3	2026-05-11	MTT hygiene track	R13 · mtt.players[].stack is dead input — only hand.players[].stack is read	ISSUE Serving · payload schema redundancy
P1	2026-05-11	Cash hygiene track	R20 · Cash universal model has near-flat BB-defense curve across stack depths (20bb → 100bb)	ISSUE Model · cash · depth-insensitive defense
P1	2026-05-11	MTT + Cash hygiene track	R15 · Chip-EV RFI vs GTOw: both models under-open at deep/mid stacks; MTT model has sharp push-fold transition; cash model over-opens at 5bb	ISSUE Model · multi-faceted chip-EV gap pattern

Findings — detail

R1 · Timothy Ulmer · UTG-vs-BB Latest results vs reviewer's GTOw reference — open for discussion

Cash NLHE 6-max 100bb, 2-blind, no ante — 4 boards (AK3ss / KQ6hh / 456ss / JT8ss)

🃏 Hand: UTG vs BB · Cash NLHE 6-max 100bb 2-blind no ante · 4 boards (AK3ss / KQ6hh / 456ss / JT8ss) Reviewer: Timothy Ulmer (MTT pro) Date: 2026-05-06 · endpoint sweep 2026-05-11

Latest Quintace endpoint results on each spot, alongside the reference number the reviewer used (from GTO Wizard or another external solver). Some spots agree, some differ. The differences are open for discussion — we haven't pre-assigned whether the external solver or Quintace is correct.

Board	Spot	Quintace (latest endpoint)	Reviewer reference (GTOw / external)	Difference
AK3ss	UTG c-bet freq	66.4%	88-89%	−22pp
AK3ss	UTG fold vs BB xr	28.6%	27%	+1.6pp
KQ6hh	UTG c-bet freq	72.8%	"similar" (no exact)	≈
KQ6hh	BB fold vs 2bb cbet	41.3%	54%	−13pp
KQ6hh	UTG fold vs BB xr	26.1%	47%	−21pp
456ss	UTG c-bet freq	51.2%	(none published)	—
456ss	BB fold vs 2bb cbet	33.5%	30%	+3.5pp
456ss	UTG fold vs BB xr	36.1%	"similar" (no exact)	≈
JT8ss	UTG c-bet freq	54.6%	(none published)	—
JT8ss	BB fold vs 2bb cbet	39.0%	38%	+1.0pp
JT8ss	UTG fold vs BB xr	28.6%	38%	−9.4pp

Spots that closely agree (≤ ±4pp): AK3ss UTG vs xr, KQ6hh UTG c-bet (qualitative), 456ss BB defense, 456ss UTG vs xr (qualitative), JT8ss BB defense.

Spots where Quintace and the reviewer's reference materially differ — open for discussion:

AK3ss UTG c-bet — Quintace 66% vs GTOw 88-89%. Is GTOw's high c-bet rate right on AK3ss, or is Quintace's more selective c-bet defensible?
KQ6hh BB defense vs 2bb c-bet — Quintace 41% fold vs GTOw 54%. Is GTOw over-folding because of poor OOP equity-realization assumption, or is Quintace under-folding?
KQ6hh UTG vs BB xr — Quintace 26% fold vs GTOw 47%. Is GTOw too willing to give up here, or is Quintace too sticky?
JT8ss UTG vs BB xr — Quintace 29% fold vs GTOw 38%. Smaller gap, same question.

Open question: these are differences, not verdicts. For each, the discussion is whether the published external-solver number reflects a well-converged equilibrium on a comparable tree, or whether Quintace's number is the better answer (e.g., universal-model with no preflop-tree abstraction may legitimately produce a different postflop equilibrium). Next step: review with gameplay AI + coach team to decide which differences are worth treating as model issues vs which are legitimate strategic differences.

Methodology: Endpoint sweep ran via V2 strategy_grid (RVV2-exact payload), model moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx, BB defense queried at 2.0bb c-bet, UTG response queried at BB x/r to 6.0bb. Test script: tests/pull_r1_all_5_boards_full.py.

R2 · Timothy Ulmer · Preflop RFI Resolved at model layer (2026-05-11)

Preflop RFI Ranges 100bb 2-blind no ante 6-max — Quintace vs GTOW vs Jonathan Little

🃏 Hand: Preflop RFI · Cash NLHE 6-max 100bb 2-blind no ante · LJ / HJ / CO / BTN / SB Reviewer: Timothy Ulmer Date: 2026-05-06 · resolved 2026-05-11

Position	Endpoint (2026-05-11)	GTO Wizard	Endpoint vs GTOw	Reviewer's surface
LJ (UTG)	17.85%	17.5%	+0.35pp	25%
HJ (MP)	21.02%	21.7%	−0.68pp	29%
CO	27.05%	27.9%	−0.85pp	36%
BTN	39.78%	40.6%	−0.82pp	48%
SB raise	35.00%	34.4%	+0.60pp	42%

2026-05-11 update — endpoint verification. Direct query against the canonical V2 strategy_grid endpoint (RVV2-style payload, model moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx) shows the model output matches GTOw to within ~1pp at every position. The "Quintace opens 5-10pp wider" claim was based on the reviewer's display surface, not the model.

What this resolves: The "REFERENCE-CONFIG / architectural divergence" framing was wrong. The model does NOT open structurally wider than external solvers. The postflop "H2 — preflop range cascades" hypothesis (in the companion UTG-vs-BB finding) is retired.

Open sub-issue: Reviewer's surface shows 25/29/36/48/42% — uniformly 5-10pp wider than the endpoint. Same family as the postflop aggregation/display drift flag. Routed to service team (Adil + Scott + Breno).

Test script: tests/pull_rfi_5positions_rvv2_payload.py

R3 · Brad Wilson · PLO4 UTG/MP VPIP monotonicity P1 B1 known-failure · A1-strict not implemented in production runner

PLO4 100bb 5%/1bb rake, 6-max — positional ordering inversion

🃏 Hand: Preflop RFI · PLO4 6-max 100bb 2-blind no ante 5%/1bb rake · all 5 positions (UTG/MP/CO/BTN/SB) Reviewer: Brad Wilson Source: b1-properties/core.yaml:48-54 (A1-strict known_failure) Date: 2026-05-06 · endpoint re-run 2026-05-11

PLO4 model opens UTG wider than MP at 100bb/5% rake — violating positional ordering. Brad's 2026-04-17 baseline showed a 0.8pp inversion (UTG 36.5% > MP 35.7%). The 2026-05-11 endpoint re-run on the same model shows the inversion has widened to 3.9pp (UTG 36.41% > MP 32.51%).

Cross-rake corroboration (added 2026-05-11): the PLO hygiene check in R10 reproduces the inversion at two additional rake settings — 3% rake / 3 BB cap and 0% rake — showing UTG 36.8% > MP 33.8% (3.0pp inversion). The monotonicity defect is rake-invariant, consistent with R10's separate finding that the PLO model is rake-insensitive on preflop opens overall.

B1 property gap discovered: A1 IS implemented in the production runner (with strict comparison) and SHOULD catch this. A1-strict (1pp noise floor) was added to the yaml spec but NOT implemented in the runner. A1 either doesn't sweep the rake parameter, or there's a payload-construction divergence — explaining why Nimit's Apr 23 cross-format sweep showed plo4-6max 55/55 PASS.

📋 Full resolution report (separate page): R3 · PLO4 UTG/MP VPIP — Resolution Report →

Includes: current endpoint data table, B1 property coverage gaps (A1-strict not in runner, A1's DSL doesn't sweep rake), three-track resolution plan (implement A1-strict + audit A1 + investigate model-side cause), suggested next steps, open questions for Nimit / Brad / model team.

R4 · seidman-easy-game-reexamined · K72r c-bet Resolved — endpoint matches GTOw within 0.6pp

K72r flop UTG c-bet frequency — Quintace endpoint vs GTO Wizard

🃏 Hand: UTG c-bet on K72r · Cash NLHE 6-max 100bb 2-blind no ante Source: REVIEW_FEEDBACK_2026-05-06.md:23-31 Date: 2026-05-06 · endpoint sweep 2026-05-11

Quintace's latest endpoint c-bet rate on K72r matches GTO Wizard's published number within solver-noise tolerance.

Source	UTG c-bet on K72r	Notes
Quintace endpoint (latest, 2026-05-11)	88.6%	V2 strategy_grid, RVV2-exact payload, model `moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx`
GTO Wizard published	88%	Article reference
Article (Seidman v1, stale snapshot)	59%	Pre-v2.0; surface was pinning a deprecated model
Article (Seidman v2.0 refresh)	79%	Different surface aggregation than the endpoint above

Resolution: Quintace's latest endpoint (88.6%) and GTOw (88%) agree within 0.6pp. No model-side discrepancy on this spot.

Article numbers worth noting: The Seidman article shipped 59% in v1 (stale model), then 79% in v2.0 (current model but rendered through the article's own pull script). Neither matches the underlying endpoint of 88.6%. The article-side numbers come from a different surface aggregation than the canonical endpoint — separate from this finding's resolution and treated as an article-pipeline question, not a model question.

Test script: tests/pull_postflop_utg_cbet_boards.py · also confirmed in tests/pull_hygiene_check_all.py (verifies query setup against this canonical reference).

R5 · adjusting-to-donk-bets-in-srp · fold-to-donk Model results match GTOw · article killed for editorial reasons

BB-lead / SB-call donk-bet defense — UTG response to BB donk in SRP

🃏 Hand: UTG response to BB donk-bet in SRP · Cash NLHE 6-max 100bb · standard donk boards (654r / 765r / T98r) Article: adjusting-to-donk-bets-in-srp (approval-blog system; article killed editorially) Source: REVIEW_FEEDBACK_2026-05-06.md:33-44 Date: 2026-05-06 · endpoint sweep 2026-05-11

Reviewer flagged that Quintace "shows little folding vs donk" compared to GTOw's ~20% fold reference at b33. Endpoint says otherwise.

Board	Donk size	Quintace endpoint UTG fold%	GTOw reference
654r	2.0bb (~36% pot)	20.4%	~20%
765r	2.0bb (~36% pot)	22.4%	~20%
T98r	2.0bb (~36% pot)	20.3%	~20%
654r	1.0bb (~18% pot, smaller)	2.4%	—
765r	1.0bb	3.3%	—
T98r	1.0bb	12.7%	—

Endpoint result: UTG fold% against BB's 2.0bb donk (≈b33) is 20-22% across 3 standard donk boards — essentially identical to the GTOw ~20% reference the reviewer cited. The "Quintace under-folds vs donk" claim does not hold at the model layer. The article's published number that triggered the reviewer's flag came from a different surface aggregation than the canonical endpoint.

Article kill still stands — but for scope, not fold-rate divergence. The reviewer's other concerns remain valid kill reasons:

BB-lead boards are 0% reach at equilibrium (BB never donks these in the solve, so analyzing "how to defend the donk" is testing a near-zero-frequency node)
SB cold-call preflop is only 3% reach (the article's preflop assumption is itself rare at equilibrium — most SB plays a 3-bet-or-fold strategy)

The article was teaching defense at a premise-broken decision tree. Killing the article is correct; the underlying model behavior on donks is fine.

Side note: At smaller donk sizes (1.0bb / b18) the endpoint folds 2-13% on 654/765/T98 — closer to the "little folding" pattern. If the article's published number was based on a smaller donk size than the b33 the reviewer compared against, that explains part of the reviewer's confusion. The article-pipeline issue (surface aggregation different from endpoint) is the same family as the R4 Seidman issue and is being routed to the article team separately.

Test script: tests/pull_r5_bb_donk_defense.py

R7 · EV-display variance for folded hands P1 B1 known-failure · KI-4 / G-series · model EV noise on rarely-trained buckets

Per-combo raise-EV varies wildly across "similar" folded hands at RFI positions

🃏 Hand: Per-combo EV at preflop RFI · Cash NLHE 6-max 100bb · all 5 RFI positions (focus on folded combos like 26o, 72o, 82o) Reporter: Internal observation (2026-05-11) Date: 2026-05-11 endpoint sweep

Per-combo raise-action EV varies dramatically across folded hands at RFI positions — standard junk (72o, 82o, etc.) shows sensible negative raise EVs (−0.2 to −2.0 bb), but rarely-played buckets like 26o show absurdly positive raise EVs (e.g. 26o = +6.77 bb at UTG despite folding 100%). Same family as B1 KI-4 (model EV shape concerns) and G-series violations.

2026-05-11 re-run of B1 I-series properties against current production model showed I3 violation rate 28.5% (12,988 / 45,598) and I5 violation rate 66.5%. The EV-shape issue is not just 26o — it's pervasive across hand classes including premium ranges (AA call vs raise gap = 1.27 bb).

📋 Full resolution report (separate page): R7 · EV-display variance — Resolution Report →

Includes: existing B1 property coverage (26 EV-related properties), 2026-05-11 I-series re-run results, sample violations (AA, AKs, 26o), why production B1 didn't catch it, two-track resolution plan (re-run + verify implementations), suggested next steps, open questions for Trung (EV owner) + Nimit (B1 framework).

R9 · Stand-Up depth-flip on Klu T9 at val=10 P2 B1 known-failure · stack-depth continuity violated · SQ4 property needed

Klu T9 verdict flips call → fold between 100bb and 226bb at val=10 (HCL 5-way Stand-Up)

🃏 Hand: HCL 5-way Stand-Up · Francisco AT · Mariano Q7 · Adi AK · Klu T9 · Peter 22 · 5-way preflop all-in · Jan 3 2025 · Scenario D (Peter+Adi+Klu standing) · val=10 Surfaced by: HCL 5-way comic article (May 2026) Source: hcl-five-way-stand-up/deviation-log.md H1 Date: 2026-05-11

At Stand-Up val=10 / Scenario D, Klu's T9 verdict flips categorically: 100bb = call 35%, 226bb = fold 100% (−65pp swing across just 126bb of depth). NLHE cash control sweep (50-1000bb on UTG RFI) shows total swing of −1.73pp — depth-stable. The Stand-Up flip is ~16× larger in magnitude across ~8× less depth → Stand-Up-specific phenomenon, not a general model-depth issue.

B1 framework gap: B1 property B1 (stack-depth continuity) exists and applies to Stand-Up, but the production runner caps stack range at 100bb, tests only UTG : open, and doesn't sweep the val parameter. A new SQ4 property (stack-depth × val × state continuity) is needed to catch this automatically.

📋 Full resolution report (separate page): R9 · Stand-Up depth-flip — Resolution Report →

Includes: the flip data, NLHE cash control sweep evidence, existing B1 property coverage analysis (B1 + SQ1/SQ2/SQ3 — none cover depth × val × state), proposed SQ4 yaml spec, two-track resolution (diagnose model behavior · add SQ4 + extend B1 runner), suggested next steps, open questions for Trung + Nimit + Stand-Up coach.

R8 · Stand-Up Game deep-stack OOD P2 B1 known-failure · training-coverage gap at deep stacks

Stand-Up Game verdicts unstable at 1260bb (surfaced by Mariano AA $340K comic article)

🃏 Hand: Mariano AA vs Lambo Tyler 88 · HCL High Stakes Friday · Jan 30 2026 · $340,500 pot · 1260bb Source: marketoonist/mariano-stand-up-aces/deviation-log.md Date: 2026-05-11

At deep Stand-Up Game stack depths (~1260bb in the Mariano AA hand), the model's verdicts flip fold ↔ jam ↔ call across small parameter perturbations and violate val-monotonicity. Root cause is a training-coverage gap; the OOD symptom is consistency violation. Article shipped only 500bb-validated queries as editorial mitigation.

📋 Full resolution report (separate page): R8 · Stand-Up Game deep-stack OOD — Resolution Report →

Includes: consistency-violation symptoms (val-monotonicity break, verdict flipping, B1 property B1 stack-depth-continuity), connection to KI-5 / R7 family, two-track resolution (Track A: training-coverage extension or max-depth gate · Track B: extend B1 runner stack-depth coverage), suggested next steps, open questions for Trung + Nimit.

R10 · PLO4 6-max RFI hygiene — open threshold mis-calibrated P1 Issue · over-opens 8–19pp at UTG/MP/CO vs two independent solver references

PLO4 100bb 6-max cash RFI opens ~2× wider than Upswing GTO reference in early positions

🃏 Scope: PLO4 6-max cash · 100bb starting · 4 positions UTG/MP/CO/BTN · two rake passes (3%/3 BB cap; 0%) — RFI VPIP / raise / fold / limp metrics Surfaced by: Solver-QA agent · PLO hygiene track (KVL q-plo-solver-crosscheck-paired-boards P1, preflop half) Source: engineering-department/gameplay-ai/projects/external-solver-benchmark/findings/2026-05-11_plo-rfi-hygiene-100bb-6max.md Test script: tests/pull_hygiene_check_plo.py · output pull_hygiene_check_plo.out.json Server model: universal-dense-v4-player_20260402_150328.onnx Date: 2026-05-11

First PLO hygiene pass mirroring the NLHE/MTT pattern. RFI VPIP vs two independent solver-grade references (100bb 6-max GTO PLO):

Position	Quintace	Upswing (Monker)	PLO Genius (proprietary)	Avg ref	Δ vs avg	Verdict
UTG	36.8%	17.9%	16.8%	17.35%	+19.45	DIVERGE
MP (HJ)	33.8%	21.8%	21.5%	21.65%	+12.15	DIVERGE
CO	37.8%	30.0%	29.0%	29.50%	+8.30	DIVERGE
BTN	47.3%	~42.5%*	48.0%	45.25%	+2.05	OK

*BTN Upswing value: community consensus (less reliable than the two solver-grade refs); PLO Genius gives 48%. Both solver-grade references corroborate within 1.1pp at UTG/MP/CO — the divergence is robust across independent solvers, not a "Upswing happens to be tight" artifact.

Diagnosis — what the model gets right, and where the gap lives:

Hand-strength understanding is structurally correct: trash hands fold 100% · premium hands raise 99%+ · AA/KK-in-hand opens 90% · double-suited beats rainbow (59.5% vs 15.4%) · ace-high beats low-card (60.2% vs 20–30%).
Limps are negligible (0.02%) — the 36.8% UTG VPIP is essentially all raises, directly comparable to the "opening range" semantics in both Upswing and PLO Genius references.
The +19.45pp UTG divergence is a shifted opening threshold — the model classifies more borderline hands as "open" than either reference solver does. Gap is largest at UTG, shrinks progressively to BTN (where Quintace lands in-range). Pattern fits "open threshold mis-calibrated wider in early positions."

Related (but distinct) findings: a separate monotonicity inversion (UTG > MP) is captured in R3. A separate rake-sensitivity diagnosis is in R16 (resolved — model is rake-flat at low rake, rake-aware at higher rake). R10 = absolute threshold too wide; R3 = positional ordering wrong; R16 = rake sensitivity correctly handled. R10's +18.9pp UTG over-loose still holds at heavy rake (R16 verified: +15.9pp gap at 10%/1cap).

Methodology notes (not findings):

ploStrategy returns 856 canonical PLO4 classes (out of 270,725 specific combos), fixed across every parameter swept. Upswing and PLO Genius references are also class-aggregated, so comparisons are apples-to-apples.
actionLabels[].strategy is NOT exposed for PLO (unlike NLHE/MTT) — per-combo Metrics.plo_preflop() aggregation is the only path; audit confirmed correct. Audit + diagnosis scripts: tests/probe_plo_methodology_audit.py + tests/probe_plo_query_diagnosis.py.
Limp-inclusion confound (checked): our "VPIP" sums limp + raise, while Upswing's reference is raise-only. Model limp by position: UTG 0.0% · MP 1.0% · CO 1.2% · BTN 4.3%. At UTG (where the gap is biggest, +18.9pp), limp = 0 — the gap is pure raise-vs-raise. At MP/CO, excluding limp narrows the gap by ~1pp (CO 7.8→6.6pp, still WARN). At BTN, excluding limp closes most of the gap (4.8→0.5pp) — consistent with BTN's existing OK verdict in the table. Net: limp inclusion does not explain the UTG/MP/CO over-loose finding.
Raise-size confound (checked): the model effectively opens at ~3.5× only (2.5× sizing used 0.8% of the time per query diagnosis). Upswing's "% of hands that open" is size-agnostic (any raise counts), so comparing open-frequency vs open-frequency is sizing-invariant.

📋 Full finding doc: findings/2026-05-11_plo-rfi-hygiene-100bb-6max.md — includes external-reference sourcing, hypothesis decomposition (MODEL vs methodology), parameter-sensitivity audit, triage routing, and next-step list (postflop paired-boards hygiene as the second KVL half; non-monotone num_players response as a sub-investigation).

Triage routing: Primary owner gameplay AI (model layer — Luong-Ha / Yaroslav, cc Brad since this extends R3). Methodology questions for Ha on the 856-class reduction scheme (informational, not load-bearing).

R11 · MTT ICM block — uses payouts shape + alive/entries threshold; ignores rank, prize pool, absolute entries P2 ISSUE Model · partial ICM-block coverage · per-seat rank silent

MTT model conditions on TWO ICM-block inputs (payouts schedule shape + alive/entries threshold) and ignores THREE (hero per-seat rank, prize pool, absolute entries). Earlier "single-feature" claim was wrong — original payouts test wrote to a non-existent field.

🃏 Scope: MTT preflop · UTG open at 10bb · 8-max · ante=0.12bb · per-seat stacks fixed Surfaced by: Solver-QA agent · MTT parameter sensitivity probes Test scripts: tests/probe_mtt_param_sensitivity.py, tests/probe_mtt_coordinated_stages.py, tests/probe_mtt_coordinated_fixed_stack.py, tests/probe_mtt_icm_block_only.py, tests/probe_mtt_payouts_field_correct.py Server model: universal-dense-v4-player_20260402_150328.onnx Date: 2026-05-11 (initial) · 2026-05-11 (corrected re: payouts shape)

Self-correction: the initial R11 said payouts shape was silent. That was wrong — the test in probe_mtt_icm_block_only.py assigned to mtt.payouts (a field the server does not read), so the actual payout schedule never changed across the test. Rerun via probe_mtt_payouts_field_correct.py mutating the correct mtt.payout_structure_limit_top10 field shows the model DOES respond.

What the model actually uses (corrected)

1. Payouts schedule shape — 5.14pp RFI spread across schedules at the same spot. Direction is physically correct (flat payouts → tighter; winner-take-all → wider):

Payout schedule	RFI	Δ vs standard
Standard (geometric decay, default)	11.83%	baseline
Flat (every paid place equal)	8.76%	−3.07pp
Top-heavy (70/18/8 + tail)	13.79%	+1.96pp
Winner-take-all	13.90%	+2.07pp

2. Alive / entries ratio — threshold step around ~5% of field remaining. Above threshold, RFI ~13.33% (baseline). Below threshold, RFI snaps to 10.41% regardless of how it got there (16/1000, 9/1000, 200/100000 all give identical 10.41%). 3.39pp effect.

What the model still ignores (re-verified)

Parameter	Test range	Payload hash changed?	Effect on RFI
Hero per-seat `rank`	1 (chipleader) / 50 (mid) / 199 (shortstack)	Yes — payloads differ	0.00pp (rank=1: 14.40%, rank=199: 14.40%, rank=50: 14.40%)
`prize_pool`	$1k → $10M	Yes — payloads differ	0.00pp (identical RFI 13.33% across all 5 values)
`total_entries` alone (with alive% held constant)	100 → 100,000	Yes — payloads differ	0.00pp (identical RFI across all 5 values when alive/entries ratio is fixed)

Implication for hygiene checks

External ICM solvers (ICMIZER, HRC) condition on per-seat ranks, payout slope, AND bounty $$ amounts. Our model captures payout slope and the alive-threshold step, but NOT per-seat ranks or prize-pool magnitude. Hygiene checks against ICM references need to match the payout schedule precisely (the dominant ICM-block input) and accept that rank-conditioned reference differences won't be reproduced.

Triage routing: Trung (model EV / ICM-feature ownership). Two distinct questions to decide:

(a) Is per-seat rank intentionally ignored? The model has no chip-position awareness — chipleader vs shortstack at the same overall stack distribution returns identical strategy. If intentional (universal model treats ICM only through aggregate / payouts), document it. If not, add rank conditioning.

(b) Is prize_pool intentionally ignored? $1k vs $10M tournaments produce bit-identical output. This is consistent with chip-EV scaling, but reads odd to coaches who expect higher-stakes situations to influence play.

How-to: shared/tools/USING_MTT_ENDPOINT.md updated to reflect the corrected behavior (payouts shape now flagged as USED, with the gotcha that the schedule must be written to mtt.payout_structure_limit_top10 — not a sibling payouts field).

R12 · Mystery bounty silently equivalent to normal across all configs P2 ISSUE Model · bounty_type=mystery_bounty no-op

11 mystery_bounty configurations spanning mysterious_prize_left $0 → $10M, head_bounties $0 → $10K, bounty_proportion 0.1 → 0.9 all return bit-identical output to bounty_type=normal. Model does not condition on any mystery-bounty field.

🃏 Scope: MTT preflop · UTG seat-2 open, 10bb hero · 8-max · ante=0.12bb · realistic stack spread Surfaced by: Solver-QA agent · MTT parameter sensitivity probe Test scripts: tests/probe_mtt_param_sensitivity.py section G (initial) · tests/probe_mtt_mystery_bounty_sweep.py (rigorous 11-config sweep) Server model: universal-dense-v4-player_20260402_150328.onnx Date: 2026-05-11

Reference behavior — PKO and knockout bounty types correctly tighten RFI by 6–10pp (incentive to hunt elimination):

Bounty type	RFI	Δ vs normal
normal	11.83%	0
PKO uniform bounty=50	5.70%	−6.13pp
Flat KO bounty (initial=50)	4.43%	−7.40pp

Mystery-bounty sweep — 11 configurations, every one returns RFI=11.83% (bit-identical to normal):

Config	mysterious_prize_left	head_bounties	bounty_proportion	RFI	Δ vs normal
no extras (mpl=None)	—	—	0.5	11.83%	0.00pp
orig R12 test	2,500	[50]×8	0.5	11.83%	0.00pp
zero everything	0	[0]×8	0.5	11.83%	0.00pp
small	100	[10]×8	0.5	11.83%	0.00pp
mid	10,000	[100]×8	0.5	11.83%	0.00pp
large	100,000	[500]×8	0.5	11.83%	0.00pp
huge	1,000,000	[5,000]×8	0.5	11.83%	0.00pp
varied head_bounties	10,000	[500,300,200,100,50,50,30,20]	0.5	11.83%	0.00pp
low proportion	10,000	[100]×8	0.1	11.83%	0.00pp
high proportion	10,000	[100]×8	0.9	11.83%	0.00pp
bounty pool >> prize pool	10,000,000	[10,000]×8	0.5	11.83%	0.00pp

Verdict: spread = 0.00pp across 11 configurations including extremes. All payload JSON hashes differ (so the JSON payloads truly vary on the wire), but the model produces bit-identical output. bounty_type=mystery_bounty is a no-op — the model treats it identically to normal, with the auxiliary fields ignored regardless of value.

Triage routing: Trung / Gameplay AI. Either (a) flag bounty_type=mystery_bounty as unsupported at the endpoint layer (return 400 or warning), or (b) add mystery-bounty conditioning to the universal model. Option (a) is the minimum so downstream consumers (CAI, AceCoach) don't silently get normal-MTT advice for mystery tournaments.

R13 · mtt.players[].stack is dead input P3 ISSUE Serving · payload schema redundancy

Setting mtt.players[].stack to arbitrary values has zero effect on output — only hand.players[].stack is read by the model. Verified across 9 configurations with hand and mtt stack blocks varied independently.

🃏 Scope: MTT preflop · payload schema (hand.players[].stack vs mtt.players[].stack) Surfaced by: Solver-QA agent · MTT parameter sensitivity probe Test scripts: tests/probe_mtt_param_sensitivity.py section A (initial) · tests/probe_mtt_stack_field_authority.py (9-config verification) Server model: universal-dense-v4-player_20260402_150328.onnx Date: 2026-05-11

Verification sweep — 9 configurations, hand and mtt stack blocks varied independently:

hand.players[].stack	mtt.players[].stack	RFI	Δ vs baseline
REAL_SPREAD (baseline)	REAL_SPREAD	17.39%	0.00pp
REAL_SPREAD	all 1bb (extreme)	17.39%	0.00pp
REAL_SPREAD	all 1000bb (extreme)	17.39%	0.00pp
REAL_SPREAD	VERY VARIED (200/5/10/100/80/60/25/8 bb)	17.39%	0.00pp
REAL_SPREAD	REAL_SPREAD seat-swapped	17.39%	0.00pp
UNIFORM 30bb	REAL_SPREAD	16.03%	−1.36pp
UNIFORM 30bb	VERY VARIED	16.03%	−1.36pp
hero@5bb / others 50bb	hero@10bb / others 50bb	31.83%	+14.44pp
hero@100bb / others 50bb	hero@10bb / others 50bb	16.12%	−1.27pp

Pattern: when hand.players[].stack is held constant, RFI is bit-identical regardless of what mtt.players[].stack contains. When hand.players[].stack changes, RFI changes dramatically. Within each "hand=X" group: spread = 0.00pp. mtt.players[].stack is silently ignored.

Classification: unlike R11 and R12 (model-behavior issues), R13 is a payload-schema redundancy. The model itself is not wrong — the payload format has two redundant stack fields where only one is actually consumed.

Triage routing: Service team (Scott / Adil). Either drop mtt.players[].stack from the payload schema (it's misleading — clients can spend time syncing both fields thinking both matter) or make it the authoritative source so the client lib can simplify (one stack field per seat instead of two). Affects strategy_grid_client.MttPreflop._build_mtt_hand(), which currently fills both blocks from the same source.

Documentation: captured as a gotcha in shared/tools/USING_MTT_ENDPOINT.md so future agents know which field actually matters.

R20 · Cash universal model has near-flat BB-defense curve across stack depths P1 ISSUE Model · cash · depth-insensitive defense

Cash universal model defends BB at ~80–83% VPIP across all stack depths from 20bb to 100bb (3.3pp range) — qualitative theory expects depth-sensitivity (deeper → wider defense for more postflop equity realization). MTT model on the same spot shows the expected curve (64.8% → 85.8% across the same depth range, 21pp range).

🃏 Scope: NLHE cash · BB defends vs BTN 2.0x open · 8-max · symmetric stacks · ante=12 chips (0.12bb) · 0 rake · 7 stack depths from 20bb to 100bb Surfaced by: Solver-QA agent · BB-defense depth curve probe (originally a side-finding while triangulating R14) Test script: tests/probe_bb_defense_depth_curve.py · output probe_bb_defense_depth_curve.out.json Server endpoint: https://preview.rlserv.aceguardianrl.com/api/strategy_grid Cash model: moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx MTT model (for comparison): universal-dense-v4-player_20260402_150328.onnx Date: 2026-05-11

VPIP curve — BB defends vs BTN 2.0x by stack depth, both models on identical payloads (only stack varies):

Stack	MTT model VPIP	Cash model VPIP	MTT − Cash
20bb	64.76%	80.77%	−16.01pp
25bb	66.85%	79.70%	−12.85pp
30bb	70.20%	79.71%	−9.51pp
40bb	78.17%	79.94%	−1.77pp
50bb	81.16%	80.82%	+0.34pp
75bb	84.73%	82.36%	+2.37pp
100bb	85.82%	82.98%	+2.84pp
Spread	21.06pp	3.28pp

Observations:

Cash model is essentially depth-blind in this spot. 3.28pp VPIP range from 20bb to 100bb (a 5× change in stack depth). Theory expects substantial widening with depth — postflop playability is the main argument for defending more hands at 100bb than 20bb.
MTT model produces the expected shape. 21.06pp VPIP range across the same depths, monotone increasing — short stacks defend tight, deep stacks defend wide. Direction and magnitude both look right.
The two models cross around 40–50bb. At 30bb the cash model defends 9.5pp wider than MTT; at 100bb the cash model defends 2.8pp narrower. Two different shapes, not just a level shift.

Single-spot caveat: this curve is from one spot (BB vs BTN 2.0x). Before drawing strong conclusions about the cash universal model, this should be reproduced at other defense spots (BB vs CO, BB vs HJ, SB vs BTN) and at different open sizes. The pattern may be specific to BB-vs-BTN or generalize across cash defense; only further testing tells us which.

Triage routing: Trung / Gameplay AI. Two questions to answer:

(a) Does the depth-insensitivity hold at other cash defense spots? Run the same depth sweep for BB vs CO, BB vs HJ, SB vs BTN. If the pattern is uniform across defense spots, that's strong evidence of a cash-model calibration gap on stack-depth conditioning.

(b) If real, is depth-insensitivity intentional or a defect? Cash universal may be trained primarily at 100bb and use an architecture that doesn't explicitly condition on depth — in which case the model's 80% at 20bb is the 100bb-equilibrium response, unaware of the short-stack regime. That would be a known training-distribution limit; the fix is either retraining with mixed depths or surfacing the depth limitation downstream.

Related: independent of retracted R14. R14's "MTT model under-defends" framing was wrong — the depth curve shows the MTT model has a sensible curve and the gap was against an unverified reference. R20 is the opposite: the cash model's curve is the suspect one.

R15 · Chip-EV RFI divergence vs GTOw — multi-faceted gap pattern across MTT + Cash models P1 ISSUE Model · structural · 5 spots × 2 models vs verified GTOw references

Group finding covering 5 chip-EV RFI spots compared against directly-verified GTOw blog references. Both models tend to under-open at deep/mid stacks; MTT model gap widens at HJ 30bb; cash model flips to over-open at 5bb push-fold range; MTT model uses a sharp push-fold transition between 5bb and 17bb. Sub-findings are grouped because they share methodology and source references.

🃏 Scope: 8-max chip-EV RFI · 4 UTG stacks (5/17/50/100 bb) + 1 HJ 30bb spot · ante=12 chips (0.12bb) · 0 rake · symmetric stacks · alive/entries=95% (well above R11's bubble threshold → chip-EV regime on the model side) References (verified directly from GTOw articles in this scrutiny pass):

UTG curve at 100/50/17/14/5/2 bb — GTO Wizard "How Stack Sizes Change Your Range". Numbers explicitly stated in prose. (MTT-SOL-13, C-HIGH.)
HJ 30bb ChipEV at 29.6% — GTO Wizard "How To Review ICM Preflop Ranges". Quote: "Quite a wide, linear range of 29.6% of the best hands." (MTT-SOL-15, C-HIGH.)

Test scripts: tests/single_spot_audit_utg_50bb.py, tests/single_spot_audit_utg_100bb.py, tests/single_spot_audit_utg_17bb.py, tests/single_spot_audit_utg_5bb.py, tests/single_spot_audit_hj_30bb.py, tests/single_spot_audit_cash_chipev.py Server endpoint: https://preview.rlserv.aceguardianrl.com/api/strategy_grid MTT model: universal-dense-v4-player_20260402_150328.onnx Cash model: moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx Date: 2026-05-11

Comparison matrix

Spot	GTOw ref	MTT model RFI	MTT gap	Cash model RFI	Cash gap	MTT vs Cash
UTG 100bb	16.50%	15.00%	−1.50pp	15.46%	−1.04pp	−0.46pp
UTG 50bb	17.70%	15.43%	−2.27pp	16.47%	−1.23pp	−1.04pp
UTG 17bb	15.80%	12.58%	−3.22pp	13.67%	−2.13pp	−1.09pp
UTG 5bb	20.00%	18.17%	−1.83pp	24.66%	+4.66pp	−6.49pp
HJ 30bb	29.60%	22.49%	−7.11pp	26.88%	−2.72pp	−4.39pp

Sub-findings grouped under R15

(a) Both models under-open at deep + mid UTG stacks. Gap is 1.5–3.2pp for MTT model, 1.0–2.1pp for cash model at UTG 100/50/17 bb. Direction is consistent (both tighter than GTOw); magnitude is near solver-noise tolerance but the consistent same-direction bias across 3 stack depths suggests a real (small) calibration drift.

(b) MTT model gap widens dramatically at HJ 30bb. One non-UTG data point shows MTT model 7.1pp tighter than GTOw vs the ~2pp UTG gap. Suggests position-conditioned bias (wider gap at later positions). Caveat: single non-UTG spot; need MP, CO, BTN coverage to confirm the pattern.

(c) Cash model FLIPS sign at 5bb (over-opens by 4.66pp). Cash universal is 1–2.7pp tight at 100 / 50 / 17 / 30 bb, then +4.66pp WIDER at 5bb. Cash model uses 21.86% all-in at 5bb vs GTOw's ~20% total RFI (where almost all should be jams). The cash model isn't push-fold trained — once min-raise becomes structurally infeasible, it gambles all-in too liberally.

(d) MTT model has a sharp push-fold transition between 5bb and 17bb. At 100/50/17 bb, MTT model all-in usage is 0.000–0.004% (essentially never jams). At 5bb, all-in is 17.60% (essentially the entire RFI). GTOw transitions smoothly across this range; the MTT model has a binary switch. Caveat: no data at intermediate stacks (8–14 bb) — exact transition shape unknown.

(e) MTT model uses a different sizing tree at HJ 30bb. Both models have the same available sizes (2.0 / 2.5 / 4.5 / 6.0 / 9.5x). Cash model uses 2.0 + 2.5x as modal opens (10.97% + 15.66%). MTT model uses 4.5x as the single modal open (21.83%) with near-zero weight on 2.0x (0.003%). Same spot, very different sizing-tree resolutions.

Methodology

All 5 audits use identical parameter alignment to the GTOw reference: 8-max NLHE, symmetric stacks, ante=12 chips (0.12bb per player), 0 rake, chip-EV regime. Single known parameter mismatch: 0.5-chip ante rounding (12 vs 12.5 chips) — per ante sensitivity probe, accounts for ≤ 0.25pp of any gap. Reference numbers were re-verified by direct WebFetch of the primary GTOw blog posts (not relying on secondary citations). Side-by-side script + JSON outputs are at external-solver-benchmark/tests/single_spot_audit_*.{py,out.json}; each script prints the full payload for spot-check.

Open scope (what would strengthen this finding)

Fill in UTG curve at 30bb, 20bb, 14bb, 2bb (all have GTOw published numbers) to validate the curve shape.
Add MP, CO, BTN at 30bb chip-EV to confirm (or refute) position-conditioning of the MTT model gap.
Add 8bb, 10bb, 12bb, 14bb data points to characterize the MTT model's push-fold transition shape.
Verify the cash-model 5bb over-jam holds across positions (not just UTG).

Triage routing: Trung (model EV / chip-EV calibration). The data is parameter-aligned and the references are now directly verified from the source GTOw articles. Three distinct concerns grouped here:

(a) Both universal models under-open at deep/mid stacks vs GTOw chip-EV — gap is small (1–3pp) but consistent direction across UTG 100/50/17/30 bb. The MTT model has an additional 1pp drift on top of the cash model's shared bias.

(b) MTT model gap widens at wider opening positions (HJ 30bb shows 7pp gap) — single data point; needs more position coverage to confirm position-conditioning.

(c) Cash model over-jams at 5bb (push-fold zone) — separate calibration concern at short stacks. Cash model was likely trained primarily at 100bb and extrapolates poorly to push-fold range.

(d) MTT model has binary push-fold switch and uses unusual sizing-tree choices — affects coaching surfaces that display sizing decisions.

R16 · PLO preflop rake-flat at low rake, rake-aware at higher rake — model behaves correctly Resolved · correct model behavior · SB notably more rake-sensitive than other positions

Original "rake invariant" claim was a sampling artifact; preflop DOES respond at 5%+ rake; postflop also responds

🃏 Scope tested: PLO4 preflop RFI · 5 positions (UTG/MP/CO/BTN/SB) × 5 rake settings · postflop BTN c-bet on 3 boards × 5 rake settings Surfaced by: PLO hygiene track parameter audit (initial sampling caught the rake-flat zone; per-position + postflop followups corrected the picture) Test scripts: tests/probe_plo_methodology_audit.py · tests/probe_plo_postflop_rake_sensitivity.py · tests/probe_plo_preflop_rake_per_position.py Server model: universal-dense-v4-player_20260402_150328.onnx Date: 2026-05-11

Three-step diagnosis:

Initial observation (preflop, 2 rake values): PLO preflop RFI output was identical across 0% rake and 3% / 3 BB cap — hypothesized "rake silently ignored."
Postflop follow-up (BTN c-bet, 3 boards × 5 rake settings): postflop DOES respond to rake (2.10–4.80pp range per board). So endpoint consumes rake fields; field-strip hypothesis refuted.
Per-position preflop sweep (5 positions × 5 rake settings): preflop ALSO responds when swept at heavier rake values. Original "invariant" claim was a sampling artifact — 0% and 3%/3 BB cap happen to produce identical output (3 BB cap rarely binds on small preflop pots, making both effectively rake-free for preflop EV).

Full picture — preflop VPIP by position × rake:

Position	0%	3%/3cap	5%/3cap	5%/1cap	10%/1cap	Range
UTG	36.80%	36.80%	35.90%	36.40%	33.80%	3.00pp
MP	33.80%	33.80%	32.70%	33.20%	30.60%	3.20pp
CO	37.80%	37.80%	36.40%	37.00%	33.80%	4.00pp
BTN	47.30%	47.30%	45.70%	46.40%	43.70%	3.60pp
SB	46.00%	46.00%	42.30%	43.60%	38.50%	7.50pp

Three takeaways:

Low-rake equivalence is correct. 0% and 3%/3 BB cap produce identical output across all 5 positions — both are functionally "no effective rake" at small preflop pot sizes. The model correctly treats them as equivalent.
Higher-rake response is real and directional. At 5%+ rake, the model tightens preflop ranges 3–4pp across non-SB positions; SB tightens 7.5pp. Direction is correct (heavier rake → tighter range).
SB rake-sensitivity is notable. SB shows 2× the rake response of other positions. Consistent with GTO intuition — SB has positional disadvantage, so every BB of rake hurts SB EV more than other positions; tightening opens compensates.

Postflop response (sub-data): BTN c-bet across 3 boards × 5 rake settings — bet% range 2.10pp (Ts9s8s) / 3.80pp (KsKd2c) / 4.80pp (Ks7d2c). Postflop also responds, as expected.

Resolution: the original "rake silently ignored" framing was wrong. The model DOES respond to rake at both preflop and postflop; the initial observation was a sampling artifact (0% and 3%/3cap both fall in the rake-flat zone). With a finer rake sweep, response is real, directional, and matches GTO expectations.

Sub-flags worth keeping warm (not blocking R16 resolution):

SB shows 2× the rake response of other positions. Could be a real positional effect, could be model noise. Worth comparing against an external SB-rake-sweep reference if one exists.
Postflop response on KsKd2c (paired board) is non-monotone — 10%/1cap shows MORE betting than 5%/1cap. Possibly model noise at heavy rake, possibly real complex rake × texture interaction.

Implication for R10: R10's +18.9pp UTG over-loose finding was measured at 0% and 3%/3cap (both in the rake-flat zone). At 10%/1cap, UTG opens at 33.8% — still well above Upswing's 17.9% reference (+15.9pp gap). The R10 over-loose calibration holds even at heavy rake; doesn't disappear under realistic rake structures.

R17 · KVL ea5 — NLHE BTN open average sizing claim is stale Resolved · current endpoint matches Pio/GTOW reference

KVL register entry claimed avg_raise_bb=4.8 at BTN; current endpoint produces 2.81 — within Pio/GTOW 2.0–3.0 expected range

🃏 Scope tested: NLHE 6-max · 100bb · 0-ante · 2-blind · BTN open · 0% rake and 3%/3 BB cap Originally surfaced as: KVL register item ea5-btn-open-vs-public-solvers (P2, partners) Test script: tests/probe_ea5_btn_open_avg_size.py Server model: moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx Date: 2026-05-11

The KVL register entry claimed BTN open avg_raise_bb = 4.8 BB, well above Pio/GTOW's expected 2.0–3.0 BB. The 4.8 figure was logged previously against an older model; the current endpoint output is sharply different.

Rake setting	Fold	Call	Raise	avg_raise_bb	Verdict
0% rake	59.35%	0.33%	40.32%	2.81	OK
3% / 3 BB cap	59.35%	0.33%	40.32%	2.81	OK

Sizing breakdown (when raising): 71.5% at 2.5× · 26.9% at 3.5× · 1.7% at 5.0×. Weighted average = 2.81 BB, right in the middle of the Pio/GTOW 2.0–3.0 expected range.

Resolution: the KVL register entry is stale. The current endpoint produces BTN avg_raise_bb = 2.81, within Pio/GTOW expected range. The original 4.8 figure was from an older model snapshot that has since been retrained. No action needed at the model layer. Removing from Tier 3 KVL register.

Note: rake is also confirmed rake-invariant for NLHE preflop opens (0% and 3%/3cap produce identical output) — same pattern as PLO preflop at low rake values (R16). Behavior is consistent across game types.

R18 · KVL q-cash-bb-btn-3bet — BB 3-bet vs BTN matches GTOw within tolerance Resolved · model matches reference

BB 3-bet frequency vs BTN 2.5x / 3.0x open at 100bb 6-max no-ante is within 0.32pp of GTOw published references

🃏 Scope tested: NLHE 6-max · 100bb · 0-ante · 2-blind · BB defends vs BTN 2.5x and 3.0x open · 2 rake settings each Originally surfaced as: KVL register item q-cash-bb-btn-3bet-crosscheck (P2, thanh) Test script: tests/probe_q_cash_bb_btn_3bet.py Server model: moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx Date: 2026-05-11

BB defends vs	Rake	Fold	Call	3-bet	GTOw ref	Δ pp	Verdict
BTN 2.5x	0%	54.09%	32.09%	13.82%	13.5%	+0.32	OK
BTN 2.5x	3%/3cap	54.09%	32.09%	13.82%	13.5%	+0.32	OK
BTN 3.0x	0%	66.65%	22.08%	11.26%	11.5%	-0.24	OK
BTN 3.0x	3%/3cap	66.65%	22.08%	11.26%	11.5%	-0.24	OK

Three-bet sizing is also sensible: vs 2.5x, BB mixes 11 BB (8.87% prob) and 16.5 BB (4.75% prob); vs 3.0x, BB shifts to 13 BB primary (9.99% prob) with a tiny 19.5 BB tail. Sizes scale appropriately with the open size.

Size-sensitivity sanity check: BB tightens 2.56pp from 13.82% vs 2.5x to 11.26% vs 3.0x — correct GTO direction (defense tightens against larger opens).

Resolution: the KVL register entry was either based on an older model snapshot or on a different sizing assumption. The current endpoint produces BB 3-bet rates within 0.32pp of GTOw at the standard 2.5x and 3.0x BTN open sizes, with correct directional response to sizing. No model action needed. Removing from Tier 3 KVL register.

Note: rake is again invariant at low values (0% and 3%/3cap produce identical output) — consistent with the NLHE preflop pattern in R17 and PLO preflop pattern in R16.

R19 · PLO paired-board c-bet observations — data captured, awaiting external reference P2 Open · cannot classify without external solver reference

BTN c-bet frequencies on 7 paired flops captured; classification deferred until paid PLO solver reference is available

🃏 Scope tested: PLO4 6-max · 100bb · BTN open 2.5x · BB call · 7 paired flops + 2 unpaired anchors · 0% rake Originally surfaced as: KVL q-plo-solver-crosscheck-paired-boards (P1, thanh) — postflop half Test script: tests/probe_plo_paired_boards_postflop.py Server model: universal-dense-v4-player_20260402_150328.onnx Date: 2026-05-11

BTN c-bet frequencies captured across 7 paired flop classes — kept here as raw observations awaiting an external PLO solver reference to validate against:

Board	Texture	BTN bet%	Check%	Avg bet (BB)
AsAd7h	high pair · mid kicker · rainbow	60.70%	39.30%	2.00
KsKd7c	high pair · mid kicker · rainbow	27.10%	72.90%	2.00
TsTd5c	mid pair · mid kicker · rainbow	24.30%	75.70%	2.00
8s8d3h	mid pair · low kicker · rainbow	57.00%	43.00%	2.00
9d9s2c	mid-low pair · low kicker · rainbow	53.30%	46.70%	2.00
4s4d2c	low pair · low kicker · rainbow	48.20%	51.80%	2.20
Ks7s7d	paired · flush draw	70.30%	29.70%	2.20

Why this is NOT classified as an Issue:

Each paired board has a genuinely different texture (different ranks, different kickers, different equity distributions, different BB cap dynamics, different reach configurations on both sides). The earlier "structurally equivalent" framing in my first draft was wrong — solver outputs for AsAd7h vs KsKd7c can legitimately differ substantially. PLO solver outputs are known to vary 20–40pp across boards that look similar at the rank level.
Without a paid PLO solver reference (Monker, Pio PLO, PLO Genius, PLO Mastermind, Vision GTO Trainer — all subscription) producing comparison numbers for these exact boards, there's no external anchor to call any single number "right" or "wrong."
Heuristics like "high pair should c-bet more than low pair" are domain knowledge, not solver-validated claims. PLO postflop GTO frequently inverts NLHE intuition.

Status: data captured; classification deferred. To upgrade R19 to Issue or Resolved, need an external PLO solver to produce comparison numbers for at least 3 of these 7 boards. Options:

Subscribe to PLO Ninja / RangeConverter / PLO Mastermind ($19–$249) for solver-grade postflop ranges
Coach reference — Brad Wilson or another PLO pro with paid solver access screenshots a few of these specific boards
Run our own internal full-CFR on these spots if PLO CFR engine access is available

Implication for KVL: the q-plo-solver-crosscheck-paired-boards KVL item is NOT fully closed — preflop half captured as R10 with confirmed Issue status (had external refs), postflop half captured here as R19 in Open status pending external refs.

Adding a new reviewer finding: drop a markdown file in engineering-department/gameplay-ai/projects/external-solver-benchmark/findings/{date}_{slug}.md, register it in INDEX-existing-data.md under Tier 1.x (reviewer-named docs) or Tier 3 (article-QA flags), then add the entry here. Include reviewer name (or "unattributed"), source link (Google Doc / article path), date, headline claim, and triage verdict.

Source of truth: engineering-department/gameplay-ai/projects/external-solver-benchmark/. INDEX maintained by the External Solver Benchmark project owner. Cross-reference: Solver QA index for the full catalog (Tier 1 quantitative datasets, Tier 2 infra, Tier 4 open KVL register, Tier 5 source theory).