QuintAce ← Solver QA

Reviewer Findings

Every divergence flagged by an external reviewer (coach, pro, theorist) reviewing Quintace strategy outputs against public solvers or theory baselines.

This is the reviewer-sourced subset of the full Solver QA catalog. Every entry here started as a reviewer comment (Google Doc, article QA pass, or in-person review), not a bench run. For the full benchmark catalogue including quantitative dataset comparisons, see the Solver QA index.

📋 Reviewer-side consolidated tracker

"QuintAce — QA findings" Google Doc → — reviewer-owned master tracker. This page is downstream; reviewer notes land in the Google Doc first, then surface here as R-findings.

Two tabs in the doc:

Summary table

Priority Date Reviewer Spot / Article Status
P1 2026-05-06 Timothy Ulmer R1 · UTG vs BB, Cash NLHE 6-max 100bb (4 boards × 3 nodes) Open for discussion
2026-05-06 Timothy Ulmer R2 · Preflop RFI 100bb 6-max (5 positions) Resolved at model layer
P1 2026-05-06 Brad Wilson R3 · PLO4 UTG/MP VPIP monotonicity inversion B1 known-failure
2026-05-06 GTO lab R4 · seidman-easy-game-reexamined — K72r c-bet Resolved at endpoint
2026-05-06 GTO lab R5 · adjusting-to-donk-bets-in-srp — fold-to-donk Model OK · article kill editorial
P1 2026-05-11 Internal observation R7 · EV-display variance for folded hands at RFI (model EV noise) B1 known-failure
P2 2026-05-11 Mariano AA comic article R8 · Stand-Up Game deep-stack OOD (1260bb verdict instability) B1 known-failure
P2 2026-05-11 HCL 5-way comic article R9 · Stand-Up depth-flip on Klu T9 at val=10 (100bb call → 226bb fold) B1 known-failure · SQ4 needed
P1 2026-05-11 PLO hygiene track R10 · PLO4 6-max RFI over-loose by 8–19pp vs two solver refs (UTG/MP/CO) Issue · open-threshold calibration
2026-05-11 PLO hygiene track R16 · PLO preflop rake-flat at low rake, rake-aware at higher rake — correct model behavior Resolved · correct behavior
2026-05-11 KVL absorption (ea5) R17 · KVL ea5 — NLHE BTN open avg-sizing claim is stale (current 2.81 BB, within Pio/GTOW 2.0–3.0) Resolved · stale register entry
2026-05-11 KVL absorption (q-cash-bb-btn-3bet) R18 · KVL q-cash-bb-btn-3bet — BB 3-bet vs BTN matches GTOw within 0.32pp Resolved · model matches reference
P2 2026-05-11 PLO hygiene track R19 · PLO paired-board c-bet observations — needs external solver reference to classify Open · data captured, awaiting external ref
P2 2026-05-11 MTT hygiene track R11 · MTT ICM block: uses payouts shape + alive/entries threshold; ignores rank, pool, entries-alone ISSUE Model · partial ICM coverage
P2 2026-05-11 MTT hygiene track R12 · Mystery bounty silently equivalent to normal across all configs ISSUE Model · bounty_type=mystery_bounty no-op
P3 2026-05-11 MTT hygiene track R13 · mtt.players[].stack is dead input — only hand.players[].stack is read ISSUE Serving · payload schema redundancy
P1 2026-05-11 Cash hygiene track R20 · Cash universal model has near-flat BB-defense curve across stack depths (20bb → 100bb) ISSUE Model · cash · depth-insensitive defense
P1 2026-05-11 MTT + Cash hygiene track R15 · Chip-EV RFI vs GTOw: both models under-open at deep/mid stacks; MTT model has sharp push-fold transition; cash model over-opens at 5bb ISSUE Model · multi-faceted chip-EV gap pattern

Findings — detail

R1 · Timothy Ulmer · UTG-vs-BB Latest results vs reviewer's GTOw reference — open for discussion
Cash NLHE 6-max 100bb, 2-blind, no ante — 4 boards (AK3ss / KQ6hh / 456ss / JT8ss)
🃏 Hand: UTG vs BB · Cash NLHE 6-max 100bb 2-blind no ante · 4 boards (AK3ss / KQ6hh / 456ss / JT8ss) Reviewer: Timothy Ulmer (MTT pro) Date: 2026-05-06 · endpoint sweep 2026-05-11

Latest Quintace endpoint results on each spot, alongside the reference number the reviewer used (from GTO Wizard or another external solver). Some spots agree, some differ. The differences are open for discussion — we haven't pre-assigned whether the external solver or Quintace is correct.

BoardSpot Quintace (latest endpoint) Reviewer reference
(GTOw / external)
Difference
AK3ssUTG c-bet freq66.4%88-89%−22pp
AK3ssUTG fold vs BB xr28.6%27%+1.6pp
KQ6hhUTG c-bet freq72.8%"similar" (no exact)
KQ6hhBB fold vs 2bb cbet41.3%54%−13pp
KQ6hhUTG fold vs BB xr26.1%47%−21pp
456ssUTG c-bet freq51.2%(none published)
456ssBB fold vs 2bb cbet33.5%30%+3.5pp
456ssUTG fold vs BB xr36.1%"similar" (no exact)
JT8ssUTG c-bet freq54.6%(none published)
JT8ssBB fold vs 2bb cbet39.0%38%+1.0pp
JT8ssUTG fold vs BB xr28.6%38%−9.4pp
Spots that closely agree (≤ ±4pp): AK3ss UTG vs xr, KQ6hh UTG c-bet (qualitative), 456ss BB defense, 456ss UTG vs xr (qualitative), JT8ss BB defense.

Spots where Quintace and the reviewer's reference materially differ — open for discussion:
  • AK3ss UTG c-bet — Quintace 66% vs GTOw 88-89%. Is GTOw's high c-bet rate right on AK3ss, or is Quintace's more selective c-bet defensible?
  • KQ6hh BB defense vs 2bb c-bet — Quintace 41% fold vs GTOw 54%. Is GTOw over-folding because of poor OOP equity-realization assumption, or is Quintace under-folding?
  • KQ6hh UTG vs BB xr — Quintace 26% fold vs GTOw 47%. Is GTOw too willing to give up here, or is Quintace too sticky?
  • JT8ss UTG vs BB xr — Quintace 29% fold vs GTOw 38%. Smaller gap, same question.
Open question: these are differences, not verdicts. For each, the discussion is whether the published external-solver number reflects a well-converged equilibrium on a comparable tree, or whether Quintace's number is the better answer (e.g., universal-model with no preflop-tree abstraction may legitimately produce a different postflop equilibrium). Next step: review with gameplay AI + coach team to decide which differences are worth treating as model issues vs which are legitimate strategic differences.

Methodology: Endpoint sweep ran via V2 strategy_grid (RVV2-exact payload), model moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx, BB defense queried at 2.0bb c-bet, UTG response queried at BB x/r to 6.0bb. Test script: tests/pull_r1_all_5_boards_full.py.
R2 · Timothy Ulmer · Preflop RFI Resolved at model layer (2026-05-11)
Preflop RFI Ranges 100bb 2-blind no ante 6-max — Quintace vs GTOW vs Jonathan Little
🃏 Hand: Preflop RFI · Cash NLHE 6-max 100bb 2-blind no ante · LJ / HJ / CO / BTN / SB Reviewer: Timothy Ulmer Date: 2026-05-06 · resolved 2026-05-11
PositionEndpoint (2026-05-11)GTO WizardEndpoint vs GTOwReviewer's surface
LJ (UTG)17.85%17.5%+0.35pp25%
HJ (MP)21.02%21.7%−0.68pp29%
CO27.05%27.9%−0.85pp36%
BTN39.78%40.6%−0.82pp48%
SB raise35.00%34.4%+0.60pp42%
2026-05-11 update — endpoint verification. Direct query against the canonical V2 strategy_grid endpoint (RVV2-style payload, model moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx) shows the model output matches GTOw to within ~1pp at every position. The "Quintace opens 5-10pp wider" claim was based on the reviewer's display surface, not the model.

What this resolves: The "REFERENCE-CONFIG / architectural divergence" framing was wrong. The model does NOT open structurally wider than external solvers. The postflop "H2 — preflop range cascades" hypothesis (in the companion UTG-vs-BB finding) is retired.

Open sub-issue: Reviewer's surface shows 25/29/36/48/42% — uniformly 5-10pp wider than the endpoint. Same family as the postflop aggregation/display drift flag. Routed to service team (Adil + Scott + Breno).

Test script: tests/pull_rfi_5positions_rvv2_payload.py
R3 · Brad Wilson · PLO4 UTG/MP VPIP monotonicity P1 B1 known-failure · A1-strict not implemented in production runner
PLO4 100bb 5%/1bb rake, 6-max — positional ordering inversion
🃏 Hand: Preflop RFI · PLO4 6-max 100bb 2-blind no ante 5%/1bb rake · all 5 positions (UTG/MP/CO/BTN/SB) Reviewer: Brad Wilson Source: b1-properties/core.yaml:48-54 (A1-strict known_failure) Date: 2026-05-06 · endpoint re-run 2026-05-11

PLO4 model opens UTG wider than MP at 100bb/5% rake — violating positional ordering. Brad's 2026-04-17 baseline showed a 0.8pp inversion (UTG 36.5% > MP 35.7%). The 2026-05-11 endpoint re-run on the same model shows the inversion has widened to 3.9pp (UTG 36.41% > MP 32.51%).

Cross-rake corroboration (added 2026-05-11): the PLO hygiene check in R10 reproduces the inversion at two additional rake settings — 3% rake / 3 BB cap and 0% rake — showing UTG 36.8% > MP 33.8% (3.0pp inversion). The monotonicity defect is rake-invariant, consistent with R10's separate finding that the PLO model is rake-insensitive on preflop opens overall.

B1 property gap discovered: A1 IS implemented in the production runner (with strict comparison) and SHOULD catch this. A1-strict (1pp noise floor) was added to the yaml spec but NOT implemented in the runner. A1 either doesn't sweep the rake parameter, or there's a payload-construction divergence — explaining why Nimit's Apr 23 cross-format sweep showed plo4-6max 55/55 PASS.

📋 Full resolution report (separate page): R3 · PLO4 UTG/MP VPIP — Resolution Report →

Includes: current endpoint data table, B1 property coverage gaps (A1-strict not in runner, A1's DSL doesn't sweep rake), three-track resolution plan (implement A1-strict + audit A1 + investigate model-side cause), suggested next steps, open questions for Nimit / Brad / model team.
R4 · seidman-easy-game-reexamined · K72r c-bet Resolved — endpoint matches GTOw within 0.6pp
K72r flop UTG c-bet frequency — Quintace endpoint vs GTO Wizard
🃏 Hand: UTG c-bet on K72r · Cash NLHE 6-max 100bb 2-blind no ante Source: REVIEW_FEEDBACK_2026-05-06.md:23-31 Date: 2026-05-06 · endpoint sweep 2026-05-11

Quintace's latest endpoint c-bet rate on K72r matches GTO Wizard's published number within solver-noise tolerance.

SourceUTG c-bet on K72rNotes
Quintace endpoint (latest, 2026-05-11)88.6%V2 strategy_grid, RVV2-exact payload, model moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx
GTO Wizard published88%Article reference
Article (Seidman v1, stale snapshot)59%Pre-v2.0; surface was pinning a deprecated model
Article (Seidman v2.0 refresh)79%Different surface aggregation than the endpoint above
Resolution: Quintace's latest endpoint (88.6%) and GTOw (88%) agree within 0.6pp. No model-side discrepancy on this spot.

Article numbers worth noting: The Seidman article shipped 59% in v1 (stale model), then 79% in v2.0 (current model but rendered through the article's own pull script). Neither matches the underlying endpoint of 88.6%. The article-side numbers come from a different surface aggregation than the canonical endpoint — separate from this finding's resolution and treated as an article-pipeline question, not a model question.

Test script: tests/pull_postflop_utg_cbet_boards.py · also confirmed in tests/pull_hygiene_check_all.py (verifies query setup against this canonical reference).
R5 · adjusting-to-donk-bets-in-srp · fold-to-donk Model results match GTOw · article killed for editorial reasons
BB-lead / SB-call donk-bet defense — UTG response to BB donk in SRP
🃏 Hand: UTG response to BB donk-bet in SRP · Cash NLHE 6-max 100bb · standard donk boards (654r / 765r / T98r) Article: adjusting-to-donk-bets-in-srp (approval-blog system; article killed editorially) Source: REVIEW_FEEDBACK_2026-05-06.md:33-44 Date: 2026-05-06 · endpoint sweep 2026-05-11

Reviewer flagged that Quintace "shows little folding vs donk" compared to GTOw's ~20% fold reference at b33. Endpoint says otherwise.

BoardDonk sizeQuintace endpoint UTG fold%GTOw reference
654r2.0bb (~36% pot)20.4%~20%
765r2.0bb (~36% pot)22.4%~20%
T98r2.0bb (~36% pot)20.3%~20%
654r1.0bb (~18% pot, smaller)2.4%
765r1.0bb3.3%
T98r1.0bb12.7%
Endpoint result: UTG fold% against BB's 2.0bb donk (≈b33) is 20-22% across 3 standard donk boards — essentially identical to the GTOw ~20% reference the reviewer cited. The "Quintace under-folds vs donk" claim does not hold at the model layer. The article's published number that triggered the reviewer's flag came from a different surface aggregation than the canonical endpoint.

Article kill still stands — but for scope, not fold-rate divergence. The reviewer's other concerns remain valid kill reasons:
  • BB-lead boards are 0% reach at equilibrium (BB never donks these in the solve, so analyzing "how to defend the donk" is testing a near-zero-frequency node)
  • SB cold-call preflop is only 3% reach (the article's preflop assumption is itself rare at equilibrium — most SB plays a 3-bet-or-fold strategy)
The article was teaching defense at a premise-broken decision tree. Killing the article is correct; the underlying model behavior on donks is fine.

Side note: At smaller donk sizes (1.0bb / b18) the endpoint folds 2-13% on 654/765/T98 — closer to the "little folding" pattern. If the article's published number was based on a smaller donk size than the b33 the reviewer compared against, that explains part of the reviewer's confusion. The article-pipeline issue (surface aggregation different from endpoint) is the same family as the R4 Seidman issue and is being routed to the article team separately.

Test script: tests/pull_r5_bb_donk_defense.py
R7 · EV-display variance for folded hands P1 B1 known-failure · KI-4 / G-series · model EV noise on rarely-trained buckets
Per-combo raise-EV varies wildly across "similar" folded hands at RFI positions
🃏 Hand: Per-combo EV at preflop RFI · Cash NLHE 6-max 100bb · all 5 RFI positions (focus on folded combos like 26o, 72o, 82o) Reporter: Internal observation (2026-05-11) Date: 2026-05-11 endpoint sweep

Per-combo raise-action EV varies dramatically across folded hands at RFI positions — standard junk (72o, 82o, etc.) shows sensible negative raise EVs (−0.2 to −2.0 bb), but rarely-played buckets like 26o show absurdly positive raise EVs (e.g. 26o = +6.77 bb at UTG despite folding 100%). Same family as B1 KI-4 (model EV shape concerns) and G-series violations.

2026-05-11 re-run of B1 I-series properties against current production model showed I3 violation rate 28.5% (12,988 / 45,598) and I5 violation rate 66.5%. The EV-shape issue is not just 26o — it's pervasive across hand classes including premium ranges (AA call vs raise gap = 1.27 bb).

📋 Full resolution report (separate page): R7 · EV-display variance — Resolution Report →

Includes: existing B1 property coverage (26 EV-related properties), 2026-05-11 I-series re-run results, sample violations (AA, AKs, 26o), why production B1 didn't catch it, two-track resolution plan (re-run + verify implementations), suggested next steps, open questions for Trung (EV owner) + Nimit (B1 framework).
R9 · Stand-Up depth-flip on Klu T9 at val=10 P2 B1 known-failure · stack-depth continuity violated · SQ4 property needed
Klu T9 verdict flips call → fold between 100bb and 226bb at val=10 (HCL 5-way Stand-Up)
🃏 Hand: HCL 5-way Stand-Up · Francisco AT · Mariano Q7 · Adi AK · Klu T9 · Peter 22 · 5-way preflop all-in · Jan 3 2025 · Scenario D (Peter+Adi+Klu standing) · val=10 Surfaced by: HCL 5-way comic article (May 2026) Source: hcl-five-way-stand-up/deviation-log.md H1 Date: 2026-05-11

At Stand-Up val=10 / Scenario D, Klu's T9 verdict flips categorically: 100bb = call 35%, 226bb = fold 100% (−65pp swing across just 126bb of depth). NLHE cash control sweep (50-1000bb on UTG RFI) shows total swing of −1.73pp — depth-stable. The Stand-Up flip is ~16× larger in magnitude across ~8× less depth → Stand-Up-specific phenomenon, not a general model-depth issue.

B1 framework gap: B1 property B1 (stack-depth continuity) exists and applies to Stand-Up, but the production runner caps stack range at 100bb, tests only UTG : open, and doesn't sweep the val parameter. A new SQ4 property (stack-depth × val × state continuity) is needed to catch this automatically.

📋 Full resolution report (separate page): R9 · Stand-Up depth-flip — Resolution Report →

Includes: the flip data, NLHE cash control sweep evidence, existing B1 property coverage analysis (B1 + SQ1/SQ2/SQ3 — none cover depth × val × state), proposed SQ4 yaml spec, two-track resolution (diagnose model behavior · add SQ4 + extend B1 runner), suggested next steps, open questions for Trung + Nimit + Stand-Up coach.
R8 · Stand-Up Game deep-stack OOD P2 B1 known-failure · training-coverage gap at deep stacks
Stand-Up Game verdicts unstable at 1260bb (surfaced by Mariano AA $340K comic article)
🃏 Hand: Mariano AA vs Lambo Tyler 88 · HCL High Stakes Friday · Jan 30 2026 · $340,500 pot · 1260bb Source: marketoonist/mariano-stand-up-aces/deviation-log.md Date: 2026-05-11

At deep Stand-Up Game stack depths (~1260bb in the Mariano AA hand), the model's verdicts flip fold ↔ jam ↔ call across small parameter perturbations and violate val-monotonicity. Root cause is a training-coverage gap; the OOD symptom is consistency violation. Article shipped only 500bb-validated queries as editorial mitigation.

📋 Full resolution report (separate page): R8 · Stand-Up Game deep-stack OOD — Resolution Report →

Includes: consistency-violation symptoms (val-monotonicity break, verdict flipping, B1 property B1 stack-depth-continuity), connection to KI-5 / R7 family, two-track resolution (Track A: training-coverage extension or max-depth gate · Track B: extend B1 runner stack-depth coverage), suggested next steps, open questions for Trung + Nimit.
R10 · PLO4 6-max RFI hygiene — open threshold mis-calibrated P1 Issue · over-opens 8–19pp at UTG/MP/CO vs two independent solver references
PLO4 100bb 6-max cash RFI opens ~2× wider than Upswing GTO reference in early positions
🃏 Scope: PLO4 6-max cash · 100bb starting · 4 positions UTG/MP/CO/BTN · two rake passes (3%/3 BB cap; 0%) — RFI VPIP / raise / fold / limp metrics Surfaced by: Solver-QA agent · PLO hygiene track (KVL q-plo-solver-crosscheck-paired-boards P1, preflop half) Source: engineering-department/gameplay-ai/projects/external-solver-benchmark/findings/2026-05-11_plo-rfi-hygiene-100bb-6max.md Test script: tests/pull_hygiene_check_plo.py · output pull_hygiene_check_plo.out.json Server model: universal-dense-v4-player_20260402_150328.onnx Date: 2026-05-11

First PLO hygiene pass mirroring the NLHE/MTT pattern. RFI VPIP vs two independent solver-grade references (100bb 6-max GTO PLO):

Position Quintace Upswing
(Monker)
PLO Genius
(proprietary)
Avg ref Δ vs avg Verdict
UTG36.8%17.9%16.8%17.35%+19.45DIVERGE
MP (HJ)33.8%21.8%21.5%21.65%+12.15DIVERGE
CO37.8%30.0%29.0%29.50%+8.30DIVERGE
BTN47.3%~42.5%*48.0%45.25%+2.05OK

*BTN Upswing value: community consensus (less reliable than the two solver-grade refs); PLO Genius gives 48%. Both solver-grade references corroborate within 1.1pp at UTG/MP/CO — the divergence is robust across independent solvers, not a "Upswing happens to be tight" artifact.

Diagnosis — what the model gets right, and where the gap lives:

Related (but distinct) findings: a separate monotonicity inversion (UTG > MP) is captured in R3. A separate rake-sensitivity diagnosis is in R16 (resolved — model is rake-flat at low rake, rake-aware at higher rake). R10 = absolute threshold too wide; R3 = positional ordering wrong; R16 = rake sensitivity correctly handled. R10's +18.9pp UTG over-loose still holds at heavy rake (R16 verified: +15.9pp gap at 10%/1cap).

Methodology notes (not findings):

📋 Full finding doc: findings/2026-05-11_plo-rfi-hygiene-100bb-6max.md — includes external-reference sourcing, hypothesis decomposition (MODEL vs methodology), parameter-sensitivity audit, triage routing, and next-step list (postflop paired-boards hygiene as the second KVL half; non-monotone num_players response as a sub-investigation).

Triage routing: Primary owner gameplay AI (model layer — Luong-Ha / Yaroslav, cc Brad since this extends R3). Methodology questions for Ha on the 856-class reduction scheme (informational, not load-bearing).
R11 · MTT ICM block — uses payouts shape + alive/entries threshold; ignores rank, prize pool, absolute entries P2 ISSUE Model · partial ICM-block coverage · per-seat rank silent
MTT model conditions on TWO ICM-block inputs (payouts schedule shape + alive/entries threshold) and ignores THREE (hero per-seat rank, prize pool, absolute entries). Earlier "single-feature" claim was wrong — original payouts test wrote to a non-existent field.
🃏 Scope: MTT preflop · UTG open at 10bb · 8-max · ante=0.12bb · per-seat stacks fixed Surfaced by: Solver-QA agent · MTT parameter sensitivity probes Test scripts: tests/probe_mtt_param_sensitivity.py, tests/probe_mtt_coordinated_stages.py, tests/probe_mtt_coordinated_fixed_stack.py, tests/probe_mtt_icm_block_only.py, tests/probe_mtt_payouts_field_correct.py Server model: universal-dense-v4-player_20260402_150328.onnx Date: 2026-05-11 (initial) · 2026-05-11 (corrected re: payouts shape)
Self-correction: the initial R11 said payouts shape was silent. That was wrong — the test in probe_mtt_icm_block_only.py assigned to mtt.payouts (a field the server does not read), so the actual payout schedule never changed across the test. Rerun via probe_mtt_payouts_field_correct.py mutating the correct mtt.payout_structure_limit_top10 field shows the model DOES respond.

What the model actually uses (corrected)

1. Payouts schedule shape — 5.14pp RFI spread across schedules at the same spot. Direction is physically correct (flat payouts → tighter; winner-take-all → wider):

Payout scheduleRFIΔ vs standard
Standard (geometric decay, default)11.83%baseline
Flat (every paid place equal)8.76%−3.07pp
Top-heavy (70/18/8 + tail)13.79%+1.96pp
Winner-take-all13.90%+2.07pp

2. Alive / entries ratio — threshold step around ~5% of field remaining. Above threshold, RFI ~13.33% (baseline). Below threshold, RFI snaps to 10.41% regardless of how it got there (16/1000, 9/1000, 200/100000 all give identical 10.41%). 3.39pp effect.

What the model still ignores (re-verified)

ParameterTest rangePayload hash changed?Effect on RFI
Hero per-seat rank1 (chipleader) / 50 (mid) / 199 (shortstack)Yes — payloads differ0.00pp (rank=1: 14.40%, rank=199: 14.40%, rank=50: 14.40%)
prize_pool$1k → $10MYes — payloads differ0.00pp (identical RFI 13.33% across all 5 values)
total_entries alone (with alive% held constant)100 → 100,000Yes — payloads differ0.00pp (identical RFI across all 5 values when alive/entries ratio is fixed)

Implication for hygiene checks

External ICM solvers (ICMIZER, HRC) condition on per-seat ranks, payout slope, AND bounty $$ amounts. Our model captures payout slope and the alive-threshold step, but NOT per-seat ranks or prize-pool magnitude. Hygiene checks against ICM references need to match the payout schedule precisely (the dominant ICM-block input) and accept that rank-conditioned reference differences won't be reproduced.

Triage routing: Trung (model EV / ICM-feature ownership). Two distinct questions to decide:

(a) Is per-seat rank intentionally ignored? The model has no chip-position awareness — chipleader vs shortstack at the same overall stack distribution returns identical strategy. If intentional (universal model treats ICM only through aggregate / payouts), document it. If not, add rank conditioning.

(b) Is prize_pool intentionally ignored? $1k vs $10M tournaments produce bit-identical output. This is consistent with chip-EV scaling, but reads odd to coaches who expect higher-stakes situations to influence play.

How-to: shared/tools/USING_MTT_ENDPOINT.md updated to reflect the corrected behavior (payouts shape now flagged as USED, with the gotcha that the schedule must be written to mtt.payout_structure_limit_top10 — not a sibling payouts field).
R12 · Mystery bounty silently equivalent to normal across all configs P2 ISSUE Model · bounty_type=mystery_bounty no-op
11 mystery_bounty configurations spanning mysterious_prize_left $0 → $10M, head_bounties $0 → $10K, bounty_proportion 0.1 → 0.9 all return bit-identical output to bounty_type=normal. Model does not condition on any mystery-bounty field.
🃏 Scope: MTT preflop · UTG seat-2 open, 10bb hero · 8-max · ante=0.12bb · realistic stack spread Surfaced by: Solver-QA agent · MTT parameter sensitivity probe Test scripts: tests/probe_mtt_param_sensitivity.py section G (initial) · tests/probe_mtt_mystery_bounty_sweep.py (rigorous 11-config sweep) Server model: universal-dense-v4-player_20260402_150328.onnx Date: 2026-05-11

Reference behavior — PKO and knockout bounty types correctly tighten RFI by 6–10pp (incentive to hunt elimination):

Bounty typeRFIΔ vs normal
normal11.83%0
PKO uniform bounty=505.70%−6.13pp
Flat KO bounty (initial=50)4.43%−7.40pp

Mystery-bounty sweep — 11 configurations, every one returns RFI=11.83% (bit-identical to normal):

Configmysterious_prize_lefthead_bountiesbounty_proportionRFIΔ vs normal
no extras (mpl=None)0.511.83%0.00pp
orig R12 test2,500[50]×80.511.83%0.00pp
zero everything0[0]×80.511.83%0.00pp
small100[10]×80.511.83%0.00pp
mid10,000[100]×80.511.83%0.00pp
large100,000[500]×80.511.83%0.00pp
huge1,000,000[5,000]×80.511.83%0.00pp
varied head_bounties10,000[500,300,200,100,50,50,30,20]0.511.83%0.00pp
low proportion10,000[100]×80.111.83%0.00pp
high proportion10,000[100]×80.911.83%0.00pp
bounty pool >> prize pool10,000,000[10,000]×80.511.83%0.00pp

Verdict: spread = 0.00pp across 11 configurations including extremes. All payload JSON hashes differ (so the JSON payloads truly vary on the wire), but the model produces bit-identical output. bounty_type=mystery_bounty is a no-op — the model treats it identically to normal, with the auxiliary fields ignored regardless of value.

Triage routing: Trung / Gameplay AI. Either (a) flag bounty_type=mystery_bounty as unsupported at the endpoint layer (return 400 or warning), or (b) add mystery-bounty conditioning to the universal model. Option (a) is the minimum so downstream consumers (CAI, AceCoach) don't silently get normal-MTT advice for mystery tournaments.
R13 · mtt.players[].stack is dead input P3 ISSUE Serving · payload schema redundancy
Setting mtt.players[].stack to arbitrary values has zero effect on output — only hand.players[].stack is read by the model. Verified across 9 configurations with hand and mtt stack blocks varied independently.
🃏 Scope: MTT preflop · payload schema (hand.players[].stack vs mtt.players[].stack) Surfaced by: Solver-QA agent · MTT parameter sensitivity probe Test scripts: tests/probe_mtt_param_sensitivity.py section A (initial) · tests/probe_mtt_stack_field_authority.py (9-config verification) Server model: universal-dense-v4-player_20260402_150328.onnx Date: 2026-05-11

Verification sweep — 9 configurations, hand and mtt stack blocks varied independently:

hand.players[].stack mtt.players[].stack RFI Δ vs baseline
REAL_SPREAD (baseline)REAL_SPREAD17.39%0.00pp
REAL_SPREADall 1bb (extreme)17.39%0.00pp
REAL_SPREADall 1000bb (extreme)17.39%0.00pp
REAL_SPREADVERY VARIED (200/5/10/100/80/60/25/8 bb)17.39%0.00pp
REAL_SPREADREAL_SPREAD seat-swapped17.39%0.00pp
UNIFORM 30bbREAL_SPREAD16.03%−1.36pp
UNIFORM 30bbVERY VARIED16.03%−1.36pp
hero@5bb / others 50bbhero@10bb / others 50bb31.83%+14.44pp
hero@100bb / others 50bbhero@10bb / others 50bb16.12%−1.27pp

Pattern: when hand.players[].stack is held constant, RFI is bit-identical regardless of what mtt.players[].stack contains. When hand.players[].stack changes, RFI changes dramatically. Within each "hand=X" group: spread = 0.00pp. mtt.players[].stack is silently ignored.

Classification: unlike R11 and R12 (model-behavior issues), R13 is a payload-schema redundancy. The model itself is not wrong — the payload format has two redundant stack fields where only one is actually consumed.

Triage routing: Service team (Scott / Adil). Either drop mtt.players[].stack from the payload schema (it's misleading — clients can spend time syncing both fields thinking both matter) or make it the authoritative source so the client lib can simplify (one stack field per seat instead of two). Affects strategy_grid_client.MttPreflop._build_mtt_hand(), which currently fills both blocks from the same source.

Documentation: captured as a gotcha in shared/tools/USING_MTT_ENDPOINT.md so future agents know which field actually matters.
R20 · Cash universal model has near-flat BB-defense curve across stack depths P1 ISSUE Model · cash · depth-insensitive defense
Cash universal model defends BB at ~80–83% VPIP across all stack depths from 20bb to 100bb (3.3pp range) — qualitative theory expects depth-sensitivity (deeper → wider defense for more postflop equity realization). MTT model on the same spot shows the expected curve (64.8% → 85.8% across the same depth range, 21pp range).
🃏 Scope: NLHE cash · BB defends vs BTN 2.0x open · 8-max · symmetric stacks · ante=12 chips (0.12bb) · 0 rake · 7 stack depths from 20bb to 100bb Surfaced by: Solver-QA agent · BB-defense depth curve probe (originally a side-finding while triangulating R14) Test script: tests/probe_bb_defense_depth_curve.py · output probe_bb_defense_depth_curve.out.json Server endpoint: https://preview.rlserv.aceguardianrl.com/api/strategy_grid Cash model: moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx MTT model (for comparison): universal-dense-v4-player_20260402_150328.onnx Date: 2026-05-11

VPIP curve — BB defends vs BTN 2.0x by stack depth, both models on identical payloads (only stack varies):

Stack MTT model VPIP Cash model VPIP MTT − Cash
20bb64.76%80.77%−16.01pp
25bb66.85%79.70%−12.85pp
30bb70.20%79.71%−9.51pp
40bb78.17%79.94%−1.77pp
50bb81.16%80.82%+0.34pp
75bb84.73%82.36%+2.37pp
100bb85.82%82.98%+2.84pp
Spread21.06pp3.28pp

Observations:

Single-spot caveat: this curve is from one spot (BB vs BTN 2.0x). Before drawing strong conclusions about the cash universal model, this should be reproduced at other defense spots (BB vs CO, BB vs HJ, SB vs BTN) and at different open sizes. The pattern may be specific to BB-vs-BTN or generalize across cash defense; only further testing tells us which.

Triage routing: Trung / Gameplay AI. Two questions to answer:

(a) Does the depth-insensitivity hold at other cash defense spots? Run the same depth sweep for BB vs CO, BB vs HJ, SB vs BTN. If the pattern is uniform across defense spots, that's strong evidence of a cash-model calibration gap on stack-depth conditioning.

(b) If real, is depth-insensitivity intentional or a defect? Cash universal may be trained primarily at 100bb and use an architecture that doesn't explicitly condition on depth — in which case the model's 80% at 20bb is the 100bb-equilibrium response, unaware of the short-stack regime. That would be a known training-distribution limit; the fix is either retraining with mixed depths or surfacing the depth limitation downstream.

Related: independent of retracted R14. R14's "MTT model under-defends" framing was wrong — the depth curve shows the MTT model has a sensible curve and the gap was against an unverified reference. R20 is the opposite: the cash model's curve is the suspect one.
R15 · Chip-EV RFI divergence vs GTOw — multi-faceted gap pattern across MTT + Cash models P1 ISSUE Model · structural · 5 spots × 2 models vs verified GTOw references
Group finding covering 5 chip-EV RFI spots compared against directly-verified GTOw blog references. Both models tend to under-open at deep/mid stacks; MTT model gap widens at HJ 30bb; cash model flips to over-open at 5bb push-fold range; MTT model uses a sharp push-fold transition between 5bb and 17bb. Sub-findings are grouped because they share methodology and source references.
🃏 Scope: 8-max chip-EV RFI · 4 UTG stacks (5/17/50/100 bb) + 1 HJ 30bb spot · ante=12 chips (0.12bb) · 0 rake · symmetric stacks · alive/entries=95% (well above R11's bubble threshold → chip-EV regime on the model side) References (verified directly from GTOw articles in this scrutiny pass): Test scripts: tests/single_spot_audit_utg_50bb.py, tests/single_spot_audit_utg_100bb.py, tests/single_spot_audit_utg_17bb.py, tests/single_spot_audit_utg_5bb.py, tests/single_spot_audit_hj_30bb.py, tests/single_spot_audit_cash_chipev.py Server endpoint: https://preview.rlserv.aceguardianrl.com/api/strategy_grid MTT model: universal-dense-v4-player_20260402_150328.onnx Cash model: moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx Date: 2026-05-11

Comparison matrix

Spot GTOw ref MTT model RFI MTT gap Cash model RFI Cash gap MTT vs Cash
UTG 100bb16.50%15.00%−1.50pp15.46%−1.04pp−0.46pp
UTG 50bb17.70%15.43%−2.27pp16.47%−1.23pp−1.04pp
UTG 17bb15.80%12.58%−3.22pp13.67%−2.13pp−1.09pp
UTG 5bb20.00%18.17%−1.83pp24.66%+4.66pp−6.49pp
HJ 30bb29.60%22.49%−7.11pp26.88%−2.72pp−4.39pp

Sub-findings grouped under R15

(a) Both models under-open at deep + mid UTG stacks. Gap is 1.5–3.2pp for MTT model, 1.0–2.1pp for cash model at UTG 100/50/17 bb. Direction is consistent (both tighter than GTOw); magnitude is near solver-noise tolerance but the consistent same-direction bias across 3 stack depths suggests a real (small) calibration drift.

(b) MTT model gap widens dramatically at HJ 30bb. One non-UTG data point shows MTT model 7.1pp tighter than GTOw vs the ~2pp UTG gap. Suggests position-conditioned bias (wider gap at later positions). Caveat: single non-UTG spot; need MP, CO, BTN coverage to confirm the pattern.

(c) Cash model FLIPS sign at 5bb (over-opens by 4.66pp). Cash universal is 1–2.7pp tight at 100 / 50 / 17 / 30 bb, then +4.66pp WIDER at 5bb. Cash model uses 21.86% all-in at 5bb vs GTOw's ~20% total RFI (where almost all should be jams). The cash model isn't push-fold trained — once min-raise becomes structurally infeasible, it gambles all-in too liberally.

(d) MTT model has a sharp push-fold transition between 5bb and 17bb. At 100/50/17 bb, MTT model all-in usage is 0.000–0.004% (essentially never jams). At 5bb, all-in is 17.60% (essentially the entire RFI). GTOw transitions smoothly across this range; the MTT model has a binary switch. Caveat: no data at intermediate stacks (8–14 bb) — exact transition shape unknown.

(e) MTT model uses a different sizing tree at HJ 30bb. Both models have the same available sizes (2.0 / 2.5 / 4.5 / 6.0 / 9.5x). Cash model uses 2.0 + 2.5x as modal opens (10.97% + 15.66%). MTT model uses 4.5x as the single modal open (21.83%) with near-zero weight on 2.0x (0.003%). Same spot, very different sizing-tree resolutions.

Methodology

All 5 audits use identical parameter alignment to the GTOw reference: 8-max NLHE, symmetric stacks, ante=12 chips (0.12bb per player), 0 rake, chip-EV regime. Single known parameter mismatch: 0.5-chip ante rounding (12 vs 12.5 chips) — per ante sensitivity probe, accounts for ≤ 0.25pp of any gap. Reference numbers were re-verified by direct WebFetch of the primary GTOw blog posts (not relying on secondary citations). Side-by-side script + JSON outputs are at external-solver-benchmark/tests/single_spot_audit_*.{py,out.json}; each script prints the full payload for spot-check.

Open scope (what would strengthen this finding)

Triage routing: Trung (model EV / chip-EV calibration). The data is parameter-aligned and the references are now directly verified from the source GTOw articles. Three distinct concerns grouped here:

(a) Both universal models under-open at deep/mid stacks vs GTOw chip-EV — gap is small (1–3pp) but consistent direction across UTG 100/50/17/30 bb. The MTT model has an additional 1pp drift on top of the cash model's shared bias.

(b) MTT model gap widens at wider opening positions (HJ 30bb shows 7pp gap) — single data point; needs more position coverage to confirm position-conditioning.

(c) Cash model over-jams at 5bb (push-fold zone) — separate calibration concern at short stacks. Cash model was likely trained primarily at 100bb and extrapolates poorly to push-fold range.

(d) MTT model has binary push-fold switch and uses unusual sizing-tree choices — affects coaching surfaces that display sizing decisions.
R16 · PLO preflop rake-flat at low rake, rake-aware at higher rake — model behaves correctly Resolved · correct model behavior · SB notably more rake-sensitive than other positions
Original "rake invariant" claim was a sampling artifact; preflop DOES respond at 5%+ rake; postflop also responds
🃏 Scope tested: PLO4 preflop RFI · 5 positions (UTG/MP/CO/BTN/SB) × 5 rake settings · postflop BTN c-bet on 3 boards × 5 rake settings Surfaced by: PLO hygiene track parameter audit (initial sampling caught the rake-flat zone; per-position + postflop followups corrected the picture) Test scripts: tests/probe_plo_methodology_audit.py · tests/probe_plo_postflop_rake_sensitivity.py · tests/probe_plo_preflop_rake_per_position.py Server model: universal-dense-v4-player_20260402_150328.onnx Date: 2026-05-11

Three-step diagnosis:

  1. Initial observation (preflop, 2 rake values): PLO preflop RFI output was identical across 0% rake and 3% / 3 BB cap — hypothesized "rake silently ignored."
  2. Postflop follow-up (BTN c-bet, 3 boards × 5 rake settings): postflop DOES respond to rake (2.10–4.80pp range per board). So endpoint consumes rake fields; field-strip hypothesis refuted.
  3. Per-position preflop sweep (5 positions × 5 rake settings): preflop ALSO responds when swept at heavier rake values. Original "invariant" claim was a sampling artifact — 0% and 3%/3 BB cap happen to produce identical output (3 BB cap rarely binds on small preflop pots, making both effectively rake-free for preflop EV).

Full picture — preflop VPIP by position × rake:

Position 0% 3%/3cap 5%/3cap 5%/1cap 10%/1cap Range
UTG36.80%36.80%35.90%36.40%33.80%3.00pp
MP33.80%33.80%32.70%33.20%30.60%3.20pp
CO37.80%37.80%36.40%37.00%33.80%4.00pp
BTN47.30%47.30%45.70%46.40%43.70%3.60pp
SB46.00%46.00%42.30%43.60%38.50%7.50pp

Three takeaways:

Postflop response (sub-data): BTN c-bet across 3 boards × 5 rake settings — bet% range 2.10pp (Ts9s8s) / 3.80pp (KsKd2c) / 4.80pp (Ks7d2c). Postflop also responds, as expected.

Resolution: the original "rake silently ignored" framing was wrong. The model DOES respond to rake at both preflop and postflop; the initial observation was a sampling artifact (0% and 3%/3cap both fall in the rake-flat zone). With a finer rake sweep, response is real, directional, and matches GTO expectations.

Sub-flags worth keeping warm (not blocking R16 resolution):
  • SB shows 2× the rake response of other positions. Could be a real positional effect, could be model noise. Worth comparing against an external SB-rake-sweep reference if one exists.
  • Postflop response on KsKd2c (paired board) is non-monotone — 10%/1cap shows MORE betting than 5%/1cap. Possibly model noise at heavy rake, possibly real complex rake × texture interaction.
Implication for R10: R10's +18.9pp UTG over-loose finding was measured at 0% and 3%/3cap (both in the rake-flat zone). At 10%/1cap, UTG opens at 33.8% — still well above Upswing's 17.9% reference (+15.9pp gap). The R10 over-loose calibration holds even at heavy rake; doesn't disappear under realistic rake structures.
R17 · KVL ea5 — NLHE BTN open average sizing claim is stale Resolved · current endpoint matches Pio/GTOW reference
KVL register entry claimed avg_raise_bb=4.8 at BTN; current endpoint produces 2.81 — within Pio/GTOW 2.0–3.0 expected range
🃏 Scope tested: NLHE 6-max · 100bb · 0-ante · 2-blind · BTN open · 0% rake and 3%/3 BB cap Originally surfaced as: KVL register item ea5-btn-open-vs-public-solvers (P2, partners) Test script: tests/probe_ea5_btn_open_avg_size.py Server model: moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx Date: 2026-05-11

The KVL register entry claimed BTN open avg_raise_bb = 4.8 BB, well above Pio/GTOW's expected 2.0–3.0 BB. The 4.8 figure was logged previously against an older model; the current endpoint output is sharply different.

Rake setting Fold Call Raise avg_raise_bb Verdict
0% rake59.35%0.33%40.32%2.81OK
3% / 3 BB cap59.35%0.33%40.32%2.81OK

Sizing breakdown (when raising): 71.5% at 2.5× · 26.9% at 3.5× · 1.7% at 5.0×. Weighted average = 2.81 BB, right in the middle of the Pio/GTOW 2.0–3.0 expected range.

Resolution: the KVL register entry is stale. The current endpoint produces BTN avg_raise_bb = 2.81, within Pio/GTOW expected range. The original 4.8 figure was from an older model snapshot that has since been retrained. No action needed at the model layer. Removing from Tier 3 KVL register.

Note: rake is also confirmed rake-invariant for NLHE preflop opens (0% and 3%/3cap produce identical output) — same pattern as PLO preflop at low rake values (R16). Behavior is consistent across game types.
R18 · KVL q-cash-bb-btn-3bet — BB 3-bet vs BTN matches GTOw within tolerance Resolved · model matches reference
BB 3-bet frequency vs BTN 2.5x / 3.0x open at 100bb 6-max no-ante is within 0.32pp of GTOw published references
🃏 Scope tested: NLHE 6-max · 100bb · 0-ante · 2-blind · BB defends vs BTN 2.5x and 3.0x open · 2 rake settings each Originally surfaced as: KVL register item q-cash-bb-btn-3bet-crosscheck (P2, thanh) Test script: tests/probe_q_cash_bb_btn_3bet.py Server model: moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx Date: 2026-05-11
BB defends vs Rake Fold Call 3-bet GTOw ref Δ pp Verdict
BTN 2.5x0%54.09%32.09%13.82%13.5%+0.32OK
BTN 2.5x3%/3cap54.09%32.09%13.82%13.5%+0.32OK
BTN 3.0x0%66.65%22.08%11.26%11.5%-0.24OK
BTN 3.0x3%/3cap66.65%22.08%11.26%11.5%-0.24OK

Three-bet sizing is also sensible: vs 2.5x, BB mixes 11 BB (8.87% prob) and 16.5 BB (4.75% prob); vs 3.0x, BB shifts to 13 BB primary (9.99% prob) with a tiny 19.5 BB tail. Sizes scale appropriately with the open size.

Size-sensitivity sanity check: BB tightens 2.56pp from 13.82% vs 2.5x to 11.26% vs 3.0x — correct GTO direction (defense tightens against larger opens).

Resolution: the KVL register entry was either based on an older model snapshot or on a different sizing assumption. The current endpoint produces BB 3-bet rates within 0.32pp of GTOw at the standard 2.5x and 3.0x BTN open sizes, with correct directional response to sizing. No model action needed. Removing from Tier 3 KVL register.

Note: rake is again invariant at low values (0% and 3%/3cap produce identical output) — consistent with the NLHE preflop pattern in R17 and PLO preflop pattern in R16.
R19 · PLO paired-board c-bet observations — data captured, awaiting external reference P2 Open · cannot classify without external solver reference
BTN c-bet frequencies on 7 paired flops captured; classification deferred until paid PLO solver reference is available
🃏 Scope tested: PLO4 6-max · 100bb · BTN open 2.5x · BB call · 7 paired flops + 2 unpaired anchors · 0% rake Originally surfaced as: KVL q-plo-solver-crosscheck-paired-boards (P1, thanh) — postflop half Test script: tests/probe_plo_paired_boards_postflop.py Server model: universal-dense-v4-player_20260402_150328.onnx Date: 2026-05-11

BTN c-bet frequencies captured across 7 paired flop classes — kept here as raw observations awaiting an external PLO solver reference to validate against:

BoardTextureBTN bet%Check%Avg bet (BB)
AsAd7hhigh pair · mid kicker · rainbow60.70%39.30%2.00
KsKd7chigh pair · mid kicker · rainbow27.10%72.90%2.00
TsTd5cmid pair · mid kicker · rainbow24.30%75.70%2.00
8s8d3hmid pair · low kicker · rainbow57.00%43.00%2.00
9d9s2cmid-low pair · low kicker · rainbow53.30%46.70%2.00
4s4d2clow pair · low kicker · rainbow48.20%51.80%2.20
Ks7s7dpaired · flush draw70.30%29.70%2.20

Why this is NOT classified as an Issue:

Status: data captured; classification deferred. To upgrade R19 to Issue or Resolved, need an external PLO solver to produce comparison numbers for at least 3 of these 7 boards. Options:
  • Subscribe to PLO Ninja / RangeConverter / PLO Mastermind ($19–$249) for solver-grade postflop ranges
  • Coach reference — Brad Wilson or another PLO pro with paid solver access screenshots a few of these specific boards
  • Run our own internal full-CFR on these spots if PLO CFR engine access is available
Implication for KVL: the q-plo-solver-crosscheck-paired-boards KVL item is NOT fully closed — preflop half captured as R10 with confirmed Issue status (had external refs), postflop half captured here as R19 in Open status pending external refs.
Adding a new reviewer finding: drop a markdown file in engineering-department/gameplay-ai/projects/external-solver-benchmark/findings/{date}_{slug}.md, register it in INDEX-existing-data.md under Tier 1.x (reviewer-named docs) or Tier 3 (article-QA flags), then add the entry here. Include reviewer name (or "unattributed"), source link (Google Doc / article path), date, headline claim, and triage verdict.