Every divergence flagged by an external reviewer (coach, pro, theorist) reviewing Quintace strategy outputs against public solvers or theory baselines.
This is the reviewer-sourced subset of the full Solver QA catalog. Every entry here started as a reviewer comment (Google Doc, article QA pass, or in-person review), not a bench run. For the full benchmark catalogue including quantitative dataset comparisons, see the Solver QA index.
📋 Reviewer-side consolidated tracker
"QuintAce — QA findings" Google Doc → — reviewer-owned master tracker. This page is downstream; reviewer notes land in the Google Doc first, then surface here as R-findings.
Two tabs in the doc:
Strategy Issues — UTGvBB major differences with GTOw · Hygiene · RV v2 vs GTOw (maps to R1, R2, hygiene check)
Articles — per-article review status: JRB Dumbest Hands · Ivey 5 Defining Hands · Easy Game (Seidman) · Adjusting to Donk Bets [KILL] · Uri Peleg Stand-Up · Mariano AA · HCL 5-way (maps to R4, R5, R8, R9)
Latest Quintace endpoint results on each spot, alongside the reference number the reviewer used (from GTO Wizard or another external solver). Some spots agree, some differ. The differences are open for discussion — we haven't pre-assigned whether the external solver or Quintace is correct.
Board
Spot
Quintace (latest endpoint)
Reviewer reference (GTOw / external)
Difference
AK3ss
UTG c-bet freq
66.4%
88-89%
−22pp
AK3ss
UTG fold vs BB xr
28.6%
27%
+1.6pp
KQ6hh
UTG c-bet freq
72.8%
"similar" (no exact)
≈
KQ6hh
BB fold vs 2bb cbet
41.3%
54%
−13pp
KQ6hh
UTG fold vs BB xr
26.1%
47%
−21pp
456ss
UTG c-bet freq
51.2%
(none published)
—
456ss
BB fold vs 2bb cbet
33.5%
30%
+3.5pp
456ss
UTG fold vs BB xr
36.1%
"similar" (no exact)
≈
JT8ss
UTG c-bet freq
54.6%
(none published)
—
JT8ss
BB fold vs 2bb cbet
39.0%
38%
+1.0pp
JT8ss
UTG fold vs BB xr
28.6%
38%
−9.4pp
Spots that closely agree (≤ ±4pp): AK3ss UTG vs xr, KQ6hh UTG c-bet (qualitative), 456ss BB defense, 456ss UTG vs xr (qualitative), JT8ss BB defense.
Spots where Quintace and the reviewer's reference materially differ — open for discussion:
AK3ss UTG c-bet — Quintace 66% vs GTOw 88-89%. Is GTOw's high c-bet rate right on AK3ss, or is Quintace's more selective c-bet defensible?
KQ6hh BB defense vs 2bb c-bet — Quintace 41% fold vs GTOw 54%. Is GTOw over-folding because of poor OOP equity-realization assumption, or is Quintace under-folding?
KQ6hh UTG vs BB xr — Quintace 26% fold vs GTOw 47%. Is GTOw too willing to give up here, or is Quintace too sticky?
JT8ss UTG vs BB xr — Quintace 29% fold vs GTOw 38%. Smaller gap, same question.
Open question: these are differences, not verdicts. For each, the discussion is whether the published external-solver number reflects a well-converged equilibrium on a comparable tree, or whether Quintace's number is the better answer (e.g., universal-model with no preflop-tree abstraction may legitimately produce a different postflop equilibrium). Next step: review with gameplay AI + coach team to decide which differences are worth treating as model issues vs which are legitimate strategic differences.
Methodology: Endpoint sweep ran via V2 strategy_grid (RVV2-exact payload), model moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx, BB defense queried at 2.0bb c-bet, UTG response queried at BB x/r to 6.0bb. Test script: tests/pull_r1_all_5_boards_full.py.
R2 · Timothy Ulmer · Preflop RFIResolved at model layer (2026-05-11)
Preflop RFI Ranges 100bb 2-blind no ante 6-max — Quintace vs GTOW vs Jonathan Little
2026-05-11 update — endpoint verification. Direct query against the canonical V2 strategy_grid endpoint (RVV2-style payload, model moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx) shows the model output matches GTOw to within ~1pp at every position. The "Quintace opens 5-10pp wider" claim was based on the reviewer's display surface, not the model.
What this resolves: The "REFERENCE-CONFIG / architectural divergence" framing was wrong. The model does NOT open structurally wider than external solvers. The postflop "H2 — preflop range cascades" hypothesis (in the companion UTG-vs-BB finding) is retired.
Open sub-issue: Reviewer's surface shows 25/29/36/48/42% — uniformly 5-10pp wider than the endpoint. Same family as the postflop aggregation/display drift flag. Routed to service team (Adil + Scott + Breno).
Test script:tests/pull_rfi_5positions_rvv2_payload.py
R3 · Brad Wilson · PLO4 UTG/MP VPIP monotonicity P1B1 known-failure · A1-strict not implemented in production runner
🃏 Hand: Preflop RFI · PLO4 6-max 100bb 2-blind no ante 5%/1bb rake · all 5 positions (UTG/MP/CO/BTN/SB)Reviewer: Brad WilsonSource:b1-properties/core.yaml:48-54 (A1-strict known_failure)Date: 2026-05-06 · endpoint re-run 2026-05-11
PLO4 model opens UTG wider than MP at 100bb/5% rake — violating positional ordering. Brad's 2026-04-17 baseline showed a 0.8pp inversion (UTG 36.5% > MP 35.7%). The 2026-05-11 endpoint re-run on the same model shows the inversion has widened to 3.9pp (UTG 36.41% > MP 32.51%).
Cross-rake corroboration (added 2026-05-11): the PLO hygiene check in R10 reproduces the inversion at two additional rake settings — 3% rake / 3 BB cap and 0% rake — showing UTG 36.8% > MP 33.8% (3.0pp inversion). The monotonicity defect is rake-invariant, consistent with R10's separate finding that the PLO model is rake-insensitive on preflop opens overall.
B1 property gap discovered: A1 IS implemented in the production runner (with strict comparison) and SHOULD catch this. A1-strict (1pp noise floor) was added to the yaml spec but NOT implemented in the runner. A1 either doesn't sweep the rake parameter, or there's a payload-construction divergence — explaining why Nimit's Apr 23 cross-format sweep showed plo4-6max 55/55 PASS.
Includes: current endpoint data table, B1 property coverage gaps (A1-strict not in runner, A1's DSL doesn't sweep rake), three-track resolution plan (implement A1-strict + audit A1 + investigate model-side cause), suggested next steps, open questions for Nimit / Brad / model team.
Quintace's latest endpoint c-bet rate on K72r matches GTO Wizard's published number within solver-noise tolerance.
Source
UTG c-bet on K72r
Notes
Quintace endpoint (latest, 2026-05-11)
88.6%
V2 strategy_grid, RVV2-exact payload, model moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx
GTO Wizard published
88%
Article reference
Article (Seidman v1, stale snapshot)
59%
Pre-v2.0; surface was pinning a deprecated model
Article (Seidman v2.0 refresh)
79%
Different surface aggregation than the endpoint above
Resolution: Quintace's latest endpoint (88.6%) and GTOw (88%) agree within 0.6pp. No model-side discrepancy on this spot.
Article numbers worth noting: The Seidman article shipped 59% in v1 (stale model), then 79% in v2.0 (current model but rendered through the article's own pull script). Neither matches the underlying endpoint of 88.6%. The article-side numbers come from a different surface aggregation than the canonical endpoint — separate from this finding's resolution and treated as an article-pipeline question, not a model question.
Test script:tests/pull_postflop_utg_cbet_boards.py · also confirmed in tests/pull_hygiene_check_all.py (verifies query setup against this canonical reference).
R5 · adjusting-to-donk-bets-in-srp · fold-to-donkModel results match GTOw · article killed for editorial reasons
BB-lead / SB-call donk-bet defense — UTG response to BB donk in SRP
Reviewer flagged that Quintace "shows little folding vs donk" compared to GTOw's ~20% fold reference at b33. Endpoint says otherwise.
Board
Donk size
Quintace endpoint UTG fold%
GTOw reference
654r
2.0bb (~36% pot)
20.4%
~20%
765r
2.0bb (~36% pot)
22.4%
~20%
T98r
2.0bb (~36% pot)
20.3%
~20%
654r
1.0bb (~18% pot, smaller)
2.4%
—
765r
1.0bb
3.3%
—
T98r
1.0bb
12.7%
—
Endpoint result: UTG fold% against BB's 2.0bb donk (≈b33) is 20-22% across 3 standard donk boards — essentially identical to the GTOw ~20% reference the reviewer cited. The "Quintace under-folds vs donk" claim does not hold at the model layer. The article's published number that triggered the reviewer's flag came from a different surface aggregation than the canonical endpoint.
Article kill still stands — but for scope, not fold-rate divergence. The reviewer's other concerns remain valid kill reasons:
BB-lead boards are 0% reach at equilibrium (BB never donks these in the solve, so analyzing "how to defend the donk" is testing a near-zero-frequency node)
SB cold-call preflop is only 3% reach (the article's preflop assumption is itself rare at equilibrium — most SB plays a 3-bet-or-fold strategy)
The article was teaching defense at a premise-broken decision tree. Killing the article is correct; the underlying model behavior on donks is fine.
Side note: At smaller donk sizes (1.0bb / b18) the endpoint folds 2-13% on 654/765/T98 — closer to the "little folding" pattern. If the article's published number was based on a smaller donk size than the b33 the reviewer compared against, that explains part of the reviewer's confusion. The article-pipeline issue (surface aggregation different from endpoint) is the same family as the R4 Seidman issue and is being routed to the article team separately.
Test script:tests/pull_r5_bb_donk_defense.py
R7 · EV-display variance for folded hands P1B1 known-failure · KI-4 / G-series · model EV noise on rarely-trained buckets
Per-combo raise-EV varies wildly across "similar" folded hands at RFI positions
🃏 Hand: Per-combo EV at preflop RFI · Cash NLHE 6-max 100bb · all 5 RFI positions (focus on folded combos like 26o, 72o, 82o)Reporter: Internal observation (2026-05-11)Date: 2026-05-11 endpoint sweep
Per-combo raise-action EV varies dramatically across folded hands at RFI positions — standard junk (72o, 82o, etc.) shows sensible negative raise EVs (−0.2 to −2.0 bb), but rarely-played buckets like 26o show absurdly positive raise EVs (e.g. 26o = +6.77 bb at UTG despite folding 100%). Same family as B1 KI-4 (model EV shape concerns) and G-series violations.
2026-05-11 re-run of B1 I-series properties against current production model showed I3 violation rate 28.5% (12,988 / 45,598) and I5 violation rate 66.5%. The EV-shape issue is not just 26o — it's pervasive across hand classes including premium ranges (AA call vs raise gap = 1.27 bb).
Includes: existing B1 property coverage (26 EV-related properties), 2026-05-11 I-series re-run results, sample violations (AA, AKs, 26o), why production B1 didn't catch it, two-track resolution plan (re-run + verify implementations), suggested next steps, open questions for Trung (EV owner) + Nimit (B1 framework).
R9 · Stand-Up depth-flip on Klu T9 at val=10 P2B1 known-failure · stack-depth continuity violated · SQ4 property needed
Klu T9 verdict flips call → fold between 100bb and 226bb at val=10 (HCL 5-way Stand-Up)
🃏 Hand: HCL 5-way Stand-Up · Francisco AT · Mariano Q7 · Adi AK · Klu T9 · Peter 22 · 5-way preflop all-in · Jan 3 2025 · Scenario D (Peter+Adi+Klu standing) · val=10Surfaced by: HCL 5-way comic article (May 2026)Source:hcl-five-way-stand-up/deviation-log.md H1Date: 2026-05-11
At Stand-Up val=10 / Scenario D, Klu's T9 verdict flips categorically: 100bb = call 35%, 226bb = fold 100% (−65pp swing across just 126bb of depth). NLHE cash control sweep (50-1000bb on UTG RFI) shows total swing of −1.73pp — depth-stable. The Stand-Up flip is ~16× larger in magnitude across ~8× less depth → Stand-Up-specific phenomenon, not a general model-depth issue.
B1 framework gap: B1 property B1 (stack-depth continuity) exists and applies to Stand-Up, but the production runner caps stack range at 100bb, tests only UTG : open, and doesn't sweep the val parameter. A new SQ4 property (stack-depth × val × state continuity) is needed to catch this automatically.
At deep Stand-Up Game stack depths (~1260bb in the Mariano AA hand), the model's verdicts flip fold ↔ jam ↔ call across small parameter perturbations and violate val-monotonicity. Root cause is a training-coverage gap; the OOD symptom is consistency violation. Article shipped only 500bb-validated queries as editorial mitigation.
First PLO hygiene pass mirroring the NLHE/MTT pattern. RFI VPIP vs two independent solver-grade references (100bb 6-max GTO PLO):
Position
Quintace
Upswing (Monker)
PLO Genius (proprietary)
Avg ref
Δ vs avg
Verdict
UTG
36.8%
17.9%
16.8%
17.35%
+19.45
DIVERGE
MP (HJ)
33.8%
21.8%
21.5%
21.65%
+12.15
DIVERGE
CO
37.8%
30.0%
29.0%
29.50%
+8.30
DIVERGE
BTN
47.3%
~42.5%*
48.0%
45.25%
+2.05
OK
*BTN Upswing value: community consensus (less reliable than the two solver-grade refs); PLO Genius gives 48%. Both solver-grade references corroborate within 1.1pp at UTG/MP/CO — the divergence is robust across independent solvers, not a "Upswing happens to be tight" artifact.
Diagnosis — what the model gets right, and where the gap lives:
Limps are negligible (0.02%) — the 36.8% UTG VPIP is essentially all raises, directly comparable to the "opening range" semantics in both Upswing and PLO Genius references.
The +19.45pp UTG divergence is a shifted opening threshold — the model classifies more borderline hands as "open" than either reference solver does. Gap is largest at UTG, shrinks progressively to BTN (where Quintace lands in-range). Pattern fits "open threshold mis-calibrated wider in early positions."
Related (but distinct) findings: a separate monotonicity inversion (UTG > MP) is captured in R3. A separate rake-sensitivity diagnosis is in R16 (resolved — model is rake-flat at low rake, rake-aware at higher rake). R10 = absolute threshold too wide; R3 = positional ordering wrong; R16 = rake sensitivity correctly handled. R10's +18.9pp UTG over-loose still holds at heavy rake (R16 verified: +15.9pp gap at 10%/1cap).
Methodology notes (not findings):
ploStrategy returns 856 canonical PLO4 classes (out of 270,725 specific combos), fixed across every parameter swept. Upswing and PLO Genius references are also class-aggregated, so comparisons are apples-to-apples.
actionLabels[].strategy is NOT exposed for PLO (unlike NLHE/MTT) — per-combo Metrics.plo_preflop() aggregation is the only path; audit confirmed correct. Audit + diagnosis scripts: tests/probe_plo_methodology_audit.py + tests/probe_plo_query_diagnosis.py.
Limp-inclusion confound (checked): our "VPIP" sums limp + raise, while Upswing's reference is raise-only. Model limp by position: UTG 0.0% · MP 1.0% · CO 1.2% · BTN 4.3%. At UTG (where the gap is biggest, +18.9pp), limp = 0 — the gap is pure raise-vs-raise. At MP/CO, excluding limp narrows the gap by ~1pp (CO 7.8→6.6pp, still WARN). At BTN, excluding limp closes most of the gap (4.8→0.5pp) — consistent with BTN's existing OK verdict in the table. Net: limp inclusion does not explain the UTG/MP/CO over-loose finding.
Raise-size confound (checked): the model effectively opens at ~3.5× only (2.5× sizing used 0.8% of the time per query diagnosis). Upswing's "% of hands that open" is size-agnostic (any raise counts), so comparing open-frequency vs open-frequency is sizing-invariant.
📋 Full finding doc:findings/2026-05-11_plo-rfi-hygiene-100bb-6max.md — includes external-reference sourcing, hypothesis decomposition (MODEL vs methodology), parameter-sensitivity audit, triage routing, and next-step list (postflop paired-boards hygiene as the second KVL half; non-monotone num_players response as a sub-investigation).
Triage routing: Primary owner gameplay AI (model layer — Luong-Ha / Yaroslav, cc Brad since this extends R3). Methodology questions for Ha on the 856-class reduction scheme (informational, not load-bearing).
MTT model conditions on TWO ICM-block inputs (payouts schedule shape + alive/entries threshold) and ignores THREE (hero per-seat rank, prize pool, absolute entries). Earlier "single-feature" claim was wrong — original payouts test wrote to a non-existent field.
Self-correction: the initial R11 said payouts shape was silent. That was wrong — the test in probe_mtt_icm_block_only.py assigned to mtt.payouts (a field the server does not read), so the actual payout schedule never changed across the test. Rerun via probe_mtt_payouts_field_correct.py mutating the correct mtt.payout_structure_limit_top10 field shows the model DOES respond.
What the model actually uses (corrected)
1. Payouts schedule shape — 5.14pp RFI spread across schedules at the same spot. Direction is physically correct (flat payouts → tighter; winner-take-all → wider):
Payout schedule
RFI
Δ vs standard
Standard (geometric decay, default)
11.83%
baseline
Flat (every paid place equal)
8.76%
−3.07pp
Top-heavy (70/18/8 + tail)
13.79%
+1.96pp
Winner-take-all
13.90%
+2.07pp
2. Alive / entries ratio — threshold step around ~5% of field remaining. Above threshold, RFI ~13.33% (baseline). Below threshold, RFI snaps to 10.41% regardless of how it got there (16/1000, 9/1000, 200/100000 all give identical 10.41%). 3.39pp effect.
0.00pp (identical RFI across all 5 values when alive/entries ratio is fixed)
Implication for hygiene checks
External ICM solvers (ICMIZER, HRC) condition on per-seat ranks, payout slope, AND bounty $$ amounts. Our model captures payout slope and the alive-threshold step, but NOT per-seat ranks or prize-pool magnitude. Hygiene checks against ICM references need to match the payout schedule precisely (the dominant ICM-block input) and accept that rank-conditioned reference differences won't be reproduced.
Triage routing: Trung (model EV / ICM-feature ownership). Two distinct questions to decide:
(a) Is per-seat rank intentionally ignored? The model has no chip-position awareness — chipleader vs shortstack at the same overall stack distribution returns identical strategy. If intentional (universal model treats ICM only through aggregate / payouts), document it. If not, add rank conditioning.
(b) Is prize_pool intentionally ignored? $1k vs $10M tournaments produce bit-identical output. This is consistent with chip-EV scaling, but reads odd to coaches who expect higher-stakes situations to influence play.
How-to:shared/tools/USING_MTT_ENDPOINT.md updated to reflect the corrected behavior (payouts shape now flagged as USED, with the gotcha that the schedule must be written to mtt.payout_structure_limit_top10 — not a sibling payouts field).
R12 · Mystery bounty silently equivalent to normal across all configs P2ISSUEModel · bounty_type=mystery_bounty no-op
11 mystery_bounty configurations spanning mysterious_prize_left $0 → $10M, head_bounties $0 → $10K, bounty_proportion 0.1 → 0.9 all return bit-identical output to bounty_type=normal. Model does not condition on any mystery-bounty field.
Reference behavior — PKO and knockout bounty types correctly tighten RFI by 6–10pp (incentive to hunt elimination):
Bounty type
RFI
Δ vs normal
normal
11.83%
0
PKO uniform bounty=50
5.70%
−6.13pp
Flat KO bounty (initial=50)
4.43%
−7.40pp
Mystery-bounty sweep — 11 configurations, every one returns RFI=11.83% (bit-identical to normal):
Config
mysterious_prize_left
head_bounties
bounty_proportion
RFI
Δ vs normal
no extras (mpl=None)
—
—
0.5
11.83%
0.00pp
orig R12 test
2,500
[50]×8
0.5
11.83%
0.00pp
zero everything
0
[0]×8
0.5
11.83%
0.00pp
small
100
[10]×8
0.5
11.83%
0.00pp
mid
10,000
[100]×8
0.5
11.83%
0.00pp
large
100,000
[500]×8
0.5
11.83%
0.00pp
huge
1,000,000
[5,000]×8
0.5
11.83%
0.00pp
varied head_bounties
10,000
[500,300,200,100,50,50,30,20]
0.5
11.83%
0.00pp
low proportion
10,000
[100]×8
0.1
11.83%
0.00pp
high proportion
10,000
[100]×8
0.9
11.83%
0.00pp
bounty pool >> prize pool
10,000,000
[10,000]×8
0.5
11.83%
0.00pp
Verdict: spread = 0.00pp across 11 configurations including extremes. All payload JSON hashes differ (so the JSON payloads truly vary on the wire), but the model produces bit-identical output. bounty_type=mystery_bounty is a no-op — the model treats it identically to normal, with the auxiliary fields ignored regardless of value.
Triage routing: Trung / Gameplay AI. Either (a) flag bounty_type=mystery_bounty as unsupported at the endpoint layer (return 400 or warning), or (b) add mystery-bounty conditioning to the universal model. Option (a) is the minimum so downstream consumers (CAI, AceCoach) don't silently get normal-MTT advice for mystery tournaments.
R13 · mtt.players[].stack is dead input P3ISSUEServing · payload schema redundancy
Setting mtt.players[].stack to arbitrary values has zero effect on output — only hand.players[].stack is read by the model. Verified across 9 configurations with hand and mtt stack blocks varied independently.
Verification sweep — 9 configurations, hand and mtt stack blocks varied independently:
hand.players[].stack
mtt.players[].stack
RFI
Δ vs baseline
REAL_SPREAD (baseline)
REAL_SPREAD
17.39%
0.00pp
REAL_SPREAD
all 1bb (extreme)
17.39%
0.00pp
REAL_SPREAD
all 1000bb (extreme)
17.39%
0.00pp
REAL_SPREAD
VERY VARIED (200/5/10/100/80/60/25/8 bb)
17.39%
0.00pp
REAL_SPREAD
REAL_SPREAD seat-swapped
17.39%
0.00pp
UNIFORM 30bb
REAL_SPREAD
16.03%
−1.36pp
UNIFORM 30bb
VERY VARIED
16.03%
−1.36pp
hero@5bb / others 50bb
hero@10bb / others 50bb
31.83%
+14.44pp
hero@100bb / others 50bb
hero@10bb / others 50bb
16.12%
−1.27pp
Pattern: when hand.players[].stack is held constant, RFI is bit-identical regardless of what mtt.players[].stack contains. When hand.players[].stack changes, RFI changes dramatically. Within each "hand=X" group: spread = 0.00pp. mtt.players[].stack is silently ignored.
Classification: unlike R11 and R12 (model-behavior issues), R13 is a payload-schema redundancy. The model itself is not wrong — the payload format has two redundant stack fields where only one is actually consumed.
Triage routing: Service team (Scott / Adil). Either drop mtt.players[].stack from the payload schema (it's misleading — clients can spend time syncing both fields thinking both matter) or make it the authoritative source so the client lib can simplify (one stack field per seat instead of two). Affects strategy_grid_client.MttPreflop._build_mtt_hand(), which currently fills both blocks from the same source.
Documentation: captured as a gotcha in shared/tools/USING_MTT_ENDPOINT.md so future agents know which field actually matters.
R20 · Cash universal model has near-flat BB-defense curve across stack depths P1ISSUEModel · cash · depth-insensitive defense
Cash universal model defends BB at ~80–83% VPIP across all stack depths from 20bb to 100bb (3.3pp range) — qualitative theory expects depth-sensitivity (deeper → wider defense for more postflop equity realization). MTT model on the same spot shows the expected curve (64.8% → 85.8% across the same depth range, 21pp range).
🃏 Scope: NLHE cash · BB defends vs BTN 2.0x open · 8-max · symmetric stacks · ante=12 chips (0.12bb) · 0 rake · 7 stack depths from 20bb to 100bbSurfaced by: Solver-QA agent · BB-defense depth curve probe (originally a side-finding while triangulating R14)Test script:tests/probe_bb_defense_depth_curve.py · output probe_bb_defense_depth_curve.out.jsonServer endpoint:https://preview.rlserv.aceguardianrl.com/api/strategy_gridCash model:moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnxMTT model (for comparison):universal-dense-v4-player_20260402_150328.onnxDate: 2026-05-11
VPIP curve — BB defends vs BTN 2.0x by stack depth, both models on identical payloads (only stack varies):
Stack
MTT model VPIP
Cash model VPIP
MTT − Cash
20bb
64.76%
80.77%
−16.01pp
25bb
66.85%
79.70%
−12.85pp
30bb
70.20%
79.71%
−9.51pp
40bb
78.17%
79.94%
−1.77pp
50bb
81.16%
80.82%
+0.34pp
75bb
84.73%
82.36%
+2.37pp
100bb
85.82%
82.98%
+2.84pp
Spread
21.06pp
3.28pp
Observations:
Cash model is essentially depth-blind in this spot. 3.28pp VPIP range from 20bb to 100bb (a 5× change in stack depth). Theory expects substantial widening with depth — postflop playability is the main argument for defending more hands at 100bb than 20bb.
MTT model produces the expected shape. 21.06pp VPIP range across the same depths, monotone increasing — short stacks defend tight, deep stacks defend wide. Direction and magnitude both look right.
The two models cross around 40–50bb. At 30bb the cash model defends 9.5pp wider than MTT; at 100bb the cash model defends 2.8pp narrower. Two different shapes, not just a level shift.
Single-spot caveat: this curve is from one spot (BB vs BTN 2.0x). Before drawing strong conclusions about the cash universal model, this should be reproduced at other defense spots (BB vs CO, BB vs HJ, SB vs BTN) and at different open sizes. The pattern may be specific to BB-vs-BTN or generalize across cash defense; only further testing tells us which.
Triage routing: Trung / Gameplay AI. Two questions to answer:
(a) Does the depth-insensitivity hold at other cash defense spots? Run the same depth sweep for BB vs CO, BB vs HJ, SB vs BTN. If the pattern is uniform across defense spots, that's strong evidence of a cash-model calibration gap on stack-depth conditioning.
(b) If real, is depth-insensitivity intentional or a defect? Cash universal may be trained primarily at 100bb and use an architecture that doesn't explicitly condition on depth — in which case the model's 80% at 20bb is the 100bb-equilibrium response, unaware of the short-stack regime. That would be a known training-distribution limit; the fix is either retraining with mixed depths or surfacing the depth limitation downstream.
Related: independent of retracted R14. R14's "MTT model under-defends" framing was wrong — the depth curve shows the MTT model has a sensible curve and the gap was against an unverified reference. R20 is the opposite: the cash model's curve is the suspect one.
R15 · Chip-EV RFI divergence vs GTOw — multi-faceted gap pattern across MTT + Cash models P1ISSUEModel · structural · 5 spots × 2 models vs verified GTOw references
Group finding covering 5 chip-EV RFI spots compared against directly-verified GTOw blog references. Both models tend to under-open at deep/mid stacks; MTT model gap widens at HJ 30bb; cash model flips to over-open at 5bb push-fold range; MTT model uses a sharp push-fold transition between 5bb and 17bb. Sub-findings are grouped because they share methodology and source references.
🃏 Scope: 8-max chip-EV RFI · 4 UTG stacks (5/17/50/100 bb) + 1 HJ 30bb spot · ante=12 chips (0.12bb) · 0 rake · symmetric stacks · alive/entries=95% (well above R11's bubble threshold → chip-EV regime on the model side)References (verified directly from GTOw articles in this scrutiny pass):
(a) Both models under-open at deep + mid UTG stacks. Gap is 1.5–3.2pp for MTT model, 1.0–2.1pp for cash model at UTG 100/50/17 bb. Direction is consistent (both tighter than GTOw); magnitude is near solver-noise tolerance but the consistent same-direction bias across 3 stack depths suggests a real (small) calibration drift.
(b) MTT model gap widens dramatically at HJ 30bb. One non-UTG data point shows MTT model 7.1pp tighter than GTOw vs the ~2pp UTG gap. Suggests position-conditioned bias (wider gap at later positions). Caveat: single non-UTG spot; need MP, CO, BTN coverage to confirm the pattern.
(c) Cash model FLIPS sign at 5bb (over-opens by 4.66pp). Cash universal is 1–2.7pp tight at 100 / 50 / 17 / 30 bb, then +4.66pp WIDER at 5bb. Cash model uses 21.86% all-in at 5bb vs GTOw's ~20% total RFI (where almost all should be jams). The cash model isn't push-fold trained — once min-raise becomes structurally infeasible, it gambles all-in too liberally.
(d) MTT model has a sharp push-fold transition between 5bb and 17bb. At 100/50/17 bb, MTT model all-in usage is 0.000–0.004% (essentially never jams). At 5bb, all-in is 17.60% (essentially the entire RFI). GTOw transitions smoothly across this range; the MTT model has a binary switch. Caveat: no data at intermediate stacks (8–14 bb) — exact transition shape unknown.
(e) MTT model uses a different sizing tree at HJ 30bb. Both models have the same available sizes (2.0 / 2.5 / 4.5 / 6.0 / 9.5x). Cash model uses 2.0 + 2.5x as modal opens (10.97% + 15.66%). MTT model uses 4.5x as the single modal open (21.83%) with near-zero weight on 2.0x (0.003%). Same spot, very different sizing-tree resolutions.
Methodology
All 5 audits use identical parameter alignment to the GTOw reference: 8-max NLHE, symmetric stacks, ante=12 chips (0.12bb per player), 0 rake, chip-EV regime. Single known parameter mismatch: 0.5-chip ante rounding (12 vs 12.5 chips) — per ante sensitivity probe, accounts for ≤ 0.25pp of any gap. Reference numbers were re-verified by direct WebFetch of the primary GTOw blog posts (not relying on secondary citations). Side-by-side script + JSON outputs are at external-solver-benchmark/tests/single_spot_audit_*.{py,out.json}; each script prints the full payload for spot-check.
Open scope (what would strengthen this finding)
Fill in UTG curve at 30bb, 20bb, 14bb, 2bb (all have GTOw published numbers) to validate the curve shape.
Add MP, CO, BTN at 30bb chip-EV to confirm (or refute) position-conditioning of the MTT model gap.
Add 8bb, 10bb, 12bb, 14bb data points to characterize the MTT model's push-fold transition shape.
Verify the cash-model 5bb over-jam holds across positions (not just UTG).
Triage routing: Trung (model EV / chip-EV calibration). The data is parameter-aligned and the references are now directly verified from the source GTOw articles. Three distinct concerns grouped here:
(a) Both universal models under-open at deep/mid stacks vs GTOw chip-EV — gap is small (1–3pp) but consistent direction across UTG 100/50/17/30 bb. The MTT model has an additional 1pp drift on top of the cash model's shared bias.
(b) MTT model gap widens at wider opening positions (HJ 30bb shows 7pp gap) — single data point; needs more position coverage to confirm position-conditioning.
(c) Cash model over-jams at 5bb (push-fold zone) — separate calibration concern at short stacks. Cash model was likely trained primarily at 100bb and extrapolates poorly to push-fold range.
(d) MTT model has binary push-fold switch and uses unusual sizing-tree choices — affects coaching surfaces that display sizing decisions.
R16 · PLO preflop rake-flat at low rake, rake-aware at higher rake — model behaves correctlyResolved · correct model behavior · SB notably more rake-sensitive than other positions
Original "rake invariant" claim was a sampling artifact; preflop DOES respond at 5%+ rake; postflop also responds
Initial observation (preflop, 2 rake values): PLO preflop RFI output was identical across 0% rake and 3% / 3 BB cap — hypothesized "rake silently ignored."
Postflop follow-up (BTN c-bet, 3 boards × 5 rake settings): postflop DOES respond to rake (2.10–4.80pp range per board). So endpoint consumes rake fields; field-strip hypothesis refuted.
Per-position preflop sweep (5 positions × 5 rake settings): preflop ALSO responds when swept at heavier rake values. Original "invariant" claim was a sampling artifact — 0% and 3%/3 BB cap happen to produce identical output (3 BB cap rarely binds on small preflop pots, making both effectively rake-free for preflop EV).
Full picture — preflop VPIP by position × rake:
Position
0%
3%/3cap
5%/3cap
5%/1cap
10%/1cap
Range
UTG
36.80%
36.80%
35.90%
36.40%
33.80%
3.00pp
MP
33.80%
33.80%
32.70%
33.20%
30.60%
3.20pp
CO
37.80%
37.80%
36.40%
37.00%
33.80%
4.00pp
BTN
47.30%
47.30%
45.70%
46.40%
43.70%
3.60pp
SB
46.00%
46.00%
42.30%
43.60%
38.50%
7.50pp
Three takeaways:
Low-rake equivalence is correct. 0% and 3%/3 BB cap produce identical output across all 5 positions — both are functionally "no effective rake" at small preflop pot sizes. The model correctly treats them as equivalent.
Higher-rake response is real and directional. At 5%+ rake, the model tightens preflop ranges 3–4pp across non-SB positions; SB tightens 7.5pp. Direction is correct (heavier rake → tighter range).
SB rake-sensitivity is notable. SB shows 2× the rake response of other positions. Consistent with GTO intuition — SB has positional disadvantage, so every BB of rake hurts SB EV more than other positions; tightening opens compensates.
Postflop response (sub-data): BTN c-bet across 3 boards × 5 rake settings — bet% range 2.10pp (Ts9s8s) / 3.80pp (KsKd2c) / 4.80pp (Ks7d2c). Postflop also responds, as expected.
Resolution: the original "rake silently ignored" framing was wrong. The model DOES respond to rake at both preflop and postflop; the initial observation was a sampling artifact (0% and 3%/3cap both fall in the rake-flat zone). With a finer rake sweep, response is real, directional, and matches GTO expectations.
SB shows 2× the rake response of other positions. Could be a real positional effect, could be model noise. Worth comparing against an external SB-rake-sweep reference if one exists.
Postflop response on KsKd2c (paired board) is non-monotone — 10%/1cap shows MORE betting than 5%/1cap. Possibly model noise at heavy rake, possibly real complex rake × texture interaction.
Implication for R10: R10's +18.9pp UTG over-loose finding was measured at 0% and 3%/3cap (both in the rake-flat zone). At 10%/1cap, UTG opens at 33.8% — still well above Upswing's 17.9% reference (+15.9pp gap). The R10 over-loose calibration holds even at heavy rake; doesn't disappear under realistic rake structures.
R17 · KVL ea5 — NLHE BTN open average sizing claim is staleResolved · current endpoint matches Pio/GTOW reference
KVL register entry claimed avg_raise_bb=4.8 at BTN; current endpoint produces 2.81 — within Pio/GTOW 2.0–3.0 expected range
The KVL register entry claimed BTN open avg_raise_bb = 4.8 BB, well above Pio/GTOW's expected 2.0–3.0 BB. The 4.8 figure was logged previously against an older model; the current endpoint output is sharply different.
Rake setting
Fold
Call
Raise
avg_raise_bb
Verdict
0% rake
59.35%
0.33%
40.32%
2.81
OK
3% / 3 BB cap
59.35%
0.33%
40.32%
2.81
OK
Sizing breakdown (when raising): 71.5% at 2.5× · 26.9% at 3.5× · 1.7% at 5.0×. Weighted average = 2.81 BB, right in the middle of the Pio/GTOW 2.0–3.0 expected range.
Resolution: the KVL register entry is stale. The current endpoint produces BTN avg_raise_bb = 2.81, within Pio/GTOW expected range. The original 4.8 figure was from an older model snapshot that has since been retrained. No action needed at the model layer. Removing from Tier 3 KVL register.
Note: rake is also confirmed rake-invariant for NLHE preflop opens (0% and 3%/3cap produce identical output) — same pattern as PLO preflop at low rake values (R16). Behavior is consistent across game types.
R18 · KVL q-cash-bb-btn-3bet — BB 3-bet vs BTN matches GTOw within toleranceResolved · model matches reference
BB 3-bet frequency vs BTN 2.5x / 3.0x open at 100bb 6-max no-ante is within 0.32pp of GTOw published references
Three-bet sizing is also sensible: vs 2.5x, BB mixes 11 BB (8.87% prob) and 16.5 BB (4.75% prob); vs 3.0x, BB shifts to 13 BB primary (9.99% prob) with a tiny 19.5 BB tail. Sizes scale appropriately with the open size.
Size-sensitivity sanity check: BB tightens 2.56pp from 13.82% vs 2.5x to 11.26% vs 3.0x — correct GTO direction (defense tightens against larger opens).
Resolution: the KVL register entry was either based on an older model snapshot or on a different sizing assumption. The current endpoint produces BB 3-bet rates within 0.32pp of GTOw at the standard 2.5x and 3.0x BTN open sizes, with correct directional response to sizing. No model action needed. Removing from Tier 3 KVL register.
Note: rake is again invariant at low values (0% and 3%/3cap produce identical output) — consistent with the NLHE preflop pattern in R17 and PLO preflop pattern in R16.
R19 · PLO paired-board c-bet observations — data captured, awaiting external reference P2Open · cannot classify without external solver reference
BTN c-bet frequencies on 7 paired flops captured; classification deferred until paid PLO solver reference is available
BTN c-bet frequencies captured across 7 paired flop classes — kept here as raw observations awaiting an external PLO solver reference to validate against:
Board
Texture
BTN bet%
Check%
Avg bet (BB)
AsAd7h
high pair · mid kicker · rainbow
60.70%
39.30%
2.00
KsKd7c
high pair · mid kicker · rainbow
27.10%
72.90%
2.00
TsTd5c
mid pair · mid kicker · rainbow
24.30%
75.70%
2.00
8s8d3h
mid pair · low kicker · rainbow
57.00%
43.00%
2.00
9d9s2c
mid-low pair · low kicker · rainbow
53.30%
46.70%
2.00
4s4d2c
low pair · low kicker · rainbow
48.20%
51.80%
2.20
Ks7s7d
paired · flush draw
70.30%
29.70%
2.20
Why this is NOT classified as an Issue:
Each paired board has a genuinely different texture (different ranks, different kickers, different equity distributions, different BB cap dynamics, different reach configurations on both sides). The earlier "structurally equivalent" framing in my first draft was wrong — solver outputs for AsAd7h vs KsKd7c can legitimately differ substantially. PLO solver outputs are known to vary 20–40pp across boards that look similar at the rank level.
Without a paid PLO solver reference (Monker, Pio PLO, PLO Genius, PLO Mastermind, Vision GTO Trainer — all subscription) producing comparison numbers for these exact boards, there's no external anchor to call any single number "right" or "wrong."
Heuristics like "high pair should c-bet more than low pair" are domain knowledge, not solver-validated claims. PLO postflop GTO frequently inverts NLHE intuition.
Status: data captured; classification deferred. To upgrade R19 to Issue or Resolved, need an external PLO solver to produce comparison numbers for at least 3 of these 7 boards. Options:
Subscribe to PLO Ninja / RangeConverter / PLO Mastermind ($19–$249) for solver-grade postflop ranges
Coach reference — Brad Wilson or another PLO pro with paid solver access screenshots a few of these specific boards
Run our own internal full-CFR on these spots if PLO CFR engine access is available
Implication for KVL: the q-plo-solver-crosscheck-paired-boards KVL item is NOT fully closed — preflop half captured as R10 with confirmed Issue status (had external refs), postflop half captured here as R19 in Open status pending external refs.
Adding a new reviewer finding: drop a markdown file in engineering-department/gameplay-ai/projects/external-solver-benchmark/findings/{date}_{slug}.md, register it in INDEX-existing-data.md under Tier 1.x (reviewer-named docs) or Tier 3 (article-QA flags), then add the entry here. Include reviewer name (or "unattributed"), source link (Google Doc / article path), date, headline claim, and triage verdict.