Solver QA

External Solver Benchmark — every catalogued divergence between Quintace's strategy outputs and public solvers (PioSolver, GTO Wizard, MonkerSolver, GTO+).

Project: external-solver-benchmark Owner: TBD (proposed Scott / Luong-Ha) Sponsor: Thanh Last swept: 2026-05-06

This is the third leg of model evaluation for gameplay AI — alongside metrics-framework (B1, internal consistency) and solution-quality (cross-endpoint parity). Different question, different oracle: a model can pass B1 perfectly and still produce a strategy that disagrees with established solver consensus by 30+ percentage points on basic spots.

This page catalogues divergences. It does not fix them. Findings get routed to gameplay-AI (model investigation), solution-quality (serving / config bug), or metrics-framework (missing internal invariant).

→ Reviewer findings log → Reviewer methodology & goals → Query hygiene check (validated 2026-05-11)

Pattern reading — where does Quintace land?

The data is regime-dependent. The story is not "Quintace is better/worse than GTOW" — it's "where does Quintace agree, and where does it diverge."

Where Quintace looks BETTER than GTOW

Postflop ICM — QuintAce CFR ~0.28% MAE vs Pio; GTOW ~13% MAE vs Pio. ~46× more accurate than GTOW on this regime. Strongest QuintAce-favorable result in the corpus.

Where Quintace looks WORSE than GTOW

HU 200bb self-play — Quintace loses 10.68 bb/100 AIVAT. Sizing distribution structurally collapsed (74% of bets at 76-100% pot, no small bets, no overbets).
6-max 100bb defense reviewer pass — Quintace under-folds 15-39pp on 5/5 spots in 6-max single-raised flop defense (partly aggregation artifact — see triage note).
Seidman K72r — c-bet 79% AG vs 88% GTOW (article showed 59% from stale model — fixed).
Adjusting-to-donk-bets — fold-to-donk much lower than GTOW's 20% vs b33.

Where Quintace ≈ GTOW or both ≈ Pio

1,755 NLHE flop spots — DRL vs Pio — FCR distance 0.94, FC distance 1.00.
3B05A_8handed — DRL vs GTOW — mean alignment 0.83, tree-match 28/30 spots.

Query hygiene — canonical reference spots (validated 2026-05-11)

Before trusting any divergence finding on this page, we verify our endpoint setup against well-established reference spots from the llm-verifier-game-expansion/cash/data/external-solver-candidates.yaml catalog (last full audit 2026-04-16, v1.5.3). Every spot below is a published GTO Wizard / Upswing reference where Quintace previously matched within tolerance. If the endpoint deviates here, our query setup itself is wrong; if it matches, divergences we report elsewhere are real.

All checks run via the canonical V2 strategy_grid rail (https://preview.rlserv.aceguardianrl.com/api/strategy_grid) with RVV2-exact payload semantics. Model: moe_dynamic_703_universal_v1_20260330_0900_no_cap_trung.onnx. Tests live at engineering-department/gameplay-ai/projects/external-solver-benchmark/tests/.

✅ Postflop hygiene — 6 of 7 spots within tolerance

Spot ID	Claim	Endpoint	External ref	Gap	Status
`B1-MDF-QQ3r`	BB fold% vs BTN 2.0bb c-bet on QQ3r	46.71%	~50%	−3.3pp	OK
`F3-AK8-cbet`	UTG c-bet% on AK8r	92.12%	~90%	+2.1pp	OK
`F3-BB-check-non-exception`	BB check% on K94ss in SRP (non-donk board)	100.00%	~98%	+2.0pp	OK
`F3-BB-donk-654`	BB donk% on 654r in SRP at 100bb	60.55%	~55%	+5.6pp	OK
`E4-suit-isomorphism`	K72r 4-rotation UTG c-bet% spread	0.09pp	0pp	+0.1pp	PERFECT
`H8-3BP-BB-check-100bb`	BB 3-bettor check% in 3BP on 654r at 100bb	47.51%	~48%	−0.5pp	OK
`K72r-LJ-cbet`	UTG c-bet% on K72r (Seidman article reference)	88.62%	88%	+0.6pp	OK

✅ Preflop RFI hygiene — 5 of 5 positions within 1pp of GTOw

Position	Endpoint	GTO Wizard	Jonathan Little	Endpoint vs GTOw
LJ (UTG)	17.85%	17.5%	17.0%	+0.35pp
HJ (MP)	21.02%	21.7%	21.4%	−0.68pp
CO	27.05%	27.9%	27.8%	−0.85pp
BTN	39.78%	40.6%	43.3%	−0.82pp
SB raise	35.00%	34.4%	24.0%	+0.60pp

⚠ One spot reclassified as SCOPE_MISMATCH after article re-check

Spot ID	Original yaml claim	Endpoint (cash)	Status
`D5-donk-654r-stack-depth`	BB donk% on 654r at 20bb vs 100bb (yaml said "BB donk reduces ~2/3 from 20bb→100bb")	20bb cash 2.5bb open: 45.69% 100bb cash 2.5bb open: 60.55%	SCOPE_MISMATCH

Article re-check (2026-05-11): GTO Wizard "Is Donk Betting for Donkeys?" article scope is MTT with 2x min-raise opens on board 764r, not cash with 2.5bb opens. My queries were cash NLHE at 2.5bb — apples-to-oranges. The yaml entry's scope: {format: "6-max", stacks: "20-100"} didn't disambiguate cash vs MTT; the article is MTT-specific. To verify the article direction, queries must be re-run against MTT 2x trees. Yaml should be amended: verdict downgraded from EXACT to SCOPE_MISMATCH until MTT-format query lands.

Bottom line: The endpoint matches canonical reference spots within ±5pp in 12 of 13 cases (preflop + postflop combined). One direction-inverted finding (D5 stack-depth) flagged. Query setup is healthy — divergences reported in Tiers 1–4 below are real and not query artifacts. Test scripts: tests/pull_hygiene_check_all.py, tests/pull_rfi_5positions_rvv2_payload.py, tests/pull_postflop_utg_cbet_boards.py.

✅ MTT hygiene — 5 of 7 spots within ±5pp · 2 flagged (UTG 2bb, BTN 50bb)

First MTT hygiene pass — possible now that Scott aligned V2 MTT config with GTOw MTT solution library presets (#dom_gameplayai 2026-05-11). External refs from GTOw blog: How Stack Sizes Change Your Range. Setup: 8-max MTT chip-EV, 2-blind, ante 0.12bb, no rake, model universal-dense-v4-player_20260402_150328.onnx.

Position	Stack depth	Endpoint RFI	GTOw published	Gap	Status
UTG	100bb	15.00%	16.5%	−1.5pp	OK
UTG	50bb	15.43%	17.7%	−2.3pp	OK
UTG	17bb	12.58%	15.8%	−3.2pp	OK
UTG	14bb	11.06%	16.0%	−4.9pp	OK (borderline)
UTG	5bb	18.17%	20.0%	−1.8pp	OK
UTG	2bb	21.80%	36.5%	−14.7pp	FLAG
BTN	50bb	41.57%	~55.0%	−13.4pp	FLAG

Tooling note: strategy_grid_client.py's MttPreflop.open() hardcodes ante=0 in _build_mtt_hand() (line 809). Standard MTT presets assume ~0.125bb ante; without explicit ante override the endpoint output is 8-25pp tighter than GTOw — that's an ante mismatch, not a model defect. This script patches the payload to add ante=0.12bb. Fix request: add ante parameter to MttPreflop helpers so MTT queries default to MTT-realistic config.

Two spots flagged for follow-up: UTG 2bb (jam range too narrow vs published) and BTN 50bb (open range too narrow). Both could be (a) real model findings on under-trained MTT regimes, (b) GTOw preset assumptions we haven't matched (different ante / payout / ICM), or (c) the BTN qualitative "closer to 55%" being a range rather than a point. Test script: tests/pull_hygiene_check_mtt.py.

Tier 1 — Raw quantitative datasets

Direct numerical comparisons of Quintace against external solvers (Pio / GTOW).

Pending consolidated results. Updated benchmark results from Yaroslav and Ha are pending. Once their next run lands we'll refresh this section with the current numbers and links to the underlying data.

For published findings to date, see the comparison article: When QuintAce, GTO Wizard, and Other Solvers Disagree — covers the 1,755-flop DRL vs PioSolver alignment, the 30-spot 3B05A_8handed DRL vs GTOW cross-tree pass, the postflop ICM 0.28% vs 13% MAE gap, plus side-by-side 13×13 hand-class grids and the tree-mismatch finding on the lowest-alignment spot.

Tier 2 — Reviewer-flagged divergences (from publishing pipeline)

Divergences surfaced by article reviewers during the verified-theory-publishing pipeline.

See the full reviewer findings log: /solver-qa/reviewers/ — complete catalog of reviewer-flagged divergences, current status, and triage outcomes across the publishing pipeline.

KVL register absorption (2026-05-11) — all 3 prior Tier 3 items absorbed into the Reviewer findings section: ea5-btn-open-vs-public-solvers → R17 (stale claim, resolved). q-cash-bb-btn-3bet-crosscheck → R18 (matches GTOw within 0.32pp, resolved). q-plo-solver-crosscheck-paired-boards → R10 (preflop captured as Issue; postflop work continues in same thread).

Tier 3 — Source theory references

Theoretical baselines and book-level cross-validation context. Not divergence data — flags where books expect cross-checks.

book-1 (Cash) — theory-foundation.md (foundational citations)
book-5 (PLO) — plo-theory.md + plo-theory-foundation.md (Pio / GTOW context)
book-6 (MTT) — mtt-theory.md + mtt-baselines.md + mtt-readme.md + mtt-causal.md (GTOW MTT-LIT-* citations)
all-in-ev-illusion/v1.md — GTOW for EV-formula context
seidman-easy-game-reexamined/v1.md — GTOW cross-check editor's notes at T2/T3/T7
antes-vs-straddles/v1.md + deviation-log.md — strategy_grid_client direct rendering
cfr-drl-gto-based-learning/ — series on CFR vs DRL vs GTO architecture
student-drl-vs-cfr-architecture/ — companion student-track piece
nick-squid-desperation-geometry/v1.md — Nick's squid article references

What's missing from the inventory

Yaroslav's original solver-alignment snapshot files — referenced but not located. Likely in a Bitbucket-only repo (gto-seer or agrlalg).
PLO5 / PLO6 cross-checks against Pio — register entry exists but no data.
MTT chip-EV vs Pio benchmarks — book-6 references MTT-SOL-* baselines but specific divergence runs not in this workspace.
Squid / Bomb-pot — by definition no public-solver baseline; B1-only.
Pre-2026-04 historical reviewer findings — likely scattered in older Slack threads / Confluence archives, not surfaced by file-search.

Source of truth: engineering-department/gameplay-ai/projects/external-solver-benchmark/. INDEX maintained by the External Solver Benchmark project owner. Last swept 2026-05-06.