R9 · Stand-Up Game depth-flip on Klu T9 — Resolution Report

Categorical sign change between 100bb and 226bb — call → fold — at Stand-Up val=10 · 2026-05-11

P2 B1 known-failure Surfaced by: HCL 5-way Stand-Up comic article B1 property needed: SQ4 (stack-depth × val × state continuity)

The issue (one paragraph)

In the HCL 5-way Stand-Up Game hand (Jan 3, 2025) at val=10 with Klu standing (Scenario D), the solver gives opposite verdicts for Klu's T9 depending on stack depth: at 100bb Klu calls 35%; at 226bb Klu folds 100%. Standard poker intuition says format-pressure effects (the Stand-Up penalty) should weaken as stacks deepen — penalty becomes proportionally smaller relative to the pot — not flip direction. The flip is categorical (call → fold), not a smooth shift. NLHE cash equivalent sweep does NOT exhibit this behavior (depth-stable within ±2pp on UTG RFI across 50–1000bb), so this looks Stand-Up-specific.

Data — the flip

Stack depth	Klu T9 verdict	Probability
100bb	Call	35%
226bb	Fold	100%

Setup: HCL 5-way Stand-Up, Scenario D (Peter + Adi + Klu standing), val=10. Source: hcl-five-way-stand-up/deviation-log.md · c:/tmp/hcl-five-way-research/SUMMARY.md Open Question #3.

Supporting evidence — NLHE cash control sweep

To distinguish "stack-depth model behavior in general" from "Stand-Up-specific depth interaction," we ran an analogous depth sweep on NLHE cash UTG RFI across [50, 100, 150, 200, 300, 500, 800, 1000] bb. Result: NLHE cash is stable — total swing in range-aggregate RFI is −1.73pp from 50bb to 1000bb. The only flagged hand (Q9s) shows a smooth 13pp drift across 950bb of depth, not a categorical flip.

Comparison	Depth range	Verdict swing	Type
H1 — Klu T9, Stand-Up val=10	100bb → 226bb (126bb)	Call 35% → Fold 100% (−65pp)	CATEGORICAL FLIP
NLHE Q9s — UTG RFI	500bb → 800bb (300bb)	Raise 50% → Fold 54% (+4pp)	Smooth shift
NLHE 87s — UTG RFI	50bb → 1000bb (950bb)	Fold 78% → 58% → 58% (smooth U-curve)	Smooth

Stand-Up's flip is ~16× larger in magnitude than NLHE's biggest shift and across ~8× less depth. Different phenomenon — not a general model-depth issue.

Existing B1 property coverage

The right property family is B1 (Stack-depth continuity, "no cliffs between adjacent depths"). It exists. It applies to all formats including Stand-Up. But the production runner has three coverage gaps:

Property	Spec	Implementation	Covers H1?
`B1` (Stack depth continuity)	"No cliffs between adjacent depths" — `applies_to: [core]`	✅ `classB.ts:47` — but stack range capped at `[20…100bb]`, only tests `UTG : open`, no val-axis	No — runner doesn't reach 226bb, doesn't test Stand-Up's val parameter, doesn't test postflop multi-way spots
`SQ1` (Val monotonicity)	VPIP non-decreasing across val ∈ {1, 2, 3, 5, 10}, fixed depth	(squid.yaml SQ-series)	No — different axis (val, not depth)
`SQ2` (State monotonicity)	VPIP ordering: hero-has ≤ fresh ≤ hero-no-squid ≤ all-desperate	(squid.yaml)	No — different axis (state)
`SQ3` (Squid × position)	Later position responds more to val	(squid.yaml)	No — different axis (position × val)
SQ4 (proposed)	Stack-depth × val × state continuity	Doesn't exist yet	Yes — this is the gap H1 surfaces

Resolution

Track A — Diagnose the Klu T9 flip (model team / Trung)

Solver-tree artifact or real depth interaction?

Sweep intermediate depths (100 / 130 / 150 / 180 / 200 / 226 / 260bb) on the exact same Scenario D · val=10 setup. Map where the call → fold transition happens. Gradual = real depth interaction; categorical = solver-tree artifact.
Inspect the bet-sizing tree at each depth — does the legal action set or sizing increments change between 100bb and 226bb in a way that mechanically forces fold?
Cross-check at val=5 (less extreme) and val=2 — does the flip still occur, or is it specifically a val=10 phenomenon?
Document the diagnosis in llm-verifier-game-expansion/squid-classic/known-issues/ if it turns out to be a real model finding.

Track B — Add SQ4 property + extend B1 runner depth (Nimit / B1 framework)

Framework needs to be able to detect this class of issue automatically.

Add SQ4 to squid.yaml: "Stack-depth × val × state continuity — for each (val ∈ {1,2,3,5,10}, state) pair, assert no dominant-action flip between adjacent depths in stack sweep [100, 150, 200, 226, 300, 500, 800] bb."
Implement runSQ4() in classB.ts (or new classSQ.ts if SQ-series gets its own file). Use the existing assertAdjacentSmooth primitive but with the val/state coordinate axes.
Extend runB1 stack range for Stand-Up format — current [20, 25, 30, 40, 50, 60, 80, 100] caps at 100bb; add [150, 200, 226, 300, 500] conditionally when format = squid/standup.
Register the Klu T9 spot (Scenario D, val=10, T9 at 100bb vs 226bb) as a fixed B1 property test case so this exact regression is detectable on future model versions.

Proposed SQ4 yaml spec

- id: SQ4
  category: X
  description: "Stack-depth × val × state continuity"
  formal_test: |
    For each (val, state) pair in {val ∈ {1, 2, 3, 5, 10}} × {fresh,
    hero-has, hero-no-squid, all-desperate}, sweep stack depth across
    [100, 150, 200, 226, 300, 500, 800] bb. Assert no dominant-action
    flip between adjacent depths (e.g., call-dominant → fold-dominant)
    unless the action distribution shifts smoothly (≥30% of the change
    happens across at least 2 adjacent depth steps).
  applies_to: [squid]
  severity: gate
  rationale: |
    H1 (Klu T9 100bb call 35% → 226bb fold 100%) surfaced that the
    stack-depth × val interaction can produce categorical sign changes
    in Stand-Up that do NOT occur in NLHE cash. This property catches
    that class of behavior automatically. Companion control: NLHE cash
    UTG RFI 50bb→1000bb shows total swing of −1.73pp; smooth, not flipping.
  added_in: v2

Suggested next steps (priority order)

Map the flip point with intermediate-depth queries (100/130/150/180/200/226bb on Scenario D · val=10 · Klu T9). Confirms whether it's gradual or categorical. Fastest test; ~10 endpoint queries.
Cross-check at val=5 / val=2 — is the flip val-specific or does it persist across val? Tells us whether the issue is "deep stack" or "val=10 + deep stack" specifically.
Add SQ4 to squid.yaml + implement runner. Smallest, highest-leverage framework fix.
Extend B1 runner stack range for Stand-Up to include 100-500bb range.
Cross-reference with R8 — R8's 1260bb instability and R9's 226bb flip may share root causes (Stand-Up training coverage in deep regimes). Joint Track A could cover both.

Open questions for the team

Trung / model team: What's the Stand-Up Game training-distribution upper bound on stack depth? Does it include 226bb specifically? If yes, the flip is a model defect; if no, it's expected OOD behavior + a docs gap.
Trung: Does the bet-sizing tree change at 226bb vs 100bb for Stand-Up? E.g., does 226bb get larger raise options that 100bb doesn't have? That could mechanically alter Klu's call EV calculation.
Nimit: Can the B1 runner accept format-specific stack-depth ranges (so Stand-Up gets the extended [100-500] range while NLHE keeps [20-100])?
Coach / Stand-Up expert: At val=10 with Klu standing in a 5-way all-in spot, is there ANY defensible reason T9 should be a snap-fold at 226bb vs a partial-call at 100bb? Or is the flip clearly model error?

Source:

marketing-department/04_marketing_materials/content-portal/every-format-every-game/hcl-five-way-stand-up/deviation-log.md

H1. Control evidence: tests/pull_h1_depth_sweep_nlhe.py. Cross-reference: R9 in Reviewer Findings · R8 (1260bb companion) · Solver QA index.