Enticing Thinking for Code Projects

Scope note: This cheatsheet reflects observed behavior in ChatGPT models. It does not generalize to all LLMs.

Stability note: Model names, limits, and UI signals change. Treat specific numbers and badges as indicative, not contractual.

With the release of the GPT-5 model in August 2025, OpenAI updated usage limits—especially for deeper Thinking modes. The guidance below focuses on reliably triggering deeper internal reasoning without unnecessarily consuming manual Thinking quotas.

ChatGPT (web/mobile) – observed usage tiers

Tier	GPT-5 Standard	GPT-5 Thinking (deeper reasoning)
Free	~10 messages / 5 hours; then fallback to smaller model. ~1 Thinking message/day.	~1 Thinking message/day.
Plus	~160 messages / 3 hours (temporarily elevated at time of review).	~200 manual Thinking messages/week. Auto-escalation does not count.
Pro/Team	Effectively unlimited standard usage (subject to abuse guardrails).	Access to extended/pro Thinking variants.

Key distinction

Manual Thinking requests → counted toward weekly limits.
Automatic internal escalation → typically not counted.

If a standard GPT-5 request internally escalates, you benefit from deeper reasoning without spending a manual Thinking slot.

Heuristics that often correlate with deeper internal reasoning

These are correlations, not guarantees:

Noticeably slower responses, even for short prompts.
Structured analysis that introduces ordering, trade-offs, or falsification steps you did not explicitly request.
The UI may later label the response as involving deeper reasoning.

Do not assume you can reliably detect or force internal escalation. The goal is to increase likelihood, not control it.

General prompt structure

“You are acting as a senior [language/framework] engineer. Task: [clear outcome]. Context: [repo summary / constraints / runtime / env]. Inputs: [code snippets, error logs, benchmarks]. Requirements: [functional + non-functional]. Deliverables: [analysis, plan, patch, tests, risks, alternatives]. Evaluate invariants, edge cases, trade-offs, and failure modes before proposing code.”

Why it helps: linked constraints + evaluation criteria tend to trigger internal depth.

1) Bug triage & minimal repro

Template

“Given this failing behavior [symptoms/logs], rank likely root causes by probability. Produce a minimal reproducible example in [language/tooling]. For each cause, show a falsification experiment, then propose the smallest patch and a regression test.”

Signals that correlate with depth: ranking, falsification, MRE, patch + test pairing.

2) Spec → plan → interfaces (no code first)

Template

“Translate this feature request [spec/user story] into:

explicit invariants and pre/post-conditions,

module boundaries and public interfaces,

a stepwise implementation plan with checkpoints and rollback. Highlight ambiguities and propose clarifying questions. Do not emit code until the analysis is complete.”

3) Defensive test design

Template

“Design a test suite for [component] covering:

happy paths and boundary values,

property-based and adversarial inputs,

performance guards (time/memory thresholds),

concurrency or race conditions if applicable. Deliverables: test-matrix table, example inputs/expected outputs, and rationale per case.”

Optional add-on:

“Convert the matrix into [framework] test stubs.”

4) Performance analysis

Template

“Given [benchmarks/profiles], identify the true bottleneck. Compare at least three optimization strategies (algorithmic, data-structure, system-level). Include complexity analysis, expected absolute wins, and a guardrail benchmark with acceptance thresholds.”

5) Refactor with safety

Template

“Refactor [module/path] to improve [maintainability/cohesion/complexity]. Constraints: zero behavior change; public API stable. Deliverables: refactor map (before → after), risk list, and a safety net (snapshot, golden, or contract tests). Stage the work across N small PRs. Do not emit code until risks and safety nets are defined.”

6) Concurrency & correctness

Template

“For [concurrent/async] code, enumerate interleavings that violate invariants. Provide a happens-before diagram. Identify deadlock, livelock, and starvation risks. Propose a synchronization strategy and justify it with contention analysis.”

7) API design review (backwards compatibility)

Template

“Evaluate this API [signature/examples] for ergonomics, consistency, discoverability, error surface, and evolution. Propose a deprecation path and versioning policy. Include adapters or shims for [old → new], with examples that make misuse difficult.”

8) Migration or rewrite plan

Template

“Plan migration from [X] to [Y]. Map data or schema transforms, compatibility layers, dual-read/dual-write strategy, and cutover criteria. Identify irreversible steps and a rollback plan. Provide a milestone timeline with measurable gates. Do not emit code until the plan is validated.”

9) Security & threat modeling (lightweight)

Template

“Perform a lightweight threat model for [component] (STRIDE-lite). List assets, trust boundaries, and the top five concrete threats with exploit sketches. Recommend mitigations with cost/impact trade-offs and note residual risk.”

10) Code review with rationale

Template

“Review this diff [patch]. Classify findings: correctness, performance, readability, testability, security. For each, provide a one-sentence rationale and a minimal fix snippet. End with a risk summary and an approve / request-changes recommendation.”

Phrases that often nudge deeper reasoning

“Rank likely root causes and show falsification steps.”
“Extract invariants and define pre/post-conditions first.”
“Provide alternatives with complexity and trade-offs.”
“Design a minimal reproducible example.”
“Specify acceptance thresholds and rollback criteria.”
“Enumerate edge cases before proposing code.”
“Stage into N small PRs with safety nets.”
“Identify hidden assumptions that could invert the recommendation.”

When not to trigger deeper reasoning

Avoid explicit depth for:

Straightforward CRUD or glue code.
Established patterns with canonical solutions.
Time-boxed exploratory spikes where speed matters more than rigor.

Over-triggering depth increases latency without improving outcomes.

When to spend manual Thinking

Use it deliberately when:

Performance, correctness, and concurrency interact.
The task spans multiple subsystems, files, or data sources.
Auto-escalated attempts are shallow or contradictory.
You need step-by-step proofs (formal invariants, lock ordering).

Example (drop-in)

“You are a senior Go engineer. Task: intermittent deadlock in the job scheduler. Context: Go 1.22, Linux, 16-core system; worker pool + buffered channels. Inputs: [stack traces / pprof]. Requirements: no functional regressions; handle 50k jobs/min; p95 latency < 150ms. Deliverables:

ranked root causes with falsification experiments,

minimal repro,

proposed fix with happens-before explanation,

test plan covering races and starvation,

rollback plan.”