Can GPT 5.3 Codex run tests and fix failures automatically?

Yes, GPT 5.3 Codex can run tests and fix failures automatically when it’s deployed in a tool-enabled environment (like an agent workflow that can execute commands, read results, and apply patches). The model itself does not “magically run tests” without tools; you need an integration that can (1) execute a test command in a sandbox, (2) capture stdout/stderr and artifacts, and (3) feed that back to the model. OpenAI’s positioning for GPT-5.3-Codex explicitly emphasizes tool use and complex execution in long-running tasks, and Codex product surfaces like the app are designed around iterative work where you can review diffs and steer the agent mid-run. That’s the exact shape needed for “run tests, fix, re-run.”

To make “fix failures” reliable, you should enforce a strict loop: run a single targeted test command, fix only what the output indicates, and re-run the same command. Beginners often ask the model to run an entire test suite repeatedly, which wastes time and makes signal-to-noise worse. A better approach is: (1) run the smallest failing test or typecheck command, (2) paste the error plus the relevant file sections, (3) ask for a minimal patch, and (4) re-run. Also, don’t let it “fix” by weakening tests or skipping checks unless you explicitly allow that. State that up front: “Do not disable tests; do not loosen assertions; fix the underlying bug.” If your environment allows it, require the agent to output a diff and a short explanation of why the fix addresses the failure. That keeps changes reviewable and avoids the “random walk” problem where successive patches drift away from the original intent.

In larger systems, tool-driven fixing should be combined with retrieval so the model doesn’t guess correct behavior. Many failures come from mismatched internal conventions: expected error codes, logging formats, or dependency injection patterns. Store those conventions and internal docs in Milvus or managed Zilliz Cloud, retrieve the relevant snippets for the failing module, and include them in the context before the model proposes a fix. This is especially helpful when the test failure message is ambiguous (“expected X but got Y”) and the correct resolution depends on team-specific policies. With retrieval + tools, GPT 5.3 Codex becomes much closer to a disciplined engineer: it reads the right references, makes a small patch, and validates it with tests before moving on.