I got a bit further with my harness for running Qwen 3.6 model on Codex. While testing, analyzing, and building the harness, I evolved TBG(O)llama-swap into a full forensic UI bridge and LLM analytics tool where every harness finding, modification, correction, tool call, reasoning step, and execution flow is fully visible.
This level of transparency was necessary to identify the behavioral differences between native OpenAI models and Qwen 3.6, and to fine-tune the harness accordingly.
The video shows a full Codex run on Qwen 3.6 running on a single NVIDIA GeForce RTX 5090. (Codex in VS Code -> tbg(o)llama-swap -> llama.cpp with qwen 3.6 27B)
The ongoing work can be checked here https://github.com/Ltamann/tbg-ollama-swap-prompt-optimizer/tree/qwen3.6 ,First post , second post
Here’s the clearest current status.
Working
apply_patch apply_patch create/update/delete flow create_file requires non-empty diff or content update_file requires non-empty diff or content delete_file works without diff shell web_search web_search using TBG(O)llama-swap built-in web search file_search view_image request_user_input update_plan spawn_agent wait_agent send_input resume_agent close_agent supports_search_tool catalog inconsistency agent_send_input_roundtrip agent_subagent_same_model shell_patch_verify_sequence web_research_then_notes plan_act_switch_impl multi_web_patch_verify skill_create_and_use_local workspace_summary_then_plan skill_read_local direct_plan_no_web web_research_then_plan file_search_then_patch view_image_then_report - invalid
apply_patch retry exhaustion no longer finalizes with fake progress prose - safer recovery branch after broken
apply_patch - false patch-intent/path-hint extraction from instructions
- reconnect bug caused by unhealthy or duplicate upstream adoption
- long delayed
502 timeout path shortened and improved - native-vs-local contrast harness:
init compare - per-scenario
comparison.json - top-level
comparison_summary.json - tool-surface diff
- item-type diff
- stream/completion diff
- final visible text diff
- grouped UX-summary diff
Implemented in the Bridge Contract
- stricter separation of:
- visible assistant text
- tool call items
- tool outputs
- file/code artifacts
- explicit continuation-state handling for:
- research flow
- write-pending flow
- verification flow
- final-answer handoff
Fixed Enough To Work, But Still Not Native-Perfect
- grouped searches
- grouped tool calls
- grouped file changes
- collapsible internal history
These areas are significantly improved in both the UI and harness, but I would still describe them as partially aligned, not fully native-identical yet.
Fixed
mcp__playwright__browser_navigate mcp__playwright__browser_snapshot mcp__playwright__browser_click mcp__playwright__browser_evaluate mcp__playwright__browser_resize mcp__playwright__browser_take_screenshot
Important nuance:
- llama-swap now preserves and exposes these much more accurately
- however, the WSL Codex router still rejects Playwright leaf calls as unsupported in this surface
- this is now tracked as a known limitation, not an active llama-swap bridge bug
Still Not Fully Closed / Needs More Work
- full native-style grouped worker UX parity
- some remaining model-quality quirks during long multi-step runs
- continuation/reporting polish around malformed reasoning/text splits
submitted by