Article· Updated May 2026
Pick the model for the job

Claude for code, Gemini for images, Qwen for Japanese speech, DeepSeek for translation. Across five projects I route to six different models. None of them won on benchmarks. Each won on the specific task I needed done, at the cost I was willing to pay.
Here's the routing table. It's the reference artifact — the rest of the article is the reasoning behind each row.
The routing table
| Task | Model | Approx. cost | Why it won |
|---|---|---|---|
| Code generation and orchestration | Claude (Opus / Sonnet) | varies | Best at holding large codebases in context, following project conventions, multi-file edits |
| Speech-to-text (JA / KO / ZH) | Qwen3-ASR via DashScope | ~$0.007/min | Handles dialect, mixed registers, overlapping speakers, code-switching better than Whisper |
| Translation (JA / KO / ZH to EN) | DeepSeek | ~$0.004/1K chars | Domain-aware with prompting, cost-effective for subtitle-length text |
| Recipe parsing | Gemini 2.5 Flash | ~$0.002/recipe | Fast structured extraction at consumer scale, handles messy HTML reliably |
| Abstract / painterly images | Gemini 2.5 Flash Image | ~$0.008/call | Visually indistinguishable from GPT Image 2 for watercolor style, 16x cheaper |
| Logos, text-heavy images | GPT Image 2 | ~$0.13/call | Only model that reliably renders precise text inside images |
| Structured web extraction | Gemini Flash (self-hosted) | ~$0.002/page | Same model Firecrawl runs behind the scenes, 71% cheaper when self-hosted |
| Landing illustrations | Gemini 2.5 Flash Image | ~$0.04/call | Higher-fidelity prompts for marketing assets, still a fraction of GPT Image 2 |
Every row has a story. The interesting ones follow.
The image generation shootout
glp3.wiki needed OG cards — the watercolor-style images that show up when someone shares an article on social media. I tested both GPT Image 2 and Gemini 2.5 Flash Image on the same prompts.
GPT Image 2 costs about $0.13 per call. Gemini costs about $0.008. That's a 16x difference.
For the abstract watercolor style glp3.wiki uses, the outputs were visually indistinguishable. I showed both versions to my wife without telling her which was which. Her verdict: “OpenAI doesn't make a difference.” She couldn't tell them apart and didn't prefer one over the other.
GPT Image 2 does win in one specific area: text rendering inside images. If you need a logo with precise lettering, a diagram with labels, or any image where the text has to be exactly right, GPT Image 2 is still the only reliable option. Gemini garbles text the same way every other diffusion model does.
So the routing rule is simple. Abstract, painterly, decorative → Gemini. Text-heavy or typographic → GPT Image 2. For glp3.wiki, that means every OG card goes through Gemini at a sixteenth of the cost with no perceptible quality loss.
The extraction arbitrage
Firecrawl is a web scraping service I use across several projects. It has a JSON extraction mode that takes a URL and a schema, then returns structured data. Under the hood, it sends the scraped HTML to Gemini and charges you 5 credits per page for the privilege.
When I realised I was paying Firecrawl to call Gemini for me, I started calling Gemini directly. The pipeline became: 1 Firecrawl credit to scrape the page (HTML only, no LLM extraction), then a self-hosted Gemini Flash call to extract the structured data. Same model. Same quality. Same latency — about 17 seconds per page either way.
The cost dropped 71%. Five credits became one credit plus roughly $0.002 in Gemini API costs. At the volume journeys.im does for restaurant and hotel extraction, that adds up fast.
The lesson isn't “Firecrawl is overcharging.” They're bundling convenience, and that's worth something. The lesson is: know what's behind the abstraction. If the abstraction is calling a model you already have API access to, you can often cut out the middleman for the extraction step and keep the service for what it's actually good at — in Firecrawl's case, getting past Cloudflare and rendering JavaScript.
The speech-to-text decision
subs.rip processes Japanese, Korean, and Chinese audio. The entire point of the product is accurate transcription and translation of Asian-language video content. Getting the speech-to-text model right is existential for the product.
Whisper-large was the starting point because it's the default everyone reaches for. It's good. It handles clean studio audio in Japanese well. But subs.rip's users aren't feeding it clean studio audio. They're feeding it variety shows with overlapping speakers, Korean dramas with dialect, Chinese podcasts where the host code-switches between Mandarin and English mid-sentence.
Qwen3-ASR handles all of that better. Dialect recognition is noticeably stronger, and the code-switching case — where someone flips between languages mid-sentence — is where the gap is widest. Whisper commits to one language and garbles the other. Qwen rides the switch.
The pipeline splits transcription and translation into separate steps rather than asking one model to do both end-to-end. Qwen3-ASR transcribes to the source language; DeepSeek translates to English. Each model does the thing it's best at, and the translation gets clean text as input instead of audio it has to re-interpret.
The meta-pattern
None of these rows came from reading benchmarks. Benchmarks average across tasks; your workload is specific. The right question is never “which model is best?” It's “which model is best at this exact thing, at what cost?”
The answer changes. Six months from now I'll have different rows. The discipline is re-evaluating when the task shape or the cost shifts, not sticking with a provider out of habit.
Frequently asked
Won't this change in six months?
Yes. The routing table is a snapshot, not a commitment. Qwen3-ASR might lose to a future Whisper version. Gemini Image might get expensive. A new model might appear that does extraction and translation in one pass better than the split pipeline. The discipline is re-evaluating when the task shape or the cost changes, not loyally sticking with a provider. I treat model choices the same way I treat library choices — use the best one today, be ready to swap tomorrow.
What about audio generation models like Lyria?
I haven't shipped anything with audio generation yet. When I do, it'll get a slot in the table. I don't have opinions on models I haven't used in production. The whole point of this piece is that the routing comes from real usage, not speculation.
Within Claude, which model do you use for planning vs execution?
The “Claude for code” row is really two jobs that want different models. Planning and orchestrating — reading a whole repo, deciding the approach, dispatching subagents — rewards the strongest reasoning model. The actual edits are mostly mechanical once the plan is set, so a cheaper, faster model finishes them without losing much. Claude Code bakes this in with the opusplan setting: Opus drives plan mode, then it switches to a lighter model for execution. I leave that on for the projects where I'm the bottleneck on architecture and let the cheaper tier do the typing. The split is the same idea as the rest of this table — match the model to the exact sub-task, not to the project.
Opus or Sonnet for everyday coding?
Sonnet is the default I reach for, and Opus is the exception — not the other way round. Anthropic's own guidance lines up with how I route it: Sonnet 4.6 is “the best combination of speed and intelligence” and costs $3/$15 per million tokens, while Opus 4.8 is “the most capable Opus-tier model for complex reasoning and agentic coding” at $5/$25. The Claude Code docs put it plainly — “Sonnet handles most coding tasks well and costs less than Opus. Reserve Opus for complex architectural decisions or multi-step reasoning.” That matches my usage exactly. Most of the work across drafty.im and journeys.im is well-scoped edits where Sonnet is faster and I never feel the gap; I switch up to Opus only when the task is “figure out the approach,” not “make the change.” The same routing logic as the rest of this table — the cheaper model until the task actually demands the expensive one.
How do you switch models without restarting?
In Claude Code, /model switches mid-session and /config sets the default, so the choice is per-task, not a setup decision you make once. For a subagent doing something mechanical, you can pin a cheaper tier in its config rather than letting it inherit the main model — the docs suggest model: haiku for simple subagent tasks, and Sonnet for agent-team members where you want capable coordination without paying Opus rates on every teammate. The principle is the same one this whole article is built on: the model is a per-job decision, and the tooling lets you make it cheap to change your mind.
On a subscription plan, does Opus vs Sonnet still matter if I'm not paying per token?
It matters more, not less. On a Pro or Max plan the cost isn't dollars per token — the docs are explicit that “usage is included in your subscription, so the session cost figure isn't relevant for billing.” What you're spending instead is your plan's usage allowance, and Opus draws it down far faster than Sonnet. Run a few heavy Opus sessions and you can hit the limit before the window resets — /usage shows the bars, and you can press w to see the last seven days against your limit. So the routing rule from the rest of this table doesn't go away when you swap an API key for a subscription — it just changes currency. Sonnet stays the default because it stretches the allowance; Opus is the spend you reserve for the architecture call where the stronger reasoning actually earns the bigger bite out of your week.