John KuehJohn Kueh
All articles

Article· Updated May 2026

Pick the model for the job cover

Claude for code, Gemini for images, Qwen for Japanese speech, DeepSeek for translation. Across five projects I route to six different models. None of them won on benchmarks. Each won on the specific task I needed done, at the cost I was willing to pay.

Here's the routing table. It's the reference artifact — the rest of the article is the reasoning behind each row.

The routing table

| Task | Model | Approx. cost | Why it won | |---|---|---|---| | Code generation and orchestration | Claude (Opus / Sonnet) | varies | Best at holding large codebases in context, following project conventions, multi-file edits | | Speech-to-text (JA / KO / ZH) | Qwen3-ASR via DashScope | ~$0.007/min | Handles dialect, mixed registers, overlapping speakers, code-switching better than Whisper | | Translation (JA / KO / ZH to EN) | DeepSeek | ~$0.004/1K chars | Domain-aware with prompting, cost-effective for subtitle-length text | | Recipe parsing | Gemini 2.5 Flash | ~$0.002/recipe | Fast structured extraction at consumer scale, handles messy HTML reliably | | Abstract / painterly images | Gemini 2.5 Flash Image | ~$0.008/call | Visually indistinguishable from GPT Image 2 for watercolor style, 16x cheaper | | Logos, text-heavy images | GPT Image 2 | ~$0.13/call | Only model that reliably renders precise text inside images | | Structured web extraction | Gemini Flash (self-hosted) | ~$0.002/page | Same model Firecrawl runs behind the scenes, 71% cheaper when self-hosted | | Landing illustrations | Gemini 2.5 Flash Image | ~$0.04/call | Higher-fidelity prompts for marketing assets, still a fraction of GPT Image 2 |

Every row has a story. The interesting ones follow.

The image generation shootout

glp3.wiki needed OG cards — the watercolor-style images that show up when someone shares an article on social media. I tested both GPT Image 2 and Gemini 2.5 Flash Image on the same prompts.

GPT Image 2 costs about $0.13 per call. Gemini costs about $0.008. That's a 16x difference.

For the abstract watercolor style glp3.wiki uses, the outputs were visually indistinguishable. I showed both versions to my wife without telling her which was which. Her verdict: “OpenAI doesn't make a difference.” She couldn't tell them apart and didn't prefer one over the other.

GPT Image 2 does win in one specific area: text rendering inside images. If you need a logo with precise lettering, a diagram with labels, or any image where the text has to be exactly right, GPT Image 2 is still the only reliable option. Gemini garbles text the same way every other diffusion model does.

So the routing rule is simple. Abstract, painterly, decorative → Gemini. Text-heavy or typographic → GPT Image 2. For glp3.wiki, that means every OG card goes through Gemini at a sixteenth of the cost with no perceptible quality loss.

The extraction arbitrage

Firecrawl is a web scraping service I use across several projects. It has a JSON extraction mode that takes a URL and a schema, then returns structured data. Under the hood, it sends the scraped HTML to Gemini and charges you 5 credits per page for the privilege.

When I realised I was paying Firecrawl to call Gemini for me, I started calling Gemini directly. The pipeline became: 1 Firecrawl credit to scrape the page (HTML only, no LLM extraction), then a self-hosted Gemini Flash call to extract the structured data. Same model. Same quality. Same latency — about 17 seconds per page either way.

The cost dropped 71%. Five credits became one credit plus roughly $0.002 in Gemini API costs. At the volume journeys.im does for restaurant and hotel extraction, that adds up fast.

The lesson isn't “Firecrawl is overcharging.” They're bundling convenience, and that's worth something. The lesson is: know what's behind the abstraction. If the abstraction is calling a model you already have API access to, you can often cut out the middleman for the extraction step and keep the service for what it's actually good at — in Firecrawl's case, getting past Cloudflare and rendering JavaScript.

The speech-to-text decision

subs.rip processes Japanese, Korean, and Chinese audio. The entire point of the product is accurate transcription and translation of Asian-language video content. Getting the speech-to-text model right is existential for the product.

Whisper-large was the starting point because it's the default everyone reaches for. It's good. It handles clean studio audio in Japanese well. But subs.rip's users aren't feeding it clean studio audio. They're feeding it variety shows with overlapping speakers, Korean dramas with dialect, Chinese podcasts where the host code-switches between Mandarin and English mid-sentence.

Qwen3-ASR handles all of that better. Dialect recognition is noticeably stronger, and the code-switching case — where someone flips between languages mid-sentence — is where the gap is widest. Whisper commits to one language and garbles the other. Qwen rides the switch.

The pipeline splits transcription and translation into separate steps rather than asking one model to do both end-to-end. Qwen3-ASR transcribes to the source language; DeepSeek translates to English. Each model does the thing it's best at, and the translation gets clean text as input instead of audio it has to re-interpret.

The meta-pattern

None of these rows came from reading benchmarks. Benchmarks average across tasks; your workload is specific. The right question is never “which model is best?” It's “which model is best at this exact thing, at what cost?”

The answer changes. Six months from now I'll have different rows. The discipline is re-evaluating when the task shape or the cost shifts, not sticking with a provider out of habit.

Frequently asked

Won't this change in six months?

Yes. The routing table is a snapshot, not a commitment. Qwen3-ASR might lose to a future Whisper version. Gemini Image might get expensive. A new model might appear that does extraction and translation in one pass better than the split pipeline. The discipline is re-evaluating when the task shape or the cost changes, not loyally sticking with a provider. I treat model choices the same way I treat library choices — use the best one today, be ready to swap tomorrow.

What about audio generation models like Lyria?

I haven't shipped anything with audio generation yet. When I do, it'll get a slot in the table. I don't have opinions on models I haven't used in production. The whole point of this piece is that the routing comes from real usage, not speculation.