Can Local LLMs Write Articles Worth Reading? — A 6-Model Comparison

It Started With a Mac Upgrade

I'd been using a final-generation Intel MacBook Pro 16. Fine for daily work, but running multiple VSCode + Claude Code instances while playing background music on YouTube would frequently make things sluggish — typing lag of several seconds, IME freezes, fans roaring, all the classics whenever Chrome's memory hunger and SSD access piled up.

A few years into the Apple Silicon transition, the used market for M1 Max / 64GB had finally settled down, so I'd been watching for a deal. A second-hand listing came up for a brand-new MacBook Pro 16 (M1 Max / 64GB) at a steep discount, and I bought it.

After the switch, the freezes were gone, of course. But that created a new problem — or an "opportunity": "With 64GB of RAM and an M1 Max GPU, what's the best use of it?" A luxurious dilemma.

My everyday work doesn't even use 32GB. Letting the remaining 30GB+ sit idle felt wasteful. That's when I remembered the AI weekly news auto-generation pipeline I'd built for Draft (the article platform I run). I'd been running it on qwen2.5:72b (47GB), but the model was getting old, and I'd been wondering what would happen if I tried newer generations or different vendors.

"With 64GB I could just run six models in sequence and compare the output." That was the starting point.

The question, distilled, is one thing: If you push local LLMs to their limit, can they write something a human actually wants to read?

What I Mean By "News"

Before you can produce "something worth reading," you have to define what counts as "news." I sorted through the candidates:

A list of RSS headlines — That's a table of contents for news, not news itself.
Aggregated social media posts — Real-time, but tends to devolve into individual emotion and speculation without numerical backing. Good for trend feel, not trustworthy analysis.
A self-built data collection pipeline + LLM processing — RSS, stock prices, regulatory info, and funding rounds collected myself, presented with data backing.

I went with the third. But just listing events stops at "organized data." Connecting events together to show what's actually happening is the condition for "news as a reading experience," I think.

For example:

Demand for AI data center construction rises → Infrastructure and energy company stocks rally

If you can produce that level of "connect the dots" analysis, it's valuable to readers. If you can't, it's just a "this week's events list."

I've kept the option open to eventually swap this pipeline over to a commercial API (Claude / GPT-4 etc.). But for now I want to see how far you can go with "a local LLM an individual can reasonably afford." The reason is simple: API billing creates a fixed cost you keep paying whether or not you have readers. With local, once you've bought the hardware, the running cost is essentially electricity. That keeps you nimble during the experimental, low-traffic phase.

TL;DR

I fed the same AI news data (Phase A/B/C shared) to six open-weight LLMs and had them write only Phase D onward (Headlines / Deep Dives / Stories / Takeaways) for comparison.
Everything ran on Mac M1 Max 64GB + Ollama. Zero cloud API.
Conclusion: qwen3.6:35b-a3b-q8_0 MoE (38GB / 10.6 minutes) is the overall winner on "numerical density × speed × Stories causality." The writing carries a sense of authorial enthusiasm.
Big failure: qwen3-next:80b (50GB MoE) — physical limit on a 64GB Mac, the thinking-model trap.
Side product: discovered many local-LLM operational pitfalls — MLX variants, Wired roundup articles, thinking mode, and more.

Baseline State and Remaining Issues

The qwen2.5:72b (47GB) pipeline I'd been using was already producing weekly articles at Draft-publishable quality. But three reader-facing dissatisfactions remained:

Monotonous reading — Headlines → Deep Dives → Takeaways line up, but emphasis doesn't come through.
Bottom lines escape into abstraction — Section closers default to clichés like "reshaping the industry" or "a new era of..."
The model generation is old — qwen2.5 is over a year old. Hadn't tried Qwen3.6 / Llama3.3 / Gemma3 etc.

Items 1 and 2 I addressed via prompt improvements (Hook + Editor's Pick, Takeaways converted from imperative to suggestive form, Bottom-line banned-phrase list, etc.). Item 3 is the scope of this experiment.

Pipeline Structure (Phase A〜G)

Each article is generated through the following 7 phases. Each phase has an independent role, and intermediate JSON saved between them allows a "resume" execution that runs only the later stages.

Phase A: Data Collection

Pulls last week's articles from 9 RSS feeds (Wired AI / TechCrunch AI / The Verge AI / MIT Tech Review / Hugging Face Blog / Google AI Blog / Ars Technica / etc.)
Each article is body-extracted with trafilatura (HTML → clean text)
S&P500 stock and volume data from the Trade2 DB on the mini server
AI-related bills from GovTrack / OpenStates
Funding round info

Phase B: Theme Extraction

One LLM call outputs "this week's 3-4 main themes" as JSON
Each theme has narrative / evidence / market_impact / so_what / key_facts
Grouping accuracy is improved via "only bundle news from the same industry / same buyer" prompting

Phase C: Theme Deep Research

For each theme, the LLM asks "what's missing" (C2)
DuckDuckGo web search retrieves answers (C3)
Results are merged into enriched analysis (C4)
Repeats up to 3 rounds until the LLM judges "I have enough"

Phase D: Article Writing ← Model variation point

D0: Hook + Editor's Pick (one-sentence hook + one must-read of the week)
D1: Headlines (all news as one-line bullets, opinion pieces excluded)
D2: Deep Dives (3-4 paragraphs per theme, 4 themes in parallel)
D3: Stories (writes 🔗 Connecting the Dots if cross-theme connections exist; skipped if weak)
D4: Takeaways (suggestive bullets, imperative voice forbidden)

Phase E: Self-Review + Revise

LLM as reviewer checks "duplication / fabrication / generic phrasing / imperative tone"
Separate LLM call applies the feedback as revisions

Phase F: Final Polish + Publish

Grammar / tone smoothing + internal-reference removal
Programmatic removal of a "banned phrases list" ("reshaping the industry", "a new era of", etc.)
POST to Draft API to publish

Phase G: Japanese Translation (added this round)

Pass the English article_final.md to the MoE model for Japanese rendering
Strict adherence to "だ・である" tone, full preservation of proper nouns and numbers
Average 3.8 minutes per article (detailed below)

Validation Structure

For this experiment, I froze the Phase A/B/C output and ran only Phase D~F across each model. A script called resume_en_from_d.py loads saved JSON and executes from Phase D. This makes it a fair comparison: "same source material, different writers."

Models Tested

Ver	Model	Size	Architecture	Pull Source
v11	qwen2.5:72b	47GB	dense	Ollama
v13	qwen3.6:27b-q8_0	30GB	dense Q8	Ollama
v18	qwen3.6:35b-a3b-q8_0	38GB	MoE 35B (active 3B) Q8	Ollama
v14	llama3.3:70b	42GB	dense Q4	Ollama
v15	gemma3:27b	17GB	dense	Ollama
v16	mistral-small3.2:24b	15GB	dense	Ollama
v17	deepseek-r1:70b	42GB	thinking-type	Ollama
❌ v12	qwen3-next:80b	50GB	MoE thinking-type	Failed (detailed below)

Results Summary

Ver	Model	Size	Runtime	Chars	One-line
v11	qwen2.5:72b	47GB	50 min	10,461	Baseline (old generation)
v13	qwen3.6:27b-q8_0	30GB	27.4 min	13,633	MBA-paper voice, careful driver
v18	qwen3.6:35b-a3b-q8_0	38GB	10.6 min	16,489	Editor who uses every number, fastest
v14	llama3.3:70b	42GB	158.3 min	13,169	No 70B dignity, disappointing giant
v15	gemma3:27b	17GB	25.3 min	17,457	Verbose analyst, quietly inflates numbers
v16	mistral-small3.2:24b	15GB	15.6 min	12,648	Solid mid-tier
v17	deepseek-r1:70b	42GB	239.6 min	10,686	Thinking-type, all of Headlines / Phase E/F TIMEOUT

Numbers Worth Noting

MoE really is fast: v18 (35B-A3B) has 3B active parameters, so it's 2.6× faster than dense 27B (v13)
Quantization loss is smaller than expected: Despite the 27B vs 72B size gap between v13 (Q8) and v11 (Q4), v13 won on textual naturalness
70B-class needs CPU spillover caution: v14 llama3.3:70b shows 102GB with KV cache (CONTEXT 131072), runs 48% CPU / 52% GPU as the data spills into main memory. Runtime ballooned 3×.

Failure Case 1: The qwen3-next:80b Trap

The model I'd been most excited about, qwen3-next:80b (50GB MoE), ultimately didn't work. This was an important lesson about running 80B-class open-weight LLMs on a 64GB Mac.

Attempt 1: think:False (v12b)

Specified "think": false via Ollama's API. The intent: "don't generate thinking, just emit the final response."

Result: the thinking process flowed straight into the response. The article opened like this:

Okay, let's tackle this. The user wants me to final proofread and output 
the corrected full text. But first, I need to understand what exactly 
needs to be done. Looking at the user's message, they provided a task:...

22,773 characters of article_final.md, almost entirely the LLM's monologue. qwen3-next's chat template isn't designed to separate thinking with <think> tags, so think:False simply doesn't apply.

Attempt 2: think:True (v12c)

"OK then, enable thinking and let it separate into the thinking field." Switched to "think": true.

Result: macOS Jetsam SIGKILL'd Python.

Specifically:

qwen3-next:80b model itself: 50GB
Ollama KV cache (including thinking): +9GB
GPU using 59GB (right at the edge of the 60GB wired_limit)
Memory left for OS / Python / Ollama server: 5GB → Pages free 65MB
macOS Jetsam (the OOM killer equivalent under memory pressure) killed the Python process with exit 144

Running an 80B + thinking model on a 64GB Mac is physically tight. You'd want at least 96GB.

Lessons

Thinking models inflate KV cache, so plan for the visible model size + 10〜15GB headroom
The behavior of Ollama's think parameter is model-dependent (qwen3-next vs qwen3.6 are different beasts)
Practical ceiling on a 64GB machine is around 50GB models (KV cache included)

Failure Case 2: The MLX bf16 Trap

"Since I have an M1 Max, surely Apple's MLX bf16 would give the highest precision?" — that was the thought when I tried qwen3.6:27b-mlx-bf16 (55GB).

Attempt 1: Homebrew Ollama 0.20.5

Error: 500 Internal Server Error: mlx runner failed: 
MLX not available: failed to load MLX dynamic library

Homebrew 0.20.5 didn't have MLX library.

Attempt 2: Upgrade to Homebrew Ollama 0.21.2

brew upgrade ollama brought in 0.21.2 + mlx + mlx-c. Retry:

panic: mlx: There is no Stream(gpu, 1) in current thread.
at mlx-c-0.6.0/mlx/c/transforms.cpp:73

Bug in mlx-c 0.6.0 caused a panic.

Attempt 3: Switch to the Official Ollama .pkg

The official build bundles a different MLX (0.31.1-23-g38ad257) and didn't panic. But prompt processing was catastrophically slow:

90 seconds to process a 14-token prompt (still stuck at 13/14 after 5 minutes)
41 minutes elapsed and the "say pong" test never completed

bf16 prioritized precision at the cost of inference speed beyond practical use. Gave up on M1 Max.

Lessons

MLX bf16 isn't "fast MLX" — it's "precision-first slow MLX"
Picking a local-LLM benchmark by size alone is a trap
The official .pkg and Homebrew builds bundle different MLX (even at the same Ollama version)

Failure Case 3: Wired Weekly Roundup Articles

In a Phase D Deep Dive, I fed the Wired AI piece "It Takes 2 Minutes to Hack the EU's New Age-Verification App." The generated Deep Dive wasn't about the EU age-verification hack at all — it was about Section 702 (US wiretap powers) / Meta Ray-Ban / Deepfake.

The cause was simple: that Wired piece is a weekly roundup. The EU hack only occupies a small portion of the opening; the remaining 4,000 characters cram in other topics. The LLM read the whole body and ended up writing about the latter half.

Fix

I wrote a function extract_title_relevant_text() that scans the body paragraph by paragraph and only extracts paragraphs containing title keywords. Specifically:

H2 detection: detect a short paragraph matching the title (often acting as an H2 header) and grab from there to the next H2
Keyword paragraph filter: keep paragraphs containing title-derived words; cut off after 2 consecutive non-matching paragraphs
Fallback: if extraction is too short, return the original opening

This dropped the Section 702 / Meta / Deepfake bleed-through to zero. For pipelines that handle roundup-style sources, this preprocessing is essential.

Failure Case 4: deepseek-r1:70b — Thinking Mode Is a Landmine for Mass Production

The last comparison target was deepseek-r1:70b (a thinking-type model). The expectation: "thinks more deeply, produces deeper insight." The result was the opposite.

What Happened (v17)

Runtime 239.6 minutes (4 hours) for a "completed" run, but the breakdown is grim:

Phase	Result
D0 Hook + Pick	✓ Got it
D1 Headlines	❌ TIMEOUT (exceeded 30 minutes, empty response)
D2 Deep Dives 4 sections	✓ Each took 15-20 minutes
D3 Stories	✓
D4 Takeaways	✓
Phase E review	❌ TIMEOUT
Phase E revise	❌ TIMEOUT
Phase F polish	❌ TIMEOUT

So the final article_final.md is "Phase D draft passing untouched through Phase E/F" plus "Headlines section missing." Doesn't qualify as a comparison subject.

Cause

deepseek-r1 is thinking-type — generates a long internal thinking process before responding. In this pipeline:

Headlines is a bulk task summarizing 14 articles each in one line → 14 chains of thought blew through the 30-minute TIMEOUT
Phase E review reads the full article and audits everything for duplication/fabrication → thinking expanded too much, timed out
Phase F polish, same

It can manage small tasks like "one paragraph per theme" in deep dives, but on tasks that require "parallel, large, exhaustive" coverage, the thinking chain self-destructs.

The Deep Dive Content Was Actually Good

The Deep Dive sections that did finish were high quality. Bottom lines were specific:

Cerebras's IPO filing and multi-billion dollar deals with AWS and OpenAI position it as a formidable competitor to Nvidia in the high-performance AI hardware market.

Hugging Face's Nemotron OCR v2, trained on 12 million synthetic images, reduces NED scores for non-English languages to as low as 0.035, demonstrating the transformative potential of synthetic data in multilingual OCR.

Numerical use is at v18's level, logical structure at v13's level. If Headlines and Phase E/F had worked, it could have made the top tier.

Lessons

Thinking-type LLMs are bad at "parallel bulk tasks". Headlines-style "summarize N items in one line each at once" is fatally heavy
Stretching TIMEOUT from 30 min to 2 hours might make it work, but a pipeline that takes 8 hours per article isn't realistic
If you use deepseek-r1, you have to design tasks to "1 call = 1 theme deep-dive" granularity. Not suited to a general-purpose article generation pipeline like this
Speed ratio = v18 (10.6 min) vs v17 (240 min) = about 23×. Not a production option.

For this pipeline, deepseek-r1 is firmly out.

5-Model Feature Comparison Table

5-tier rating (more ★ = better; for "fabrication risk" only, more ★ = lower risk).

Item	v13 qwen3.6 dense	v18 qwen3.6 MoE	v14 llama3.3 70B	v15 gemma3 27B	v16 mistral-small3.2 24B
Size / Speed	★★★	★★★★★	★	★★★	★★★★
Numerical accuracy	★★★★	★★★★★	★★★	★★★★	★★★★
Stylistic naturalness	★★★★★	★★★★	★★	★★★★	★★★
Logical structure	★★★★★	★★★★★	★★	★★★★	★★★
Hook	★★★★★	★★★★	★	★★	★★★
Bottom-line specificity	★★★★	★★★★★	★★★	★★★★★	★★★
Fabrication risk (low)	★★★★★	★★★★★	★★★	★★	★★★★
Translatability	★★★★★	★★★★	★★★★	★★★	★★★★

Use-Case Rankings

Use case	1st	2nd	3rd
Want to read it as prose	v18	v13	v15
Want it dense with numbers	v18	v15	v16
Speed-first (mass production)	v18 (10.6 min)	v16 (15.6 min)	v15 (25.3 min)
Memory-conscious / lightweight	v16 (15GB)	v15 (17GB)	v13 (30GB)

Per-Model Quirks and Observations

v13 qwen3.6:27b-q8_0 — MBA-Paper Voice

Each Deep Dive is built on the 3-step "fact → structural driver → strategic implication"
Numbers strictly follow source data. Doesn't invent unknown numbers; substitutes "high throughput", "significantly reducing", etc.
Deliberately omits the Stories section. It's optional in the prompt, so it makes the call not to force it
Slightly stiff voice, but translates most naturally
Quote: This direct confrontation with Nvidia's dominance is further evidenced by...

v18 qwen3.6:35b-a3b-q8_0 (MoE) — Editor Who Uses Every Number

Best at using numbers: deploys $510M / 76% / $237.8M / non-GAAP $75.7M loss in full
Connecting the Dots (Stories) causality is more structural than other models
Thanks to MoE 3B active, 10.6 minutes — astonishing speed (1/3 of dense 27B v13)
Quote: generated $510 million in revenue in 2025, a nearly 76% increase from the prior year

v14 llama3.3:70b — Zero 70B Dignity

The largest 70B model, but grammar mistakes remain. Quote: Cerebras' IPO filing and its $23 billion valuation **the company to** further challenge Nvidia's (verb missing)
Hook is extremely short: just Cerebras files for IPO amid $15 million cryptocurrency heist.
Weak inter-paragraph connections; doesn't explain relationships between themes
The 158.3-minute runtime came from CPU/GPU spillover (KV cache exceeding VRAM, falling into main memory) — 3× balloon
A textbook case for "size doesn't equal quality"

v15 gemma3:27b — Verbose Analyst, Inflates Numbers

Most verbose (17,457 chars). Aggressive enough to make specific predictions
But the highest fabrication risk. Examples:
- App releases surged 60% year-over-year in **Q1 2025** ← source data is Q1 2026
- the EU will likely mandate independent security audits ... **by Q1 2027** ← stating a prediction as if fact
- $2.2 billion in Series G and H funding rounds ← Series G $1.1B + Series H $1B summed; that combined figure isn't in the source
17GB / lightweight, but fact-checking is mandatory before publishing

v16 mistral-small3.2:24b — Solid Mid-Tier

15GB / 15.6 min, lightest and fastest
Content is unobjectionable, low fabrication risk
Editor-meta leakage at the top: "Here is the corrected full text with the requested edits:" shows up at the start of the article
Occasional syntax breakdowns: Anthropic's cybersecurity model could for secure AI applications (verb missing)
With a post-processing step to strip the leading meta, it's fine for lightweight use

Adoption: qwen3.6:35b-a3b-q8_0 (MoE) Across the Board

For both English writing and Japanese translation, going with qwen3.6:35b-a3b-q8_0 (MoE) as the single model.

Stage	Model	Size	Speed
English article generation	qwen3.6:35b-a3b-q8_0 (MoE)	38GB	10.6 min/article
Japanese translation	qwen3.6:35b-a3b-q8_0 (MoE)	same	3.8 min/article

Three Reasons MoE Won

I started thinking v13 dense's "MBA-paper-style careful driving" was the right answer for prose. But after reading both through to the final form, the conclusion flipped. MoE has more of a "human writing with conviction" feel.

Doesn't dodge numbers v13 hedges with significant majority and substantial portion, ducking unknown numbers conservatively. v18 fires off $510M, 76% growth, $237.8M net income, non-GAAP $75.7M loss, 12 million synthetic images, 34.7 pages/sec, NED 0.035–0.069 — all in. From the same source, v18's information density is clearly higher.
Stories (🔗 Connecting the Dots) has actual causality v13 chose to omit this section entirely (it's optional in the prompt). v18 lays out a 3-theme causal chain: "surging hardware investment → rising dependence on security infrastructure to protect those compute assets → market expectation around Anthropic's cybersecurity model." Readers shift from "huh" to "I see."
2.5× faster v13 27 min vs v18 10.6 min. Even on a weekly cadence, the headroom for "want to publish more" is different.

MoE Wins on Translation Too

The same MoE handles translation. In an experiment translating v13's English article into Japanese with both models:

dense compresses numbers ("12 million" → "数百万", losing the figure)
MoE preserves all numbers (5億1,000万ドル / 1,200万枚 / NED 0.035-0.069 — all source-faithful)
MoE is 2.6× faster (dense 9.8 min vs MoE 3.8 min)

English writing and translation both running on the same model keeps operations simple. Model swap cost and memory occupancy stay at one model's worth.

Dense's Polish vs MoE's Heat

What I thought was dense's only advantage — "stylistic polish" — turned out to lose to the persuasiveness of arguments backed by numbers when read alongside MoE. Less softness of tone, more data-driven heat. The MoE 35B active 3B architecture is designed to "preserve quality at the parameter scale while accelerating inference," and it turned out to be especially well-matched to news article generation, which demands "logical structure + numerical accuracy + speed."

Use Cases (Reference)

Use case	Pick	Reason
Want to read it as prose	v18 qwen3.6 MoE	Number density + causality + speed; you feel the writer's heat
Stylistic polish first	v13 qwen3.6 dense	MBA-paper voice, safe via hedging but less information
Lightweight / hardware-constrained	v16 mistral-small3.2	15GB / 15.6 min, assuming a post-processing script

Models Not Adopted

v14 llama3.3:70b — Mediocre across size, time, and quality (grammar mistakes remain even at 70B)
v15 gemma3:27b — Fabrication risk is too high; fact-checking labor erases the savings ("Q1 2025" vs "Q1 2026" mistakes etc.)
v16 mistral-small3.2:24b — Editor-meta leakage makes it sub-production ("Here is the corrected text..." at the top)
v17 deepseek-r1:70b — Thinking-type mismatched with the pipeline structure (Headlines / Phase E/F all TIMEOUT)

The biggest takeaway is that "size," "vendor," and "generation" guarantee none of the quality. You can't pick from benchmark numbers alone — you have to actually run the same data through and read them side by side. I also got to experience first-hand that the relatively new MoE architecture can flip dense for the right use case.

Lessons Summary

Model Selection

Practical ceiling on a 64GB Mac is around 50GB models (KV cache included)
70B+ thinking models are heavy on a 64GB machine (96GB recommended)
MoE speed is at the active-parameter level. 35B-A3B inference runs at roughly 3B speed
Q8 quantization loss is small. No need to insist on bf16

Ollama Ops

Homebrew and official .pkg builds bundle different libraries
think:True/False behavior is model-dependent (qwen3-next is broken both ways; qwen3.6 works with False)
When swapping models, explicit ollama stop is needed or memory leaks
iogpu.wired_limit_mb=61440 extends M1 Max GPU memory cap to 60GB (resets on reboot)

News Article Generation Pipeline

Wired/Verge weekly roundup articles mix unrelated topics in the body → preprocessing to extract title-related paragraphs is mandatory
Phase B narrative tends to compress specific facts → carrying important numbers in a key_facts field is effective
Confirm sourcing before using "partners with" in Headlines (LLMs jump to "partnership" too easily)
Without a Bottom-line banned-phrase list, "reshaping the industry" / "a new era of" run rampant
Imperative-voice Takeaways ("Watch X" / "Evaluate Y") translate into preachy "〜せよ" in Japanese → force "observation" / "implication" form
Without "do not round numbers" explicitly in the translation prompt, dense models compress "12 million" into "数百万"

Side Note: Built an Unlisted Publishing Feature for Draft

For readers who want to read each model's full article, I posted them as unlisted on Draft with URLs below. They don't appear in the listing — only people who know the URL can read them.

Draft (the platform I run) didn't have this "unlisted" feature, so I added it to the server + CLI for this comparison article. Added 'unlisted' to the status enum; the listing query was already status='published'-only, so no extra exclusion logic needed. Article detail display blocks via if status in ('draft','pending_review','blocked') → unlisted isn't on that list, so it falls into the "anyone can view" bucket automatically. Two-file change, minimal.

CLI side, added a draft publish --unlisted flag.

draft publish              # Normal publish (appears in listing)
draft publish --unlisted   # Unlisted (hidden from listing, direct URL only)

Each model's full article (English + Japanese):

v13 qwen3.6:27b-q8_0 (dense): English / 日本語
v18 qwen3.6:35b-a3b-q8_0 (MoE) ← Adopted: English / 日本語
v14 llama3.3:70b: English / 日本語
v15 gemma3:27b: English / 日本語
v16 mistral-small3.2:24b: English / 日本語
v17 deepseek-r1:70b (partial-failure reference): English / 日本語

Work log: 2026-04-25 〜 2026-04-26, Mac M1 Max 64GB / Ollama 0.21.2 official build

Can Local LLMs Write Articles Worth Reading? — A 6-Model Comparison

It Started With a Mac Upgrade

What I Mean By "News"

TL;DR

Baseline State and Remaining Issues

Pipeline Structure (Phase A〜G)

Phase A: Data Collection

Phase B: Theme Extraction

Phase C: Theme Deep Research

Phase D: Article Writing ← Model variation point

Phase E: Self-Review + Revise

Phase F: Final Polish + Publish

Phase G: Japanese Translation (added this round)

Validation Structure

Models Tested

Results Summary

Numbers Worth Noting

Failure Case 1: The qwen3-next:80b Trap

Attempt 1: think:False (v12b)

Attempt 2: think:True (v12c)

Lessons

Failure Case 2: The MLX bf16 Trap

Attempt 1: Homebrew Ollama 0.20.5

Attempt 2: Upgrade to Homebrew Ollama 0.21.2

Attempt 3: Switch to the Official Ollama .pkg

Lessons

Failure Case 3: Wired Weekly Roundup Articles

Fix

Failure Case 4: deepseek-r1:70b — Thinking Mode Is a Landmine for Mass Production

What Happened (v17)

Cause

The Deep Dive Content Was Actually Good

Lessons

5-Model Feature Comparison Table

Use-Case Rankings

Per-Model Quirks and Observations

v13 qwen3.6:27b-q8_0 — MBA-Paper Voice

v18 qwen3.6:35b-a3b-q8_0 (MoE) — Editor Who Uses Every Number

v14 llama3.3:70b — Zero 70B Dignity

v15 gemma3:27b — Verbose Analyst, Inflates Numbers

v16 mistral-small3.2:24b — Solid Mid-Tier

Adoption: qwen3.6:35b-a3b-q8_0 (MoE) Across the Board

Three Reasons MoE Won

MoE Wins on Translation Too

Dense's Polish vs MoE's Heat

Use Cases (Reference)

Models Not Adopted

Lessons Summary

Model Selection

Ollama Ops

News Article Generation Pipeline

Side Note: Built an Unlisted Publishing Feature for Draft

Related articles

🎁 Support the author