Ditching the cloud for local AI — how I use two mini PCs to process millions of tokens a day and save money on costly API fees

1 hour ago 1

For heavy AI users, the economics of the current boom are starting to bite. Over the past year, major labs have nudged prices upward while tightening the screws on usage — whether through stricter rate limits, smaller context windows on lower tiers, or the gradual reshuffling of features behind more expensive plans. Even where per-token costs have fallen in headline terms, the reality for users is more complicated: higher volumes, more complex workflows, and new tooling expectations mean monthly bills are creeping up, not down.

At the same time, open-weight models have improved rapidly, consumer hardware has become more capable, and tools like LM Studio, Ollama, and llama.cpp have made local deployment far more accessible than it was even a year ago. The result is a renaissance in running models on your own machines.

I’m one of the people who has taken the leap myself. In mid-March, I bought a GMKtech mini PC with an AMD Ryzen AI Max+ 395 chip and 96GB of RAM. The purchase — at the time something like £1,500 ($2,000) — was a calculated decision. The kinds of volume I wanted AI models to run through would have blown through my current subscriptions to AI models (I have a ChatGPT Plus and GLM Coding Lite plan, which combined cost me around $23 a month), and forced me onto the higher-cost monthly plans, or API-based inference.

The system I set up on my hardware was designed to try and help me keep track of the constantly changing news in the areas I cover for sites like Tom’s Hardware Premium and others. It takes RSS feeds and ingests the contents of stories in key beats that I cover, then grades them against a digital ‘brain’ made of how I think about the world and what I report on, generated by analyzing nearly 2,000 of my past stories over the previous four years.

When it finds candidates that are potentially interesting, those stories are ‘assigned’ to AI beat reporters, who then read around the subject on the web and produce pitches, similar to those that I send to my editors here and elsewhere. Those AI reporters then send their pitches to AI editors, who engage in a conversation with the reporters to fine-tune the idea’s framing, before presenting me with a couple of paragraphs of a broad idea that is meant to be tailored to my tastes via Telegram. The outputs are far from perfect — I’d equate them to a newly-graduated student that I teach in terms of their taste and depth — but they’re a good starting point for me to learn about what’s important on a given day, and a provocation for how I might think about framing those events. For the kind of things I’m using AI for, even the bleeding-edge frontier models aren’t much better than the local LLM options, though I appreciate that there’s a bigger gap when thinking about coding.

The whole process uses LM Studio and runs on a mix of quantized models, generally of Qwen3.5 and 3.6. Because I’m running multiple editor and reporter processes in parallel, the parameter count on each model may seem undersized for the 96GB of RAM that my AMD GPU can access (after some BIOS tweaks): I’m using a mix of Qwen’s straightforward 3.5-9B model, as well as Jackrong’s Qwen-3.5-9B-GLM-5.1-Distilled and Qwopus-3.5-9B models. In part, that’s because thousands of calls on the models take place every day, and in order to keep on top of the backlog of stories to look through and ‘discuss,’ throughput needs to be high.

For this kind of reading, thinking, analyzing, and re-presenting, local models work brilliantly. They have high throughput but are working in the background, meaning that the slower time to first token that many local LLM users complain about in comparison to big lab-hosted alternatives isn’t an issue for me. The model runs 24 hours a day, and if it takes two seconds or two minutes to process the prompts (between 7,000 and 18,000 tokens, depending on whether it’s a reporter or editor and how far through the discussion process it is), it doesn’t bother me. Tokens per second won’t impress those talking a big game about local LLMs on social media: the models handle the prompts at around 300 tok/s, while the output is a much slower 5-10 tok/s. Yet it works for me.

But for now, I’m still keeping my big lab subscriptions — though I’m using them differently. My GLM Coding plan, bought around Christmastime and which lasts for a year, is used alongside Codex through my OpenAI subscription to troubleshoot and tinker with the projects when issues arise. My coding knowledge stopped at some QuickBASIC and Delphi in my teenage years, so having the ability to call on them (and an OpenCode Go subscription I occasionally dip into) to fix problems is invaluable.

However, the proportion of my AI use has shifted significantly. Two-thirds or more of my total token use is now locally-hosted LLMs I run myself. And as local models continue to develop their abilities and the gap between them and the state of the art from big labs closes, I can envisage that it will increase. For instance, I recently vibe-coded a web interface for LM Studio that allows me to use it as a regular chatbot just this last week. And in just two months, the amount I’ve saved if I had run that project every day through API calls on GPT-5.4-mini, arguably a comparable model, is three-quarters of the cost of that first mini PC — around $1,500.

In hindsight, I wish I’d bought the 128GB version of my mini PC, which is why I decided around two weeks ago, before another memory-based price hike, to buy the bigger version. The reason was a simple one: the volume of queries I was putting through my 96GB box was starting to hit the limits, and I wanted to expand the project. I also wanted to test out locally hosted coding harnesses like Claude Code or Hermes using a local model.

The experience, trials, and tribulations from my first mini PC setup helped enormously with setting up the second PC. Token count has increased from 20-50 million tokens a day to more like 50-80 million tokens a day. I offloaded part of that massive ingest and analysis project onto the new hardware and put it onto more powerful 27B and 36B parameter models (through the Final-Bench-Darwin-36B-Opus model), freeing up space on my first mini PC and allowing me to test the idea of a locally-hosted Claude Code-style project with the spare space on my second mini PC.

That has been less successful — at least so far. Underpinning the coding harness with GLM-4.7-Flash works, but feels like too big a step back in model generations to be a useful tradeoff. Larger Qwen models have so far got stuck in their own thinking (or burned through a lot of the context window they’re assigned), but I’m considering swapping Claude Code out for a lighter-weight, less context-heavy harness and giving it a proper run.

The bet I’ve made is a simple one: subscription and API prices from frontier labs — with the odd outlier like DeepSeek excepted — are only going to go in one direction as the companies behind them realize they need to make a financial return for investors. Even if prices don’t go into the stratosphere, labs might make tradeoffs to cut down on usage — as we’ve already seen GitHub doing. And while the race to build capacity to meet demand for those major AI labs will continue to push up prices for hardware in the short term, I still think it’s a better bet to have control over your own models and how much you pay for them than to leave it in the hands of big companies.

So I’ll keep tinkering with my local stack, which has already gone from one mini PC to two interlinked ones — and already have my eyes on a PC with an Nvidia GPU to give me the token speed that’s currently missing. But for now, I think it’s worth keeping what I have for a while and seeing how I can eke out additional benefits before making the leap financially in expanding my whole system.

Read Entire Article