Local AI Models Are Getting Dangerously Good

Local AI Models Are Getting Dangerously Good

Okay, I have to preface this video with a warning. For the next few minutes, I’ll be a petty little guy who is just happy to see that not everything goes according to plan for the world’s greatest villains. And if you think villain is too strong a word, let me know what we should call people who seem genuinely excited to dismantle the middle class in their pursuit of enough wealth and fame to earn a spot in Homelander’s inner circle.

Well, today we should celebrate because the co-founder of Hugging Face recently tweeted the following:

“Qwen 3.6 is running inside of a pi coding agent via llama.cpp on a MacBook Pro, and for non-trivial tasks on the Hugging Face codebases, this feels very, very close to hitting the latest Opus in Claude or whatever shiny monopolistic closed-source API of the day is, in full airplane mode.”

But if you are getting too excited, please let me stop you. After all, this tweet comes from somebody involved in running an open-model platform, and he has some bias on the matter. If something seems to be too good to be true, it usually is. And don’t get me wrong, as much as I love celebrating any small win we have against the AI team, we have to spend this Monday morning review realistically assessing the current capabilities of local models.

The claim is pretty bold. Basically, you have a model small enough to run locally on a $2,000 laptop in airplane mode, which allegedly has similar outputs as the state-of-the-art trillion-dollar data center machines that consume enough electricity and water that they are literally changing weather patterns.

What Are Local Models?

Now, if you are not familiar with local models, here is a quick TL;DR.

Most AI tools people use today work like this: You type something into an app. That request is sent to a server somewhere. The model runs in a giant data center, and the answer comes back over the internet.

A local model runs directly on your machine with no internet connection needed. This will definitely be a trend in the future because we are already seeing hardware manufacturers build specifically for this world. And the local hardware setup is actually a key aspect here. This worked so smoothly on a laptop thanks to Apple’s M-series, which has unified memory, meaning the GPU can access system RAM as VRAM. A $2,000 Windows laptop with a 4 GB discrete GPU would probably choke on this since the model needs about 17 GB to work.

Another key aspect in this local architecture is llama.cpp. This is an open-source project that lets people run large language models efficiently on normal hardware, especially on CPUs and consumer GPUs. It became popular because it made local AI practical for people who don’t own a small data center or a pile of leftover crypto-mining hardware. Basically, llama.cpp is one of the reasons you can take a large model like Qwen, compress it through techniques like quantization, and run it on a MacBook.

Quantization is another important detail here. Normally, AI models are large and expensive to run because their weights are stored with high numerical precision. Quantization reduces that precision, making the model smaller and faster at the cost of some quality. This is basically a trade-off where you lose some accuracy, but you gain the ability to run the thing on hardware that was originally purchased for producing petty YouTube videos while playing in closed source.

Also, note that choosing Qwen as the model for this test was not accidental. Qwen 3.6 has built-in native thinking mode reasoning steps that it passes along through the code context. So, it is doing deep reasoning entirely on consumer hardware.

The Unbeatable Advantages of Going Local

Benchmarks aside, there’s a whole category of advantages local models have that the closed labs literally cannot compete with. The most important one is that it runs locally, so your code stays on your machine. That means:

  • fewer security and compliance issues
  • no dependency on a cloud provider
  • no rate limits
  • no usage caps
  • no usage-based bill

So going back to the original claim, you can already see that the tweet comes with a lot of footnotes. Can a local model have similar results to what the most recent Claude or Codex is offering? Obviously, not across the board, but the fact that local models are now good enough for this comparison to even be taken seriously is already a huge deal. Both for us developers, but mostly for the AI marketing and hype machine we are so sick and tired of.

And before some Anthropic fanboy starts roasting me in the comments, let me be specific about the actual gap between these two approaches. For narrow work like writing a function, debugging a file, or scaffolding a CRUD app, local models are getting suspiciously close. However, hand the same local model a 50-file monorepo and ask it to refactor the thing while holding 200,000 tokens of context for 3 hours, and you’ll have to buy a new MacBook. Clearly, Opus is still in a different weight class.

The Trillion-Dollar AI Bubble

But here is the part the AI labs really don’t want you to think about for too long. The second scenario of an autonomous senior engineer running for hours and replacing entire teams is what they sell to investors and CEOs. But we all know that this has no legs in the real world. And as much as Anthropic is trying to sell this idea with projects like writing a compiler from scratch or rewriting the entire Bun codebase from Zig to Rust, these are just marketing stunts with very little relevance in the real software world.

A couple of weeks ago, an Nvidia executive stated that the cost of compute is far beyond the cost of the employees. And right now, AI is more expensive than paying human workers. We all know AI labs are operating at massive losses while trying to capture market share and convince everyone that their tools can increase worker productivity, or better yet, replace workers entirely.

But what we have so far is:

  • Amazon workers “token maxing” and doing fake AI work to keep managers happy.
  • Uber running through its entire 2026 AI budget just to end up with a codebase in a worse shape.
  • Meta creating an internal leaderboard and incentivizing people to use AI, which led to the company burning through 60 trillion tokens in 30 days.

The top user at Meta burned 281 billion tokens, which at Claude Opus’s pricing would be $1.4 million for one engineer in a month. This puts Meta on track for what would be a $900 million monthly API bill. I used to laugh when Jensen Huang said every developer should get $250,000 worth of tokens. Now, somehow he sounds like the reasonable one in the room.

So, these AI companies are burning through billions of dollars in their attempt to make this happen. And now, all of a sudden, there is a possibility you’ll get maybe 80% of the same result by running a free model on your local laptop. It must have been a rough couple of weeks at Anthropic headquarters, and I’m sure having Musk breathing down their backs didn’t help either.

I was mentioning in a recent video that nobody really knows how this LLM era will end. Things are moving extremely fast, and 6 months from now local models could close the gap, or Sam Altman could reach AGI. As always, I have to mention yet again that I don’t have a problem with the technology itself, which is rather useful in many scenarios, but all the terrible practices and hype around it are just the worst.

Trivia: Why Apple Silicon Feels Like Retro Hardware

And before wrapping things up, here is your awesome trivia of the day. I mentioned that none of this would be possible without Apple silicon and the unified memory architecture.

When we talk about this layout where the GPU directly accesses system RAM to load a massive 17 GB AI model on a laptop, we tend to treat it like a brand new invention. But the core concept actually dates back to retro computing and early video game consoles. Apple essentially revived the hardware architecture used in the ’90s.

In a standard modern PC, the CPU and GPU are separated. If the GPU needs data, the CPU has to copy it from system RAM and transfer it over a hardware interface called the PCIe lane into the dedicated video memory. This transfer process creates a massive data bottleneck when running large language models.

Back in the 1980s and ’90s, computer and console manufacturers couldn’t afford to put separate dedicated memory chips for graphics into consumer hardware because it was too expensive. So, in order to save money, engineers designed systems where all components share the exact same pool of memory. Nintendo used the exact same approach for the N64, running the entire system on a unified pool of Rambus RAM, allowing its custom Silicon Graphics co-processor to handle 3D geometry without a traditional memory bottleneck.

The PC industry eventually abandoned this layout because modular swappable parts allowed users to upgrade their components independently. When Apple decided to shift to its M-series, they looked at the traditional PC layout and recognized its inefficiencies for modern data-heavy workloads. By soldering the RAM directly onto the same chip package as the processors, the CPU, GPU, and neural engine all access a single memory pool.

Because of this, when you run a model like Qwen locally on a Mac, the GPU doesn’t waste time transferring gigabytes of weights over a bottleneck interface. The data is already sitting in the shared memory pool, fully accessible.

This proves once again that the ’90s was the golden age of humanity when somehow we managed to do everything right.

If you like this video, you should consider joining our community where I’m posting more dedicated weekly content. Please don’t forget to smash all the buttons.