Speech & Audio

Bring your own brain? Why local LLMs are taking off • The Register

Bring your own brain? Why local LLMs are taking off • The Register


Feature After a decade or two of the cloud, we’re used to paying for our computing capability by the megabyte. As AI takes off, the whole cycle promises to repeat itself again, and while AI might seem relatively cheap now, it might not always be so.

Foundational AI model-as-a-service companies charge for insights by the token, and they’re doing it at a loss. The profits will have to come eventually, whether that’s direct from your pocket, or from your data, you might be interested in other ways to get the benefits of AI without being beholden to a corporation.

Increasingly, people are experimenting with running those models themselves. Thanks to developments in hardware and software, it’s more realistic than you might think.

Think local

There’s a cultural shift driving local LLM adoption, and part of it has to do with distrust of big tech. Pew Research Center found 81 percent of Americans fret that AI companies will misuse their data. The Federal Trade Commission felt it necessary to warn AI model companies to honor their commitments around customer data. That was before the present administration came to power and changed the regulatory landscape.

OpenAI has said it will forget your chats if you ask it to, but that doesn’t mean it purges that data. In fact, it can’t. A court ordered the company to retain its chat logs as part of the case it’s currently fighting against the New York Times and other publications.

Even those that start with a focus on ethics and privacy will bend to market dynamics. Anthropic extended its data retention rules from 30 days to five years in late August, just a few days after announcing that it was giving its AI model a memory. It also started training its models on user data. Yes, users can turn this off, but these are opt-out policies, rather than opt-in.

The privacy argument segues into a sovereignty one, especially as the US takes the brakes off AI regulation. European companies are considering making their own alternatives. For example, German engineering company Makandra baked its own locally run AI to ensure that its usage followed GDPR rules.

Companies that advocate for local LLMs also cite technological democracy as a driver. “AI is one of the greatest sources of leverage humanity has ever had,” says Emre Can Kartal, growth engineer and lead at Jan, a project from Menlo Research to build locally run models and the tools that manage them. “Our mission is to make sure it remains open and in people’s hands, not concentrated [among] a few tech giants.”

Cost is also a factor. AI companies selling compute power at a loss tend to rate-limit users. Anyone who pays more than $100 a month to one of the foundational model vendors only to get cut off during a marathon AI-powered coding session will understand the issue.

“I was experimenting extensively with GPT-3 (before ChatGPT), and was building programs you might call ‘agents’ today,” says Yagil Burowski, founder of LM Studio, a tool that allows users to download and run LLMs. “It was a real bummer to remember that, every time my code runs it cost money, because there was just so much to explore.”

Environmental impacts

Not worried about the financial impact of overspending on tokens? Perhaps the environmental impact of cloud AI might give you pause. US datacenters will consume over 9 percent of the country’s electricity by 2030, according to research company EPRI. Many use evaporative cooling that slurps vast quantities of water. The general calculation is about half a liter per conversation.

The environmental advantages of running LLMs yourself lie not so much in the carbon cost of the training, but in the inference. If you’re using an open-weight foundational model, the training has already happened. But the more you use it for local inference, the more of an environmental impact you could make. The only liquid you’ll likely use to cool your PC is in a closed loop, so you’re not wasting water.

There are caveats. You’re still chewing through power locally, and where you get that power from makes a difference. Someone whose grid relies on hydroelectric power will fare better than someone whose region burns coal. There’s also the carbon lifecycle of the PC components to consider. Semiconductor manufacturing produces lots of greenhouse gas.

Squeezing the models

Generally, though, the more you use generative AI (or even classic AI for that matter), the more appealing a local model becomes. So what do you need to run it effectively? A lot depends on the accuracy that you run the model at, and you can dial that in using a key concept in LLMs, quantization.

Quantization reduces the precision of the weight values derived from nodes in a neural network. That reduces the storage and computing power needed to process them. You increase quantization by reducing the precision of the floating point numbers, and potentially even replace them with purely integer values.

While quantization decreases the accuracy of the neural network-based algorithms that underpin generative AI, the effect isn’t substantial. And the power/performance gain you get from it opens up possibilities to run it on systems more tractable for the server room, the edge appliance, or the home.

What does all this look like in practice? AI infrastructure company Modal says that at half-precision (16 bits), 2 GB of VRAM usage per billion parameters is a reasonable bet. You can serve more parameters by either increasing your VRAM (an Nvidia RTX 5090 GPU has 32 GB VRAM), or by increasing your quantization (halving, or even quartering, the precision of the model). Or both.

“The real sweet spot? Previous-generation enterprise hardware like used Quadro RTX cards often beat new consumer GPUs on VRAM per dollar,” says Ramon Perez, product engineer and lead at Jan. “But don’t sleep on M2 MacBook Pros as their 24 GB unified memory runs 20B+ models surprisingly well.”

Advances in software

Hardware alone is not enough, though. Today, running LLMs on a wide range of equipment is only possible because of developments in the underlying software stack.

“In my opinion the ggml stack (e.g. llama.cpp and whisper.cpp) has made the biggest impact in making local AI possible by a large margin,” said Georgi Gerganov. If you’re doing any client-side inferencing these days, he’s likely responsible for at least some of it. Ggml is his low-level library for running machine learning models on different types of hardware.

Gerganov also maintains llama.cpp, which is a bedrock package for running LLMs on hardware with different capabilities. It supports CPUs, but also takes advantage of GPUs if you have them.

Ollama, one of the most popular CLI platforms for running your own LLMs, is a developer layer built atop llama.cpp. It offers single-line installation of over 200 pre-configured LLMs, making it easy for LLM developers to get up and running with their own local generative AI.

It isn’t just the low-level part of the stack that has evolved. For many, local LLM projects begin with consumer-friendly environments like Jan and LM Studio. These make it easy for people to take open source AI and package it into a format that anyone can use. They abstract away things like Nvidia’s CUDA library and low-level dependencies.

This means users don’t have to be developers anymore, says Burowski. “We think that many of our users are not engineers at all. Many folks who come to our Discord server or reach out via email have no programming background. Many lawyers, teachers, folks in finance, and many other industries make use of this technology.”

A model for everybody

There seems to be a model for everyone, based on their use case and hardware capabilities. General models like Llama and Mistral offer various parameter counts from small to large. Google Gemma 2 scales down to two billion parameters for on-device work.

“As smaller LLMs become more effective and as edge compute becomes more efficient, smaller organizations can explore open source which offers GPT-OSS, Qwen, Gemma, and even our own Jan Models, which offers increasingly competitive performance,” says Jan’s Can Kartal.

There are also models specialized for tasks such as coding, such as Qwen 2.5 Coder 7B and DeepSeek Coder V2.

“I use local code assistance for completions and questions on a daily basis,” says Gerganov.

Some models can get very specific. For example, we hear that storytelling LLMs like Mythomax are good for roleplaying games (+10 XP when you install it).

Are local LLMs good enough?

The question is whether all these models are better than the heavyweight ones running in the cloud, or whether they need to be.

Andriy Mulyar, founder of AI company Nomic, started out trying to make local AI models. His company developed an open source model, GPT4All, designed to run locally. But he got nowhere trying to sell services based on that to potential customers.

“For personal and hobbyist use cases, it’s great. You can get value. You can write your email. You can demonstrate coding something,” he says of local LLMs. But for him, that’s where it ends.

“Ultimately, if you want to go in and do a serious business task with these models, they’re not of high enough quality because the actual amount of knowledge you can bake into a 20-billion-parameter model is not sufficient for the needs of a general business or general enterprise.”

Instead, Nomic uses OpenAI with a zero-retention agreement, adding Nomic’s own services to interpret the specialized documents used in the engineering and construction sectors that he targets.

Sizing up the situation

There are two factors that keep cloud models ahead of local ones. The first is size.

“Larger models will always be more generally intelligent,” agrees Perez. “Smaller models tend to specialize and adapt faster to your evolving needs through fine tuning and reinforcement learning. Most users and teams don’t need a 500-billion-parameter model to remember every detail of WW2.”

Retrieval-augmented generation (RAG) is another helpful tool here. Those with particular interests who build their own knowledge bases for LLMs to draw on can produce impressive results in narrowly defined areas. That includes everything from answering questions about War and Peace for your university class through to expert advice garnered from collating technical manuals.

There’s also a lot you can do with multi-agent architectures, where you swap in different centralized models to handle particular tasks, whether that’s summarizing legal documents, handling transcription, coaching you in Second World War history, or being dungeon master for your latest Cthulhu adventure.

Frameworks like Langchain and CrewAI are available for those who want to get into this LLM orchestration, allowing people to piece together agents with different specialist functionality.

A fragile lead

The second factor that keeps the foundational cloud model providers ahead is secrecy. The likes of OpenAI guard their flagship models closely to maintain a market lead. DeepSeek’s market disruption last year showed how fragile that is. As Google already said: “There’s no moat.”

However, that lead is minimal. “The quality differences are diminishing very quickly,” says Gerganov. “Today, local quality is equal or better than cloud quality a year ago. It will continue to improve.”

In the meantime, you can go a long way by being more intentional with your prompting, and with what you put into a local LLM’s context (the model’s working per-chat memory, which is limited). “Models like GPT or Claude can handle very messy context and semi-clear instructions,” says Burowski. “Local models need more careful steering.”

Your next step in the local AI journey

Whether you’re a law firm looking for a private system to manage sensitive work or a hobbyist trying to build a personal knowledge graph, your locally run LLM journey should begin by matching your ambitions to reality.

Start with clear use cases where your privacy, cost, and performance needs justify local deployment. Select the appropriate tooling for your level of technical expertise, and use a general model that matches your hardware profile.

When the dust settles and we’re through the hype cycle, modern AI will still represent a new era in computing. The more you’re willing to experiment with everything from prompting expertise to LLMs hosted next to your desk, the closer you’ll get to controlling it. ®

Bring your own brain? Why local LLMs are taking off • The Register

Source link