Speech & Audio

Gentoo and NetBSD ban ‘AI’ code, but Debian doesn’t – yet • The Register

Gentoo and NetBSD ban ‘AI’ code, but Debian doesn’t – yet • The Register

Comment The Debian project has decided against joining Gentoo Linux and NetBSD in rejecting program code generated with the assistance of LLM tools, such as Github’s Copilot.

The first FOSS OS project to ban code generated by LLM bots was Gentoo, which issued a council policy forbidding code generated with “AI” tools in mid-April. This week, the NetBSD project updated its commit guidelines with a similar edict.

Gentoo’s policy identifies three points that prompted the decision: copyright, quality, and ethical concerns. Of the three, the middle one is the easiest to understand. Code quality is almost self-explanatory: these tools often produce extremely poor quality code. Firstly, no project wants to include bad code. Secondly, nobody really wants contributions from programmers who aren’t able to identify poor-quality code, or who are unable to write better themselves – or at least to improve the bot’s efforts. As such, this is the least important reason.

The other two are trickier, but underlying them is the basis for the NetBSD project’s decision. To understand the significance of these criteria, it’s necessary to understand what these so-called “AI assistants” are and how they work – which is also the reason that the Reg FOSS desk puts “AI” in quotation marks. “AI assistants”, and “generative AI” in general, are not intelligent. The clue here is that industry itself has invented a new term for its attempts at computer intelligence: in recent years the companies behind LLMs now term that AGI, or Artificial General Intelligence.

LLM stands for Large Language Model and these tools work by building statistical models of extremely large “corpuses” of text: vast collections of terabytes of text (and imagery, for the graphical tools). The tools automatically build unimaginably vast models of which words appear with which other words, how close to them, in what sequence and in what permutation. Given enough instances in enough text, plus an example of the desired output, a type of algorithm called a “transformer” can use this model to generate more text. The Financial Times has an excellent free explainer about how the Transformer works.

The result is a statistical model that can extrapolate patterns of words. It’s a very clever idea that extends the autopredict function of fondleslab on-screen keyboards: one that doesn’t just predict the next word that will appear on the tablet, but can generate whole sentences and paragraphs. That is why it’s called “generative”: it generates text, according to the patterns in its model, which were calculated from the text in the corpus that was fed into the model.

It turns out that, if you can afford to take a whole datacenter of tens of thousands of servers stuffed full of idle programmable GPUs able to repeat simple mathematical operations very fast, and you have the storage and bandwidth to feed this server farm multiple thousands of gigabytes of text, these models can extrapolate very plausible-looking material.

But you need an inconceivably huge amount of material. So, the “corpus” of input contains as much text as the teams creating the models can get either for free or cheaply. The input often contains all of Wikipedia, all of Project Gutenberg, and the contents of social networks and online forums… such as source code foundries.

(This is, incidentally, why social networks, other web-based SaaS providers, and indeed, cloud providers in general have now found themselves in the money: the nature of their businesses means that they are sitting on troves of pure, all-human-generated LLM training data – which is suddenly hugely valuable.)

The original models worked well, but inducing them to emit useful text was tricky. The next big step from this was using a chat bot to prompt the transformer, so that plain natural-language queries could generate useful answers. Natural-language interfaces are nothing new: at the beginning of the 1980s, Infocom’s text adventure games had excellent ones, based on the seminal Zork game. Later, work from SRI International, which was already around long before it begat Apple’s Siri – led to the Q&A database for MS-DOS – to which, Symantec (“semantic”, geddit?!) was founded.

If there was text in the corpus on which the model was trained that closely matches the input query, LLMs can generate good, coherent answers. There are tons of tutorials on complex software out there, which have found their way into LLM bots’ indexes. This is genuinely a great use case for these tools: bespoke custom tutorials and guidance on how to use complicated programs such as Git.

But that doesn’t mean that the bot itself understands Git. It doesn’t: it can just generate text which fits the pattern of the text in lots of Git tutorials. LLM bots can’t think, or reason, or spell, or even count – but if there was text in their input that resembles the answers you want, they can do an extremely good impersonation of thinking and solving problems.

Any tool being marketed as “AI” is not intelligent – because real intelligence is now being called AGI, and nobody can do it yet. Instead, LLM tools are one of several forms of machine learning, which means humans writing software to find patterns in lots of input material. Much machine learning has the goal of creating software that can recognize new, unfamiliar patterns that it hasn’t seen before which resemble patterns it has seen. It’s just in the last few years that a new business has exploded: improvising text (or graphics, or sounds) containing the patterns the bots were trained on.

Not only are LLM bots not intelligent, they aren’t artificial, either. The word “artificial” means made by humans to resemble existing examples in nature; it’s from the same route as “art”, “artful”, “artifice” and “artificer.” In the case of LLMs, the models are built by transformer algorithms, not built by humans. The artifice was constructing the algorithm, not the gigabytes of statistics it created. As a comparison, even if you learn how 3D printers work, design one, buy all the components and build it yourself, the plastic shapes it outputs are not hand-made. Very artful, skilled people write transformer algorithms, which extract patterns from huge amounts of human-written source material and then imitate it. That imitation is not artifice: the art was in writing the tool that can learn to imitate its input data.

“Not intelligent” + “not artificial” = not artifical intelligence. QED.

This is why LLM bots are starting to get some splendid nicknames. We especially like “stochastic parrots” – in other words, they parrot their input data, but re-arranged randomly. As an acronym, we also like “Plagiarized Information Synthesis System.” (We feel that “synthetic information” is a particularly interesting term: it looks like information, but it’s not, really.) We also admire the caustic observation by cURL author Daniel Sternberg:

As this vulture pointed out at the end of last year, we are already drowning in code that is entirely hand-written, by tens of thousands of people working together over the internet for decades. The codebase of any modern general-purpose OS is already far, far too large for any one human to read, digest and modify as a whole. To quote the Debian project:

That’s about 116GB of code. It would fit on a $13 (£10) USB key.

Large language models are, by nature, orders of magnitude bigger than that, and they are not human-readable code. They are vast tables of billions or trillions of numerical values, calculated by huge numbers of machines. They cannot be checked or verified or tweaked: it would take cities full of people working for millennia to read them, let alone understand and amend them.

Extrapolating text, or images, modelled upon human-created input is how LLMs work. All of their output is hallucinated. Since no humans designed the models or can inspect the models, it is not possible to adjust them so that they don’t emit text that is not factual. The only LLM bots that create interesting, useful output are the really big ones: small ones can only find and copy simple patterns, so they can’t produce interesting output.

(This vulture’s cynical suspicion was that very large models only became feasible after the industry realized that blockchains are hugely wasteful, horribly slow, and will never facilitate any online business except crime. What else can you do with those newly idle server farms? Run transformer models on them!)

If someone is typing program code into an online editor – for example, if that editor is an Electron app consisting of Javascript which is already running in a browser engine – the editor can feed that code into an LLM bot as you type it… which leads to the mother of all autocomplete tools: one which can, on the fly, match patterns in the code you’re writing with some of the millions of patterns in its corpus, and instantly extrapolate individually personalized code that’s close enough to what you’re typing to be directly usable.

The snag with this is that if your code is close enough to some of the training data, the bot will emit matching code. In principle, the output should not be absolutely identical to the original code in the input corpus, but it might be indistinguishable – for example, identical code with different variable names. Getting LLM bots to reveal their training data [PDF] is an established technique now. It’s even an actual game.

For an open source software project, that means that if the training data contained, for instance, C code for functionality common to many operating systems, then LLM-powered programming assistants will generate code that is extremely similar to the code that was in their corpus. If the code is close enough to be recognizable to a skilled programmer – which up to a point means not the sort of programmer to use such LLM powered tools – then there is the risk of license violations. Code that the bot picked up from a different project could get incorporated into other projects, even though no human knowingly copied anything.

This is at the heart of the copyright and ethical concerns that Gentoo identified. If the code the LLM “assistants” provide is traceable to other projects, that would open up a Linux distribution to ownership issues. If code is inadvertently copied that contains vulnerabilities, who is responsible? The programmer who contributed the code – even if they didn’t write it themselves? The original author, who never contributed that code or even knew that a bot was parroting it?

For NetBSD, all this applies and more, because of licensing issues. While the online code forges are heavily used by Linux developers, meaning that they’re full of GPL code, NetBSD is not GPL: it’s BSD licensed. Accidentally incorporating GPL code into a BSD codebase is a problem: it would mean either relicensing existing code, or totally replacing it – neither of which they have the manpower to do.

If you consider that these are not significant risks, we would point out that Github’s owner Microsoft isn’t feeding any of its own proprietary OSes’ source code into its LLM training corpuses.

LLM bots are a wholly remarkable new type of tool, and absolutely not useless toys – although in the tradition of recent developments in IT, they are enormously wasteful and consume vast amounts of computing power, electricity, and cooling. Like all blockchain efforts before them, they are environmentally catastrophic. That’s not going away, and anyone who tries to tell you that it is – for instance, by driving the development of more efficient technology – is trying to sell you something. LLMs are already driving the development of new processors with circuitry devoted to running LLM models, which renders older processors obsolete and thus consigns them to landfill, a different kind of ecological disaster.

Creative formulation of LLM bot prompts is itself a form of programming, which is also rapidly increasing in importance as well as remuneration. (Previously, we thought that moving programmer effort to interpreted languages running in JIT-compile bytecode engines was egregiously inefficient. How woefully we underestimated human ingenuity in finding new ways to squander compute resources on epic new scales.) As we mentioned earlier this month, server farms don’t complain about where you want them to work, so every vendor is pouring money into this area, in the hope of eliminating those expensive, difficult humans.

It’s also significant, but rarely mentioned, that generative AI is almost no help in handling real-world situations. The Irish Sea wing of Vulture Towers is entirely free of Amazon Alexa, Google Assistant, Apple Siri, or any other such paid privacy-intrusion systems, but nonethless, many people are inexplicably fond of paying for giant corporations to listen in to avoid the arduous labor of turning on lights or playing music. Generative AI has made few inroads here, because for almost everyone, poor and limited understanding is far preferable to hallucinatory misunderstandings and randomly invented guesses. It isn’t going to help guide self-driving cars, either, unless we want innovative massively parallel new categories of trolley problem.

Last month, Linux Weekly News covered Gentoo’s discussions in depth, and we recommend that if you want to know more. More recently it has examined Debian’s deliberations. This issue is going to grow and grow, perhaps even more rapidly than the training databases themselves. ®

Gentoo and NetBSD ban 'AI' code, but Debian doesn't – yet • The Register

Source link