Natural language processing

Court ruling suggests AI systems may be in the clear as long as they don’t make exact copies

Court ruling suggests AI systems may be in the clear as long as they don’t make exact copies



summary
Summary

A California district court has partially dismissed a copyright lawsuit against Microsoft’s GitHub Copilot programming tool and its former underlying language model, OpenAI’s Codex. The ruling could set a precedent for AI tools trained on copyrighted data.

The U.S. District Court for the Northern District of California dismissed significant portions of a 2022 lawsuit filed by the Joseph Saveri Law Firm.

The plaintiffs claimed that GitHub and OpenAI infringed copyrights by allowing Copilot and Codex to reproduce source code without adhering to license terms like copyright notices and attribution.

While GitHub Copilot now uses GPT-4 and Codex has been discontinued, the court’s decision could apply to other AI models with similar training methods and capabilities.

Ad

Court finds no clear copyright infringement

The court rejected the plaintiffs’ claim under Section 1202(b) of the Digital Millennium Copyright Act (DMCA), which prohibits the removal of copyright notices. In its earlier decision, the court had ruled that plaintiffs must prove Copilot makes identical copies of their protected works, a prerequisite for the DMCA to apply.

On the revised second claim, the court again found that plaintiffs had not shown that Copilot tended to reproduce copyrighted code identically.

GitHub recently introduced an optional feature allowing users to hide suggestions resembling publicly available code. The plaintiffs argued this demonstrates Copilot’s ability to accurately reproduce copyrighted code.

However, the court disagreed and dismissed the DMCA claim with prejudice, stating that such a filter did not make it more likely that Copilot would make an identical copy of the plaintiffs’ works in normal use.

Plaintiffs also referred to a March 2023 study that found that the likelihood of AI systems reproducing their training data verbatim increases with the size of the models. However, this study did not specifically refer to the plaintiffs’ works or to Copilot.

Recommendation

Judge Jon S. Tigar cites the study’s conclusion that Copilot “rarely emits memorized code in benign situations, and most memorization occurs only when the model has been prompted with long code excerpts that are very similar to the training data.”

This decision could set a precedent for AI systems trained on copyrighted data, suggesting that copyright claims may be difficult as long as AI systems do not regularly make verbatim copies of their training material in normal use.

The ruling may also benefit OpenAI in its copyright lawsuit with the New York Times, where the company accuses the newspaper of using manipulative prompts to generate exact copies of NYT articles in ChatGPT.

While dismissing additional claims for unjust enrichment and unfair competition, the court allowed a claim for breach of open-source license agreements to proceed. The plaintiffs argue that Copilot violates the terms of open-source licenses by reproducing the code without attribution.

According to the court, even if the breached license terms are more akin to copyright “conditions,” case law does not preclude a breach of contract claim.

Programmer and lawyer Matthew Butterick, who is involved in the lawsuit, highlighted concerns about potential violations of open-source licenses, arguing that AI programming tools like Copilot are monetizing open-source work without permission.

Court ruling suggests AI systems may be in the clear as long as they don't make exact copies

Source link