How Google Could Make It Easier To Identify AI Text
We've lived with narrow AI — the type that focuses on specific tasks — for decades now, with phones, computers, and even technology as simple calculators being a part of our daily lives. Throughout that time, the true threat of artificial intelligence was thought to be the coming arrival of what's known as Artificial General Intelligence, or AGI. This hypothetical form of the technology refers to a machine capable of learning, understanding, and carrying out any intellectual task that a human can, and is the very much the source of widespread concerns about robots taking everyone's jobs. Its future arrival was touted as the moment when AI would become self-aware and, as a result, we as human beings would be under immediate threat.
But as narrow AI becomes increasingly ubiquitous, it turns out that this less capable form of the technology was really all it took for humanity to see major and significant changes to their way of life. With the rise of chatbots, AI art, and even the U.S. military adopting AI, the technology has become a prevalent part of modern life, causing widespread anxiety among many who fear their jobs are at risk. While we might be far from the potentially apocalyptic scenario envisioned by some writers on the topic, such as Nick Bostrom in his book "Superintelligence: Paths, Dangers, Strategies," this growth of narrow AI has already led to multiple controversies, not least in the sphere of writing.
AI writing might have only become popular in recent years, but there has been no shortage of controversy surrounding its arrival. Now, however, researchers at Google DeepMind have created a method for identifying and marking artificially produced text that would make it a lot easier to navigate the coming, AI-dominated world.
AI writing is an increasingly important issue
While AI has already proven it has multiple benefits, such as reading old scientific papers and making new discoveries, or bringing ancient texts to life, there has also been no shortage of controversy surrounding the rise of artificially generated articles. Back in November 2023, Sports Illustrated was found to be publishing what appeared to be stories and photos made using AI. A report by Futurism claimed the company had published the artificially produced articles under bylines for authors that didn't exist, and had furthermore used photos for the authors that were taken from a website that sold AI-generated headshots. Sports Illustrated would not confirm the claims, but the whole debacle was yet another reminder of how contentious a topic AI-produced writing is, especially since it followed similarly controversial experiments with AI at newspaper giant Gannett and tech website CNET earlier in the year.
Needless to say, AI writing remains a hot button issue, as companies look to reap the benefits of large language models and their ability to produce increasingly human-like text. Meanwhile, however, both writers and the public at large have proven to be, at the very least, uncomfortable with such a prospect — not only due to companies using AI writing without marking it as such, but due to the ethical concerns that come with LLMs being trained on masses of pre-existing writing produced by human beings.
Thankfully, the guys over at Google are trying to make it easier for readers to identify AI writing with a new "watermark" that would invisibly label text as AI generated.
Google is developing a watermark for AI text
Researchers at Google DeepMind — a collective of scientists and engineers working to develop safe AI systems — have created what they call a "watermark" which would invisibly mark text as AI-generated. The watermark, called "SynthID-Text," is not the first of its kind but is notable for having been deployed in the real world, with Google rolling out the feature to millions of its Google Gemini users for testing.
Detailed in a research paper published in Nature, the SynthID-Text watermarking technology actually alters the way Large Language Models work by changing certain words the models select for use in text generation. These changes are subtle and, according to DeepMind, "non-distortionary," meaning that the quality of the text itself is preserved even after SynthID changes specific words to imprint the watermark.
DeepMind's watermarking technology is particularly exciting as the Google Gemini trial showed that, on the whole, users rated text altered by SynthID as being of equal quality to text that had not been watermarked by the technology. What's more, DeepMind has made SynthID open, allowing others to adopt the underlying technology and develop their own watermarking function.
How does Google's new AI watermark work?
SynthID is relatively easy to understand is you know how Large Language Models work. LLMs such as ChatGPT and Google's own Gemini create text based on a system that uses "tokens." These tokens can represent a single character, word, or a phrase, and they're chosen one at a time to create a string of text. The models use preceding words and each tokens' probability scores to predict the next most likely token to use. This is perhaps most easily understood by considering the fact that when Large Language Models appear to be able to do math, what they are really doing is scouring their training data to find the most likely answer to, say, "2+2." When "4" emerges with the highest probability score, that "token" is chosen as the answer.
DeepMind explains this process further on its website, highlighting how LLMs might take a sentence fragment like "My favorite tropical fruits are," and consider tokens such as the words "mango," "lychee," "papaya," or "durian." Each one of these will be given a probability score. This is where SynthID comes in. The watermarking technology actually uses a cryptographic key to assign scores to each token, then compares them in a tournament-style process, ultimately choosing a token that scores highly with the watermark's own functions. This tournament process whittles down the options until the highest-value token wins and is chosen for use in the text. The process is then repeated throughout the text generation, resulting in a finished passage that contains multiple instances of SynthID stepping in to select specific tokens that use the technology's own adjusted probability scores, essentially embedding a statistical signature into the text that can be used to certify its origin.
What does SynthID mean for AI-generated content?
The SynthID watermark is a promising development, not only because it is slightly easier to detect than the alternatives, but because its application doesn't degrade the quality of outputted text and has no impact on the speed of the AI text generation. The technology can be applied to as few as three sentences, though longer passages will increase its robustness and accuracy. Still, however robust the watermark becomes, there are ways for users to remove it from text. Rewriting or paraphrasing the generated text, either manually or by feeding it into another LLM, would, for example, remove all traces of SynthID's statistical signature.
Still, with a large Gemini trial backing up its undetectability, SynthID is a big step forward in ensuring AI text can be identified as such — though it would require producers of the text to actually deploy the technology in the first place. That could soon be a requirement, however, as mandatory watermarking of AI text is fast becoming a reality. The Chinese government have already made it a legal requirement, and in September 2024, California passing sweeping legislation to regulate AI usage, including a requirement for large AI developers to watermark AI-generated images. It is likely only a matter of time before similar legislation is enacted elsewhere in the United States.
Meanwhile, Google DeepMind has developed SynthID for use in multiple areas, including images, audio, and video. As we all debate whether AI is good or bad, and begin to realize that AGI is much less of a concern than the very real issues posed by today's narrow AI products, such developments at least show that important work is being done to regulate the technology and increase transparency.