Software

AI + ML

Medium asks AI bot crawlers: Please, please don't scrape bloggers' musings

OpenAI and Google might respect robots.txt but how about the others?


Blogging platform Medium would like organizations to not scrape its articles without permission to train up AI models, and warned this policy may be difficult to enforce.

CEO Tony Stubblebine on Thursday explained how Medium intends to curb the harvesting of people's written work by developers seeking to build training data sets for neural networks. He said, above all, devs should to ask for consent – and offer credit and compensation to writers – for training large language models on people's prose.

Those AI models can end up aping the writers they were trained on, which feels to some like a double injustice: the scribes weren't compensated in the first place, and now models are threatening to take their place as well as income derived from their work.

"To give a blunt summary of the status quo: AI companies have leached value from writers in order to spam internet readers," he wrote in a blog post. "Medium is changing our policy on AI training. The default answer is now: no."

Medium has thus updated its websites' robots.txt file to ask OpenAI's web crawler bot GPTBot to not copy content from its pages. Other publishers – such as CNN, Reuters, the Chicago Tribune, and the New York Times – have already done this.

Stubblebine called this a "soft block" on AI: it relies on GPTBot heeding the request in robots.txt to not access Medium's pages and lift the content. But other crawlers can and may ignore it. Medium could wait for those crawlers to provide a way to block them via robots.txt, and update its file accordingly, but that's not a situation guaranteed to happen.

For what it's worth, though, not only does OpenAI support blocking via robots.txt, so too does Google, which also on Thursday detailed how to block its AI training crawlers for its Bard and Vertex generative API services, again via robots.txt. Medium has yet to update its robots.txt to exclude Google's AI training spiders.

Blocking web crawlers at a level lower than robots.txt, such as by IP address or user agent string, will work, too – until the bots get new IP addresses or alter their user agent strings. It's a game of whack-a-mole that may be too tedious to play.

"Unfortunately, the robots.txt block is limited in major ways," Stubblebine said. "As far as we can tell, OpenAI is the only company providing a way to block the spider they use to find content to train on. We don't think we can block companies other than OpenAI perfectly."

By that he means that at least OpenAI, and now Google, has promised to observe robots.txt. Other orgs collecting data for machine-learning training might just ignore it.

That all said, regardless of robots.txt protections, Medium has promised to send cease and desist letters to those crawling its pages without permission for articles to train models.

So, effectively: Medium has asked OpenAI's crawler to leave it alone, at least, and the website will take other data-set crawlers to task via legal threats if they don't back off. The website's terms-of-service were updated to forbid the use of spiders and other crawlers to scrape articles without Medium's consent, we're told.

Stubblebine also warned writers on the platform that it's not clear whether copyright law can protect them from companies training models on their work and using those models to produce similar or almost identical material, amid multiple ongoing lawsuits into that whole thing. 

The CEO also reminded Medium users that no one can resell copies of their work on the site without permission. "In the default license on Medium stories, you retain exclusive right to sell your work," Stubblebine wrote.

Textbook publishers sue shadow library LibGen for copyright infringement

MEANWHILE

He went on to say that some AI developers may have done just that: bought or obtained copies of articles and other works scraped off Medium and other parts of the internet by third-party resellers, to then train networks on that content. He dubbed that laundering of people's copyrighted material "an act of incredible audacity."

Stubblebine advised companies looking to crawl web data from Medium to contact the site to discuss credit and compensation among other sticking points. "I'm saying this because our end goal isn't to block the development of AI. We are opting all of Medium out of AI training sets for now. But we fully expect to opt back in when these protocols are established," he added.

Medium proposed that if an AI maker were to offer compensation for scraped text, the blogging biz would give 100 percent of this to its writers.

In July, it also confirmed that although AI-generated posts aren't completely banned, it would not be recommending any text completely written by machines.

"Medium is not a place for fully AI-generated stories, and 100 percent AI-generated stories will not be eligible for distribution beyond the writer's personal network," it stated. ®

Send us news
5 Comments

Colleges snub Turnitin's AI-writing detector over fears it'll wrongly accuse students

By the time they graduate, employers will be making them use LLMs anyway

Intel slaps forehead, says I got it: AI PCs. Sell them AI PCs

People try to put us down, talkin' 'bout ML generation

Unions claim win as Hollywood studios agree generative AI isn't an author

The pen is (slightly) mightier than the algorithm

IRS using AI to catch rich people and tax-dodging corps

Plus: Google CEO says AI will be biggest tech shift in our lives, new official AI words on Dictionary.com

Portable Large Language Models – not the iPhone 15 – are the future of the smartphone

Personal AI can redefine the handheld experience and perhaps preserve privacy too

UK judge rates ChatGPT as 'jolly useful' after using it to help write a decision

PLUS: Coca-Cola's AI-designed drink to debut; chip startups struggle to compete with Nvidia as funding flees

Friends don't let friends use AI to chat

Science shocker: Real BFFs understand authenticity and sincerity can't be machine-generated

Alibaba set to unleash AI that offers financial advice – do you feel lucky?

Claims it's already comparable to finance industry pros and capable of recommending insurance plans

Cloudflare loosens AI from the network edge using GPU-accelerated Workers

Isn't that how Skynet took over?

Mention AI in earnings calls ... and watch that share price leap

This week's story is brought to you by the letters C, E, O, as well as B and S

Getty delivers text-to-image service it says won't get you sued, may get you paid

Trained on its own image library that's clear of copyright complications

FYI: Those fancy 'Google-designed' TPU AI chips had an awful lot of Broadcom help

And Meta's tapping up Big B too – it's big bucks for this silicon giant