Come, friendly robots, and copy my inimitable style
Or why I'd like the LLMs to read my stuff and take over the tedious work
Is a robot reading this? Are you a robot?1 For some people, this question is a fundamental one, because the last thing that they want is for a robot to be reading their hand-crafted, artisanal words on a screen. No, what they want to be sure about is that only humans are absorbing their words with the intention of learning from them, considering them, and perhaps regurgitating them, or referencing them, or quoting them at some point in the future. But under no circumstances should robots be allowed to do that. Apart from search engine robots, of course. They’re OK.
This, roughly speaking, is the framework of the latest concern about AI. Or, more precisely, about large language models (LLMs), which basically trawl as much as possible of the internet and regurgitate approximations back in response on request. On Thursday morning, I received an email from Medium titled “Why we’re blocking AI companies from training their bots on your words”, which included this content:
Since ChatGPT launched last November, you've probably read (or written) at least a few stories about AI. Many of you have analyzed how these new chatbots work and why they matter. Writer Zulie Rane even paid a professional to edit an AI-written article—and discovered that while ChatGPT is pretty confident and verbose, it's not always factually reliable.
But there’s another issue here for writers. Generative AI companies train their chatbots on text and images found online—including Medium stories. In other words: Your writing is helping power popular new technologies, but you haven't been compensated or credited. That doesn't feel right to us, and we're guessing it may not feel right to you either.
If you’ve been following the LLM/chatbot discussion at all, then none of the above is in the least surprising. ChatGPT isn’t factual? Amazing! Chatbots are trained on stuff from online? I’m shocked, shocked I tell you.
That’s because this is exactly how humans learn. We walk around an art gallery and look at the pictures. Aspiring artists try to reproduce famous works of art. For myself, I taught myself how to write more fluidly by literally finding journalists whose work I admired, and typing it out, and breaking it down to see how they structured their sentences and paragraphs, chose their similes and metaphors, where they injected their bits of journalism, and I compared those against less good work to see how the structure differed, why their similes and metaphors were less effective, all to try to find where the magic resided. (You can form your own opinion of whether that was time well spent. It was principally for sports journalism, which is where I started.)
Where is any of what I did different from what ChatGPT or Google Bard do, except that theirs is in a black box? I’m not sure that I see it.
And yet in the Washington Post there’s an article by William Cohan, “a best-selling author and a founding partner of Puck News”, which begins
The other day someone sent me the searchable database published by Atlantic magazine of more than 191,000 e-books that have been used to train the generative AI systems being developed by Meta, Bloomberg and others. It turns out that four of my seven books are in the data set, called Books3. Whoa.
Not only did I not give permission for my books to be used to generate AI products, but I also wasn’t even consulted about it. I had no idea this was happening.
I’ve had a look in the Atlantic database too, and I found that one of the three books I’ve written is in there. And I can tell you, I’m profoundly disappointed. Only one? And it’s not even the latest one! What I was really looking forward to was discovering that mine were the only books in the database, and that every chatbot in the world would relay its answers in my eminently reasonable, free-flowing style, learnt from dismantling Sports Illustrated articles written by Frank Deford.
But Cohan isn’t finished.
This is wholly unacceptable behavior. Our books are copyrighted material, not free fodder for wealthy companies to use as they see fit, without permission or compensation. Many, many hours of serious research, creative angst and plain old hard work go into writing and publishing a book, and few writers are compensated like professional athletes, Hollywood actors or Wall Street investment bankers. Stealing our intellectual property hurts.
Well, sure, Mr Cohan, but I have to point out: there are humans out there reading your books and getting ideas from them. Or at least, one sure hopes there are, because otherwise all those many hours of serious research etc have really gone to waste. As writers, if we don’t influence what people think, what’s the point? Furthermore, if we get a chance to influence what robots write, shouldn’t we leap at it?
Cohan quotes Scott Galloway, who pointed out recently that the market value of companies deploying or part-owning LLMs (Microsoft, Google, Facebook) has increased dramatically in the past few months:
[Galloway] said recently on the Pivot podcast that 70 percent of Nasdaq’s gains in the first half of 2023 came from seven technology companies, most of which had AI product offerings. “So the question is, if AI is literally sucking the oxygen outta the room and all the market cap, it’s like, well what is driving that value?”
The answer, obviously, is the hundreds of thousands of content creators who are taking the time — often over many months and years — to report and to write and to think up the content that Meta, Google and Microsoft are scraping up into the LLMs without asking permission or paying proper compensation.
This, again, is shortsighted. First, there’s the question of how much of that increase in value is down to LLMs; clearly not all, because they aren’t all deploying LLMs; Apple grew by 49% in the first half of 2023, faster than the Nasdaq (of which it is part), but isn’t talking about LLMs, so that 70% is already an overestimate. And some of that increase is just the market’s expected growth; stock values are the market’s guess of the company’s future profits.
Second, more importantly, those content creators are doing that either for money or because they just want to. People like making their voice heard. Blogs and then social media are testament to that fact. This Substack is free; I don’t ask about the business model that lets Substack, the company, offer it in that way without stuffing ads all over the place. (Obviously, the freemium model presently means that a few paid-for and very successful blogs here pay for freeloaders like me. But maybe like Netflix, which used to be ad-free but has moved to an ad-tier model, Substack will shift in time.)
So what’s the “proper compensation” for having your content loaded into an LLM? The implication seems to be that having your book injected into the maw of the machine is equivalent to having it read by a person; so perhaps a single book royalty, or a single library lending payment? That seems reasonable enough. The royalties on 191,000 books might seem a little eyewatering, but these companies have a lot of money, and those royalties would amount only to a few dollars each.
But what about the zillions of web pages—the Reddit forums, blogposts, unpaywalled news articles—that have also been swallowed up? Do those deserve any payment?
As far as Medium is concerned, perhaps they do. The Medium CEO wrote a blogpost in September (which prompted the later email I received a few weeks later):
We let Google spider our site — in fact, we optimize for this. But the only reason we do this is because Google sends a lot of readers to your stories. For the vast majority of writers, this is a fair exchange of value…
Then he offers two possibilities for writers:
How would you feel about a deal that offered you the ability to opt out of allowing AI companies to train on your writing, but offered a 10% boost in earnings on Medium to people who opted-in? Would you opt out? Opt in? Leave Medium?
Would you allow a search engine to train on your writing in order to generate AI summarized answers that credited you? Google Bard seems headed in this direction. Would your answer change if the amount of traffic the search engine sent you dropped in half?
My answer:
I’ll take the 10%. Because I’m pretty certain that if I’m not telling these systems how to talk proper, someone else will. It might not make a big difference—it might not make any difference at all, like trying to fill the ocean with a cup—but 10% is real money.
If I get credit, sure, even if the traffic drops—because, again, the internet is big and if I don’t then someone else will. (The internet is the place where your margin is, endlessly, someone else’s opportunity.) Getting your name known is about the only thing you get left in the attention economy.
Now we let machines do it
Let’s look at the potential aftermath of letting machines gulp down all that content. In that Medium blogpost, the CEO writes
You would hope that these AI innovations would lead to a better Internet. For example, AI as a writing aid has the potential to empower new voices who may have previously gone unheard for lack of writing ability.
But in practice, the overwhelming experience of our readers, editors, and curators is that they are seeing more AI-generated writing that amounts to nothing more than spam: plausible sentences that are unreliable in substance and fact.
Actually, having laboured in the salt mines of content creation, writing and editing news and features for decades, I personally welcome something that will do the tedious writing. In other words…
They’re already writing some of the tedious news content, and they’re already helping programmers to write code, so I don’t particularly have a problem with AI-generated content. The reality is that it’s either going to be as good as a human’s—in which case it’s worth reading in its own right—or it isn’t, which means that we’ll be able to distinguish the difference.
I think that humans will be able to rise above the machine-generated content. The magic that we contain is choice: we can pick and choose among all the words and metaphors that we might use, and discern which is the right one to go for. Let machines read and regurgitate the press releases (written by AI, fed by humans). Humans can move up the scale of complication: doing the interviews, analysing what the trends buried in the numbers mean, writing the overview that explains (to other humans, and LLMs) what the hell it’s all about.
If you don’t agree, consider three pieces of content written this week: all three for free. The latter two feed off the first one, which is Marc Andreessen’s latest missive on What We Ought To Do, which he titled The Techno-Optimist Manifesto.
Andreessen’s piece is 5,221 words, consisting of (if I counted correctly) 418 sentences, an average of just over 12 words each. LLMs don’t write like that, and certainly can’t be pushed into that style without a great deal of effort, and the chances that someone who’s prompting a chatbot will have the bright idea of forcing it to write 12-word sentences are next to nothing. It’s something only a human could write.
Next, Dave Karpf—an academic who has been doing a deep dive into the archives of Wired magazine, the fountainhead of techno-optimism—put Andreessen’s missive into context:
Reading it felt a lot like walking past preteens wearing Nirvana tshirts or seeing an ad for the Frasier reboot:
“Oh, have we decided it’s 1993 again? I guess I didn’t get the memo.”
There’s plenty more, of course (his Substack is entirely worth reading in its own right all the time). But would any machine have thought of the Nirvana and Frasier references, and to reach back to the time when Andreessen was the kid with the big idea at Netscape? I doubt it. This is where humans shine: in our ability to find the connections.
And then, finally, there’s the artist and coder Ben Grosser’s take. The piece is a PDF, where he’s gone through Andreessen’s piece and filleted it to the essentials. It’s the same length. But also shorter. And I absolutely guarantee you that no machine would ever think to do this.
In that sense, I’m a techno-optimist too: optimistic that we’re still going to be the ones who have the new ideas, or who pick out the ideas that are worth pursuing. Look, you got this far. How are you not persuaded? Only a robot wouldn’t be.
• You can buy Social Warming in paperback, hardback or ebook via One World Publications, or order it through your friendly local bookstore. Or listen to me read it on Audible.
You could also sign up for The Overspill, a daily list of links with short extracts and brief commentary on things I find interesting in tech, science, medicine, politics and any other topic that takes my fancy.
• I’m part of a proposed Class Representative for a lawsuit against Google in the UK on behalf of publishers. If you sold open display ads in the UK after 2014, you might be a member of the class. Read more at Googleadclaim.co.uk. (Or see the press release.)
• I’m taking a one-week break, so the next edition will be on November 2. You can leave a comment here, or in the Substack chat, or Substack Notes, or write it in a letter and put it in a bottle so that The Police write a song about it after it falls through a wormhole and goes back in time.
The sort of question that would occupy an entire Philip K Dick novel, or just a short story, and would lead us to profound questions about the nature of reality. Perhaps another time?
Finally - some clear analytical thought on the "stealing LLMs" line. Thank you! I'm pretty sure some of my photographs will have been 'viewed' by LLMs, but frankly anything that might get my stuff, however reimagined, to a wider audience is fine by me. Also loved the link to the Ben Grosser piece - this should be printed out very large and hung somewhere like Tate Modern.