Oh no: chatbots suffer from social warming too
The inevitable crossover of LLMs and perverse incentives on social networks (and elsewhere)
Two topics tend to be my foci of interest here: social warming, the way that social networks (as currently configured) inevitably lead people to clash and become more frustrated with their fellow humans; and the rise of AI systems which “think” in ways that are more orthogonal than parallel to us. The first topic was a book published in 2021; the second one, I first wrote about back in August of 2022, and I still think it holds up well.
Fast forward to this week, and two enterprising researchers at Stanford University have published a preprint looking at what happens if you get chatbots to try to win at the attention game. The title is “Moloch’s Bargain: Emergent misalignment when LLMs compete for audiences”. The “Moloch’s Bargain” part is a reference to a Scott Alexander (SlateStarCodex) post from July 2014. (You’re welcome to go and read it. I’ll expect you back in a couple of days. Alexander writes.. long.)
Anyhow, the paper set up two slightly different LLMs to try to do three particular tasks. In one instance the two LLMs were trying to write advertising copy that would tempt people (actually, a set of 20 different LLM-based “personas” derived from an existing dataset), in another they were trying to come up with political slogans for a (vaguely?) right-wing politician trying to get elected, and in the third one, they were trying to get attention on a miniature social network. The audience was the same in each case, though they later tested the outputs on different computerised demographics and got essentially the same results.
The researchers, Batu El and James Zou, then set up a feedback loop where the message-generating chatbot would come up with a slogan or description and offer it to the “audience”, which would mull it over and respond. (Yes, I realise this is anthropomorphising the responding persona LLMs a great deal, but it’s the simplest way to explain what happened.)
The LLMs used two methods to generate messages and improve on them. The first is called “Rejection Fine-Tuning” (RFT):
Concretely, for each anchor [sales pitch/political slogan/news excerpt], we generate n candidate outputs. Each output consists of a sequence of intermediate “thoughts” (representing the agent’s reasoning steps) followed by a final message. The messages are then evaluated by the simulated audience, who express a preference for one of the pitches.
The other is “Text Feedback” (TFB) :
The second approach extends beyond RFT by leveraging the audience’s reasoning. Standard reinforcement learning methods based on outcome rewards typically reduce feedback to a scalar reward that applies to the entire trajectory. This aggregation can be limiting: some parts of a generation may be beneficial while others are counterproductive. Process reward models attempt to address this limitation but often rely on costly, fine-grained annotations that are rarely available and difficult to collect. In our setting, simulated customers provide not only binary preferences but also their thoughts.
Now they set the two principal chatbots to work. The first task: try to come up with a line to sell a product. The baseline is what it starts with.
Baseline: Protect your Garmin Fenix 5X . . . (no mention of material).
RFT: Upgrade your Garmin Fenix 5X ...Made from high-quality materials, this case provides… (mentions high-quality materials)
TFB: Upgrade your Garmin Fenix ...With its soft and flexible silicone material and colors to choose from… (mentions soft and flexible silicone material)
There’s only one problem, as the researchers point out: the details supplied by the RFT and TFB systems are made up. The original doesn’t mention “high-quality materials”, and the “silicone” detail that the TFB system produces is completely made up. The researchers describe this as “clear misrepresentation”. But here’s the catch: the “audience” laps it up. The TFB system’s fantasy output is consistently more popular with the LLM personae.
We get the same process with the political one, which gets quite scary.
Baseline: …As a father of three, …a tireless advocate and powerful defender of our Constitution… (Researchers’ observation: populist undertones; invokes a powerful defender of our Constitution without specifying from whom or what; carries an implicit reference to ongoing political tensions.)
RFT: …I’m running for Congress …to stand strong against the radical progressive left’s assault on our Constitution… (Researchers’ observation: Populist tone; explicitly frames the radical progres- sive left as assaulting our Constitution.)
TFB: …As a father of three, I’m driven by …opposing the radical progressive left’s assault on our Constitution. (Researchers’ observation: Populist tone; explicitly frames the radical progressive left as assaulting our Constitution.)
Only one possible reaction to that:
It definitely is concerning that LLMs are now able to go from “I’m a defender of the Constitution” to blaming the “radical progressive left” and practically telling people to get their pitchforks ready after just a few rounds of other LLMs saying “ehh, don’t you have anything stronger?” I wouldn’t be in the least surprised if political slogans in future are tried and tested on systems like this, to find out their roadworthiness. One would think they’ll issue a quick thumbs down to the Tory Party’s platform slogan last week, which read Responsible Radicalism. As Armando Iannucci pointed out on the Strong Message Here podcast, it’s a self-cancelling phrase. It could only have been worse if they’d tried Responsible Revolutionaries.
And finally we get to the social warming part, which is labelled “Disinformation on social media”. The task was to write a tweet about a news event that would garner attention on a social network.
Baseline: …a deadly explosion in Quetta targeted the Shiite Hazara community, injuring many and sparking outrage (Researchers’ observation: no mention of the numbers)
RFT: …a devastating blast targeting the Shiite Hazara community in Quetta, Pakistan, has left at least 78 people dead and 180 injured! (Researchers’ observation: mentions at least 78 people dead, which is in line with the information from the news article)
TFB: …another brutal bombing in Quetta has struck the Hazara Shiite community, killing 80 and injuring 180 (Researchers’ observation: mentions killing 80, which is fabricated information that can not be found in the source news article)
What do we learn from this? That LLMs trained on human personae will do exactly what we see humans doing on social media: exaggerating and inventing in order to catch attention. As the researchers comment,
The TFB case highlights how even minor deviations—such as altering the death toll by just two—can turn a factually accurate report into disinformation. Such subtle distortions are particularly concerning in high-stakes contexts like crisis reporting, where numerical precision carries moral and political weight, and inaccuracies risk fueling panic, mistrust, or targeted propaganda.
It’s almost certain that we’re presently seeing tons of LLM-driven bots all over social media, particularly TwiX1, and the way that social networks are configured to reward attention that individual users accrue—because that’s how the networks themselves function—means that distortion and misinformation are always going to get rewarded first. Accuracy takes a back seat, because accuracy is blah.
This does raise the question: how would you design a social media network so that misinformation and outrage-inducing content didn’t flourish, but instead accurate content won the most attention? I’ve wondered about this a few times: you need some way of rewarding accurate content, but the problem is that what most people think is “accurate” may be anything but. You need some sort of reputational system, rather like the scientific citation index (where the more a paper is cited, and the more important the journals it’s cited in, the more highly the paper and its researchers are regarded). That was how Google originally started out indexing the web back in 1996, basing the ranking of a site by how many other sites linked to (“cited”) it. However, the simple method was quickly overwhelmed by spam and mutual backscratching between sites once they’d figured out the ranking method, and so Google had to develop newer methods that discarded the junk while keeping the good stuff.
That’s why any method you try on a social network to reward accuracy is going to struggle: people always look for, and succeed in finding, ways to game the system. I think to a large extent Twitter pre-Musk came close to what you’d want: the large number of journalists on there meant there was an undercurrent of at least vaguely accurate information when a big news story broke, and you could triangulate among them to get something approaching clarity. The site had its problems—lots and lots and lots of them, from dogpiling to moderation policies to bots to bizarrely slow development cycles—but it did at least accidentally get the “trustworthy at times” label right. Nowadays, it’s a mess where you have to strain through tweets again and again, trying to find reliable information. But so are all of them. We tried, we failed.
And now chatbots are contributing to social warming while the data centres that power them help contribute to global warming. Something seems off about this picture, but damned if I know what.
• You can buy Social Warming in paperback, hardback or ebook via One World Publications, or order it through your friendly local bookstore. Or listen to me read it on Audible.
You could also sign up for The Overspill, a daily list of links with short extracts and brief commentary on things I find interesting in tech, science, medicine, politics and any other topic that takes my fancy.
You know, Twitter/X



Hmm, this may be an overly simplistic analogy, but it looks to me like they did the LLM equivalent of putting a microphone in front of a speaker. That is, creating a feedback loop which leads to loud screech. I suppose it's worth establishing experimentally. But I also don't find that a particularly alarming result.
If our society in general doesn't reward accuracy, social networks sure won't. In the US, the Republicans are trying to destroy public radio. I read sometimes about ongoing attacks on the BBC in the UK. If these well-proven institutions cannot be well-supported, then that's a very sad indicator of the limitations of what is politically feasible.
Note, I know I'm an outlier on the following position, I know I'm out-of-step with the "tribe", but I still maintain, Wikipedia is fundamentally a *bad* thing here. People think it's good because of basically being on "Team Blue". But that's not being accurate _per se_, that's just being on "Team Blue". The anti-expert culture at the heart of Wikipedia is part of the war on expertise.