WikiProject iconAI Cleanup
WikiProject iconThis page is within the scope of WikiProject AI Cleanup, a collaborative effort to clean up artificial intelligence-generated content on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.

RFC[edit]

The following discussion is an archived record of a request for comment. Please do not modify it. No further edits should be made to this discussion. A summary of the conclusions reached follows.
This RfC seeks to establish a policy or guideline on how Large Language Models (LLM) can be used on Wikipedia. There is no consensus to adopt any wording as either a policy or guideline at this time with most editors participating falling into one of three groups.
The largest group of editors was in favor of adopting the proposed wording as either a policy or guideline. The smallest group of editors proposed a complete ban of LLM use in mainspace, but with more flexibility for use in other spaces (what came to be called option 1b). The remaining editors opposed the wording proposed in the RfC. The primary objections from those opposing included concerns about mandatory disclosure and about the proposed ability to summarily remove suspect LLM content, including those who feel there is no accurate way to identify LLM generated content. Some of those in favor of the proposal did note that we have other policies and guidelines where detection of violations can be difficult, though some of those in favor of the wording overall also opposed summary removal.. There does seem to be an implied consensus for Large language model output, if used on Wikipedia, must be manually checked for accuracy (including references it generates) among those both favoring and opposing this wording but this was not stated explicitly enough by enough editors for me to formally find a consensus for it. Nothing in this close should be construed to suggest that current policies and guidelines do not apply to Large Language models, with a number of editors explicitly noting (especially among those opposing) that current policies and guidelines do apply.
Housekeeping: As an RfC to establish policy, I used no weighting of any arguments when determining this consensus nor did I use any AI other than what is built into GoogleDocs to help with spelling and basic proofreading. Barkeep49 (talk) 20:47, 19 January 2024 (UTC)[reply]


Per a prior RfC at WP:LLM, should the following sentence be adopted as a policy/guideline (whether or not it is expanded later, or whether or not there are subsequent decisions with respect to the use of LLMs)?

Large language model output, if used on Wikipedia, must be manually checked for accuracy (including references it generates), and its use (including which model) must be disclosed by the editor; text added in violation of this policy may be summarily removed.

jp×g🗯️ 22:22, 13 December 2023 (UTC)[reply]

Survey[edit]

Notified WikiProject AI Cleanup. Queen of Hearts ❤️ (no relation) 22:30, 13 December 2023 (UTC)[reply]

Notes

  1. ^ AI detectors like GPTZero are unreliable and should never be used to decide whether a text was created by an AI.
Option 1, absolutely, either as policy or as guideline. I appreciate the "we have no real way to prove it's been used" argument, but I think in most of the problem cases there's simply no reasonable doubt both that it is LLM-generated and that it's just inappropriate junk. If better tools come along, we can certainly reconsider this later: but the way things are now, a clean ban on anything except well-supervised and clearly disclosed drafting assistance seems the sensible step. Andrew Gray (talk) 19:09, 14 December 2023 (UTC)[reply]
@Skarz: According to Buidhe's formulation, option 1b is ban all LLM-generated content from the wiki. I just wanted to confirm if that is what you meant since the explanation of your choice sounds different. Phlsph7 (talk) 18:06, 24 December 2023 (UTC)[reply]
Did I post my response in the wrong place? skarz (talk) 18:24, 24 December 2023 (UTC)[reply]
No, the place is correct. I only got the impression from your explanation that you were not generally opposed to LLMs. But it's possible that I misinterpreted your explanation so please excuse the confusion. Phlsph7 (talk) 18:59, 24 December 2023 (UTC)[reply]
I am not opposed to LLMs. I thought this RFC was regarding whether not LLM-generated text should have to be attributed/cited as such, not whether LLM-generated content was prohibited. skarz (talk) 19:03, 24 December 2023 (UTC)[reply]
I agree, it has become a little confusing since some editors have suggested additional options in their votes. Currently, the most popular ones are
  • Option 1: new guideline/policy that all LLM output has to be disclosed
  • Option 1b: new guideline/policy that all LLM-generated content is banned
  • Option 2: no new guideline/policy.
Phlsph7 (talk) 19:08, 24 December 2023 (UTC)[reply]
Huh? I sure as heck hope not, because that isn't what Option 1 says. I suppose there is no rule against making RfC comments like "Support option N I think Option N is bad and we should do something different". But I think if we want to have a referendum on a different issue, we should have a separate RfC for it. jp×g🗯️ 06:26, 26 December 2023 (UTC)[reply]
I'll note that the previous draft (WP:LLM) met with ignominious failure after people attached a litany of additional pet-peeve provisions onto it, resulting in a huge trainwreck referendum-about-everything-in-the-world on which it was impossible for consensus to develop (and indeed, none did, meaning that we spent months with no guidance whatsoever). jp×g🗯️ 06:31, 26 December 2023 (UTC)[reply]
Question put to Bard:
 How many legs do ten elephants have, if two of them are legless?
Bard's reply:
 Eight legs per elephant is the standard, so ten elephants would have 8 legs/elephant * 10 elephants = 80 legs.
 However, if two of the elephants are legless, then we need to subtract their leg count from the total. Since legless elephants have 0 legs, the total leg count becomes 80 legs - 0 legs from 2 legless elephants = 80 legs.
 Therefore, ten elephants with two of them being legless would have 80 legs.
Maproom (talk) 18:40, 27 December 2023 (UTC)[reply]
I doubt this is in any way respresentative of the kind of prompts that a reasonable editor would use when contributing to the encyclopedia. A better example of one might be something like this. Sohom (talk) 19:14, 27 December 2023 (UTC)[reply]
I tried again using the same prompt:
Bard's reply:
Eight-legged elephants are a fun concept, but in reality, an elephant has four legs. So, even if two elephants were missing all their legs (which is not possible for an elephant), the remaining eight elephants would still have a total of 32 legs.
Chat GPT reply:
Elephants typically have four legs each. If two of the ten elephants are legless, the remaining eight elephants would have a total of 32 legs (8 elephants x 4 legs/elephant). The two legless elephants would not contribute any legs to the total count.
Bard seems to be still daydreaming as nobody tells them that elephant has 8 legs, but ChatGPT provided an accurate answer. ✠ SunDawn ✠ (contact) 02:22, 28 December 2023 (UTC)[reply]

Discussion[edit]

Question Does the disclosures include cases where LLMs were used for paraphrasing help ? Sohom (talk) 22:31, 13 December 2023 (UTC)[reply]

I believe that's the intention. An error might be introduced by the paraphrasing work, after all.
Unfortunately, there doesn't seem to be any reliable way to detect (or prove) violations. If this passes, you could go revert any addition you want, and say "Well, I thought it was a violation of this rule, and the rule says it can be summarily reverted". WhatamIdoing (talk) 00:59, 14 December 2023 (UTC)[reply]
Yes, that is definitely a weird loophole. I've personally used LLM outputs as inspiration for paraphrasing/rewriting attempts, and having to declare all of them/have them be reverted for no apparent reason is not really something, I'm willing to support. Sohom (talk) 01:33, 14 December 2023 (UTC)[reply]
Editors absolutely have to be responsible for what they post, but even for the most benign uses, I really wonder how many people are actually able to say "which model" they used. We have editors who aren't really sure what their web browser is. WhatamIdoing (talk) 01:47, 14 December 2023 (UTC)[reply]
Even something like "used Bing Chat" would be useful to identify LLM content, although I'd certainly prefer more detail for the prompt used or specific model (when used in a Direct Chat that lists it). TROPtastic (talk) 02:38, 14 December 2023 (UTC)[reply]
My thinking on this is, basically, that the playing field as it stands is very uneven. Prompting is a delicate art that can take a while to set up (and tokens often cost money), but nonetheless, a language model can generate a paragraph in a couple seconds. When I do GA reviews or proofread Signpost articles, I take a heck of a lot longer than a couple seconds to go over a paragraph (maybe a couple minutes, maybe half an hour if I have to look something up in a field I'm not familiar with). Normally, the system we have on Wikipedia is somewhat balanced in this respect -- it takes a while to review that a paragraph is legit, and it also takes a while to write a paragraph. While it's not perfectly balanced, it's at least within an order of magnitude. With language models, however, it's possible to create a quite large volume of text with virtually zero input, all of which (under our current policy) is ostensibly required to be treated with the same amount of delicate surgical care as paragraphs written through the hard work of manual effort.
Another thing that's important is the ability to separate people who put a lot of work into the process (i.e. multi-shot prompting, multiple runs, lorebook-style preparation) from people who are literally just typing "Write a Wikipedia article about XYZ" into the box and copypastaing whatever pops out into the edit window. The first group of people, which includes me, is responsible for stuff like the Signpost templates functioning properly and not having busted CSS (thanks GPT-3.5). The second group of people is a nuisance at best and a force of destruction at worst. If someone is writing paragraphs of text and can't be arsed to figure out what website they got them from, why should we spend minutes or hours going through each sentence of that text individually on the assumption that it's legit? jp×g🗯️ 09:53, 14 December 2023 (UTC)[reply]
I share the concern, but the second group will probably not disclose it anyway, so the disclosure requirement is unlikely to help with cleanup. We'll have to continue relying on practices such as removing unsourced content, verifying sources, etc. If the problems are severe enough, such as using fake sources, blocks can be served (there are precedents already). MarioGom (talk) 23:04, 14 December 2023 (UTC)[reply]
I think it might be worth clarifying in the policy itself, in the event of a disclosure policy, that LLM would not itself be a reason for reversion, akin to all analogous policy. The editor is explicitly guaranteeing that they have vetted the accuracy and suitability of the LLM-generated text, so if a reversion is necessary it is just due to the normal reasons one would revert an edit. Remsense 21:06, 24 December 2023 (UTC)[reply]

User:JPxG, is this based on this suggestion by User:Tamzin in the prior RFC? WhatamIdoing (talk) 01:01, 14 December 2023 (UTC)[reply]

Great minds, etc etc -- this page is from Sep 1, that comment is from Sep 6, but there has certainly been some intellectual confluence since then. jp×g🗯️ 01:16, 14 December 2023 (UTC)[reply]

Perhaps out of boredom, I asked a large language model (GPT-3.5) to write an essay in support of this policy and an essay in opposition to this policy. I've included them below in a collapsed column if anyone is interested:

In support

Large language models like GPT-3 possess remarkable capabilities to generate text across diverse subjects, offering unprecedented assistance to editors contributing to platforms like Wikipedia. However, their usage on such platforms demands responsible oversight to maintain the integrity and accuracy of information. Therefore, implementing a policy mandating the manual verification of language model output, including the references it generates, is crucial. This measure ensures that the information added to Wikipedia is reliable and aligns with the platform's standards.

Firstly, while large language models can produce coherent and seemingly accurate content, they lack the discernment and fact-checking abilities inherent in human editors. Verification of their output becomes imperative to sift through the vast array of generated text and validate the accuracy of the information and references provided. This manual review serves as a safeguard against the inadvertent propagation of misinformation or biased content, maintaining Wikipedia's reputation as a reliable source of information.

Secondly, disclosure regarding the use of language models in contributing to Wikipedia is essential for transparency. Editors should openly acknowledge the utilization of such technology and specify the model employed. This disclosure not only fosters transparency but also allows for better monitoring and evaluation of the content generated. It enables the community to understand the potential sources and methodologies used, promoting accountability and trust among editors and readers alike.

Consequently, any content added in contravention of these guidelines should be subject to removal from Wikipedia. Upholding strict policies ensures the platform's credibility, as information obtained from unchecked language model output poses the risk of inaccuracy or bias. By promptly removing such content, Wikipedia maintains its commitment to providing accurate and reliable information, upholding its status as a reputable knowledge-sharing platform.

In conclusion, while large language models offer incredible potential to aid in content creation, their utilization on platforms like Wikipedia demands stringent oversight. Mandating manual verification of generated text, disclosing the use of language models, and promptly removing content added in violation of these policies are essential steps to preserve the integrity and accuracy of information on Wikipedia. These measures ensure that the platform continues to be a reliable and trustworthy source of information for millions of users worldwide.

In opposition

While the integration of large language models like GPT-3 in content creation offers a revolutionary approach to generating information, imposing stringent manual verification and disclosure requirements on their usage for Wikipedia may pose several counterproductive challenges. First, the sheer volume of data these models produce is immense and manually verifying each output, including references, becomes an impractical task. The time and resources required for such meticulous verification could significantly impede the pace of content creation and hinder the platform's ability to keep up with the constantly evolving information landscape.

Furthermore, mandating disclosure of which model an editor uses and the generated text's origin might inadvertently lead to bias against certain models or methodologies. Editors might face unwarranted skepticism or prejudice based on their choice of model, potentially discouraging them from using innovative tools that could otherwise enhance the quality and diversity of content. This limitation might stifle the exploration of new technology within Wikipedia, impeding the platform's ability to adapt and evolve in the rapidly advancing field of information generation and dissemination.

The strict removal of content added without adhering to these stringent policies could also hinder the collaborative nature of Wikipedia. It may discourage contributions from well-intentioned editors who, while using language models, might inadvertently miss the manual verification or disclosure steps. This approach risks alienating contributors and potentially diminishing the wealth of knowledge that diverse perspectives and contributions bring to the platform.

In conclusion, while ensuring accuracy and transparency are paramount in maintaining Wikipedia's credibility, imposing rigid manual verification and disclosure requirements for language model usage might create more challenges than solutions. Striking a balance between leveraging these technological advancements and maintaining editorial standards is crucial to foster an environment that encourages innovation, inclusivity, and the continuous growth of accurate and reliable information on Wikipedia.

Red-tailed hawk (nest) 03:28, 14 December 2023 (UTC)[reply]

It's amusing that the Oppose essay is not because the proposal could removal human content too hastily, but because it attempts to control LLM use at all. "The time and resources required for such meticulous [manual verification of output including references] could significantly impede the pace of content creation" indeed... TROPtastic (talk) 07:27, 14 December 2023 (UTC)[reply]
I especially loved its final sentence: I was waiting for it to say "synergy" and "paradigm".
I was curious at how bad the "Oppose" essay prompt was, considering the "Support" essay was decent enough -- at least in-line with the policies and how people generally interpret them. So I asked GPT-4 via MS Copilot to write a short essay in opposition to the policy change as written:
Opposition to Proposed Policy
Wikipedia is a free online encyclopedia that anyone can edit. It aims to provide reliable and verifiable information on a wide range of topics. However, some editors have suggested that Wikipedia should adopt a new policy that would restrict the use of large language models (LLMs) on the site. LLMs are artificial intelligence systems that can generate natural language text based on a given input. They can be used to create or improve Wikipedia articles, especially on topics that are underrepresented or lack sources. However, the proposed policy would require that any text generated by LLMs must be manually checked for accuracy and disclosed by the editor, and that any text added in violation of this policy may be summarily removed. This policy is unnecessary, unfair, and harmful to the quality and diversity of Wikipedia.
First, the policy is unnecessary because Wikipedia already has mechanisms to ensure the quality and verifiability of its content. Wikipedia has a set of core policies and guidelines that all editors must follow, such as citing reliable sources, maintaining a neutral point of view, and avoiding original research. Wikipedia also has a community of editors who can review, edit, or revert any changes made by other editors, including those made by LLMs. Wikipedia also has various tools and bots that can detect and flag vandalism, spam, or other problematic edits. Therefore, there is no need to single out LLMs as a special case that requires additional scrutiny and disclosure.
Second, the policy is unfair because it would create an unequal and discriminatory environment for editors who use LLMs. LLMs are not inherently good or bad, but rather depend on how they are used and for what purpose. LLMs can be used to generate text that is accurate, informative, and helpful, or text that is inaccurate, misleading, and harmful. The same can be said for any other source of information, such as books, websites, or human experts. Therefore, it is unfair to impose a higher burden of proof and transparency on editors who use LLMs than on editors who use other sources. Moreover, the policy would discourage editors from using LLMs, even if they could improve the quality and diversity of Wikipedia articles. This would create a disadvantage for editors who have limited access to other sources, such as those from developing countries, minority groups, or marginalized communities.
Third, the policy is harmful because it would reduce the quality and diversity of Wikipedia. LLMs have the potential to enhance Wikipedia's mission of providing free knowledge to everyone. LLMs can help create or improve articles on topics that are underrepresented or lack sources, such as those related to women, people of color, indigenous cultures, or emerging fields. LLMs can also help generate text that is more readable, engaging, and accessible to a wider audience, especially for non-native speakers or people with disabilities. By restricting the use of LLMs, the policy would limit the scope and depth of Wikipedia's coverage, and undermine its goal of being a comprehensive and inclusive encyclopedia.
In conclusion, the proposed policy to regulate the use of LLMs on Wikipedia is unnecessary, unfair, and harmful. It would create an unequal and discriminatory environment for editors who use LLMs, and reduce the quality and diversity of Wikipedia's content. Therefore, the policy should not be adopted as a policy or guideline, and instead, Wikipedia should encourage and support the responsible and ethical use of LLMs to improve its encyclopedia.
Honestly, in many respects it makes pretty much the case I make. SamuelRiv (talk) 02:54, 15 January 2024 (UTC)[reply]
GPT3.5 lacks significantly in reasoning and logic, I'm sure you'll get drastically different results if you were to use GPT4. skarz (talk) 17:31, 24 December 2023 (UTC)[reply]
The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

Discussion at Wikipedia:Templates for discussion/Log/2023 December 13 § Template:AI-generated notification[edit]

 You are invited to join the discussion at Wikipedia:Templates for discussion/Log/2023 December 13 § Template:AI-generated notification. –Novem Linguae (talk) 08:26, 14 December 2023 (UTC)[reply]

Discussion at Wikipedia:Templates for discussion/Log/2023 December 13 § Template:OpenAI[edit]

 You are invited to join the discussion at Wikipedia:Templates for discussion/Log/2023 December 13 § Template:OpenAI. –Novem Linguae (talk) 08:26, 14 December 2023 (UTC)[reply]

Notes

Future directions[edit]

I think it may be appropriate to note here my intentions for after the RfC, assuming it is successful.

When writing the proposal, I did my best to prevent it from being a "pro-LLM" or "anti-LLM" policy as written. My hope is that, rather than a meandering general referendum on the whole field of artificial intelligence, we could establish some simple and non-intrusive rule to cut down on the bottom 10% of slop without presenting too much of an obstacle to people who are interested in using the tools productively. And we are getting a rather consistent flow of slop (see WP:WikiProject AI Cleanup), from people who are either using these models improperly, using them for tasks to which they're not suited, or being insufficiently careful in verifying their output. This puts a rather large (and unnecessary) strain on new page patrollers, AfC reviewers, and editors in general.

For what it's worth, I am myself a great fan of transformer models, and have followed them with great interest for several years (I created the articles for GPT-2 and DALL-E, my first interaction with them was a GPT-2-124M in summer 2019, and I had access to the GPT-3 API in 2020). Last August I used the GPT-3 API to assist in writing several Signpost columns; I guess you will have to take my word for it that I didn't write this as a stalking-horse for a project-wide LLM ban.

Some people think that these things are just plain crap, and there is a lot of very lively debate on what utility they really have, and whether it is worth the effort, et cetera. Well, I think it is, but the consensus of the editing community isn't mine to decide, and if everyone thinks that they are junk, then I guess we will have to live with that.

I will note that the number of people who want to ban LLMs entirely increases every time a gigantic bucket of GPT slop is poured into the NPP queue, so if there's some very low-effort solution we can implement to slow down the flow, I think it is worth it even if you are a LLM maximalist who resents any sort of restriction.

Anyway, it is hard to predict the trajectory of a technology like this. They may get better, they may level off, or they may improve a lot at some things and very little at other things in a disjunct way that makes no sense. So maybe we are right on the precipice of a tsunami of crap, or maybe it already passed over, or maybe we're on the precipice of a tsunami of happiness. What I do think is important is that we have policies that address existing issues without prematurely committing to thigns in the future being good or bad. If it turns out that this cuts down on 90% of the slop and we never have an ANI thread about GPT again, then maybe there does not need to be any further discourse on the issue. If it turns out that this short sentence isn't enough, then maybe we can write more of them. jp×g🗯️ 09:37, 15 December 2023 (UTC)[reply]

Then:
  • Old problem: We had a bunch of badly written articles posted.
  • Old action: We wrote a bunch of rules against undisclosed paid editing.
  • Old result: A few folks changed their behavior, and the rest kept doing the same thing anyway, because we had no good way to identify them.
Now:
  • New problem: We have a bunch of badly written articles being posted.
  • New action: We write some rules against a set of tools that might be used to make them.
  • New result: A few folks changed their behavior, and the rest kept doing the same thing anyway, because we had no good way to identify them?
WhatamIdoing (talk) 04:04, 17 December 2023 (UTC)[reply]
Even if there is no good way to identify them, that does not mean it is not a bad idea to institute as policy. Is there an easy way to, for example, identify bot-like or semi-automated editing? Unless if there are tags to identify a script that made that tool, a semi automated edit could have any edit summary or no summary and no one would really know that it was semi automated. The whole point is not banning LLMs from mainspace poses a significant risk of disruption, and encouraging it would just be encouraging more disruption. And DE is one thing that, regardless of the means or intent, results in a block if it is prolonged. Awesome Aasim 22:13, 17 December 2023 (UTC)[reply]
The thing is that everything about LLM use that disrupts Wikipedia is already prohibited by existing policies. Nobody in any discussion so far has provided any evidence of anything produced by an LLM that is both permitted by current policy and harmful to Wikipedia. Thryduulf (talk) 10:27, 18 December 2023 (UTC)[reply]
Because the issue the policy is trying to address is more about larger editing patterns than individual diffs. It's not illogical if the scope of policies overlap—in fact, it's arguably a feature, since it reinforces the points that the community find most important. Remsense 14:11, 31 December 2023 (UTC)[reply]
While there is inevitably some overlap in policies, I disagree that it's a feature per se. Generally speaking, it easier for editors to keep track of fewer policies than more, thus having a few number of central policies with supporting guidance that expands on details provides an organizing structure that simplifies remembering and following guidance. Avoiding redundancy supports this principle and helps prevent guidance from getting out of sync, and thus being contradictory. It also can forestall complaints about there being too much guidance, as the basic shape of the guidance can be understood from the central policies, and the details can be learned gradually, without having to jump between overlapping guidance. isaacl (talk) 17:04, 31 December 2023 (UTC)[reply]
I don't think that The whole point is not banning LLMs from mainspace poses a significant risk. I think there's some good old human emotions at play here, but the problem is that we already know the ban will be ineffective. Most people won't know the rule, you won't be able to catch them (and we will wrongly accuse innocent people), and most of the few people who are using LLMs and actually know the rule won't follow it, either, because a good proportion of them don't know that you decided that their grammar checker is an LLM, and the rest don't think it's really any of your business.
This is King Canute and the tide all over again: We declare that people who are secretly using LLMs must stop doing it secretly, so that we know what they're doing (and can revert them more often). You're standing on the beach and saying "You, there! Tide! Stop coming in, by orders of the king!" We can't achieve any of the goals merely by issuing orders.
And your plan for "And what if they don't follow your edict?" is what exactly? To harrumph about how they are violating the policies? To not even know that they didn't follow your orders? WhatamIdoing (talk) 07:06, 11 January 2024 (UTC)[reply]
A good summary of our WP:COI guidelines, but it doesn't seem a reason to scrap them. CMD (talk) 07:28, 11 January 2024 (UTC)[reply]
I am also concerned that it will add an unnecessary burden on those of us who will follow the policy, for no apparent reason. MarioGom (talk) 12:04, 11 January 2024 (UTC)[reply]

Request for close[edit]

I'm going to make a request, because the bot just removed the RFC template since it's been a month (I obviously am not going to close it myself). jp×g🗯️ 10:18, 13 January 2024 (UTC)[reply]