The following discussion is an archived debate of the proposed deletion of the miscellaneous page below. Please do not modify it. Subsequent comments should be made on the appropriate discussion page (such as the page's talk page or in a deletion review). No further edits should be made to this page.

The result of the discussion was: keep. However, there is also consensus to disable the script until such time that the major bugs have been fixed and relative stability has been achieved. If this script is being widely used, and that use is causing significant damage to the project, then both the editors using the tool as well as the creator of the tool are to blame. Therefore, @BrandonXLF: please disable this script temporarily to work out the bugs. Then, consider opening up access to the script to a small number of users for additional testing before opening it up more widely. —⁠ScottyWong⁠— 06:04, 20 June 2023 (UTC)Reply[reply]

User:BrandonXLF/ReferenceExpander[edit]

User:BrandonXLF/ReferenceExpander (edit | talk | history | links | watch | logs)
User:BrandonXLF/ReferenceExpander.js (edit | talk | history | links | watch | logs)[a]

Careless use of this script has spawned multiple AN/ANI threads (1, 2, 3) and created a huge mess resulted in thousands of damaged citations that will likely take several months and hundreds of collective hours of work to clean up. I don't have any technical expertise so I'm not qualified to evaluate the code, but a cursory examination reveals that a very large portion (perhaps a majority) of edits made using this script have removed useful information from references or introduced errors. The author of the script, BrandonXLF, has added a disclaimer to the UI, but hasn't made any other changes to the script since he was made aware of the issues. The disclaimer hasn't been entirely effective at preventing misuse, as seen here. As a preventative measure, this script should be deleted or at least disabled until its functionality is improved. — SamX [talk · contribs] 20:03, 27 May 2023 (UTC) edited 03:56, 30 May 2023 (UTC)Reply[reply]

  1. ^ A technical decision was made to nominate the documentation page. This discussion is intended to pertain to the javascript code. Folly Mox (talk) 13:51, 3 June 2023 (UTC)
  1. Trust the input over the served page. Run the "populate citation template" parsing function on the input data, and run it separately on the URL.
    1. Any newly populated fields obtained from the URL, add to the template created from the existing data.
    2. Any fields that are populated from each function return that differ between the two results, leave as they are in the original but highlight the results from the URL for users to compare before they decide to commit the suggestion or not.
    3. If the count of alphanumeric characters in the populated fields in the template created from the input data is lower than the count of alphanumeric characters minus the URL in the input data itself, the parser has missed something and the suggestion should be discarded. This check should handle things like bundled citations, quotes, and other notes that I'm not smart enough to have suggestions about.
      Trusting the manually entered information over information parsed from the URL with this or a similar process should by itself solve a large number of issues. Crucially, the citation-wrecking error of processing a link to the root of a usurped domain, which destroys all the information in the reference, but also other information-loss scenarios like the script failing to parse out an author, publication date, page number, etc. from the served page but which was present in the original reference, changing references entirely because someone got an ISBN off by one, the script mistaking an author for a title, or a website for a title, and many other misparses.
  2. Stop removing archives. When a pair of ref tags contains a call to webarchive, incorporate that data into the new citation rather than discarding it. This is one of the more damaging errors, since it can make verification impossible without checking the article history if the link is dead, and takes a long time to repair manually since we have to go dig up an archive.
  3. If the input contains a citation template and any other information, pass along unchanged to the output any information outside the citation template, in the same location.
  4. Do basic error checking on the results. If the page title contains "404", "Page not found", "Request rejected", "Not anonymous" etc., skip the reference. If the author fields contain numeric strings or "Contact Us" or "Uploading..." or more than five non-consecutive whitespace characters (e.g.), skip the reference. String parsing for arbitrary information over the set of all webpages is not an easy task, and it's better to know your limits than assume that your parser is going to get it right every time.
  5. Discriminate between editors and authors in book results. Search for "ed." or "eds." or "edited by" in the served webpage.
  6. Incorporate functionality to handle chapter contributions by authors who are not listed amongst the main editors of books. I know page numbers for chapters will probably not be possible, since this is almost never included in the page results, but comparing the input to the suggested output should be a viable route to this functionality.
  7. Stop escaping special characters (particularly =, &, and ?) = in URLs. I see your homebrew Citoid.js is responsible for this, and a bot (don't remember which) has been following behind edits made by your script and correcting this unwanted behaviour. User:Citation bot will correct this when run on pages following edits made using your script.
    11:31, 4 June 2023 (UTC): Updating this bit again to add that escaping % can easily break links containing non-ASCII glyphs.
  8. Look for special separator characters like dashes and pipes in the html title metadata, since they usually indicate a break between the actual title and the name of the website, series, or author, and any information after the separator almost certainly doesn't belong in the title= parameter.
  9. Any template calls in the input data that are not a citation template or webarchive template, or anything your script will be resolving, like Template:bare URL inline, pass along unchanged. This should stop the script from removing things like Template:pd-notice, which breaks attribution.
These changes should improve the script to the point where it no longer damages references. To improve the script to the level where it genuinely can expand references to a point where a manual double check is probably not needed, I suggest the following changes:
  • In served webpages, look for boxes like "how to cite this page" or "download citation" to make things easier. (or "entry information" "datasheet citation" etc.)
  • Look harder for things like authors and publication dates, which are often at the top or bottom of the body text.
  • Check better to see if the citation is to a book or news article. I've only seen ReferenceExpander create Template:cite book or Template:cite journal when it's fed an ISBN or DOI or already properly formatted citation. In all other cases it generates Template:cite web, irrespective of the type of source, just because it happens to have a URL. (Edited to add that I have now seen the script produce Template:cite book on its own, although in two cases it should have been Template:cite magazine.)
  • Be more discriminating about populating the website= parameter. About nineteen times out of twenty, I see the script fill this in with whatever is between the https:// and the next forward slash character, which is trivially obvious by inspecting the URL and adds no value.
  • This is a genuine nitpick, but it's been frustrating for me personally to go through all these diffs and see how many times the script has added the useless (on en-wiki) language=en field, while failing to add a language= parameter for foreign languages, like Malay, Russian, Icelandic, or Latin, all of which I've encountered during my repairs. language=en doesn't alter the appearance of the page on en-wiki, and so adds no benefit to the reader. (The script does sometimes add parameters identifying foreign language sources; its lapses in that regard are only frustrating in the context of how frequently it needlessly identifies sources as English.)
I've repaired some citations during the course of this cleanup that seemed impossible for even the most thoughtfully crafted algorithm, like one where the publication information on google books matched a different version than the preview pages, and having done some string parsing lo these many decades ago I understand it's a difficult task and I'm not expecting perfection. I'd rather see the script improved than thrown out, and it's true that we have User:Philoserf to blame for the vast, vast majority of bad edits facilitated by this script. But I think I said in an edit summary somewhere that ReferenceExpander gave Philoserf a lot of dumb suggestions.
We trusted Philoserf to make constructive edits, and Philoserf trusted ReferenceExpander to make good suggestions. The cleanup has already taken tens, maybe hundreds of volunteer hours, and it's unclear how far we've gotten, because the failure states are so varied every repair is its own journey. Each reference usually takes meet about five to ten minutes (I'm likely slower than most editors due to editing on mobile and also maybe my standards are too high?), and sometimes there are dozens of references in a diff. Right now I think the safest thing is to disable the script while BrandonXLF works on improvements, then maybe we can do a trial run like BRFAs until we're satisfied it can be used safely. Unfortunately if we can't get the script disabled by consensus or voluntarily by the maintainer, my second choice is delete. I'm pretty sure there are other scripts that have similar functionality (ReFill maybe?) without the dangers.
I intend come back and add diffs to this.  Done Folly Mox (talk) 06:19, 30 May 2023 (UTC) Edited 09:54, 30 May 2023 (UTC). Diffed 31 May 2023Reply[reply]
Peeking at the code again, it looks like the entire reference suggestion hinges on getCitoidRef, which relies on mediawiki's own Citoid.js, so I'm wondering if the bulleted suggestions above might be asking BrandonXLF to do the impossible, and maybe other citation creating scripts using Citoid will have the same error rate given an input URL. I'm also not sure if there's a way to call a Citoid function to create its JSON object or whatever it does based on the input text of an existing reference rather than a URL, but that's really what needs to happen to make sure this script doesn't lose information when altering references that include more than a bare URL. It looks like the fix might not be as simple as I conceived, as per usual in programming. Still planning on adding diffs. I have other things going on unfortunately. Folly Mox (talk) 18:08, 30 May 2023 (UTC)Reply[reply]
I just repaired an edit in which the script ignored ((cbignore)) and removed it from the reference. I'm not sure if user scripts are supposed to follow ((cbignore)), but I figured I'd bring it up here since it probably isn't desired behavior. — SamX [talk · contribs] 16:28, 31 May 2023 (UTC)Reply[reply]
As far as I've been able to determine from my halfass code review, the entire core functionality goes like: extract first URL from input data → generate citation template based off of processing the URL and nothing else → suggest change. It doesn't look at anything. Folly Mox (talk) 16:44, 31 May 2023 (UTC)Reply[reply]

References

  1. ^ MfDs ending in ".js" have been made before, and nothing seemed to break.
The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made on the appropriate discussion page (such as the page's talk page or in a deletion review). No further edits should be made to this page.