The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at Wikipedia:Bots/Noticeboard. The result of the discussion was Request Expired.

Operator: Naypta (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 14:13, Saturday, June 20, 2020 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): Golang

Source code available: https://github.com/mashedkeyboard/yapperbot-scantag

Function overview: Scans every article on Wikipedia, and checks for configured patterns. When it finds a pattern, it tags the article with an appropriate maintenance tag.

Links to relevant discussions (where appropriate): Wikipedia:Bot requests#Populate tracking category for CS1|2 cite templates missing "))" (now archived here)

Edit period(s): Continuous

Estimated number of pages affected: Hard to say; as it's configurable, potentially the entire corpus of Wikipedia articles

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: GreenC was looking for a bot that would populate a tracking category for templates that are unclosed and contained within a <ref> tag. I started out making this, but then realised that, rather than having people make a mishmash of bots that did regex scanning of all articles with different regexes, it would be a better idea to have one bot parsing the pages and doing the regex scanning for all the regexes people wanted to match over pages. So, that's what I've made.

Scantag, the name I've given to this task, is a dynamic system for creating rules for pattern matching articles that need maintenance using regular expressions, and then maintenance tagging them according to the associated matches. At present, it is only configured for the specific request that GreenC has made; however, its configuration (as is becoming a recurring theme with my bots!) is entirely on-wiki, so it can be reconfigured on-the-fly. You can see its configuration file here. This is in a user JSON file, so it is only editable by administrators and myself through the bot account; I think this should be a sufficient threshold to prevent abuse, but was considering getting the content model changed to JS to make it interface protected, instead, due to the potential for danger inherent in the task. Thoughts on this would be appreciated.

Whilst the edit filter regex matches changes, it is designed only to be used for preventing serious issues that actively harm the wiki, and there's a limit to the number of rules that it can have - after all, a user is waiting. Scantag, on the other hand, is a deliberately slow process - it runs with a short maxlag, a high number of retries for maxlag, and after every edit it waits a full ten seconds before continuing. This brings with it the advantage that, while it may be a slow process, it can be used for a lot more than the edit filter would ever be. Because it's looking over every single article, it can also be useful for finding and tagging articles that would be impossible to run through a standard regex Elasticsearch, because it would simply time out. Case in point, the maintenance tagging that we're talking about here - but potentially, the same system could be useful for a number of other applications that involve matching patterns in articles.

The task works as follows:

  1. The bot examines the User:Yapperbot/Scantag.json file, reads the rules, and compiles the regexes.
  2. The bot then iterates through the latest database dump's list of page titles in NS0.
  3. For every title in NS0, the bot retrieves the wikitext.
  4. The bot matches each regex (currently only the one) specified in Scantag.json against the wikitext of the article.

If there is no match, the bot skips to the next article. If the bot matches the regex, however, it performs the following steps:

  1. Check the "noTagIf" regex specified corresponding to the main regex. This is a rule designed to check for where the article has already been tagged with the correct maintenance tag.
  2. Prefix the article with the corresponding "prefix" property in the JSON file, if there is one.
  3. Suffix the article with the corresponding "suffix" property in the JSON file, if there is one.
  4. Edit the page, with an edit summary linking to the task page, and listing the "detected" parameter as the reason.
  5. Wait ten seconds before moving on. This is a safety mechanism to prevent a situation where a badly-written regex causes the bot to go completely haywire, editing every single article it comes across.

In common with other Yapperbot tasks, the bot respects the kill page at User:Yapperbot/kill/Scantag, so in the event of an emergency, it could be turned off that way.

Because writing the regexes involved requires not only a good knowledge of regex, but for those regexes to be JSON escaped as well to stay in the JSON string correctly, and because of the potential for issues to come up as a result, there is also a sandbox for the rules. Myself or any other administrator configuring a Scantag rule would be able to set one up to test in here. Rules in the sandbox generate a report at User:Yapperbot/Scantag.sandbox, explaining exactly what the bot has understood from the JSON it's been given, and rendering an error if there are any obvious problems (e.g. failure to compile one of the regexes, noTagIf being set to anything other than a regex or false, etc). Each rule also can have a "testpage" parameter, specifying a test page with the prefix "User:Yapperbot/Scantag.sandbox/tests/", which is designed as a place to set up tests to make sure that the corresponding regex is matching when it's supposed to, and not matching when it's not. An example of one of these is here.

I appreciate that this is a fair bit more complicated than the previous bot tasks, so I'm absolutely about to answer any questions! There are specific instructions for admins on how to deal with Scantag rule requests on the task page. I think there is also an open question here as to whether each rule would require a separate BRFA. Fundamentally, what's going on here isn't all too different from a "retroactive edit filter", of sorts, so I should think either the default restriction for JSON files to only admins editing, or changing the content model to JS so only interface admins can edit, should be sufficient to protect from misuse; however, I'd definitely like to hear BAG members' thoughts on this.

Discussion[edit]

Regarding the editing restrictions, I don't there's a need to restrict it to intadmins. Just put a banner as an editnotice asking admins not to edit unless they know what they're doing. (non-BAG comment) SD0001 (talk) 14:05, 2 July 2020 (UTC)[reply]
@SD0001: I had a chat with some of the team in either #wikimedia-cloud or #wikimedia-operations on IRC (one or the other, I don't recall which, I'm afraid) who had indicated that there wouldn't be an issue with doing it this way, so long as maxlag was set appropriately (which is deliberately low here, at 3s). I didn't initially want to do too many page requests in a batch, for fear of ending up with a ton of edit conflicts towards the end of the batch; even with the ability to handle edit conflicts, it's expensive both in terms of client performance and also in terms of server requests to do so. That being said, batching some of the requests could be an idea - if either you or anyone else has a feel for roughly what that batch limit ought to be, I'd appreciate any suggestions, as this is the first time I'm building a bot that parses the entire corpus. Naypta ☺ | ✉ talk page | 14:38, 2 July 2020 (UTC)[reply]
@Naypta: Now I actually read the task description. Since the bot is only editing the very top or bottom of the page, it is unlikely to run into many conflicts. Edit conflicts are only raised if the edits touched content in nearby areas; the rest are auto-merged using diff3. I'd be very surprised if you get more than 5-10 editconflicts in a batch of 500. So if you want to reduce the number of server requests (from about 1000 to about 510 per 500 pages), batching is what I'd use. If you choose to do this, you'd want to give the jsub command enough memory to avoid an OOM. SD0001 (talk) 16:09, 2 July 2020 (UTC)[reply]
@SD0001: Sure, thanks for the recommendation - I'll plonk them all into batches then. You're right that it's only editing the very top and bottom, but it does need to do a full edit (because of maintenance template ordering) rather than just a prependpage and appendpage, which is unfortunate, but so the edit conflict issues might still come about from that. No harm in giving it a go batched and seeing how it goes though! I'll make sure it gets plenty of memory assigned on grid engine to handle all those pages - a few gigabytes should do it in all cases. Naypta ☺ | ✉ talk page | 16:13, 2 July 2020 (UTC)[reply]
 Done - batching implemented. I've also changed the underlying library I use to interact with MediaWiki to make it work with multipart encoding, so it can handle large pages and large queries, like these batches, an awful lot better. Naypta ☺ | ✉ talk page | 22:20, 3 July 2020 (UTC)[reply]

Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Primefac (talk) 22:22, 2 August 2020 (UTC)[reply]

Request Expired. Operator inactive. No prejudice against re-opening upon their return. Primefac (talk) 21:02, 14 November 2020 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at Wikipedia:Bots/Noticeboard.