The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.

Operator: Anomie

Automatic or Manually Assisted: Automatic, unsupervised

Programming Language(s): Perl

Function Summary: Correct reference syntax errors and attempt to recover orphaned refs from page history

Edit period(s) (e.g. Continuous, daily, one time run): Continuous

Already has a bot flag (Y/N): N

Function Details: The bot will process mainspace pages in Category:Pages with incorrect ref formatting to correct some common errors, as described on the bot's user page. The most interesting correction is that the bot will detect orphaned refs (i.e. named references without content) and attempt to find content for them in the page history. The bot will process each mainspace page in the category only once per revision, and once it has scanned the page history for a particular orphaned reference and not found a replacement it will not bother searching for that reference again.

Discussion[edit]

I am open to suggestions as to how often to check the category for additions once the bot has cleaned up all of the errors it knows how to handle. Anomie 03:55, 20 August 2008 (UTC)[reply]

I copied this list of tasks from the bot userpage. – Quadell (talk) 14:00, 20 August 2008 (UTC)[reply]

That last one looks pretty tricky. Is the code ready? The others look pretty easy to understand. – Quadell (talk) 14:00, 20 August 2008 (UTC)[reply]

Ready and uploaded at User:AnomieBOT/source/tasks/OrphanReferenceFixer.pm. I've even run the task in "testing" mode (where it writes proposed edits to the local filesystem for me to manually verify) and it seems to work well. The "find references" function pulls out <ref> tags with a regular expression and ((#tag:ref))s with a simple parser. The task uses this to check the current version for orphans, and then (at 10 revisions/minute, currently) iterates back through the page history until it either runs out of orphans or runs out of history.
There's also some logic in the iteration to allow the bot to stop processing any particular page, save any edits to that point, and later continue looking where it left off. Otherwise a page with an unfindable orphan and 5000 revisions would monopolize the bot for 8+ hours at the current rate limit quite a while. Anomie 16:14, 20 August 2008 (UTC)[reply]
Since Wikipedia:Creating a bot#General guidelines for running a bot has been revised to remove the "10 reads per minute" limit when a bot respects maxlag, I have removed that limit from this task. Anomie 02:12, 25 August 2008 (UTC)[reply]

((BAGAssistanceNeeded))

Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. BJTalk 19:16, 26 August 2008 (UTC)[reply]

Trial complete. All the edits are contiguous in the bot's contribs.[1] Anomie 21:39, 26 August 2008 (UTC)[reply]

((BAGAssistanceNeeded))

Haven't reviewed all the edits, but what I see looks good. I do note a couple of instances where the bot ended up fixing redirects broken due to vandalism; I'm not offhand sure what, if anything, could or should be done about that, other than maybe adding a separate delay (besides the existing 5 minute delay after any edit) to make the bot not edit any page within less than 24(?) hours after the (latest) edit that actually broke the refs. —Ilmari Karonen (talk) 17:52, 31 August 2008 (UTC)[reply]
That's not a bad idea, but then of course it won't fix new legitimate breakages for that same period of time. What is the normal turnaround time for recent changes vandal patrollers fixing something like that? Anomie 19:03, 31 August 2008 (UTC)[reply]
Hard to say, but I'd guess it varies a lot. Most simple page-blanking vandalism is caught and reverted in minutes if not seconds by bots and human RC patrollers. The rest is likely to linger until someone spots it on their watchlist (hours to days), or, in the worst case, just happens to stumble across that page and notices something funny (weeks to years). So it all boils down to just picking a cutoff to minimize the risks in either direction; I suggested 24 hours since I'd guess that'd probably be enough for most well-watched pages, but certainly one could argue for changing it either way.
I do know that some interwiki bots tend to be a bit too hasty with this, not unfrequently "fixing" partially blanked articles by restoring the IW links. Even here I have no statistics, just some personal experience that has led me to double-check any bot edits on my watchlist that claim to add more than one new interwiki link. —Ilmari Karonen (talk) 19:43, 31 August 2008 (UTC)[reply]
I also check all edits on my watchlist, bot or not, just in case some vandalism was hidden by a later bot edit. Anyway, I added the necessary code to check this once the proper timeframe is determined. Anomie 19:46, 31 August 2008 (UTC)[reply]
Looks good, except... if I'm reading the code right, shouldn't it really be checking the timestamp of the revision after the one in which the ref was found? —Ilmari Karonen (talk) 19:54, 31 August 2008 (UTC)[reply]
D'oh. Yes, you're right. I neglected to take into account the possibility that edits can have long gaps between them. Thanks! New code is uploading now, check in a minute or two. Anomie 20:13, 31 August 2008 (UTC)[reply]

I'd say five minutes, ClueBot triggers within 10 seconds, Hugglers within a minute usually. BJTalk 12:26, 1 September 2008 (UTC)[reply]

Sounds good, that's what I already have it set at. Unless anyone has other comments, I'm just awaiting the final approval (or rejection). Anomie 16:46, 1 September 2008 (UTC)[reply]


The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.