The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was  Approved.

Operator: Green Cardamom (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 15:49, Friday, June 17, 2016 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): Nim and AWK

Source code available: WaybackMedic on GitHub

Function overview: User:Green Cardamom/WaybackMedic 2

Links to relevant discussions (where appropriate): Wikipedia:Bots/Requests for approval/GreenC bot - first revision approved and successful completed.

Edit period(s): one time run

Estimated number of pages affected: ~380,000 pages have Wayback links as of July 20 2016

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: The bot is nearly the same as the first bot (User:Green Cardamom/WaybackMedic), with these differences:

  1. In fix #2, instead of only making changes when other changes are made, it makes changes always. For example, it will convert all web.archive.org http links to secure https even if it's the only change. This modification amounts to commenting out the skindeep() function so doesn't require new code.
  2. The first bot was limited in scope to articles previously edited by Cyberbot II. This will look at all articles on the English Wikipedia containing Wayback Machine links, somewhere around 380k. The bot determines target articles by regex'ing a Wikipedia database dump prior to running.

Most of the edits will be URL formatting fix #2. Fix #4 will impact somewhere around 5% of the links (based on stats from the first run of WaybackMedic). The rest of the fixes should be minimal 1% or less.

Discussion

[edit]
Nothing in the code changed to widen the scope of task other than explained in bullet #1 above. -- GreenC 01:19, 18 June 2016 (UTC)[reply]

Approved for trial (250 edits or 15 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete.xaosflux Talk 03:25, 25 June 2016 (UTC)[reply]

Trial 1

[edit]

WM will process in batches of 100 articles each, but some articles may not need changes so the number of edits will vary within each batch.

@Green Cardamom: In this edit why was the content removed? — xaosflux Talk 02:52, 8 July 2016 (UTC)[reply]
It appears the original URL is working; it's possible that's why. ~ Rob13Talk 04:26, 8 July 2016 (UTC)[reply]
Am I missing something - that condition doesn't appear to be on this list. — xaosflux Talk 04:53, 8 July 2016 (UTC)[reply]
This is fix #4 on that list. If an archive URL is not working it tries to find a working snapshot date, if it can't find it the archive is removed, as was here. In this case since the original URL is still working it didn't leave a ((dead)). However there is a problem -- the archive URL is working. The bot keeps logs so I checked the JSON returned by the Wayback API which shows the URL was not available at Wayback. But the bot also does a header check to verify since the API is sometimes wrong. The header check also returned unavailable (at the time it ran). I just re-ran a dry run and it came back as link available - the problem doesn't appear to be with the bot. If I had to guess it's robots.txt as that is the most common reason links come and go from Wayback. robots.txt are controlled by the owners of the website. -- GreenC 13:14, 8 July 2016 (UTC)[reply]
@Xaosflux: - It's removing a non-working archive link. Any human editor would do the same. One might make the case that if it's non-working due to robots.txt on Internet Archive, it's possible it could start working in the future -- however until then (if ever) we have a broken archive link which is the point of the bot to fix. One could "preserve" the broken link in a comment or talk page, but what's the point, anyone can check IA using the original URL, there's no information needing preservation. It's better to remove broken links where they exist and let bots like IABot (and other tools) re-add them if they become available, as normal. BTW I've already done this for 10's of thousands of articles and didn't have any complaints or concern, or during the last Bot Request. -- GreenC 14:54, 18 August 2016 (UTC)[reply]

Trial 2

[edit]
Approved for extended trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete. (5000 article targets) (Note, this is for targeting 5000 articles only (with between 0-5000 edits as appropriate for those 5000 targets). This should be the final trial round and is 1% of the estimated targets. — xaosflux Talk 15:06, 18 August 2016 (UTC)[reply]

@Xaosflux: Sorry if you don't mind me asking, what is the rationale for a second trial? The bot has been tested extensively on 100,000 articles already. The point of the request was to extend the number of articles to the whole site, and some minor changes which are working. -- GreenC 15:18, 18 August 2016 (UTC)[reply]

As your final run is so large, just want one last check point. You do not need to personally evaluate it - if you can run 5000 targets, just give a count of how many updates were needed - if there are no complaints in a week I think you are good to go. — xaosflux Talk 15:41, 18 August 2016 (UTC)[reply]
Ok no problem. -- GreenC 17:40, 18 August 2016 (UTC)[reply]
@Green Cardamom: Out of curiosity, what happens if the Wayback Machine goes down for some reason while the bot is running? Would the bot begin to remove every archive link as non-working, or is there some check to prevent this from happening? ~ Rob13Talk 18:04, 18 August 2016 (UTC)[reply]
It handles that a number of ways. I can describe the details if you're interested. -- GreenC 18:58, 18 August 2016 (UTC)[reply]
As long as it fails gracefully in this scenario, I don't need to hear details. Just don't want a server outage to result in the bot going wild. ~ Rob13Talk 03:01, 20 August 2016 (UTC)[reply]
It's a good question. Design philosophy is sanity check data, and on failure skip and log. Errors end up in logs not in Wikipedia. Critical failures at the network level (such as timeouts or Wayback API not responding which happens) get logged and the articles reprocessed in a future batch. When processing the first 140k articles for WaybackMedic #1 it never went wild even during API outages. -- GreenC 12:40, 20 August 2016 (UTC)[reply]

If I may suggest an additional feature, for future runs: there may be articles in which |archiveurl= has a functioning WBM link but |archivedate= is empty or missing. It would be nice if this bot could fix this issue but extracting the archive date information from the WBM url. --bender235 (talk) 19:19, 22 August 2016 (UTC)[reply]

Ok this is now part of Fix #3. I'm hesitant to do major feature additions this late in the RfA but this is simple to check and fix. It will also log. I'll keep an eye on it on the first batch, manual testing shows no problem. I don't think there will be too many since the CS generates a red error and they will likely get fixed manually after a while. -- GreenC 21:40, 22 August 2016 (UTC)[reply]
Thanks. I just realized there would be another issue (at least theoretically; I didn't check if we have those cases): there might be articles in which the information in |archivedate= contradicts the actual date from the WBM url (imagine, for instance, the editor thought he should put the date when he added the archive link to Wikipedia rather than when WBM archived the page; or, even simpler, typos). Cases like these could be fixed/corrected based on the WBM url archive-date information. --bender235 (talk) 00:43, 23 August 2016 (UTC)[reply]
Alright, it will now verify |archivedate= matches the date in the wayback URL and if not change |archivedate=. There is one exception: if the |archivedate= is in dmy format, and the page doesn't have a ((use dmy dates)) or a ((use mdy dates)) it will leave alone. The reason is editors often forget to use the dmy template and I don't want the bot to undo proper formatting (bot defaults to mdy). Note: This was not a big change to the code, I've tested every combo I can think of on a test case, every change is being logged, and when the first batch runs I'll keep a tight eye on it. I don't think it will be a very common occurrence. -- GreenC 14:37, 23 August 2016 (UTC)[reply]
I just ran the bot on the 5500 articles of the trial using only the two new features added above (other features disabled). It found about 400 article changes. I checked them manually and see no problems. That was a good suggestion, bender235, there are a fair number of problems. They were all in Fix #8 none in Fix #3. -- GreenC 19:57, 24 August 2016 (UTC)[reply]
You're welcome. I hope this bot gets final approval soon. --bender235 (talk) 19:07, 25 August 2016 (UTC)[reply]
 Approved. Task approved. — xaosflux Talk 13:13, 4 September 2016 (UTC)[reply]
The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.