The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Request Expired.

Operator: Josh Parris (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 23:47, Friday December 6, 2013 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): Python, wikitools

Source code available: Sure, prompt me

Function overview: Replace broken urls to *.thecanadianencyclopedia.com with working ones to thecanadianencyclopedia.ca

Links to relevant discussions (where appropriate): Wikipedia:Bot requests/Archive 57#URL updates for The Canadian Encyclopedia

Edit period(s): one run

Estimated number of pages affected: ~3468

Exclusion compliant (Yes/No): No, one-time run

Already has a bot flag (Yes/No): Yes

Function details: swap broken urls for tested good ones. I have assembled a mapping of certain URL updates for The Canadian Encyclopedia based on lookups into the Wayback Machine of all external URLs that match *.thecanadianencyclopedia.com and used that to generate and test combinations of URLs against thecanadianencyclopedia.ca until I got a 200-sucess. Links to the home page of the site will be stripped. URLs where I couldn't get a successful hit will be left unchanged. Variations on the ((dead link)) templates are added or removed to the article to reflect the status of external links; they're only removed for thecanadianencyclopedia. The work parameter of the various cite templates is altered to change hyperlinks into domains.

Discussion[edit]

Approved for trial (10 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. —  HELLKNOWZ  ▎TALK 22:30, 11 December 2013 (UTC)[reply]

Trial complete.
10 Trial edits in I made a stupid error with my HTML comment not being an actual comment. I fixed all the edits.
More interesting errors:
  • [1] "corrected" to a 404. The substitution did what it was told. The translation list was mis-populated because of a parsing error on http://web.archive.org/web/20110929060526/http://www.thecanadianencyclopedia.com/index.cfm?PgNm=TCE&Params=U1ARTU0000203 where the "writing" section was determined to be the article title. An inspection of the translation list shows this has not occurred on any other occasion, nor has "author" nor "bibliography" been the target of any translations. It was just good luck catching this one. Fixed on a couple of levels - tighter regex matching, plus those headings have been added to the blacklist.
  • [2] shows a replacement within a ref tag where the link is followed by a ((dead link|date=December 2013)). Ought I be removing these ((dead link))s? Josh Parris 02:44, 12 December 2013 (UTC)[reply]
  • Do I get this right -- you are comparing the actual page content on wayback archived version to find matches? Or title?
  • Yes, you should remove ((dead link))s after the citation or reference tag if you fix them. —  HELLKNOWZ  ▎TALK 13:16, 13 December 2013 (UTC)[reply]
    The technique I'm using for translating from the old URL to the new one is:
    1. Check for a 302 redirect sometime in 2012. The redirect will be to a URL similar to what's used now, with quite a few variations - a trailing slash may or may not be required, the order of words may have changed, parts of the path might have been moved around.
    2. Failing that, the Wayback Machine's copy will have an article title, which might be transformable in various ways into the corresponding URL in the new website
    All I'm doing is checking for a 200 status code to confirm a match - do you think I ought to be doing something less naive?
    I'll get onto removing deadlink tags; it might be easy, or perhaps not. Josh Parris 11:48, 14 December 2013 (UTC)[reply]
I think that's good enough -- I don't think there would be obvious false positives, especially if you use their own 302s. —  HELLKNOWZ  ▎TALK 12:03, 14 December 2013 (UTC)[reply]

Approved for extended trial (10 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. —  HELLKNOWZ  ▎TALK 12:03, 14 December 2013 (UTC)[reply]

Trial complete. Wow, that expanded the source code dramatically. I selected ten articles that had ((dead link)) and canadianencyclopedia urls. Performed 10 trial edits, highlights include:
  • [3] Shows the ((dead link)) template being removed for the repaired link
  • [4] Shows how subst doesn't work for bots (fixed)
  • [5] demonstrates removing a URL from a cite template's work parameter
So, it seems all went well. Josh Parris 06:55, 17 December 2013 (UTC)[reply]
Okay, these look good, but that's quite a range of functionality. —  HELLKNOWZ  ▎TALK 13:47, 18 December 2013 (UTC)[reply]

Approved for extended trial (20 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. More trial since the addition of code and just a larger sample. —  HELLKNOWZ  ▎TALK 13:47, 18 December 2013 (UTC)[reply]

Trial complete. with results here. Points of note:
  • [6] has the bot swapping out a dead url for text, which would be fine except this is a url= field. I've removed this functionality from the bot and will leave it to humans to clean up these urls. But [7] shows I removed it wrong; I should have detected those URLs and done nothing, rather than treating them as any other URL. Fixed.
  • [8] has the bot making supplemental fixes but not the main fix of swapping dead urls. This was due to a logic bug in the code to detect null edits - fixed.
I stand ready for another trial. Josh Parris 00:59, 19 December 2013 (UTC)[reply]

Approved for extended trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. without removing external links from |work=. —  HELLKNOWZ  ▎TALK 10:28, 19 December 2013 (UTC)[reply]

Functionality altered to reflect this. Josh Parris 10:40, 19 December 2013 (UTC)[reply]
Trial complete. after 50 edits. Every edit seems fine.
I did get a scare from [9], but looking at http://web.archive.org/web/20120315000000*/http://www.thecanadianencyclopedia.com/index.cfm?PgNm=TCE&Params=U1ARTU0002865 I'm reassured that the bot isn't at fault. Josh Parris 11:27, 19 December 2013 (UTC)[reply]

Approved for extended trial (20 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. —  HELLKNOWZ  ▎TALK 10:48, 27 December 2013 (UTC)[reply]

Sorry for the delay; flaky Internet. Trial complete. After 20 edits, every edit seems fine. Josh Parris 09:17, 28 December 2013 (UTC)[reply]
I've got solid Internet under my feet now, so ((BAGAssistanceNeeded)) Josh Parris 20:39, 5 January 2014 (UTC)[reply]

Note that I haven't (yet) gone through previous trials link by link. —  HELLKNOWZ  ▎TALK 21:59, 5 January 2014 (UTC)[reply]

If only there was an exasperated sigh template I could invoke here.
The Ben Johnson (and Kurdish protest) edits show the "check for a 200 status" rule isn't adequate. I'll work up something more robust in the face of this.
The Eva Rose York edit is actually fine.
The The Queensway – Humber Bay edit is particularly galling, as running the list generator against the page today pulls up the 404 and can't resolve it, but going to the URL in the article redirects to a valid article. The site operator has not only 404'd their old URL, they've made the older one work by redirecting it to their new one. I'm going to have to throw away my old translation list and regenerate it.
I'll ping back once I've made the necessary code changes. Expect a two week delay. Josh Parris 21:40, 6 January 2014 (UTC)[reply]
That fix was easier than I thought.
It seems something similar to the mcleans thing happened with French articles, so I already had code to simply strip it out.
I've coded up a fix to the 404.
I'm going to review all the edits the bot made since the start of time and confirm they correlate to what the bot would now do, and repair anything that's wrong. Josh Parris 14:49, 8 January 2014 (UTC)[reply]

A user has requested the attention of the operator. Once the operator has seen this message and replied, please deactivate this tag. (user notified) Anything new about this taske? 46.107.88.236 (talk) 16:45, 24 January 2014 (UTC)[reply]

@Josh Parris: Any progress? (tJosve05a (c) 13:12, 2 April 2014 (UTC)[reply]

Request Expired. --slakrtalk / 07:03, 12 April 2014 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.