The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.

Operator: Anomie

Automatic or Manually assisted: Automatic, unsupervised

Programming language(s): Perl

Source code available: User:AnomieBOT/source/tasks/ReplaceExternalLinks2.pm

Function overview: Cleanup oocities.com (and other geocities-mirror) spam and add archive.org links for geocities.com.

Links to relevant discussions (where appropriate): User talk:AnomieBOT/Archive 3#Bot assist?, User talk:Xeno#Bot question (perm), WT:WikiProject External links/Geocities#New script; bot

Edit period(s): As needed

Estimated number of pages affected: 5658 articles need removal of oocities.com spam. Another 9484 articles have links to geocities.com.

Exclusion compliant (Y/N): Yes

Already has a bot flag (Y/N): Yes

Function details: The bot will scan all mainspace pages linking to oocities.com and/or geocities.com, and/or any other geocities mirrors that are discovered. To each page, it will do the following:

  1. Globally replace ".oocities.com" and "/oocities.com" (and the like for any other mirror domain) with ".geocities.com" and "/geocities.com".
  2. Check citation templates for instances with url being a geocities.com link and no archiveurl. For each found, it will query the "most recent" archive from archive.org and fill in archiveurl and archivedate. If archive.org returns a 404, it will instead add ((dead link|date=...|bot=AnomieBOT)) after the citation template. If archive.org returns any other failure code, it will retry a few times before treating it as a 404.
  3. Check for geocities.com external links (bracketed or bare) where the text "archive" or "Archive" or "webcitation.org" does not appear in the rest of the line of wikitext (this catches cases where something like this was done), and do the same replacement of the URL or addition of ((dead link)).

Discussion[edit]

I note in [1] someone requested that the article GeoCities not be processed. It turns out that there is nothing for the bot to do on that page anyway. Anomie 17:02, 21 September 2010 (UTC)[reply]

Why most recent archive? Should it not be the one closest to the original access date? It is an assumption that the content has not changed that a bot should not make.
Examples, such as [2] or [3], are also archived (if poorly) copies, that may be found in the remainder of text after external link. User may have also used ((Wayback)).
Also you do not mention removing ((dead link)) if you have added |archiveurl=. Do note that some users place it outside: <ref>((cite web |url=googla.com |title=googla))</ref>((dead link)) —  HELLKNOWZ  ▎TALK 17:20, 21 September 2010 (UTC)[reply]
It should be, but it seems that in order to get the list of all different links from archive.org you have to screen-scrape; you can get the "most recent" archive by requesting a URL like http://web.archive.org/page and seeing where you get redirected to.
I'm not too worried about other archival services, as the majority of archive links for non-citation links were probably done by WebCiteBot. Similarly, I wonder whether many users have bothered to ((dead link)) existing geocities links. But I'll see what I can come up with. Anomie 19:48, 21 September 2010 (UTC)[reply]
Yes, it does need scraping to get all links. But the bot's assumption that the last version is the correct one is still invalid. There is a reason for having access dates. I suppose we could treat Geocites as a special case, but I wouldn't know why — Geocites weren't really known for their stability or permanent content.
This issue is of course irrelevant to Webcite links.
I don't think too many editors bother with ((dead links))s, usually just bots do. —  HELLKNOWZ  ▎TALK 20:32, 21 September 2010 (UTC)[reply]
I think between having a dead link and having a link to the most recent archive, the latter is preferable. Keep in mind that this is the first step in a gargantuan effort to review the many links to Geocities with an eye to removing those that are inappropriate - so the reviewers can change to an earlier archive if they feel it is necessary. –xenotalk 12:41, 22 September 2010 (UTC)[reply]
Having either a dead link or the most recent archive is not the only two choices. I am unsure if I have made an "all or nothing" impression, but it is highly reasonable to parse the revisions and find when the link was actually added with little error. This was done before in DASHBot and in H3llBot. I brought up the access dates by bots issue in VP before (VP 1, VP 2), unfortunately not garnering broader attention due to, I suspect, lack of any hard evidence/statistics.
I am all for archiving Geocites, I am merely pointing out previously encountered/considered issues. —  HELLKNOWZ  ▎TALK 13:56, 22 September 2010 (UTC)[reply]
With oocities.com on the spamblacklist, this bot job should be approved and finished as quickly as possible: It's hindering editors, and a typical response is removing/disabling/hiding the offending references (e.g. here at ANI). If the details of this BOTREQ need further discussion, at least the part of turning them back into geocities links should be approved and done quickly (if need be as a mass-undo or another AWB job). Amalthea 12:36, 23 September 2010 (UTC)[reply]
We could use a mass rollback script with edit summary changer and &bot=1 to rollback the top edits in the interim? –xenotalk 12:45, 23 September 2010 (UTC)[reply]
Fine by me. We'd probably lose a few genfixes, but that's not worth any concerns. Amalthea 12:53, 23 September 2010 (UTC)[reply]
[4]. –xenotalk 15:52, 23 September 2010 (UTC)[reply]
1395 reverts. As far as I can tell, Updatehelper did not run with general fixes on in the first place. –xenotalk 18:37, 24 September 2010 (UTC)[reply]
The API tells me there are still 3949 links to oocities.com, and checking the first few bears this out. So I'll still have the bot do the oocities replacement (if this BRFA ever goes anywhere). Anomie 20:28, 24 September 2010 (UTC)[reply]
Yes, the reverts only got those that were still (top) edits. –xenotalk 20:16, 25 September 2010 (UTC)[reply]

H3llkn0wz's suggestions should be implemented now, and it will check for archives on webcite too. I'll update the source once I finish testing to make sure there aren't any obvious bugs. Anomie 17:43, 23 September 2010 (UTC)[reply]

If you are checking revisions for access date, do note that links may have been removed temporarily due to vandalism and than link appearance in first few edits is likely to be a split/copy from other articles. I am personally OK with an occasional error; the benefits should really overweight the uncommon slight inaccuracies. Have fun coding! :) —  HELLKNOWZ  ▎TALK 18:34, 23 September 2010 (UTC)[reply]
Not checking revisions, just citation templates. Anomie 19:11, 23 September 2010 (UTC)[reply]
Also checking for if someone manually enters "Retrieved <date>" or the like inside a <ref></ref>. Anomie 20:28, 24 September 2010 (UTC)[reply]
There should be a better edit summary and explanation when the bot does more work. The trial seemed to use one of these two edit summaries:
  1. Reverting oocities.com spam. Errors? User:AnomieBOT/shutoff/ReplaceExternalLinks2
  2. Reverting oocities.com spam and changing archived geocities links. Errors? User:AnomieBOT/shutoff/ReplaceExternalLinks2
Neither of these is helpful for the many editors who are going be puzzled by the bot activity. The link should be to a page with a brief but helpful explanation with links to discussions, and a link to the shutoff page. Editors will just revert the bot (and if that doesn't work due to the blacklist, they will be very irritated), and waste time wondering what's going on. Johnuniq (talk) 00:45, 26 September 2010 (UTC)[reply]
There are 4 possible clauses that can appear in the edit summaries, depending on just what the bot did to the page: "reverting oocities.com spam", "adding archiveurl for archived geocities cites", "changing archived geocities links", and "tagging dead geocities links". I'm open to suggestiong on what a good link might be, although I'm not particularly clear on how anyone would be confused by any of that or why they would want to revert it (unless they happen to be involved in oocities, anyway). Anomie 20:52, 26 September 2010 (UTC)[reply]
I note that it might make the work of the Wikipedia:WikiProject External links/Geocities people actually harder if they can't easily get at them with the external links search. Should an alternative to changing plain links to straight archive links be considered? A simple template maybe that links both? Amalthea 21:11, 26 September 2010 (UTC)[reply]
I would not be opposed to such a template, and the bot could easily be adjusted to use it. At the simplest, a template such as ((geocities link|url=...)) could just output the passed url. OTOH, there already seems to be a full list at Wikipedia:WikiProject Spam/LinkSearch/geocities.com/All. Anomie 21:26, 26 September 2010 (UTC)[reply]
Ah, that's just as well. If the oocities links are already on there (or being added) then my comment is moot. Thanks, Amalthea 21:37, 26 September 2010 (UTC)[reply]
Hmm, that's a good point, I don't know whether it includes oocities links or not. OTOH, by combining xeno's revert list, AnomieBOT's 51 edits, and the remaining links to oocities, we can easily enough create a supplementary list: [6] Anomie 21:56, 26 September 2010 (UTC)[reply]

I've added to the proposed task that the bot will handle any other geocities mirrors that are discovered to have been spammed (e.g. these, should they turn out to be the same deal as oocities.com). Also, ((BAGAssistanceNeeded)) Anomie 00:54, 4 October 2010 (UTC)[reply]

It appears MBisanz approved this request with this and this edit, and just forgot to edit this page. So I'll just do a little clerical work to mark this page as  Approved.. Anomie 03:48, 7 October 2010 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.