The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.

Operator: Harej

Automatic or Manually assisted: Automatic

Programming language(s): PHP

Source code available: User:Full-date unlinking bot/code

Function overview: Removes links from dates.

Edit period(s): Continuous

Estimated number of pages affected: 650,000+, exclusively in the article space.

Exclusion compliant (Y/N): Yes

Already has a bot flag (Y/N): No

Function details: On Wikipedia:Full-date unlinking bot, a consensus was achieved that full dates (a month, a day, and a year) should not be linked to in articles unless they are germane to the articles themselves (for example, in articles on dates themselves). In order to have the most support, the bot will operate on a conservative basis and will only unlink full dates. The details of operation and exceptions are available on User:Full-date unlinking bot.

Discussion[edit]

The specification mentions an exclusion list in addition to ((bots))/((nobots)). This doesn't seem to be present. Mr.Z-man 19:20, 23 August 2009 (UTC)Reply[reply]
It is available here: User:Full-date unlinking bot#Exceptions. @harej 19:32, 23 August 2009 (UTC)Reply[reply]
I'm referring to detail #6, "An exclusion list will contain the articles the bot will not edit. This list will contain the few article titles where a link to month-day and/or year within at least one triple date meets the relevance requirement in MOSNUM. (In these cases, it would be easier to edit the page manually in accordance with this at a later time.) Articles will be added to the list after manual review; there should be no indiscriminate mass additions of articles to the exclusion list. The list will be openly editable for one month before the bot starts running." Mr.Z-man 20:18, 23 August 2009 (UTC)Reply[reply]
Ultimately what ended up happening was that rather than a full list of articles, we made a list of types of articles that the bot would exclude, as enumerated in the Exceptions list. @harej 20:27, 23 August 2009 (UTC)Reply[reply]

Code review[edit]

Note that I'm not going to touch the consensus issues relating to this bot or any evaluation of its specification, I'm just going to see if the code seems to do what the specification at User:Full-date unlinking bot states. This review is based on this revision.

  1. It will fail to process pages linking to "January 1" and so on, because the code on line 140 will check "January1" (no space) instead.
  2. It doesn't check for API errors from $objwiki->whatlinkshere. Which means that, on an API error, it will try to process the page with a null/empty title (whether it will "succeed" or not I haven't determined) and skip all the actual pages linking to that month/day.
  3. It would be more efficient to pass "&blnamespace=0" as the $extra parameter to whatlinkshere.
  4. It would also be more efficient (and less likely to fail) to check the "ns" parameter in each returned page record from list=backlinks than to match a regex against the page title to detect namespaces. Of course, that would mean not using $objwiki->whatlinkshere.
    • I'd recommend doing both #3 and #4, just in case Domas decides to break blnamespace like he did cmnamespace.
      • Is there a substantial chance of that happening? @harej 04:27, 24 August 2009 (UTC)Reply[reply]
        • I would hope not, but you never can really tell. Anomie 17:57, 24 August 2009 (UTC)Reply[reply]
  5. Your namespace-matching regex is broken, it will only match talk namespaces. Which means the bot would edit any even-numbered namespace, not just the article namespace.
    Resolved by getting rid of the namespace-matching regex altogether. @harej 04:27, 24 August 2009 (UTC)Reply[reply]
  6. Your regex3 will match "1st", "3rd", and "4th"-"9th", but not "2nd".
  7. Your list of topics in regex3 and regex4 may or may not be comprehensive. It might be safer to just match all possible topics with ".+".
  8. Your regex3/regex4 will not match "intrinsically chronological articles" named like "1990s in X", or "List of 1994 Xs", or "List of 1990s Xs", or "List of 20th century Xs", or "List of 2nd millennium Xs", or "List of Xs in the 1990s", or "List of Xs in the 20th century", or "List of Xs in the 2nd millennium". There may be other patterns, those are just what I can think of off the top of my head.
    As far as I know, I have fixed this. @harej 04:53, 24 August 2009 (UTC)Reply[reply]
    "List of Xs in 1990" is still missing, make "the" optional in regex4. Anomie 17:57, 24 August 2009 (UTC)Reply[reply]
    I made "the" optional, so now "List of Xs in 1990" will work. @harej 18:29, 24 August 2009 (UTC)Reply[reply]
  9. For that matter, your regex3 won't even compile. It has unbalanced parentheses.
  10. checktoprocess() doesn't check for errors from $objwiki->getpage either. Which could easily lead to the bot blanking pages.
  11. Putting a comment in the page to indicate that the bot processed it is The Wrong Way to do it. If someone reverts the bot they're going to be reverting your comment too, which means that it will completely fail in its purpose. You need to use a database of some sort (sqlite is easy if you don't already have one available), and I recommend storing the pageid rather than the title as pageid is not affected by moves. And even if no one reverts it, that means that hundreds of thousands of articles will have this useless comment in them for quite some time.
    If the bot must confer with a database for each page to make sure it has not already been processed, how much will this add to the amount of time it takes to process a page? @harej 19:11, 23 August 2009 (UTC)Reply[reply]
    Negligible, really, especially if the database is on the same machine so network issues are minimized. Communicating with the MediaWiki API will take much more of your time, and even that will likely be dwarfed by the necessary sleeps to avoid blowing up the servers. Anomie 17:57, 24 August 2009 (UTC)Reply[reply]
    Come to think of it, I should rewrite all my scripts to interface directly with the database instead of with the API (though having it interface with the API makes it a lot more portable). Anyways, as I have said below, I am working to replace the comment-based system with a sqlite database. @harej 18:29, 24 August 2009 (UTC)Reply[reply]
  12. $contents never makes its way into unlinker(), as there are no "global" declarations.
    As far as I know, I have fixed this. @harej 18:29, 24 August 2009 (UTC)Reply[reply]
  13. You could theoretically skip "Sept" in the date-matching regular expressions, enwiki uses Sep. OTOH, I don't know whether any broken date links use it. Also, technically only "[ _]" is needed rather than "[\s_]".
    I think our intent is to recognize all the cases that MediaWiki recognizes, plus more. This, cases like "[[Sept 1]], [[2009]]" and "[[1 January]][[2009]]" (no space) would be delinked and have their punctuation corrected. -- Tcncv (talk) 01:50, 24 August 2009 (UTC)Reply[reply]
  14. To match the comma in the same way MediaWiki does, use "(?: *, *| +)"; use " *, *" or " +" to match comma or non-comma, respectively. Also, you'll need "(?![a-z])" at the end of each regex to truly match MediaWiki's date autoformatting.
    See above response. -- Tcncv (talk) 01:50, 24 August 2009 (UTC)Reply[reply]
  15. You could probably do each of the replacements with one well-constructed preg_replace instead of with a match and loop.
    See proposed change to replace logic here. -- Tcncv (talk) 01:50, 24 August 2009 (UTC)Reply[reply]
  16. Your "brReg" regex is broken, it'll leave code like "[[1 January 2009".
    Fixed by this edit. -- Tcncv (talk) 01:34, 24 August 2009 (UTC)Reply[reply]
  17. Your "amOdd" regex is broken, "January 1 2009" is not a valid format for autoformatted dates.
    Fixed by this edit. -- Tcncv (talk) 01:34, 24 August 2009 (UTC)Reply[reply]
  18. Using strtotime and date will screw up unusual dates like "February 30 2009" which are correctly handled by the MediaWiki date autoformatting.
    What would be a good alternative? @harej 19:11, 23 August 2009 (UTC)Reply[reply]
    See proposed change to replace logic here. -- Tcncv (talk) 01:34, 24 August 2009 (UTC)Reply[reply]
    ↑ That is the good alternative. Anomie 17:57, 24 August 2009 (UTC)Reply[reply]
  19. Note that "(?!\s\[)(?:\s*(?:,\s*)?)" in the brOdd regex could be reduced to ",\s*", which doesn't match what MediaWiki's date formatting will match.
    We are also looking to match cases with multiple spaces or spaces before the comma. -- Tcncv (talk) 01:50, 24 August 2009 (UTC)Reply[reply]
  20. Your bot will edit articles solely to add your comment, which is annoying to anyone watching the page and generally a waste. When you change it to use a local database instead, the bot will waste resources making null edits.
    I was not expecting this to be a problem, since only articles that appear on WhatLinksHere would work, but it would leave the comment if, say, August 23 was in the article but not accompanied by a year. That is the problem. @harej 19:11, 23 August 2009 (UTC)Reply[reply]
  21. While it's unlikely to run into an edit conflict, I see you do no edit conflict checking. And given the nature of the bot, people will complain if it overwrites someone's legitimate edit even once.
  22. There is no error checking from the page edit either. It could be useful to log protected page errors and the like.
  23. It is not exclusion compliant.
  24. You really do need to add in sleep(10) or the like after each edit. And you should really use maxlag on all queries, too.
  25. The edit summary must be more informative than a cryptic "Codes: amReg". Explain what you're doing in plain English, and put the codes at the end.
    Each edit summary links to a page about the codes. Obviously the page will be made before the bot begins to edit. @harej 19:11, 23 August 2009 (UTC)Reply[reply]
    Not good enough, IMO. Consider what a random person unaware of all this date mess will think when seeing edits from this bot in his watchlist. "Codes: blah blah" is much less informative than "Unlinking auto-formatted dates per WP:Whatever. Codes: blah blah". Anomie 17:57, 24 August 2009 (UTC)Reply[reply]
    Edit summary now begins with "Unlinking autoformatted dates per User:Full-date unlinking bot. Codes: ". @harej 18:29, 24 August 2009 (UTC)Reply[reply]

All in all, this code needs a lot of work before it's ready to even be given a test run. Anomie 18:43, 23 August 2009 (UTC)Reply[reply]

Some of these have already been fixed. Would it be okay if I struck out parts of your comment as problems are rectified? @harej 18:51, 23 August 2009 (UTC)Reply[reply]
Go ahead. Anomie 18:56, 23 August 2009 (UTC)Reply[reply]
Added some regex related responses above -- Tcncv (talk) 01:34, 24 August 2009 (UTC)Reply[reply]
I agree with Anomie wrt the hidden comments in pages. Hidden comments may be removed by people who don't know what they're for, or if people revert the bot, they may revert the comment too, so they aren't reliable. Using a database should add only a trivial amount of overhead, much less than getting the page text. Also, if you run the bot from the toolserver, you could put the database on the sql-s1 server, replace the API backlinks call with a database query with a join on your own database, and not have to do a separate query to check if you've already processed it (you also wouldn't have to worry about blnamespace being disabled without warning if you use the database directly). Mr.Z-man 13:38, 24 August 2009 (UTC)Reply[reply]
I have been looking into using an SQLite database backend to store the page IDs of pages that have already been treated. Consider it definitely a part of the next version. @harej 17:16, 24 August 2009 (UTC)Reply[reply]

Comments on the revised code in beta release 3:

  1. The negative-lookahead in part_AModd_punct and part_BRodd_punct should be unnecessary, as the prior processing of the non-odd versions should have already taken care of anything the negative-lookahead group would match. OTOH, it doesn't hurt anything being there either.
    You are correct. It is my indent to code the regular expressions with minimal dependence on each other, so that they can be tested individually to demonstrate that they match the intended targets and not match the unintended targets. As you point out, it doesn't hurt anything, so I think I'll leave them in place. -- Tcncv (talk) 03:34, 25 August 2009 (UTC)Reply[reply]
  2. One potential issue: "[[January 1]], [[1000_BC]]" will be delinked to "January 1, 1000_BC" (with an underscore in the year). The easiest fix might be a simple "$contents=preg_replace('/\[\[(\d{1,4})_BC\]\]/', '[[$1 BC]]', $contents);" before the other processing.
    I looked at this as a possible independent AWB task and discovered that there are no cases to fix! I'll keep an eye out in case something slips into the database. Your obviously thorough examination is welcome and appreciated. -- Tcncv (talk) 03:34, 25 August 2009 (UTC)Reply[reply]
  3. A theoretical 45 edits per minute (15 edits every 20 seconds) is still quite fast. This task is really not so urgent that the 6epm suggestion in WP:BOTPOL needs to be ignored. To be simple, just sleep(10) after each edit; to be slightly more complicated (but closer to a real 6epm), store the value of time() after each edit and just before the next edit do "$t=($saved_time + 10 - time()); if($t>0) sleep($t);" (and be sure to implement edit conflict detection!).

Anomie 17:57, 24 August 2009 (UTC)Reply[reply]

The sleep conditional does not indicate 15 edits every 20 seconds, but that it would make 15 edits (each edit taking at least two seconds, more if there are a lot of dates), then do nothing for 20 seconds. I am not sure on the math, but that is a lot slower than 45 edits per minute. @harej 18:29, 24 August 2009 (UTC)Reply[reply]
You must have an incredibly slow computer there, if those few regular expressions are going to take several seconds to run. Anomie 01:28, 25 August 2009 (UTC)Reply[reply]
"Several" is not the right word. However long it will take, I've since changed the code so that it will sleep after each edit, for simplicity's sake (and because I probably underestimate the speed). @harej 02:09, 25 August 2009 (UTC)Reply[reply]

Beta Release 4[edit]

Here is the difference between Beta Release 3 and Beta Release 4. The two most significant changes are that it now rests for ten seconds after each edit instead of 20 seconds after 15 edits. This is plain easier on the server. Additionally, it will now keep track of pages it has already edited through page IDs recorded in a text file. I know I originally said that it would be an SQL database; however, considering how non-complicated a task storing a list of page IDs is, I figured this would be simpler. Of course this can change again if I was a dumbass for using a comma-delineated text file. And there are still more problems to address. @harej 21:56, 24 August 2009 (UTC)Reply[reply]

I'm glad you went with the sleep after each edit. Comments:
  1. I'd recommend going ahead with sqlite or another database, just because that has already solved the issues with efficiently reading the list to find if a particular entry is present and then adding an entry to to the list and writing it to disk.
    It's a better solution, you're saying? @harej 02:43, 25 August 2009 (UTC)Reply[reply]
    It's IMO the right tool for the job. Anomie 12:13, 25 August 2009 (UTC)Reply[reply]
  2. Your "write back to the file" code needs to use "a" rather than "w" in the fopen; "w" will overwrite the file each time rather than appending the new entry as you intend.
  3. It would be better if you can load the page contents and pageid in one API query rather than two.
Anomie 02:25, 25 August 2009 (UTC)Reply[reply]
If you simply iterate through page IDs starting with 6 to save time... then you will only need to store on e number. Rich Farmbrough, 08:46, 27 August 2009 (UTC).Reply[reply]
Starting with 6? Why 6? @harej 20:24, 27 August 2009 (UTC)Reply[reply]
Page IDs 0-5 are not in main-space, if they exist at all. Rich Farmbrough, 18:16, 28 August 2009 (UTC).Reply[reply]

You might also be interested in [1]. Rich Farmbrough, 09:17, 27 August 2009 (UTC).Reply[reply]

Trial run complete[edit]

The bot has successfully performed 51 edits as part of the trial. See Special:Contributions/Full-date unlinking bot. Note that the bot's edit to August was due to an oversight in one of the regular expressions used to exclude article titles. I reverted the bot's edits, corrected the regex, and it should not be a problem anymore. @harej 23:57, 2 October 2009 (UTC)Reply[reply]

Thanks and congratulations, Harej. I seem to recall that a second, larger trial is part of the plan. Is this correct? Tony (talk) 12:03, 3 October 2009 (UTC)Reply[reply]
Possibly, but I think I will need approval for that. @harej 15:48, 3 October 2009 (UTC)Reply[reply]
Hopefully soon. Can you run off a list (from your database) of pages you've already edited? Then we could check that these ones have been ticked off. - Jarry1250 [ In the UK? Sign the petition! ] 10:57, 4 October 2009 (UTC)Reply[reply]
Oh, and can it keep track of articles with mixed date formats in them? Cheers, - Jarry1250 [ In the UK? Sign the petition! ] 10:57, 4 October 2009 (UTC)Reply[reply]
Special:Contributions/Full-date unlinking bot shows which pages were actually edited, but according to the list of pages that went through the whole process because the title did not disqualify it immediately, the bot edited the pages with the following IDs: 1004, 1005, 6851, 6851, 12028, 13316, 19300, 19758, 20354, 21651, 25508, 26502, 26750, 27028, 27277, 30629, 30747, 31833, 31852, 32385, 33835, 37032, 37419, 42132, 53669, 57858, 65143, 65145, 67345, 67583, 68143, 69045, 74201, 74581, 84944, 84945, 84947, 95233, 96781, 106575, 106767, 107555, 116386, 131127, 133172, 147605, 148375, 159852, 161971, 180802, 188171, 207333, 230993, 232200, 245989, 262804, 272866, 311406, 314227, 319727, 321364, 321374, 321380, 321387, 333126, 355604, 377314, 384009, 386397, 402587, 403102, 410430, 414908, 415034, 418947, 434000, 438349, 480615, 497034, 501745, 503981, 508364, 532636, 535852, 545822, 562592, 577390, 595530, 613263, 617640, 625573, 634093, 641044, 645624, 656167, 658273, 682782, 696237, 728775, 743540, 748238, 761530, 769466, 784200, 819324, 826344, 833837, 839074, 840678, 842728, 842970, 842995, 858154, 865117. @harej 16:00, 4 October 2009 (UTC)Reply[reply]
I check a few, and it looks good. It got the commas right as well. --Apoc2400 (talk) 13:26, 9 October 2009 (UTC)Reply[reply]

Request for a second trial[edit]

Considering that the first trial was successful, I am now requesting a second trial of 500 edits. @harej 16:22, 4 October 2009 (UTC)Reply[reply]

Approved for trial (200 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete., otherwise impossible to look from. - Jarry1250 [ In the UK? Sign the petition! ] 18:23, 9 October 2009 (UTC)Reply[reply]
The bot has performed 202 edits in its second test run. In between runs, I changed the code to finally ditch the flat file database of page IDs in favor of a proper MySQL database of page titles, which has improved performance. Additionally, the bot will no longer try to submit a page unless a change has been made to the page. There are only three awry edits to report: a page blanking, another page blanking, and some weird jazz involving the unlinker that Tcncv should probably look into. The page blankings should not be a problem; I have reverted them and I have updated the code to make sure the page is not blank. @harej 01:15, 10 October 2009 (UTC)Reply[reply]

Third trial[edit]

The issues of trial #2 have been rectified, and I forgot to place in a request for a third trial. I am requesting one now, and am also asking that contingent on the success of the third trial (and the outcome of the ArbCom motion), we begin considering letting the bot run full-time. @harej 01:02, 21 October 2009 (UTC)Reply[reply]

Approved for trial (500 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. - Jarry1250 [ In the UK? Sign the petition! ] 17:09, 21 October 2009 (UTC)Reply[reply]
Completed. @harej 23:36, 5 November 2009 (UTC)Reply[reply]
Also, the aforementioned ArbCom motion has passed. Dabomb87 (talk) 00:41, 6 November 2009 (UTC)Reply[reply]
  • This one's an oddball, which I've seen only once -linking each fragment, but not doing it properly either. Not a bot problem as such, but it just shows the stuff people do. Ohconfucius ¡digame! 10:33, 7 November 2009 (UTC)Reply[reply]
  • Don't worry, I just dropped Rich Farmbrough a message about it. I've done a sample check of the last batch, and found no issues. Ohconfucius ¡digame! 13:27, 7 November 2009 (UTC)Reply[reply]

Full live run[edit]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.