The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section.

Operator: NicDumZ ~

Automatic or Manually Assisted: Both. I choose at every run.

Programming Language(s): Python. pywikipedia framework

Function Summary: Idea from Wikipedia:Bot requests#Incorrect Ref Syntax

Edit period(s) (e.g. Continuous, daily, one time run): Every time that a new dump is available.

Edit rate requested: I_don't_know. Standard pywikipedia throttle settings. Edit : from my test runs on fr:, from 5 to 10 edits per minute

Already has a bot flag (Y/N): N.

Function Details: Read User:DumZiBoT/refLinks. Please feel free to correct English mistakes if you find any, it is intended to be a runtime FAQ ;)

The script has been maually tested on fr, where I already have a botflag for fr:Utilisateur:DumZiBoT and ~40k automated edits. From the ~20 edits that I've made overthere, I've found several exceptions, which are fixed by now.

Sidenotes:

Discussion[edit]

Do you have an estimate of the total number of pages to be edited in the first full run on enwiki? Does your parser for the dumps match things with extra spaces such as
 <ref> [http://google.com ] </ref>?
— Carl (CBM · talk) 21:50, 29 December 2007 (UTC)[reply]

I can easily count this. I currently have troubles getting the last en: dump, so it will have to wait...tomorrow I'd say, but as an estimate, the number of pages to alter on fr is ~5500, not that much. For your second question, the answer is yes. NicDumZ ~ 23:05, 29 December 2007 (UTC)[reply]
Count from enwiki-latest-pages-articles.xml (October 23th, pages-meta is newer, but still downloading it) : ~62,300. Quite a lot in fact. NicDumZ ~ 14:52, 30 December 2007 (UTC)[reply]

I did another run on fr, longer this time : [1]. I had some rare encoding problems ([2]) which I need to fix, but everything seems fine to me. NicDumZ ~ 00:07, 30 December 2007 (UTC)[reply]

I was thinking about doing similar to this for a while, except my bot would have been tailored to placing ((cite news)) with most parameters (date, title, author) filled from the website? Any similar ideas? And will you be placing a comment notify editors that the title was automatically added (so they aren't left wonder why some of the links have strange titles)? I look forward to seeing your source code. —Dispenser (talk) 08:47, 30 December 2007 (UTC)[reply]

I would definitely be interested to read through the code. I think that using ((cite news)) isn't a good idea, since in most cases you won' be able to fill in the detail. You could add the string "Accessed YYYY-MM-DD" after the link, inside the ref tags, without much trouble. — Carl (CBM · talk) 15:37, 30 December 2007 (UTC)[reply]

The ((cite news)) template would have to coded for each specific site. The ideal form would be to store this in a dictionary which would allow easy adding of new sites. Having proper citation would immensely help with dead link problem in citations.
Ah ! No offense, but you don't seem to know what you're talking about. My bot will have to edit thousands of different websites : would you write a different handler for each website ? xD NicDumZ ~ 19:09, 30 December 2007 (UTC)[reply]
No, but I would write it for the 20 biggest that regularly remove content after a few month, especially those that block access to the Wayback Machine via the robots.txt. Some which I'd like to see are New York Times, Yahoo News, Reuters, The Times, Los Angeles Times. Most of these will probably use the same regex anyway. —Dispenser (talk) 20:02, 30 December 2007 (UTC)[reply]
I checked yesterday for NYTimes links. Happens that (rare) links which looks like http://www.nytimes.com/* could be easily parsed to retrieve an author name and a publication date. But the huge majority of other links http://select.nytimes.com/*, for example, don't appear to have a common format, nor to give the author on every article... NicDumZ ~ 08:11, 3 January 2008 (UTC)[reply]
I will have to concede on this point, as of the aforementioned problem. I did come up with an idea of a user based system, but as of now it is currently unimplementable. —Dispenser (talk) 06:56, 7 January 2008 (UTC)[reply]

An issue with JavaScripts in the HTML. Will application/xml mime types be accepted? What about server that don't sent out any types? —Dispenser (talk) 10:03, 31 December 2007 (UTC)[reply]

Thanks for this one. I saw it, but thought that it was some strange title. About the problem, I'm a bit... stuck. (By the way the source of the page, is... so wrong ! ) I don't think that I should ignore text inside <script> markups, should I ? NicDumZ ~ 13:29, 31 December 2007 (UTC)[reply]
JavaScript and CSS should be in a HTML comment or better in CDATA but it isn't required. Its best off removing the stuff before sending it to the interrupter. —Dispenser (talk) 04:08, 3 January 2008 (UTC)[reply]
I'll remove them then. NicDumZ ~ 08:14, 3 January 2008 (UTC)[reply]
 Done. DumZiBoT now ignores <script> tags NicDumZ ~ 16:21, 3 January 2008 (UTC)[reply]
application/xml mime types links are ignored, same goes for server not sending out types. If you have examples of links where I *should* not ignore them, I can try to improve this behaviour. NicDumZ ~ 14:12, 31 December 2007 (UTC)[reply]
I disabled the parts of the content checking in my link checker tool. Encyclopedia Britannica website was not sending out a content type, and most of them since fixed this. You should follow the W3C XHTML Media type recommendations. Again here are the problematic links
  • [3] text/html; charset=ISO-8859-1
    OK, *not* ignored, but no title, though.
  • [4] - text/html for .pdf file (soft404 actually without redirect)
    retrieves 404 - Page not found as a title.
  • [5] - text/html for .txt file
    retrieves .: Corvallis Gazette-Times: Archives as a title.
  • [6] - No type or length
    No Content-Type HTTP header is found, hence link is skipped.
    → TODO :
    Try to retrieve meta Content-Type when no HTTP Content-Type is found.
     Done. retrieves User pages
  • [7] - application/pdf but my tool reports it as text/html (python issue?)
    Media detected.

Dispenser (talk) 04:08, 3 January 2008 (UTC)[reply]

Thanks a lot for your help... !! I will add support for application/xhtml+xml, application/xml, and text/xml mime types NicDumZ ~ 08:11, 3 January 2008 (UTC)[reply]
 Done NicDumZ ~ 16:21, 3 January 2008 (UTC)[reply]

I encountered encoding problems with exotic charsets that BeautifulSoup couldnt handle properly (arabic charset windows-1256). I now try to retrieve charset from meta tags to give an accurate hint to BeautifulSoup : Problem solved for that particular charset. (TODO: When able to fetch a valid charset from meta tags, convert document encoding by myself, and retrieve title with a simple regex. When no valid charset is found, keep the current behavior, i.e. parse <title> markups with BeautifulSoup, encoding is "guessed" by BS) NicDumZ ~ 13:29, 31 December 2007 (UTC)[reply]

TODO was done. I now only use the lightweight UnicodeDammit module from BeautifulSoup to help with encoding : Performances have been greatly improved. NicDumZ ~ 21:38, 2 January 2008 (UTC)[reply]
Is there any code to prevent the bot from download large file or extensions that are non HTML? Example. —Dispenser (talk) 04:08, 3 January 2008 (UTC)[reply]
I rely on a 10 seconds socket timeout. Pretty bad answer, wasn't it ? Sockets don't timeout when downloading large files. I now never download more than 1 Mb... (And download is only started when no content-Type Header was given or when content-Type header gave a valid mimetype) NicDumZ ~ 21:36, 3 January 2008 (UTC)[reply]
Be careful if your querying the server with HEAD instead of GET, as some site break when using the former (ex: http://telegraph.co.uk/). —Dispenser (talk) 06:56, 7 January 2008 (UTC)[reply]

I also just implemented a soft switch-off : User:DumZiBoT/EditThisPageToStopMe. NicDumZ ~ 21:38, 2 January 2008 (UTC)[reply]

Typically it is the talk page with page.site().messages (which look for the "You have new messages" in the HTML), this way you don't need to check manually page. —Dispenser (talk) 03:38, 3 January 2008 (UTC)[reply]
Not sure here. I also check for messages, but it is a bit different. I've encountered at least twice users who wanted to "+&*%#&//$öä#" DumZiBoT for some misunderstanding of its behavior, asking on the French VP for a community ban, and so on... without even trying to contact me. This solution has the advantage to give non-admin users the illusion that they *can* stop DumZiBoT, and calm down the "bot-haters". If I find abuses, I'll turn this feature off. NicDumZ ~ 08:11, 3 January 2008 (UTC) Also, this is only an easy latestRevisionId check.[reply]

DumZiBoT and deadlinks[edit]

How will you take case dead link such as those at User:Peteforsyth/O-vanish and special cases and redirecting to 404 page and the root page that are handled by my linkchecker tool? —Dispenser (talk) 18:09, 30 December 2007 (UTC)[reply]

I tested the behavior of my script on the three link examples given in User:Peteforsyth/O-vanish, log is :
fr:Utilisateur:DumZiBoT/Test :
http://www.oregonlive.com/news/oregonian/index.ssf?/base/news/1144292109305320.xml&coll=7
No title found... skipping
http://www.oregonlive.com/newsflash/regional/index.ssf?/base/news-14/1142733867318430.xml&storylist=orlocal
No title found... skipping
http://www.oregonlive.com/weblogs/politics/index.ssf?/mtlogs/olive_politicsblog/archives/2006_08.html#170825
HTTP error (404) for http://www.oregonlive.com/weblogs/politics/index.ssf?/mtlogs/olive_politicsblog/archives/2006_08.html#170825 on fr:Utilisateur:DumZiBoT/Test
[...]
Both behaviors are fine to me, aren't they ? I guess that redirects to 404 pages will raise a 404, as the third link did, and that redirects to root pages will most likely raise 404 (when redirected because page is not available anymore), 301 or 303 (moved). But I do not deal with errors handling, servers do. NicDumZ ~ 19:09, 30 December 2007 (UTC)[reply]
Here's a list I got from tool, these may pose issues for bots:
These are not as common as 404 (9% of all links) but enough that it will be a problem (1-2%). —Dispenser (talk) 20:48, 30 December 2007 (UTC)[reply]
Thanks for these. I tried, only two links are ignored, others are proceeded : See resulting page.
Now, if an editor looks at the resulting page, he/she will see that refs #3 & #8 were not processed, and will most likely check and remove the links. Then, he/she will see that refs #1, #4, #5 and #10 have strange titles (including "session cookies", "login", "ressource secured", "Page not Found") and will check and remove them. That leaves invalid links contained in refs #6, #7, and #9 (#2 is correct)... Not bad for a script that is not intended to remove invalid links !! :)
  • If, for some reason (high load leading to a timeout, strange restrictions) DumZiBoT does not process link (valid or not, that's not the question), it's fine. Chances are that at the next check, it will get converted. If not, well... bots also have limits ! :)
  • Giving a title to a dead link which reflects that it is a dead link, without actually removing it, *is* fine. (e.g. ref #10, {...} - Page Not Found )
  • Giving a normal title to a dead link (e.g. ref #9, FIFA.com) is not so good, but definitely not worse than having a bad refs AND a dead link without title :).
  • What I must avoid is the last case : Giving an erroneous title (e.g. "Forbidden") to a valid link. I don't think that it has happened in my tests yet, and I can't think of an example that would trigger this... (Some page where my bot would get tricked in an Error while regular browsers wouldn't) But this is definitely the worst case :
    • An user might think that the link is dead/invalid and delete it without actually checking it
    • And it adds extra work to an user who would notice that the link is valid but not the title.
NicDumZ ~ 22:17, 30 December 2007 (UTC)[reply]
Here are the regex that my tool that you may find relevant:
regreq = re.compile(r'register|registration|login|logon|logged|subscribe|subscription|signup|signin|finalAuth|\Wauth\W', re.IGNORECASE)
soft404 = re.compile(r'\D404(\D|\Z)|error|errdoc|Not.{0,3}Found|sitedown|eventlog', re.IGNORECASE)
directoryIndex = re.compile(r'/$|/(default|index)\.(asp|aspx|cgi|htm|html|phtml|mpx|mspx|php|shtml|var)$', re.IGNORECASE)
# Untested
ErrorMsgs = re.compile(r'invalid article|page not found|Not Found|reached this page in error', re.IGNORECASE)
I haven't seen anything in your worst case. The closest thing I've seen is that a very few site actually give 404 errors for every page they serve up. I've also seen page that act completely wrong with a browser agent (sending a soft404) and perfectly fine with a bot agent. I've heard that some site give access to everything when the agent is googlebot. Like those entries you label 404/403 they don't do that in firefox. —Dispenser (talk) 10:03, 31 December 2007 (UTC)[reply]

(←) I'm worried about false positives with you regexes. What about pages like [18], [19], or [20] ? Their titles matches regreq or soft404, and yet they are valid. Or are you talking about checking the links where you are redirected with these regexes ? (Even here, what if a SSHlogin.htm is moved permanently to an other address, but with the same name ? 'SSHlogin.htm' does match regreq !) I do not understand everything... NicDumZ ~ 14:12, 31 December 2007 (UTC)[reply]

The regex are designed to test redirects. They are designed to have low false negatives, but allow for false positives as the tool present them as possibilities. With your bot you would be writing an unwritten rule that external links that don't have a title are need to be fixed and replaced with a title. Additionally, the link title is the most significant keywords for finding dead links again. However, they're not of much use if the title simply states that the link is not found. —Dispenser (talk) 04:42, 3 January 2008 (UTC)[reply]
With your bot you would be writing an unwritten rule that external links that don't have a title are need to be fixed and replaced with a title : I understand your concern, but I can't delete links if I'm not 100% sure that the link is dead, that's what I was trying to say... A final human check before deletion, is, to me, necessary.
Additionally, the link title is the most significant keywords for finding dead links again. However, they're not of much use if the title simply states that the link is not found.. I'm not sure how to understand this. If I "flag" a link in a reference, with e.g. 404 - Page not found from its html title, it's much better than leaving this link unmodified, isn't it ? NicDumZ ~ 19:06, 3 January 2008 (UTC)[reply]
Sorry if this is disjointed. As your bot will leave 404 pages unmarked it will eventually give the impression that such links need to be reviewed. Second, (I do not recall this being written anyway) by not giving a title to a page it allows for a second pass by the bot at a later time. This allows for retrieval of the title if a site is being slashdotted. On the regex above are relatively accurate if you use them to check the redirecting URL, of course you need to test the first URL that the keywords don't appear. This will ensure consistency between errors and possibly the site will correctly forward all the old links. —Dispenser (talk) 06:56, 7 January 2008 (UTC)[reply]

French wiki got reflinks.py'ed[edit]

The last days, I ran several 1000 edit batches on fr, and eventually, the whole db got reflinks.py'ed. Still waiting, but for now I got signaled only a single error 2 errors :

I guess that DumZiBoT is now ready for the big jump... NicDumZ ~ 10:19, 5 January 2008 (UTC)[reply]

((BAGAssistanceNeeded))

Some comments, questions, updates, maybe ?

Thanks...

NicDumZ ~ 08:22, 6 January 2008 (UTC)[reply]

Trial went fine, I believe. We're discussing code improvements with Dispenser, but none of these improvements will alter significantly the bot's behavior : Do I need something else before getting fully approved ?

Thanks,

NicDumZ ~ 00:51, 10 January 2008 (UTC)[reply]

Source code[edit]

Is now available at User:DumZiBoT/reflinks.py.

Please edit it if you think that it needs improvements. I mean it. NicDumZ ~ 18:05, 8 January 2008 (UTC)[reply]

I looked over the source code and made the following changes:
  • Better link parsing (not perfect - http://example.org/j!;? will be prased as http://example.org/j!;)
  • Functions for the replace link
  • page.get() cache the result so eliminated duplication
  • All regex use ur which preserves the \ as is
  • Removing more than scripts from the HTML, includes CDATA, and style tags
  • 1,000,000 != 1 MB see binary prefixes
  • convert HTML space entities to regular spaces (some site try to hide the Browser ID)
  • Will check for messages after every edit (this comes at no additional cost)
I haven't tested my changes. Still unsure about the method used in the cases where the link is not titled as it still changes the wikitext. With the bug mentioned in the first bullet would cause a loss of data. Should it be possible to simply skip the replacement in those cases? —Dispenser (talk) 22:34, 8 January 2008 (UTC)[reply]
Thanks again.
I had to correct a few minor syntax errors, and converting HTML entities is not necessary because that's the purpose of the next line t = wikipedia.html2unicode(t), but you definitely improved my code. I quickly tested it on fr:, and as expected, it seems to be working.
About your question the wiki is only edited if the links looks like "[1]", i.e. bracketed untitled links. It only edits to unbracket the link, which was the original behavior requested on Wikipedia:Bot requests/Archive 16#Incorrect Ref Syntax. NicDumZ ~ 22:57, 9 January 2008 (UTC)[reply]
Bot trial run approved for 50 edits. ßcommand 05:17, 11 January 2008 (UTC)[reply]
 Done. By the way I've already run a trial, and your last edit summary, Betacommand, was somewhat confused ?! Looks like you've mixed up this request with that other one, haven't you ? :þ
NicDumZ ~ 11:05, 11 January 2008 (UTC)[reply]
So look those the edits and here are things:
  • [23] Bot had missed ref since user added to many brackets, consider using [* ]* instead of [? ]?
  • [24] All uppercase title don't look very nice and aren't as readable. Perhaps, if more than 80% of letter are upper change using title(). (Implemented a simple version already)
  • [25] Bad title given. Seems to be related User-Agent sniffing, if you use a Firefox UA you'll deal with a different set of issues. The only sure fire method is to spoof googlebot.
  • [26] Bad title "Sign In Page". Maybe there should be a black list for titles?
  • [27] Bad title "Loading...", JavaScript redirect.
I should mention that there two different URL matching methods the one with the bracketed which allows more characters an is simpler to implement and the non-bracketed which you've seen the regex for. Consider what parts you matching a simplification of your original would work well in most cases. I've also gone ahead an implemented limited soft 404 and redirect to root detection. This should ensure a much lower false positive rate than my tool ever had (its purposely inflated to catch those exceptional corner cases. —Dispenser (talk) 07:36, 14 January 2008 (UTC)[reply]

(ident) Thanks a lot for your input. It was, as always, very, very useful... I'm somewhat running out of time these days, but :

NicDumZ ~ 00:08, 16 January 2008 (UTC)[reply]

  • I have modified the algorithm, it now checks for letters instead of characters and added a digit check to avoid titles such as RFC 1234. I’m unsure which is better as yours weight all lowercase characters against the string for version numbers and dates.
  • The advantages and disadvantages to URL spoofing a browser User-Agent.
    • Browser UA - Advantages to spoofing are assurance that the page is the same that is given
    • Browser UA - Disadvantages include getting sign-in, ad-preview, and previews
    • Bot UA - Advantages some website will give pay-for content so it appears on the search results, remove advertising, and make the website more bot friendly
    • Bot UA - Disadvantages some webmasters hate bots and will do anything to hide themselves
For your task I’d stick with a non-browser UA. It is probably a good idea to change the library’s default UA, such that site can identify what your program is doing. Wikipedia for example block all python’s default UA and ask people to use the name of bot. Give a URL to this RfA so people can read up on it.
I only recommended GoogleBot since website need to allow it, but they operate from a fixed range of IP address.
  • I tend to optimize too much stuff functionality in as few characters as possible. This creates all sorts of problems in maintenance (I have to complement your code, it's very clean). I saw that you were only matching inside ref tags in such a way that it wouldn't actually do any harm to change it.
  • Here's a sample of registration title, All of these were pulled from my link checker which I think means that they're all redirect urls.
  • There’s just about nothing for JavaScript redirect short of a fully blown Engine.
    I've implemented a work around. Count the number of printed bytes in the page, if it is more than 2x larger than the title then the page contains more than just the title (and reprinted title).
  • I type this up late at night so it tends to suffer from that. Because of the way you changed my link parser it now doesn't do it correctly. Of course this doesn't matter because its inside < > tags.
Now lets take a look at edged cases for links. The characters . , \ : ; ? ! [ ] < > " (space) are significant at the end of URLs. Observe

URLs ending with .,\:;?!

Character that break the url from the title in the bracketed case are:

Due to the way text is processed the follow quirk happens when leaving out the space and using formatting:

Most parser not implementing HTML rendering will fail with these links.

Dispenser (talk) 05:14, 16 January 2008 (UTC), updated[reply]
I've implemented the blacklist feature, Google allintitle: feature was useful to evaluating keywords.
Additionally, I added some support for marking links as dead when the server returns with the rather obscure HTTP 410 code.
As of now I see no significant reason why this bot request should not be approved. —— Dispenser 06:04, 22 January 2008 (UTC)[reply]
One month and two successful trials. Time for approval, isn't it ? NicDumZ ~ 17:18, 28 January 2008 (UTC)[reply]
Thanks ;) NicDumZ ~ 09:18, 3 February 2008 (UTC)[reply]
The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.