The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.

Operator: NicDumZ ~

Automatic or Manually Assisted: Automatic, supervised

Programming Language(s): Python

Function Summary: You do know what DumZiBoT was doing, right ? :p Extend its behavior to deal with duplicate references.

Edit period(s) (e.g. Continuous, daily, one time run): Same as before, on each XML dump

Already has a bot flag (Y/N): Y

Function Details:

If you look at the older edits of DumZiBoT, you'll probably catch one of these minor bugs that have been reported on my talk page, and fixed since (and pushed to SVN). Do not base yourself on older edits, ask me to edit again, if needed ;)

Discussion[edit]

Lets see what could go wrong [1] [2] [2] [3] [4] [4] [5] [6] [6] [7] [8] [9] Cite error: The opening <ref> tag is malformed or has a bad name (see the help page).

  1. ^ Example 1: named:casingref
  2. ^ a b Example 2: name:CasingRef and Tag casing
  3. ^ Example 3: Quoting (note that mixed quote ' " don't work, also note how space translates to an underscore)
  4. ^ a b Cite error: The named reference quoting space was invoked but never defined (see the help page).
  5. ^ bug">Example 4: ref is named "quote>bug", but is null
  6. ^ a b Example 5: Characters that can make a ref name, excluding <, >, and "
  7. ^ Example 6: special characters translate to anchor encoded character, so ! and .21 are equivalent
  8. ^ Cite error: The named reference bang.21 was invoked but never defined (see the help page).
  9. ^ Example 7: Empty name are apparently valid name (BUG!)
def escapeId(s):
	""" Anchor encode routine similar to URL encode"""
	return escapeUrl(s).replace('%', '_')
def escapeUrl(s):
	""" URL encode routine percent encodes non-safe characeters """
	if not safe_map:
		# generate when first used
		safe = '-.0123456789:ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz'
		for i in range(256):
			c = chr(i)
			safe_map[c] = (c in safe) and c or ('.%02X' % i)

	res = map(safe_map.__getitem__, s.replace(' ', '_'))
	return ''.join(res)

Dispenser 17:03, 25 July 2008 (UTC)[reply]

Ah, Dispenser again. Thanks :)
Well, I see no big problem with the above examples. No, my bot does not support all of them, and no, I do not think I can do so :)
  • I did not know about reference names being converted the same way the anchor are converted, I just implemented its support (using the url2unicode from pywikipedia ;) ). after adding a few identical references... the bot removes the identical. adding an encoded identical works the same : [1] (And yes, if some processing is needed, the decoded version will always be preferred)
  • Empty named references are just being ignored by my regex (i.e. not taken into account, ever. No merges, no duplicates flagging, and so on)
  • I just modified my regex to ignore "quote>bug". Previously, it would match "quote", which was wrong, and since I don't think there's a simple way to handle those very specific references, I'll just ignore them
  • I personally really dont care about XHTML spec, nor about removing 1 space from an article. I'll just leave it as it is, I don't think that having one more space will strain the servers, knowing that most of the time, DumZiBoT is removing duplicates hence reducing the overall text size
  • For that second task, I'm not working on references only containing a link. I work on all references. It means that fetching a domain name won't work all the time, as some references are just plain text. Also, I'd prefer acting very dumb, saying I'm not able to guess a proper name, to have the editors look into the reference if they want to find a proper name. "autogenerated" is plain and simple : a bot has inserted it; a guessed name, when not relevant, can be confusing, right ?
    While I apperiate your efforts in keeping the program content agnostic, most users do not rename references. A nice way to keep it content agnostic while still selecting parts to be rename is to have a configurable regex, like\|\s*last\s*=\s*(?P<refname>[^\w\s]+)|http&#3A;//+([^/]*?)(?P<refname>[^A-Za-z0-9\-]+)\.([^A-Za-z\.]{2,6})(:\d+)?/ which will (hopefully) capture the lastname of the author, failing that it will use the domain of the website. And don't take my earlier attempt of this, take the longest word and hope authors have long names, as anything that took more than 5 minutes to do. — Dispenser 04:01, 30 July 2008 (UTC)[reply]
  • As for reference merges when not perfectly identical... well... no, it's not this easy, I'm afraid :( I'd prefere to keep it rather simple, as the overall code is getting bigger and bigger...
NicDumZ ~ 18:31, 25 July 2008 (UTC)[reply]

I'd love to see this go through a couple thousand pages until it gets perfect, so that I can steal the tried-and-true code for my bot. . Seriously though, I advocate its formal approval here. – Quadell (talk) 23:21, 25 July 2008 (UTC)[reply]

I see cautions and ideas above, but no objections. NickDumZ has always been diligent with checking for errors and fixing them promptly. I'm confident that this task will be performed carefully responsibly. – Quadell (talk) 12:25, 31 July 2008 (UTC)[reply]

((BAGAssistanceNeeded))

Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. MBisanz talk 12:40, 31 July 2008 (UTC)[reply]
Okay, already spotted this kind of buggy edit. I'm adding nowiki tags around ref contents to solve the problem :)
NicDumZ ~ 13:16, 1 August 2008 (UTC)[reply]
Also fixed this kind of edits (the URL wasnt parsed correctly because of the spaces in the title) :) NicDumZ ~ 15:41, 1 August 2008 (UTC)[reply]
Here are 50-ish edits... :) NicDumZ ~ 12:11, 2 August 2008 (UTC)[reply]

 Approved. Looks harmless, useful, and appropriate. (You have the honor of my first RfBA closing.) – Quadell (talk) 13:39, 6 August 2008 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.