The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was  Approved.

Operator: Pkbwcgs (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 13:10, Monday, December 24, 2018 (UTC)

Function overview: The bot will fix a range of unicode control characters in articles. This is CW Error #16.

Automatic, Supervised, or Manual: Supervised

Programming language(s): AWB

Source code available: AWB

Links to relevant discussions (where appropriate):

Edit period(s): Five times a week

Estimated number of pages affected: 100-250 at a time

Namespace(s): Mainspace

Exclusion compliant (Yes/No): Yes

Function details: This is an extension to Task 1 as I am already fixing Unicode Control Characters there. However, this task does more fixes to error 16 and fixes a range of Unicode control characters that WPCleaner can't fix. The following will be removed:

The following will be turned into a space:

The bot will use RegEx and general fixes will be switched on but typo fixing will be turned off as they are both not required for this task.

Discussion

[edit]

I'm not sure about some of these. In particular, U+00AD may have been added by editors to specify the proper place for long words to be broken, and U+00A0 should more likely be turned into the   entity than changed into U+0020. The same might apply to the other space characters, editors may have specifically used these in preference to U+0020. Anomie 17:06, 24 December 2018 (UTC)[reply]

@Anomie: After going through the WP:WCW list, there are no instances of U+00AD anywhere. However, if it does come up, then I will replace it with a hyphen. U+00A0 takes up more bytes than a regular space (U+0020) so it is easier to leave a space. The other space characters can be safely replaced as they are unnecessary and they mostly come up in citations. See 1 which is taking out U+2005 which is four-per-em space, 2 which is taking out U+2008 which is punctuation space, 3 which is taking out U+2005 again, 4 which is taking out U+2008 again and 5 which is also taking out U+2008. All these occurred inside citations. Pkbwcgs (talk) 17:43, 24 December 2018 (UTC)[reply]
Replacing U+00AD with a hyphen would not be correct either. You'd want to replace it with ((shy)) or the like. For NBSP "takes up more bytes" is a very poor argument, and replacing it with a plain space could break situations described at MOS:NBSP. A figure space might be intentionally used to make columns of numbers line up properly where U+0020 would be a different width, and so on. I don't object to fixing things where specific fancy spaces don't make a difference, but you're arguing that they're never appropriate and that strikes me as unlikely. Anomie 17:55, 24 December 2018 (UTC)[reply]
@Anomie: There are no cases of U+00AD so the bot doesn't need to handle that. In terms of U+00A0, I will make sure my RegEx replaces the cases described at MOS:NBSP with &nbsp or otherwise skip them. Pkbwcgs (talk) 18:04, 24 December 2018 (UTC)[reply]
If you're not intending to handle U+00AD after all, you should remove mention of U+00AD from the task entirely. (I see you struck it) As for "the cases described", good luck in managing to identify every variation of those cases. It would probably be better to just make that part of the task be manually evaluated rather than "always replace". Anomie 18:09, 24 December 2018 (UTC)[reply]
@Anomie: The bot will still strip U+00A0 in wikilinks because replacing them with &nbsp is not going to work. Pkbwcgs (talk) 18:15, 24 December 2018 (UTC)[reply]
Replacing the cases stated at MOS:NBSP is trickier than I thought so I am going to skip those cases manually. This task is supervised. Pkbwcgs (talk) 18:20, 24 December 2018 (UTC)[reply]
((BAG assistance needed)) I have made some amendments to this task including reducing down to five times a week and added general fixes so the removal of unicode control characters and general fixes can be combined together. I have also specified that non-breaking space will not be removed in cases described at MOS:NBSP and the bot will replace those cases with "&nbsp" with the general fixes. Pkbwcgs (talk) 20:10, 17 January 2019 (UTC)[reply]

 Approved.. I concur with the edit summary tweak - no point in putting in the "replacement" field when it's all unicode whitespace. Primefac (talk) 13:55, 7 April 2019 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.