The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.

Operator: Alex Bakharev

Automatic or Manually Assisted: Supervised during Trials, Unsupervised in the future

Programming Language(s): Perl

Function Summary: Patroling Special:Newpages, populating various New articles lists, populating list of potentially problematic articles.

Edit period(s) (e.g. Continuous, daily, one time run): ~every 10 minutes

Edit rate requested: 1 per minute maximum

Already has a bot flag (Y/N): N

Function Details: The bot will compare texts of new articles with predifined lists of regular expressions and put the articles into a relevant list if matched.

Discussion[edit]

Do you mind sharing the list of regular expressions and what project they would be sorted into? Perhaps in a subpage of your bot's userpage? —Mets501 (talk) 04:38, 27 January 2007 (UTC)[reply]

I certainly want to share the lists of regexps. Actually I want make the regexps read from a protected subpages of the bot userpage at every execution of the program, so it would be easy to fix. For Russia-related new articles it would look like { /\srussia/ /\srus'?\s/ /moscow/ /petersburg/ /leningrad/ /novgorod/ ...}. I will do case-insensitive search, the \s in front of russia is to separate it from Prussia or Belorussia.

I do not want to write into the new boards directly but rather to a subpage then the text could be cut-n-pasted to the main board. I expect after a serie of false-positive or false-negative for a board to review the regexps. Alex Bakharev 07:54, 27 January 2007 (UTC)[reply]

I'll take a more extensive look through the list later, but for now I noticed that you should probably remove Samara, see Samara for the other uses, many outside Russia. —Mets501 (talk) 14:36, 27 January 2007 (UTC)[reply]

Good job Alex. Others may want to see request for such a bot and general approval for it.-- Piotr Konieczny aka Prokonsul Piotrus | talk  19:02, 27 January 2007 (UTC)[reply]

I agree: good job. Since this isn't a very simple task, I'll wait for input from another member of the approvals group before giving the go-ahead for a trial. —Mets501 (talk) 01:41, 28 January 2007 (UTC)[reply]
Just to clarify: when you say this 'bot will be doing newpage-patrolling, do you mean that it'd periodically polling special:newpages (how often, btw -- is that the ten minutes period?) and do a page-get of every new page, continuously? That's probably not too big a hit, since these will be getting pretty heavily trawled by vandalism patrollers anyway, and will certainly hit a cached page for that reason, but I'd be worried if we ended up with one of these per wikiproject... (Ideally a single bot could do them all at once, if properly co-ordinated and integrated.) Alai 07:07, 28 January 2007 (UTC)[reply]
Yes it is getting an article ones and is trying to apply al the possible rules for all the projects. It goes down until the creation time difference from the first article is more than a specified time (say 15 minutes, if the frequency is 10 minutes). Yes, I would try to get all the projects at once. Alex Bakharev 07:56, 28 January 2007 (UTC)[reply]
OK, seems sensible. Does anyone have a back-of-the-envelope as to what the new article rate is? IIRC, there was at one point an intended bot-read-throttle of one per second. One possible tweak might be to throttle this more at peak times, or stop it completely, and clear the "backlog" during off-peak, but that would be significantly more of a pain in the neck to implement... Alai 08:08, 28 January 2007 (UTC)[reply]
We have around 3000 new articles per day, that is in average 2 articles a minute. This is not evenly distributed at time (there are peaks and dips). I have experimented with processing ~5K articles depth. It require ~1h from home (mostly limited by the connection speed). Maybe it is the way - just run the bot dayly. Alex Bakharev 02:25, 30 January 2007 (UTC)[reply]
Yeah, that would probably work better. —Mets501 (talk) 04:33, 30 January 2007 (UTC)[reply]
That sounds like a good idea. 3000 article reads per day is by no means a server-buster, obviously, but the load-spreading seems a nice idea, since this isn't a time-critical task, and may even produce better results: you may be more useful results from the several-hours-old version of the article, than when it's freshly minted. Thanks for responding so readily to suggestions. Alai 09:11, 31 January 2007 (UTC)[reply]
The bot could collect a list of 50 new pages or so, and then get them all at once using Special:Export which would probably save quite a lot of DB queries and make it much more efficient. If I remember correctly, Special:Export only does one query to get all the pages, instead of 50+ queries if each page is retrieved individually. Jayden54 23:11, 28 January 2007 (UTC)[reply]
I will try to implement Jayden's idea. It sounds sensibleAlex Bakharev 02:25, 30 January 2007 (UTC)[reply]

Good job! Thanks. Is it possible to make the bot work with links and categories? Many Russia-related articles (and almost no unrelated articles) contain the "in Russian:" or "lang-ru" sequence. Besides, a link from top-importance Russia-related articles (Putin, Gazprom, Politics of Russia, Energy policy of Russia, History of Russia, Second Chechen War etc.) could also be an additional criterion. Such articles usually refer to many unrelated entries as well, but these (such as entries about foreign countries and celebrities) are usually created long time ago and wouldn't interfere here. By the way, it would be worth adding some more keywords related to Russian politics and economy. As to important toponymic keywords such as moscow and russia, I guess they create too much noise. We have plenty of hardly relevant articles that mention Moscow and Russia. Colchicum 16:19, 2 February 2007 (UTC)[reply]

I will think about it. Many new articles written by newbies are uncategorized and do not have proper links. Most people who are good in wiki syntax announce their articles themselves, I was mostly after the new articles written by clueless newbies there a short review by people who know the subject could greatly improve the quality. Alex Bakharev 04:55, 4 February 2007 (UTC)[reply]
So, are you you ready for a trial? Have you implemented the above features? —METS501 (talk) 19:08, 10 February 2007 (UTC)[reply]
I am running trials populating 8 different lists. I am running the bot daily, supervised Alex Bakharev 05:13, 12 February 2007 (UTC)[reply]
OK, no problem. When you're done with trials post back here with diffs. —METS501 (talk) 19:49, 12 February 2007 (UTC)[reply]

The bot is doing what it is supposed to do, providing 21 New Article feeds to different projects. See its contributions. What is the next step? Alex Bakharev 09:10, 11 March 2007 (UTC)[reply]

I am also thinking about adding a new function to the bot: minimal maintenance of the new articles (since it reads them anyway):

OK, Approved for trial. Please provide a link to the relevant contributions and/or diffs when the trial is complete. Try out some of those new tasks, and check for false positives. You may want to check for ==External link==, ==Weblink==, ==Weblinks==, and the cite templates (such as ((cite web))). —METS501 (talk) 16:02, 17 March 2007 (UTC)[reply]

This bot was just reported on AIV [1] ST47Talk 14:31, 18 March 2007 (UTC)[reply]

It appears that it's decoding HTML characters, such as
[[Mr. Country & Western Music]]
instead of Mr. Country & Western Music. ST47Talk 14:32, 18 March 2007 (UTC)[reply]
I have fixed it Alex Bakharev 01:36, 21 March 2007 (UTC)[reply]
OK, feel free to continue the trial. —METS501 (talk) 03:05, 23 March 2007 (UTC)[reply]

I have a objection in giving AlexNewArtBot full approval at this stage. My main issue is with the marking of articles as stubs, currently it is just using ((stub)) which is irritating when trying to fix actual human errors. At time of writing, there are 171 unsorted stubs, 12 of which have been added within the last 18 minutes.

The other is the bot is hiding articles that should have never been created, by placing templates like Unreferenced, Wikify and Uncategorized on pages like Runescape Beginner guide, Scott Wilton, Ronnie Pearson and Roosa Kettunen which are a mixture of unencyclopedia, junk, non notable people, etc (most I'd assume would easily be put up for deletion).

All in all, I can see the bot as useful, but I must query if it's such a great idea to jump right on the heels of new articles, maybe apply the templates after a day or two, but not within hours, give editors a chance to get an article started, I'd support giving bot status if the issues (especially with stubs) could be sorted out --NigelJ talk 11:32, 24 March 2007 (UTC)[reply]

I see what you mean, but I would rather start colloborative work as early as possible. If the user realized that the article need to be referenced [s]he would add the reference straight away. In a few day they might completly forget where the info came from. If the categories are correctly identified then there are chanses that the article is already exist and we only create a redirect, or merge the articles, thus, the resources are not wasted on writing redundant texts. Maybe I should introduce a short delay (say 1 hour) so not to rush with the ((stub)) for the articles 15 minutes old. Alex Bakharev 13:54, 24 March 2007 (UTC)[reply]
I still think it's incorrect to add the ((stub)) template to articles, because in my opinion it just creates more work for people in WikiProject Stub sorting (and other people that fix them when they see them). While I agree that some people may forget about an article they have created, instead of polluting the article with templates right away, why not place a notice on the user's talk page, something to the effect of "Hi I'm x bot, I have noticed your created a new article at pagename but it needs some more work, in particular the article still needs to be wikified, and referenced, in one day I shall include the appropriate templates on the article page, so you have time to finish working." It could improved, but my reasoning is that a user may still login after creating a article, they notice the "You have new messages", they may not realise the article they created still needs urgent work. --NigelJ talk 22:36, 24 March 2007 (UTC)[reply]
Another article I've noticed is Devika Chawla, which is according to AlexNewArtBot may contain a conflict of interest, I thought policy was on inclusion of templates like this, it was general practise to give reasonings for inclusion (just like a reason is expected for AfDs etc, otherwise we don't know what we are discussing. To me, it's branching out on the Function Summary a bit too much, creating a list of problematic pages is far different to including every template under the sun and then forgetting to add a reason. --NigelJ talk 12:09, 24 March 2007 (UTC)[reply]
The article Devika Chawla is about singer Devika Chawla Kaushish and is written by the User:Kaushish. So it is most probably a Wikipedia:AUTO. The bot inserts ((COI)) if the username is included in the article and it did not appear to be a part of the signature. It is indeed risky if a User:Texas would write an article mentioning word Texas, the bot might report it. Might be indeed the bot should leave a notice on the talk page saying that the identification was performed by bot and if it is not actual please just remove the notice. Alex Bakharev 13:00, 24 March 2007 (UTC)[reply]
I'm fine if the bot was to place a note on the talk page regarding why, I'm sure your bot has specific criteria, so it can't be that hard to generate a logial sentence or two out of what criteria matched. --NigelJ talk 22:36, 24 March 2007 (UTC)[reply]
There is always a problem with over tagging pages, and a line has to be drawn somewhere. I'm sure you'll agree, this is far too many templates! Perhaps it would make more sense, as NigelJ mentions above, to create a list of problems on the article talk page, add one template to the article page itself, indicating that problems have been identified and noted on the talk page, and leave the creator a message. You could also put the page into the appropriate cleanup categories, without the tags. Martinp23 23:03, 24 March 2007 (UTC)[reply]
I have did the following. Disable tagging by ((COI)) - it is to serious a tag to be handled by bot. Instead the bot produces list of possible autobiographies that is transcluded to Wikipedia:Conflict of interest/Noticeboard for manual analysis. I have stopped tagging stubs by ((catneeded)) and ((unreferenced)). If a stub tagged by the ((stub)) belongs to a project the bot changes stub to the corresponding project's stub - slightly simplifies work on the stub sorting Alex Bakharev 00:22, 26 March 2007 (UTC)[reply]
Another issue has been not using some templates correctly, see [2] where another bot has had to edit the templates, I'm aware of the fact that it's dated the 19th/20th and is quite likely fixed, there have been points where half the generic stubs I came across had edits from both bots. --NigelJ talk 20:25, 27 March 2007 (UTC)[reply]

Arbitrary section break (AlexNewArtBot)[edit]

I can see two clearly divisible tasks here: the page listing vs the page patrolling. The listing stage is probably around about ready for approval, with some more discussion. For now, I'm going to ask you to stop the NP patrolling trials until issues here are fully ironed out. However, I would like you to (continue to?) run trials for the page listing so we can get an idea of how it works over a longer period of time. May I remind you to keep edit rates by the bot below 2 per minute when on trial, to avoid flooding recent changes - looking through the recent bot contribs, there are several occasions where it has gone nearly 3 times over this limit.

On the matter of the patrolling bot itself - I think that this should be run as a seperate task (of sorts) on all new pages. This way, you can retrieve the list of new pages with a certain time offset (24 hours) and then add any required templates, rather than leaping on the article as soon as it is born :). What are your thoughts (and do any other watdchers have any input to offer)? Martinp23 15:23, 27 March 2007 (UTC)[reply]

By the by - I do like the new stub tagging on the basis of subject (where determined) - a great improvement on earlier :). If the bot can store lists for itself (in either a database or text file or something) of new pages listed as belonging to a project, it can use this data in the later (proposed by me above) NPPatrolling stage to add the appropriate stub tag(s). The reason for my concerns on this issue is that people often build an article over serveral revisions, and a flood of templates could easily put them off. Martinp23 17:33, 27 March 2007 (UTC)[reply]

Proposal To Approve Bot[edit]

Personally my proposal is:

I think these conditions are reasonably fair, it still allows the bot to do what has been applied for, the 2 months can be either considered as an extended trial, or full bot status with restrictions (and hence getting the bot flag). Any thoughts? --NigelJ talk 20:25, 27 March 2007 (UTC)[reply]

I'd be fairly happy to approve the listing task soon, and a bot flag would be given for that task. Until we have this NPP task narrowed down, I wouldn't want to approve, and even then we'd probably like to see extended trials and further community input based on its scale. I realise that it's inconvenient, but at least we'll be able to get the first part of the bot up and running while the dicussion re: the NPP continues. Thanks, Martinp23 20:35, 27 March 2007 (UTC)[reply]
Hi Martin, yes I generally agree, I'm assuming NPP is New Page Patrol? On a side note, to futhur backup points in my suggestion, check out edits like [3] (not only did the bot have to edit two times to place a sorted stub category, it placed a totally unrelated template, yes okay Cherry Gardens, I can understand, but the article was about a road, human NPP can better cope with this, [4], once again, 2x edits, possibly an incorrect edit. The Bot also has a quite large talk page, for a bot still in trial period most of the sections relate to the tagging of pages, headings in particular: "Strange?", "Your bot isn't working properly", "Blocked...", "Cleanup Tags", "Unreferenced?", "Dated Categories" (and the others at the bottom of the page). The messages started on the 18th, which happens to be when the bot started tagging articles, a day after the bot was approved for trial, the bot has been posted at least once on AIV (see above, but link reposted here). Another note, whats with the bot blanking it's logpages? (See: User:AlexNewArtBot/ChinaLog and User:AlexNewArtBot/NZLog for examples. On a side note, it may sound like I've got something against Alex, that isn't the case, I just think a bot that tags pages falsely, needs to be confined quite tightly in what it can and can't do, the Log pages (when filled) should have a great deal of useful information that can be phased by tools for human NPP and thats why I'm in the end supporting the bot flag. --NigelJ talk 00:44, 28 March 2007 (UTC)[reply]
Well, If there is a consensus that the bot should not tag articles - then I could disable it. It is just putting one comment sign in the script. I agree that the bot often determine the stub types wrongly the frequency would decrease with more feeds (the bot put most suitable stub type it is aware of, it know about Gardening, but not the Roads, thus, the article about the Garden Road received the Gardening stub). On the other hand labelling with ((Catneeded)), ((Wikify)) and ((Unreferenced)) is almost never done wrongly and I believe encourage people to reference, categorize and wikify. Regarding logs - they are quite long, if not trimmed dayly they soon grow umaneageable. That is why I keep only the most recent data in the current version. All old entries are still available via the history.
Since stub tagging simce to be controversial I propose to disable it and only allow:
I will make a point of saying that the bot's talk page, contains a few confused editors with those tags as well, (especially the headers I pointed out earlier). However, I think it's time that the BAG are left to make a fair judgement of all the facts, and will now leave them to it. --NigelJ talk 03:20, 1 April 2007 (UTC)[reply]

The bot was blocked by User:MartinP with the explanation on User talk:AlexNewArtBot#Blocked. Sorry for being dense but can you point out where the bot was discontinued from the trials of NPP? I was under impression that my proposal for limiting bot's functionality was not opposed by anybody. I will discontinue the bot's editing in the article space until some hints on the matter Alex Bakharev 00:48, 4 April 2007 (UTC)[reply]

I will discontinue the article space editing by bot until I get some hint what is allowed and what is discontinued. Alex Bakharev 00:48, 4 April 2007 (UTC)[reply]

Neither proposal was accepted by the BAG, thats why they are named proposals and also why I said... "However, I think it's time that the BAG are left to make a fair judgement of all the facts, and will now leave them to it. --NigelJ talk 03:20, 1 April 2007 (UTC)" above. --NigelJ talk SIMPLE 05:11, 4 April 2007 (UTC)[reply]
Hi - the part where I specifically discontinued the trials can be found in my comment just below the section break: For now, I'm going to ask you to stop the NP patrolling trials until issues here are fully ironed out. However, I would like you to (continue to?) run trials for the page listing so we can get an idea of how it works over a longer period of time.. As the trial for NPP wasn't re-approved, it shouldn't have been running. Sorry for any confusion. Martinp23 09:35, 4 April 2007 (UTC)[reply]

Back to the listing proposal - everything looks fine in the trials and the bot seems to be working well. As an idea - would it be possible to put notes (or code letters) next to listed articles to indicate issues with them (eg CN posted after the article name, with a key giveing the menaing "cat needed"). Of couse, if code letters were to be used, a key of sorts would be required. Thoughts? Martinp23 09:46, 4 April 2007 (UTC)[reply]

It is easy to implement, but I am not sure if there is value in it. Suppose an article labelled by Needs: Category, References and Removing signatures. It was true at the moment bot read the article. It might not be true half a day later: somebody can insert category, references, etc. It is reasonable to expect people to remove maintenance tags after they fixed the problem it is unreasonable to expect them to check "What links there" and remove the maintenance label from a project list. Do I have to periodically run a separate script that would check all the project's lists to see if the tags are still actuall or not? It is doable but require work. Does it worth it? The bot is currently generating a list of possible COIs but it is different if the original author has a conflict of interests it is unlikely to disappear in a few hours. The other issue is that big ugly blue tags encourage people to fix the problem. The lists are less visible and encourage much less. I was thinking about putting notices on the author's talk page. Like "Your article needs categories. Please read Help:Category and Wikipedia:Categorization FAQ and add some". Or "You have put your signature to an article. Al the wikipedia works are a collective creations so please remove the signature". It might by helpful for the newbies but quite patronizing and embarrassing for the experience users, so I really do not know what to do. Maybe I should start a discussion on Village Pump or somewhere? Alex Bakharev 08:23, 5 April 2007 (UTC)[reply]
I think it would be better to mark whether an article has been listed because of its categories or due to some other rules only. Colchicum 10:41, 5 April 2007 (UTC)[reply]
Yes - it would be hlepful to have communtiy input - make a note on the VP, pointing here, and we can take it from there. I'm sure that it has been mentioned above, but a good system may be to have one big notice on the article saying "A bot has detected a nubmer of problems with this article. See the talk page for details", and list the problems on the talk page, with instructions. The appropriate categories for cleanup etc can also e added to the article, but without the big templates. Thanks, Martinp23 10:46, 5 April 2007 (UTC)[reply]
If the result is going to be to take away mainspace page editing, would the bot be best off outputting to a toolserver account? That would allow people who want to look for specific things by actually allowing Alex to create a PHP/Perl backend to support more specific queries, where as bot output to user subpages would be generally unsortable. Just a thought. --NigelJ talk SIMPLE 13:47, 5 April 2007 (UTC)[reply]
It looks like an interesting idea. I am not very familiar with the possibilities of the toolserver but at the very least the log files should be better kept off the main wiki storage. They are big and should not be kept forever, on the other hand they are to be shared with a few users. Alex Bakharev 07:27, 6 April 2007 (UTC)[reply]
Yes - sounds good, but is out of the control of the BAG! You need to go through the proper toolserver approval process, or ask someone already with an account to host the tool(s) for you. After listing your name on the requests page, it usually takes about a month before you are given the account. It's worth noting that toolserver bots can edit, and edit the mainspace, through the normal web interface (all of my bots are on toolserv), so you could put the bot there in any case if required. Anyway - as I said, we can't control toolserver applications, nor do we need to approve any data access methods you implement (there are all up to you) - we are only bothered by edits and, to an extent, server load from page fetches. Thanks, Martinp23 09:22, 6 April 2007 (UTC)[reply]

 Approved. I am approving the logging function of the bot here, and archiving the request. If you wish to proceed with the NPPatrol section, please create a new sub-task request. It would be good to get community input for this function first, and (ideally) have the logging part of the bot on toolserver, to make the data more easily accessible (searches, etc..). Until the bot gets its flag, please keep the edit rate below 2 per minute. Martinp23 12:27, 6 April 2007 (UTC)[reply]


The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.