The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.
Operator: FlagSteward (talk)
Automatic or Manually Assisted:

Supervised automatic

Programming Language(s):

PHP, Kingbot+AWB

Function Summary:

Automated assessment of class of articles for Projects as Stub or Start. Main target is Country Projects where there's a lot of "town" articles transwiki'd and Unassessed. Where it's possible to extract a population size from an infobox, in many cases bot will assign Importance based on population. More description on the Talk page, although the numbers have been updated.

Task 2 In support of this main objective, also request the ability to transwiki infoboxes automagically, many Italian comuni are currently missing one.
Edit period(s) (e.g. Continuous, daily, one time run):

Typically run once per Project that wants it.

Edit rate requested: X edits per TIME

12 per minute

Already has a bot flag (Y/N): N/A

Function Details:

Overview[edit]

It scans the lists of articles covered by a project and opens each of them in turn at >1.0 second intervals using Special:Export. It extracts various statistics such as the number of <ref> tags and number of images, and also what infoboxes an article uses. If the Template:Infobox CityIT box is present, it goes to the next stage.

Assessing importance[edit]

The bot extracts the population of a commune if possible, and proposes the following assessment of importance :

Low : <15000 (and <2 JPG images in the article)
Mid : >25000, <75000
High :>125000

Assessing class[edit]

Class is based primarily on length of article. In the case of Italian comuni, many have a 2kb demographic timeline which doesnt affect assessment, so the length of this is calculated and deducted from the length of the article. Classes are as follow :

Stub
Length - timeline < 2500
OR Length - timeline < 3200, <3 headers, <2 refs, <2 images
Start
3000 < length-timeline <12000,
1800 < length-timeline <12000, >3 headers and at least 1 ref and 1 image outside a infobox, or (images+headers) >2

These categories may vary depending on the Project - for instance the France Project assess all towns >100,000 people as High by definition.

Applying assessments[edit]

The list of assessments once generated is then examined by me, and any obvious tweaks applied. The list is then fed into Kingbot and AWB. The Italy Project has about 1500 articles to do, some of the other Projects may have up to 5000.

CityIT templates[edit]

In support of the primary task, the bot may extract Comune infoboxes from the Italian versions of articles, clean up and translate eg months, and use subst:Comune to apply a CityIT infobox to the English article.

Accuracy[edit]

The nice thing about this thing is that you can see how it handles articles that have already been assessed manually. On the Italy Project there are 2529 assessed articles with CityIT infoboxes. The bot would assess 1 current High as a Mid (Tivoli, Italy), 14 current Mids as Lows (Tolfa would be #7 by size); 123 would not be assessed. Given that I would be eyeballing them manually to check for "obvious" misses, and that in any case one would expect that "more-important-than-their-small-population-would-suggest" villages are less likely to have remained unassessed, I think a worst-case false positive rate of 0.6% is pretty acceptable.

On the class assessments, 30 out of 240 current Starts would be assessed as Stubs, but of those 30, 24 carry Stub tags in the article, and they're all in the fuzzy grey area between Start and stub - #15 is Mozzanica to give you an idea. 2206 out of 2277 stubs would be assessed as such, the rest would be unassessed. 16 current Stubs out of 2277 would be assessed as Starts - again it's that grey area, Artegna is #8 and I'm not too worried about that. 194 Starts would be recognised as such.

So I'm pretty comfortable with the accuracy, particularly since this is not in the main space and assessment is a bit of an art in any case. Even the false positives could easily be assessed the other way by a human assessor.

Discussion[edit]

Sounds interesting. The 0.6% "false positive" rate would seem fairly acceptable to me, due to the nature of the task - ie its not going to break anything/cause any major (or minor, really), problems. If people disagree with the tagging, they are of course, free to change it... Reedy Boy 15:54, 25 February 2008 (UTC)[reply]

Only 4 out of 49 fail to be assigned a Class - Alanno, Casoli, Popoli and Silvi. The last one is an unusually well-referenced stub ;-/, the other three are genuinely debatable.

There were 8 Starts - Pescasseroli, San Giovanni Lipioni, Vasto, Campli, Montefino, Torricella Sicura, San Benedetto dei Marsi, and Avezzano, the rest were Stubs. Avezzano is much the most interesting one, it comes up on one out of two Stub tests, and one out of two Start tests (so Start wins). Again it's a slightly malformed article, I think Start probably is the right assessment but there's not a lot in it.

For your convenience, I sorted them so that the 'interesting' ones were done first, and after Spoltore there's nothing but Stubs in order of increasing stubbiness. Of course, since as a trial bot I was using AWB in manual mode, I couldn't resist the human temptation to muck about with things. Which means that at the start, I included the three articles assigned neither Class nor Importance, just so you could see what was going on. And then I further messed around with Talk:Popoli trying to make it extra clear what was going on, but instead just screwed things up - I'll sort it out as soon as this is over. Like I say, I was just trying to give you the "overview" of the articles that the bot is looking at - in reality Popoli, Silvi and Alanno would never have made it into the list fed into Kingbot. Let me know if you need more doing. FlagSteward (talk) 00:41, 9 March 2008 (UTC)[reply]

Looks good.  Approved.. — Werdna talk 07:50, 21 March 2008 (UTC)[reply]

The above discussion is preserved as an archive. Please do not modify it. Subsequent comments should be made on the appropriate discussion page, such as the current discussion page. No further edits should be made to this discussion.