The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Approved.

False Positives

If you have come here to report a false positive, please do so here.
That link should provide an easier and more pleasant experience than editing the table at the bottom of this page.

ClueBot NG

Operator: Christopher Breneman (Crispy1989), Tim1357, and Jacobi Carter (Cobi).

Time filed: 00:35, Monday October 25, 2010 (UTC)

Automatic or Manually assisted: Automatic.

Programming language(s): The core is written in C++ by Christopher Breneman. The interface to Wikipedia is written in PHP by Cobi. The dataset is maintained by Tim.

Source code available: See Christopher Breneman for access to subversion repository.

Function overview: Vandalism detection and reverting using machine learning algorithms.

Links to relevant discussions (where appropriate):

Edit period(s): Continuous.

Estimated number of pages affected: Current statistics indicate approximately 70% of vandalism is caught, so it would be editing approximately 70% of vandalized pages.

Exclusion compliant (Y/N): Yes.

Already has a bot flag (Y/N): No.

Function details: Cluebot-NG is an attempt to revolutionize practical vandalism prevention on Wikipedia. Existing anti-vandal bots use simple static heuristics, and as such, catch a relatively small portion of vandalism, and with an unacceptable false positive rate, many of which are likely not even reported. Cluebot-NG shares no code with the original Cluebot, and uses completely different algorithms to detect vandalism. Details of these algorithms can be found at [1] . Because these algorithms must be trained on a dataset, there is also a convenient way to estimate accuracy before a live run - simply running the bot on a portion of its dataset not used for training. Currently, this is yielding a 60% to 70% vandalism detection rate - far above that of current bots.

Discussion

Pre-trial Discussion

I think one of the main concerns about anti-vandal bots is not about how good they are at detecting vandalism; rather, it's above all about how good they are at not marking legitimate edits as vandalism. Therefore, one main concern from the document might stem from the following: "Estimates given to me for an acceptable false positive rate range from 1% to 0.5%. " Consider, for a moment: a false positive rate of 1.0% means that on average 1 in every 100 legitimate edits would be reverted as vandalism; 0.5% translates to 5 of every 1000 legitimate edits being reverted as vandalism. Keeping in mind that the enwiki edit rate is relatively high, false positives tend to add up. I'm not sure what the magic number is going to be, but my suggestion would be that you should at least keep in mind the idea that as a whole the community seems to prefer dealing with vandalism manually if it avoids labeling new users as vandals. --slakr^\ talk / 06:24, 25 October 2010 (UTC)[reply]

I agree with this completely. One of the aims from the ground up has been to minimize false positives. The key is to realize that even humans can sometimes have false positives, particularly with the borderline edits (and this is the area where Cluebot-NG has a few false positives). Certainly existing bots have a significant number of false positives - likely higher than 0.5%. If the 0.5% false positive rate is deemed too high, it can be adjusted at any time. This exact number could be put up for discussion before going live. The program is capable of generating graphs comparing the false positive rate with vandalism detection rate - do you think it would be useful to post these and open up a discussion concerning them? I can also post current lists of false positives in the trial dataset - it may be useful to see that most of them are poor quality or borderline edits, and on nearly all of them the user has contributed a very few times (not just as of the time of the edit, but as of the present). In response to your concerns about new users being labeled as vandals, I'm re-running the trial dataset right now, discarding any data about previous user contributions. I will update this with the results. Crispy1989 (talk) 06:55, 25 October 2010 (UTC)[reply]

Update - I ran the dataset without any prior user information and accuracy was almost the same, with only a slight dip of about 4%. There was no significant effect on false positives.

I know that this discussion is focused more on eliminating false positives rather than increasing ClueBot NG's accuracy, but I've noticed that the bot seems to miss content-removal vandalism more often than the other bots. --Ixfd64 (talk) 20:22, 4 November 2010 (UTC)[reply]

Very interesting. The last 2 anti-vandal bots performed dry runs on their user pages before being approved. This does not require prior permission, AFAIK. And just curious, have you test your program against the datasets provided by the CLEF 2010 LABs?, and if you have, how it has performed in comparison with other approaches? Sole Soul (talk) 18:23, 25 October 2010 (UTC)[reply]
It's difficult to perform a dry run on a userpage that adequately demonstrates the capabilities of the bot. The neural network takes into account not only the raw diff, but information on activity on the page, user activity, and other statistics. Also, the neural network is trained on a dataset of main namespace articles - it wouldn't apply very well to the talk page. It may be possible to set up a simulated environment with a somewhat less functional version by either holding these inputs constant, or removing them from the neural network. This approach differs from all existing ones that I'm aware of in that this approach combines multiple different methods to catch different classes of vandalism. Most existing approaches are either statistics-based, or language-based. This is both. Another key difference is that existing approaches are designed with the mindset of being a research project, and as such, try to maximize overall accuracy without practical considerations. Cluebot-NG has been designed to be practical from the ground up, in terms of speed, and minimizing false positives even if it means a decrease in overall correctness. Crispy1989 (talk) 19:17, 25 October 2010 (UTC)[reply]
What is the accuracy when the false positive rate is capped at 0.1%? Is there a chart somewhere of the catch rate vs false positive setting? Gigs (talk) 21:03, 25 October 2010 (UTC)[reply]
Capped at 0.1%, it currently catches about 40% of vandalism. We're working to improve this by identifying under what circumstances edits are falsely identified as vandalism and correcting them. You can view one recent trial report in whole at [2] . Among other things, it includes the graph you requested: [3] Crispy1989 (talk) 22:02, 25 October 2010 (UTC)[reply]
Thanks, that's just what I was looking for. It does look like the sweet spot is 0.5% at this point. I would feel more comfortable with a 0.1% false positive rate though. Gigs (talk) 22:22, 25 October 2010 (UTC)[reply]
I'll update the graphs as the bot improves and learns. Crispy1989 (talk) 22:46, 25 October 2010 (UTC)[reply]

I would support an anti-vandal bot which could avoid absurd false positives such as these recent reversions by ClueBot [4] and [5] (apparently, any use of the words "sex" or "pussy" in an edit by a non-whitelisted user is sufficient to trigger reversion), provided that the existing anti-vandal bots are decommissioned. Will this the algorithms of this new bot adhere to WP:NOT#CENSORED, and refrain from reverting edits solely because they contain "bad" words? (though questionable language may legitimately be one of the factors which weighs in favour of reversion.) Also, blanking of content is sometimes legitimate; automated restoration of copyright [6] or BLP violations is particularly disruptive. I suggest that blanking in the following situations never be reverted by a bot:

The blanking edit adds a copyright violation template, or ((db-g12)).
The blanking adds ((db-g10)).
The blanking only removes content that contains no references in a form the bot can recognize (<ref>, a reference template, or an external link.)
The blanking only removes content (doesn't replace it with something else) on an article that was previously in Category:Living persons (while content removal on BLPs can constitute vandalism, human judgement is required to determine whether this is actually the case; the content may have blatantly violated the sourcing or neutrality requirements of the policy.) Peter Karlsen (talk) 03:31, 26 October 2010 (UTC)[reply]

I also noticed this diff on Wikipedia:Bots/Requests for approval/ClueBot NG/Trial. Since User:Tide rolls is an administrator with nearly 150,000 edits, the identification of this edit for reversion suggests that a whitelisting mechanism is not (currently) implemented. Is this a planned feature, perhaps through the use of Wikipedia:Huggle/Whitelist? Peter Karlsen (talk) 03:49, 26 October 2010 (UTC)[reply]

Cluebot-NG does not use a "blacklist" of any sort. The words that compose an edit are taken into account in two ways; the presence in a set of predefined "word categories", and the result of a naive Bayes classifier (two, actually). The word categories (as opposed to a blacklist) allows the bot to recognize that certain words may be acceptable if similar words are already used in the article. The naive Bayes classifiers recognize if a certain word appears only in vandalism, or if it sometimes appears in good edits. The naive Bayes classifier also provides for instances of normally bad words to be offset by the presence of other words which are not normally found in vandalism (These lists are NOT predefined - they are empirically determined by analyzing the dataset). Additionally, neither of these factors are used independently - they're fed into a neural network along with many other statistics, so the bot can learn from example in what statistical situations a high Bayesian score, or a word's presence in a certain category, is acceptable. Also, the second Bayes classifier uses sets of two words, instead of one, so phrases like "Pussy cat" would be recognized as primarily belonging to good edits (given a large enough data set). As the bot does not use any sort of heuristics, it cannot be programmed to ignore those certain situations that you list. However, these tags, and others, can be added as inputs to the neural network. It should then learn under what circumstances they contribute to an edit being good or bad. I should also note that, of the many false positives I've examined so far, none of them fit into these categories, so it would seem the neural network already does a good job of determining that the edit is not vandalism in these cases. The entire concept relies on the neural network learning complex patterns and making inferences about the data presented to it, and this requires a large dataset. We currently have a dataset of around 30,000 edits, about 20,000 of which are used for training, and the other 10,000 for testing/trialing. We are working on expanding this. About the whitelisting of edits - Yes, this will be implemented when the bot is actually running live. The reason this is not currently active is that we would like to find as many false positives as possible now, before the bot goes into production, so we can work on fine-tuning the statistical parameters of the neural network. Even if an edit would not be reverted in production due to a whitelist, it's useful to train as an example of what should be considered a good edit. The programmatic structure of the bot is modular, and different mechanisms can easily be added - we plan on only adding whitelisting measures post-neural-net (ie, no heuristics that could cause additional false positives). Crispy1989 (talk) 04:37, 26 October 2010 (UTC)[reply]

It sounds like ClueBot NG could greatly outperform traditional Wikipedia anti-vandalism bots, reverting significantly more vandalism with a lower false positive rate. One of the most important issues, however, is exactly where the acceptable false-positive rate is set. I would expect a percentage of false positives no greater than an experienced, careful human editor would produce. 0.1% false positives is probably at least as accurate as most humans could be, and, at 40% of vandalism reverted (based on the discussion above) is still far more effective than existing bots. Given that ClueBot NG would be performing far more reversions in total than DASHBotAV and ClueBot have, setting the false positive rate below the 1% or 0.5% that has heretofore been considered acceptable is particularly important. Peter Karlsen (talk) 05:12, 26 October 2010 (UTC)[reply]

Keep in mind that number of reversions has no bearing on false positive percentage. False positives are measured based on how many legitimate edits it processes, not how many edits it reverts. False positives percentage is (number of legit edits marked as vandalism) / (total legit edits processed) * 100. -- Cobi^(t|c|b) 06:35, 26 October 2010 (UTC)[reply]

Are you saying that if it processes 100,000 edits, 100 of which are vandalism, it will revert 70 of the vandalism edits and 500 legitimate edits? (given 70% catch rate at 0.5% false positive). If so, that's far, far worse than I thought. Gigs (talk) 18:59, 26 October 2010 (UTC)[reply]

That is how false positive rates are measures - This is significantly better than existing bots nonetheless. Ie, if the existing bots are decommissioned and replaced with this, there will be fewer false positives, and much more vandalism caught. Also, vandalism comprises more than 1% of edits, so although that example does show how the rates are calculated, it does not represent reality. Crispy1989 (talk) 19:13, 26 October 2010 (UTC)[reply]

In any case, at a 0.1% false positive rate and 40% of vandalism reverted, ClueBot NG represents an opportunity for significantly improved performance, so that more vandalism is reverted, while events like [7], [8], and [9] become less frequent occurrences. I would suggest that setting the acceptable false positive rate as low as it possibly can be while still reverting a reasonable portion of vandalism increases the probability that the bot will be approved after its trial period (for the same reason, any "final" trial, either dry or with actual reversions, should be conducted with whitelisting enabled.) Peter Karlsen (talk) 19:33, 26 October 2010 (UTC)[reply]

Also, I should mention that the bot is constantly being improved every day. Accuracy will likely be even higher before the final trial. Anyone who feels they have something to add is welcomed to help. The more pertinent statistical information the neural network gets, the better. We've only added the things that we can think of off the tops of our heads. All of this information is configured at run time using configuration files, so new metrics can be easily added. If you have something to add or suggest, stop by irc.cluenet.org #cluebotng . Crispy1989 (talk) 20:40, 26 October 2010 (UTC)[reply]

When do you think you will be ready for a trial and what kind of sample size will you need to verify the method? MBisanz ^talk 05:34, 27 October 2010 (UTC)[reply]

Even its current performance is far superior to current bots. It's up to the BAG when the trial happens. As I said above, accuracy can only get higher, and development won't stop as soon as it goes live. We already have enough of a sample size (around 30,000 edits) to verify that it works very well. This sample size is what the other figures in this discussion are based on. The larger the dataset, the more accurate it will be. Crispy1989 (talk) 05:55, 27 October 2010 (UTC)[reply]

I ran it for a few hours today in dry run mode and exported it's data to User:ClueBot NG/Dry Run -- warning, this is a very large page (about 2500 links). -- Cobi^(t|c|b) 08:29, 28 October 2010 (UTC)[reply]

Generally pretty impressive, imo. I will say that the fourth link I clicked on in a list of 2500 (Christopher Boykin by 68.229.109.100 (talk · contribs) at 2010-10-27 19:37:08 - ANN scored at 0.956627) was a false positive, however. :P Ale_Jrb ^talk 11:34, 30 October 2010 (UTC)[reply]

That sort of false positive is very rare, and will be fixed over time as the dataset grows. Part of the algorithm uses a naive Bayes classifier on the inserted text. In the dataset we have, the word "banana" has been used in 4 vandalism edits, and 0 good edits. Since that user made no past contributions to analyze, and there was no data other than the inserted word, this statistic was the only one that contributed to the score. Because our existing dataset is pretty large, false positives like this (and false positives in general) very rarely occur. These can be fixed in one of two ways. The best way is to increase the size of the dataset. If even a single good edit in the dataset included the word "banana", the neural network would put less stock in it. The other way is to increase the minimum number of dataset appearances of a word before it contributes to the Bayesian score. This is currently set at 4 (barely enough for "banana" to trigger it). We are using our best judgement to adjust these parameters to the optimal values, but as I said, the best way is to increase the dataset size. Crispy1989 (talk) 12:04, 30 October 2010 (UTC)[reply]

A couple comments/questions:

I think it would be good to get some wider community input on what an acceptable false positive rate would be.
How do false positives affect the bot's learning ability? Will a higher false positive rate lead to "bad" entries in the dataset that could cause it to "learn" slower (or worse, get worse with time)? Or will the fact that its reverting more actual vandalism, including corner cases, mean that it will learn faster?
A somewhat WP:BEANS-y question - If it looks at users' past edits, is it possible to game the bot? If a user makes a legitimate edit to an article about penises, will that make the bot less likely to detect "penis vandalism" by the same user?

-- Mr.Z-man 19:34, 30 October 2010 (UTC)[reply]

How would I go about getting wider community input for the false positive rate? I agree with this, but so far there have been different acceptable estimates, but no real concensus. My recommendation is somewhere between 0.5% and 0.1%. Anything above 0.5% is probably too high, but close to 0.1%, there's a dropoff in effectiveness.
The bot does not automatically learn. If it did so, regardless of the false positive rate, its performance would simply remain status quo. Over time, we plan on manually growing the dataset, both by using human-reviewed random edits, and by adding specific false positives (and false negatives) to the set.
The bot only looks at general statistics of the user's past edits (number of edits, time frame, number of past edits that were vandalism, number of unique pages edited). It does not process content of past edits. It learns things like, if the user has made two previous edits, and neither were vandalism, it's much less likely for the edit in question to be vandalism than if it's the user's first edit, or if the user has made 2 edits before, both of which were vandalism. (Also note that these scenarios do not alone positively identify an edit as vandalism. They only contribute to the result.)

A note about "penis" vandalism and similar: Currently, the bot's Bayesian database shows "penis" appearing in 157 vandalism edits, and 0 good edits. Because of this, a user simply adding the word "penis" alone to a page would likely be classified as vandalism. However, almost no legitimate edit would add only this word. Bayesian scoring also takes into account other words that are added with it (for example, if "birth" was also added, it appears in 43 good edits and only 15 vandalism edits, so it would bring the score far below threshold). In addition to this, the bot also monitors if words already appear in an article before the edit, so even if the word "penis" alone were added, if it already appeared in the article, the bot may take that into account (however, that may not be enough to have it classified as a good edit - all statistics are taken into account). Also, the bot handles words inside quotes differently, as direct quotes on Wikipedia are allowed to contain many things that direct article text should not. Crispy1989 (talk) 20:20, 30 October 2010 (UTC)[reply]

To me the false positive rate is the key question, since any vandalism reversion is useful. For comparison, what are the false positive rates of the existing anti-vandalism bots? IMHO any bot with a lower false positive rate than currently approved bots can be readily authorised. Rjwilmsi 20:08, 30 October 2010 (UTC)[reply]

If I understood the info from the report.txt that's linked to above, if the bot catches 2100 vandalism edits while having 5 false positives it sounds like it's doing very well. Rjwilmsi 20:11, 30 October 2010 (UTC)[reply]

It's very difficult to estimate the false positive rates of existing bots, because the majority of false positives probably are not even reported. Since existing bots are heuristics-based, they are frequently over-aggressive with many things that usually occur in vandalism, but can sometimes be acceptable. They mitigate this to some extent using a user-whitelist, and not reverting edits made by a user with more than a certain number of contributions. But this means that all or nearly all of the false positives are occurring with new users, who probably don't understand why their edit was classified as vandalism, or how to report it (if they even notice). Just by looking at their logic and considering how many legitimate edits could be misclassified by simple heuristics, it can be seen that existing false positive rates should be significantly higher than the values being considered for this bot.

About the comment on report.txt, you are reading it correctly, but the inference is a little off. The raw numbers of vandalism detected versus false positives are based on the trial dataset, which is about 50/50 vandalism/good edits. If Wikipedia edits followed this same proportion, then saying "2100 vandalism edits reverted for every 5 false positives" would be accurate. However, there are more good edits on Wikipedia than vandalism, so this isn't an accurate way to think about it. That's why I've been discussing false positive rate (which is a percentage of good edits), rather than the ratio, because the false positive rate is independent of the ratio of good edits to vandalism.

On the other hand, a few other things to keep in mind: Of those 5 false positives, one isn't actually a false positive (it's a misclassified edit in the dataset). We are working to correct these few errors by manually reviewing edits. They do not affect training as the very few errors are washed out by all the correctly classified ones, but they can affect calculation of threshold (by making false positive rate seem higher than it actually is). Also, at least one other edit of those 5 false positives would probably be caught by the post-processing filter (using a similar whitelist to existing bots) which will be added in production. Also keep in mind the bot is continually being improved. Many changes have been made since that report was generated. I'll post a new report soon. Crispy1989 (talk) 20:41, 30 October 2010 (UTC)[reply]

Examining the bot's current performance, I've decided to recommend a false positive rate of 0.25%. I also posted the report from a recent dataset trial here. As you can see from the graph, there's a sharp dropoff after about 0.25% false positives. At 0.25% false positives, the score threshold is 95.4%, and 63.7% of vandalism is caught. In reality, the number of false positives will be less than demonstrated here. Looking at the list of false positives, 8 of the 13 aren't even real false positives - they're edits misclassified in the dataset, so the bot is actually identifying these correctly. Since the whole dataset is human-reviewed, this demonstrates that the bot can, under some circumstances, be even more accurate than a human - it can even recognize and ignore errors in its training set. Crispy1989 (talk) 21:52, 30 October 2010 (UTC)[reply]

Approved for trial (14 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Approved for editing at 0.25% FP rate. 0.25% of edits means that on average, 3 out of 1000 edits will be reverted, which is lower than our current bots and many of our human editors as well. Crispy and Cobi and Tim are working continuously on this bot, and it should only improve from here. What more, with the dataset being improved, FP rate is actually lower than stated, so this should be an allright FP rate. (X! · talk) · @234 · 04:37, 2 November 2010 (UTC)[reply]

Trial 1 discussion

I fixed a botched attempt at a redirect by a noob[10] and got reverted by this bot. Thanks, Jon. — Preceding unsigned comment added by 81.145.247.25 (talk • contribs) 19:26, 2 November 2010

Thanks for pointing this out - "REDIRECT" was not included in the list of wiki markup to ignore. We are adding it now. Crispy1989 (talk) 19:30, 2 November 2010 (UTC)[reply]

That's not the only error so far. [11] is a rather inexplicable reversion. Also, the bot seems to have a bad habit of reverting itself [12] [13] for no apparent reason (or perhaps the means of identifying vandalism is so problematic that the bot really is marking its own edits as vandalism.) Peter Karlsen (talk) 19:37, 2 November 2010 (UTC)[reply]

Reverting itself has been fixed. The other errors are due to the dataset not being broad enough (adding these edits to the dataset and retraining should rectify this) and REDIRECT not being in the list of wiki markup (being fixed right now). Crispy1989 (talk) 19:43, 2 November 2010 (UTC)[reply]

Thanks. [14] appears to be an incorrect reversion as well. Peter Karlsen (talk) 19:46, 2 November 2010 (UTC)[reply]

I blanked an attack page and got a warning. Not good. Carl Sixsmith (talk) 19:45, 2 November 2010 (UTC)[reply]

This also seems to be an issue with dataset completeness. There are no instances of complete page blanking in the dataset that are legitimate. As soon as these are added, this will correct itself. Also, users above a certain threshold number of edits should be ignored (and you should quality). We'll look into lowering the threshold. Crispy1989 (talk) 19:49, 2 November 2010 (UTC)[reply]

I placed a speedy tag ((db-redirnone)) on a page with a broken redirect (to a page that has been deleted) and was reverted.[15] Thanks, Jon. — Preceding unsigned comment added by 81.145.247.25 (talk • contribs) 20:05, 2 November 2010
Thanks, will also add this tag to markup list and edit to the dataset. Crispy1989 (talk) 20:38, 2 November 2010 (UTC)[reply]

So far, the false positive rate seems to fall right into the expected range of around 0.25%. It also seems to be reverting more than half of all vandalism on Wikipedia, also as expected. The false positives that do exist seem to be primarily problems with redirects (which has been fixed in the code and is being tested before restarting the process running the trial) and problems that can be solved by increasing dataset size. Please continue to report any issues so they can be added to the training dataset, and so I can add tags and other things (like redirects) that I may have missed to the special text handling code. Crispy1989 (talk) 21:35, 2 November 2010 (UTC)[reply]

It would be better if you assign a special page for reporting FPs instead of here. Also, what is the number of edits above which a registered user will be ignored. Sole Soul (talk) 21:58, 2 November 2010 (UTC)[reply]

Users with more than 50 edits are now ignored. This is what the old ClueBot was set at. Crispy1989 (talk) 22:13, 2 November 2010 (UTC)[reply]

I changed a stub article, flagged as poor into a redirect and got a false positive. See The Secret Peacemaker for details. Jim no.6 (talk) 22:02, 2 November 2010 (UTC)[reply]

The robot is automatically reverting my edition on the page Ideologia but "Ideologia" could refer either to the album Ideología and to the album Ideologia (Cazuza's album). The robot needs some adjustment. — Preceding unsigned comment added by 187.68.100.92 (talk • contribs) 23:45, 2 November 2010

Reverted [16]. In the absence of a better solution, I suggest that article => redirect and redirect => article conversions be removed from the "vandalism" dataset. Peter Karlsen (talk) 23:54, 2 November 2010 (UTC)[reply]

The problem was not that it thought redirects were vandalism, but that there was no special handling for "REDIRECT". This has been fixed (special handling added). Testing the fixes right now, bot will be restarted with updates soon (it takes some time to retrain and test it). I also added special handling for disambiguation tags. Crispy1989 (talk) 00:02, 3 November 2010 (UTC)[reply]

I edited the Greatest Hits So Far redirecting page to a disambiguation page in order to include both The Greatest Hits, So Far and Greatest Hits... So Far!!!, since it is an upcoming release. Yet it was since reverted & I got a vandalism warning. Same thing happened when I corrected the UK edition track listing in the Greatest Hits... So Far!!! page, which is currently incorrect, and I replaced it with the track listing cited in the reference, yet it was reverted again. My edits should stop being reverted as they are not unconstructive nor vandalistic. Imjayyy (talk) 00:04, 3 November 2010 (UTC)[reply]

The issue with redirects and disambiguation pages is known and is being corrected. Cluebot-NG has not edited Greatest Hits... So Far!!!, it did not consider these edits vandalism. They were reverted by another user. Crispy1989 (talk) 00:18, 3 November 2010 (UTC)[reply]

Thank you! Imjayyy (talk) 00:21, 3 November 2010 (UTC)[reply]

Is it possible to generate new false positive vs catch rate charts once the bot has a few days under its belt so that we can see the progress? Gigs (talk) 01:12, 3 November 2010 (UTC)[reply]
The reports and charts are generated from the trial dataset. I can regenerate them (from the same trial dataset) after making these modifications, but it wouldn't look much different - the changes I'm making now are to prevent false positives from things which aren't well-represented in the dataset (seemingly, redirects and disambiguation pages). We will add these reported false positives to the training dataset, and add special handling code for these, but it won't greatly affect the results of a run with the trial dataset. What I think you want is statistics on exactly how much vandalism is really being caught, versus the false positive rate. For this, we need to have a human go through all the edits that bot has seen, and manually classify them to check for accuracy. We have already designed an interface exactly for this purpose (see User:Cluebot_NG for details), but have not gotten any volunteers yet. If we get enough volunteers, then we'll add all the edits the bot has seen to the review queue, and generate a separate trial dataset from this (then, we can generate the stats and graphs). Crispy1989 (talk) 01:23, 3 November 2010 (UTC)[reply]

80.192.184.101 (talk · contribs) made a test edit to thin layer chromatography, which I reverted. When the user made the same edit again, it was reverted by ClueBot NG. Does the bot assume that reverted edits are vandalism? --Ixfd64 (talk) 06:31, 4 November 2010 (UTC)[reply]

Well, one of the parameters into the ANN is number of times reverted. So, not directly, but the bot probably did pick up on the fact that that user had been reverted before. -- Cobi^(t|c|b) 06:56, 4 November 2010 (UTC)[reply]

Actually, the parameter is number of times user has been warned. This is one factor out of over a hundred, and alone, is not sufficient to indicate that an edit is vandalism. You can think of it more as an "estimation of good faith" for borderline edits. Crispy1989 (talk) 07:15, 4 November 2010 (UTC)[reply]

False positive reports

I made an edit to Shell game that was reverted, despite being a genuine edit. 75.67.220.101 (talk) 04:00, 22 November 2010 (UTC)[reply]
An attempt to remove the blatant advertising from the Lemsip article was reverted.

It's pretty obvious the article was written by someone working for the producer of the medicine.

I added some variations to the information for Kings, the drinking game. It actually has a section for variations and it had kicked off the suggestions I have added. Any suggestions how to make it stick?
tried to edit "Hoy_Me_Voy" to change sentences to a more appropriate tone, but ClueBot flagged the change as vandalism, put a notice on my talk page, and reverted the change back. What do I do? (talk)
tried to make a change on Fernando_Garibay, adding "so happy i could die" to his 2010 productions but it got removed. the proofs in his website and its already a source so what do i do. 70.173.230.88 (talk)
The following is the log entry regarding this warning: Four (energy drink) was changed by 109.46.144.246 (u) (t) ANN scored at 0.817485 on 2010-11-18T07:44:16+00:00 . Thank you. ClueBot NG (talk) 07:44, 18 November 2010 (UTC)
I attempted to revert an instance of vandalism to this page; ClueBot flagged my change as vandalism and reverted my change to the previously vandalized version. --24.72.122.184 (talk) 22:22, 15 November 2010 (UTC)[reply]

Okay, after this happened the first time, an automated message appeared on my talk page stating, "ClueBot NG produces very few false positives, but it does happen. If you believe the change you made should not have been detected as unconstructive, please report it here, remove this warning from your talk page, and then make the edit again." I did this, and it re-reverted the change again and placed another warning on my talk page. What is this? --24.72.122.184 (talk) 22:31, 15 November 2010 (UTC)[reply]

Okay, it was in fact another user that re-reverted the change, and not ClueBot. Please disregard that follow-up comment. --24.72.122.184 (talk) 22:55, 15 November 2010 (UTC)[reply]

This looks wrong. http://en.wikipedia.org/w/index.php?title=Kamen_Rider_OOO_%28character%29&curid=28242801&diff=396938724&oldid=396938713 How would it even know if this is vandalism? Stupid bot.131.202.131.250 (talk) 17:04, 15 November 2010 (UTC)[reply]
I followed the link provided to give a false positive report, but as far as I can tell, there's not actually a section set aside here for that purpose. I apologize for any intrusion in my making this section. —Bill Price (nyb) 03:28, 3 November 2010 (UTC)[reply]

Sorry for the confusion - that page is currently "left-over" from the original ClueBot. We thought it would be good to keep false positives here while the BRFA is open, so all reviewers can get a good idea of its accuracy.

Same thing. Happybunny95 (talk) 04:03, 15 November 2010 (UTC)[reply]

This edit of Bed Intruder Song made by Allyisaunicorn (talk · contribs). The edit needed to be reverted due to copyright issues, but it was not an act of vandalism. —Bill Price (nyb) 03:28, 3 November 2010 (UTC)[reply]
Edits like this are very difficult to distinguish from vandalism from a bot's point of view. The bot does specially handle things within quotes, but these lyrics were presented as normal page content, so they were handled normally. They contain unterminated sentences, slang, Bayesian keywords, abnormal use of capitalization, and other things which are fed into the neural network. It may be possible to add a special case for lyrics, but this might require that the training dataset contain several examples of lyrics being added. I'll look into the feasibility of this. Crispy1989 (talk) 03:52, 3 November 2010 (UTC)[reply]

You should note that the account that made that edit seems to be a single purpose vandalism only account. So the number of past edits that were vandalism statistic would likely be higher. As such I'm not sure this specific edit is really a false positive due to the user's obvious bad faith. --nn123645 (talk) 15:10, 3 November 2010 (UTC)[reply]

Thanks for pointing this out. Yes, the bot does take this into account. It may be a factor into why this particular edit was reverted. Crispy1989 (talk) 17:46, 3 November 2010 (UTC)[reply]

Here's another false positive: [17] (it's not immediately clear to me why the edit might have seemed like vandalism at all; the bot is surely not policing the addition of unreferenced material to Wikipedia.) Peter Karlsen (talk) 03:35, 3 November 2010 (UTC)[reply]
One unique thing about using a neural network as the core detection engine is that it's a bit of a "black box". Sometimes it's not immediately apparent why an error occurred. Usually it's because the dataset just isn't large enough, and as it grows, these will disappear. Currently, the dataset is large enough for the statistics mentioned above (about 60% of vandalism caught at 0.25% false positives), and just by estimating, it looks like these are approximately correct for the live run as well. As the dataset grows, these kinds of false positives will be eliminated. Crispy1989 (talk) 03:52, 3 November 2010 (UTC)[reply]
Do you intend keep the target false positive rate at 0.25%? (for editors new to this discussion, that's 0.25% of every edit examined; the number of incorrect reversions will be well over the two and a half per every 1000 rollbacks by the bot that might seem to be indicated by the raw percentage.) If so, then as the dataset improves, the threshold for reversion will simply be lowered to continue to meet 0.25% target, resulting in more vandalism reverted, but new and exciting false positives to replace the ones that have been eliminated. This is why, in the discussion above, I suggested that a 0.1% false positives target would be more conducive to community acceptance of the bot, and ultimate approval. Peter Karlsen (talk) 05:07, 3 November 2010 (UTC)[reply]
It can be adjusted to whatever people want. As the dataset improves, I'll update the graphs. The 0.25% was determined by looking for a sharp dropoff point below 0.5% on the graph. As the dataset improves, the false positive rate will be lowered as well. Crispy1989 (talk) 15:56, 3 November 2010 (UTC)[reply]

I too followed a link here to report a false positive, i really have no idea why it auto reverted my changes. The trigger that seems to have set off the bot was my use of the word "homosexual", although i was simply substitution it for the word "gay" in a sentence because it seemed more appropriate in the context, and created a better flow in the prose. Oddly the Bot also reverted a number of other minor changes i had made, which were mealy filling in missing words, and could in no way be construed as vandalism. Here are the changes i made[18] — Preceding unsigned comment added by Carensdp (talk • contribs) 05:13, 3 November 2010

I've taken the liberty of reverting the bot [19]. One thing that ought to be apparent from the false positives up till now is the bot's persistent homophobia - any mention of "gay" or "homosexual" seems to be enough to trigger it (for instance, [20].) Perhaps it might be advisable to add some edits to LGBT-related articles to the dataset of legitimate contributions, so the bot might (hopefully) be able to distinguish between references to actual homosexuality, and "gay" in the pejorative sense as a generalized insult. Peter Karlsen (talk) 05:35, 3 November 2010 (UTC)[reply]

Yeah, this is what needs to happen. In the current dataset, there are no instances of these words being used correctly. As soon as these edits are added, this problem should correct itself.

Cluebot seems to be immediately reverting all contributions made by IP's to the Reference Desk, eg. [21]. WikiDao ☯ (talk) 11:41, 3 November 2010 (UTC)[reply]
(edit conflict) here's another, maybe keep it to article space, or at least out of the discussion space? - Kingpin¹³ (talk) 11:42, 3 November 2010 (UTC)[reply]
The neural network is only trained on articles in the main namespace. It is not (currently) meant to handle any other articles. I was unaware that articles from other namespaces were fed to the core. I'll tell the person running the interface code to exclude any edits not in the main namespace. Crispy1989 (talk) 15:56, 3 November 2010 (UTC)[reply]
One thing this has brought to attention, is that the exclusion compliance is apparently not working, see here. - Kingpin¹³ (talk) 16:35, 3 November 2010 (UTC)[reply]
I'll tell the developer of the Wikipedia interface. It handles all whitelists and exclusions. Crispy1989 (talk) 16:49, 3 November 2010 (UTC)[reply]
Also, are you aware that it's not currently warning users? - Kingpin¹³ (talk) 16:50, 3 November 2010 (UTC)[reply]

I am confused by what you mean by "developer of the Wikipedia interface"? Exclusion compliant means following the ((bots)) template in this case, such as this. — HELLKNOWZ ▎TALK 16:55, 3 November 2010 (UTC)[reply]
The bot's code is created primarily by two people - myself and User:Cobi. I wrote the core which does the main vandalism detection with the machine learning techniques. Cobi wrote the interface to Wikipedia, which handles everything that's not machine-learning (exclusions, whitelists, etc). The interface was largely taken from the existing Cluebot. Crispy1989 (talk) 17:40, 3 November 2010 (UTC)[reply]

Exclusion compliance fixed. The ((nobots)) was working, but not ((bots)). Not warning users was due to someone setting the bot's shutoff page, and due to a bug that has now been fixed, it only honored that page for warns. -- Cobi^(t|c|b) 17:47, 3 November 2010 (UTC)[reply]
ClueBot NG reverted a speedy deletion tag db-vandalism which was added by Uncle Milty. Minima c (talk) 14:23, 3 November 2010 (UTC)[reply]
This is being fixed. Crispy1989 (talk) 15:56, 3 November 2010 (UTC)[reply]

Note: Cluebot-NG has reviewed over 70,000 edits so far, resulting in a handful of false positives, which are either being fixed now programmatically, or will be fixed with the growing of the dataset. Crispy1989 (talk) 17:00, 3 November 2010 (UTC)[reply]

I should also note that, while Cluebot-NG has a false positive rate comparable to some humans (if a human were to review every single edit made to Wikipedia), the false positives are not always the same ones that you might expect a human to make. Crispy1989 (talk) 19:39, 3 November 2010 (UTC)[reply]

On Ajharper18 -- the content of the page was "test" so I tagged as ((db-g2)) and was reverted by ClueBot NG. There's no way that should have happened. I've never had any bot revert any of my speedy taggings previously. 174.109.197.174 (talk) 11:36, 4 November 2010 (UTC)[reply]
This seems to be an issue in some cases because the current dataset does not contain instances of speedy deletion tags being added. We are generating a new dataset now which should solve the issue. Crispy1989 (talk) 14:17, 4 November 2010 (UTC)[reply]

This is probably a silly question, but what does the "NG" stand for? New Generation? --Ixfd64 (talk) 20:22, 4 November 2010 (UTC)[reply]

Our intent was Next Generation. Crispy1989 (talk) 20:33, 4 November 2010 (UTC)[reply]

This edit on Franco Selleri was made by an inexperienced user, and just seems to be reverted by accidentally adding his signature ~~~~ to his edit. -- Crowsnest (talk) 10:24, 5 November 2010 (UTC)[reply]
The signature probably is the primary reason it was reverted - the training set doesn't include talk pages or areas where signatures are used, so without seeing a signature before, it probably seems like a random mashup of punctuation by a new user. As the dataset grows, and it sees instances of accidental signatures classified as constructive, this type of thing won't happen. In addition to the signature, a possible complicating factor is that the bot can detect common vandal grammatical errors, such as unterminated sentences - and the user's edit, in this case, adds one. Again, as the dataset grows, and there are instances of where edits like this are not classified as vandalism, the bot will score these lower. Crispy1989 (talk) 16:33, 5 November 2010 (UTC)[reply]

[22] Can't see why this revert was made. Philip Trueman (talk) 15:05, 5 November 2010 (UTC)[reply]
This is an instance where the dataset isn't large enough. For some reason, the only edits the bot has learned from with similar statistics have been vandalism. With a larger and more complete dataset, as is being generated now by volunteers, there will be fewer gaps in its training. Crispy1989 (talk) 16:38, 5 November 2010 (UTC)[reply]

This revert [23], while not referenced is definitely not vandelism. Not sure why it was labelled as such (use of ball?). AIRcorn (talk) 20:53, 5 November 2010 (UTC)[reply]
This is purely a case of a gap in the dataset. The Bayesian classifier (ie, words) were not what caused it, alone, anyway. - "ball" isn't even in the Bayesian database (the bot learned that it occurs about equally in vandalism and nonvandalism). A few words may have contributed ("you" occurs in 548 vandalism articles, and 45 good articles), but this should have been counterbalanced by other words ("22" occurs in 82 good articles, and 22 vandalism articles). With an increase in dataset size, this should stop. Crispy1989 (talk) 00:26, 6 November 2010 (UTC)[reply]

[24] looks like a revert of a perfectly legitimate and correct edit (see definition of ionization). PleaseStand ^(talk) 05:37, 6 November 2010 (UTC)[reply]
This appears to be because the bot was counting "i" and "e" both as uncapitalized sentences, and "i" as an uncapitalized 'I'. Thanks for pointing out this special case. It is now fixed. Crispy1989 (talk) 07:07, 6 November 2010 (UTC)[reply]

I made some improvements to the English translation on Du, du liegst mir im Herzen (here), however my changes were reverted as vandalism... why? 71.38.118.252 (talk) 06:51, 6 November 2010 (UTC)[reply]
Occasionally it has issues dealing with song lyrics because they do not follow standard acceptable wiki formatting. We're looking into adding special cases in code, and increasing dataset size should help as well. Crispy1989 (talk) 07:09, 6 November 2010 (UTC)[reply]

[25] I don't like the use of the word 'your', but to call the original edit vandalism is a stretch. Philip Trueman (talk) 11:22, 6 November 2010 (UTC)[reply]
This is the kind of false positive I'd expect - poor edits with borderline vandalism qualities. Even these should be reduced with a larger dataset (containing constructive edits with these traits). In addition to the word "your", the lack of space after the previous sentence was also a factor - it registers punctuation present in the middle of words (other than things like apostrophe). Crispy1989 (talk) 23:00, 6 November 2010 (UTC)[reply]

[26] (filling in the chronology part of an infobox with a link) definitely isn't vandalism. PleaseStand ^(talk) 19:21, 6 November 2010 (UTC)[reply]
Definitely something that could be fixed with a larger dataset. Crispy1989 (talk) 23:01, 6 November 2010 (UTC)[reply]

[27] Maybe it was a poor, unsourced edit to content about a living person, but it's not vandalism. PleaseStand ^(talk) 19:59, 6 November 2010 (UTC)[reply]
The bot most likely figured it was a poor/borderline edit based on statistics, and perhaps the word "estained". The previous warning for vandalism was used as an estimation of good faith (1/3 of all previous edits made were vandalism at the time of the edit). As with the other similar false positives here, increasing dataset size and including cases where previous vandals make constructive edits. Crispy1989 (talk) 23:06, 6 November 2010 (UTC)[reply]

[28] Again, not vandalism. PleaseStand ^(talk) 20:11, 6 November 2010 (UTC)[reply]
The bot is failing to recognize this as a link. It currently recognizes external links with either [blah or <a href=blah syntax. I'll correct it to look for more general forms. Crispy1989 (talk) 23:10, 6 November 2010 (UTC)[reply]

[29] I corrected first the non-working external links pointing to Finnish Army Insignias on Finnish Defence Forces' website. Next i added a few spaces to the links' texts to correct their appearance and the bot reported this as unconstructive. I forgot to mention the last change as a small change, which may have affected in the bot report. Kime79 ^(talk) 14:57, 9 November 2010 (UTC)[reply]
Thanks for pointing this out - this brings to my attention that, although links are removed (and analyzed separately) before being input to the neural net, total size difference includes the links. Because links are very rarely this long, this threw off the neural net. I'll look into modifying it into removing links in a preprocessing step instead. Crispy1989 (talk) 15:02, 9 November 2010 (UTC)[reply]

[30] Teach your bot what wikify and Manual of Style are.
What sort of a word is 'indiscovered'? Philip Trueman (talk) 03:22, 11 November 2010 (UTC)[reply]
The bot may have been picking up on the direct replacements of formal terms with informal terms (ie. your replacement of "large" with "big"), and the replacements of words with incorrect spellings of the words (ie. your replacement of "undiscovered" with "indiscovered"). If enough edits like this are added to the dataset and classified as constructive, the bot will stop recognizing it as vandalism. But it seems to me that this kind of edit is very borderline - adding misspelled words is one things, but replacing correct words with misspelled ones, and formal words with informal ones, in multiple places, is another thing. Crispy1989 (talk) 03:32, 11 November 2010 (UTC)[reply]

[31] Edit incorrectly reverted for not being 'constructive'.
Considering that the article is about the TLD and not the slang word, this edit seems very borderline. Crispy1989 (talk) 16:09, 11 November 2010 (UTC)[reply]
But the bot actually reinstated the piece about the slang term. Ucucha 16:11, 11 November 2010 (UTC)[reply]
I can't believe I missed that - wow. Yeah, it's the same problem that has caused a few other issues. Not enough reverts in the dataset. Crispy1989 (talk) 16:14, 11 November 2010 (UTC)[reply]
[32] The 0.25% false positive statistic doesn't seem correct; if you're calculating it by taking the number of people who take the time to post on this page divided by its total amount of edits, you're going to get a very skewed "statistic". Shubinator (talk) 05:09, 14 November 2010 (UTC)[reply]
Just looked over the last 50 of the bot's contribs (for the record, that's these), and found two more: [33] [34]. By my (very informal) data, ClueBot NG has a 4% false positive rate. Don't get me wrong, the bot is unique and the work you're doing is great, but the bot definitely needs some tweaking before being let loose unmonitored. Shubinator (talk) 05:22, 14 November 2010 (UTC)[reply]
False positives is the percentage of good edits it classifies as bad. I.e., it classifies 25 out of every 10,000 good edits as bad. And, yes, we realize there needs to be work done to tweak it -- that is why we have a review interface so we can create a better dataset. We calculated 0.25% by training with 20,000 edits in our current 30,000 edit dataset, and then having it classify the remaining 10,000, and seeing how many it said are vandalism, when our dataset said they were good. -- Cobi^(t|c|b) 06:20, 14 November 2010 (UTC)[reply]
Just to clarify what Cobi said: He's correct about how false positive rate is determined. To accurately determine what it is during a live run, you have to count the number of false positives in a time period, and divide that by total number of legitimate edits that were made in that time period. Also as Cobi said, the false positive rate is not determined by false positive reports. We divide our dataset up into two parts - 2/3 to use for training and 1/3 for trialing. That 1/3 is run through the network and is used for rate calculations. This should be a very accurate way of calculating it, assuming a representative dataset. Crispy1989 (talk) 07:41, 14 November 2010 (UTC)[reply]

This edit was reverted within seconds for no obvious reason. -- Smjg (talk) 16:36, 16 November 2010 (UTC)[reply]

[35] the improvement of a poor quote translation was reverted within seconds (see http://www.nybooks.com/books/imprints/classics/the-way-of-the-world/ for a source of the correct quote translation)

Reverting page deletion by author

this edit (now deleted) reverted the blanking of a page by its author. It is very confusing for an author who realises his page is inappropriate and blanks it, which is a frequent occurrence, when the inappropriate page is restored in stead of being tagged db-g7. JohnCD (talk) 10:09, 3 November 2010 (UTC)[reply]

We'll add an exemption for the author of the page. Crispy1989 (talk) 16:31, 3 November 2010 (UTC)[reply]

False positives

see User_talk:ClueBot_Commons#Cluebot_-too_many_false_positives too many false positives on the wikipedia science reference desk. and the error report ID fucntion seems broken.Sf5xeplus (talk) 13:25, 3 November 2010 (UTC)[reply]

Cluebot NG is not meant to edit anything outside of the main namespace. This is apparently a misunderstanding between the developer of the core and the developer of the Wikipedia interface. The interface will be changed to ignore edits not in the main namespace, unless at some point in the future we train separate neural networks for separate namespaces. Crispy1989 (talk) 16:33, 3 November 2010 (UTC)[reply]

That page was on the optin list. I've removed everything from the optin list, for now. Keep in mind, when users add pages there, they are inviting the bot somewhere where it has not been tested or designed for. It may work well. It may not. -- Cobi^(t|c|b) 17:33, 3 November 2010 (UTC)[reply]

Thannks.Sf5xeplus (talk) 19:00, 3 November 2010 (UTC)[reply]

Why did you populate User:ClueBot NG/Optin from User:ClueBot/Optin? No one requested that ClueBot NG revert pages outside of the article namespace. Peter Karlsen (talk) 20:16, 3 November 2010 (UTC)[reply]

I copied all of the control pages from ClueBot's userspace. I forgot to remove all but the comment at the top. -- Cobi^(t|c|b) 20:30, 3 November 2010 (UTC)[reply]

Then I'm concerned. Can you at least consider widening the scope to include the Template namespace? Philip Trueman (talk) 18:37, 5 November 2010 (UTC)[reply]

The method can be expanded to work with pretty much any namespace or content, but it should use a separate neural network, and must be trained on a training set from that namespace. I'd like to get the core perfected and approved for the main namespace first, then we'll look into generating datasets for other namespaces. If necessary, while getting it to work with other namespaces, the old heuristics-based cluebot could be run just on those namespaces. Crispy1989 (talk) 00:35, 6 November 2010 (UTC)[reply]

Review Interface

I already mentioned this, but it's important, so I'll bring it to attention again. Cluebot NG's accuracy depends almost entirely on its dataset. By fixing its current dataset, and helping to classify new edits, you can help to greatly improve its performance. We have an interface specifically designed for this, and should make it easy for volunteers to help out. The interface can be found at this link. You need a Google account to use it, and we need to authorize you to access it. If you'd like to help out, please follow the link and go to the signup section. Help is needed, and greatly appreciated! Crispy1989 (talk) 22:07, 3 November 2010 (UTC)[reply]

Thank you to all the people helping with dataset classification! We've added some stats for who's doing what right here.

We're looking to double our current dataset size (currently a little over 30,000 edits) and replace it with a model closer to reality by using a truly random sampling of data. The interface is currently loaded with around 70,000 edits - about a day's worth. Each edit must be reviewed by at least two different people (more if the first two disagree). If we can get this data, I believe the bot's performance can significantly improve, even from what it's at right now. Crispy1989 (talk) 17:17, 4 November 2010 (UTC)[reply]

Is there any, umm, help or documentation for this interface? I've activated my Google account, I've got as far as the screen that asks for my Wikipedia username to match my Google email id, and I'm looking at a page that says "Stored.". Now what do I do? Philip Trueman (talk) 18:48, 5 November 2010 (UTC)[reply]

Sorry, I need to fix that message to be more intuitive. It means you were added to the list of users for admins to review. I've approved you. You should be getting an e-mail about it. -- Cobi^(t|c|b) 19:40, 5 November 2010 (UTC)[reply]

Thanks - it's working now, and I've done a few. A few comments: If the dataset is aimed at mainspace aticles only, why was I offered a User talk space edit? I classified it as per its space - in an article it wouldn't've been good but it was fine as part of an attempt at dialogue. Also, some of the edits were edits made by approved bots that might equally well have been made by a human (e.g. RjwilmsiBot adding ((Persondata))). Couldn't these have been automatically classified as OK? Finally, if there's any disagreement with another reviewer about any of my classifications then I'd appreciate learning about it, if only to improve my own performance. Philip Trueman (talk) 01:32, 6 November 2010 (UTC)[reply]

Thanks for your help. Although the bot currently is only being trained on mainspace articles, a few edits from other namespaces may have made their way into the random edit set. Classify these as you would normally (constructive, vandalism, skip). They simply won't be used for the main namespace training. We plan on expanding in the future to handle other namespaces as well, in which case, classifications from other namespaces will be used. We really don't want to assume any bot always makes good edits. Although this is usually the case in practice, we'd prefer to have every edit verified. Just classify these as constructive as usual (unless it's another anti-vandal bot with a false positive or something - in this case, it should probably be skipped). As for the question about being notified of any disagreement, I'll defer that to Cobi (the developer of the interface). Crispy1989 (talk) 01:43, 6 November 2010 (UTC)[reply]

I use Windows 7 and IE8. I had an edit come up in the review interface that caused IE to go into Compatibility View, and the diff it showed was blank. Sorry, can't remember which edit it was, but I marked the edit as 'Skip' (because I couldn't classify it) with a comment. Philip Trueman (talk) 12:35, 6 November 2010 (UTC)[reply]

Quiff

ClueBot NG gave a final warning to an IP editor for this edit (2 diffs). Is it just detecting that he's restoring content that was reverted by someone else, or is there something about the edit itself that's triggering ClueBot? Also, the warning given on the user's talk page suggests that ClueBot NG was giving the final warning simply because of the addition of the word "an" ... does ClueBot by default only give a diff link for one diff, or is it actually only the second diff that triggered ClueBot NG?—Soap— 22:40, 4 November 2010 (UTC)[reply]

Feel free to move this up to Section 2 if it fits better ... I'm not meaning to make my edit stand out from all the others. —Soap— 22:49, 4 November 2010 (UTC)[reply]

The only concept the bot has of restoring old content is if the edit summary says so. In this case, the edit was probably identified as vandalism because it had borderline statistics (but would not ordinarily be considered vandalism), combined with the fact that the user had vandalized a number of times before. Statistically, if a large portion of a user's previous edits have been vandalism, it's much more likely for their current edits to be vandalism. Alone this is not enough to trigger a vandalism classification, but it can push over the edge what might otherwise be a borderline edit. As the dataset grows, this will become more fine-tuned and less likely to be identified as vandalism, and the percentage of past edits that have been vandalism will remain a useful statistic in estimating good faith/bad intentions. Crispy1989 (talk) 00:25, 5 November 2010 (UTC)[reply]

Could you also clarify why the user was warned for adding "an"? - Kingpin¹³ (talk) 15:09, 5 November 2010 (UTC)[reply]

The neural network functions by analyzing statistics. Because "an" is a common word, word-based statistics do not apply. What the neural network sees is a user inserting a short word into the middle of an article - a user than already has several warnings. Without the existing warnings, the score would end up being 0.5 or less, well below the 0.95 threshold it's currently at. Multiple previous warnings significantly increase the probability that a given edit is classified as vandalism. Increasing dataset size and including instances where users with multiple warnings made constructive edits will decrease this kind of occurrence. Note: Removing this statistic from the neural network decreased catch rate when normalized to the same false positive rate, so this statistic is helpful overall to the performance of the method. Crispy1989 (talk) 16:46, 5 November 2010 (UTC)[reply]

Helping to classify ..

This [36] appears in the Dry Run but does not seem to be clear vandalism to me. Maybe greater weight needs to be given to the context of the change?

Is it actually worthwhile for humans to review the whole of the Dry Run? If so, what's the best way to flag what's been reviewed?

BTW, I tried to get myself a Google id to help out with reviewing the dataset and ended up writing a scathing comment about the user hostility of the application process. Philip Trueman (talk) 06:40, 5 November 2010 (UTC)[reply]

There are no (preset) weights. Statistics are combined using a neural network. To correct outlying datapoints like this, the datasize must grow. It's not really worthwhile to review the entire dry run - particularly since it's with an older version. The dataset review interface combines edits randomly from several sources - one of these sources is edits that bot is unsure of. So the dataset review interface is by far the best way to help. Crispy1989 (talk) 06:51, 5 November 2010 (UTC)[reply]

possible false positive

I may be wrong, but this edit didn't seem like vandalism to me. --Ixfd64 (talk) 03:55, 8 November 2010 (UTC)[reply]

This can be fixed by enlarging the dataset, and by fine-tuning word categories. Crispy1989 (talk) 12:21, 8 November 2010 (UTC)[reply]

[37] You'll understand I'm a bit miffed. Philip Trueman (talk) 09:06, 8 November 2010 (UTC)[reply]

Exactly the false positive I was coming to report. I reverted ClueBot's reversion. Curious to see if I get a warning too. :D Millahnna (talk) 09:08, 8 November 2010 (UTC)[reply]

Ouch. This shouldn't be happening at all. The real issue can be fixed by enlarging the dataset (the current dataset doesn't contain many vandalism reversions) ... but there should be a hard threshold of edits per user. Users with more than 50 edits shouldn't be reverted at all - this is a bug in the Wikipedia interface code. We'll correct it ASAP. Crispy1989 (talk) 12:25, 8 November 2010 (UTC)[reply]

It's OK, I'm not offended. Well, not much. Strangely, I've just been presented in the review interface with one of my own reversions. I added a comment asking for a 'Recuse' button ... Philip Trueman (talk) 14:45, 8 November 2010 (UTC)[reply]

[38], [39]

Poor quality edits to poor quality article rather than deliberate vandalism. Philip Trueman (talk) 05:36, 9 November 2010 (UTC)[reply]

In my opinion, false positives of poor quality edits aren't quite as bad as false positives of good quality edits - but they still shouldn't happen. These should also be able to be prevented by expanding the dataset. The second of these two even looks like it's so poor quality that it could be borderline vandalism. Crispy1989 (talk) 11:20, 9 November 2010 (UTC)[reply]

[40] Not vandalism. Not enough good edits in the database with the word 'toilets', right? Philip Trueman (talk) 08:52, 10 November 2010 (UTC)[reply]

[41] Certainly not vandalism; an improvement if anything. Philip Trueman (talk) 13:19, 10 November 2010 (UTC)[reply]

Both of these can only be explained by the dataset not being large enough. I'm not really sure why the second one was misclassified - it must just be a hole in the training data. Crispy1989 (talk) 13:23, 10 November 2010 (UTC)[reply]

[42] Ooops! Just asking, but is this a case where the bot would have reverted itself back again? The word 'iincluding' is presumably rare in good edits. If a diff counts as vandalism both ways then surely it should hold off. Also, how much does the bot know about article categories? Words like 'love' and 'hate' and 'pregnant' are normal in, say, Category:Serial drama television series when they're not in, say, Category:Chemical elements. Philip Trueman (talk) 08:46, 13 November 2010 (UTC)[reply]
The word "iincluding" is not present in the dataset at all, so it would not contribute at all to the Bayesian score. If a word has never been seen before, it is not assumed to be good or bad, beyond a few basic things to detect if it's gibberish or leetspeak. Also, for a word to contribute to the score at all, it has to appear in a certain minimum number of articles total (currently 6). You bring up a good point about the words in the categories. Right now, it assesses which words belong in context by checking added words against words that already appear on the page. This is usually sufficient, but as you pointed out, does have some holes. I may be able to figure out a way to determine statistical word relations - not as in a Markov chain, or a Bayesian classifier, but in a broader sense to sort-of automatically categorize an article. Crispy1989 (talk) 16:05, 13 November 2010 (UTC)[reply]

Riding an old hobby-horse

This edit [43] is fine as far as it goes, but clearly needed to go one revision further back. I don't know whether the 100 or so things the neural network takes into account include the identity of the editor of the version that would be rolled back to. In my experience if reverting a bad edit by an IP would mean rolling back to a version last edited by a similar IP then it's worth digging deeper, and I've modified my anti-vandal tool to warn the user in this case. I seem to recall that at least one anti-vandal bot had a rule not to revert in such a case, so as not to 'lock in' an earlier bad edit. Perhaps what's really needed here is a semi-protected page the bot can write to flagging up articles it thinks need human attention. Philip Trueman (talk) 10:25, 9 November 2010 (UTC)[reply]

There are a few things involved here in figuring out how to handle these situations. While it's true that if an edit is vandalism, immediately previous edits by the same user on the same article are probably also vandalism, it begs the question, why didn't the bot catch the earlier edit? The best thing to do is just to keep improving the bot (which I'm doing) and the dataset. It would definitely be possible to post borderline edits somewhere - the neural net generates a score which is compared against a threshold. The threshold (currently around 0.95) is calculated from a given false positive rate at dataset training/trial time. A second threshold could be set somewhat below this, where edits falling into that group could be posted somewhere. At the ~0.95 threshold it's currently catching 60% of vandalism with 0.25% false positives (calculated from the trial dataset). At a threshold of around 0.65, it gets over 90% of vandalism (with about 3% false positives). Maybe a threshold around 0.65 would be useful. Crispy1989 (talk) 11:38, 9 November 2010 (UTC)[reply]

I thought ClueBot used rollback anyway, but that's not important. It's not the same editor, so that's probably why it didn't analyze. (X! · talk) · @538 · 11:54, 9 November 2010 (UTC)[reply]

Perhaps I didn't make myself clear. It's precisely because rollback only rolls back consecutive edits by exactly the same editor that this case needs to be trapped. It is frequently the case than when the same article has consecutive edits by different IPs that are in the same narrow range that in fact they were made by the same person. If the latest edit is vandalism then the earlier ones are suspect. Using rollback in this case runs the risk that earlier vandalism may become locked in - further vandalisms will show up, to bots and in anti-vandalism tools, as bad diffs, but the earlier vandalism might remain in place for some time. Philip Trueman (talk) 13:11, 9 November 2010 (UTC)[reply]

If you have a good suggestion as to how to reliably determine when an IP is sufficiently similar to warrant reverting it as part of the rollback, I'd like to hear it, but I don't think that can be done without adding more false positives. As for the idea of a page for review, that could be done. Do you just want it for any previous IP who is in the same /24? -- Cobi^(t|c|b) 21:19, 9 November 2010 (UTC)[reply]

The only other piece of information I can think of that a bot could go on is how recent the previous edit was. If the same article gets hit several times in a short period by several IPs in a narrow range, and one edit is clearly vandalism, then (in my experience) all those edits are suspect, especially if it's the same line that's been edited. If there's a longish gap then it sometimes turns out that it's a case of different editors at an educational establishment editing a page about that establishment and the previous edit was in good faith. That's why I have my anti-vandal tool warn the user rather than revert back further. I have the range set at /16 and (I'm guessing here, I don't have any statistics) I'd say the previous edit is also bad at least 80% of the time. That's nowhere near enough for a bot, of course. BTW, I also have the tool hold off if it would revert to a version by a previously reverted vandal - is that worth considering? My preferred solution would be to hold off if the IPs are in the same /16 range, list the article for attention by a human, and (ultimately) give the bot sysop rights so it can briefly semi-protect the page if it considers that more than one of the IPs in that range has recently made a bad edit to the article.

Slightly separately, the idea of a page to log edits that are just below the threshold is attractive on the face of it, but in practice it may prove difficult to make it useful - many edits that are just below the threshold will be bad and will have been reverted by humans with good anti-vandal tools almost immediately, so it'll be out-of-date almost immediately. Philip Trueman (talk) 02:42, 10 November 2010 (UTC)[reply]

Interesting ideas. I'll leave it up to Cobi whether it's feasible or not to put additional rules to prevent reverting to a previous edit if the previous edit is potentially vandalism. I can say that this would likely incur a significant delay to fetch the extra information (although I'm not certain of this). Also, I believe it's possible to get the bot accurate enough to the point where the previous edit would have already been caught it it were vandalism. Another thing to consider is that vandals tend to follow a pattern - if the current edit is reverted, it's likely previous edits in the same style would also be reverted.

But your suggestions give me a few ideas for how to potentially improve accuracy. It may be possible to add an input to the neural network that is the time of the previous edit. Also, I may be able to add a parameter in cases where both the current and previous revisions are made by IPs - the "distance" between the IPs. This parameter would just be the smallest CIDR subnet size that contains both IPs. I'll look into this. Crispy1989 (talk) 02:58, 10 November 2010 (UTC)[reply]

Here's [44] an excellent example of what I'd like the bot to hold off doing - or, at least, ask a human for help. Presumably the first bad edit wasn't bad enough, and the second was by an IP that had already been reverted on that page that day. So when it reverted the third bad edit it reverted to a version by an editor it had previously reverted. Philip Trueman (talk) 09:56, 11 November 2010 (UTC)[reply]

"Expand the dataset"

Firstly, I think this bot is doing a great job, however, it is getting a large number of false positives. Some of these false positives are things which can easily be fixed, at the source code for the bot. Such as not reverting experienced users, ignoring discussion pages, not reverting CSD-tagging etc. (I believe these things are being addressed in the code, but I'm not sure about that :D) and it's really good to see these getting resolved. However, the bulk of the false positives seem to be down to not having a large enough dataset. Some of these are understandable, for example edits which in the context of the article are good, but appear to be vandalism otherwise. But the large majority seem to be edits which can't really be said to look anything like vandalism. I'd just like to say I think it's key for this bot to not assume that edits are vandalism. Saying "we need to expand the dataset so the bot picks up more vandalism" makes sense, saying "we need to expand the dataset so the bot picks up less false positives" doesn't, for me anyway. - Kingpin¹³ (talk) 11:57, 9 November 2010 (UTC)[reply]

That's how artificial neural networks work. In this case, it is basically a classifier - either vandalism or not, with a given certainty. If the neural network has never seen a given edit before, its internal weights are not trained to classify it, so it may end up giving an unexpected output. In fact, the network needs much more good edits in its set than bad edits to not make false positives. — HELLKNOWZ ▎TALK 12:01, 9 November 2010 (UTC)[reply]

Well that's kind of what I mean. I understand that the reviewed edits are either "good" or "bad". So if you have only reviewed bad edits, the bot is going to be more likely to assume that edits are bad. But I think it's gone to far towards the assuming the edits are bad. Maybe reviewing more good edits would deal with this - I don't know. Basically, it's seems too concentrated on identifying bad edits, and not enough on identifying good edits. - Kingpin¹³ (talk) 12:06, 9 November 2010 (UTC)[reply]

Right now the dataset is roughly 50/50 vandalism/constructive. The dataset we are generating with the interface will come from a day's worth of edits (roughly 70k edits), and will have a more realistic ratio. -- Cobi^(t|c|b) 12:13, 9 November 2010 (UTC)[reply]

I should point out that the false positive rate is selectable, and can be reconfigured at any time. I should also point out that the false positive rate is currently set at 0.25% - and that the actual number of false positives is *below* this. For the issues in the code, here's a quick overview:

Redirects - Fixed. At the beginning of the trial, there was no metric to recognize these to input to the neural net, so the neural net just saw it as shouting. This metric has been added.
Various tags - Fixed. A metric was added for certain tags, and template names are now removed before statistical processing.
Non-main namespace pages - Fixed. This wasn't actually a bug, but was due to importing the old Cluebot's opt-in list. It has since been cleared.
Not reverting experienced users - Fixed. This was actually two separate problems. The first is that the edit threshold was initially too high, and has been decreased. The second is that the WP API was returning errors in a few cases, so the number of edits was being treated as zero. Error handling has been added to solve this.

Even context-specific false positives are at a much lower rate than existing bots, and can continue to be improved with a larger dataset.

Also, H3llkn0wz is right about the neural net - increasing dataset size and quality will both increase vandalism catch rate and decrease false positives. Cluebot-NG's false positive rate is very, very low, considering the sheer number of edits it reviews. Now, after fixing the programmatic issues, it's only getting a few false positives a day. As I mentioned earlier, Cluebot-NG's false positive rate is very low, but the false positives it does have aren't necessarily the same ones you'd expect from another bot.

About the dataset ratio, this actually doesn't really matter. Having a dataset ratio that differs from reality will affect the average result score from the neural net, but remember that the threshold is calculated and calibrated using a set false positive rate, so even if the average score is higher in general, the threshold will also be calculated to be higher, and will normalize the results. Crispy1989 (talk) 12:21, 9 November 2010 (UTC)[reply]

Multiple reverts

I'm not going to argue that these reverts shouldn't be done – it's quite obvious they should have been. However it was my understanding that the old ClueBot would not revert the same thing twice. Was this just a coincidence or was that true? This bot doesn't seem to follow that same pattern. Is that intentional or not? --Shirik (Questions or Comments?) 18:31, 9 November 2010 (UTC)[reply]

Cluebot-NG does follow the same behavior of the old Cluebot in this regard - the interface to Wikipedia (of which this functionality is a part) is largely just copied, and is the same code. Cobi knows the exact logic behind it, but my understanding is that, by default, it does not revert the same user/article combination twice in the same day, with some exceptions. These exceptions are for the article of the day (which this is), and any articles listed in the "angry opt-in list". Crispy1989 (talk) 18:49, 9 November 2010 (UTC)[reply]

Another false positive for your dataset

IP added vandalism, another IP removed it and ClueBot tagged the second edit as vandalism. Timing issue I'm guessing? Millahnna (talk) 15:54, 11 November 2010 (UTC)[reply]

This is actually a real false positive. The dataset needs more instances of people reverting vandalism. Right now it has very few. The means used to generate it apparently don't generate a random sampling. As soon as the review interface generates a large enough dataset from random edit reviews, we'll replace our current dataset entirely. Crispy1989 (talk) 16:05, 11 November 2010 (UTC)[reply]

Just kind of a random question, but, what does the "NG" stand for? Allmightyduck  What did I do wrong? 03:47, 12 November 2010 (UTC)[reply]

Believe I read it stands for "Next Generation". N419 BH 07:44, 12 November 2010 (UTC)[reply]

"This is probably a silly question, but what does the "NG" stand for? New Generation? --Ixfd64 (talk) 20:22, 4 November 2010 (UTC) [reply]

Our intent was Next Generation. Crispy1989 (talk) 20:33, 4 November 2010 (UTC)"[reply]

Another Dataset Plea

Looking at some of the current data from the review interface, it seems that our training dataset is significantly biased. The bot's current performance, while still better than existing bots, is significantly inferior to what it could be. This is due entirely to the bias in the dataset. I'd like to scrap our entire existing dataset and replace it with the truly random sampling (and verified) edits from the review interface. But not enough edits have been reviewed yet to provide sufficient data for training. Is there anything we can do to make it easier to review edits, or make it seem more worthwhile to people? Thanks to those who are already helping! Crispy1989 (talk) 15:19, 12 November 2010 (UTC)[reply]

I reply to this with some diffidence, because I've already talked enough on this page. (Thank you! Thank you!) But I do have a few comments. Firstly, please give us some time: quite a few people, me included, have already helped out, and I see no indication yet of contributors dropping out - rather the reverse. But this whole project is very much a volunteer effort and we all have real lives elsewhere, no matter what may seem to be the case here. Secondly, it really would be helpful to have some feedback on our efforts: this is, I understand, a fairly basic result in experimental psychology - performance improves with feedback, even negative feedback, compared with no feedback. Even putting up a message to say "At the current rate we expect to go live with a fully-reviewed dataset in the middle of February" would give us a target to beat. Thirdly, it would be nice to have specific feedback on the quality of the classification of difficult edits. No-one expects an individual thank-you for correctly classifying the addition of a ((Persondata)) template or of "MRS FINKELSTEIN IS A GREAT BIG CHODE!!!", but in my experience some of these edits have proved quite difficult, and it's going to be the bot's ability to classify borderline cases correctly that will distinguish it from the rest, and justify the effort that goes into building it. I didn't really like the suggestion I saw somewhere that if two reviewers disagree then the edit will be dropped from the dataset - surely that is a recipe for blunting the sensitivity of the bot? If that is the case then I wonder what is happening to all those comments I've placed on difficult edits. What needs to happen is for those edits to be reviewed even more carefully, and perhaps even put up for community discussion. We'd all learn something, the reviewers as well as the bot. Enough! Philip Trueman (talk) 20:10, 12 November 2010 (UTC)[reply]

About the reviewing interface, it is really easy to work with. I do regularly get an error message that makes me have to refresh. I get both generic error messages telling me something went wrong and I need to refresh, as well messages telling me that it is out of revisions. I like that there is a counter in the corner, so one can set a goal for themselves as 'I will review x amount of revisions this session', and then do just that.

About getting people to participate, this same problem is faced all the time by wikiprojects who organize 'drives' to improve certain parts of the encyclopedia. Some of the techniques I see used in these drives are: fixed timespans, clear goals, 'rewards' (meaning: glorified thank-you notes), and advertisement on places such as the 'Community bulletin board' (on the community portal). Arthena ^(talk) 22:57, 12 November 2010 (UTC)[reply]

Review interface is fine. Though it would be nice if 30% of edits for review were not my own bot's addition of persondata. To encourage wider participation include some stats (e.g. "after the n000 reviews bot accuracy has improved 5%" or whatever it is) and just politely spam the various tech village pump, bot owner noticeboard, huggle talk pages etc. Can we establish how many reviews are needed to reach production-level accuracy, set a target for the review phase? Rjwilmsi 00:13, 13 November 2010 (UTC)[reply]

Interesting points. The following have been added to Cobi's and my TODO list:

For giving feedback, it's not really possible to set a certain goal, because it will always be improvable. It would work fine right now. Rjwilmsi's suggestion about giving statistics on the bot's current accuracy, given the current dataset, are definitely possible, though. We're going to work on setting up a system to retrain and retrial the bot daily, each time using the new current dataset. The results of these trial runs will be posted. We may also be able to take this data over a period of a number of days and create things such as graphs of dataset size versus accuracy.
For discarding edits where there's some disagreement, we've decided to change this to a scheme where every edit is always classified as something (Vandalism, Constructive, or Skip), and that the classification that is used must have at least 3x the votes as any other classification.
For getting feedback on difficult edits, we've discussed ways to do this, and it may be possible to set something up, but it would likely require some restructuring of the database. The idea is to allow users to view a list of all edits they've classified, that others have classified differently, and allow them to view and add more comments, and change their existing vote. But the internal database currently cannot support this. The best way to implement this will be to wait until all edits currently in the database are classified (10,000ish), then upgrade the database. In the mean time, we'll see if there's any halfway point (possibly viewing controversial edits without being able to change the past vote) that we can implement without reconstructing the db.
Continue to try to figure out the few bugs that are causing random (but harmless) occasional error messages.

Crispy1989 (talk) 16:21, 13 November 2010 (UTC)[reply]

Providing a better user experience

ClueBot NG seems to catch vandalism much better than the old ClueBot did. However, we must not forget that regardless of how amazingly well such statistical techniques as artificial neural networks work, there will be false positives; it will never be possible to recreate 100% of the brain of a human RC patroller in computer software. When users who make acceptable edits have them reverted, misunderstandings arise. For example, see Old revision of User talk:ClueBot Commons/Archives/2010/November#are anons not allowed to post subst:prod.

I believe that a concise, informative FAQ page is a necessity if we are to approve this bot. We need to explain that:

The bot is not perfect, and it will never reach 100% accuracy, although its false positive rate has been set to revert only 1 in 400 legitimate edits. This is to help Wikipedia remain free of vandalism.
There are certain types of edits that the Wikipedia community does not find acceptable. (Summarize the vandalism policy here, including the different types of vandalism.)
The bot's revert of a user's edit does not necessarily mean that it is unacceptable.
If the user believes that his edit is not vandalism, he may repeat the edit, and the bot will not take action. (Include instructions for reverting the bot using undo, maybe even a link in the talk page message.)
The bot operators are open to suggestions of how to improve the bot, including reports of false positives. PleaseStand ^(talk) 22:02, 12 November 2010 (UTC)[reply]

A FAQ like this would indeed be useful, and I'll work on writing something up. A few comments on your list, though:

Several of these points are already mentioned in the warning the bot posts on user talk pages (although it can't hurt to have it be elsewhere in a FAQ as well).
It's probably a good idea to emphasize, "If this edit was made in good faith, do not be afraid to post a false positive report, and clear your good name." I can understand that new users could potentially be intimidated by a big warning, so something to this effect would probably be helpful.
It may not be the best idea to make it clearly apparent that the bot will not re-revert the same edit. Even now, without this fact being made clear, a significant amount of vandalism is being caught, but slips through when the user re-vandalizes the page. This behavior of the bot is necessary (unless the false positives are eventually somehow reduced to an incredibly low amount), but making it apparent to vandals, and even providing links for them to re-vandalize in one click, could drastically reduce the actual effectiveness of the bot.
The old Cluebot has a nice user-friendly false positive reporting mechanism. When Cluebot-NG goes into production, we'll bring this interface live again.

Crispy1989 (talk) 16:32, 13 November 2010 (UTC)[reply]

Old topic, but I wouldn't put "and clear your good name", that implies that the reversion is saying something about their name in the first place. It might be good to compare it to a spam filter as well, since people understand that those sometimes have false positives. Gigs (talk) 21:00, 21 November 2010 (UTC)[reply]

Glitch?

what triggered this? Choyoołʼįįhí:Seb az86556 ^{> haneʼ} 05:41, 14 November 2010 (UTC)[reply]

Looks like a dataset completeness issue to me. Crispy1989 (talk) 07:43, 14 November 2010 (UTC)[reply]

Some comments on the review mechanism

N.B. Bot owners - feel free to move this to a new page if you feel it doesn't belong here.

I'm sure that at least twice now I've had the same edit come up for review twice - one on the safety of microwave ovens and one about common given names in Azerbaijan. Does the interface not check whether a reviewer has seen the edit before?

I support Rjwilmsi in his comments about the frequency of his bot's edits. Maybe there should be a few, but I get the impression that we're heading for a skewed dataset. The idea of "one day's edits" is flawed - much will depend on what bots are active that day, whether school's in or out, and what the major news item of the day is (we're getting a lot of stuff about the 2010 mid-term elections right now). If it has to be a random selection then it needs to sampled from a period of several weeks. Also, there should be a limit on the number of edits in the dataset by any given editor.

I could do with three more choices: "This needs a subject matter expert", "Content dispute", and "Recuse". I've had an edit of my own come up, and Rjwilmsi has been in the uncomfortable position of having to classify an edit by his own bot.

It would be nice to know what the criteria are for asking 'Are you sure?'. I've assumed that the answer is "This is the first time we've had that answer for this edit". If that's the case, then I'd say 0.5%-1% of the edits in the dataset are currently wrong classified. Is that right? Some feedback on how many errors the reviewers have uncovered would be welcome.

Finally, how should we handle the case where the edit we're presented with is OK on it's own, but is the latest of a string of edits by the same editor that cumulatively are bad? In my experience, this is a common case when doing RCP. Normally, I'd hit revert, but for the cumulative edit. What's the correct action here? Philip Trueman (talk) 11:10, 15 November 2010 (UTC)[reply]

You should not have gotten the exact same edit twice (maybe the user re-did their edit?).
We will be able to add a more random sampling over a span of a few weeks.
I don't see why there should be a limit, so long as it is proportional to the number of edits they make in a day.
"Subject matter expert" should be able to be handled by the skip or the refresh button. "Recuse" is handled by refreshing. "Content dispute" should be a skip. We could make a dedicated button to do the same as refreshing, though.
Furthermore, about a "recuse" button, clearly, if you get an edit of your own, it's constructive. It's not a courtroom, just dataset generation. If someone who makes vandalized edits has access to the interface anyway, there are much larger problems than someone classifying their own edit.
"Are you sure?" comes up in some circumstances where the current bot isn't sure on the edit.
We are working on adding some more feedback to the interface.
For an edit that isn't vandalism, but is in the same string of edits where vandalism occurred by the same user, just hit skip.

Hope that helps. -- Cobi^(t|c|b) 14:35, 15 November 2010 (UTC)[reply]

Yes, that helps. Thank you. Philip Trueman (talk) 16:14, 15 November 2010 (UTC)[reply]

Cobi has made some major improvements to the review interface based on received comments. One of the important asked-for improvements is that users can now view what others have voted on edits they've already reviewed by clicking on the counter in the top-right corner, and potentially change your own vote in retrospect. Note that you cannot view what others have voted before voting yourself - this is to prevent any prior bias. Also, the logic to determine the final result has changed, and contested edits are no longer discarded. Crispy1989 (talk) 02:12, 22 November 2010 (UTC)[reply]

Status Update

In the last day or so we've made some major improvements with the dataset. We discovered an issue with the dataset we've been using. The output of the dataset downloader was not matching the output of the live downloader, essentially adding some degree of randomness to some of the fields, and causing the bot's live performance to not measure up to its theoretical performance based on a dataset trial. After rewriting the dataset downloader to use the same code as the live downloader, and regenerating the dataset, the bot's live performance is now much closer to its theoretical dataset performance (before the live bot was catching only about 10% of vandalism, about twice that of existing bots - now it's catching 50%-60%, in the range of the dataset trial). The false positive rate remains at the same 0.25% as before.

Also, the classifications from the review interface are now enough to start being used. There aren't enough to use as a training dataset yet, but there are enough to use for trials. This means two things. First, it means we can train the bot using our entire existing dataset, instead of reserving a portion for trials. This should slightly increase accuracy. Second, it means that the statistics we give about the accuracy of the bot are now guaranteed to be accurate and unbiased (the 50%-60% above is an example). Crispy1989 (talk) 14:50, 15 November 2010 (UTC)[reply]

Trial complete.

Trial Summary

The trial is now over, and I'd like to take a moment to go over what was found during the trial.

Problems found and fixed during the trial

Redirect handling.
Quote handling.
Speedy deletion tag handling.
Imported opt-in list.
Incorrect downloading of some fields in the dataset.
Reverting own edits.

Outstanding issues that can be fixed by improving the dataset

Reverting occasional vandal reverts.
A few "bad words" that haven't been seen to be used in good edits.
A few random, rare statistical flukes.

Things that can be improved

Better markup handling.
Larger, more accurate dataset.

End-of-trial statistics

False positive rate below the set 0.25% (the false positive threshold is calculated before applying revert exemptions, such as minimum edit count).
Vandalism catch rate at approx. 55%. Vandalism revert rate at an estimated 40%. Not all caught vandalism is reverted, mostly because the bot won't re-revert edits, and users often re-vandalize.

Overall

The bot performs as expected. The false positive rate (which can still be adjusted if necessary) is set at 0.25%, which, after the revert exemptions, causes only a few false positives per day. This is below the false positive rate of existing bots. The vandalism catch rate, determined by using the random sampling of edits from the review interface, is right around 55%, about an order of magnitude more than existing bots. This puts a very large dent in vandalism on Wikipedia, and will continue to improve.

While there are things that can still be improved to catch more vandalism, the false positive rate will always remain at a fixed percentage. Further improvements will yield a greater vandalism catch rate, but the false positive rate is adjusted by hand, and will not change unless it is decided that it should change.

The single most important thing for improving the bot is improving the dataset. Many people are already contributing large amounts of time to this purpose, and because of this, we can now use a real random sampling for statistics determination. As these people, and others, continue to help, we'll eventually be able to use the random sampling as a training set as well.

Request

I'd like to ask for an extended trial. The bot is production ready, and performs much better than existing bots, both in terms of false positives and vandalism catch rate. But an extended trial will maintain interest in helping us to expand the dataset so it becomes as good as it can be, while still reverting vandalism just as well as it would in production. Crispy1989 (talk) 23:20, 16 November 2010 (UTC)[reply]

Approved for extended trial (14 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. It seems the biggest thing needed is the improved dataset. Anomie ⚔ 04:40, 18 November 2010 (UTC)[reply]

Trial 2 discussion

Question

What does possible vandalism by 41.252.6.218 to version by SicaSunny Means ??

It means the bot incorrectly identified the edit as vandalism. This false positive looks like it was caused by the bot not recognizing HTML color codes as such. This will be fixed as soon as the parser is complete. Crispy1989 (talk) 16:02, 18 November 2010 (UTC)[reply]

Thank you

[45] Philip Trueman (talk) 04:49, 19 November 2010 (UTC)[reply]

False positive exerience

It was not clear to me from the warning I got that I could revert the bot action. All the text was harsh, with none of the FAQ comments (addressed up page). Also, I was not sure if additional attempts to edit would make the bot treat me more and more as a vandal (as can happen with spam catchers on blogs that act up). 72.82.33.250 (talk) 08:19, 20 November 2010 (UTC)[reply]

In this case, the false positive looks like it was partially a result of the earlier vandalism warning. Also, if you do not have any prior warnings, the first warning it gives is much nicer, and is more clear about what to do in the case of a false positive. If you have any ideas how to make it clear (on subsequent warnings) that the bot will not revert the edit twice, without making it obvious to vandals (and easy to revandalism), we'd love to hear it - we've been trying to think of a good solution to this ourselves. Crispy1989 (talk) 17:08, 20 November 2010 (UTC)[reply]

I think the inital vandal tagging was a bit of over-reaction for the type of error I committed, but I don't want to get into it more. I'm not "wounded". Just a datapoint among many. WRT the bot, I would leave it as is, in terms of harsh remarks for tagged vandals. The collateral damage is probably small and the benefits high. Just keep an eye on it.

Also for whatever it is worth, the mechanism of how to report a false positive seems pretty daunting, espeically for a new user (I basically blew it off for example). I suspect that the average false positive will not report through the mechanism now required. So if your 0.25% is the reported false positives, then the true rate is probably significantly higher. Some manual surveying ought to show that. (Still 0.25% may be right setting, but just realize actual collateral damage is higher). Of course, if I don't understand this, so be it...just trying to help.

0.25% false positives is based on a dataset trial with random, human-verified, edits. It is not only accurate, but the actual rate is less, because post-processing prevents reversions in some circumstances that will remove some false positives. The number of reported false positives has no bearing on these reported statistics.

If you have an idea how to improve the false positive reporting, go ahead and make any changes you feel could improve it. Just tacking on a new section was getting unmanageable. I don't have any particular preference on exactly how it works, but I think that actual discussion should be clearly separated from false positive reports, and the false positive reports should be represented in a concise manner. Crispy1989 (talk) 20:49, 20 November 2010 (UTC)[reply]

I'm not sure mechanically how to improve it. It's just discouraging for a user to feel like he is prey to a machine and then that the appeals process is arduous. No biggie, just a datapoint...

ClueBot NG Reverting good faith edits

For example here. Access Denied – talk to me 16:44, 20 November 2010 (UTC)[reply]

I also found this. Wow. Access Denied – talk to me 16:46, 20 November 2010 (UTC)[reply]

False positives with Cluebot-NG are (essentially) inevitable. The amount of caught vandalism depends on a set false positive rate. Currently, the FP rate is set at 0.25% - this has generally been deemed an acceptable price for eliminating over 50% of vandalism. Of the false positives that do exist, most are poor quality edits in some way (like the first of these two edits) that share traits with vandalism. There are occasionally unexpected reverts that don't appear to have any vandalism traits. This is a consequence of using a neural network as a core, and these should virtually disappear as the dataset grows. Crispy1989 (talk) 16:57, 20 November 2010 (UTC)[reply]

If you accept that 0.25% of edits it catches are FPs, then I think it needs constraints on how many times it will revert the same editor. If it's possible, 1RR would be advisable for edits that aren't certain vandalism. It can afford to be more aggressive on things like profanities and blanking. HJ Mitchell | Penny for your thoughts? 01:19, 22 November 2010 (UTC)[reply]

The bot already adheres to 1RR. It does not revert the same user/article combination more than once in the same day. This allows users that are reverted as a false positive to simply redo their edit without being reverted. The bot does not contain simple heuristics, so we cannot make it more aggressive for certain offenses. However, it may be possible to override the 1RR (this rule does make the bot miss a fair amount of second-time vandalism) in some strict circumstances, such as where the edit has a very high score, and more than half of the user's previous edits have been vandalism, or something like that. But before overriding 1RR under any circumstances, there should be significant community discussion on the issue. Crispy1989 (talk) 01:25, 22 November 2010 (UTC)[reply]

Additional: [46] this is reverting the addition of an internal English Wikipedia link, formatted as an external link. Presumably it looks like spam, but perhaps this can be tweaked. Rd232 ^talk 20:17, 21 November 2010 (UTC)[reply]

New headers each warning??

Hi - it seems that ClueBot NG makes a new header for the month for every first warning they give out - there are 3 November 2010 headers here : [47] - if you can fix it that would be great :) --Addi hockey 10^e-mail 18:16, 22 November 2010 (UTC)[reply]

This is a known issue and appears to be intermittent. Cobi is working on fixing it. Crispy1989 (talk) 19:08, 22 November 2010 (UTC)[reply]

It's an issue with the fact that ClueBot NG (and ClueBot) simply append a subst'd template to the end of the talk page.^{[Specifically, this one]} Someone decided to add a header to the level 1 template. I will see about fixing this properly in the code, as it cannot be done in a template, but it is somewhat low-priority right now. -- Cobi^(t|c|b) 19:42, 22 November 2010 (UTC)[reply]

Race condition?

This edit by the bot restored a bit of vandalism that had just been reverted in the same second. My guess is that it appropriately identified the vandalism but missed that it had been changed before it go to do it itself? --John (User:Jwy/talk) 20:37, 22 November 2010 (UTC)[reply]

It's not a race condition, just a normal false positive. It presently has a few issues with vandalism reverts, because there are few/none present in the dataset. This should stop with time when the review interface dataset becomes large enough to use as a training set. Crispy1989 (talk) 20:40, 22 November 2010 (UTC)[reply]

Thanks for the reply. I'll assume you will take care of getting it into your database if useful? --John (User:Jwy/talk) 21:34, 22 November 2010 (UTC)[reply]

Blanking a section of an article, and replacing it with "DONKEY BALLS" isn't even remotely acceptable behavior for a bot. It's also 100% preventable, without dataset extension: the bot simply needs to evaluate the reversion it is about to make, as though it were considering whether an edit made by another user would be identified as vandalism. If so, the reversion should be suppressed. I assume that the bot would consider section blanking and replacement with "DONKEY BALLS" to be vandalism if done by a non-whitelisted user, as this is the sort of malicious edit that is most easily identified by anti-vandalism bots. Peter Karlsen (talk) 04:46, 23 November 2010 (UTC)[reply]

false positive rate

After reviewing hundreds of bot edits, I'm concerned that the false positive rate may be set too high. The 0.25% false positive rate sounds impressive until you consider more intuitive measures of performance. Assuming 10% of edits are malicious and the bot reverts 60% of those, an 0.25% false positive rate implies that 3.6% of the bot's reverts (1 in 28) are false positives. If the bot makes 2,500 reverts per day, that's 2,410 good reverts and 90 false positives per day.

If you view at Wikipedia purely as a data repository, that looks like great progress. However, Wikipedia is also a community of editors, one constantly in need of "new blood". I believe that false positives do great harm to the Encyclopedia by driving away good-faith contributors. Most of the editors hit with false positives are newcomers with less than ten edits. If an experienced editor gets wrongly reverted, she presumably knows enough to take it with a grain of salt. But a good-faith user whose first or fifth contribution is reverted three seconds later by a bot is unlikely to return. Most don't bother to report the error. Of course, such harm must be balanced against our workload as vandal-fighters and the harm that might occur if more vandalism went undetected. I raised these issues at User_talk:ClueBot_Commons#false_positive_rate and got many interesting replies, and now I am moving the discussion here. --Stepheng3 (talk) 20:01, 28 November 2010 (UTC)[reply]

To summarize the discussion on the talk page (not in chronological order):

Initially, several people misunderstood the meaning of "False Positive Rate", although it has been clearly explained in multiple places that is means "portion of legitimate edits that are incorrectly classified as vandalism".
Using estimations from several users on actual number of false positives, the actual live false positive rate was calculated to be well within the stated 0.25%.
Someone pointed out that many of the supposed "false positives" reported by the user(s) opposed to the bot's current performance are not actually false positives, and were indeed correctly reverted as vandalism. Even so, the counted number of false positives was within the expected 0.25%.
A user used the bot's administrator shut-off (intended to be used when the bot is behaving unexpectedly) to stop the bot's operation. The same user later reversed this decision about a day later.
There was some misunderstanding about the accuracy of the false positive rate. 0.25% false positives is not based on number of reported false positives, but is an accurate number based on a dry trial of random edits not used for training.
Much of what is explained on the bot's user page and on this BRFA was reiterated, including how the FP rate is calculated, and how a certain number of FPs are necessary for the bot's proper operation.
The impact of vandalism and importance of human vandal fighters' time was reiterated by myself and several other impartial users.
It was implied that all users subject to a false positive will leave Wikipedia and never edit again. This was given without proof and is incorrect.
The statistic that "1 in 400 incorrectly reverted legitimate edits is worth 200 in 400 correctly reverted vandalism edits" was put forth, and debated.
Whether or not the time that human vandal fighters spend patrolling edits is significant, was debated.
Whether or not human vandal fighters catch 100% of vandalism immediately, was debated.
The fact that ClueBot NG's false positives are not what one would expect from a normal bot and often are not triggered by things such as bad words, was reiterated. This makes it much clearer to users that are subject to false positives that they did not do something wrong.
It was suggested by an impartial user that his/her own human false positive rate is likely greater than the bot's.

My position, and that of several others posting there, is that reducing human vandal fighter workload by half or more allows them to contribute significantly more new material to the encyclopedia. It also prevents half or more of the vandalism that currently gets through, from getting through, keeping Wikipedia twice as clean from undetected vandalism. I believe this is well-worth the minimal impact of less than 1 in 400 false positives, particularly considering that the warning makes it clear the revert may have been a false positive, and provides instructions for undoing the revert. Crispy1989 (talk) 21:16, 28 November 2010 (UTC)[reply]

Your comments are incorrect, and self-contradictory. If users presently fighting vandalism reduce the amount of time spent on it to "contribute significantly more new material to the encyclopedia", the purported vandalism-reduction benefits of the bot will be blunted by the diminution of human effort in this area. However, if users do indeed contribute less time to vandalism reversion, the more likely outcome will be a reduction in their total contributions, since most users with a desire to write content are already doing so. Have many users actually said that they would write more for Wikipedia, if only they weren't tied up with RC patrol?

For the purpose of comparing this bot's false positive rate to that of human users, it is absolutely imperative that the rate be quoted in the same terms that would intuitively be used to measure the accuracy of human anti-vandalism efforts: the percentage of the total edits reverted that are false positives. Once the false positive rate is provided in a comprehensible format, I believe that the ugly truth will become apparent: one would be hard pressed to find nearly as many false positives in 250 edits reverted by an experienced, skilled human user, as there would be in the same number of edits reverted by this bot during approximately the same time. Any edits which can be automatically identified as almost certainly vandalism, without an unacceptably high false positive rate, are already blocked by the edit filter. Contributions which are accepted by the filter currently require human judgment to evaluate, to avoid automated violations of WP:BITE and discouragement of new editors by an unacceptably inaccurate bot.

The time spent on developing this bot shouldn't be considered wasted, however. Perhaps the neural network feature could be integrated with a human-assisted anti-vandalism program such as Huggle, as a means of identifying edits which are probably vandalism, with a user-adjustable threshold score for identification. Peter Karlsen (talk) 23:58, 28 November 2010 (UTC)[reply]

The statements you say are incorrect are not my saying. They are summarized from the talk page discussion. Please take the time to read there. And yes, there are users that have made these statements.

Whether or not the bot is used to revert vandalism is up to the BAG and will be decided at the completion of its trial. The benefits of using it in a human vandalism program are limited, as it is designed as a first line of defense. Considering that the false positive rate is adjustable and can be easily changed (I don't know why I have to keep saying this, people just don't seem to understand), there's no reason it shouldn't be approved.

I find it unfortunate that I have to spend so much time repeating over and over things I have already said, when I could be spending the time improving the code. But apparently this is a necessity in getting community approval. Crispy1989 (talk) 00:08, 29 November 2010 (UTC)[reply]

I am also surprised that, in all this complaining, nobody has suggested simply using an alternate false positive rate. I'll even take suggestions for thresholds. Every time Cluebot NG reverts an edit, it leaves a score. Suggestions for a score threshold or false positive rate (within reason) will be considered, and I can post stats on bot effectiveness given either the threshold or FP rate. Crispy1989 (talk) 00:31, 29 November 2010 (UTC)[reply]

I am well aware that the target false positive rate is adjustable by the bot operators only. A member of BAG could presumably require you to reduce the rate. However, I have taken notice of the fact that, despite the mounting criticism of the bot's incorrect reversions, you have not actually reduced the false positive target. There would certainly be no objection to the bot running at a lower false positive rate than the one under which it was approved for the trials. This refusal to modify a clearly problematic bot task until a BAG member actually forces you to do so is worrisome. Therefore, I am evaluating the bot based on its present mode of operation, rather than some hypothetical alternate configuration that might exist, had you been more responsive to the community's concerns.

The value of integrating a neural network into an application like Huggle is that existing filters used to present possibly malicious edits for human examination are extremely primitive. I would venture to say that over half of the "filtered" edits are not vandalism, while much of the vandalism that this bot catches is missed. The benefits of using a neural network to allow human users to identify likely vandalism in a flood of other edits would be extraordinary, especially considering that the target false positive rate could safely be set much higher in a manually-confirmed reversion application.

Your claim that "in all this complaining, nobody has suggested simply using an alternate false positive rate" is untrue [48] - and the fact that you made it shows a disregard for community input. Whether there's a reason the bot shouldn't be approved depends largely on whether you are willing to respond to the community's critiques by lower the false positive target now. (I still believe that my suggestion of 0.1% above is prudent.) The ball is in your court. Peter Karlsen (talk) 01:04, 29 November 2010 (UTC)[reply]

I have not lowered it below 0.25% because there has been no consensus. The BAG does have the final say, but if there was a community consensus, I would immediately adjust it. Looking at the bot talk page, there have been a number of instances of people happy with the bot's current performance. Additionally, at least one user has explicitly stated that they are happy with the aggressiveness. Without consensus, the only sane option is to delegate to the BAG for arbitration.

Before you say that I do not listen to community input, you should take note of the fact that the original FP rate was set at 0.5%. I reduced it to 0.25% very early on, because at that point, there was consensus that 0.25% was preferable to 0.5%. I also evaluated your 0.1% suggestion, and determined that it would cause a significant drop in the bot's catch rate.

0.1% may be an acceptable value with decent performance, but the trial dataset is not currently large enough to accurately calculate the threshold. I will be able to accurately evaluate its effectiveness and calculate the threshold when the dataset from the review interface is approximately doubled in size.

In lieu of a larger trial dataset, I can at present evaluate a given threshold, although as the bot changes, a set threshold can vary significantly. Reviewing the reported false positives (with a grain of salt - some of them aren't really false positives) may allow you to suggest a threshold. I would be open to running the bot for a day or two with a given set threshold within reason, to see if its performance in that mode is acceptable. Crispy1989 (talk) 01:26, 29 November 2010 (UTC)[reply]

If it's unclear whether any given reduced false positives target would retain sufficient performance to significantly decrease the rate of false positives per edits reverted (which is what the community is actually concerned about), then why did you claim that because "the false positive rate is adjustable and can be easily changed... there's no reason it shouldn't be approved", as though this would solve all of the bot's problems? Given the uncertainty you've described, I find it reasonable to evaluate the performance of the bot under its present configuration, without assuming that there necessarily is a better one.

BRFAs require consensus for approval. In light of the strong concerns many editors have expressed about the bot's excessive incorrect reversions, I don't believe that such a consensus exists. If the probability of an improvement in the bot's configuration is significant, then this request could be left open until it occurs, or is clearly not possible. Peter Karlsen (talk) 02:30, 29 November 2010 (UTC)[reply]

I believe BRFAs require consensus among the BAG. The reason the BAG was created is that non-members often don't have the knowledge or perspective to make informed decisions on automated processes. As neither you nor I are members, it's not up to either of us to decide, and we should leave it to them.

About the ease of changing the false positive rate, it is exactly as I have described. The automatic threshold calculation is a helpful feature on top of the core. For excessively low false positive rates, it requires a very large trial dataset to accurately calculate. At 0.1% false positives, our current trial dataset would only yield a single false positive. I could run it with this, but I'd rather not, because there are some people who would interpret this as an inaccurate claim. Instead, as I have already explained (and you, once again, ignored), the threshold can be manually adjusted based on observed false positives.

As I already stated, I'd much rather spend my time actually working on improvements, as it's a continual process, instead of repeating myself and arguing. Whether or not the bot is approved in its current state is up to the BAG as soon as the trial ends. Crispy1989 (talk) 03:04, 29 November 2010 (UTC)[reply]

I have increased the threshold. With the new threshold, our trial dataset (containing 963 good edits) has zero false positives. So the false positive rate should now be approximately 0.1%. The catch rate has decreased a fair amount (it's hard to tell exactly how much, again due to dataset size), but it should still be at least twice as effective as the old Cluebot. Crispy1989 (talk) 03:34, 29 November 2010 (UTC)[reply]

I certainly hope that the change in the threshold for reversion ultimately produces better results than this. If the per edit examined false positive rate is halved, but the amount of vandalism caught is decreased by a similar factor, then the per-revert false positive rate, which is intuitively used to measure the accuracy of anti-vandalism work, will remain unchanged. Comments at this BRFA and at User talk:ClueBot Commons suggest that this continued high level of inaccuracy would still be unsatisfactory. The latest complaint about the bot, User_talk:ClueBot_Commons#Problem_with_the_bot, came in after the threshold for reversion was increased. The system for reporting false positives, and lack of individualized responses, was also critiqued. The claim that (with three bot operators) everyone responsible for the bot is simply too busy to articulate responses to the false positive reports [49] is troubling. The accepted Wikipedia standard for responses to claimed errors in automated tools designed to stop malicious edits, as shown at Wikipedia:Edit filter/False positives/Reports, is that each and every false positive report is examined, and receives a response to determine whether it is genuine. When an edit filter produces an actual false positive, it is usually possible to modify it to prevent a recurrence. However, you state that this bot is a "black box"[50] such that the cause of errors often cannot be ascertained and immediately corrected. Instead, false positives are generically attributed to "the dataset being too small". The only solution offered is to increase the size of the bot's dataset. But if the bot is not yet adequately configured to avoid an unacceptably high (per-revert) level of false positives, then why is it making live edits at all? Continuing the dry run, and examining which edits it would have reverted, would provide adequate data on new false positives, while relieving the bot operators of the burden of responding on-wiki to false positive reports.

I apologize if my critiques of the bot's operations, and those of my colleagues, seem inadequately appreciative of your software development efforts. The theory of the bot's operation is original and intellectually intriguing; the present code, configuration, and dataset can already identify edits that are probably vandalism, and could be used to improve human-assisted anti-vandalism programs. With sufficient refinement, the bot may one day be an acceptable fully automated anti-vandalism tool. No deprecation of your contributions is intended in the candid observation that the bot is not yet ready for mainspace live-editing approval. Peter Karlsen (talk) 04:15, 1 December 2010 (UTC)[reply]

The comment you link to is mostly out of frustration. Despite your apparent surety that the bot is inadequate, you are one of only two people that I can see to strongly complain about the false positive rate, where many people have been happy and satisfied with it. I find myself making these FP rate changes, just because it's very time consuming to carry on these debates about the same topic, where the pertinent information has already been stated in various places.

The complaint was about the FP rate, not percentage of reverts that are false positives. The FP rate has been more than halved. In fact, the percentage of reverts that are false positives has also been decreased, due to the effect postprocessing has on the results. The FP rate is determined by the core, but the final decision to revert may be overridden by some set metrics in the Wikipedia interface. Because these metrics apply less-often to higher-scored edits, increasing the threshold lowers the percentage of would-be reverts stopped by the post-processing filters. Therefore, overall, even the percentage of reverts that are false positives has been decreased.

I'd also like to point out that, due to these post-processing filters, the given FP rates, whether 0.25% or 0.1%, are maximums. Observed FP rate is likely to be significantly below this, as many FPs are caught and eliminated by the post-processing filters.

To support this, take a look at one of the recent comments on the bot talk page, made after the threshold increase. The user states that they had to review over 100 diffs/reverts from Cluebot-NG to find a single false positive. While this isn't a wide sample set, it should give you some idea of the "accuracy" after the change.

You may wonder why I still disagree with such a low FP rate, even if I know that it increases overall "accuracy" - the reason is that, for an antivandal bot to really make a difference, it has to revert a significant portion of vandalism. Bots like the old ClueBot reverted an estimated 5% of vandalism. This is enough to get it noticed by human editors when they're beaten to a revert, but doesn't significantly decrease the time necessary for human patrollers to spend. Even with the lowered 0.1% false positive rate, ClueBot-NG is more than five times as effective as the old ClueBot, but the entire purpose of an antivandal bot should be to make a real difference.

You mention a recent complaint - but it is unrelated to the FP rate, or number of false positives at all. Rather, it is related to the handling of false positives. The discussion there clearly spells out our reasoning, and is mostly supported by at least one independent and impartial user.

It really is very time-consuming to respond to every false positive manually, and even with three bot operators, there's not enough spare time to go around. One of the bot operators has a wife, two jobs, and school to worry about, and still finds time to work on bot development. Another of the bot operators spends most of his time working on dataset management, which as we've repeatedly stated, is what can most improve the bot - his remaining time is spent on real-life commitments. The third wasn't really involved in core development, and doesn't know enough about it to respond to false positives with anything more than a form-letter response.

What's more is that individual responses are not necessary. As you point out, each one should be reviewed to make sure it's actually a false positive. And each one is. As stated in multiple places, each reported false positive is submitted to the review interface, where we can draw on community effort to classify them.

While the system for reporting false positives was criticized, no suggestions were offered on how to improve it (by the primary user doing the criticizing). Another user spent the time to find a false positive and report it, and not only determined that most of the criticism was invalid, but also did indeed give some suggestions, which are being discussed and will likely be implemented very soon.

The neural network is indeed a form of a "black box", but this is not the only reason that simply examining false positives will not directly and immediately help accuracy. As explained in multiple places, a certain number of false positives are absolutely necessary for the bot's operation. The choice to be made is simply how many false positives are acceptable, and the bot operates as well as it can, given that number. Ideally, with time, the FP rate can be decreased without hindering the bot's performance much or at all (as the dataset is improved), but false positives as a whole, and individual occurrences, can never be entirely eliminated.

Extending this, it can be seen that your following comment about the bot not yet being ready is invalid. The dataset will always be able to be improved, more and more. It's a continual process. There will never be a point at which we can say "Stop. It's as good as it can be. Nothing more can be done." Just because there is still room for improvement doesn't mean the bot is not ready to make live edits. If this were so, the bot would never be ready. A point has to be set at which the FP rate is considered acceptable - you've suggested the FP rate of 0.1%, and that has been acted upon.

Your suggestion about continuing a dry run is noted, but would never work in practice. Keep in mind that around or less than 1% (estimated as per the above-mentioned user that went through a series of diffs) of CBNG edits are now false positives. Reviewing data from a dry run would take 100x more time than responding to individual false positives - if there isn't time for the latter, there would never be time for the former. Another important consideration is that we cannot improve the dataset by ourselves, and nobody wants to spend time on a review interface for a bot that isn't active. The live edits, even at the current state, are not only extremely worthwhile for Wikipedia, but also bring in a steady stream of contributors to continue to help improving the dataset. Stopping the bot now would all but eliminate these contributions, and this would probably mean that it would never actually be approved.

My comments about "lack of appreciation" on the user talk page discussion do not apply to you. While I disagree with you on most of the points you bring up (for reasons I believe are correct and stated above), it's also clear that you are trying to help, with nothing but good faith. I mentioned "lack of appreciation" on the talk page because the user in that context was engaging in nonconstructive flaming and even making up quotes from the dev team to try to make us look bad. Nonconstructive complaining, and, worse, flaming, are not welcome at all. All other forms of comments and suggestions, even if we disagree with them, are welcome, and at the very least open to consideration.

The upshot of this all is, the bot is already in a much-more-than-adequate state to be running live. There is indeed still room for improvement, but there will always be room for improvement. The bot is already much improved from all predecessors, and only seems to be having more issues with false positives because of its much higher overall edit rate - so much so that things such as minor bugs in the Wikipedia interface, that have remained unnoticed and unreported for the three years the original ClueBot has been running, are now being noticed and fixed very rapidly. Even in trial, CBNG is making a significant difference, more than noticeable by vandal fighters and other users alike, as clearly evidenced by numerous comments on the user page and talk page. There are no significant outstanding problems - particularly when significance of problems is compared with previous AV bots. Crispy1989 (talk) 09:48, 1 December 2010 (UTC)[reply]

I realize that information and logic about behavior of the bot, particularly related to false positives, has been spread out over multiple discussions in different places. To make it easier to follow along, I have consolidated the information in a few places, and tried to explain it simply and concisely: FAQ on CBNG False Positives, Detailed Information on CBNG False Positives, CBNG Algorithms. Crispy1989 (talk) 10:53, 1 December 2010 (UTC)[reply]

Trial complete. We'll post a summary shortly. -- Cobi^(t|c|b) 04:33, 2 December 2010 (UTC)[reply]

Trial 2 Summary

Major Events During Trial 2

False positive rate was lowered from the previous 0.25% (as it was for Trial 1) to 0.1%, at user request, more than halving the number of false positives. The change was made about half-way through Trial 2.
Data from dataset review interface has grown in size enough to use as a trial set, and more accurately calculate the threshold and statistics from false positive rate.
False positive reporting switched from freeform reporting to the old ClueBot false positive reporting interface, so we can more easily use the data from reports to improve the dataset.

Controversies

Several controversies not (conspicuously) present during Trial 1 were raised during Trial 2.

False Positive Rate - A couple of users believed that the 0.25% max. false positive rate (at most 1 in 400 false positives) was too high, with a fair amount of debating. Eventually, at one of the user's suggestion, the false positive rate was lowered to 0.1% max.
Ease of False Positive Reporting - A couple of users believed that the false positive report interface was too difficult to use practically. Then, one user actually took the time to find a false positive (stating he/she had to go through over 100 bot edits to find one), and tried to report it, determining that the interface was quite easy and painless to use. Users have also suggested some improvements to the interface, which we are now implementing. This discussion took place on the ClueBot NG talk page.
Commenting on Every False Positive - A couple of users had a problem with the fact that the developers do not personally comment on every false positive. The developers do not have nearly enough time to write a personalized response to each one, but every false positive is submitted to the review interface for verification and dataset use. A confirmation page is being added to the report interface to clarify how the reports are used. A user also suggested periodic overviews of false positive statistics - this may be possible, but difficult, and we are looking into it.

Clarifications

These are clarifications on some things are are available elsewhere, but are restated here because they are commonly misunderstood.

Meaning of False Positive Rate - The false positive rate is calculated as Number of Incorrect Classifications / Number of Non-vandalism Edits.
False Positive Rate Calculation - The false positive rate is not calculated based on reported false positives (which may be less than the actual number). The false positive rate is calculated from a random sampling of human-verified edits, from the review interface, so it is accurate. Actually, actual false positive rate will be less than stated, due to post-processing filters.

Important Documentation

Those not already familiar with how the bot works should read these links. They are critical to understanding its behavior. These were written during Trial 2 in response to numerous repeated questions for the same information.

The entire user page, particularly stats, false positive info, threshold, and post-processing.
The FAQ.

Support for the Bot

While the bot has generated some controversy, it has also received a large amount of support and praise - this support isn't on the BRFA, but may be useful. Only "pure support" message are included here - there are others that are part of controversial discussions.

It's also worth noting that this praise is coming from people who are familiar and used to the old ClueBot, so they are noticing a real difference.

Summary

The bot is performing well within its expected parameters. It was approved for Trial 1 for operation at 0.25% false positives, and it was always well within that limit. Halfway through Trial 2, it was changed to 0.1% false positives at user request, or 1 in 1000 incorrectly reverted edits (also note that this is a maximum).

Controversy has sprung up, often due to misunderstandings about how various statistics are calculated and used. These have been clarified, and an FAQ page written to explain these issues. The remaining controversy has been addressed (false positive rate has been more than halved, report interface improved, etc).

Cluebot NG's performance is almost an order of magnitude better than all previous anti-vandal bots. Using novel algorithms and approaches, it truly is the next generation to practical automated vandal-fighting on Wikipedia. And over time, as we continue to work on the bot, its accuracy will improve even more.

Request

The developers request that the bot be approved to operate at a false positive rate of the operators' discretion. We would like the ability to adjust the false positive rate for a few reasons:

We select an appropriate rate based on generated graphs of statistical performance, looking for a dropoff point, which can change as the bot changes.
Stated FP rate is less than actual FP rate due to post-processing filters. As these post-processing filters are modified, the core FP rate may need to be modified to maintain accuracy.

We will never set the FP rate to anything above 0.25% (or 3 in 1000), and for now, it will remain at 0.1% (1 in 1000), as this is where community support lies. We will also always listen to the community and try to determine consensus if disagreement about the FP rate ever again arises.

After approval, we will restart the bot, so it can continue doing its job of keeping Wikipedia clean, and reducing vandal-fighter workload. Crispy1989 (talk) 04:36, 2 December 2010 (UTC)[reply]

False Positive Reporting

Less than 0.1% of constructive or well-intentioned edits are misclassified as vandalism by Cluebot-NG. Please see Information About False Positives for more information about why this happens, and why it is necessary. Reports posted here are reviewed by the bot developers in case anything can be done to the bot to improve its accuracy.

Old False Positives

Diff	Comment	Reason/explanation/fix/discussion etc.
	The bot undid my interwiki link for the Template page. 78.72.250.55 (talk) 13:41, 18 November 2010 (UTC)[reply]	A few of the other false positives above have also been related to interwiki links - it's a dataset issue. As the dataset grows, this should be fixed. Crispy1989 (talk) 16:08, 18 November 2010 (UTC)[reply]
Edit	The bot undid an IP user's good faith edit, [51], a perfectly fine insertion of a word for clarification. I have reverted back and intend to post on the IP talk page that it was a mistake. I realise it might be harder to distinguish between such edits and some forms of vandalism, but I imagine minor edits of this type are common and we don't want to deter casual editors. ChiZeroOne (talk) 16:48, 18 November 2010 (UTC)[reply]	This is one of those weird ANN things that will definitely be fixed by a larger dataset. The word "accidental" does not appear in the Bayesian database at all, so the statistical properties of the message must just be falling into a strange gap in the training set. Crispy1989 (talk) 16:50, 18 November 2010 (UTC)[reply]
[52]	Bot undid an edit on [53] from SML/Texting to SMS/Texting. SML documentation provides no mention of IDC, and the person who added SML/Texting was someone who has previously vandalized. As SMS is the only topic relating to the definition that follows the SML/Texting entry ("Idc" or "I.D.C." is an acronym for "I don't care.") I made the change to SMS/Texting. Phoenixkin (talk) 18:27, 18 November 2010 (UTC)[reply]	The dataset had not previously seen "SMS" or "SML" in all caps, and as such, treated it as shouting. Crispy1989 (talk) 01:18, 19 November 2010 (UTC)[reply]
Edit	Bot reverted edit that wasn't vandalism, but fancruft. [54] I reverted the bot edit to let the editors decide. --Confession0791 ^talk 21:28, 18 November 2010 (UTC)[reply]
	I just added a note that QuickTime is required to hear the Start-up and Chimes Of Death in MacTracker. --216.96.2.184 (talk) 22:53, 18 November 2010 (UTC)[reply]
	Bot reverted an edit where I added the relevant quote to a reference. Quote was buried in lengthy source text therefore quote in reference was necessary & useful
	Added a new link to the recently-released English version of the product website, but was reverted.
Edit	I don't believe that this should have been reverted. The IP was simply trying to say that the actress is now in another soap and it got reverted as possible vandalism. --5 albert square (talk) 09:38, 19 November 2010 (UTC)[reply]
	I seem to have gotten credit for several vandal edits from this IP address on subjects which I have never written about-footballers and the Fritzl family and so on. The edit on "The Sparrow" was legit.--83.70.226.255 (talk) 16:09, 19 November 2010 (UTC) Also will someone please tell Jimmy wales the multimillionaire megalomaniac philanthropist to take his smirkung begging face off every page ![reply]
	The edit by anonymous user 24.215.26.57 was perfectly legitimate. It appears that the bot was triggered by the facts that the edit was made by an unregistered user, combined with the added material being in full caps (a stylistic gaffe, to be sure, but one that can easily be corrected).—Jerome Kohl (talk) 16:37, 19 November 2010 (UTC)[reply]
	I was pointing out that the legend of the swan song was a misconception, as mute swans do not "sing right before their deaths". The bot reverted the one-word edit. 192.12.88.68 (talk) 03:01, 20 November 2010 (UTC)[reply]
Edit	I wonder why this one triggered the bot. It was a pretty clear similar addition to what was already on the page. ? Melodia Chaconne ? (talk) 23:07, 15 November 2010 (UTC)[reply]
Edit	Or this one, which seems like a legitimate addition to the Languages list for the article. (I don't read this language, but I was able to locate the article.) Cynwolfe (talk) 00:16, 16 November 2010 (UTC)[reply]
List of Representatives from North Carolina	My edit was misconstrued as nonconstructive. I was trying to put up a cleanup tag on the page List of Representatives from North Carolina, and not having enought experience doing that, I did it wrong. 71.91.99.47 (talk) 01:52, 16 November 2010 (UTC)[reply]	This is mostly an issue with dataset completeness (not many/any instances of incorrectly added templates), but it probably would not have been classified as vandalism if that IP had not vandalized in the past. Crispy1989 (talk) 02:08, 16 November 2010 (UTC)[reply]
Edit	This [55] reads like a content dispute where the bot should not be taking sides. Philip Trueman (talk) 08:32, 16 November 2010 (UTC)[reply]	I don't see any reason this edit would be classified as vandalism. It must be a dataset completeness issue. Crispy1989 (talk) 12:46, 16 November 2010 (UTC)[reply] There's a change of "facebook" to "myspace", and that might have been a contributing factor, since I presume vandals sometimes spam social site links. — HELLKNOWZ ?TALK 12:50, 16 November 2010 (UTC)[reply]
[56]	[57]. Trying to improve the page. --130.233.79.47 (talk) 08:34, 16 November 2010 (UTC)[reply]	Just asking, but has the problem of "i.e." being treated as ultra-short sentences been fixed? Philip Trueman (talk) 10:50, 16 November 2010 (UTC)[reply] I thought I fixed it, but seemingly not. There's no other reason this edit should have been reverted. To fix it before, I just removed all instances of "i.e." before processing. I just changed it to instead replace it with "Ie". Crispy1989 (talk) 12:37, 16 November 2010 (UTC)[reply] I checked for that. Actually it was myspace replaced with facebook, so the bayesian score would be influenced by the addition of "facebook", but "facebook" isn't even in the bayesian database (close to 50/50 vandalism/constructive occurrences). There are a few words that are in the database, like "rumors", "very", and "happy", but none of these have particularly high scores - definitely shouldn't be high enough to cause a false positive. There must just be some statistical property of the edit that fell into a gap in the ANN's training set. Crispy1989 (talk) 12:54, 16 November 2010 (UTC)[reply]
Edit	Nothing to do with me really but this edit doesn't look like vandalism to me. Rambo's Revenge (talk) 20:04, 16 November 2010 (UTC)[reply]	The bot was treating what was inside there as a quote, and text inside quotes isn't processed (other than counted). Combined with the multiple other recent warnings, it was enough to trigger it. Crispy1989 (talk) 21:30, 16 November 2010 (UTC)[reply]
	That what they do is vandalism. The Vlachs are not Romanians, they are ethnic group recognized by the Constitution! In census their language is recognized as Vlach! So, we must to respect that. Thanks!	The edit you're referring to looks like vandalism to me. It's not obvious vandalism, but removing all mention of "Romania" from an article where it is pertinent usually constitutes vandalism. The bot is capable of catching non-obvious vandalism such as this. Crispy1989 (talk) 16:06, 18 November 2010 (UTC)[reply]
Shawn Johnson	There was no vandalism. This is an automated process, so not sure why my edit was tagged and reverted. I wanted to keep working on the article!
	Re Tropaeum Traiani cluebot NG revert 10.18am 20 November 2010 I simply added the Latin text of the memorial underneath the section heading "Legionares Memorial" The text is from note2 source, this is not "vandalism".
Edit	I was adding a source that is already in the article! Followorders (talk) 21:43, 20 November 2010 (UTC)[reply]	False positive due to Bayesian keyword. Will be mitigated by complete parser and fixed by improved dataset.
	The stupid bot deleted my entry! MAKE IT STOP, MOM!!! 173.59.219.49 (talk) 22:31, 20 November 2010 (UTC)[reply]	Not a false positive. The edit is vandalism.
Edit	Correcting name of a band from "Wakey Wakey" to "Wakey!Wakey!" but the system thinks it knows best...	Sole addition of two (non-consecutive) exclamation points. Might be fixed by improved dataset, but very rare edge case anyway.
Edit	Looks like the user was adding the "Smoky Bacon" flavor of Pringles chips and was nailed for it	Since the user did not add a space or comma, it was treated as a very long nonsense word. A full parser may help this, as could a very extensive dataset including examples of this kind of well-intentioned error.
http://en.wikipedia.org/w/index.php?title=Harry_Potter_and_the_Deathly_Hallows_(film)&diff=prev&oldid=398006837	Dude, WTF? You restored an every-word-has-a-link vandalism.	Definitely fixed by larger dataset. With the current dataset, all mass removals of links have been vandalism. Clearly this is not always the case.
Edit	Previous version by 70.26.181.136 had corrected errors and improved the section with valid text. Your edit removed the addition and restored the typo corrected by the editor.

False Positives

This page is currently inactive and is retained for historical reference.
Either the page is no longer relevant or consensus on its purpose has become unclear. To revive discussion, seek broader input via a forum such as the village pump.

Approval

Approved. to operate at operators' discretion. —Ree dy 02:24, 3 December 2010 (UTC)[reply]

Thanks. The false positive rate will remain at less than 0.1% for the foreseeable future, unless improvements are made to the bot which cause a slightly higher dropoff point than present, or the bot's accuracy improves to the point where it can be lowered without significantly affecting accuracy. Crispy1989 (talk) 02:37, 3 December 2010 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.