Back to News and notes

Discuss this story

If you really want to focus on this particular study, rather than gathering raw data, somebody should start asking why WMF got "Between 30 and 60 percent useful" and my preliminary results are about 10% useful. That's a huge red flag. Is it because only one person cared enough to look at my data and post an estimate? was 200 a big enough sample? Is it because your study used 3 people? If you personally looked at the data would you come back and say that your estimate is 30%, not 10%? Is it because in both cases the person doing the evaluation was self-selected? If I saw results like that I would try to rip my own methodology to shreds and then I would try to rip the methodology of the other study to shreds. Somebody is doing something wrong. My attitude toward science: http://xkcd.com/242/ --Guy Macon (talk) 03:00, 7 February 2013 (UTC)Reply[reply]
Frankly, I can't answer those questions; I'm not the researcher here ;p. I'll poke Aaron and see if he can comment. Okeyes (WMF) (talk) 11:29, 7 February 2013 (UTC)Reply[reply]
poke received First of all, I want to direct you to the official report I wrote which includes the strategy for drawing both a random and stratified sample and the details of my methodology. I'm sad to find that this report was clearly referenced. You're not the first to have missed it. meta:Research:Article_feedback/Final_quality_assessment We had 18 Wikipedians evaluate at least 50 feedback items individually (though some evaluated more than 200). All feedback submissions were evaluated by two different people. The 30-60% number is a non-statistically founded, conservative minimization of these two evaluations/item. In the study, we found that 66% of feedback was marked *useful* by at least one evaluator ("best" in the report) and 39% of feedback was marked useful by both evaluators ("worst" in the report). Here's the breakdown of the four category classes we asked the evaluators to apply:
  • Useful - This comment is useful and suggests something to be done to the article.
  • Unusable - This comment does not suggest something useful to be done to the article, but it is not inappropriate enough to be hidden
  • Inappropriate - This comment should be hidden: examples would be obscenities or vandalism.
  • Oversight - Oversight should be requested. The comment contains one of the following: phone numbers, email addresses, pornographic links, or defamatory/libelous comments about a person.
Note that these exact descriptions appear as tooltips in multiple places in the feedback evaluation tool. If you'd like to personally replicate the study, I'd be happy to pull another random sample for you and load it up in the evaluation tool. --EpochFail(talkwork) 15:42, 7 February 2013 (UTC)Reply[reply]
Before I respond, let me reiterate that I think everyone at the WMF is doing a good job and has the right goals. This is a discussion about possible improvements, starting with some future study. Those who are looking for a club to beat WMF with should look elsewhere.
meta:Research:Article_feedback/Final_quality_assessment is a very useful overview of the methodology used, but in my opinion an additional detailed methodology would be a Good Thing. (I am about to write some questions, but please don't post the answers. They are examples of what should be in a detailed methodology -- I cannot explain what I am talking about without giving examples of questions that the overview does not answer.) For an example, the overview says "We assigned each sampled feedback submissions to at least two volunteer Wikipedians." A detailed methodology would have said something like this:
"Between 3AM and 4AM on December 24th, we posted a request for volunteers (in French) on Talk:Mojave phone booth and on the main page of xh.wikipedia.org. 43 people volunteered, and we rejected 20 of them for being confirmed sockpuppets of User:Messenger2010 (See Wikipedia:Long-term abuse/Messenger2010) and rejected 11 of them because Guy drank too much and decided he doesn't like editors with "e" in their username. That left us with Jimbo and a six-year-old girl (username redacted for privacy reasons). We then..."
Unlike "We assigned each sampled feedback submissions to at least two volunteer Wikipedians", the above details exactly how those volunteers were chosen. Again, I don't care how they were chosen. I just want future studies to contain a detailed methodology page that answers questions like this or questions about the RNG used. To pick another example, the post above this one says "We had 18 Wikipedians evaluate at least 50 feedback items individually (though some evaluated more than 200)." That detail is not found in the methodology overview. --Guy Macon (talk) 16:51, 7 February 2013 (UTC)Reply[reply]
The specific 'how they were chosen' list, I can provide, actually. The purpose of the study was to compare the rating of feedback that did get rated to feedback that got missed out on, suspecting that people overwhelmingly checked feedback for high-profile articles. In order to get some consistency between the two sets of numbers, I pulled from the database a list of all users who had, in the 30 days before we started the recruitment process, monitored more than 10 pieces of feedback in some fashion. The users in question were then sent a talkpage invitation going 'would you like to participate in this?'. I appreciate that's more a specific example to highlight a general point than anything else - and I'm going to bear your general point in mind when writing up something I've been working on recently, actually - but I thought I'd address it :). Okeyes (WMF) (talk) 18:50, 7 February 2013 (UTC)Reply[reply]