The Signpost

Recent research

Vandalizing Wikipedia as rational behavior


A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.


Vandalizing Wikipedia as rational behavior

A paper[1] presented last year at the International Conference on Social Media and Society studies possible rational motivations for Wikipedia vandalism:

"Competing theories in criminology seek to explain the motivations for and causes of crime, ascribing criminal behavior to such factors as lack of impulse control, lack of morals, or to societal failure. Alternatively, rational choice theory proposes that behaviors are the product of rational choices. In order to apply rational choice theory to vandalism, this project seeks to understand vandal decision-making in terms of preferences and constraint"

The author observes that "vandalism-related research has tended to focus on the detection and removal of vandalism, with relatively little attention paid to understanding vandals themselves" (which can be readily confirmed by searching the archives of this newsletter for "vandalism"; one exception being a 2018 study that asked students their guesses about why their classmates vandalize Wikipedia: "Only 4% of students vandalize Wikipedia – motivated by boredom, amusement or ideology (according to their peers)"). She notes that

"Although the harm is clear, the benefit to the vandal is less clear. In many cases, the thing being damaged may itself be something the vandal uses or enjoys. Vandalism holds communicative value: perhaps to the vandal themselves, to some audience at whom the vandalism is aimed, and to the general public."

The theoretical framework used to study such rational motivations is "rational choice theory (RCT) as applied in value expectancy theory (VET)". It conceptualizes the expected utility of a choice (such as that engaging in an act of vandalizing) as the sum over possible outcomes over the product of "the probability of some outcome O [...] and the utility valuation U of that same outcome".

Based on a sample of 141 vandalism edits (from the English Wikipedia), the author proposes an ontology of Wikipedia vandalism, extending classifications used in previous vandalism detection studies (e.g. blanking, misinformation, "image attack", "link spam") with a few new ones: "Attack graffiti" (i.e. "attack an individual or group") and "Community-related Graffiti" (expressing "opposition to community, norms, or policies").

The quantitative part of this mixed methods paper "examine[s] vandalism from four groups: users of a privacy tool Tor Browser, those contributing without an account, those contributing with an account for the first time, and those contributing with an account but having some prior edit history". Tor Browser edits are generally blocked automatically on Wikipedia and those in the dataset consists of edits that slipped through this mechanism, raising the question whether some or many of these edits might have involved the editor having to try several times to get around that block, setting them apart from less dedicated vandals in the other groups.

The observation that contributing under an account requires more effort (i.e. creating that account, and logging into it) than contributing as IP editor motivates the author's first hypothesis: "(H1) users who have created accounts will vandalize less frequently". She finds it confirmed by the examined edit data.

Secondly, the author hypothesizes that "the least identifiable individuals are more likely to produce vandalism that has high-risk repercussions" (H2) because value expectancy theory "suggests that identifiability acts as a constraint on deviant behavior." The author finds this hypothesis partially supported. Among other findings, "Tor-based users are substantially more likely than other groups to engage in large-scale vandalism and least likely to engage in the lowest risk type of vandalism, that which communicates friendly and sociable intent."

In motivating her third hypothesis, the author observes that "the groups under study differ by how they are treated by community policies. Newcomers are targeted for social interventions to welcome, train, and retain them. Wikipedia invites IP-based editors to create accounts as well as welcoming them. However, Tor-based editors generally experience rejection." The resulting hypothesis is "(H3) Members of excluded groups are more likely to strike against the community targeting them," operationalized as a higher rate of vandalism in the "community-related" category (e.g. directly attacking Wikipedia norms or policies).

The paper contains various other interesting observations that might make it worth reading for Wikipedia editors spending time dealing with vandalism and related community policies. To pick just one example, the author highlights that vandalism can also have positive effects, referring to a 2014 paper.[2] That earlier study involved conducting interviews with editors and a quantitative analysis of a dataset that included edit numbers by editor experience level, page watcher numbers, pageview numbers and other data from the English Wikipedia, finding that "novice contributors’ participation has a direct negative effect on the quality of goods produced [i.e. newbie edit decreased article quality on average], but a positive indirect effect because it acts as a cue for expert contributors to improve the quality of those goods that consumers [i.e. Wikipedia readers] are most interested in." It found "that the positive direct effect of article consumption [i.e. pageviews] on expert editing patterns is fully mediated by novice contributions. Results [...] support the theory that experts are unaware of demand [i.e. experienced editors do not usually check traffic levels of the articles they edit] but they are stimulated to respond to article consumption if consumers signal demand for that particular good through their contributions as novice producers."

Briefly

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Language biases in Wikipedia's "information landscapes"

From the abstract and conclusions:[3]

"We test the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted. Controlling the size factor, we investigate this hypothesis for a number of 25 subject areas. [...] at least in the context of the subject areas examined here [Wikipedia's] different language versions differ so much in their treatment of the same subject area that it is necessary to know which area in which language someone is consulting if one wants to know how much the part of the IL [information landscape] he or she is traversing is biased."

"Universal structure" of collective reactions to invididual actions found in Twitter, Wikipedia and scientific citations

From the abstract:[4]

"In a social system individual actions have the potential to trigger spontaneous collective reactions. [...] We measure the relationship between activity and response with the distribution of efficiency [...]. Generalizing previous results, we show that the efficiency distribution presents a universal structure in three systems of different nature: Twitter, Wikipedia and the scientific citations network."

"Novel Version of PageRank, CheiRank and 2DRank for Wikipedia in Multilingual Network Using Social Impact"

From the abstract:[5]

"... we propose a new model for the PageRank, CheiRank and 2DRank algorithm based on the use of clickstream and pageviews data in the google matrix construction. We used data from Wikipedia and analysed links between over 20 million articles from 11 language editions. We extracted over 1.4 billion source-destination pairs of articles from SQL dumps and more than 700 million pairs from XML dumps. [...] Based on real data, we discussed the difference between standard PageRank, Cheirank, 2DRank and measures obtained based on our approach in separate languages and multilingual network of Wikipedia."

(see also earlier coverage of related research that applied such ranking metrics to graphs of Wikipedia articles)

"Modeling Popularity and Reliability of Sources in Multilingual Wikipedia"

From the accompanying blog post:[6]

"In this paper authors analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. In the research authors presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. [....] For example, among the most popular scientific journals in references of English Wikipedia are: Nature, Astronomy and Astrophysics, Science, The Astrophysical Journal, Lloyd’s List, Monthly Notices of the Royal Astronomical Society, The Astronomical Journal and others."

References

  1. ^ Champion, Kaylea (2020-07-22). "Characterizing Online Vandalism: A Rational Choice Perspective". International Conference on Social Media and Society. SMSociety'20. New York, NY, USA: Association for Computing Machinery. pp. 47–57. doi:10.1145/3400806.3400813. ISBN 9781450376884. (blog post)
  2. ^ Gorbatai, Andreea D. (2014). "The Paradox of Novice Contributions to Collective Production: Evidence from Wikipedia". SSRN 1949327.
  3. ^ Mehler, Alexander; Hemati, Wahed; Welke, Pascal; Konca, Maxim; Uslu, Tolga (2020). "Multiple Texts as a Limiting Factor in Online Learning: Quantifying (Dis-)similarities of Knowledge Networks". Frontiers in Education. 5. doi:10.3389/feduc.2020.562670. ISSN 2504-284X.
  4. ^ Martin-Gutierrez, Samuel; Losada, Juan C.; Benito, Rosa M. (2020-07-22). "Impact of individual actions on the collective response of social systems". Scientific Reports. 10 (1): 12126. Bibcode:2020NatSR..1012126M. doi:10.1038/s41598-020-69005-y. ISSN 2045-2322. PMC 7376036. PMID 32699262. S2CID 220682026.
  5. ^ Coquidé, Célestin; Lewoniewski, Włodzimierz (2020). Abramowicz, Witold; Klein, Gary (eds.). "Novel Version of PageRank, CheiRank and 2DRank for Wikipedia in Multilingual Network Using Social Impact". Business Information Systems. Lecture Notes in Business Information Processing. Cham: Springer International Publishing. 389: 319–334. arXiv:2003.04258. doi:10.1007/978-3-030-53337-3_24. ISBN 978-3-030-53337-3. S2CID 212649841. Closed access icon
  6. ^ Lewoniewski, Włodzimierz; Węcel, Krzysztof; Abramowicz, Witold (May 2020). "Modeling Popularity and Reliability of Sources in Multilingual Wikipedia". Information. 11 (5): 263. doi:10.3390/info11050263. See also blog post


+ Add a comment

Discuss this story

  • Results [...] support the theory that experts are unaware of demand [i.e. experienced editors do not usually check traffic levels of the articles they edit] but they are stimulated to respond to article consumption if consumers signal demand for that particular good through their contributions as novice producers. Very true. I think that article views can get away from some people, so someone randomly editing an article for the first time to try and fix some error (even if they are unable to) generally captures editor attention more often than no one actively doing anything. –MJLTalk 07:45, 2 December 2021 (UTC)Reply[reply]
    • I do something slightly similar myself. When in an urban place that I haven't seen in recent years, I bring up the Commons App map or sometimes the WikiShootMe site. It shows me any nearby unphotographed Wikipedia articles or Wikidata items, and I snap them. Sometimes the object isn't there, because WD has the wrong coordinates. Editing WD coords correctly on the phone screen is difficult for me, but that's okay. I just edit the location incorrectly. It's seldom worse than before, and any watchers' watchlists will show it. And whose watchlist? Mine. Upon returning home I've got the big screen and can easily make it right.
    • But hmm, editors are unaware of the amount of demand for their particular articles. Maybe there ought to be an option to make the monthly reader count more prominent. Jim.henderson (talk) 01:13, 4 December 2021 (UTC)Reply[reply]
    • I came here to highlight the exact same passage. It would be interesting if there was an opt-in tool where you could specify articles to "super-watchlist" and be notified of sudden pageview spikes (say, if a single day hits x10 of the previous monthly/yearly average, or x3 of the previous daily record). Perhaps it'd only be worthwhile for topics where there is high potential for this: a TV show whose latest season just aired; any living person (who might gain increased media attention for any number of reasons) etc. But, as the paper finds, I do encounter in practice that inexperienced/new editors draw my attention to new developments by adding a basic description that needs expansion and good referencing. The most recent case of this for me happened today with Death to 2020, which will have a sequel this year.
      As for editors being unaware of baseline levels of views, there's a possibility here for someone to write a bot to send a monthly opt-in personalised message saying "of the articles you've substantially contributed to [added/removed more than 500 bytes last month], these are the ones with the highest pageviews that you might consider giving priority to". — Bilorv (talk) 23:51, 5 December 2021 (UTC)Reply[reply]
  • ...value expectancy theory "suggests that identifiability acts as a constraint on deviant behavior." Dissidents realize that they are outliers and their asocial behavior only finds voice when they can hide from the consequences. I suspect that many of the editors who protect IP editing know this as they, themselves, are deviants or are deviant-adjacent and support this asociality. (I deliberately edit under my real name.) Were Wikipedia to adapt some version of attributable point of view, as opposed to the farce of WP:NPOV (which isn't really neutral) perhaps we could include minority narratives to create a useful release valve for these dead-enders. Chris Troutman (talk) 20:41, 5 December 2021 (UTC)Reply[reply]
  • Working in political areas, it appears to me that much involved vandalism (something that took the vandal more than 30 seconds to write) arises from the vandal's perceived lack of political autonomy and a lack of representation of their views in mainstream media. Wikipedia merely repeats what the (fact-based) mainstream media say, but we are editable where other news sources are not. Media narratives arise from one of two places: the elite dictate what narrative to impose on the public (Rupert Murdoch is the most-cited example here); or market interests dictate that a news source should, in a hyperpartisan manner, manufacture sensationalised stories in whatever topics research indicates will be most-clicked on. Both types of narratives are exclusionary of many people in our society.
    The more superficial vandalism (racist comments, blanking etc.) in political topics is more a consequence of mainstream media manufacturing negative attention on a particular person or scapegoat. Most of it is fairly transient, but sometimes it amounts to long-term persistent attacks from what is undoubtedly the same group of people who will be sending death threats on social media and otherwise engaging in harassment and violence. — Bilorv (talk) 23:51, 5 December 2021 (UTC)Reply[reply]