Top scholarly citers, lack of open access references, predicting editor departures: And other research publications.
The Signpost
← Back to Contents
View Latest Issue

Wikimedia Research Newsletter Logo.png
A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

The first scholarly references on Wikipedia articles, and the editors who placed them there

Reviewed by Bri

The authors of the study "Dataset of first appearances of the scholarly bibliographic references on Wikipedia articles"[1] developed "a methodology to detect [when] the oldest scholarly reference [was] added" to 180,795 unique pages on English Wikipedia. The authors concluded the dataset can help investigate "how the scholarly references on Wikipedia grew and which editors added them". The paper includes a list of the top English Wikipedia editors in a number of scientific research fields.

English Wikipedia lacking in open access references

Reviewed by Gerald Waldo Luis

A four-author study was published by the journal Insights on February 2, 2022 titled "Exploring open access coverage of Wikipedia-cited research across the White Rose Universities".[2] As implied, it analyzes English Wikipedia references published by universities of the White Rose University ConsortiumLeeds, Sheffield, and York—and examines why open access (OA) is an important feature for Wikipedians to use. It summarizes that the English Wikipedia is still lacking in OA references—that is, those from the consortium.

The study opens by stating that despite the open source nature of Wikipedia editing, there is no requirement to link to OA sites where possible. It then criticizes this lack of scrutiny, reasoning that it is contrary to Wikipedia's goal of being an accessible portal to knowledge. Several following sections encapsulate the importance of Wikipedia among the research community, which makes OA crucial; this has been recognized by the World Health Organization when they announced they would make their COVID-19 content free to use for Wikipedia. Wikipedia has also proven to be a factor in increasing paper readerships.

Overall, 300 references were sampled for this study. The authors also added: "Of the 293 sample citations where an affiliation could be validated, 291 (99.3%) had been correctly attributed." "In total," the study summarizes, "there were 6,454 citations of the [consortium's] research on the English Wikipedia in the period 1922 to April 2019." It then presented tables breaking down these references to specific categories: Sheffield was cited the most (2,523), while York was the least (1,525). Biology-related articles cited the consortium the most (1,707), while art and writing articles cited them the least (7). As expected by the authors, journal articles—specifically from Sheffield—were cited the most (1,565). There is also a table breaking the references down by different OA licenses. York had the most OA sources cited on the English Wikipedia (56%). There are fewer sources that have non-commercial and non-derivative Creative Commons licenses. The study, however, disclaims that this is not a review of all English Wikipedia references.

In a penultimate "discussion" section, the study says that while there are many OA references, it is still "some way to go before all Wikipedia citations are fully available [in OA]", with nearly half of the sampled references paywalled, thus stressing the need for more OA scholarly works. However, with Plan S, a recent OA-endorsing initiative, the study expressed optimism in this goal. It also proposes the solution of more edit-a-thons, which usually involve librarians and researchers who can help with this OA effort. The study notes that Leeds once held an edit-a-thon too. Its "conclusion" section states that "This [effort] can be achieved through greater awareness regarding Wikipedia's function as an influential and popular platform for communicating science, [a] greater understanding […] as to the importance of citing OA works over [paywalled works]."


Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Compiled by Tilman Bayer

"Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia's Verifiability"

From the abstract:[3]

"In this paper, we aim to provide an empirical characterization of the reasons why and how Wikipedia cites external sources to comply with its own verifiability guidelines. First, we construct a taxonomy of reasons why inline citations are required, by collecting labeled data from editors of multiple Wikipedia language editions. We then crowdsource a large-scale dataset of Wikipedia sentences annotated with categories derived from this taxonomy. Finally, we design algorithmic models to determine if a statement requires a citation, and to predict the citation reason."

"Psychology and Wikipedia: Measuring Psychology Journals’ Impact by Wikipedia Citations"

From the abstract:[4]

"We are presenting a rank of academic journals classified as pertaining to psychology, most cited on Wikipedia, as well as a rank of general-themed academic journals that were most frequently referenced in Wikipedia entries related to psychology. We then compare the list to journals that are considered most prestigious according to the SciMago journal rank score. Additionally, we describe the time trajectories of the knowledge transfer from the moment of the publication of an article to its citation in Wikipedia. We propose that the citation rate on Wikipedia, next to the traditional citation index, may be a good indicator of the work’s impact in the field of psychology."

"Measuring University Impact: Wikipedia Approach"

From the abstract:[5]

"we discuss the new methodological technique that evaluates the impact of university based on popularity (number of page-views) of their alumni’s pages on Wikipedia. [...] Preliminary analysis shows that the number of page-views is higher for the contemporary persons that prove the perspectives of this approach [sic]. Then, universities were ranked based on the methodology and compared to the famous international university rankings ARWU and QS based only on alumni scales: for the top 10 universities, there is an intersection of two universities (Columbia University, Stanford University)."

"Creating Biographical Networks from Chinese and English Wikipedia"

From the abstract and paper:[6]

"The ENP-China project employs Natural Language Processing methods to tap into sources of unprecedented scale with the goal to study the transformation of elites in Modern China (1830-1949). One of the subprojects is extracting various kinds of data from biographies and, for that, we created a large corpus of biographies automatically collected from the Chinese and English Wikipedia. The dataset contains 228,144 biographical articles from the offline Chinese Wikipedia copy and is supplemented with 110,713 English biographies that are linked to a Chinese page. We also enriched this bilingual corpus with metadata that records every mentioned person, organization, geopolitical entity and location per Wikipedia biography and links the names to their counterpart in the other language." "By inspecting the [Chinese Wikipedia dump] XML files, we concluded that there was no metadata that identifies the biographies and, therefore, we had to rely on the unstructured textual data of the pages. [...] we decided to rely on deep learning for text classification. [...] The task is to assign a document to one or more predefined categories, in our case, “biography” or “non-biography.” [...] For our extraction, we used one of the most widely used contextualized word representations to date, BERT, combined with the neural network's architecture, BiLSTM. BiLSTM is state of the art for many NLP tasks, including text classification. In our case, we trained a model with examples of Chinese biographies and non-biographies so that it relies on specific semantic features of each type of entry in order to predict its category."

See also an accompanying blog post.

Apparently the authors were unaware of Wikipedia categories such as zh:Category:人物 (or its English Wikipedia equivalent Category:People) which might have provided an useful additional feature for the machine learning task of distinguishing biographies and non-biographies. On the other hand, they made use of Wikidata to generate a training dataset of biographies and non-biographies.

"Learning to Predict the Departure Dynamics of Wikidata Editors"

From the abstract:[7]

"...we investigate the synergistic effect of two different types of features: statistical and pattern-based ones with DeepFM as our classification model which has not been explored in a similar context and problem for predicting whether a Wikidata editor will stay or leave the platform. Our experimental results show that using the two sets of features with DeepFM provides the best performance regarding AUROC (0.9561) and F1 score (0.8843), and achieves substantial improvement compared to using either of the sets of features and over a wide range of baselines"

"When Expertise Gone Missing: Uncovering the Loss of Prolific Contributors in Wikipedia"

From the abstract and paper (preprint version):[8]

"we have studied the ongoing crisis in which experienced and prolific editors withdraw. We performed extensive analysis of the editor activities and their language usage to identify features that can forecast prolific Wikipedians, who are at risk of ceasing voluntary services. To the best of our knowledge, this is the first work which proposes a scalable prediction pipeline, towards detecting the prolific Wikipedians, who might be at a risk of retiring from the platform and, thereby, can potentially enable moderators to launch appropriate incentive mechanisms to retain such `would-be missing' valued Wikipedians."

"We make the following novel contributions in this paper. – We curate a first ever dataset of missing editors, a comparable dataset of active editors along with all the associated metadata that can appropriately characterise the editors from each dataset.[...]

– First we put forward a number of features describing the editors (activity and behaviour) which portray significant differences between the active and the missing editors.[...]

– Next we use SOTA machine learning approaches to predict the currently prolific editors who are at the risk of leaving the platform in near future. Our best models achieve an overall accuracy of 82% in the prediction task. [...]

An intriguing finding is that some very simple factors like how often an editor’s edits are reverted or how often an editor is assigned administrative tasks could be monitored by the moderators to determine whether an editor is about to leave the platform"


  1. ^ Kikkawa, Jiro; Takaku, Masao; Yoshikane, Fuyuki (March 14, 2022), "Dataset of first appearances of the scholarly bibliographic references on Wikipedia articles", Scientific Data, Nature Research, 9 (85), doi:10.1038/s41597-022-01190-z open access
  2. ^ Tattersall, A., Sheppard, N., Blake, T., O’Neill, K. and Carroll, C. (February 2, 2022), "Exploring open access coverage of Wikipedia-cited research across the White Rose Universities", Insights, 35, doi:10.1629/uksg.559((citation)): CS1 maint: uses authors parameter (link)
  3. ^ Redi, Miriam; Fetahu, Besnik; Morgan, Jonathan; Taraborelli, Dario (2019). "Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia's Verifiability". The World Wide Web Conference. WWW '19. New York, NY, USA: ACM. pp. 1567–1578. doi:10.1145/3308558.3313618. ISBN 9781450366748. closed access, preprint version: Redi, Miriam; Fetahu, Besnik; Morgan, Jonathan; Taraborelli, Dario (2019-02-28). "Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia's Verifiability". arXiv:1902.11116 [cs].. code research project page on Meta-wiki
  4. ^ Banasik-Jemielniak, Natalia; Jemielniak, Dariusz; Wilamowski, Maciej (2021). "Psychology and Wikipedia: Measuring Psychology Journals' Impact by Wikipedia Citations". Social Science Computer Review. doi:10.1177/0894439321993836. closed access, Author's copy
  5. ^ Babkina, Tatiana Kozitsina; Goiko, Viacheslav; Khomutenko, Valentin; Palkin, Roman; Mundrievskaya, Yulia; Myagkov, Mikhail; Sukhareva, Maria; Froumin, Isak (November 2021). "Measuring University Impact: Wikipedia Approach". 2021 3rd International Conference on Control Systems, Mathematical Modeling, Automation and Energy Efficiency (SUMMA). 2021 3rd International Conference on Control Systems, Mathematical Modeling, Automation and Energy Efficiency (SUMMA). pp. 625–632. doi:10.1109/SUMMA53307.2021.9632112. closed access Preprint version: Kozitsina, Tatiana; Goiko, Viacheslav; Palkin, Roman; Khomutenko, Valentin; Mundrievskaya, Yulia; Sukhareva, Maria; Froumin, Isak; Myagkov, Mikhail (2020-12-27). "Measuring University Impact: Wikipedia approach". arXiv:2012.13980 [cs].
  6. ^ Blouin, Baptiste; Bosch, Nora van den; Magistry, Pierre (2021-05-05). "Creating Biographical Networks from Chinese and English Wikipedia"., to appear in Journal of Historical Network Research (dataset)
  7. ^ Piao, Guangyuan; Huang, Weipeng (2021). "Learning to Predict the Departure Dynamics of Wikidata Editors". The Semantic Web – ISWC 2021. Cham: Springer International Publishing. pp. 39–55. doi:10.1007/978-3-030-88361-4_3. ISBN 9783030883614. closed access, Repository version: Guangyuan Piao and Weipeng Huang, "Learning to Predict the Departure Dynamics of Wikidata Editors | The Insight Centre for Data Analytics".
  8. ^ Das, Paramita; Guda, Bhanu Prakash Reddy; Chakraborty, Debajit; Sarkar, Soumya; Mukherjee, Animesh (2021). "When Expertise Gone Missing: Uncovering the Loss of Prolific Contributors in Wikipedia". Towards Open and Trustworthy Digital Societies. Cham: Springer International Publishing. pp. 291–307. doi:10.1007/978-3-030-91669-5_23. ISBN 9783030916695. closed access, preprint version: Das, Paramita; Guda, Bhanu Prakash Reddy; Chakraborty, Debajit; Sarkar, Soumya; Mukherjee, Animesh (2021-09-21). "When expertise gone missing: Uncovering the loss of prolific contributors in Wikipedia". arXiv:2109.09979 [cs].
+ Add a commentDiscuss this story
  • It then criticizes this lack of scrutiny, reasoning that it is contrary to Wikipedia's goal of being an accessible portal to knowledge - I've always thought of it as the other way around. By using non-OA sources such as print books, we're making available knowledge that we personally have available to us, but isn't available to most. FUTON bias strikes again. I don't think we should favor one over the other, but I've seen stuff arguing we should prioritize open access several times, and that's never really made sense to me. Hog Farm Talk 02:16, 28 March 2022 (UTC)[reply]
    Hog Farm, I ironically agree with you; not all encyclopedic information is available on OA, and sometimes you gotta look through the wall to get the bigger picture. I do, however, think that if an OA source is a perfect substitute of a paywalled source, it should be used more. Though overall, like when I review FACs or GANs, I'm generally okay with paywalled sources, since what matters (I suppose) is verifiability and reliability. GeraldWL 07:44, 28 March 2022 (UTC)[reply]
    With FAC, I've become pretty well convinced over 15 successful FACs and scores of reviews that most articles cannot be brought to FA standard without print books. It's a pain that that's the case (until last year I lived in a pretty rural area whose libraries were very nonfiction poor), but with the exception of internet-age sportspeople, recent albums/songs, and many videogames, it's just unavoidable. Hog Farm Talk 13:25, 28 March 2022 (UTC)[reply]
    I expanded To Fly!, a 1976 film (now awaiting my mentor to review it before a FAC), and this rings true. Luckily for me, I have The Wikipedia Library, as well as an online friend with the resources and kindness to help me find books and all that jazz. Although I still try include OA sources wherever possible; better than nothing IMO. Of course, it all comes down to personal preference. GeraldWL 16:33, 28 March 2022 (UTC)[reply]
    Yeah, and it really depends on what field you're looking in. I mainly work with the American Civil War, and asides from some old archived articles of the Missouri Historical Review, there's not much open access quality out there on that subject, while other fields have better collections of OA materials. I have to lean pretty heavy on print books and JSTOR for what I write. Hog Farm Talk 16:42, 28 March 2022 (UTC)[reply]
    For me, I lean, JSTOR, ProQuest, and the TWL search engine. Google Books previews help a lot too. Often the TWL engine would lead me to Gale as well. GeraldWL 03:11, 29 March 2022 (UTC)[reply]
  • I have recently encountered for the first time a problem with adding citations linked to Open Access sources. The problem is that English Wikipedia has an automatic filtering program that discourages editors from adding links to open access material published by what the program describes as predatory open access publishers. The program relies upon a very long list of such (alleged) publishers, and makes no provision for an editor like me to question the inclusion of any of the publishers in the list. I therefore have little difficulty in assuming that many of the entries in that list are not "predatory" at all. But why should I risk having what I consider to be reliable material later deleted from Wikipedia when I can just choose instead to publish it without the links? Bahnfrend (talk) 09:50, 28 March 2022 (UTC)[reply]
    Bahnfrend, if the only source you have is OA and considered predatory, look into the sources. Are the sources something that is of common characteristic within predatory sources? Are they peer reviewed, referenced, accurate, etc? If they look like a legit source, then yes it's OK to use it. I use a lot even though they're considered predatory, because the specific works I cite is high quality; not all POA references are POA-like. If the works are predatory-ish, look into other sources, paywalled ones if you have to. The study I reviewed here even disclaims only using OA sources where possible. GeraldWL 12:00, 28 March 2022 (UTC)[reply]
    Gerald Waldo Luis I have been citing books, not journals, published by IGI Global, which is allegedly a predatory open access journal publisher (my emphasis). The relevant book chapters are about East Timor, and were written by academics based at what appear to be reputable public universities in Portugal, Brazil (including Brazil's most prestigious university, according to its Wikipedia article), and at an accredited private institute of technology in East Timor (the last of these authors is also a former Minister in the government of East Timor). The editors of the books are academics based at apparently reputable public universities in Portugal and the UK. Unfortunately there is not a massive amount of reliable source material on East Timor, which is an impoverished country with a population of only about 1.3 million. My internet researches cast considerable doubt on whether IGI Global is really a predatory publisher of any kind of literature, whether in journals or in books. But the automated filtering program has no mechanism by which I can challenge any of the publishers on its list of (alleged) predatory publishers. Bahnfrend (talk) 12:24, 28 March 2022 (UTC)[reply]
    Bahnfrend, if you think it is reliable and you are able to assure its reliability should the sources be questioned, then yes, do use it. IGIG's reliability according to my research seems like a good source, it has been approved by ResearchGate, and implements a "double-blind peer review process". GeraldWL 12:49, 28 March 2022 (UTC)[reply]
    @Gerald Waldo Luis: Thanks, I will bear your comments in mind next time I encounter the automatic filtering program. However, it would still be nice if there were some mechanism for me (or you) to challenge the inclusion of IGI Global (or any other publisher) in the list. Bahnfrend (talk) 08:58, 29 March 2022 (UTC)[reply]
  • I agree that the idea that we should totally rely on OA sources is misguided. Quality and accuracy matter. Besides, I always feel its of better use to the internet when I add a piece of info from a book that I know has never been put anywhere on the internet before. -Indy beetle (talk) 06:21, 29 March 2022 (UTC)[reply]
    • That's a great point, Indy beetle. The internet can be a very small loop of regurgitated knowledge. Adding to the information ecosystem rather than just recycling it is a genuine contribution. Ganesha811 (talk) 12:28, 30 March 2022 (UTC)[reply]
      I also second Indy. And yeah, I sometimes find book sources interesting too, and a good editor should try diversify their scope of sources to get the bigger picture. Unfortunately, we live in a world where OA is not a universal thing, so yeah. GeraldWL 17:03, 30 March 2022 (UTC)[reply]
  • "An intriguing finding is that some very simple factors like how often an editor’s edits are reverted or how often an editor is assigned administrative tasks could be monitored by the moderators to determine whether an editor is about to leave the platform." Eh? An editor is assigned administrative tasks? Perhaps the researchers have not looked into how such assignments are made. Jim.henderson (talk) 17:30, 30 March 2022 (UTC)[reply]
  • I've skimmed the paper and also find this an odd choice of words because on page 8 they acknowledge that admins "act voluntarily". The paper used XTools admin score as a predictor. It looks like perhaps they are postulating that asking people to do stuff specifically to increase their admin score is healthy wrt editor retention. Is it true though that people who e.g. participate in AfD or make AIV reports are more likely to stay? ☆ Bri (talk) 17:53, 30 March 2022 (UTC)[reply]