The Signpost

In focus

The Wikipedia SourceWatch

A new project to find unreliable sources cited by Wikipedia

A few years back, while working on WikiProject Academic Journals' Journals Cited by Wikipedia (JCW) compilation, I realized we could harness the power of bots to identify a variety of unreliable sources which are cited by Wikipedia. I've dubbed the project The Wikipedia SourceWatch (or just The SourceWatch),[a] as it aims to identify and combat unreliable sourcing, similarly to Quackwatch, which aims to identify and combat medical quackery and Retraction Watch, which reports retracted research in scientific journals.

For context, the JCW compilation takes the various |journal= parameters of ((cite xxx)) templates found in articles, and compiles them into various lists. For example, in the following citation

  • ((cite journal |last1=Yager |first1=K. |year=2006 |title=Wiki ware could harness the Internet for science |journal=Nature |volume=440 |issue=7082 |pages=278–278 |doi=10.1038/440278a))

a bot would find |journal=Nature and then report it at WP:JCW/N7.[b][c] The compilation is organized in many ways (alphabetically, by citation count, and so on) and is typically updated a few days after the 1st and 20th of each month, when database dumps are generated. Those who want a bit of history and technical details can check the main JCW page or this talk I gave in Montreal for Wikimania 2017.

The Directory of Open Access Journals does not allow predatory journals to be listed on its directory. As such, several journals will lie about being included in DOAJ to appear more legitimate. Predatory journals will also lie about having impact factors or about being included in high-reputation databases like Scopus or Web of Science. The DOAJ advises "to ALWAYS check at https://doaj.org that a journal is indexed in DOAJ even if its web site carries the DOAJ logo or says that it is indexed [in DOAJ]". This is good advice, which applies equally to the other indexing services.

The idea of using the JCW compilation to fight unreliable sourcing stewed in my mind for a while, until I finally decided to take action in August 2018. I contacted JLaTondre, who runs the bot, and together we began laying down the first bricks of The SourceWatch. The bot would look for the various |journal= parameters of citation templates and cross-check them against Beall's List, a list maintained by librarian Jeffrey Beall to identify predatory journals and publishers until it was taken down in 2017. Beall's List is not perfect by any means, especially if you want a list that only identifies journals that are definitely predatory, rather than journals that range from questionable to definitely predatory, but it was a good start. Since there are other efforts beyond Beall's List to identify unreliable sources in general, I expanded The SourceWatch to draw from a variety of additional sources, including circular references to Wikipedia, deprecated or generally unreliable sources, journals lying about being included in the Directory of Open Access Journals, Quackwatch's list of non-recommended periodicals, self-published sources and vanity publications, and sources from notoriously unreliable fields (which are broadly speaking the subcategories of Category:Pseudo-scholarship and a few others). While journals from Cabell's blacklist could not be included as of writing due to the exorbitant paywall, they might get included in the future.

Two main ways of using The SourceWatch exist:

  1. Browsing WP:SOURCEWATCH directly. If 5 or fewer articles cite a specific publication, the links to these articles will be given. If more than 5 articles cite it, you will have to search Wikipedia to find where it is cited. This is useful to find articles which need to be updated with reliable sources, or where unreliable sources need to be removed.
  2. Using Special:WhatLinksHere on an article and looking for links from Wikipedia:WikiProject Academic Journals/Journals cited by Wikipedia/Questionable1 (or .../Questionable2, .../Questionable3, ...). This won't directly tell you which potentially unreliable publication is cited, but it will let you know that some potentially unreliable citation is cited. This is useful when you edit an article and want to make sure you are not citing bad sources. However, this method only works if 5 or fewer articles cite a specific publication.

For example, as of writing, the article on Heinrich Albert cites Deutsche Allgemeine Zeitung, a German newspaper published from 1861 to 1945, which is categorized in Category:Propaganda > Category:Nazi propaganda > Category:Nazi newspapers. This does not mean that citing Deutsche Allgemeine Zeitung is necessarily inappropriate – the newspaper did not exclusively publish Nazi propaganda over the 84 years of its existence – but it is good to verify that we are not citing Nazi propaganda inappropriately. This can be found either by browsing WP:SOURCEWATCH, which features Deutsche Allgemeine Zeitung under the 'Propaganda' category, or through Special:WhatLinksHere/Heinrich Albert, which shows a link from Wikipedia:WikiProject Academic Journals/Journals cited by Wikipedia/Questionable1.

A figure from the famous "Get me off Your Fucking Mailing List" paper by David Mazières and Eddie Kohler, accepted in the International Journal of Advanced Computer Technology.[1] The journal's 'review' process deemed the paper "excellent". Figure 2 in the paper shows even more rigorous data on why Mazières and Kohler should be taken off from the aforementioned mailing list.

Of course, due to the inherently subjective nature of what constitutes an unreliable source, The SourceWatch includes sources that range from questionable to definitely unreliable, but it also has a few false positives. For the questionable we have, for example, journals and publishers which may merely engage in questionable practices such as sending spam emails to researchers, but which nonetheless remain committed to scientific and academic standards. For the definitely unreliable, we have journals that literally accept anything, even SCIgen papers, if you pay them. For false positives, we have hijacked journals, which are fraudulent publications designed to have identical or similar names to established publications.[d] Other false positives can include members of categories such as Category:Paranormal magazines, which may set out to debunk hoaxes and nonsensical claims, rather than perpetuate them. Yet another cause of false positives is that the algorithm used to find those unreliable sources is not perfect. It is designed to find typos and similar names (Journal of Science vs Journal of Sciences), but will sometimes pick up journals that are obviously (to humans) unrelated (African Journal of ... vs American Journal of ...). However, false positives can be manually identified, and the compilation will be updated accordingly in future bot runs. And lastly, The SourceWatch is heavily based on third party lists and will to an extent reflect the opinion of those lists' compilers, which could be inaccurate or outdated in certain cases.

I want to emphasize here just how much work JLaTondre has done on this and JCW over the nearly 10 years of the compilation. The original JCW compilation and The SourceWatch may be my ideas, but JLaTondre is the one responsible for the heavy lifting and making them a reality since 2011.[e] I must also acknowledge the contributions of several people: Ronhjones's for their help managing the configuration pages,[f] Tokenzero's for their help with the creation of several redirects useful to The SourceWatch,[g] as well as the help of many people at Village Pump (technical) over the years with various matters, Galobtter in particular. Hundreds of citations were cleaned up using The SourceWatch during development, but it was only known to a handful of people due to its unpolished state. The compilation was at times plagued with a staggering number of false positives and poor presentation structure. Now, after several iterations, The SourceWatch is something that should be usable by the community at large. While there likely is still room for improvements and debates on what should or should not be listed, one no longer needs to be familiar with the intricate workings of the bot to make sense of The SourceWatch lists, or spend months playing Whac-A-Mole against false positives.

The SourceWatch does not definitely answer whether a source is unreliable. Even if a source were unreliable, it does not definitively answer whether it is appropriate to cite it either. However, The SourceWatch is a good starting point to find unreliable sources, at least those which make use of citation templates. Once they are found, the community can then critically evaluate whether or not they should be cited, leading to a better, more reliable, Wikipedia. Whether a source should be cited can be discussed at the reliable sources noticeboard, or alternatively at a relevant WikiProject's talk page, such as WikiProject Medicine for medically dubious sources, or WikiProject Physics for sources claiming to have proven aether theories.

Suggestions on how to improve The Wikipedia SourceWatch can be made at WT:SOURCEWATCH. Particularly welcomed would be suggestions for additional sources that The SourceWatch could draw from, like lists of journals lying about being indexed by reputable databases. Other efforts to identify and prevent unreliable sourcing can be found in the "other efforts" section of the WP:JCW navbox.

Notes and references

Notes
  1. ^ Renamed The Wikipiedia CiteWatch or The CiteWatch in May 2019, per RFC.
  2. ^ As of writing. If you are reading this at a later date, Nature may be reported at a different location.
  3. ^ Non-templated citations like
    • Maddox, J.; Randi, J.; Stewart, W. W. (1988). "'High-dilution' experiments a delusion". ''Nature''. '''334''' (6180): 287–290. ((doi|10.1038/334287a0)).
    are completely ignored by the bot.
  4. ^ For example, the perfectly respectable journal Wulfenia's web presence has been hijacked (with the fake websites www.wulfeniajournal.at / www.wulfeniajournal.com / www.multidisciplinarywulfenia.org), while the real website is hosted by the Regional Museum of Carinthia. As of writing, the bot will report Wulfenia, out of concern it may be a citation to one of the fraudulent websites, even though in all likelihood those citations will be to the real website. This behaviour may change in the future.
  5. ^ From 2009 to 2011, ThaddeusB coded WikiStatsBOT to take care of JCW.
  6. ^ Specifically, Ronhjones coded RonBot (Task #10), which sorts and organizes WP:SOURCEWATCH/SETUP (upon which The SourceWatch is based) and WP:JCW/EXCLUDE (which removes false positives).
  7. ^ Specifically, Tokenzero coded TokenzeroBot (Tasks #5 and #6 especially), which creates redirects of the type Predatory JournalPredatory Publisher, including the ISO 4 abbreviations of such journals. It also puts appropriate disambiguation notes in articles, when relevant.
References
  1. ^ Beall, J. (20 November 2014). "Bogus journal accepts profanity-laced anti-spam paper". Scholarly Open Access. Archived from the original on 2014-11-22.
  2. ^ Wales, Jimmy (23 March 2014). "Jimmy Wales, Founder of Wikipedia: Create and enforce new policies that allow for true scientific discourse about holistic approaches to healing. > Jimmy Wales's response". Change.org. Retrieved 18 February 2019.
+ Add a comment

Discuss this story

A previous discussion about a technical error related to the publishing script was moved to User talk:Evad37/SPS.js#Publishing script error. Headbomb {t · c · p · b} 23:11, 31 March 2019 (UTC)[reply]
  • Thank you for fighting the good fight. Efforts like this are why the credibility of Wikipedia is increasing. Keep up the good work.  SchreiberBike | ⌨  04:42, 1 April 2019 (UTC)[reply]
    • I second this, and I'd like to thank everyone involved for stepping up to make Wikipedia a more reliable and trustworthy resource for our readers. — Newslinger talk 04:54, 2 April 2019 (UTC)[reply]
  • The "list" is rather all-inclusive - even managing to have The New York Times in its talons. Bot-generated lists are not something I recommend for use as the number of "false positives" is beyond belief. Argh. Collect (talk) 18:15, 1 April 2019 (UTC)[reply]
It is on the list in the very first column one arrives at through the only "search" box provided! - and even casual users would notice it, as most of the "bad sites" are also in that second column. This from the "search" function on the "Sourcewatch" redirected page. As are all the other NY newspapers, the MIT Technology Review (listed quite prominently as a hijacked journal) and more. Voice of America and Radio Free Europe are listed as "propaganda sources". Page N18 of the sources list. I fear the "search box" missed your editorial review I had suggested a while back? Wikipedia:WikiProject Academic Journals/Journals cited by Wikipedia/N18 To begin with, cut out that "search" which misleadingly lists every journal known to man. The best part is the list of redlinked sources - but as there are an infinite number of possible redlinks to add, that does not help. I suspect that the separated "actual real Wikipedia problem sources" list will be much more manageable. Oh, and blacklisting every "wrong science source" may be nice to some, but deleterious to many articles. Meanwhile, is there a reason to keep publishers on the whist which Beall had deleted from that list? Collect (talk) 20:16, 1 April 2019 (UTC)[reply]
@Collect:
1) The New York Times: That's from WP:JCW/N18 which is part of Journals Cited by Wikipedia, a compilation of every |journal= used across Wikipedia, not The SourceWatch, which is a specific subset of JCW (specifically the pages ending in /Questionable#). A thing that will help here is that if you do not see the big SourceWatch warning on the page, you are not dealing with The SourceWatch.
2) Voice of America is categorized in Category:United States government propaganda organizations and Radio Free Europe in Category:Anti-communist propaganda.
3) The MIT Technology Review was indeed hijacked.
Headbomb {t · c · p · b} 20:29, 1 April 2019 (UTC)[reply]
Are you asserting that MIT Technology Review is not listed as "hijacked"? Item number 30 on the very first page of your list? Did you read Beall's comments which made clear that "Tec Review" was his problem and not "MIT Technology Review"? That would be reassuring as it would then be clear that evil forces are corrupting my downloads. Meanwhile, it means the "search" function is totally useless for this. I am glad you pointed out that many organizations are given deprecatory descriptions, by the way. It makes one feel reassured that WP:NPOV is adhered to in all projects. And the reason for "redlinked journals" in profusion is? Collect (talk) 20:34, 1 April 2019 (UTC)[reply]
I don't think you understand what a hijacked journal is. I also don't know what you mean by the "redlinked journals" in profusion.Headbomb {t · c · p · b} 20:35, 1 April 2019 (UTC)[reply]
Beall's lists "Tech Review" as the hijacked journal, and MIT Technology Review as the genuine journal! One column (the left one in his list) is the fakes, the second column (the right one is clearly labeled "authentic journal") is the "authentic journal" It helps to read the column headers!!! Collect (talk) 20:41, 1 April 2019 (UTC)[reply]
MIT Technology Review is the hijacked journal (a legitimate academic journal for which a bogus website has been created by a malicious third party), TECH REV: Technology Review journal is the hijacker. The Beall website gets the terminology wrong. Also I've tweaked the search box to only search in the SourceWatch when on a /Questionable page.Headbomb {t · c · p · b} 20:46, 1 April 2019 (UTC)[reply]
As noted, Beall only lists the fake one as the "hijacked journal" and the "real one" is listed as "authentic." And Beall got his own terminology wrong? Nope. It quite appears the reverse. The person who writes the first list is the one who gets to choose his terminology. But "The Beall website got the terminology wrong" does not quite impress me. Sorry. Collect (talk) 20:52, 1 April 2019 (UTC)[reply]
Jeffrey Beall did not invent the term hijacked journal. Again, see our article hijacked journal and the explanatory note Hijacked journals are legitimate academic journals with imposters pretending to be the legitimate publication. These citations are likely not problematic, but it is good to check that the real journal is being cited. Headbomb {t · c · p · b} 20:54, 1 April 2019 (UTC)[reply]
You mean the article I corrected because it misrepresented the sources? The one where you removed an old talk page entry as "no one cares"? [1]? The one where you reverted my actual use of the sources? [2] which admits Butler (a main source" was "misused - but doubled down on the misuse? Sorry, I was giving you the benefit of the doubt -- but misusing sources and doubling down on that misuse is not my cup of tea. Add all the sites you wish as I seem to have a very bad taste in my mouth. Collect (talk) 21:24, 1 April 2019 (UTC)[reply]
See the note on the talk page. And no one cares about a talk page message posted by a bot ages ago. Headbomb {t · c · p · b} 21:27, 1 April 2019 (UTC)[reply]
And I am getting a teensy bit upset about your hatting and rehatting of my post at WP:RS/N#Hijacked_journal_problems as a violation of WP:CANVASS while your post at WT:WikiProject_Academic_Journals#Talk:Hijacked_journal#bad_reverts which seems not to relate to the problems at hand - specifically making "interpretations of sources" directly contradicted by the sources. Collect (talk) 00:34, 2 April 2019 (UTC)[reply]
That's because, again, per WP:CANVASSING, if you want to bring people to a discussion, you give a neutral notice of the discussion happening. You don't poison the well by injecting your opinion/side all over the place. WP:RSN is to discuss whether or not sources are reliable. It is not the place to debate their interpretation, or decide on what terminology is clearest. Headbomb {t · c · p · b} 00:50, 2 April 2019 (UTC)[reply]
WP:RSN is a neutral noticeboard, and I stated the issue clearly. It is not "CANVASSING" buy a mile or two. The article has exceedingly few viewers, and you "pinged" friends to go there, while I "pinged" no one at all. Period. I rather think that when a reliable source uses a word, we should not assign it a diametrically opposite meaning. Maybe I am in a minority, in Carrollian way. That should end the contretemps as I have done my best to state facts and not hat the helk of someone. Collect (talk) 01:22, 2 April 2019 (UTC)[reply]
WP:RSN is a neutral noticeboard, yes. It was your message that was not neutral, and violated WP:CANVASS. But that's rather irrelevant to this Signpost piece, so can we please keep debate about what to do with the article on the article's talk page, rather than have a meta debate about how to have a debate on a half a dozen page? Headbomb {t · c · p · b} 01:39, 2 April 2019 (UTC)[reply]
You hatted and rehatted my post which I believed and still believe set forth the issue. That you assert it was not "neutral" and not placed in a "neutral" place is not of import as it is still hatted and anyone can read it for themselves to see how non-neutral it was. Or possibly actually feel it was a reasonable post on the proper noticeboard, and less CANVASS that "pinging" three friends. Collect (talk) 01:46, 2 April 2019 (UTC)[reply]
Again, I did not ping three friends. I pinged the original author of the words, and the two people that posted on the talk page before. Neutrally. You on the other hand, poisoned the well at WP:RSN, presenting your side, rather than neutrally advertise the ongoing discussion. Now, take it to Talk:Hijacked journal, as has been requested of you over a dozen times now, where a discussion of the actual issue can happen, rather than these silly meta debates about how to have a debate about something. Headbomb {t · c · p · b} 01:53, 2 April 2019 (UTC)[reply]

Can this system by gamed?

  • Assume for a moment that for ideological reasons a fairly large number of Wikipedia editors dislike some reliable sources and like other, unreliable sources.
Given the above assumption, is there any way that this list of questionable sources be gamed in such a way that it can be weaponized in the ongoing bare-knuckle, no-rules brawl between Team Blue and Team Red? --Guy Macon (talk) 15:16, 2 April 2019 (UTC)[reply]
My gut feeling is that this is as 'gameable' as any of the original sources themselves. Debate about whether something on Beall's list is reliable has occurred countless of times. The answer with Beall is usually Beall was right, this is a garbage journal, but since Beall classified questionable journals alongside literally zero academic worth journals, these discussions often result in 'Yeah this is published by X, which isn't the greatest, but it's not zero worth'. Facts backed up by those journals would usually fail a WP:MEDRS check, but would be often be considered perfectly valid sources for basic claims that aren't at the cutting edge of research (e.g. Foobarin is a complex protein discovered by James of Foo in 1942) and aren't used to back up completely wild OR/POV claims. Likely all this list will be doing is accelerate the rate at which those discussions occur, since it makes finding these potential problematic citations easier.
The list is a tool, and like any other tool it can be abused if you really want to. But you'd have to ignore the bigass disclaimer the top of the list, saying that the SourceWatch is only a starting point, that it's not perfect, that it doesn't know the full context in which a source is used, and tells you that you shouldn't go on a mass purge without discussing things at the RSN first. Headbomb {t · c · p · b} 15:36, 2 April 2019 (UTC)[reply]