Machine-assisted approaches have proven successful for a wide variety of problems on Wikipedia most notably including vandalism and spam. Wikipedia already uses an rule-based edit filter, a neural network bot (Cluebot), and a variety of semi-automated programs (Stiki, Huggle, Igloo, Lupin) to determine which edits are least likely to be constructive. The worst of these are reverted automatically without any human oversight while those that fall into a gray area are queued up for manual review, prioritized by suspiciousness based on a number of factors that have correlated in the past with problematic edits.

This page is designed to explore how we can leverage the same approach to the issue of copyright. Copyright violations are a major concern and problem on Wikipedia, as the encyclopedia aims to be free for anyone to use, modify, or sell. In order to maintain copyright-compatibility, content on Wikipedia that does not share those permissions is severely limited (as in the case of non-free content/fair use), and that content is always explicitly tagged or labeled as such. Editors who contribute text or images to the encyclopedia under the auspices that it is 'free' content--when it is actually held under someone else's copyright--introduce practical and legal problems for the community. If we cannot trust that the content on Wikipedia is free, then neither can our content re-users. In addition to exposing Wikipedia to liability, we do the same for those who take from our site and assume the content has no strings attached (except for mere attribution, and share-alike provisions).

Steps for implementation

[edit]

CorenSearchBot

[edit]

How it works

[edit]

Weaknesses

[edit]

Questions

[edit]

Copyvio detection approaches

[edit]

Direct: text comparison

Indirect: attribute comparison

Combined method

Approaches for developing a corpus

[edit]

Questions from Turnitin

[edit]

Question's from Chris Harrick

[edit]

Editors have a decision to make in terms of how much they assist Turnitin with their efforts. Some may prefer to instead optimize Wikipedia's own tools, and feel less positive about helping a private company, even one with whom we may be partnering. Do whatever feels right for you.

  1. How hard it is to identify and maintain a list of mirrors?
  2. How confident are you that the mirror list is up-to-date?
  3. Do you have a resident expert on mirror sites who can tell us:
  4. How many mirrors are there?
  5. Are there new mirrors daily, weekly, monthly?
  6. Are there ways to potentially automate the tracking of mirrors?
  7. What constitutes a mirror versus a legitimate copy? (Just the attribution and link back we assume but we will most likely need to filter for copies as well)
  8. How many brand new articles are being created per day.

New mirrors of Wikipedia content are constantly popping up. Some of them reprint sections of articles or entire articles; others mirror the entire encyclopedia. Wikipedia content is completely free to use, reuse, modify, repurpose, or even sell provided attribution is given and downstream re-users share the content under the same terms. The classic sign of a 'legitimate' mirror is that it acknowledges Wikipedia as the original 'author' and is tagged with one of the compatible licenses: Creative Commons Attribution/Share-Alike License (CC-BY-SA), GNU Free Documentation License (GNU FDL or simply GFDL), or Public Domain, etc. The presence of attribution and one of those copyrights on the page is de facto compliance with our license and therefore legitimate. Unfortunately, the absence of either attribution or one of those licenses does not mean that the site didn't copy content from Wikipedia; it's simply harder to tell whether Wikipedia or the other website was first. A potential way to check which was first is to compare the dates of the content addition to Wikipedia with the content present on the website at a given time. Though computationally intensive, that is one approach.

The most comprehensive list we have of known mirrors is at Wikipedia:Mirrors_and_forks/All (approximately 1000). There is also a list of mirrors here (approximately 30). A record of the license compliance is maintained here and here. That list is should overlap with the mirrors list, but there may be discrepancies between them. There is a Wikipedia category for 'websites that use Wikipedia' here (7 sites). There is a category of 'Wikipedia-derived encyclopedias' here (7 encyclopedias). There is a meta list of mirrors here (approximate 230). There is a list of 'live mirrors' here (approximately 170). In addition to mirrors there are also 'republishers'. These sites package and often sell Wikipedia articles as collections. A small list of known republishers is available here 6 republishers).

Also note that Google maintains a cache of slightly outdated Wikipedia articles: details here.

To get a sense of how well-maintained our main mirror list is, consider that mirrors beginning with D-E-F were updated 20 times between July 2010 and July 2012. A rough estimation then is that we've updated the complete mirror list 20*9=180 times in the last 2 years. That number is likely lower than the actual number of new mirrors in that time period and almost certainly much lower than mere isolated instances of copying/reprinting/excerpting individual articles.

Statistics for the number of new articles each day are here. A quick review of that table shows approximately 1000 new articles daily. There are approximately 4 million existing articles.

Automation of mirror detection could be pursued to enhance but not entirely cover the problem. One option is to look for the terms: Creative Commons, CC-BY-SA, GNU, GNU FDL, or GFDL. Another option is to look for terms such as: from Wikipedia, by Wikipedia, via Wikipedia, etc. We can examine mirrors manually to see what other clues there are to content reuse. One possibility for semi-automating mirror detection is to add a feature to Turnitin reports so that a Wikipedia editor could 'flag' a matched-text source site as a mirror. Those sites could be added to a list for review to determine if they are mirrors or not. This would require an investment to the interface and infrastructure of Turnitin's reports.

Questions from Turnitin's CTO

[edit]

The answers to these questions differ considerably depending on whether we are analyzing new Wikipedia articles or existing (old) Wikipedia articles. When looking at brand new articles, mirrors are irrelevant, because assuming we run a report soon enough after the article is posted, there is simply no time for mirror sites to copy the content. Thus, with a new article analysis, any text matches from an external website are likely to be a copyright violation (Wikipedia inappropriately/illegally copying from them). One exception to that instance is if the content was legally licensed for reuse. That is often indicated by a Creative Commons of GNUFDL (GFDL) copyright tag. We can possibly identify those and screen them out as well. The issue of mirrors only arises when enough time has passed between the addition of content on Wikipedia and the copyright analysis for pages/sites to have copied/mirrored our content in between. That is a much more difficult problem to solve, and certainly not the low-hanging fruit.

A mirror, for our copyright compliance purposes, is any page, collection of pages, subdomain, or entire site which copies content from Wikipedia. (Obviously we get the most leverage from identifying known entire sites that copy from Wikipedia en masse). Copying from Wikipedia is fully permitted by our license, provided attribution is given to Wikipedia and downstream reusers honor the same terms. Thus, if a site has copied verbatim from us and followed the terms, it is de facto not a copyright violation. If we copy verbatim from them, however, it almost definitely is (with the exception of direct quotations, which are permitted within reason).

If an external page contains some content that was copied from Wikipedia (CFW) and some content that was not copied from Wikipedia (NCFW), it's still possible that a copyright violation occured--if the NCFW content turns up on Wikipedia. My suspicion is that there is frankly no way to know which parts were copied and which were not, without intensive manual inspection. So, I think we have little choice but to ignore these mixed instances. That said, this is an extreme edge case. The likelihood of a site copying from us also being a site that we copied from seems very slim. Furthermore, just as a point of comparison, no existing tool that we have access to will be able to parse the difference either.

There are thousands if not millions of cases where an author or page has plagiarized Wikipedia. People copy from us all of the time, sometimes with proper attribution ("From Wikipedia", "Wikipedia says:", etc.) and sometimes in such a way that it would get them brought into the Dean's Office or fired. But again, this needs to be seen in the context of Wikipedia's CC-BY-SA license (creative commons, with attribution, share-alike). It's ok to copy verbatim from Wikipedia, and even when they do so without attribution it's more a problem for them than for us; meanwhile it's not ok for Wikipedia to copy verbatim from others, unless we strictly give attribution, and then still don't copy so much that it exceeds what would be appropriate under Fair Use (we can't "quote" an entire article, for example, only a minimal excerpt).

To determine if content within a Wikipedia page is problematic and worth investigating, we use a number of approaches. The first is automated detection. For example, CorenSearchBot pulls from the feed of new articles and feeds them into the Yahoo Boss API, searching for the title of the Wikipedia article. It then pulls the top 3 results, converting those results to plain text. It then compares that text to the Wikipedia article and computes the Wagner-Fischer algorithm score (i.e., edit distance). If the score is high enough it indicates a likely copyright violation, and the Wikipedia article is flagged while our Suspected Copyright Violations board is notified. From there editors manually inspect the highest matching site. Other approaches utilize the 'smell test'. If an article suddenly appears with perfect grammar, densely researched content, no Wikipedia formatting (markup language), especially from a new editor or an editor known to have issues with plagiarism, then editors will explore further. Sometimes a comparison of the posting date on a website with the version date of the Wikipedia article is a dead giveaway of which one came first. Other times searching around the site reveals that a majority of the content is copied from Wikipedia, allowing a manual determination that it is a mirror. That determination is more easily made if the matching site is authoritative, such as a known newspaper, book, or blog. It's not impossible that a book copied from Wikipedia, but it appears to be more common that the opposite happens.

One of the strengths of Andrew and Sean (Madman)'s involvement is that they are going to be collaborating on datamining a corpus of known positive and negative findings from our copyright archives. That should number in the several thousands of instances. Among those instances are numerous identified mirrors. In addition to the mirrors we find in the corpus, we have a list of approximately 3000 identified/suspected mirrors already to go in a spreadsheet. That will give us a good head start on analyzing existing (old) Wikipedia articles for copyright violation. Andrew also intends to use data-mining techniques to determine if mirror detection can be automated. Thus, it would be desirable if we were able to append to the Turnitin mirror list using API functionality.

In the end, it's only necessary that we reduce the number of false positives to a level that editors might be able to manually evaluate (1-3 might be tolerable but 20-30 would render reports meaningless). Andrew and Sean's work will also allow us to develop a profile of the typical editor who violates copyright, by determining a variety of metadata such as whether or not they are registered, how many edits they have made, how many blocks they have on their record, and about 40 others. That profile could be used in conjunction with Turnitin scores to develop a composite metric of your best evaluation combined with ours.

Questions from Turnitin's Product Manager

[edit]

I am trying to gain an understanding of how wikipedia would like to use the iThenticate API. I have been relayed information about the desire to remove mirrored sites from the reporting results and that there are likely thousands of mirrors in existence. Since this is the case I imagine the folks at Wikipedia would like a quick way to identify the mirrors and remove them from the iThenticate report results. Currently I need to get a better idea of how the API would be used and the best way to identify mirrors in order to bulk upload them to our url filter...

The GUI interface for the "filter" already exists (as you know) and the add/remove/list functionality that provides is sufficient. We just want API hooks into those actions so we can interact with the list programmatically (and not have to resort to screen-scraping). To be pedantic, here is what I imagined the API calls might look like (sweeping generalizations here; I have not interacted with your API much, though I intend to code a Java library for it this weekend):

EDIT: After delving into the API documentation over the weekend, my PHP-esque examples below are obviously not the XML-RPC that iThenticate uses. Nonetheless, the general input/output/method sentiment remains the same, so I am not going to modify at length below. Thanks, West.andrew.g (talk) 20:46, 20 September 2012 (UTC)[reply]

OUTPUT FILTER LIST

[edit]

http://www.ithenticate/....?action=filterlist

Response
FID URL
1 http://www.someurl1.com
2 http://www.someurl2.com
3 http://www.someurl3.com
4 http://www.someurl4.com

DELETE FROM LIST

[edit]

http://www.ithenticate/....?action=filterdelete&fid=1

Response
"deleted FID 1"

ADD TO LIST

[edit]

http://www.ithenticate/....?action=filteradd&url=http://www.someurl.com

Response
"added as FID [x]"

These IDs could just be the auto-increment on some table? They are not even strictly necessary but might make management a bit easier than having to deal with string matching subtleties (i.e., in the "delete" case). Obviously, we'd also need to encode the "url" parameter in order to pass the special chars. via HTTP.

Client-side

[edit]

All the complicated processing regarding mirrors will be done in our client-side application. You don't have to concern yourselves with this, but regardless, I'll describe it in brief:

  1. We provide Turnitin a document
  2. Per our telecon, the API will be able to provide us the URL matches and their match percentages: i.e. "99% www.someurl1.com; 97% www.someurl2.com".
  3. For matches above a certain threshold, we will then go fetch those URLs and put them through a machine-learning derived classifier to determines if: (a) the site is a Wikipedia mirror, or (b) the site is freely licensed. If either of these is true, we will immediately append that URL/domain to the filter list.
  4. Once done for needed URLs in the match list, we will re-submit the content and have the report re-run under the new filter settings. This output will be the one published to users and upon which any actions are based.

This summarizes my opinion on what is needed. West.andrew.g (talk) 19:45, 14 September 2012 (UTC)[reply]

Notes from Aug. 14, 2012 Turnitin Meeting

[edit]

Things noted by West.andrew.g (talk):

Noted by Ocaasi

Turnitin Trial design

[edit]

Example report of 25 or so articles with CSB results and iThenticate results: [1]. Let me know what you think. — madman 16:16, 4 September 2012 (UTC)[reply]

iThenticate trial observations

[edit]

In project

[edit]

SCV Corpus

[edit]

API integration

[edit]

Mirror detection

[edit]


Low level observations

[edit]

Plans for June 15, 2013 progress meeting

[edit]
Ocaasi
Andrew
Madman
Zach