This page is currently inactive and is retained for historical reference. Either the page is no longer relevant or consensus on its purpose has become unclear. To revive discussion, seek broader input via a forum such as the village pump. |
Tools, such as bots, semi-automated editing and administrative tools, and Toolserver tools with access to the Wikipedia database, regularly help in dealing with routine everyday tasks, either by automating them completely or streamlining the workflow to only involve human input where needed. They can make certain types of tasks possible that are impossible or too tedious to be cost-effective using ordinary website functions.
However, tool design is currently fragmented among many individuals with limited public discussion. The result of this is that designs are often not well-reviewed before implementation, it is difficult to recruit developers for complex tool development efforts, and creative contributors with tool ideas may have trouble finding people with the skills to make their idea a reality. The purpose of this page is to propose new tool ideas, flesh out their high-level requirements and design, and recruit interested developers.
Please be bold and invite feedback even if you're not quite sure how your tool idea would work - this is a collaborative forum and we can all work together to come up with good designs.
In the future, proposed tools may be further categorized and structured as necessary.
== Name of tool == (one-sentence description of the tool, with a link to the tool development website, if one exists) === Problem === (description of the problem motivating this tool) === Requirements === (what does the tool need to do? do not include details about implementation here) === Interface design === (describe how you imagine the user interface might look; it can be web-based, GUI-based, console-based, or whatever you like) === List of interested developers === === High-level architecture === (to be filled in by developers; what components will the tool have, and how will they interact?) === Implementation details === (to be filled in by developers; how will the tool be implemented? what technologies will be used and what implementation issues do you anticipate?) === Progress === (as the tool is developed, describe here how far along it is and what problems are being encountered)
A copyright tool, checks for text copied from the web.
AFAIK there are currently only two WP tools available to check articles for the presence of text copied from the Web. Both have limitations.
User:CorenSearchBot runs as a background task on newly created articles. A particular article can also be run thru it by adding its name to a queue, which article the bot will run, it states, when it has a free moment. Its major limitation is that because it's an automated task, it can't search Google or GBooks.
User:The Earwig's tool [1] is manually invoked. It searches Google, but not Gbooks. It would not have caught the material that caused the recent flap [2]. (Don't know whether CSbot would have caught it either). Its author is a student who has said they won't have time to improve its algorithm. It doesn't create permanent output (I realize that might pose a maintenance problem.) I'm not sure, but I think from looking at the code [3], that if it finds one match, it adds that url to an exclusion list. If true, this means that the person who'll try and clean the article will need to go on manually comparing the rest of the website to the article - it would be much more efficient to see every match.
Check article sentences to see whether they were copied verbatim or close to verbatim from websites (excluding known WP mirrors and public domain) and books in Gprint. Create output: for each match: article section title, matching sentence or good-sized sentence fragment, and url. Optional but would be useful: a second pass option with checkboxes that would allow the user to exclude some of the match websites, because even if the usual WP mirrors are automatically excluded, one often sees random sites that have scraped WP.
While it's under development, or maybe after it goes live too, dump its search strings out somewhere; we could then contemplate why it didn't find a match where we would have expected it to and think of ways to further improve the algorithm. Novickas (talk) 15:23, 5 November 2010 (UTC)
Console-based, like Earwig's tool.
(to be filled in by developers; what components will the tool have, and how will they interact?)
(to be filled in by developers; how will the tool be implemented? what technologies will be used and what implementation issues do you anticipate?)
Just this morning I implemented a basic prototype of this that seems to do a pretty good job. It doesn't yet account for a lot of things like detecting close paraphrasing or eliminating common phrases or proper names, but a few people have tried it and given good feedback. See:
It's based on a simple n-gram search algorithm, where the webpages are stripped down to text, split into a sequence of words, then an index data structure is built out of one of them by collecting for each pair of words all positions at which that word pair occurs. It then goes over the other document's sequence of words and at each position matches its current pair against each position that pair occurs in in the other document, extending it as far as possible. Finally, during the final listing it sorts by number of words in reverse order, and eliminates any search results that are substrings of search results already listed. PDFs are simply filtered through the existing pdftotext tool first. Dcoetzee 17:23, 21 March 2011 (UTC)