Per a "Shutdown imminent" message from the developer, this tool appears to have been taken offline in late 2009. |
G'day All,
There's a link suggesting tool I'm temporarily putting out there for you all to have a play with and to give feedback and comments on (either via email or on my talk page).
What it does is it takes an article of your choosing from the English Wikipedia, and suggests bits of text in that article that could potentially be linked. You can then accept or reject those individual suggestions, and then save your changes back to the Wikipedia. Please do have a look at WP:OVERLINK for some ideas about how to use your editorial judgment when selecting which links to employ. Also before adding a wikilink to an article, follow the suggested links to the article that they point to, in order to avoid subtle mistakes. For example, many people have the same name; linking to a football player in a biochemistry article is probably not correct.
It tries to do this in a reasonably pleasant UI, where you see the list of suggestions, and then simply select "yes", "no", or "don't know" for each suggestion, and click "Preview with Added Links".
If you want to play with it now, it's at: http://can-we-link-it.nickj.org/
The source code for this available, under the GPL. Detailed setup instructions are below. You'll need Julien Lemoine's Suggestion Search daemon (TcpQuery) installed for this to work (use the "[archive]" link for downloading - it's C++ code that gets compiled - I think it assumes a UNIX / Linux type of system, although I'm not certain - check the README in the archive), which runs as a daemon that my PHP script talks to, to help determine which phrases have existing Wikipedia articles.
If you want to set up a local copy of CWLI, then you can. However, setting this up has a number of steps, and is a bit complicated. I'm not trying to make it more complicated than it needs to be, but CWLI depends on a number of other bits of software to work, and so those dependencies do make it more complicated.
Firstly, you need a machine with PHP, a web server (such as Apache), and MySQL, and that those bits of software work and are already configured. It is assumed from here onwards that you already have this, as setting these up is outside of the scope of this document.
The second thing you need is need Julien Lemoine's Suggestion Search daemon (TcpQuery) installed for this to work: http://suggest.speedblue.org/download.php
mkdir suggest cd suggest # Note: I am using TcpQuery v0.44 - there is a later version, v0.51, but I do not suggest using it # because it crashes for me, whereas v0.44 does not crash. wget http://suggest.speedblue.org/tgz/wikipedia-suggest-0.44.tar.gz tar xfz wikipedia-suggest-0.44.tar.gz cd wikipedia-suggest-0.44 # If you need to install expat & glib and are on Debian or Ubuntu, try the following command: # aptitude install libexpat1 libexpat1-dev libglib2.0-0 libglib2.0-dev # ... and if you need a compiler installed, do this: # aptitude install build-essential # The netcat package provides the nc command used below. To install, use the following instruction: # aptitude install netcat ./configure make make check cd cmd # This next step downloads a 122 Mb file, so can be a bit slow depending on your internet connection... wget http://www2.speedblue.org/download/WikipediaSuggestCompiledEn-20060810.tar.bz2 # This command will be quite slow as it decompresses the above file: tar xfj WikipediaSuggestCompiledEn-20060810.tar.bz2 # Check that there is a "pages.bin" and a "trie.bin" file in this directory from the above archive: ls -al En # Now start the TcpQuery daemon. Note: Can include a "-m" switch for much improved speed if have ~ 1 Gb of memory on this box: ./TcpQuery -t 10 22581 En/trie.bin En/pages.bin & # Now test with: echo fish | nc localhost 22581 # Should get back an answer like: [["Fish","3167",""],["Fishing","2146",""],["Fishspinner","1113","Tropical cyclone"] (...etcetera...) # Once you've got this working, you can add a line to your /etc/rc.local file to make the daemon run on bootup. # An example line is something like: #( cd /root/tmp/wikipedia-suggest/wikipedia-suggest-0.44/cmd ; ./TcpQuery -m -t 10 22581 trie.bin pages.bin > /dev/null & ) # Note: you'll need to update the paths and file names as appropriate
# You path may be "/var/www/", so update as appropriate based on how you have configured your apache: cd /var/www/hosts # Note: I have updated the ZIP file, so this will have to re-downloaded if you already have it - sorry! wget http://can-we-link-it.nickj.org/can-we-link-it.zip unzip can-we-link-it.zip # This should show some files: ls -al can-we-link-it.nickj.org/ # Then open this directory in your web browser (the URL to use will depend on how you have configured apache), # and it should show a page just like on http://can-we-link-it.nickj.org/ . If it doesn't then there is something # wrong with this step (either with copying over the PHP files, or with the configuration of apache). # Now try typing something in the box (e.g. "test") As you type, it should suggest articles that match. If it # does, then it's working - if not, then there is a problem with TcpQuery.
The last step is setting up MySQL. I have created a full dump of MySQL as I currently have it, so with this you should have all the data that I have as at 19 Oct 2007.
To download and load this dump:
wget http://can-we-link-it.nickj.org/suggest-links.sql.bz2 bunzip2 suggest-links.sql.bz2 echo "create database suggest_links; " | mysql # load the data - might take a minute or two: mysql suggest_links < suggest-links.sql mysql suggest_links # and issue these three commands in MySQL: grant all on suggest_links to links@localhost identified by "links"; grant all on suggest_links.* to links@localhost identified by "links"; exit;
Now you should have a full dump of the files, and a functional local copy of the site.
Try it on a Wikipedia page. When it is finding links, it may take about 20 seconds on a page, and the hard disk light will spin a lot.
If this is too slow to be usable, then you'll need to add the "-m" memory switch to the TcpQuery line, which loads everything into memory - however, it needs around 600 Mb of RAM for this to work.
Oh, and periodically, you should purge the unpopular link suggestions, so that they are not suggested any more. Here are the queries that I use:
mysql suggest_links # Shows suggested links that were strongly disliked and which have not been purged yet: select * from link_votes where against < 100 and against >= in_favour + 2 order by against - in_favour; # Check that the results look sensible. If they do, then can purge the strongly disliked links so that they are not suggested any more like so: update link_votes set against += 100 where against < 100 and against >= in_favour + 2;
I want to give a big thank you to Julien Lemoine for writing his Suggestion Search daemon / server, which this tool uses (or rather, abuses) in a rather cruel way to determine what's a valid article name and what's not :-) Also the front page uses a modified version of his web form to help you find the right page that you want to suggest links for.
Q: What is the difference between the Link Suggester tool, and Can We Link It?
A: The difference between the Link Suggester / LinkBot and Can-We-Link-It is that the Link Suggester came first, and it was an offline script that I would manually run to suggest links, and the LinkBot would save those link suggestions to article talk pages. However, after 3 or 4 small-scale test runs it became clear that this approach had a number of problems:
Because of these problems, the talk-page approach was abandoned. Instead, the Link Suggester scripts were modified to make an web-based link-suggesting tool, called Can-We-Link-It. This tool has a number of benefits:
The main downsides of the tool as it currently stands are:
Sorry folks, but I'm going to have to shut down the Can-we-link-it web site in the next few weeks (during November 2009), and I don't know when or even if it will be back.
Why now?
Why don't you just move it to your new home?
Okay, so why don't you just move to the toolserver?
So why don't you rewrite it to use less RAM / do something else?
So where to from here?
So those are the options that I am aware of. I personally favor the rewrite approach, throwing away the current code, taking the current ideas, improving them, and making them native to MediaWiki as a MediaWiki extension, that can hopefully ultimately run on the WikiMedia servers ... but I don't envy the poor sod who gets to do the rewrite!
Anyway, whatever is going to happen, you have between 2 weeks and 4 weeks from now to make a decision and do it, because that's when the current site will be shut down.
-- All the best, Nickj (t) 07:37, 13 October 2009 (UTC)