As of February 2009, Wikipedia allows all search engines to cache its results. That is, if a search engine like Google happens to crawl a page, any inappropriate or bad content, including WP:BLP violations, may be propagated out onto the Internet for an indeterminate amount of time. However, we have the ability to set Wikipedia to be NOCACHE in our robots.txt file. The major benefit of this is that search engines would only report the current state of an article (or any page) at any given time.

At least once, a slightly prominent BLP article was vandalized with racial epithets that the world's search engines then cached.[1] A vandal replaced the entire BLP article with three epithets.[2] However, the damage was done, and according to Wikipedia on search engines, we were now referring to the BLP subject as "NIGGA".[3] The edit was reversed less than two minutes later, but the damage was done.[4]

That was one of the single most-watched BLP articles we've ever had--what chance do the hundreds of thousands of lesser-known BLP articles have? The idea behind this proposal would be to protect not just BLPs, but the integrity of our articles themselves from being cached with bad information, even temporarily.

See also

References

  1. ^ Oswald, Ed (2009-02-17). "Google Search for Barack Obama Reveals Racial Epithets". Technologizer. Retrieved 2009-02-18.
  2. ^ 04:44, February 17, 2009 edit to Barack Obama.
  3. ^ The edit in question, which was cached by Google. It was done at 04:44, February 17, 2009. It was reversed at 04:46, February 17, 2009, <2 minutes later, but the damage was done and saved for the world to see, on one of the most well-known living people on Earth, for an unknown length of time.
  4. ^ 04:46, February 17, 2009 edit to Barack Obama.

Discussion is good, and here is what I want to discuss

  • In a perfect world, this plus flagged revs on BLPs would nuke just about anything new from getting in. Just a note too, the user that made the edit? Registered and editing since 2006. rootology (C)(T) 08:18, 18 February 2009 (UTC)[reply]
  • This rather odd user had not editted since 18 September 2007 until sparking off this incident, and has made less than 50 edits in total. MickMacNee (talk) 10:02, 18 February 2009 (UTC)[reply]
  • In that regard, doesn't that essentially negate Rootology's suggestion of flagged revisions? As his revision would not have been flagged, it would have gone on anyway -- just another example of why flagged revisions falls short of even the lowest of expectations, in my biased opinion. 128.61.56.41 (talk) 10:39, 18 February 2009 (UTC)[reply]
Not at all. FRs are very flexible, and while it's possible (I believe) to turn on FRs for only IP users, any implementation of it would almost surely be for all users to be flagged. Plus someone with that few edits wouldn't be a sighter, so FRs in this case would have indeed prevented the incident. ♫ Melodia Chaconne ♫ (talk) 13:01, 18 February 2009 (UTC)[reply]
  • What will the effects of this be on Wikipedia's appearance in the search results? If Google isn't allowed to cache Wikipedia articles, it won't be able to present the two-line excerpts in the results, making search results less useful and less attractive. --Carnildo (talk) 08:38, 18 February 2009 (UTC)[reply]
    • Maybe I'm wrong, but as I understand it, as long as a page is indexed, there will be a snippet. The "cache" feature just lets the user see a full copy of the whole page. But, as long as the page is indexed, the user can see a snippet of text, usually containing the search terms. If you do a Google search, you'll often see some results with no cache, but still having a snippet. --Rob (talk) 11:48, 18 February 2009 (UTC)[reply]
  • I believe you are correct, and it would mean in this particular case that Google searchers would still have seen the racial epithet on one of arguably the most important BLPs in Wikipedia. This proposal may be misleading people by implying that it would have made any difference in this high profile incident. Delicious carbuncle (talk) 15:11, 18 February 2009 (UTC)[reply]
  • Per Wikidemon, the result of this proposal may well be to diminish Wikipedia to save 2 minutes of misfortune. I wonder, though, if it might be possible to only display revisions which have been around for 3 days without a revert. That is, instead of having such results instantly change, have only the page with non-controversial edits. Just a thought. 128.61.56.41 (talk) 10:29, 18 February 2009 (UTC) I suppose I just re-iterated Rob's suggestion above. Whoops. 128.61.56.41 (talk) 10:39, 18 February 2009 (UTC)[reply]

"That was one of the single most-watched BLP articles we've ever had--what chance do the hundreds of thousands of lesser-known BLP articles have? " Is this not irrelevant to this proposal? The proposal highlights the fact that vandalism, no matter how short lived, can be cached for an indeterminate amount of time after it has been reverted. The risk stemming from the 'unwatchedness' of articles meaning bad edits stay in articles themselves would seem to be irrelevant to risks posed by caching bad out of date versions, and is not an issue fixable by this proposal. "The idea behind this proposal would be to protect not just BLPs, but the integrity of our articles themselves from being cached with bad information, even temporarily" Similarly, this sentence also seems to miss the point of what this proposal can prevent or protect. The proposal only addresses the harm caused by temporarily caching bad out of date information. If bad information is staying around for longer due to it not being reverted in articles, that is not something this proposal protects us from at all. MickMacNee (talk) 11:31, 18 February 2009 (UTC)[reply]

This will not work. It interferes with the GFDL. It will overload the WP server. It will not prevent vandalism. Fear of vandalism is prompting ever more aggressive security measures here. Get too restrictive and over-secure and we kill the Goose that lays the golden eggs - it's already being strangled. Riversider (talk) 23:14, 18 February 2009 (UTC)[reply]
Something like the HTML code <meta name="robots" content="noarchive"> would be added to each page served. This isn't very different than other directions Wikipedia gives for certains types of pages (for instance "NOINDEX" for talk pages). So, I don't see how an overload could occur. --Rob (talk) 02:31, 19 February 2009 (UTC)[reply]
Aside from the objections already made, Google's cache feature allows users to get around censorship. I am in Vietnam, which has briefly blocked Wikipedia on several occasions. China blocks Wikipedia all the time. Kauffner (talk) 09:46, 19 February 2009 (UTC)[reply]

No, absurd and disproportionate knee-jerk reaction to a problem of questionable significance. — Werdna • talk 06:32, 20 February 2009 (UTC)[reply]

And sometimes the discussion is based on ignorance

It's wonderful to behold the mixed levels of knowledge and assumptions above. Some people say it will increase the load on Wikipedia servers, others gainsay that, and some have even read the link describing what NOCACHE really does.

How will the folk looking at the discussion on this proposal reassure all of us, the knowledgeable, the quasi-knowledgeable, the dangerous loon who knows nothing but still uses that lack of knowledge to have an opinion, and the truly undecided person questing for knowledge that our various opinions have been listened to, counted or discounted based upon technical accuracy, and weighted correctly according to firmly held opinion?

That question mark should have come earlier. The sentence is far too long, but I don't care! The point is that most of us haven't a clue what we are talking about, but we are sure that we are right. I number myself in that, so don't feel I'm accusing you. Most of us opine out of ignorance, or what a "bloke in the pub" said and that we believed after the third pint.

Please will a truly technically competent person use very short words and very short sentences to tell us what effects this proposal would have, if adopted. Please don't refer us elsewhere. Many of us have read and some of us understand it.

Looking at the reason for proposing this in the first place that reason looks to me like: "If we cache, then, well, but, but, but, but, but, but.... it, er, well, but..." So I am rather lost. It's a rather small thing, isn't it? I mean really? Fiddle Faddle (talk) 15:58, 19 February 2009 (UTC)[reply]

I'm not even convinced the right explanation is now linked above. (I believe it was added after the original proposal.) The link is to HTTP headers, but the proposal is about robots.txt, which appears meaningfully different. Which makes much of the discussion about that point look meaningfully off to me. But what do I know, because I am certainly not an exper on either. GRBerry 16:46, 19 February 2009 (UTC)[reply]
But this is Wikipedia. We have the Wisdom of Crowds. This means that it must be correct. Or not. Fiddle Faddle (talk) 16:49, 19 February 2009 (UTC)[reply]
My proposal was actually significantly different than what this has turned into, which bizarrely veered into GFDL advocacy and who-knows-what now, as the original formatting has been changed by (sorry guys) anti-polling types. It was literally just a query if people supported the idea of minimizing exposure of cached copies of the site (which has nothing to do with GFDL, because by that argument we violate GFDL by even deleting anything). I think discussion manipulation actually killed this, bizarrely, and that makes me sad. I'm starting to get mightily sick of anti-survey and anti-poll FUD. This is what I originally posted. I was looking for feedback in an organized, orderly fashion. rootology (C)(T) 16:54, 19 February 2009 (UTC)[reply]
It seems like most of the objectors are objecting on grounds completely separate from the GFDL concerns. I don't see any real basis for the GFDL concerns but there are a lot of other issues with this sort of proposal that have been brought up above. JoshuaZ (talk) 16:58, 19 February 2009 (UTC)[reply]
You have to remember that we actually have The Ignorance of Crowds rather than the Wisdom. The opposite of Artificial Intelligence is Genuine Stupidity.
I have just been pointed at the "right" link which describes what Google does with requests not to cache. From reading the page in detail I return to "If we cache, then, well, but, but, but, but, but, but.... it, er, well, but..." as stated above. So, that will be amusing, then, if this proposal goes ahead. I can't help but think "Good luck with that, then!"
I'm disappointed that manipulation killed it. The discussion is interesting, if bizarre. Fiddle Faddle (talk) 17:01, 19 February 2009 (UTC)[reply]

This is what I originally posted before Scott Mac changed the formatting. This was ********NOT******** a proposal to turn this on, but to gauge a clear view of what people thought, for a possible future proposal. rootology (C)(T) 17:08, 19 February 2009 (UTC)[reply]

I think your original good intention was based upon a storm in a teacup. So some fool called the US President by a bad name. In the UK we have a one eyed Scottish idiot, according to Jeremy Clarkson who had to apologise for the two factual statements and refused to apologise for the opinion. This bad name for Obama rubbish got picked up by some blogger or other and he attracted folk to his site to get advertising revenue from it
Wikipedia can be vandalised. Whoopeeee! Now let's all get over it. This was manufactured news. Ask Obama if he actually cares! Fiddle Faddle (talk) 17:24, 19 February 2009 (UTC)[reply]
No, my original intention was based on the fact that if Obama's article, with a million eyes on it, can be dinged this way with potential defamation and/or libel archived across multiple websites from our ";broadcast", what chance do the tens of thousands of unwatched BLPs have, as I explicitly said in the lead of this page, in my conclusion? rootology (C)(T) 17:28, 19 February 2009 (UTC)[reply]
But Wikipedia can be vandalised. It truly does not matter. No-one will die. The earth will not stop spinning. The magnetic poles will not reverse. It's not as if it actually matters. And the actions of search engines are absolutely not Wikipedia's responsibility. People here create or vandalise. Search engines visit and record. Sometimes it hits at a bad time. So, even though I recognise that you feel strongly about this, even hijacked to an extent, I suggest that it is truly not important. Interesting? yes. But not important. Fiddle Faddle (talk) 17:36, 19 February 2009 (UTC)[reply]

As mentioned above, this is actually the relevant mechanism for doing what's proposed. Notice there's no keyword "NOCACHE" (it's "noarchive"), and it's *not* part of the "robots.txt" file (despite what somebody says above). It also doesn't directly effect snippets, which is what the original "NIGGA" example used (in the picture), which can be done by doing this. And contrary to some opinions expressed, it has nothing really to do with Wikipedia's web server caching, and no effect on performance. While reformatting of this page messed things up, there was never a well formed proposal to begin with. Finally, User:Scott MacDonald didn't reformat the comments, he removed them entirely. Somebody else returned them, without the polling format (legitimately wanting to keep the comments, without edit warring of the whether there's a poll or not). --Rob (talk) 17:43, 19 February 2009 (UTC)[reply]

All true and all not terribly relevant. Most of the objections are specifically about issues that have little or nothing to do with the technical details. JoshuaZ (talk) 20:57, 19 February 2009 (UTC)[reply]
And thus we come full circle to the heading of this section. Fiddle Faddle (talk) 21:04, 19 February 2009 (UTC)[reply]
What we really need is a way to ask Google to "re-snippet" the page so that the snippet is current. Then when you see BLP vandalism, you revert it and ask for a re-snippet. As long as we promised to keep it under a hundred or so requests a minute, Google would probably be happy to work something out with our devs. Franamax (talk) 22:41, 19 February 2009 (UTC)[reply]
Every webmaster on Earth would love it, if they could get Google to refresh their results faster, just by asking. Google already refreshes highly ranked pages fast, which Wikipedia pages, tend to be, so we're not going to make it better by asking. The one semi-related thing a webmaster can ask for, is to have the removal of deleted pages expedited, which would be good, since deleted pages often were deleted for BLP violations, but sit around in Google, long after deletion (or they used to, I haven't checked lately). We can also ensure certain archives, like archive.org, doesn't archive such pages. --Rob (talk) 22:50, 19 February 2009 (UTC)[reply]
I'm thinking in particular of a self-published poet who unwisely decided to register here in his own name and create an article about himself, then started moaning about how his ANI's and AfD's were showing up in gsearches. Yes, it was his own fault, but we courtesy-blanked around and it still took a few weeks for it all to flush out of Google. It would have been nice in that case to queue up some re-snippeting.
And yeah, every webmaster would love it - but I'm pretty sure Google has "noticed" where we show up on searches. :) I'd bet they'd cooperate in setting up a direct channel. Wikia, no way, but we're such nice people here. :)
That's an interesting link you provide and actually could be implemented in software, triggered by "Delete this page". I haven't seen too many devs hanging out here though... Franamax (talk) 01:26, 20 February 2009 (UTC)[reply]
Note that google no longer indexes ANI and AFDs anyways so the primary issue here is done. Making pages lose their index when they are deleted isn't terribly useful and would damage transparency. The idea of a person being smart enough to go use a cached version but not smart enough to realize that it was likely deleted for a reason just isn't very credible. JoshuaZ (talk) 16:20, 20 February 2009 (UTC)[reply]
Not a lot of smarts are involved. Somebody sees a listing, finds the page is gone, but then sees the "cache" link, which they click. Deleted pages will ultimately, be removed from Google's index, so doing so faster, would be at worst, be harmless. I don't understand why "ANI and AFDs" were the primary issue. If an AFD is started over a problem BLP, then the actual article is more harmful than the related AFD or ANI discussion. Keep in mind we're not worried about the reader, who has some blame, if they blindly trust something that's clearly expired. But, the BLP-subject isn't to blame about libel surviving, and it's often the only thing found in a Google search on their name. Also, if Wikipedia wanted this type of "transparency", we'd make deleted contents visible (which is very easy to do). --Rob (talk) 18:45, 20 February 2009 (UTC)[reply]
In regard to AfD and ANI I was talking about the problem mentioned above of people who are unhappy with the discussions about them on AfDs and ANI. That's a far more common issue. We need cached pages for DRV as well as allowing users to access Wikipedia where it is blocked. This really is throwing the baby out with the bathwater. It doesn't help much at all. We need to deal more with vandalism, not try to remove secondary consequences of it. JoshuaZ (talk) 18:53, 20 February 2009 (UTC)[reply]
Sometimes, there's good reason to make deleted content available on both Wikipedia and Google. Sometimes, there's good reason to hide deleted content on both Wikipedia and Google. There's never reason to hide content on Wikipedia, but still show it on Google. In a DRV, an admin can temporarily restore safe content to an appropriate place for viewing. I agree cached pages are good for places where Wikipedia is blocked, and we should continue to allow non-deleted pages to be cached as normal. --Rob (talk) 19:05, 20 February 2009 (UTC)[reply]
Sure there is. The vast majority of content deleted doesn't belong on Wikipedia but isn't at all libelous. There's no good reason to force cache removal of such material. JoshuaZ (talk) 19:06, 20 February 2009 (UTC)[reply]

In summary[edit]

Obama's article was vandalized, and Google indexed the vandalism. A proposal was made to prevent this in the future, but was based on a misunderstanding of how Google works, and would not have actually prevented it. The correct mechanism for preventing it was discovered, but it would have a major negative impact on using Google to search Wikipedia. Any questions? --Carnildo (talk) 22:30, 19 February 2009 (UTC)[reply]

Yawn. "Wikipedia gets vandalised. Get over it." That is a briefer summary. Fiddle Faddle (talk) 22:55, 19 February 2009 (UTC)[reply]

This is the wrong horse[edit]

The motives are first class, but the saddle is on the carthorse, not the racehorse.

All good problem solving goes back to analyse the problem, and the problem is not, repeat not, search engine caches, nor archive system archives. No, the problem is... Vandalism.

And yet that is not the problem. The true problem is the delay in detecting which bits are vandalism on pages whose watched status is such that few folk notice vandalism.

The proposal should be:

"In order to minimise the risk of disrepute due to undetected or slowly corrected vandalism, Wikipedia must run ever more numerous and ever more efficient vandal detection and rectification services."

At a stroke this reduces the probability of an embarrassing edit being hailed as "Wikipedia is the product of a load of amateur authors having fun." Ah wait, That one is true. Ah yes "Wikipedia is inaccurate trash." Hmm. Some of that is true. And surely, speaking unemotionally, Obama is hailed as being black despite being half white, so maybe the rude word he was called is fact too?

Our fight is not with cache. Cache is a digression. Our fight is against vandalism. Fiddle Faddle (talk) 23:13, 19 February 2009 (UTC)[reply]

Spartans! Prepare for glory! Bigbluefish (talk) 23:18, 19 February 2009 (UTC)[reply]
Ave Imperator. Morituri te salutant! Fiddle Faddle (talk) 23:22, 19 February 2009 (UTC)[reply]

Unfortunately, you didn't mention the one solution that actually would address this problem: flagged revisions. Flagged revisions would ensure that the page displayed to anonymous users — that is, the page that Google caches — has always been looked over by a trusted user, at least on BLPs. Had this system been in place, all of this could have been avoided. --Cyde Weys 02:51, 24 February 2009 (UTC)[reply]