Computing C‑class Mid‑importance | ||||||||||
|
Internet C‑class High‑importance | ||||||||||
|
For the old archive, please see Talk:Robots.txt protocol. —Preceding unsigned comment added by Vacuum (talk • contribs) 02:29, 27 March 2004
Perhaps this article should be named Robots exclusion standard instead of Robots Exclusion Standard? Wmahan. 00:17, 2004 Sep 12 (UTC)
guest: This page used to be findable under "robot exclusion standard" as my browser still remembers finding it there. I see no reason not to retain a top-level link under robot versus robots so that either "robot exclusion" or "robots exclusion" will find the same page, esp. as nobody would guess to seach for it under the plural (I didn't.) I gripe because the old page was not found, not even "this was renamed or moved." —Preceding unsigned comment added by 76.235.68.177 (talk) 01:09, 13 December 2008 (UTC)
I have ixed the Warning section, and reduced the level to a level 3 heading (=== ... === instead of == .. ==), and femoved the ((tone)) tag.--|333173|3|_||3 05:38, 27 June 2006 (UTC)
A source is needed. - Ta bu shi da yu 13:53, 14 August 2006 (UTC)
Can anyone confirm this? It sounds general, but I know of not a single reference anywhere. projectphp 00:13, 15 August 2006 (UTC)
AFIK, NOINDEX tag has been introduced by Yandex, a russian search engine, see Yandex help page (in Russian). 212.176.39.52 12:12, 15 August 2006 (UTC)
One another way to exclude a portion of webpage from indexing is used by ASPSeek and DataparkSearch search engines: two special comments for the begin and the end of region to exclude <!--noindex--> / <!--/noindex-->, see DataparkSearch's documentation.
There's a bit of a discrepancy between the first two and the other examples; the first two talks about "robots" while the latter about "crawlers". Should this be fixed/changed? Aeluwas 21:14, 30 May 2007 (UTC)
When search engines talk about their robots, they tend to call them "crawlers". However, the robots.txt applies to all robots, even the ones that don't crawl (and just check sites). Accordingly, I suggest that we use "robots" as a standard term for this article unless it's in a section that is very clearly only about a search engine crawler (such as the crawl delay).Ian McAnerin (talk) 05:20, 21 November 2007 (UTC)
The first example says that it allows all robots to crawl all directories so why is Mediapartners-Google mentioned in the user-agent section?--87.80.96.31 (talk) 19:37, 30 June 2008 (UTC)
I just removed the following external link: *[ht tp://www.google-msn-yahoo.info/ Windows XP Update Repaire] It caught my eye when I noticed "repaire" was spelled wrong. When I followed the link, it went to one of the spammier sites I've ever seen. The top half was all about wooden flooring, and there was a little tiny note at the bottom saying that robots.txt is important. Ian McAnerin (talk) 05:05, 21 November 2007 (UTC)
http://yro.slashdot.org/comments.pl?sid=377285&cid=21554125 gives the history of the robots.txt standard. However, I'm not sure if the information is purticulary encyclopedic, and I'm betting a slashdot comment isn't a reliable, verifiable source. OTOH, the people monitoring this talk page might want to chase it down. Theorbtwo 23:54, 2 December 2007 (UTC)
There's no info on dynamic links. ceo 13:21, 7 December 2007 (UTC)
www.share_ali.com —Preceding unsigned comment added by 82.38.218.169 (talk) 07:48, 21 September 2008 (UTC)
The title is misleading. This is no standard but a protocol. hAl (talk) 15:30, 13 February 2009 (UTC)
The result of the move request was no consensus. @harej 00:16, 24 August 2009 (UTC)
Robots exclusion standard → robots.txt — Some googling suggests this term may be more common. --Cybercobra (talk) 18:29, 14 August 2009 (UTC)
Does ACAP belong here? It's no kind of extension for robots.txt - it's a totally different proposal (not even a standard) —Preceding unsigned comment added by 78.86.8.122 (talk) 10:30, 16 October 2009 (UTC)
The big 3 (Google, Yahoo, MSN/Live/Bing) now support wildcards (* and $) in robots.txt, f.e.:
disallow: /*.php$
matches anything ending in php —Preceding unsigned comment added by Smremde (talk • contribs) 12:14, 25 October 2009 (UTC)
As per the comments in the external links code, can I propose adding a link to
Mtcooper (talk) 18:20, 22 February 2010 (UTC)
When I tried to check the url: http://www.kpmg.com.hk/robots.txt I got this message: "Sorry, this page is either no longer valid or currently under maintenance." What does it mean? The main URL does have a valid web page. Ottawahitech (talk) 17:29, 18 March 2010 (UTC)
Granted, the robots exclusion protocol has not historically capitalized "agent" in any of its specifications or examples. However, as it's an informal agreement, not a standard or a proposal (including an RFC), such is not controlling. RFC 2616 does capitalize "agent", and it has been accepted into the official HTTP standard. Therefore, as the use of "User-agent" in the robots context refers to the same header name and data as in the HTTP protocol ("User-Agent" at Section 14.43 et. al.), shouldn't "agent" be capitalized in the robots context also? 71.106.210.230 (talk) 06:36, 13 July 2010 (UTC)
The exact mixed-case directives may be required, so be sure to capitalize Allow: and Disallow: , and remember the hyphen in User-agent:
Looking back at the original 1994 robots.txt definition, it says that field names are case insensitive, so the "A" may be capitalized or not.[1] Additionally, the field data (for User-Agent info) should also be interpreted by robots as case insensitive for matching purposes. The draft RFC in 1997 repeats the case insensitivity for field names, even though it shows a lowercase "a" for this field's name in the ABNF syntax.[2] 71.106.210.230 (talk) 06:32, 28 July 2010 (UTC)
Both appears to be acceptable in the robots.txt, but "User-agent" seems more common place and outweighs the RFC anyway. I would go with that. --Hm2k (talk) 08:34, 28 July 2010 (UTC)
Hi, when you try to access Wayback Machine in Beta version, you get this error message: robots.txt has blocked this content from being crawled. Is there a was it should be fixed in Night, when in Korea. 121.164.146.185 (talk) 16:31, 11 November 2010 (UTC)
From [ http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2011-08-15/Technology_report ]:
"A question raised at Wikimania – why the Chinese Wikipedia was getting so much more traffic than it used to – turned out to have a technical answer. The robots.txt file for the Chinese Wikipedia was written in both traditional and simplified Chinese, causing problems for bots from search engines and the like, a Chinese Wikimedian explained ( http://ultimategerardm.blogspot.com/2011/08/why-chinese-wikipedia-is-doing-so-well.html )."
I have no idea why someone would want to make a file that is not meant to be read by humans have multiple languages, but I don't see anything in the Robots exclusion standard that covers this. Does anyone have more information? Guy Macon (talk) 01:17, 17 August 2011 (UTC)
In section Examples the words robots and crawlers are intermixed. It may be confusing, but it may also be educational. I'm not sure if it needs fixing, and if so, what word to choose. David A se (talk) 17:21, 3 March 2012 (UTC)
Robot blocker can be cited [3]. --Trivanderumtequila (talk) 05:06, 26 November 2013 (UTC)
The article states that bingbot complies with robots.txt, but that's not always factual. Last year we blocked crawlers from downloading images to save on bandwidth. This worked for every bot except bingbot, which continued to crawl directories that have been explicitly disallowed by robots.txt. (There are no bingbot/msnbot sections overriding this).
User-agent: * Disallow: /images/ Disallow: /image.php Disallow: /imagesize.php Disallow: /photogallery/
Recently rules were added to "Forbid" bot requests to these disallowed urls, and bingbot is the only bot to trigger them (at a rate of about 2-5 requests per minute). They all seem to come from legitimate MS IPs, here is one line:
157.55.39.42 - - [28/Feb/2015:20:25:53 -0500] "GET /images/deals/BT_1_973646942.jpg HTTP/1.1" 403 266 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
I looked it up and it seems I'm not the only one who's observed bingbot's disregard for robots.txt:
https://graphiclineweb.wordpress.com/2013/06/14/bing-banned/
https://www.techinasia.com/bing-denies-wrongdoing-sogou-privacy-leak-mess/
What would be the best way to improve the article's accuracy? Add examples of bingbot ignoring robots.txt at this point? "Some major search engines following this standard include Ask,[7] AOL,[8] Baidu,[9] Bing,[10] Google,[11] Yahoo!,[12] and Yandex.[13]" A new section maybe? 69.112.203.39 (talk) 02:02, 1 March 2015 (UTC)
We've received a response from our Product Group and we're happy to inform you that Bingbots respected the Robots.txt directives by not showing the content. The robots.txt disallow directive generally does not exclude a URL from index but blocks its content. Evidently, Bing's image crawlers does not show the blocked images of the site as configured in its robots.txt, this shows that Bingbot is following the directives.
We would like to inform you that a backend fix was done by our Product Group which should address your concern. Kindly verify this from your end.
Robots.txt instructs crawlers not to crawl a page, but that's not sufficient to keep content out of search results. To accomplish that (with Google, Bing and Yahoo), you must *not* use robots.txt and instead you must use the noindex HTML tag and the X-Robots-Tag HTTP headers. Since most people coming to this page are trying to understand how to block a page from appearing in search engines, I think a new section should be added that explains this in detail. Currently, this article addresses this at the very end by saying:
> Even if a robot honors robots.txt, it is still possible for the robot to find and index a disallowed URL from other places on the web. This can be prevented by using robots.txt directives in combination with robots meta tags or X-Robots-Tag headers.
That's just barely correct. If you use robots.txt in combination with robots meta or X-Robots-Tag headers, the result will be that your content is not crawled, the the meta tags will not be seen, and the item will show up in search results.
I intend to update this article with this information because in my experience almost all web devs expect robots.txt to work, and are then shocked when it doesn't. For many search engines (including Bing, Yahoo! and Google, the three most important ones to English readers) it's a deprecated way of blocking content, and this article should reflect that.
Before I make these changes, I'm interested in any feedback people might have about this proposal. --mjlissner (talk) 18:04, 8 April 2015 (UTC)
Hello fellow Wikipedians,
I have just modified one external link on Robots exclusion standard. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at ((Sourcecheck))
).
This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template ((source check))
(last update: 5 June 2024).
Cheers.—cyberbot IITalk to my owner:Online 13:52, 1 April 2016 (UTC)
The article omits documentation of the meta robots noimageindex instruction. Unclear if this is/was deliberate or not. Here are a few citations:
I think it deserves inclusion at least in a summary list, though commenting here first just to see if I missed some deliberate reasons for exclusion. — Tantek (talk) 21:46, 12 August 2019 (UTC)
Per Wikipedia:Administrators' noticeboard#Spam / vandalism magnets this page is now under indefinite pending changes protection. See Wikipedia:Pending changes and Wikipedia:Reviewing pending changes. If any of the regulars who watch this page don't have the PC approval right, you can apply at Wikipedia:Requests for permissions or just ask me to do it for you. (PC approval is routinely granted to experienced editors). --Guy Macon (talk) 17:12, 21 May 2020 (UTC)
I have an idea for some external links that could be added to Robots.txt#External links. There's Wikipedia's own robots.txt file and YouTube's robots.txt file (which contains an Easter egg). EthanGaming7640 (talk) 15:35, 17 November 2021 (UTC)
It has been proposed in this section that Robots.txt be renamed and moved to Robots.txt. A bot will list this discussion on the requested moves current discussions subpage within an hour of this tag being placed. The discussion may be closed 7 days after being opened, if consensus has been reached (see the closing instructions). Please base arguments on article title policy, and keep discussion succinct and civil. Please use ((subst:requested move)) . Do not use ((requested move/dated)) directly. |
Robots exclusion standard → Robots.txt – Per WP:COMMONNAME. There was a move proposal before, but that discussion wasn't really grounded in policy and also quite a long time ago. "Robots exclusion standard" doesn't even seem to be the real name, the robots.txt website only names it when linking to Wikipedia. The original document is titled "A Standard for Robot Exclusion" and the newest IETF RFC is titled "Robots Exclusion Protocol", which is also a different name. PhotographyEdits (talk) 14:21, 18 December 2022 (UTC)