先問是不是,再問為什麼。答案是他們的 robots.txt 都寫錯了。必應是遵守了 The Robots Exclusion Protocol 的。遵守或者不遵守本來就是無所謂的。這又不是什麼行業標準。這協議連道德標準都算不上。先說 robots.txt 該怎麼寫以下引用自 The Web Robots Pages
What to put in itThe "/robots.txt" file is a text file, with one or more records. Usually contains a single record looking like this:
Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records.
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The "*" in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".
要說的是,http://zhihu.com/robots.txt 的錯誤是在 Disallow 欄位打算使用通配符,望改正。http://taobao.com 和 http://weibo.com 下的 robots.txt 已經不可救藥了,推倒重來吧。(更正,這句話里犯了個錯誤。我在 Edge 瀏覽器下打開這兩個文件,發現都沒有換行,便誤以為這兩個都是寫錯的。答主出錯的原因是「 Linux 的換行符和 Windows 下是不一樣的」。http://taobao.com 的 robots.txt 是符合 Linux 標準的,使用的 LF 進行換行。然而,http://weibo.com 的 robots.txt 沒用空白行分隔每個記錄,而且換行符是 Mixed ,同時用了 LF 和 CR LF 來換行,不是很懂,等大神指正。)更正2 的依據在此:Performance, Implementation, and Design Notes
The "Disallow" field specifies a partial URI that is not to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example,
Disallow: /help disallows both /help.html and /help/index.html, whereas
Disallow: /help/ would disallow /help/index.html but allow /help.html.