大家好,又见面了,我是你们的朋友全栈君。
robots.txt
Here’s an exercise: open a new tab and type in the URL of your favorite website. Add /robots.txt
to the end and hit enter.
这是一个练习:打开一个新选项卡,然后输入您喜欢的网站的URL。 将/robots.txt
添加到末尾,然后按Enter。
There’s a good chance that the end result will be something like a list of lines starting with words like “User-agent” and maybe “Disallow.” These are probably followed by strings of text peppered with forward slashes and stars. Let’s talk about what it all means — and what it has to do with robots.
最终结果很有可能会是一些以“ User-agent”(可能是“ Disallow”)开头的行列表。 紧随其后的是带有斜线和星号的文本字符串。 让我们谈谈这一切的含义-以及它与机器人的关系。
运行互联网的隐形机器人 (The Invisible Robots Running the Internet)
From the very beginning of the Internet age, programmers have been building these little pieces of software known as “bots,” “web crawlers,” “spiders,” and any number of other names. Bots navigate the Internet from webpage to webpage by automatically following any links they find. They’re used for all kinds of purposes, most notably by search engines in a process called indexing. Indexing occurs when bots encounter a new webpage and add it to the search engine’s database. The search engine pulls webpages from this database when it returns relevant results for someone’s search.
从Internet时代的开始,程序员一直在构建这些小软件,称为“机器人”,“网络爬虫”,“蜘蛛”以及许多其他名称。 漫游器通过自动跟踪找到的任何链接来在网页之间浏览Internet。 它们用于各种目的,最著名的是在搜索引擎中称为indexing的过程中使用。 当漫游器遇到新网页并将其添加到搜索引擎的数据库中时,就会发生索引编制。 当搜索引擎返回与某人搜索相关的结果时,它将从该数据库中提取网页。
But web crawling can also be used for more nefarious purposes. Email harvesting is the process of using bots to find email addresses to target for scams or bulk emails. Hackers can also use bots to find security vulnerabilities or spread malware.
但是,网络爬网也可以用于更邪恶的目的。 电子邮件收集是使用漫游器查找电子邮件地址以定位骗局或大量电子邮件的过程。 黑客还可以使用漫游器来发现安全漏洞或传播恶意软件。
Even if a bot was used without malicious intent, it’s still possible that it could cause harm to the site — bots have been known to inadvertently crash web servers by making too many requests to the server in a small amount of time. This can obviously ruin the experience for everyone else trying to use the site. There are also parts of a website that its owners don’t want to make visible to search engines. For example, a banking website shouldn’t allow a user’s account balances showing up in Google’s search results.
即使在没有恶意的情况下使用了漫游器,它仍然有可能对站点造成损害-众所周知,漫游器会在短时间内向服务器发出过多请求,从而无意中破坏Web服务器。 这显然会破坏其他所有人尝试使用该网站的体验。 网站的某些部分也不希望其所有者对搜索引擎可见。 例如,银行网站不应允许用户的帐户余额显示在Google的搜索结果中。
Given the slew of circumstances surrounding when web crawling is and isn’t appropriate, it’s probably necessary for there to be some kind of etiquette or regulation on how these robots should behave when navigating the net.
考虑到网络爬取是否合适的各种情况,可能有必要对这些机器人在浏览网络时的行为进行某种礼节或规定。
机器人排除标准 (The Robots Exclusion Standard)
In 1994, Internet pioneer Martijn Koster accomplished just this by proposing a standard for instructing robots on how to navigate a website. This standard uses a text file called “robots.txt” to list the parts of a site that are and aren’t available for crawling. Let’s use the first few lines of Google’s robots.txt as an example:
1994年,互联网先驱Martijn Koster 提出了一个指导机器人如何导航网站的标准 ,从而实现了这一目标。 该标准使用一个名为“ robots.txt”的文本文件来列出网站上哪些部分可以进行爬取,哪些不可以进行爬网。 让我们以Google的robots.txt的前几行为例:
Lines beginning with User-agent
refer to the names of particular bots. If the line reads User-agent: *
as it does above, the exclusion standards apply to all bots crawling the site.
以User-agent
开头的行是指特定漫游器的名称。 如果该行显示为“ User-agent: *
如上述操作),则排除标准适用于所有抓取该网站的漫游器。
Lines labeled Disallow
indicate the parts of the site that are off-limits to the user agent. In the example above, bots are not allowed to navigate to https://www.google.com/search. If the line reads Disallow: /
, the entire website is off-limits to the user agent.
标有“ Disallow
行表示该站点的用户代理禁区。 在上面的示例中,不允许漫游器导航到https://www.google.com/search 。 如果该行显示Disallow: /
,则整个网站都超出了用户代理的权限。
Optionally, there are lines that begin with Allow
, indicating subsections of disallowed sections that bots have permission to navigate. Google allows bots to access https://www.google.com/search/about even though most other webpages in the “search” folder are off-limits. A few robots.txt files will also include a line providing a link to a sitemap, which models how the website is structured so that crawlers and/or humans can more easily navigate it.
(可选)有些行以Allow
开头,表示漫游器有权导航的不允许部分的子部分。 Google允许漫游器访问https://www.google.com/search/about,即使“搜索”文件夹中的大多数其他网页都是禁止访问的。 一些robots.txt文件还将包括提供指向站点地图的链接的一行,该行对网站的结构进行建模,以便爬虫程序和/或人类可以更轻松地浏览该网站。
When bots complying with the standard first navigate to a website, they try adding /robots.txt
to the URL just as we did earlier. If such a file exists, the bots will read the file and avoid disallowed portions of the website. If the file doesn’t exist, the entire site is considered fair game for crawling.
当遵循该标准的漫游器首先导航到网站时,他们像我们之前那样尝试将/robots.txt
添加到URL。 如果存在此类文件,则漫游器将读取该文件,并避免访问该网站的不允许部分。 如果文件不存在,则将整个站点视为爬网的公平游戏。
The robots exclusion standard has become the de facto standard followed by most legitimate bots. It helps websites exclude portions of their sites from search results, public viewing, and bot traffic. It also helps websites direct search engine bots to only the most relevant portions of the site, as some search engine bots can be constrained by a “crawl budget” limiting their processes. In these ways, the robots exclusion standard is undoubtedly an important contributor to the courtesy and efficiency of the technologies that define our Internet.
机器人排除标准已成为大多数合法机器人遵循的事实上的标准。 它可以帮助网站从搜索结果,公众查看和漫游器流量中排除网站的某些部分。 它还可以帮助网站将搜索引擎机器人仅定向到网站最相关的部分,因为某些搜索引擎机器人可能会受到“ 抓取预算 ”的限制,从而限制了其流程。 通过这些方式,机器人排除标准无疑是对定义我们的互联网的技术的礼貌和效率的重要贡献。
没有上锁的门 (Not a Locked Door)
Despite being a nifty and efficient tool for managing bot behavior, the robots exclusion standard isn’t perfect. The most important shortcoming of the standard is that bots don’t have to abide by anything robots.txt says; the standard isn’t legally binding and doesn’t contain technology to actually stop bots from doing whatever they want to. In fact, while the robots exclusion standard has been adopted by most major search engines, there are many other bots, both good and bad, that haven’t done so.
尽管它是一种用于管理机器人行为的灵巧而有效的工具,但机器人排斥标准并不完美。 该标准的最重要的缺点是机器人不必遵守robots.txt所说的任何东西 ; 该标准没有法律约束力,并且不包含实际上阻止机器人执行其所需操作的技术。 实际上,尽管大多数主要搜索引擎都采用了机器人排除标准,但还有许多其他机器人(无论好坏)都没有这样做。
For example, the Internet Archive, an organization that preserves webpages all over the Internet, stopped following the exclusion standard in 2017, as they felt that “robots.txt files that are geared toward search engine crawlers do not necessarily serve [the Internet Archive’s] archival purposes.” Robot exclusion standards are likewise ignored by bad actors — in fact, one scary implication of the standard is that many malicious bots use the disallowed listings on robots.txt to identify which parts of a website to target first.
例如,在Internet上保留网页的组织Internet Archive在2017年停止遵循排除标准 ,因为他们认为针对搜索引擎抓取工具的“ robots.txt文件不一定服务[Internet Archive’s存档目的。” 恶意行为者也同样会忽略机器人排除标准-实际上,该标准的一个可怕含义是,许多恶意机器人都使用robots.txt上不允许的清单来确定网站的哪个部分首先定位。
A main takeaway here is that despite their importance to the Internet, robots.txt files are not a replacement for proper security standards. As the official robot exclusion protocol website puts it, “think of [robots.txt] as a ‘No Entry’ sign, not a locked door.”
这里的主要要点是,尽管robots.txt文件对Internet十分重要,但它们并不能代替适当的安全标准 。 正如官方的机器人排除协议网站所说,“将[robots.txt]视为“禁止进入”标志,而不是上锁的门。”
翻译自: https://medium.com/swlh/robots-txt-a-peek-under-the-hood-of-the-internet-c38163b8f213
robots.txt
发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/143864.html原文链接:https://javaforall.cn
【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛
【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...