robots.txt_Robots.txt:互联网幕后的一瞥「建议收藏」

robots.txt_Robots.txt:互联网幕后的一瞥「建议收藏」robots.txtHere’sanexercise:openanewtabandtypeintheURLofyourfavoritewebsite.Add/robots.txttotheendandhitenter.这是一个练习:打开一个新选项卡,然后输入您喜欢的网站的URL。将/robots.txt添加到末尾,然后按Enter。There’…

大家好,又见面了,我是你们的朋友全栈君。

robots.txt

Here’s an exercise: open a new tab and type in the URL of your favorite website. Add /robots.txt to the end and hit enter.

这是一个练习:打开一个新选项卡,然后输入您喜欢的网站的URL。 将/robots.txt添加到末尾,然后按Enter。

There’s a good chance that the end result will be something like a list of lines starting with words like “User-agent” and maybe “Disallow.” These are probably followed by strings of text peppered with forward slashes and stars. Let’s talk about what it all means — and what it has to do with robots.

最终结果很有可能会是一些以“ User-agent”(可能是“ Disallow”)开头的行列表。 紧随其后的是带有斜线和星号的文本字符串。 让我们谈谈这一切的含义-以及它与机器人的关系。

运行互联网的隐形机器人 (The Invisible Robots Running the Internet)

From the very beginning of the Internet age, programmers have been building these little pieces of software known as “bots,” “web crawlers,” “spiders,” and any number of other names. Bots navigate the Internet from webpage to webpage by automatically following any links they find. They’re used for all kinds of purposes, most notably by search engines in a process called indexing. Indexing occurs when bots encounter a new webpage and add it to the search engine’s database. The search engine pulls webpages from this database when it returns relevant results for someone’s search.

从Internet时代的开始,程序员一直在构建这些小软件,称为“机器人”,“网络爬虫”,“蜘蛛”以及许多其他名称。 漫游器通过自动跟踪找到的任何链接来在网页之间浏览Internet。 它们用于各种目的,最著名的是在搜索引擎中称为indexing的过程中使用。 当漫游器遇到新网页并将其添加到搜索引擎的数据库中时,就会发生索引编制。 当搜索引擎返回与某人搜索相关的结果时,它将从该数据库中提取网页。

But web crawling can also be used for more nefarious purposes. Email harvesting is the process of using bots to find email addresses to target for scams or bulk emails. Hackers can also use bots to find security vulnerabilities or spread malware.

但是,网络爬网也可以用于更邪恶的目的。 电子邮件收集是使用漫游器查找电子邮件地址以定位骗局或大量电子邮件的过程。 黑客还可以使用漫游器来发现安全漏洞或传播恶意软件。

An artistic depiction of a bot with an eye

These little guys can be used for good and evil.
这些小家伙可以被用来做善与恶。

Even if a bot was used without malicious intent, it’s still possible that it could cause harm to the site — bots have been known to inadvertently crash web servers by making too many requests to the server in a small amount of time. This can obviously ruin the experience for everyone else trying to use the site. There are also parts of a website that its owners don’t want to make visible to search engines. For example, a banking website shouldn’t allow a user’s account balances showing up in Google’s search results.

即使在没有恶意的情况下使用了漫游器,它仍然有可能对站点造成损害-众所周知,漫游器会在短时间内向服务器发出过多请求,从而无意中破坏Web服务器。 这显然会破坏其他所有人尝试使用该网站的体验。 网站的某些部分也不希望其所有者对搜索引擎可见。 例如,银行网站不应允许用户的帐户余额显示在Google的搜索结果中。

Given the slew of circumstances surrounding when web crawling is and isn’t appropriate, it’s probably necessary for there to be some kind of etiquette or regulation on how these robots should behave when navigating the net.

考虑到网络爬取是否合适的各种情况,可能有必要对这些机器人在浏览网络时的行为进行某种礼节或规定。

机器人排除标准 (The Robots Exclusion Standard)

In 1994, Internet pioneer Martijn Koster accomplished just this by proposing a standard for instructing robots on how to navigate a website. This standard uses a text file called “robots.txt” to list the parts of a site that are and aren’t available for crawling. Let’s use the first few lines of Google’s robots.txt as an example:

1994年,互联网先驱Martijn Koster 提出了一个指导机器人如何导航网站的标准 ,从而实现了这一目标。 该标准使用一个名为“ robots.txt”的文本文件来列出网站上哪些部分可以进行爬取,哪些不可以进行爬网。 让我们以Google的robots.txt的前几行为例:

Part of Google’s robots.txt file

Lines beginning with User-agent refer to the names of particular bots. If the line reads User-agent: * as it does above, the exclusion standards apply to all bots crawling the site.

User-agent开头的行是指特定漫游器的名称。 如果该行显示为“ User-agent: *如上述操作),则排除标准适用于所有抓取该网站的漫游器。

Lines labeled Disallow indicate the parts of the site that are off-limits to the user agent. In the example above, bots are not allowed to navigate to https://www.google.com/search. If the line reads Disallow: /, the entire website is off-limits to the user agent.

标有“ Disallow行表示该站点的用户代理禁区。 在上面的示例中,不允许漫游器导航到https://www.google.com/search 。 如果该行显示Disallow: / ,则整个网站都超出了用户代理的权限。

Optionally, there are lines that begin with Allow, indicating subsections of disallowed sections that bots have permission to navigate. Google allows bots to access https://www.google.com/search/about even though most other webpages in the “search” folder are off-limits. A few robots.txt files will also include a line providing a link to a sitemap, which models how the website is structured so that crawlers and/or humans can more easily navigate it.

(可选)有些行以Allow开头,表示漫游器有权导航的不允许部分的子部分。 Google允许漫游器访问https://www.google.com/search/about,即使“搜索”文件夹中的大多数其他网页都是禁止访问的。 一些robots.txt文件还将包括提供指向站点地图的链接的一行,该行对网站的结构进行建模,以便爬虫程序和/或人类可以更轻松地浏览该网站。

When bots complying with the standard first navigate to a website, they try adding /robots.txt to the URL just as we did earlier. If such a file exists, the bots will read the file and avoid disallowed portions of the website. If the file doesn’t exist, the entire site is considered fair game for crawling.

当遵循该标准的漫游器首先导航到网站时,他们像我们之前那样尝试将/robots.txt添加到URL。 如果存在此类文件,则漫游器将读取该文件,并避免访问该网站的不允许部分。 如果文件不存在,则将整个站点视为爬网的公平游戏。

The robots exclusion standard has become the de facto standard followed by most legitimate bots. It helps websites exclude portions of their sites from search results, public viewing, and bot traffic. It also helps websites direct search engine bots to only the most relevant portions of the site, as some search engine bots can be constrained by a “crawl budget” limiting their processes. In these ways, the robots exclusion standard is undoubtedly an important contributor to the courtesy and efficiency of the technologies that define our Internet.

机器人排除标准已成为大多数合法机器人遵循的事实上的标准。 它可以帮助网站从搜索结果,公众查看和漫游器流量中排除网站的某些部分。 它还可以帮助网站将搜索引擎机器人仅定向到网站最相关的部分,因为某些搜索引擎机器人可能会受到“ 抓取预算 ”的限制,从而限制了其流程。 通过这些方式,机器人排除标准无疑是对定义我们的互联网的技术的礼貌和效率的重要贡献。

没有上锁的门 (Not a Locked Door)

Despite being a nifty and efficient tool for managing bot behavior, the robots exclusion standard isn’t perfect. The most important shortcoming of the standard is that bots don’t have to abide by anything robots.txt says; the standard isn’t legally binding and doesn’t contain technology to actually stop bots from doing whatever they want to. In fact, while the robots exclusion standard has been adopted by most major search engines, there are many other bots, both good and bad, that haven’t done so.

尽管它是一种用于管理机器人行为的灵巧而有效的工具,但机器人排斥标准并不完美。 该标准的最重要的缺点是机器人不必遵守robots.txt所说的任何东西 ; 该标准没有法律约束力,并且不包含实际上阻止机器人执行其所需操作的技术。 实际上,尽管大多数主要搜索引擎都采用了机器人排除标准,但还有许多其他机器人(无论好坏)都没有这样做。

For example, the Internet Archive, an organization that preserves webpages all over the Internet, stopped following the exclusion standard in 2017, as they felt that “robots.txt files that are geared toward search engine crawlers do not necessarily serve [the Internet Archive’s] archival purposes.” Robot exclusion standards are likewise ignored by bad actors — in fact, one scary implication of the standard is that many malicious bots use the disallowed listings on robots.txt to identify which parts of a website to target first.

例如,在Internet上保留网页的组织Internet Archive在2017年停止遵循排除标准 ,因为他们认为针对搜索引擎抓取工具的“ robots.txt文件不一定服务[Internet Archive’s存档目的。” 恶意行为者也同样会忽略机器人排除标准-实际上,该标准的一个可怕含义是,许多恶意机器人都使用robots.txt上不允许的清单来确定网站的哪个部分首先定位。

A main takeaway here is that despite their importance to the Internet, robots.txt files are not a replacement for proper security standards. As the official robot exclusion protocol website puts it, “think of [robots.txt] as a ‘No Entry’ sign, not a locked door.”

这里的主要要点是,尽管robots.txt文件对Internet十分重要,但它们并不能代替适当的安全标准 。 正如官方的机器人排除协议网站所说,“将[robots.txt]视为“禁止进入”标志,而不是上锁的门。”

Image for post

Source:
https://www.robotstxt.org/about.html
资料来源:
https :
//www.robotstxt.org/about.html

翻译自: https://medium.com/swlh/robots-txt-a-peek-under-the-hood-of-the-internet-c38163b8f213

robots.txt

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。

发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/143864.html原文链接:https://javaforall.cn

【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛

【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...

(0)
blank

相关推荐

  • Dreamweaver2021中文版 附安装教程

    Dreamweaver2021中文版 附安装教程当我们访问网站的时候看到的每一个页面都是前端程序员开发的,如果没有一款好的软件则会让开发效率大大降低,那么有没有一款高效的前端开发软件呢?推荐大家使用Dreamweaver2021,这是adobe旗下的一款非常受欢迎的网页设计软件,是该系列的全新版本,可以帮助广大学生、程序员制作出精美的网页,比如简洁的百度首页,复杂的淘宝页面,你都可以通过它让你游刃有余的制作出来。该软件可以帮助用户了解以及编辑HTML、CSS、Web、xml、json,各种前端语言都支持编码输入,还支持快捷键快速输入一大段代码,减少你重复

  • c语言最大公约数怎么求算法_最小公倍数c语言算法

    c语言最大公约数怎么求算法_最小公倍数c语言算法1、相减法2、穷举法3、辗转相除法

    2022年10月31日
  • CPU型号后缀字母所代表的含义

    CPU型号后缀字母所代表的含义一、Intel桌面式CPU——只看数字你就输了●X后缀X后缀=至高无上的至尊版  X代表Extreme,中文意思是至尊级,代表同一时代性能最强的CPU。如Corei7-5960X、Corei7-4960X。X代表在同一代中只有一款CPU黄袍加身,地位至高无上。加上没有竞争对手可以望其项背,…

  • cs架构和bs架构的应用_cs bs区别

    cs架构和bs架构的应用_cs bs区别悬赏园豆:200[已关闭问题]1.如何设计C/S和B/S混合结构?2.采用webservice,B/S端采用C#,C/S端采用C、delphi、VC++,如何进行通信?3.WebService的运行机理:首先客户端从服务器的到WebService的WSDL,同时在客户端声称一个代理类(ProxyClass),这个代理类负责与WebService服务器进行Request和Response,…

  • 用 PHP和Golang 来刷leetCode 之 无重复字符 最长子串

    用 PHP和Golang 来刷leetCode 之 无重复字符 最长子串

  • Eclipse开发JavaWeb项目配置Tomcat,详细教程

    Eclipse开发JavaWeb项目配置Tomcat,详细教程以下都经过本人自学时一一自己动手配置实验。首先介绍eclipse开发JavaWeb项目需要配置的相关环境,使用tomcat软件在本地搭建服务器,然后再在eclipse环境下配置tomcat:第一步:使用tomcat软件在本地搭建服务器,这个本地的tomcat服务器与eclipse环境下配置tomcat服务器都可以使用,但是只能启动一个,否则会报端口冲突,到时安装好环境会介绍tomcat

发表回复

您的电子邮箱地址不会被公开。

关注全栈程序员社区公众号