# Robots that do not return any value to the site operator, for example # by referring readers, are excluded by statements in this file. # # Many of the crawlers on this list belong to companies that provide their # customers with some information about the web, but this information is # not available to the general public, and can not be audited by the web # site operator. # # In most cases, such information is not meant to create a productive # relationship for the web site operator. For example, it's reporting on # trademark use on web sites (and possible infringement), or trawling for # content to compare to student's papers and establish plagarism, etc. # # Web site operators pay for system resources and bandwidth. They lose money # when their resources are used this way. The problem is so bad that most # sites today get many more hits from robots than from actual human readers. # # In my opinion, the robots listed here should not be allowed on your site # unless their operators arrange to pay for the content. # # If you don't believe that your robot belongs on the excluded list, # please write to Bruce Perens and explain why. # # Hint: Robots that do not include a URL to a page explaining what they # do in their browser ID string are considered impolite. # Brandimensions.com: sells information about blog content to help # corporations police their brands and see what's being written about them. User-agent: BDFetch Disallow: / # They're developing it. They don't say what for. # http://www.commoncrawl.org/faq.htm Agent: CCBot Disallow: / # MSRBOT: They say it's for research. User-agent: MSRBOT Disallow: / # Moreover.com: Online monitoring service producing "actionable information" # for corporations. User-agent: Moreoverbot Disallow: / # Attributor.com: Plagarism monitor. User-agent: attributor Disallow: / # Turnitin.com: Plagarism monitor, has issues regarding privacy of legal # minors. User-agent: TurnitinBot Disallow: / # Doesn't say what it is. User-agent: Gigabot Disallow: / # DotNetDotCom http://www.dotnetdotcom.org/#info # A few guys trying to build the best crawler. Fine, please tell us when # it represents some value to the site operator. User-agent: dotbot Disallow: / # These are parts of the site that legitimate robots should not access. User-agent: * Disallow: /blogs/ Disallow: /comments/ Disallow: /sections/ Disallow: /sessions/ Disallow: /no-cache/ Disallow: /users/ # Disallow things provided by the old technocrat.net code. Disallow: /s/ Disallow: /application/ Disallow: /misc/ Disallow: /robots/ # Robots are still looking for the old slashcode here. Disallow: /article.pl Disallow: /comments.pl Disallow: /index.pl Disallow: /journal.pl Disallow: /News/ Disallow: /search Disallow: /cgi-bin/ Disallow: /Articles/