robots.txt files

If you are against crappy, unethical AI bots scraping your website (as you should be), bear in mind that improper use of the robots.txt file can do more harm than good.

Do not disallow everything

A simple robots txt file that disallows all web crawlers looks like this:

User-agent: *
Disallow: /

This is a bad idea. Web crawlers are not all unethical AI scraping tools; disallowing all crawlers also means search engine crawlers and others will not look at your website, meaning your site will not appear in any search engine results. Unless that's what you want, then that isn't a good result.

Opt-in vs opt-out for users

In most cases of data privacy and consent, it's a clear cut moral argument that the user should opt in to things they want explicitly, rather than being "opted in by default" and having to use an opt-out option.

This is not the case for robots txt files. The robots txt file is a voluntary directive. A bot visiting your website is not compelled to obey it in any way; unethical bots are often programmed to visit the exact areas of a site that the robots txt told it not to. The logic behind this is fairly simple; the vast majority of cases where an area of a website is disallowed, it's because some valuable or user data is there that the website does not want scraped. Unethical bots therefore use this as an indication that that particular area is worth scraping.

Therefore, you need to balance "the privacy of users" with "the actual effectiveness of your robots txt policy".

Tailoring your robots.txt according to your site content and user base

If a site contains primarily users and content that are, for example, uploading their own work - and thus do not want it scraped - then an approach of opting out all such content in the robots.txt makes sense.

However, if a site contains only some users of that nature, and also others who are uploading things that don't belong to them (for example, a blog talking about a pizza brand, a shop selling physical goods, and so forth), then an overly aggressive robots.txt is likely to backfire. You want bots to obey it, which means minimising unnecessary anti-scraping directives; the more of the site is blocked off, the more likely it is an unethical bot is going to consider it a valuable target. In these cases, it is therefore reasonable to not disallow scraping of the site automatically, but to allow users to opt into it for their own parts of the site, so that when they do so it is more likely bots will actually honour it.

Artwork Submissions

Media Submission Flow

Troubleshooting log messages not appearing

Deserted Chateau Loggers

Head HTML

Navigation bar

Tooltip Shells

Page-specific server-side content loading

Configuration Values

Internal Credentials

Refreshing cached values

Rate Limits

The GetGalleryForDisplay procedure

Configuration Structure

robots.txt files

Do not disallow everything

Opt-in vs opt-out for users

Tailoring your robots.txt according to your site content and user base