this post was submitted on 13 Aug 2023
179 points (100.0% liked)
Technology
37362 readers
303 users here now
Rumors, happenings, and innovations in the technology sphere. If it's technological news or discussion of technology, it probably belongs here.
Subcommunities on Beehaw:
This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Very early on, at least, their spiders respected robots.txt.
I know there are folks that have all of the Big G in their robots.txt files on principle, might want to ask them if it works or not.
I do and I can confirm there are no requests (except for robots.txt and the odd /favicon.ico). Google sorta respects robots.txt. They do have a weird gotcha though: they still put the URLs in search, they just appear with an useless description. Their suggestion to avoid that can be summarized as: don't block us, let us crawl and just tell us not to use the result, just trust us! when they could very easily change that behavior to make more sense. Not a single damn person with Google blocked in robots.txt wants to be indexed, and their logic on password protecting kind of makes sense but my concern isn't security, it's that I don't like them (or Bing or Yandex).
Another gotcha I've seen linked is that their ad targeting bot for Google AdSense (different crawler) doesn't respect a
*
exclusion, but that kind of makes sense since it will only ever visit your site if you place AdSense ads on it.And I suppose they'll train Bard on all data they scraped because of course. Probably no way to opt out of that without opting out of Google Search as well.
Now that's a dirty trick.