...not only does #SemRush show complete disregard for robots.txt, it also ignores any sort of session management cookies etc. As such, every single file it crawled created a new anonymous session in kallithea, creating a file in my sessions store until the nodes were exhausted on the filesystem and it crashed.
I finally had the need to put aggressive bot/crawler blocking on my nginx reverse proxy for the first time. I think being on the fediverse right attention to myself lol...
...anyways a little PSA:
Instance admins: even if your site is small be sure to aggressively block web crawlers. More and more they ignore robot.txt etiquette, and in light of recent archiving incidents you want to have some control over distribution of user's public posts.
Fediversians: REALLY BE CAREFUL about what you publicly post. Delete works only on reliable fediverse servers. Evil bots don't respect post deletes when hoarding data.
@msh I have a rant that's about 100 miles long and laden with descriptive expletives regarding SemRushBot and what I'd like to do it. And that's exactly what it had done at my server as well!
I have blocked on .htaccess level everything that even sniffs like SemRushBot, Baidu and other misbehaving webcrawlers...😡 🤬
Micro-blogging site operated by Mark Shane Hayden of Coalesco Digital Systems Inc. We are located in Alberta, Canada. This is NOT intended to be a commercial/promotional site! Registration is open to anyone interested in civil discussions on any interesting topic--especially technology, current events and politics.