Opened 4 years ago

Last modified 3 years ago

#5105 new defect (bug)

Remove bot blocking (403 responses) on * sites.

Reported by: jonoaldersonwp's profile jonoaldersonwp Owned by:
Milestone: Priority: high
Component: Trac Keywords: seo


We have systems in place which actively prevent Google (and other agents?) from accessing * sites/URLs. We return a 403 response (and a raw NGINX template) in these scenarios.

This 'solution' prevents these agents them from seeing/accessing the robots.txt file on those respective sites, and thus results in them continuing to attempt to crawl/index them (especially as these URLs are heavily linked to throughout the ecosystem).

I propose that we remove the 403 behaviour, and rely on the robots.txt file to do its job.

If we believe that it's necessary to restrict crawling behaviour for performance reasons, then we can consider tailoring the robots.txt rule(s) to be more restrictive, and/or implementing performance improvements throughout the site(s) (of which there are myriad available and achievable, both front-end and back-end).

Change History (4)

#1 @jonoaldersonwp
4 years ago

NB: It looks like this might be tied to some rate-limiting logic. That doesn't change anything, though; this should still be removed.

#2 @jonoaldersonwp
4 years ago

  • Priority changed from normal to high

@afercia rightly points out that the current behaviour is likely to negatively impact the ability of contributors to contribute, as they rely on Google (either through internal or external site search) to find tickets/issues and related files. Upgrading the severity accordingly.

#3 @jonoaldersonwp
3 years ago

This has been stale for a year; how can we escalate addressing this?

This ticket was mentioned in Slack in #meta by jonoaldersonwp. View the logs.

3 years ago

Note: See TracTickets for help on using tickets.