Making WordPress.org

Opened 4 months ago

Last modified 3 months ago

#4559 accepted enhancement

Dedicated robots.txt file for translate.wordpress.org

Reported by: jonoaldersonwp Owned by: dd32
Milestone: Priority: low
Component: Translate Site & Plugins Keywords: seo needs-testing

Description (last modified by jonoaldersonwp)

This consumes huge amounts of crawl budget, for relatively little return. We'd like to block crawling of it entirely, via robots.txt.

At the moment, it shares a robots.txt with other wordpress.org domains, which makes this impossible.

Can we give it a dedicated robots.txt file, which is separate from other sites, with the following contents:

User-agent: *
Disallow: /*
Noindex: /*
Allow: /$

NB: We'll need to be absolutely certain that this is a standalone file, and doesn't bleed through to any other WP domains/contexts, or we'll cause the end of the world.

If/when this is complete, the ?filter rule can be removed from the shared/global robots.txt file.

Change History (9)

#1 @jonoaldersonwp
4 months ago

  • Description modified (diff)

#2 @ocean90
4 months ago

  • Priority changed from high to low
  • Type changed from defect to enhancement

There are probably a few pages which still should be indexed like /stats, /consistency or each /locale/$locale.

We'd like to block crawling of it entirely

Just out of curiosity, who is "we"?

#3 @jonoaldersonwp
4 months ago

Happy to add a small number of whitelisted pages. In this case, "we" is me and Joost.

#4 @dd32
3 months ago

In 9070:

GlotPress: Add a handler to allow robots.txt to be served by WordPress on translate.wordpress.org.

See #4559.

#5 @dd32
3 months ago

In 9071:

Add a mu-plugin which contains the customised WordPress.org robots.txt content.

The intention is that we can switch from serving a static robots.txt file to serving a customised one from WordPress.

See #4559.

#6 @dd32
3 months ago

  • Keywords needs-testing added

After some discussion with @tellyworth, I think the best route forward here is to switch to using the WordPress generated robots.txt and remove the static robots.txt that's currently in place.

That has some downsides, one of them being the potential unknowns (currently) of there being a domain that we're not thinking of that is currently relying upon the static file.. But assuming we can work around that, it seems like moving is the best option here.

The above commits implement the existing robots.txt file and the request here into something that is usable.


The Robots file for translate.wordpress.org..

I couldn't find many other pages worth indexing on translate.wordpress.org, we really don't need to index every project in every locale or all the strings, so having Search engines request those URLs isn't doing us any benefit.

#7 @dd32
3 months ago

  • Owner set to dd32
  • Status changed from new to accepted

#8 follow-up: @dd32
3 months ago

In 9072:

Robots.txt: Don't link to the Themes sitemap.xml, as there's a server-level redirect in place breaking it.

See #4559.

#9 in reply to: ↑ 8 @dd32
3 months ago

Replying to dd32:

In 9072:

Robots.txt: Don't link to the Themes sitemap.xml, as there's a server-level redirect in place breaking it.

See #4559.

For reference, https://make.wordpress.org/systems/2019/07/24/remove-trailingslashit-rule-for-theme-directory-uris/ to remove the redirect that breaks it.

Note: See TracTickets for help on using tickets.