Making WordPress.org

Opened 11 months ago

Closed 5 months ago

#4559 closed enhancement (fixed)

Dedicated robots.txt file for translate.wordpress.org

Reported by: jonoaldersonwp Owned by: dd32
Milestone: Priority: low
Component: Translate Site & Plugins Keywords: seo needs-testing

Description (last modified by jonoaldersonwp)

This consumes huge amounts of crawl budget, for relatively little return. We'd like to block crawling of it entirely, via robots.txt.

At the moment, it shares a robots.txt with other wordpress.org domains, which makes this impossible.

Can we give it a dedicated robots.txt file, which is separate from other sites, with the following contents:

User-agent: *
Disallow: /*
Noindex: /*
Allow: /$

NB: We'll need to be absolutely certain that this is a standalone file, and doesn't bleed through to any other WP domains/contexts, or we'll cause the end of the world.

If/when this is complete, the ?filter rule can be removed from the shared/global robots.txt file.

Change History (13)

#1 @jonoaldersonwp
11 months ago

  • Description modified (diff)

#2 @ocean90
11 months ago

  • Priority changed from high to low
  • Type changed from defect to enhancement

There are probably a few pages which still should be indexed like /stats, /consistency or each /locale/$locale.

We'd like to block crawling of it entirely

Just out of curiosity, who is "we"?

#3 @jonoaldersonwp
11 months ago

Happy to add a small number of whitelisted pages. In this case, "we" is me and Joost.

#4 @dd32
10 months ago

In 9070:

GlotPress: Add a handler to allow robots.txt to be served by WordPress on translate.wordpress.org.

See #4559.

#5 @dd32
10 months ago

In 9071:

Add a mu-plugin which contains the customised WordPress.org robots.txt content.

The intention is that we can switch from serving a static robots.txt file to serving a customised one from WordPress.

See #4559.

#6 @dd32
10 months ago

  • Keywords needs-testing added

After some discussion with @tellyworth, I think the best route forward here is to switch to using the WordPress generated robots.txt and remove the static robots.txt that's currently in place.

That has some downsides, one of them being the potential unknowns (currently) of there being a domain that we're not thinking of that is currently relying upon the static file.. But assuming we can work around that, it seems like moving is the best option here.

The above commits implement the existing robots.txt file and the request here into something that is usable.


The Robots file for translate.wordpress.org..

I couldn't find many other pages worth indexing on translate.wordpress.org, we really don't need to index every project in every locale or all the strings, so having Search engines request those URLs isn't doing us any benefit.

#7 @dd32
10 months ago

  • Owner set to dd32
  • Status changed from new to accepted

#8 follow-up: @dd32
10 months ago

In 9072:

Robots.txt: Don't link to the Themes sitemap.xml, as there's a server-level redirect in place breaking it.

See #4559.

#9 in reply to: ↑ 8 @dd32
10 months ago

Replying to dd32:

In 9072:

Robots.txt: Don't link to the Themes sitemap.xml, as there's a server-level redirect in place breaking it.

See #4559.

For reference, https://make.wordpress.org/systems/2019/07/24/remove-trailingslashit-rule-for-theme-directory-uris/ to remove the redirect that breaks it.

#10 @jonoaldersonwp
7 months ago

Any movement on this? :)

#11 @ocean90
6 months ago

#4186 was marked as a duplicate.

#12 @dd32
5 months ago

In 9350:

Make: Ensure that make.wordpress.org/robots.txt works without a statically defined file.

See #4559.

#13 @dd32
5 months ago

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.