Making WordPress.org

Opened 2 years ago

Closed 22 months ago

#4559 closed enhancement (fixed)

Dedicated robots.txt file for translate.wordpress.org

Reported by: jonoaldersonwp Owned by: dd32
Milestone: Priority: low
Component: Translate Site & Plugins Keywords: seo needs-testing

Description (last modified by jonoaldersonwp)

This consumes huge amounts of crawl budget, for relatively little return. We'd like to block crawling of it entirely, via robots.txt.

At the moment, it shares a robots.txt with other wordpress.org domains, which makes this impossible.

Can we give it a dedicated robots.txt file, which is separate from other sites, with the following contents:

User-agent: *
Disallow: /*
Noindex: /*
Allow: /$

NB: We'll need to be absolutely certain that this is a standalone file, and doesn't bleed through to any other WP domains/contexts, or we'll cause the end of the world.

If/when this is complete, the ?filter rule can be removed from the shared/global robots.txt file.

Change History (13)

#1 @jonoaldersonwp
2 years ago

  • Description modified (diff)

#2 @ocean90
2 years ago

  • Priority changed from high to low
  • Type changed from defect to enhancement

There are probably a few pages which still should be indexed like /stats, /consistency or each /locale/$locale.

We'd like to block crawling of it entirely

Just out of curiosity, who is "we"?

#3 @jonoaldersonwp
2 years ago

Happy to add a small number of whitelisted pages. In this case, "we" is me and Joost.

#4 @dd32
2 years ago

In 9070:

GlotPress: Add a handler to allow robots.txt to be served by WordPress on translate.wordpress.org.

See #4559.

#5 @dd32
2 years ago

In 9071:

Add a mu-plugin which contains the customised WordPress.org robots.txt content.

The intention is that we can switch from serving a static robots.txt file to serving a customised one from WordPress.

See #4559.

#6 @dd32
2 years ago

  • Keywords needs-testing added

After some discussion with @tellyworth, I think the best route forward here is to switch to using the WordPress generated robots.txt and remove the static robots.txt that's currently in place.

That has some downsides, one of them being the potential unknowns (currently) of there being a domain that we're not thinking of that is currently relying upon the static file.. But assuming we can work around that, it seems like moving is the best option here.

The above commits implement the existing robots.txt file and the request here into something that is usable.


The Robots file for translate.wordpress.org..

I couldn't find many other pages worth indexing on translate.wordpress.org, we really don't need to index every project in every locale or all the strings, so having Search engines request those URLs isn't doing us any benefit.

#7 @dd32
2 years ago

  • Owner set to dd32
  • Status changed from new to accepted

#8 follow-up: @dd32
2 years ago

In 9072:

Robots.txt: Don't link to the Themes sitemap.xml, as there's a server-level redirect in place breaking it.

See #4559.

#9 in reply to: ↑ 8 @dd32
2 years ago

Replying to dd32:

In 9072:

Robots.txt: Don't link to the Themes sitemap.xml, as there's a server-level redirect in place breaking it.

See #4559.

For reference, https://make.wordpress.org/systems/2019/07/24/remove-trailingslashit-rule-for-theme-directory-uris/ to remove the redirect that breaks it.

#10 @jonoaldersonwp
2 years ago

Any movement on this? :)

#11 @ocean90
23 months ago

#4186 was marked as a duplicate.

#12 @dd32
22 months ago

In 9350:

Make: Ensure that make.wordpress.org/robots.txt works without a statically defined file.

See #4559.

#13 @dd32
22 months ago

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.