Opened 5 years ago
Closed 5 years ago
#4559 closed enhancement (fixed)
Dedicated robots.txt file for translate.wordpress.org
Reported by: | jonoaldersonwp | Owned by: | dd32 |
---|---|---|---|
Milestone: | Priority: | low | |
Component: | Translate Site & Plugins | Keywords: | seo needs-testing |
Cc: |
Description (last modified by )
This consumes huge amounts of crawl budget, for relatively little return. We'd like to block crawling of it entirely, via robots.txt.
At the moment, it shares a robots.txt with other wordpress.org domains, which makes this impossible.
Can we give it a dedicated robots.txt file, which is separate from other sites, with the following contents:
User-agent: * Disallow: /* Noindex: /* Allow: /$
NB: We'll need to be absolutely certain that this is a standalone file, and doesn't bleed through to any other WP domains/contexts, or we'll cause the end of the world.
If/when this is complete, the ?filter
rule can be removed from the shared/global robots.txt file.
Change History (13)
#3
@
5 years ago
Happy to add a small number of whitelisted pages. In this case, "we" is me and Joost.
#6
@
5 years ago
- Keywords needs-testing added
After some discussion with @tellyworth, I think the best route forward here is to switch to using the WordPress generated robots.txt
and remove the static robots.txt
that's currently in place.
That has some downsides, one of them being the potential unknowns (currently) of there being a domain that we're not thinking of that is currently relying upon the static file.. But assuming we can work around that, it seems like moving is the best option here.
The above commits implement the existing robots.txt file and the request here into something that is usable.
Currently:
- https://*/robots.txt is still using the static file
- The WordPress generated file can be accessed like so: https://wordpress.org/?robots=1 or https://de.wordpress.org/?robots=1
- For translate.wordpress.org, it can be found here: https://translate.wordpress.org/?robots=1&gp_route=robots.txt
The Robots file for translate.wordpress.org..
- Drops the
noindex:
declaration as it'll be no longer supported by Google as of September. - Adds the Global Stats, Consitency Report, locale index, the Locale Indexes and Locale Plugin/Theme stats to the Allow ruleset.
I couldn't find many other pages worth indexing on translate.wordpress.org, we really don't need to index every project in every locale or all the strings, so having Search engines request those URLs isn't doing us any benefit.
There are probably a few pages which still should be indexed like /stats, /consistency or each /locale/$locale.
Just out of curiosity, who is "we"?