Opened 2 years ago
Last modified 13 months ago
#6763 reopened enhancement
Update robots.txt (and rosetta variations)
Reported by: |
|
Owned by: | |
---|---|---|---|
Milestone: | Priority: | low | |
Component: | General | Keywords: | seo performance has-patch |
Cc: |
Description (last modified by )
The robots.txt file could be tightened up in order to prevent unnecessary crawling, which could have significant SEO and performance/efficiency benefits.
Additionally, we apply rules inconsistently across different rosetta subdomains; we should probably standardize these!
I'm looking at wordpress.org/robots.txt as a starting point for standardization; I'd suggest:
- Remove the various wp-admin 'allow' rules (e.g., wordpress.org/robots.txt)
- Combine the remaining disallow rules
- Tweak the 'search' disallow rule to add a trailing slash
- Moving sitemap references to the end
- Disallow
/plugins/wp-json/plugins/v1/locale-banner
(which is crawled by Google upwards of 40k times per day!) - Disallow subfolder variations of some rules so that they catch subsites (e.g.,
/*/wp-admin/
) - Disallow 'non-pretty' variations (e.g.,
?rest_route=
) - Add some inline comments
That gets us to the following:
# Prevent crawling of WP internals # -------------------------------- User-agent: * Disallow: /wp-admin/ Disallow: /*/wp-admin/ Disallow: /?rest_route= Disallow: /xmlrpc.php # Prevent crawling of leaky theme endpoints # -------------------------------- User-agent: * Disallow: /plugins/wp-json/plugins/v1/locale-banner # Prevent crawling of search URLs # -------------------------------- User-agent: * Disallow: /search/ Disallow: /*/search/ Disallow: /?s= Disallow: /*/?s= # Sitemaps # -------------------------------- Sitemap: https://wordpress.org/sitemap.xml Sitemap: https://wordpress.org/news-sitemap.xml Sitemap: https://wordpress.org/themes/sitemap.xml Sitemap: https://wordpress.org/plugins/sitemap.xml Sitemap: https://wordpress.org/news/sitemap.xml Sitemap: https://wordpress.org/showcase/sitemap.xml
Change History (8)
This ticket was mentioned in PR #129 on WordPress/wordpress.org by @renyot.
2 years ago
#4
- Keywords has-patch added
This ticket was mentioned in PR #4207 on WordPress/wordpress-develop by @renyot.
2 years ago
#5
Trac ticket: https://meta.trac.wordpress.org/ticket/6763
This PR adds an inline comment to robots.txt.
Related: https://github.com/WordPress/wordpress.org/pull/129:
## Before
Sitemap: https://wordpress.org/sitemap.xml Sitemap: https://wordpress.org/news-sitemap.xml Sitemap: https://wordpress.org/themes/sitemap.xml Sitemap: https://wordpress.org/plugins/sitemap.xml Sitemap: https://wordpress.org/news/sitemap.xml Sitemap: https://wordpress.org/showcase/sitemap.xml Sitemap: https://wordpress.org/documentation/sitemap.xml User-agent: * Disallow: /wp-admin/ Disallow: /*/wp-admin/ Allow: /wp-admin/admin-ajax.php Allow: /wp-admin/load-scripts.php Allow: /wp-admin/load-styles.php Disallow: /?rest_route= Disallow: /xmlrpc.php Disallow: /plugins/search/ # Prevent crawling of search URLs # -------------------------------- Disallow: /search/ Disallow: /*/search/ Disallow: /?s= Disallow: /*/?s= # Prevent crawling of leaky theme endpoints # -------------------------------- Disallow: /plugins/wp-json/plugins/v1/locale-banner
## After
Sitemap: https://wordpress.org/sitemap.xml Sitemap: https://wordpress.org/news-sitemap.xml Sitemap: https://wordpress.org/themes/sitemap.xml Sitemap: https://wordpress.org/plugins/sitemap.xml Sitemap: https://wordpress.org/news/sitemap.xml Sitemap: https://wordpress.org/showcase/sitemap.xml Sitemap: https://wordpress.org/documentation/sitemap.xml # Prevent crawling of WP internals # -------------------------------- User-agent: * Disallow: /wp-admin/ Disallow: /*/wp-admin/ Allow: /wp-admin/admin-ajax.php Allow: /wp-admin/load-scripts.php Allow: /wp-admin/load-styles.php Disallow: /?rest_route= Disallow: /xmlrpc.php Disallow: /plugins/search/ # Prevent crawling of search URLs # -------------------------------- Disallow: /search/ Disallow: /*/search/ Disallow: /?s= Disallow: /*/?s= # Prevent crawling of leaky theme endpoints # -------------------------------- Disallow: /plugins/wp-json/plugins/v1/locale-banner
#6
@
20 months ago
- Resolution set to invalid
- Status changed from new to closed
Thank you for your correspondence regarding this matter. We have observed link spamming activities taking advantage of WordPress' behavior with ?s=
, as mentioned in the # Prevent crawling of search URLs section.
Although attackers could be monitoring this discussion, we have noticed that in some instances, they add valid parameters before the "s" parameter. An example of this tactic is: https://example.com/?tq=obejm&s=spam_words
The current Disallow
rule in the robots.txt file, as proposed in the pull request, does not cover this scenario. Therefore, we suggest adding the following rule to address this issue:
Disallow: /?*&s=
This modification ensures that search URLs with crafty parameter combinations, like the example provided, will also be prevented from being indexed by search engines.
See https://meta.trac.wordpress.org/ticket/6763
## Question
1 - In ticket 6763,
seem to have been deleted, or maybe they were just omitted without being written down?
2 - It this still hold up that
wp-admin/load-*.php
should be upstreamed to Core?Or it isn't required anymore (https://github.com/WordPress/wordpress.org/pull/121#discussion_r1109380766)
3 - I'm not sure where these directives came from. Are they from Jetpack? Since the ticket requested to add inline comments and also move the Sitemap references to the end of the file, gotta locate them.
## Before
{{{Sitemap: https://wordpress.org/sitemap.xml
Sitemap: https://wordpress.org/news-sitemap.xml
Sitemap: https://wordpress.org/themes/sitemap.xml
Sitemap: https://wordpress.org/plugins/sitemap.xml
Sitemap: https://wordpress.org/news/sitemap.xml
Sitemap: https://wordpress.org/showcase/sitemap.xml
Sitemap: https://wordpress.org/documentation/sitemap.xml
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Allow: /wp-admin/load-scripts.php
Allow: /wp-admin/load-styles.php
User-agent: *
Disallow: /search
Disallow: /?s=
User-agent: *
Disallow: /plugins/search/
}}}
## After
## Before (Rosetta)
## After (Rosetta)