Making WordPress.org

Opened 2 years ago

Last modified 13 months ago

#6763 reopened enhancement

Update robots.txt (and rosetta variations)

Reported by: jonoaldersonwp's profile jonoaldersonwp Owned by:
Milestone: Priority: low
Component: General Keywords: seo performance has-patch
Cc:

Description (last modified by jonoaldersonwp)

The robots.txt file could be tightened up in order to prevent unnecessary crawling, which could have significant SEO and performance/efficiency benefits.

Additionally, we apply rules inconsistently across different rosetta subdomains; we should probably standardize these!

I'm looking at wordpress.org/robots.txt as a starting point for standardization; I'd suggest:

  • Remove the various wp-admin 'allow' rules (e.g., wordpress.org/robots.txt)
  • Combine the remaining disallow rules
  • Tweak the 'search' disallow rule to add a trailing slash
  • Moving sitemap references to the end
  • Disallow /plugins/wp-json/plugins/v1/locale-banner (which is crawled by Google upwards of 40k times per day!)
  • Disallow subfolder variations of some rules so that they catch subsites (e.g., /*/wp-admin/)
  • Disallow 'non-pretty' variations (e.g., ?rest_route=)
  • Add some inline comments

That gets us to the following:

# Prevent crawling of WP internals
# --------------------------------
User-agent: *
Disallow: /wp-admin/
Disallow: /*/wp-admin/
Disallow: /?rest_route=
Disallow: /xmlrpc.php

# Prevent crawling of leaky theme endpoints
# --------------------------------
User-agent: *
Disallow: /plugins/wp-json/plugins/v1/locale-banner

# Prevent crawling of search URLs
# --------------------------------
User-agent: *
Disallow: /search/
Disallow: /*/search/
Disallow: /?s=
Disallow: /*/?s=

# Sitemaps
# --------------------------------
Sitemap: https://wordpress.org/sitemap.xml
Sitemap: https://wordpress.org/news-sitemap.xml
Sitemap: https://wordpress.org/themes/sitemap.xml
Sitemap: https://wordpress.org/plugins/sitemap.xml
Sitemap: https://wordpress.org/news/sitemap.xml
Sitemap: https://wordpress.org/showcase/sitemap.xml

Change History (8)

#1 @jonoaldersonwp
2 years ago

  • Description modified (diff)

#2 @jonoaldersonwp
2 years ago

  • Description modified (diff)

#3 @jonoaldersonwp
2 years ago

  • Description modified (diff)

This ticket was mentioned in PR #129 on WordPress/wordpress.org by @renyot.


2 years ago
#4

  • Keywords has-patch added

See https://meta.trac.wordpress.org/ticket/6763

## Question
1 - In ticket 6763,

Disallow: /plugins/search/
Allow: /wp-admin/admin-ajax.php
Allow: /wp-admin/load-scripts.php
Allow: /wp-admin/load-styles.php

seem to have been deleted, or maybe they were just omitted without being written down?

2 - It this still hold up that wp-admin/load-*.php should be upstreamed to Core?
Or it isn't required anymore (https://github.com/WordPress/wordpress.org/pull/121#discussion_r1109380766)

3 - I'm not sure where these directives came from. Are they from Jetpack? Since the ticket requested to add inline comments and also move the Sitemap references to the end of the file, gotta locate them.

Allow: /wp-admin/admin-ajax.php
Sitemap: https://wordpress.org/sitemap.xml
Sitemap: https://wordpress.org/news-sitemap.xml

## Before
{{{Sitemap: https://wordpress.org/sitemap.xml
Sitemap: https://wordpress.org/news-sitemap.xml
Sitemap: https://wordpress.org/themes/sitemap.xml
Sitemap: https://wordpress.org/plugins/sitemap.xml
Sitemap: https://wordpress.org/news/sitemap.xml
Sitemap: https://wordpress.org/showcase/sitemap.xml
Sitemap: https://wordpress.org/documentation/sitemap.xml
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Allow: /wp-admin/load-scripts.php
Allow: /wp-admin/load-styles.php

User-agent: *
Disallow: /search
Disallow: /?s=

User-agent: *
Disallow: /plugins/search/
}}}

## After

Sitemap: https://wordpress.org/sitemap.xml
Sitemap: https://wordpress.org/news-sitemap.xml
Sitemap: https://wordpress.org/themes/sitemap.xml
Sitemap: https://wordpress.org/plugins/sitemap.xml
Sitemap: https://wordpress.org/news/sitemap.xml
Sitemap: https://wordpress.org/showcase/sitemap.xml
Sitemap: https://wordpress.org/documentation/sitemap.xml
User-agent: *
Disallow: /wp-admin/
Disallow: /*/wp-admin/
Allow: /wp-admin/admin-ajax.php
Allow: /wp-admin/load-scripts.php
Allow: /wp-admin/load-styles.php
Disallow: /?rest_route=
Disallow: /xmlrpc.php
Disallow: /plugins/search/

# Prevent crawling of search URLs
# --------------------------------
Disallow: /search/
Disallow: /*/search/
Disallow: /?s=
Disallow: /*/?s=

# Prevent crawling of leaky theme endpoints
# --------------------------------
Disallow: /plugins/wp-json/plugins/v1/locale-banner

## Before (Rosetta)

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Allow: /wp-admin/load-scripts.php
Allow: /wp-admin/load-styles.php

User-agent: *
Disallow: /plugins/search/

## After (Rosetta)

User-agent: *
Disallow: /wp-admin/
Disallow: /*/wp-admin/
Allow: /wp-admin/admin-ajax.php
Allow: /wp-admin/load-scripts.php
Allow: /wp-admin/load-styles.php
Disallow: /?rest_route=
Disallow: /xmlrpc.php
Disallow: /plugins/search/

# Prevent crawling of search URLs
# --------------------------------
Disallow: /search/
Disallow: /*/search/
Disallow: /?s=
Disallow: /*/?s=

# Prevent crawling of leaky theme endpoints
# --------------------------------
Disallow: /plugins/wp-json/plugins/v1/locale-banner

This ticket was mentioned in PR #4207 on WordPress/wordpress-develop by @renyot.


2 years ago
#5

Trac ticket: https://meta.trac.wordpress.org/ticket/6763

This PR adds an inline comment to robots.txt.
Related: https://github.com/WordPress/wordpress.org/pull/129:

## Before

Sitemap: https://wordpress.org/sitemap.xml
Sitemap: https://wordpress.org/news-sitemap.xml
Sitemap: https://wordpress.org/themes/sitemap.xml
Sitemap: https://wordpress.org/plugins/sitemap.xml
Sitemap: https://wordpress.org/news/sitemap.xml
Sitemap: https://wordpress.org/showcase/sitemap.xml
Sitemap: https://wordpress.org/documentation/sitemap.xml
User-agent: *
Disallow: /wp-admin/
Disallow: /*/wp-admin/
Allow: /wp-admin/admin-ajax.php
Allow: /wp-admin/load-scripts.php
Allow: /wp-admin/load-styles.php
Disallow: /?rest_route=
Disallow: /xmlrpc.php
Disallow: /plugins/search/

# Prevent crawling of search URLs
# --------------------------------
Disallow: /search/
Disallow: /*/search/
Disallow: /?s=
Disallow: /*/?s=

# Prevent crawling of leaky theme endpoints
# --------------------------------
Disallow: /plugins/wp-json/plugins/v1/locale-banner

## After

Sitemap: https://wordpress.org/sitemap.xml
Sitemap: https://wordpress.org/news-sitemap.xml
Sitemap: https://wordpress.org/themes/sitemap.xml
Sitemap: https://wordpress.org/plugins/sitemap.xml
Sitemap: https://wordpress.org/news/sitemap.xml
Sitemap: https://wordpress.org/showcase/sitemap.xml
Sitemap: https://wordpress.org/documentation/sitemap.xml
# Prevent crawling of WP internals
# --------------------------------
User-agent: *
Disallow: /wp-admin/
Disallow: /*/wp-admin/
Allow: /wp-admin/admin-ajax.php
Allow: /wp-admin/load-scripts.php
Allow: /wp-admin/load-styles.php
Disallow: /?rest_route=
Disallow: /xmlrpc.php
Disallow: /plugins/search/

# Prevent crawling of search URLs
# --------------------------------
Disallow: /search/
Disallow: /*/search/
Disallow: /?s=
Disallow: /*/?s=

# Prevent crawling of leaky theme endpoints
# --------------------------------
Disallow: /plugins/wp-json/plugins/v1/locale-banner

#6 @ogumemura
20 months ago

  • Resolution set to invalid
  • Status changed from new to closed

Thank you for your correspondence regarding this matter. We have observed link spamming activities taking advantage of WordPress' behavior with ?s= , as mentioned in the # Prevent crawling of search URLs section.

Although attackers could be monitoring this discussion, we have noticed that in some instances, they add valid parameters before the "s" parameter. An example of this tactic is: https://example.com/?tq=obejm&s=spam_words

The current Disallow rule in the robots.txt file, as proposed in the pull request, does not cover this scenario. Therefore, we suggest adding the following rule to address this issue:

Disallow: /?*&s= 

This modification ensures that search URLs with crafty parameter combinations, like the example provided, will also be prevented from being indexed by search engines.

#7 @ogumemura
20 months ago

  • Resolution invalid deleted
  • Status changed from closed to reopened

My sincere apologies. I may have unintentionally changed the status.

  • Resolution set to invalid
  • Status changed from new to closed

#8 @jonoaldersonwp
20 months ago

Agreed, that's a smart tweak.

Note: See TracTickets for help on using tickets.