WordPress.org

Making WordPress.org

Opened 5 months ago

Closed 5 months ago

Last modified 5 months ago

#5740 closed task (fixed)

Add /?s= disallow rule to robots.txt

Reported by: jonoaldersonwp Owned by: dd32
Milestone: Priority: high
Component: General Keywords: seo performance
Cc:

Description (last modified by jonoaldersonwp)

https://wordpress.org/robots.txt already disallows /search, but doesn't disallow /?s=*.

This omission creates an attack vector for negative SEO and spam attacks; and we're currently under heavy attack.

To prevent this, we should allow the following rule to the robots.txt file:

Disallow: /?s=

Attachments (1)

image.png (321.2 KB) - added by jonoaldersonwp 5 months ago.

Download all attachments as: .zip

Change History (11)

#1 @jonoaldersonwp
5 months ago

  • Description modified (diff)

#2 @joyously
5 months ago

Would this apply to all WP sites? (as in, should this be in core?)

#3 @jonoaldersonwp
5 months ago

Potentially, but risky without tailoring and configuring on a site-by-site basis, depending on how their internal site search works.

This ticket was mentioned in Slack in #meta by tellyworth. View the logs.


5 months ago

#6 @dd32
5 months ago

/search/ is excluded for performance reasons, not for SEO spam.

/?s=* redirects to /search/ so I don't think there's any need to exclude it specifically? It seems like there was no reason for this ticket If I understand correctly?

https://core.trac.wordpress.org/ticket/52457 as mentioned by @tellyworth sets a noindex tag on all search results on other sites, such as https://wordpress.org/news/?s=spam

#7 @jonoaldersonwp
5 months ago

https://core.trac.wordpress.org/ticket/52457 is related, but isn't a solution here, and isn't the same thing.

Why would I create a ticket with no reason?

Noindex tags prevent indexing of an already-crawled URL. Robots.txt directives prevent crawling. They're different systems, with different effects.

It's because our ?s URLs redirect to /search/ URLs that we have a problem - to the tune of ~500,000 spam URLs indexed in Google, damaging the WordPress brand, and consuming crawl budget which we desperately need elsewhere.

@jonoaldersonwp
5 months ago

#8 @carike
5 months ago

There have been very significant changes in SEO over the past decade or so.
While, yeah, some tactics are still really spammy, search engines *have* gotten better at detecting them - and they are mostly now only successful in the short term.
SEO is not a dirty word. When done right - and when taking a long-term view of things, it is really about user experience (including, but not limited to performance, since page loading times is now a major factor for many search engines).

.org has changed a lot too. It is now in a mature phase, which means that the same strategies that worked even three years ago, won't keep working.
We all want to give users the best experience that we possibly can - and to do that, we need to start taking a long term view on SEO.

Users need to be able to find answers - and potential users need to be able to see our marketing content - and developers need to be able to find information related to the development of WordPress.

The unfortunate reality is that a massive amount of low-quality URLs are preventing users, current and future, and budding developers who can help build WordPress in the future, from getting where they need to be.

And this ticket can help fix some of that.

#9 @dd32
5 months ago

  • Owner set to dd32
  • Resolution set to fixed
  • Status changed from new to closed

In 10989:

SEO: Add a Disallow: /?s= rule for robots.txt.

Props jonoaldersonwp.
Fixes #5740.

#10 @dd32
5 months ago

In future, please include an example rather than "please just do this".

It wouldn't be the first time there's been a ticket created for behaviours that aren't actually happening, and when all the data we have available to us says "This isn't a problem" a ticket without an example should always be taken with a grain of salt.

Note: See TracTickets for help on using tickets.