Making WordPress.org

Opened 5 years ago

Closed 5 years ago

Last modified 2 months ago

#5184 closed defect (bug) (reported-upstream)

Homepage requests with a 'page' parameter should return a 404

Reported by: jonoaldersonwp's profile jonoaldersonwp Owned by:
Milestone: Priority: lowest
Component: General Keywords: seo
Cc:

Description (last modified by dd32)

Requests like https://wordpress.org/page/3/ should return a 404 template and HTTP header.

Requests to paginated states of /download/, like https://wordpress.org/download/6/, should return a 404 template and HTTP header.

Requests to paginated states of pages in (and including) the 'about' section, such as https://en-gb.wordpress.org/about/features/5/, https://en-gb.wordpress.org/about/5/ and https://wordpress.org/about/license/8/ should return a 404 template and HTTP header

Change History (18)

#1 follow-up: @dd32
5 years ago

Would it returning a canonical tag of https://wordpress.org/ suffice here? (Currently it returns <link rel="canonical" href="https://wordpress.org/3/" />)

#2 @dd32
5 years ago

  • Description modified (diff)

Closing the others as duplicates of this, as they're all Paginated states of Pages which is the same thing at the core.

#3 @dd32
5 years ago

#5185 was marked as a duplicate.

#4 @dd32
5 years ago

#5186 was marked as a duplicate.

#5 follow-up: @ocean90
5 years ago

The last two examples should be fixed by [WP47727].

#6 in reply to: ↑ 5 @dd32
5 years ago

Replying to ocean90:

The last two examples should be fixed by [WP47727].

Ah, so they are, Thanks @SergeyBiryukov!

#7 in reply to: ↑ 1 ; follow-up: @bradleyt
5 years ago

Replying to dd32:

Would it returning a canonical tag of https://wordpress.org/ suffice here? (Currently it returns <link rel="canonical" href="https://wordpress.org/3/" />)

Just noting that core should really be returning a canonical of either https://wordpress.org/page/3/ or https://wordpress.org/ here - https://wordpress.org/3/ is just plain wrong. This specific canonical issue only happens on the homepage, and there is an open core ticket for this specific issue: https://core.trac.wordpress.org/ticket/49220

For wordpress.org specifically, the canonical should be equal to https://wordpress.org/

#8 in reply to: ↑ 7 @dd32
5 years ago

Replying to bradleyt:

Replying to dd32:

Would it returning a canonical tag of https://wordpress.org/ suffice here? (Currently it returns <link rel="canonical" href="https://wordpress.org/3/" />)

...
For wordpress.org specifically, the canonical should be equal to https://wordpress.org/

Would returning that canonical tag fulfil the needs of this ticket, specifically, can we avoid having to return a 301 or 404 here and just use the canonical tag instead?

#9 follow-up: @jonoaldersonwp
5 years ago

A canonical tag would definitely help, but we'd still be in a position where we have infinite crawl traps and pages which should exist. That'd continue to impact crawl budget, discovery, etc, across the site(s).

#10 in reply to: ↑ 9 @dd32
5 years ago

Replying to jonoaldersonwp:

we'd still be in a position where we have infinite crawl traps and pages which should exist. That'd continue to impact crawl budget, discovery, etc, across the site(s).

As paginated states of the front-page aren't ever actually linked, I'm not sure if that's realistically an issue here? 3rd party websites may link to one or two such pages, but on the whole it shouldn't be massive traffic?

#11 @jonoaldersonwp
5 years ago

The problem isn't traffic volume, it's that they're queryable and public. That means they'll still represent a point of leakage. That aside, they shouldn't exist / be exposed, regardless.

#12 @dd32
5 years ago

  • Resolution set to fixed
  • Status changed from new to closed

https://wordpress.org/page/3/

Returns a canonical tag now.
I'm not inclined to add a redirect here right now.

All other urls mentioned redirect thanks to [WP47727].

#13 @jonoaldersonwp
5 years ago

  • Priority changed from normal to lowest
  • Resolution fixed deleted
  • Status changed from closed to reopened

This is a huge improvement, but we still need to improve the handling of invalid requests to optimize crawl budget.

As per the brief, URLs like https://wordpress.org/page/3/ need to return a 404 or 301.
Prefer a 404, as these URLs might feasibly be valid in the future.

Version 0, edited 5 years ago by jonoaldersonwp (next)

#14 @dd32
5 years ago

Unless core fixes those, these won't be returning 404's on WordPress.org.

#15 @dd32
5 years ago

  • Resolution set to reported-upstream
  • Status changed from reopened to closed

Opened https://core.trac.wordpress.org/ticket/50163 with a possible patch.

Going to mark this as it can be handled upstream.

#16 @jonoaldersonwp
5 years ago

Nice one, thanks! :)

#18 @johnjamesjacoby
2 months ago

I've done some independent research related to this, so I figured I'd share what found:

HTTP Codes:

  • 301 is not quite right because it communicates permanent non-existence
  • 404 is currently best because it communicates current non-existence
  • 416 means "Range Not Satisfiable" but has a specific use-case that does not apply here (at all)
  • There isn't a better response code than 404 to communicate that a request is out-of-bounds or what the boundaries would be relative to a specific URI

WordPress Code:

  • The way that redirect_canonical() integrates with pagination is specific to something like /page/1/ or page=1, and is consistently handled for all core cases (archives, multi-page singulars, comments, etc...), and also properly do a 301 back to the root/canonical URI – it does not redirect page requests that are too large
  • I noticed that wp_get_canonical_url() does handle pagination query variables, but does not currently consider is_front_page() and the difference between page and paged, though I have not confirmed if that matters yet
  • The cpage query var for Comments may be worth checking too, if paginated comments are set anywhere and out-of-bounds requests are being made/crawled
Note: See TracTickets for help on using tickets.