Making WordPress.org

Opened 3 years ago

Last modified 9 months ago

#4126 accepted defect

"Special contributions" template leaks PII

Reported by: jonoaldersonwp Owned by: dd32
Milestone: Priority: high
Component: Codex Keywords: seo analytics privacy close


E.g., https://codex.wordpress.org/Special:Contributions/Jany2786@gmail.com

This template should have a meta robots value of 'noindex, follow'.

Change History (16)

#2 @Otto42
3 years ago

For reference, that isn't the email address, it's the username. Those are old spam accounts that used the same values for email and username.

We no longer allow accounts to have email addresses as their username. Been like that for a few years. Usernames must be lowercase alphanum only.

This ticket was mentioned in Slack in #meta by tellyworth. View the logs.

3 years ago

#4 @tellyworth
3 years ago

Can (should) we handle URLs with user=\w+@ in a special way? Force a 404 or 410, redact the address from the page, something like that? Just in case there are any ancient non-spam addresses in there.

#5 @jonoaldersonwp
3 years ago

Hmm, we should probably avoid trying to do anything clever with the URLs on request, but, we can definitely control indexing of these (types of) URLs, and, separately, I've plans to keep them out of Google Analytics etc by doing some housekeeping in Google Tag Manager before tracking scripts fire.

#6 @jonoaldersonwp
3 years ago

For clarity, this still needs noindex'ing.

#7 @tellyworth
2 years ago

  • Owner set to tellyworth
  • Status changed from new to accepted

What's the solution here? Noindex all Special: pages? Just Contributions and Log? Is it specific to those with @ in the URL?

#8 @jonoaldersonwp
2 years ago

Let's noindex anything starting with https://codex.wordpress.org/Special:Contributions/ - I don't see any useful/valuable (landing) pages in that set.

#9 @dd32
13 months ago

Having just cleaned out a lot of spam from the codex, taking into account #4127 and #4373 I think we could noindex

  • ^/User:*
  • ^/User_talk:*
  • ^/Special:*
  • ^/index.php?*

User pages do have useful content on some of them, but it's in the minority.

Version 0, edited 13 months ago by dd32 (next)

#10 @jonoaldersonwp
13 months ago

Nice, that'd be great!

#11 @dd32
13 months ago

Fixed via r10702-codex pending deploy.

This ticket was mentioned in Slack in #meta by tellyworth. View the logs.

10 months ago

#13 @tellyworth
9 months ago

  • Keywords close added

Is this definitely fixed now?

#14 @dd32
9 months ago

  • Owner changed from tellyworth to dd32

Unfortunately now that the above commit has been deployed (I think) some pages aren't being matched correctly.

For example, https://codex.wordpress.org/Special:SpecialPages doesn't include the noindex although it should, but User pages are noindexed.

Last edited 9 months ago by dd32 (previous) (diff)

#15 @dd32
9 months ago

In r11568-codex I've updated the noindex code to be this:

// Noindex various pages. See Meta #4373, #4127, #4126.
$noindex = (
	// No article
		$pageOutput->isArticleRelated() && ! $pageOutput->getRevisionId()
	) ||
	// The User, Special, and File namespaces are not indexed.
	) ||
	// It's an internal 'index.php?..' page
	preg_match( '~^/index\.php~', $_SERVER['REQUEST_URI'] )

That seems to match all pages I could find, just need to wait upon a systems deploy / cache clear again.

There's a bunch of is....() functions in MediaWiki that could be used, but it wasn't straight forward to use those functions I found due to the number of them. isArticleRelated() is also truthful for user pages, as is wiki pages.

#16 @jonoaldersonwp
9 months ago

Awesome, nice one!
I'll fire up a crawl once that's in the wild, and see if there's anything else we need to squash.

Note: See TracTickets for help on using tickets.