Making WordPress.org

Opened 4 years ago

Closed 20 months ago

#5344 closed defect (bug) (fixed)

Delete stale, orphaned topic tags

Reported by: jonoaldersonwp's profile jonoaldersonwp Owned by: dd32's profile dd32
Milestone: Priority: low
Component: Support Forums Keywords: seo
Cc:

Description (last modified by jonoaldersonwp)

Topic tags which have only one related posts, where that member was created more than a year days ago, should be either consolidated into related/similar tags, or when there's no suitable candidate for consolidation, deleted (and a 410 status returned).

To facilitate the unavoidably manual process of consolidation, would it be possible to expose a list of URLs of tags which meet this criteria?

E.g., https://wordpress.org/support/topic-tag/_765258526/

Attachments (2)

5344-limit-topic-creation.patch (1.8 KB) - added by Clorith 4 years ago.
5344-limit-topic-creation.2.patch (3.7 KB) - added by Clorith 4 years ago.

Download all attachments as: .zip

Change History (27)

#1 @jonoaldersonwp
4 years ago

  • Description modified (diff)

#2 @Clorith
4 years ago

I have said list and started work on a quick script to find similarities and consolidation for the tags so we can merge / remove them.

But I wanted a proper solution for tags before doing that work fully, fix the cause first, then the symptom or it'll get out of hand again (there's currently close to 600k tags)

This ticket was mentioned in Slack in #meta by jonoaldersonwp. View the logs.


4 years ago

This ticket was mentioned in Slack in #forums by yui. View the logs.


4 years ago

This ticket was mentioned in Slack in #meta by tellyworth. View the logs.


4 years ago

#6 @Clorith
4 years ago

5344-limit-topic-creation.patch is the first step in resolving this.

It will limit the creation of new topic tags to moderators or above (this can be further refined if needed, but seems like the natural point).

This means the tag field remains as it is while we work on shortening the tags list, but any attempt to write a tag that does not exist is just skipped.

This ticket was mentioned in Slack in #meta by clorith. View the logs.


4 years ago

#8 @Otto42
4 years ago

@Clorith It seems like returning a WP_Error there will abort inserting of all terms after the non-existent one. See https://core.trac.wordpress.org/browser/trunk/src/wp-includes/taxonomy.php#L2602 where the function returns instead of continuing the loop.

#9 @Clorith
4 years ago

Good catch @Otto42, I had of course only tested with new tags coming in last so completely missed that!

5344-limit-topic-creation.2.patch takes a more roundabout way to reach the same goal, I didn't spot any good core way of preventing tag creation, but there's some bbPress filters that could be utilized to achieve what was needed here.

The new topic creation is a bit more involved than previously, I don't think we add any secondary taxonomies, but the approach taken futureproofs it in case we need to (or actually do already), and shouldn't add any noteworthy added overhead with its checks as these are all features that are used otherwise in the process so the responses would be in memory for the request.

This ticket was mentioned in Slack in #meta by clorith. View the logs.


4 years ago

This ticket was mentioned in Slack in #meta by clorith. View the logs.


4 years ago

This ticket was mentioned in Slack in #meta by clorith. View the logs.


4 years ago

#13 @dd32
4 years ago

In 10599:

Support Forums: Limit new tag creation to moderators.

Props Clorith.
See #5344.

#14 follow-ups: @dd32
4 years ago

Just noting that I cleaned up 5344-limit-topic-creation.2.patch a little bit, and moved it to Performance Optimizations, so that this only affects the english support forums, and not all the localised support forums.

#15 in reply to: ↑ 14 @jonoaldersonwp
4 years ago

Replying to dd32:

Just noting that I cleaned up 5344-limit-topic-creation.2.patch a little bit, and moved it to Performance Optimizations, so that this only affects the english support forums, and not all the localised support forums.

Nice one!

#16 @dd32
4 years ago

In 10602:

Support: Fix a PHP warning caused when only one tag is added to a thread.

Amends [10599].
See #5344.

#17 @dd32
4 years ago

In 10603:

Support: Handle one,two being passed not just one, two. This shouldn't be needed, but better to be forwards compatible.

Amends [10602].
See #5344.

#18 in reply to: ↑ 14 ; follow-ups: @Clorith
4 years ago

Replying to dd32:

Just noting that I cleaned up 5344-limit-topic-creation.2.patch a little bit, and moved it to Performance Optimizations, so that this only affects the english support forums, and not all the localised support forums.

Sounds reasonable for now (I seem to recall rosetta also wanting this at some point, but I think it makes sense to revisit that once the full on tag-upgrade-process is in place, since the curated tags are only really valuable once we've also implemented a way to more sanely choose tags, what I like to call phase 3 of this).

For the next phase, I'm thinking removing all "undesirables" from the tag list makes sense as a first step. Doing so will reduce the total dataset we have to work with, and make it much easier to determine the overarching tag hierarchy needed to group the remaining tags.

I'm thinking something initially manageable like removing tags that:

  • Have a numeric only slug (these are purely HTML entity tags from a quick check, and even if the tag was fully numeric any way, numbers alone give no context and as such hold no value)
  • Have fewer than 5 uses
  • Are literally the term WordPress, or wp (since it's fairly redundant to tag a topic as being about WordPress, on a WordPress support forum)

I'll lean a bit on @jonoaldersonwp to sanity check that these sound like sensible criteria for a first set of removable tags.

#19 in reply to: ↑ 18 @dd32
4 years ago

Those sound reasonable to me as a first set to remove.

Replying to Clorith:

I'm thinking something initially manageable like removing tags that:

  • Have a numeric only slug (these are purely HTML entity tags from a quick check, and even if the tag was fully numeric any way, numbers alone give no context and as such hold no value)
  • Have fewer than 5 uses

Looking at the current list of tags, here's the counts:

Topics. Tags w/ that many topics.
0	 36,724
1	492,877
2	 53,658
3	 19,359
4	 10,231
5	  6,288
6	  4,271
7	  3,233
8	  2,484
9	  1,945
>10	 22,257

Combining removing tags with less than 5 uses, and those whose slugs are just numeric, we'd be removing 613k tags leaving 40k tags behind.

  • Are literally the term WordPress, or wp (since it's fairly redundant to tag a topic as being about WordPress, on a WordPress support forum)

There's quite a few which match that too, but I suspect that list is going to be harder to come up with, although, looking at the most used tags, there's a number of obvious ones.
Eg, top 20 topic-tags in use:

name		count
woocommerce	43,541
plugin		38,768
wordpress	37,200
error		36,563
css		29,316
theme		25,310
menu		21,195
php		18,578
image		18,525
header		18,308
images		18,149
post		16,522
posts		16,094
categories	15,393
sidebar		15,298
widget		14,654
category	14,577
Comments	13,920
login		13,784
multisite	13,680

#20 @jonoaldersonwp
4 years ago

Oh, wow. That jump down to 40k would be amazing.

#21 @Clorith
4 years ago

That sounds great, I think we'll hold off on removing any of the other top-20 tag uses (with the exception of wordpress and wp), as some of them are plugin slugs.

The reasoning is simple, some plugin authors follow their slugs to catch topics, and depending on how they do this following, they might get odd errors on their end, so I would like to announce the removal of these properly first, to ensure a good transition for those who may be using them.

#22 @dd32
4 years ago

  • Owner set to dd32
  • Status changed from new to accepted

#23 in reply to: ↑ 18 @dd32
4 years ago

Replying to Clorith:

I'm thinking something initially manageable like removing tags that:

  • Have a numeric only slug (these are purely HTML entity tags from a quick check, and even if the tag was fully numeric any way, numbers alone give no context and as such hold no value)

Done. All of the affected 150 terms were either numeral, or only contained non-[a-z0-9] characters like #:)

  • Have fewer than 5 uses

Done.

  • Are literally the term WordPress, or wp

Done.
I removed wordpress, wp, plugin, theme, and error. These were just the generic tags on the first page of edit-tags.php that I didn't see offering any value.

Happy to remove any other similarly generic ones if provided a list, WordPress cannot delete these terms directly itself, as it's not optimized for such operations on large tags (It fetch/diffs/sets the tags for each topic, rather than just deleting the tag for the topic which is much faster)

Note: I took a snapshot of the terms prior to running this, I can revert any individual change or triple check what the data was before this if required for anything.

#24 @jonoaldersonwp
4 years ago

Ooh, great work!

#25 @jonoaldersonwp
20 months ago

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.