Making WordPress.org

Opened 11 years ago

Last modified 8 months ago

#174 assigned enhancement

Link to generally related functions/classes

Reported by: samuelsidler's profile samuelsidler Owned by:
Milestone: Improved Search Priority: normal
Component: Developer Hub Keywords: has-patch
Cc:

Description

Individual code reference entries should link to generally related functions and classes based on word stem, location, and other information.

Attachments (6)

174.png (69.5 KB) - added by keesiemeijer 10 years ago.
related posts for the get_terms function
174-2.png (68.5 KB) - added by keesiemeijer 10 years ago.
related posts for the get_queried_object function
174-3.png (73.5 KB) - added by keesiemeijer 10 years ago.
related posts for the WP_Date_Query::get_sql method
174.2.png (69.5 KB) - added by keesiemeijer 10 years ago.
related posts for the () function
174.patch (17.3 KB) - added by keesiemeijer 8 years ago.
Add title keywords algorithm and display keywords in single posts pages
174.4.png (26.8 KB) - added by keesiemeijer 8 years ago.
Title keywords on single post pages

Download all attachments as: .zip

Change History (46)

#1 @samuelsidler
11 years ago

  • Priority changed from low to normal

#2 @siobhan
11 years ago

  • Cc siobhan added

#3 @siobhan
11 years ago

  • Owner set to Rarst
  • Status changed from new to assigned

#4 @Rarst
10 years ago

The relatedness is simple concept on the surface, but less than trivial one to implement, more so for atypical use cases (like ours).

Overall we are trying to:

  1. Pick data inputs that are not overly complicated to achieve
  2. Process them in a way that takes reasonable resources (more importantly human resource, less importantly computers')
  3. Arrive at "good" results showing to users (at least most of the time)

There are two implementations closely related to the use case (that I can think of right away):

  1. Codex uses mostly manual process of creating and maintaining wiki categories (or whatever they are terminologically). This is extremely flexible process with arbitrary inputs, however it is also extremely resource intensive since people have to take care of it.
  2. QueryPosts uses stemming (or whatever it is terminologically) to dismantle names into word parts and look up the matching names (some meaningless words are thrown own, etc). In practice it produces passable results, though far from perfect. It's usefulness falls for extremely commonly used words, which produce overabundance of matches.

The one approach I am interested in, but hadn't seen tried yet is to simply take all data (we already have like file structure) and feed it into one of generic related posts plugins. It would be highly automated, but will have to try to see quality of results.

So these three (or combination of) are essentially choices I see on the table.

  1. Adopt/migrate manual processes from Codex.
  2. Adopt stemming approach (I could throw ready–made QP code over and done).
  3. Choose and try related posts plugin, see what happens.
Last edited 10 years ago by Rarst (previous) (diff)

This ticket was mentioned in Slack in #meta-devhub by rarst. View the logs.


10 years ago

This ticket was mentioned in Slack in #meta-devhub by rarst. View the logs.


10 years ago

#7 @keesiemeijer
10 years ago

+1 for stemming words.

Here's a proof of concept to use the Porter stemming algorithm and other rules to get related words from the post title.
http://tartarus.org/~martin/PorterStemmer/

Similar words as "queried", "queries" and "query" get the same stem "queri" if passed through the algorithm.
The algorithm is a cheap way of getting stems from words without a database or API lookup.
It doesn't always produce real words but it increases the similarity between titles when queried.

Proof of concept rules for getting related words from the title:

  • Get the related words by splitting the title at the underscores.
  • Remove stop words as 'the', 'as', 'by' etc...
  • Don't allow 'wp' as a related word. Add the second word from the title to it with a dash (e.g. the related words from wp_head are wp-head head).
  • 'wp' is allowed as a related word if it's the only word in a function/hook/class/method name.
  • Allow the stop words 'is' and 'get' only if they're the first word of the post title.
  • Add word stems and the file name as related words.

Here you'll find related words found with these rules:
Functions
https://rawgit.com/keesiemeijer/b8ba0b01006d6d859919/raw/poc-related-words-functions.html

Classes
https://rawgit.com/keesiemeijer/0727a611ee3d171a5ea0/raw/poc-related-words-classes.html

Methods
https://rawgit.com/keesiemeijer/49d3d62068be351e7adb/raw/poc-related-words-methods.html

Hooks
https://rawgit.com/keesiemeijer/7403f273deeb546389d8/raw/poc-related-words-hooks.html

These results were created with this gist in the archive.php file of the wporg-developer theme.
https://gist.github.com/keesiemeijer/41b6c8576a2f2ac684ce

I've created a custom taxonomy 'wp-parser-related-words' for all relevant post types in my local install and added the related words to the posts as terms.
In total 3,595 terms were created. This is with all external libraries and wp-content parsed. It should be less for the developer reference.

Last edited 10 years ago by keesiemeijer (previous) (diff)

@keesiemeijer
10 years ago

related posts for the get_terms function

@keesiemeijer
10 years ago

related posts for the get_queried_object function

@keesiemeijer
10 years ago

related posts for the WP_Date_Query::get_sql method

@keesiemeijer
10 years ago

related posts for the () function

#8 @keesiemeijer
10 years ago

These are related posts found with the functions from my own related posts plugin.
https://github.com/keesiemeijer/related-posts-by-taxonomy/blob/master/functions.php

The screenshots show how many related posts were found and the number of terms in common. 20 posts are shown and the posts are randomized within their own relatedness score (terms in common).

Less results are found for classes.

This ticket was mentioned in Slack in #meta-devhub by rarst. View the logs.


10 years ago

This ticket was mentioned in Slack in #core by swissspidy. View the logs.


9 years ago

#11 @grapplerulrich
8 years ago

Stemming would work well for functions with related names like with Post Types. https://codex.wordpress.org/Post_Types#Function_Reference

But then we have Conditional Tags and Template Tags which don't have any common words.

The related functions is really useful when checking if there is a WordPress function for a certain job.

#12 @keesiemeijer
8 years ago

Multiple taxonomies can be used. The related posts are found from the function words, package, and source file taxonomies. The terms used for a function like is_page() would be: is, page, query-php, Query, wp-includes/query.php. This will find all the related is_* functions like

  • is_front_page (5)
  • is_paged (5)
  • is_comments_popup (4)
  • is_home (4)
  • is_date (4)
  • is_main_query (4)
  • is_author (4)
  • is_category (4)
  • etc...

See what related words are found here http://www.stoerke.be/reference/reference/functions/is_page/

Last edited 8 years ago by keesiemeijer (previous) (diff)

#13 @grapplerulrich
8 years ago

Yes, but you miss out on the the functions with has_ e.g. has_tag or post_type_exists.

This seems like a good start :D Would be nice if they were order alphabetically as they seem to be ordered randomly at the moment.

#14 @keesiemeijer
8 years ago

  • Keywords reporter-feedback added

Thanks :)

This proof of concept only shows that you can make posts more relatad by using stemmed words from the post title. There are of course other metrics (than terms) that could be used, like searching post content, contributed notes, used by, or manually adding related words.

Keeping with the example for is_page() I don't see how it would ever find a has_* or post_type related post unless it's mentioned in the post content or manually added.

I'm not sure if there is any interest in the related post feature anymore though.

#15 @grapplerulrich
8 years ago

There is interest. I came looking for a ticket because there was a discussion on the "Advanced WordPress Group" on Facebook about switching from the codex to the developer resource and people were missing this feature.

Did you make the changes to the PHP Doc parser? https://github.com/WordPress/phpdoc-parser

#16 @Rarst
8 years ago

  • Owner Rarst deleted

#17 @keesiemeijer
8 years ago

I meant interest from the committers here. I will ask on slack if there is still any interest.

Did you make the changes to the PHP Doc parser?

No, I only changed the wporg-developer theme to add the terms and display the related posts. I would start there though. Adding the terms when importing functions, classes, etc..

Then use a related post plugin to display the related posts in the theme.

This ticket was mentioned in Slack in #meta-devhub by keesiemeijer. View the logs.


8 years ago

This ticket was mentioned in Slack in #meta-devhub by drew. View the logs.


8 years ago

This ticket was mentioned in Slack in #meta-devhub by drew. View the logs.


8 years ago

@keesiemeijer
8 years ago

Add title keywords algorithm and display keywords in single posts pages

#21 @keesiemeijer
8 years ago

The 174.patch adds the title keywords algorithm as described in comment #7
Use the function get_title_keywords( $title ) to get the keywords and their stems from a title. The title keywords are displayed on single post pages using this function.

I've put the plugin I've created some time ago for this ticket online here: https://github.com/keesiemeijer/devhub-related-posts. It lets you generate related keyword terms for the DevHub parsed post type posts in batches of 500 posts. It also displays the related posts in single post pages. It uses the same algorithm as in this patch

@keesiemeijer
8 years ago

Title keywords on single post pages

This ticket was mentioned in Slack in #meta-devhub by drew. View the logs.


8 years ago

#23 @DrewAPicture
8 years ago

  • Keywords has-patch added; reporter-feedback removed
  • Priority changed from normal to high
  • Type changed from enhancement to task

#24 follow-up: @keesiemeijer
8 years ago

In addition to the title keyword taxonomy we should create another taxonomy for function names only. Every post gets its own function name as a term in this taxonomy. This would also be the taxonomy where users can add related function names from a list with autocomplete.
Together with the title keywords, package and file name taxonomy terms it should get reliable related posts with a custom query or related posts plugin (Jetpack?)

#25 in reply to: ↑ 24 @DrewAPicture
8 years ago

Replying to keesiemeijer:

In addition to the title keyword taxonomy we should create another taxonomy for function names only. Every post gets its own function name as a term in this taxonomy. This would also be the taxonomy where users can add related function names from a list with autocomplete.

I think we're talking about two different things here.

You're suggesting we use a taxonomy to group elements together in a reusable way, and that's totally fine. At the same time, you're also talking about associating those items together with each other, which we already have the capability for with p2p. p2p also already supports autocomplete for adding connections. So I think our best option would be to leverage p2p for the 1:1 relationship part and a taxonomy for the reusable 1:many part.

Together with the title keywords, package and file name taxonomy terms it should get reliable related posts with a custom query or related posts plugin (Jetpack?)

Last edited 8 years ago by DrewAPicture (previous) (diff)

#26 @keesiemeijer
8 years ago

It doesn't matter if it's another taxonomy or a p2p connection. It's just that we cant use the title keywords taxonomy for user contributions. I would rather see the title keywords terms be added when importing or updating functions with the parser. By using another taxonomy or p2p connection (for functions) you have a list of functions for users to choose from. That was the main idea.

p2p also already supports autocomplete for adding connections.

You also get autocomplete for free with a taxonomy in the edit posts screen (when adding a term) as all posts will have their own name as a term to begin with. I don't know p2p well enough to know if it supports autocomplete on the front end as well.

A downside to using a (function) taxonomy is that 6430 terms need to be created for the users to choose from if we are planning to have related posts for all parsed post types. Upside is you only have to query taxonomy terms (function, title words, package, file name)

Another thing to think about is that the parser ships with the p2p library only, without the admin functionality. https://github.com/scribu/wp-lib-posts-to-posts

It creates fatal errors when parsing if both the parser and p2p plugin are both activated. So that has to be fixed before we can use the p2p plugin in the dashboard


Last edited 8 years ago by keesiemeijer (previous) (diff)

This ticket was mentioned in Slack in #meta-devhub by drew. View the logs.


8 years ago

#28 @DrewAPicture
7 years ago

@keesiemeijer If you're still up for giving this a shot, I think it might actually be better to use the parser to identify the title keywords and create/assign the terms – using the stemming technique. From an efficiency perspective it makes the most sense. We should also add 'has' to the get_allowed_first_words() array.

#29 @DrewAPicture
7 years ago

Additionally, I think it would be useful to introduce a second taxonomy for broadly categorizing things. We could probably use the components and focus lists used by core trac, actually.

Either way, both would bring us a lot closer to improving the discoverability aspect of the code reference, which I feel is one of the biggest complaints right now.

#30 @pbiron
7 years ago

I started work a while back on trying to identify "related" references, tho I've had to put it aside to work on other things recently.

The general idea I was exploring is based on the realization that most (tho not all) function/method/class/hook names are of the form: [Verb] [Noun], e.g., (add|get|update|delete)_post_meta(), etc...where add, get, update, delete are Verbs and post_meta is a Noun.

So, on import:

  1. do phrase level parsing of function/method/class/hook names (stripping stopwords, but only limited stemming)
  2. do "part of speech" (POS) tagging of the phrases (see Part Of Speech Tagging)
  3. then, the "related" references are those with the same Noun but a different Verb

Using this technique, I hope, will produce "related" references with a much higher degree of Precision than stemming alone; altho the recall would undoubtedly be lower. Personally, getting 602 references "related" to get_terms() would be less than useful.

Granted, the method I was working on requires A LOT of work up-front, building/refining the POS lexicon. But once that up-front work is done, the indexing process is relatively quick (and doesn't require human input).

I built a mostly fully functioning plugin that provides a UI for assigning POS to the phrases generated in step 1. The plugin's intended use is:

  1. do an import from the sources (i.e., run phpdoc-parser), which generates potential phrases for step 1 above
  2. assign POS for each phrase (the plugin provides a UI that makes this pretty easy)
  3. iterate the process, refining the POS lexicon on each iteration

I'll try to find the time to get the plugin to the point where I can release it and get others involved in refining the POS lexicon.

#31 follow-up: @keesiemeijer
7 years ago

Hi @DrewAPicture

I've uploaded the proof of concept plugin again. It seems I had removed it.

I've made some changes to it. The terms now get imported when parsing the codebase with this plugin active. WP_Query is used to get related posts instead of a direct query. The logic for what constitutes a related post has changed too. A post should at least have 3 terms in common and should not be to far removed from the top match.

To see the related posts results visit this mirror of the reference.

With only the title to go on it works decent for some functions but fails for others. I agree that another taxonomy, or other solutions like @pbiron's , would help in this regard. On its own it's not there yet.

How do you see the extra taxonomy terms added? Manually, by users or mods, or some other way?

#32 in reply to: ↑ 31 ; follow-up: @DrewAPicture
7 years ago

Replying to keesiemeijer:

How do you see the extra taxonomy terms added? Manually, by users or mods, or some other way?

Along the lines of the existing core components list, I'd probably add a prompt on the reference article somewhere asking users to help us improve relevancy by suggesting related components.

There's long been discussion about shoring up the @subpackage tags in the file headers to match the components list so we could do just such "categorization" at the devhub level. The thinking is that initial categorization would happen automatically from the parsed file headers, and more specific drilling-down happen by associating specific sub-components at the reference level.

So it would certainly be a multi-prong effort, but I think there's value in (at least) pursuing the subpackage audit, even if we don't primarily consider it in defining relevancy to other elements.

#33 in reply to: ↑ 32 @keesiemeijer
7 years ago

Replying to DrewAPicture:

If the @subpackage tags would be consistent it would be a good way to group posts together at a higher level (instead of mixing them in with the title and file words).

I've updated the plugin. It now also looks for similar (lowercase) words as the @subpackage tags. I've also created a synonym lists of words to connect the posts better. The results have improved but not to a point where we can use it.

With only stemming and synonyms a post can have up to 200 related posts. What's difficult is deciding what's still related and where to set the cut-off point. For now I've set it to show only the top 25 related posts. Because of this it could be that better related words don't show up where you would expect them. Functions like is_* and has_* are mostly related by the other functions in the file they are in. They now also get a synonym of exist to better relate them.

I don't think we can do much more with only stemming and synonyms. I like the idea of adding a prompt to suggest related components or posts. Maybe we can re-use the explanations functionality to moderate these suggestions. I will look into it.

We should also think about how a components list would look like.

Last edited 7 years ago by keesiemeijer (previous) (diff)

#34 @keesiemeijer
7 years ago

I've come to the point were I think this isn't the right way to go about it. Generally the query for related posts is too expensive and the results vary to much.

As an experiment, to fix the expensive query, I've tried importing related post ids into post meta after the parser was done. Similarly to how the Used By and Uses posts are connected after the parser has imported all posts. It took about two seconds to import related posts for every post. This gives you an idea how expensive the query is. It would have taken around 4 hours to import all related posts.

I think letting users connect related functions as mentioned in 25 would lead to better results with less effort.

Last edited 7 years ago by keesiemeijer (previous) (diff)

#35 @obenland
7 years ago

  • Type changed from task to enhancement

@keesiemeijer @DrewAPicture Would you mind updating the ticket description with what it is that we're looking for here?

#36 @obenland
7 years ago

  • Priority changed from high to normal

#37 @gibrown
7 years ago

  • Milestone set to Improved Search

This ticket was mentioned in Slack in #meta-devhub by drew. View the logs.


6 years ago

This ticket was mentioned in Slack in #meta by gibrown. View the logs.


3 years ago

This ticket was mentioned in Slack in #meta by pbiron. View the logs.


3 years ago

Note: See TracTickets for help on using tickets.