Opened 11 years ago
Last modified 11 months ago
#174 assigned enhancement
Link to generally related functions/classes
Reported by: | samuelsidler | Owned by: | |
---|---|---|---|
Milestone: | Improved Search | Priority: | normal |
Component: | Developer Hub | Keywords: | has-patch |
Cc: |
Description
Individual code reference entries should link to generally related functions and classes based on word stem, location, and other information.
Attachments (6)
Change History (46)
This ticket was mentioned in Slack in #meta-devhub by rarst. View the logs.
10 years ago
This ticket was mentioned in Slack in #meta-devhub by rarst. View the logs.
10 years ago
#7
@
10 years ago
+1 for stemming words.
Here's a proof of concept to use the Porter stemming algorithm and other rules to get related words from the post title.
http://tartarus.org/~martin/PorterStemmer/
Similar words as "queried", "queries" and "query" get the same stem "queri" if passed through the algorithm.
The algorithm is a cheap way of getting stems from words without a database or API lookup.
It doesn't always produce real words but it increases the similarity between titles when queried.
Proof of concept rules for getting related words from the title:
- Get the related words by splitting the title at the underscores.
- Remove stop words as 'the', 'as', 'by' etc...
- Don't allow 'wp' as a related word. Add the second word from the title to it with a dash (e.g. the related words from wp_head are wp-head head).
- 'wp' is allowed as a related word if it's the only word in a function/hook/class/method name.
- Allow the stop words 'is' and 'get' only if they're the first word of the post title.
- Add word stems and the file name as related words.
Here you'll find related words found with these rules:
Functions
https://rawgit.com/keesiemeijer/b8ba0b01006d6d859919/raw/poc-related-words-functions.html
Classes
https://rawgit.com/keesiemeijer/0727a611ee3d171a5ea0/raw/poc-related-words-classes.html
Methods
https://rawgit.com/keesiemeijer/49d3d62068be351e7adb/raw/poc-related-words-methods.html
Hooks
https://rawgit.com/keesiemeijer/7403f273deeb546389d8/raw/poc-related-words-hooks.html
These results were created with this gist in the archive.php file of the wporg-developer theme.
https://gist.github.com/keesiemeijer/41b6c8576a2f2ac684ce
I've created a custom taxonomy 'wp-parser-related-words' for all relevant post types in my local install and added the related words to the posts as terms.
In total 3,595 terms were created. This is with all external libraries and wp-content parsed. It should be less for the developer reference.
#8
@
10 years ago
These are related posts found with the functions from my own related posts plugin.
https://github.com/keesiemeijer/related-posts-by-taxonomy/blob/master/functions.php
The screenshots show how many related posts were found and the number of terms in common. 20 posts are shown and the posts are randomized within their own relatedness score (terms in common).
Less results are found for classes.
This ticket was mentioned in Slack in #meta-devhub by rarst. View the logs.
10 years ago
This ticket was mentioned in Slack in #core by swissspidy. View the logs.
9 years ago
#11
@
9 years ago
Stemming would work well for functions with related names like with Post Types. https://codex.wordpress.org/Post_Types#Function_Reference
But then we have Conditional Tags and Template Tags which don't have any common words.
The related functions is really useful when checking if there is a WordPress function for a certain job.
#12
@
9 years ago
Multiple taxonomies can be used. The related posts are found from the function words, package, and source file taxonomies. The terms used for a function like is_page()
would be: is
, page
, query-php
, Query
, wp-includes/query.php
. This will find all the related is_* functions like
- is_front_page (5)
- is_paged (5)
- is_comments_popup (4)
- is_home (4)
- is_date (4)
- is_main_query (4)
- is_author (4)
- is_category (4)
- etc...
See what related words are found here http://www.stoerke.be/reference/reference/functions/is_page/
#13
@
9 years ago
Yes, but you miss out on the the functions with has_
e.g. has_tag
or post_type_exists
.
This seems like a good start :D Would be nice if they were order alphabetically as they seem to be ordered randomly at the moment.
#14
@
9 years ago
- Keywords reporter-feedback added
Thanks :)
This proof of concept only shows that you can make posts more relatad by using stemmed words from the post title. There are of course other metrics (than terms) that could be used, like searching post content, contributed notes, used by, or manually adding related words.
Keeping with the example for is_page()
I don't see how it would ever find a has_*
or post_type
related post unless it's mentioned in the post content or manually added.
I'm not sure if there is any interest in the related post feature anymore though.
#15
@
9 years ago
There is interest. I came looking for a ticket because there was a discussion on the "Advanced WordPress Group" on Facebook about switching from the codex to the developer resource and people were missing this feature.
Did you make the changes to the PHP Doc parser? https://github.com/WordPress/phpdoc-parser
#17
@
9 years ago
I meant interest from the committers here. I will ask on slack if there is still any interest.
Did you make the changes to the PHP Doc parser?
No, I only changed the wporg-developer theme to add the terms and display the related posts. I would start there though. Adding the terms when importing functions, classes, etc..
Then use a related post plugin to display the related posts in the theme.
This ticket was mentioned in Slack in #meta-devhub by keesiemeijer. View the logs.
9 years ago
This ticket was mentioned in Slack in #meta-devhub by drew. View the logs.
9 years ago
This ticket was mentioned in Slack in #meta-devhub by drew. View the logs.
8 years ago
#21
@
8 years ago
The 174.patch adds the title keywords algorithm as described in comment #7
Use the function get_title_keywords( $title )
to get the keywords and their stems from a title. The title keywords are displayed on single post pages using this function.
I've put the plugin I've created some time ago for this ticket online here: https://github.com/keesiemeijer/devhub-related-posts. It lets you generate related keyword terms for the DevHub parsed post type posts in batches of 500 posts. It also displays the related posts in single post pages. It uses the same algorithm as in this patch
This ticket was mentioned in Slack in #meta-devhub by drew. View the logs.
8 years ago
#23
@
8 years ago
- Keywords has-patch added; reporter-feedback removed
- Priority changed from normal to high
- Type changed from enhancement to task
#24
follow-up:
↓ 25
@
8 years ago
In addition to the title keyword taxonomy we should create another taxonomy for function names only. Every post gets its own function name as a term in this taxonomy. This would also be the taxonomy where users can add related function names from a list with autocomplete.
Together with the title keywords, package and file name taxonomy terms it should get reliable related posts with a custom query or related posts plugin (Jetpack?)
#25
in reply to:
↑ 24
@
8 years ago
Replying to keesiemeijer:
In addition to the title keyword taxonomy we should create another taxonomy for function names only. Every post gets its own function name as a term in this taxonomy. This would also be the taxonomy where users can add related function names from a list with autocomplete.
I think we're talking about two different things here.
You're suggesting we use a taxonomy to group elements together in a reusable way, and that's totally fine. At the same time, you're also talking about associating those items together with each other, which we already have the capability for with p2p. p2p also already supports autocomplete for adding connections. So I think our best option would be to leverage p2p for the 1:1 relationship part and a taxonomy for the reusable 1:many part.
Together with the title keywords, package and file name taxonomy terms it should get reliable related posts with a custom query or related posts plugin (Jetpack?)
#26
@
8 years ago
I doesn't matter if it's another taxonomy or a p2p connection. It's just that we cant use the title keywords taxonomy for user contributions. I would rather see the title keywords terms be added when importing or updating functions with the parser. By using another taxonomy or p2p connection (for functions) you have a list of functions for users to choose from. That was the main idea.
p2p also already supports autocomplete for adding connections.
You also get autocomplete for free with a taxonomy in the edit posts screen (when adding a term) as all posts will have their own name as a term to begin with. I don't know p2p well enough to know if it supports autocomplete on the front end as well.
A downside to using a (function) taxonomy is that 6430 terms need to be created for the users to choose from if we are planning to have related posts for all parsed post types. Upside is you only have to query taxonomy terms
Another thing to think about is that the parser ships with the p2p library only, without the admin functionality. https://github.com/scribu/wp-lib-posts-to-posts
It creates fatal errors when parsing if both the library (used by the parser) and p2p plugin are both activated. So that has to be fixed before we can use the p2p plugin in the dashboard
This ticket was mentioned in Slack in #meta-devhub by drew. View the logs.
8 years ago
#28
@
7 years ago
@keesiemeijer If you're still up for giving this a shot, I think it might actually be better to use the parser to identify the title keywords and create/assign the terms – using the stemming technique. From an efficiency perspective it makes the most sense. We should also add 'has' to the get_allowed_first_words()
array.
#29
@
7 years ago
Additionally, I think it would be useful to introduce a second taxonomy for broadly categorizing things. We could probably use the components and focus lists used by core trac, actually.
Either way, both would bring us a lot closer to improving the discoverability aspect of the code reference, which I feel is one of the biggest complaints right now.
#30
@
7 years ago
I started work a while back on trying to identify "related" references, tho I've had to put it aside to work on other things recently.
The general idea I was exploring is based on the realization that most (tho not all) function/method/class/hook names are of the form: [Verb] [Noun]
, e.g., (add|get|update|delete)_post_meta()
, etc...where add, get, update, delete
are Verbs
and post_meta
is a Noun
.
So, on import:
- do phrase level parsing of function/method/class/hook names (stripping stopwords, but only limited stemming)
- do "part of speech" (POS) tagging of the phrases (see Part Of Speech Tagging)
- then, the "related" references are those with the same
Noun
but a differentVerb
Using this technique, I hope, will produce "related" references with a much higher degree of Precision than stemming alone; altho the recall would undoubtedly be lower. Personally, getting 602 references "related" to get_terms()
would be less than useful.
Granted, the method I was working on requires A LOT of work up-front, building/refining the POS lexicon. But once that up-front work is done, the indexing process is relatively quick (and doesn't require human input).
I built a mostly fully functioning plugin that provides a UI for assigning POS to the phrases generated in step 1. The plugin's intended use is:
- do an import from the sources (i.e., run
phpdoc-parser
), which generates potential phrases for step 1 above - assign POS for each phrase (the plugin provides a UI that makes this pretty easy)
- iterate the process, refining the POS lexicon on each iteration
I'll try to find the time to get the plugin to the point where I can release it and get others involved in refining the POS lexicon.
#31
follow-up:
↓ 32
@
7 years ago
Hi @DrewAPicture
I've uploaded the proof of concept plugin again. It seems I had removed it.
I've made some changes to it. The terms now get imported when parsing the codebase with this plugin active. WP_Query
is used to get related posts instead of a direct query. The logic for what constitutes a related post has changed too. A post should at least have 3 terms in common and should not be to far removed from the top match.
To see the related posts results visit this mirror of the reference.
With only the title to go on it works decent for some functions but fails for others. I agree that another taxonomy, or other solutions like @pbiron's , would help in this regard. On its own it's not there yet.
How do you see the extra taxonomy terms added? Manually, by users or mods, or some other way?
#32
in reply to:
↑ 31
;
follow-up:
↓ 33
@
7 years ago
Replying to keesiemeijer:
How do you see the extra taxonomy terms added? Manually, by users or mods, or some other way?
Along the lines of the existing core components list, I'd probably add a prompt on the reference article somewhere asking users to help us improve relevancy by suggesting related components.
There's long been discussion about shoring up the @subpackage
tags in the file headers to match the components list so we could do just such "categorization" at the devhub level. The thinking is that initial categorization would happen automatically from the parsed file headers, and more specific drilling-down happen by associating specific sub-components at the reference level.
So it would certainly be a multi-prong effort, but I think there's value in (at least) pursuing the subpackage audit, even if we don't primarily consider it in defining relevancy to other elements.
#33
in reply to:
↑ 32
@
7 years ago
Replying to DrewAPicture:
If the @subpackage
tags would be consistent it would be a good way to group posts together at a higher level (instead of mixing them in with the title and file words).
I've updated the plugin. It now also looks for similar (lowercase) words as the @subpackage
tags. I've also created a synonym lists of words to connect the posts better. The results have improved but not to a point where we can use it.
With only stemming and synonyms a post can have up to 200 related posts. What's difficult is deciding what's still related and where to set the cut-off point. For now I've set it to show only the top 25 related posts. Because of this it could be that better related words don't show up where you would expect them. Functions like is_*
and has_*
are mostly related by the other functions in the file they are in. They now also get a synonym of exist
to better relate them.
I don't think we can do much more with only stemming and synonyms. I like the idea of adding a prompt to suggest related components or posts. Maybe we can re-use the explanations functionality to moderate these suggestions. I will look into it.
We should also think about how a components list would look like.
#34
@
7 years ago
I've come to the point were I think this isn't the right way to go about it. Generally the query for related posts is too expensive and the results vary to much.
As an experiment, to fix the expensive query, I've tried importing related post ids into post meta after the parser was done. Similarly to how the Used By
and Uses
posts are connected after the parser has imported all posts. It took about two seconds to import related posts for every post. This gives you an idea how expensive the query is. It would have taken around 4 hours to import all related posts.
I think letting users connect related functions as mentioned in 25 would lead to better results with less effort.
#35
@
7 years ago
- Type changed from task to enhancement
@keesiemeijer @DrewAPicture Would you mind updating the ticket description with what it is that we're looking for here?
The relatedness is simple concept on the surface, but less than trivial one to implement, more so for atypical use cases (like ours).
Overall we are trying to:
There are two implementations closely related to the use case (that I can think of right away):
The one approach I am interested in, but hadn't seen tried yet is to simply take all data (we already have like file structure) and feed it into one of generic related posts plugins. It would be highly automated, but will have to try to see quality of results.
So these three (or combination of) are essentially choices I see on the table.