Opened 8 years ago
Last modified 8 months ago
#2686 new enhancement
Plugin Search: Search should also take into account the number of ratings
Reported by: | bfintal | Owned by: | |
---|---|---|---|
Milestone: | Improved Search | Priority: | normal |
Component: | Plugin Directory | Keywords: | needs-patch |
Cc: |
Description
The overall rating is used during plugin searches while the number of ratings isn't being taken into account.
Because of this, a plugin that only one or two 5 star ratings would ranker higher (assuming all other things are similar) than a plugin that has a collective 4.5 stars from 50 ratings.
I would like to suggest that the number of ratings be used alongside the overall rating, where a higher number of ratings would have more impact.
There should be a threshold though, so as not to always favor the more established plugins. Perhaps if 20 ratings have been reached, then the number of ratings should no longer have any impact. We can probably safely assume that the rating has evened out after 20 ratings. For example, if a 4.5 star plugin has 200 ratings, then it shouldn't have much of an advantage over a 4.4 star plugin that has 20 ratings.
Change History (27)
#2
@
8 years ago
Hey @gibrown, thanks for replying and your work on this area :)
I think this mostly happens in some-what long tail searches where there are fewer big players. The advantage also isn't enormous, but it is present.
My suggestion isn't so that plugin creators would resort to ask for more ratings, I'm not in favor of that. But instead, it's so that a having a very few number of ratings - maybe by the author themselves, maybe a friend or a handful of testers or early adopters - won't result right away in an advantage. That's why I mentioned a cut off like 20 ratings (I just pulled that number from the top of my head). My logic is that if a plugin has lower than 20 ratings, the overall rating should have a lower impact than others with 20 or more ratings.
For example:
(I've got nothing against these plugins, I'm just using them as examples)
- "Responsive Menu": https://wordpress.org/plugins/search/responsive+menu/
Awesome Responsive Menu is higher than WP Mobile Menu. It may have something to do with the name of the plugin not matching the word "Responsive" though. But I think the amount of ratings (6 vs 45) should make WP Mobile Menu appear higher.
- "Page Builder": https://wordpress.org/plugins/search/page+builder/
Full disclosure, my plugin there is Page Builder Sandwich so I won't dive into details in this one :)
- "Under Construction": https://wordpress.org/plugins/search/under+construction/
Eazy Under Construction with one 5-star rating outranks a lot of similar ones in page 2 who have 30+ ratings. Although the naming may have something to do with this, maybe the rating bumped it higher.
As for replying to ratings, I'm not sure about this. I imagine creators replying generic stuff like "Thanks for the rating, we'll do our best to blah blah blah" to 5-star ratings and actual replies would go to 4-star ratings and lower. Replying to support threads might be a better way to go, although this is taken into account already.
#3
@
8 years ago
Great examples, thanks. I'll take a look at these in the context of other changes. It's all pretty small adjustments which I think are going to be really hard to hand tune. But I'd bet that we have a lot of cases where some minor adjustments would help.
#4
@
8 years ago
Note: we encourage plugin authors to "ask" for reviews. We only tell them not to pay for them in some way, even by offering free stuff in exchange. We actively remove reviews that are paid for when we find out about them.
#5
@
8 years ago
@gibrown I think it's pretty much about the small adjustments and refinements now since the majority of the search query put out great results in the most part. I would put out a patch but I cannot find anywhere the indices being used, I found a gist that you posted 8-something months ago, but that seems outdated already. I think those are private and in wpcom.
@Otto42 Let me correct myself, what you said is what I meant. I've only added the reasoning that a small amount of reviews won't be a good enough sample size or indicator of an overall rating :)
#6
follow-up:
↓ 7
@
8 years ago
@bfintal the index can get queried directly via a .com api. Here's some code I use to run bulk sets of queries against the index for testing the latest search alg: https://gist.github.com/gibrown/9b54444cb23fb61f4e6513a45163e98c
If you have adjustments to the query you can also just try post them on here and then I'll try them against the 3000 queries I've been running against during my testing to see how they work on a wider scale.
#7
in reply to:
↑ 6
@
8 years ago
@gibrown
I'm not sure if the number of ratings are indexed, so I can't test with the scripts you gave. But here's what I have in mind, this is to replace the last field_value_factor for the rating field:
{ "filter": { "range": { "ratings_num": { "gte": 20 } } }, "field_value_factor": { "field": "rating", "factor": 0.25, "modifier": "sqrt", "missing": 2.5 } }
The 20
there is an arbitrary value and can be replaced with anything which can be deemed as a minimum sample size for an overall rating, it can be 10
or 8
. This is to prevent plugins with only one or a handful ratings from being favored. I can't find the field for the number of ratings, so I placed it as ratings_num
above to illustrate what I have in mind.
#8
@
8 years ago
Ah ya an example doc would probably help.
Here is Jetpack: https://gist.github.com/gibrown/f236af8a1e5365f7c8493ad601d25b74
And here are the mappings for the index: https://gist.github.com/gibrown/49262d39edb5ce7936fc001990ece560
At a quick glance, my main concern is that having a fixed number as a filter creates a step function and I think that in general we'd do better to have smooth functions in how we rank things. So I'd rather have a decay for small numbers of ratings.
#9
@
8 years ago
@gibrown
Good point and thanks for the links. Here's a smoother one that adds a multiplier 0.75 to 1.0 when it's below 20. This one actually works with the script you gave. The values for the range filter 20
and the offset
and decay
can be further fine tuned. The decay can be increased to a higher value to lessen the impact of this filter.
{ "field_value_factor": { "field": "rating", "factor": 0.25, "modifier": "sqrt", "missing": 2.5 } }, { "filter": { "range": { "num_ratings": { "lte": 20 } } }, "gauss": { "num_ratings": { "origin": 20, "offset": 0, "scale": 20, "decay": 0.75 } } }
#10
@
7 years ago
- Summary changed from Plugin search should also take into account the number of ratings to Plugin Search: Search should also take into account the number of ratings
#11
follow-up:
↓ 12
@
7 years ago
Ah ha!
Wilson score confidence interval for a Bernoulli parameter from http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
I think I will try putting something like this into the index. I need to figure how to make it work on a 5 star scale, but this addresses a lot of my concerns with hard coding any limits. Noting here mostly as a reminder.
#14
follow-up:
↓ 15
@
4 years ago
A kinda similar problem exists with support threads. We currently use support_threads_resolved which both counts number of threads as well as whether they are getting resolved or not. IIRC I also played around with the percentage of resolved threads, but there were a bunch of cases where that made things a lot worse. Particularly when there were small numbers of support threads, or many support threads and only 50% were resolved.
An example brought up here: https://meta.trac.wordpress.org/ticket/2753#comment:38
- search for "translate" https://wordpress.org/plugins/search/translate/
- translatepress-multilingual has 80k active sites and is 2nd: https://wordpress.org/plugins/translatepress-multilingual/
- gtranslate has 200k active sites and is 3rd https://wordpress.org/plugins/gtranslate/
There is a decent argument that the order should be reversed. I'm not entirely sure I agree that this is a problem caused by the support threads. The ratings count seems like the bigger problem to me. Either way it is kinda the same root problem with the alg. It would probably make sense that if we can fix the rating we should do something similar for the support threads.
If anyone has the time to create the php code for calculating a Wilson score confidence interval for a Bernoulli parameter that would help make this happen.
#15
in reply to:
↑ 14
@
4 years ago
Replying to gibrown:
If anyone has the time to create the php code for calculating a Wilson score confidence interval for a Bernoulli parameter that would help make this happen.
Here is a PHP implementation of https://github.com/instacart/wilson_score, hope this helps
<?php class WilsonScore { public function interval($k, $n, $options = array()) { if($n == 0) { echo "second parameter cannot be 0\n"; return false; } $confidence = isset($options['confidence']) ? $options['confidence'] : 0.95; $correction = isset($options['correction']) ? $options['correction'] : true; $z = $this->pnorm(1 - (1 - $confidence) / 2); $phat = $k / $n; $z2 = $z**2; // continuity correction if($correction) { $a = 2 * ($n + $z2); $b = 2 * $n * $phat + $z2; $c = $z * sqrt($z2 - 1/$n + 4 * $n * $phat * (1 - $phat) + (4 * $phat - 2)) + 1; $d = $z * sqrt($z2 - 1.0/$n + 4 * $n * $phat * (1 - $phat) - (4 * $phat - 2)) + 1; $lower = ($phat == 0 ? 0 : max(0, ($b - $c) / $a)); $upper = ($phat == 1 ? 1 : min(1, ($b + $d) / $a)); return array($lower, $upper); } else { $a = 1 + $z2 / $n; $b = $phat + $z2 / (2 * $n); $c = $z * sqrt(($phat * (1 - $phat) + $z2 / (4 * $n)) / $n); return array(($b - $c) / $a, ($b + $c) / $a); } } /** * from the statistics2 gem * https://github.com/abscondment/statistics2/blob/master/lib/statistics2/base.rb * inverse of normal distribution ([2]) * Pr( (-\infty, x] ) = qn -> x */ public function pnorm($qn) { $b = array(1.570796288, 0.03706987906, -0.8364353589e-3, -0.2250947176e-3, 0.6841218299e-5, 0.5824238515e-5, -0.104527497e-5, 0.8360937017e-7, -0.3231081277e-8, 0.3657763036e-10, 0.6936233982e-12); if($qn < 0.0 or 1.0 < $qn) { echo "Error: qn <= 0 or qn >= 1 in pnorm()!\n"; return 0; } if($qn == 0.5) return 0; $w1 = $qn; if($qn > 0.5) $w1 = 1 - $w1; $w3 = -log(4 * $w1 * (1 - $w1)); $w1 = $b[0]; for($i = 1; $i <= 10; $i++) $w1 += $b[$i] * $w3 ** $i; if($qn > 0.5) return sqrt($w1 * $w3); return -sqrt($w1 * $w3); } public function lower_bound($k, $n, $options = array()) { return $this->interval($k, $n, $options)[0]; } public function rating_interval($avg, $n, $score_range, $options = array()) { $confidence = isset($options['confidence']) ? $options['confidence'] : 0.95; $correction = isset($options['correction']) ? $options['correction'] : true; $min = $score_range[0]; $max = $score_range[1]; $range = $max - $min; $interval = $this->interval($n * ($avg - $min) / $range, $n, array('confidence' => $confidence, 'correction' => $correction)); return array(($min + $range * $interval[0]), ($min + $range * $interval[1])); } public function rating_lower_bound($avg, $n, $score_range, $options = array()) { return $this->rating_interval($avg, $n, $score_range, $options)[0]; } }
Tests:
<?php include 'wilson_score.php'; $wilson = new WilsonScore; function floats_eq($a, $b) { return (abs($a - $b) < 0.0001); } function test_n($condition, &$n) { if($condition === true) echo "$n -> pass\n"; else echo "$n -> fail\n"; $n++; } // test 1 $t = 1; $interval = $wilson->interval(1, 2, array('correction' => false)); test_n(floats_eq(0.0945, $interval[0]) and floats_eq(0.9055, $interval[1]), $t); // test2 $lower_bound = $wilson->lower_bound(1, 2); test_n(floats_eq(0.0267, $lower_bound), $t); // test3 $interval = $wilson->interval(1, 2); test_n(floats_eq(0.0267, $interval[0]) and floats_eq(0.9733, $interval[1]), $t); // test4 $interval = $wilson->interval(0, 1); test_n(floats_eq(0, $interval[0]) and floats_eq(0.9454, $interval[1]), $t); // test5 $interval = $wilson->interval(0, 10); test_n(floats_eq(0, $interval[0]) and floats_eq(0.3445, $interval[1]), $t); // test6 $interval = $wilson->interval(1, 10); test_n(floats_eq(0.0052, $interval[0]) and floats_eq(0.4588, $interval[1]), $t); // test7 $interval = $wilson->interval(1, 50); test_n(floats_eq(0.0010, $interval[0]) and floats_eq(0.1201, $interval[1]), $t); // test8 $interval = $wilson->interval(1, 1); test_n(floats_eq(0.0546, $interval[0]) and floats_eq(1, $interval[1]), $t); // test9 $interval = $wilson->interval(1, 1); test_n(floats_eq(0.0546, $interval[0]) and floats_eq(1, $interval[1]), $t); // test10 $interval = $wilson->interval(1, 3); test_n(floats_eq(0.0176, $interval[0]) and floats_eq(0.8747, $interval[1]), $t); // test11 $interval = $wilson->rating_interval(5, 1, array(1, 5), array('correction' => false)); test_n(floats_eq(1.8262, $interval[0]) and floats_eq(5, $interval[1]), $t); // test12 $interval = $wilson->rating_interval(3.7, 10, array(1, 5), array('correction' => false)); test_n(floats_eq(2.4998, $interval[0]) and floats_eq(4.5117, $interval[1]), $t); // test13 $rating_lower_bound = $wilson->rating_lower_bound(5, 1, array(1, 5), array('correction' => false)); test_n(floats_eq(1.8262, $rating_lower_bound), $t); // test14 test_n($wilson->interval(0, 0) == false, $t);
Test results (https://github.com/instacart/wilson_score/blob/master/test/wilson_score_test.rb):
1 -> pass 2 -> pass 3 -> pass 4 -> pass 5 -> pass 6 -> pass 7 -> pass 8 -> pass 9 -> pass 10 -> pass 11 -> pass 12 -> pass 13 -> pass second parameter cannot be 0 14 -> pass
#16
@
4 years ago
Hi @gibrown,
Have you checked the PHP implementation? If there are any other show stoppers, please let me know. I guess now it should be an easy fix.
Thanks!
This ticket was mentioned in Slack in #meta by edo888. View the logs.
4 years ago
This ticket was mentioned in Slack in #meta by edo888. View the logs.
4 years ago
#21
@
4 years ago
- Priority changed from high to normal
- Type changed from defect to enhancement
Sorry I didn't get back to you here. This really isn't a bug, so changing that back.
I mostly got that code rewritten to work for indexing, but as I was working on it realized how much testing of the algorithm probably needs to get done/re-done along with this change. Changing the search algorithm is not really something we should treat as a quick "fix". Either this change really makes a difference (in which case we need to do some similar testing to what was done before), or it just isn't that high of a priority. I'm also guessing that looking at the search quality will reveal a bunch of other things to work on for better or worse.
#22
@
4 years ago
Not sure I understand you well. Is the change for indexing or for final sorting with score_query_by_recency function?
I'm mostly worried about this code:
'field_value_factor' => array( 'field' => 'support_threads_resolved', 'factor' => 0.25, 'modifier' => 'log2p', 'missing' => 0.5, )
I want to compare 2 plugins A and B:
A has 100 resolved tickets out of 200 and has 4.5 average rating from 300 reviews.
B has 20 resolved tickets out of 20 and has 4.9 average rating from 2000 reviews.
So for A and B the multiplier from support threads resolved are:
Math.log10(100 + 2) = 2.0 -- A Math.log10(20 + 2) = 1.34 -- B
It seems to make a real difference.
Now let's calculate the rating lower bound with new algorithm for plugin A and B:
$wilson->rating_lower_bound(4.5, 300, array(1, 5), array('correction' => true)) = 4.33 # A $wilson->rating_lower_bound(4.9, 2000, array(1, 5), array('correction' => true)) = 4.87 # B
So the multiplier is sqrt(4.33) vs sqrt(4.87) for rating.
'field_value_factor' => array( 'field' => 'rating', 'factor' => 0.25, 'modifier' => 'sqrt', 'missing' => 2.5, )
When you multiply you get a score:
New score with Wilson lower bound applied for rating only:
2 x sqrt(4.33) = 4.16 vs 1.34 x sqrt(4.87) = 2.96
Old score:
2 x sqrt(4.5) = 4.24 vs 1.34 x sqrt(4.9) = 2.96
I see there is some improvement (2%), but the same needs to be done also for support threads as you have mentioned and not just the rating which will totally change it.
So if you count the lower bound for support threads it will be:
$wilson->lower_bound(100, 200, array('correction' => true)) = 0.43 $wilson->lower_bound(20, 20, array('correction' => true)) = 0.8
And now new total score with new algorithm for rating and support threads will be:
0.43 x sqrt(4.33) = 0.89 vs 0.8 x sqrt(4.87) = 1.76 which already looks fair.
I can see that you have decreased the priority to normal, which is sad.
Thanks!
This ticket was mentioned in Slack in #meta by edo888. View the logs.
4 years ago
#25
@
21 months ago
An additional idea here is whether we can give recent reviews more weight than older reviews. So the current state of a plugin is not completely outweighed by past reviews. I think this would mean having multiple review scores over different time periods and combining them in some way. I am not 100% positive how this would influence scores, but it probably would encourage behavior we want (maintaining a good plugin experience).
Something to experiment with that I think could be done.
This ticket was mentioned in Slack in #meta by gibrown. View the logs.
21 months ago
#27
@
8 months ago
I think this would mean having multiple review scores over different time periods and combining them in some way. I am not 100% positive how this would influence scores, but it probably would encourage behavior we want (maintaining a good plugin experience).
Just noting that this idea was raised in #6851 and probably makes sense to move forward with discussion of that over there, whether that directly influences the scoring (or if a different alg needs to be used, as above) I'm unsure.
Do you have any examples of searches and poor results where you think this will make a significant difference?
Otherwise I think this will just encourage plugins to ask for more ratings.
It came up earlier to just look at total number of one star ratings rather than trying to boost. Or on whether the ratings got replied to. Whether something is getting replied to seems like a good metric really. It indicates that the plugin author is paying attention, and is trying to give real support to real users. It is also something that is under the plugin author's control and doesn't reward folks for nudging lots of users to just give them reviews with no content in them.