Opened 6 years ago
Closed 6 years ago
#4450 closed defect (bug) (invalid)
Does WordPress.org Plugin Repo Elasticsearch function_score penalize plugins with fewer than one million installs?
Reported by: |
|
Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Component: | Plugin Directory | Keywords: | |
Cc: |
Description (last modified by )
I was recently looking over the source code for the Plugin Repo's Elasticsearch function_score query. If I understand correctly, it seems like the query penalizes plugins with less than one million active installs, but the comments in the code suggest this should be otherwise. The filter clause in the Elasticsearch query applies the exponential decay scoring function to plugins with less-than-or-equal to 1000000 active installs. The exponential decay scoring function with a plugin with 500000 (five hundred thousand) active installs should look like this when plugging in all the values in accordance with Elasticsearch's example:
custom score = e^((ln(decay)/scale) * max(0, |actual_value - origin| - offset)) decay = 0.75 scale = 900000 actual_value = 500000 origin = 1000000 offset = 0 e^((ln(0.75)/900000) * max(0, |500000 - 1000000| - 0))^ = 0.8522943134
For Google Sheets:
EXP(LN(0.75)/900000 * MAX(0, ABS(500000 - 1000000) - 0)) = 0.8522943134
The resulting score is multiplied, along with other calculated factors, with the document relevance score Elasticsearch returns based on how well the search input matches the plugin text content. If my understanding of the exponential decay function is correct and if my math is correct, it appears that the resulting relevance document score for the plugin is going to be reduced to 85% of what it should otherwise be. This multiplier is not calculated or applied to plugins with more than 1000000 active installs.
If I have misunderstood this query scoring, I would be grateful to have my understanding and my math corrected.
Change History (4)
#2
@
6 years ago
@jadonn overall there is a boost based on the number of active sites. There is another term which boosts based on the log of the number of sites. The exponential is in order to better differentiate in the range from 100k - 1m. Then above 1m it didn't really matter much (is the comment you are referring to).
There are some graphs and a bunch of the thinking behind the search alg here: https://data.blog/2017/03/15/improving-relevance-and-elasticsearch-query-patterns/
#4
@
6 years ago
- Resolution set to invalid
- Status changed from new to closed
Thank you for the information @otto42 and @gibrown! (And thank you @dd32 for cleaning up the description.)
I did see before that plugins get an increase in score from having more active installs. However, the two factors are multiplied together (this is the default behavior for Elasticsearch's function_score when no alternative behavior has been specified), which means that even though more active installs helps, plugins with less than one million installs have an apparently artificially decreased score, which I see makes plugins with more than one million active installs more likely to appear in search results.
I did misunderstand the comments, then, I suppose. I thought they were intended to reduce the advantage of having over one million active installs, but as @gibrown said the logarithmic behavior of the active installs boost reduces the advantage of having over one million active installs.
You use the word "penalize" here, but really, what this does is to give more a boost to those with more active installs, with a maximum of 1 million installs being relevant.
In other words, this seems to me to make active install numbers actually count. Above a million, more installs no longer benefit you.
Whether you are adjusting the scores up or down is kind of irrelevant. Counting more installs as higher is the goal, with the 1 million maximum being relevant.
cc @gibrown