Opened 5 years ago

Closed 5 years ago

#4450 closed defect (bug) (invalid)

Does Plugin Repo Elasticsearch function_score penalize plugins with fewer than one million installs?

Reported by: jadonn's profile jadonn Owned by:
Milestone: Priority: normal
Component: Plugin Directory Keywords:

Description (last modified by dd32)

I was recently looking over the source code for the Plugin Repo's Elasticsearch function_score query. If I understand correctly, it seems like the query penalizes plugins with less than one million active installs, but the comments in the code suggest this should be otherwise. The filter clause in the Elasticsearch query applies the exponential decay scoring function to plugins with less-than-or-equal to 1000000 active installs. The exponential decay scoring function with a plugin with 500000 (five hundred thousand) active installs should look like this when plugging in all the values in accordance with Elasticsearch's example:

custom score = e^((ln(decay)/scale) * max(0, |actual_value - origin| - offset))

decay = 0.75
scale = 900000
actual_value = 500000
origin = 1000000
offset = 0

e^((ln(0.75)/900000) * max(0, |500000 - 1000000| - 0))^ = 0.8522943134

For Google Sheets:
EXP(LN(0.75)/900000 * MAX(0, ABS(500000 - 1000000) - 0)) = 0.8522943134

The resulting score is multiplied, along with other calculated factors, with the document relevance score Elasticsearch returns based on how well the search input matches the plugin text content. If my understanding of the exponential decay function is correct and if my math is correct, it appears that the resulting relevance document score for the plugin is going to be reduced to 85% of what it should otherwise be. This multiplier is not calculated or applied to plugins with more than 1000000 active installs.

If I have misunderstood this query scoring, I would be grateful to have my understanding and my math corrected.

Change History (4)

#1 @Otto42
5 years ago

You use the word "penalize" here, but really, what this does is to give more a boost to those with more active installs, with a maximum of 1 million installs being relevant.

In other words, this seems to me to make active install numbers actually count. Above a million, more installs no longer benefit you.

Whether you are adjusting the scores up or down is kind of irrelevant. Counting more installs as higher is the goal, with the 1 million maximum being relevant.

cc @gibrown

#2 @gibrown
5 years ago

@jadonn overall there is a boost based on the number of active sites. There is another term which boosts based on the log of the number of sites. The exponential is in order to better differentiate in the range from 100k - 1m. Then above 1m it didn't really matter much (is the comment you are referring to).

There are some graphs and a bunch of the thinking behind the search alg here:

#3 @dd32
5 years ago

  • Description modified (diff)

#4 @jadonn
5 years ago

  • Resolution set to invalid
  • Status changed from new to closed

Thank you for the information @otto42 and @gibrown! (And thank you @dd32 for cleaning up the description.)

I did see before that plugins get an increase in score from having more active installs. However, the two factors are multiplied together (this is the default behavior for Elasticsearch's function_score when no alternative behavior has been specified), which means that even though more active installs helps, plugins with less than one million installs have an apparently artificially decreased score, which I see makes plugins with more than one million active installs more likely to appear in search results.

I did misunderstand the comments, then, I suppose. I thought they were intended to reduce the advantage of having over one million active installs, but as @gibrown said the logarithmic behavior of the active installs boost reduces the advantage of having over one million active installs.

Note: See TracTickets for help on using tickets.