Making WordPress.org

Opened 7 years ago

Closed 5 years ago

Last modified 3 years ago

#2823 closed defect (bug) (fixed)

Improve IP Geolocation Results

Reported by: iandunn's profile iandunn Owned by: iandunn's profile iandunn
Milestone: Priority: normal
Component: Events API Keywords:
Cc:

Description

There are a couple problems with our current data set. They're not strictly related to each other, but I'm creating a single ticket for this because I'm guessing that the solution to both will be the same: switch from our current data source to a better one.

It looks like our database was updated recently (March 2017), but I can't find any info on where the original data comes from. @dd32, @tellyworth, do either of you know? Do you have any thoughts on other potential sources?

Attachments (3)

2823-ipv6.1.diff (3.1 KB) - added by coreymckrill 7 years ago.
2823.diff (33.2 KB) - added by dd32 7 years ago.
2823.2.diff (30.3 KB) - added by dd32 7 years ago.
Approach using Google Geocode to power user lookups instead of attempting to roll our own geocode service - Google does it better, and doesn't receive any personal user details with a server-side request

Download all attachments as: .zip

Change History (57)

#1 @Otto42
7 years ago

If you're using the data I think you're using, then it almost certainly comes from the free data set at http://www.geonames.org/

#2 @Otto42
7 years ago

Oh, and the ip2location probably comes from here: http://lite.ip2location.com/database/ip-country-region-city-latitude-longitude

May or may not be the free one. Not sure.

#3 @iandunn
7 years ago

  • Owner set to iandunn
  • Status changed from new to accepted

Ah, yeah, it looks like IP2LOCATION-LITE-DB5.csv. It could also be the commercial version, but I'll need to look closer when I have time. Thanks!

They've got IPv6 available. If we're not already using the commercial version, then it's probably worth upgrading.

#4 @Otto42
7 years ago

At the time I first encountered those years ago, they were an outdated copy of the free version. I updated the data to the latest copy, but then we didn't use it for much, so I let it go. Looking at it now, it appears to have many more rows than the free version, so I suspect somebody updated it to the non-free version. They may still have access to that and can get the big dataset, if we know who updated it last.

#5 @dd32
7 years ago

@tellyworth did you update that data during development? I seem to recall you wrote some kind of importer for some data source for something remotely related to this?

#6 @iandunn
7 years ago

There's a bin script for updating the geonames data, but I didn't see anything for the IP geolocation.

#7 @iandunn
7 years ago

IP2LOCATION-LITE-DB5.csv also has the names in proper Sentence Case, rather than all uppercase, so it's definitely not that source, unless something went really wrong during the import.

Having the data improperly cased is another problem that would be solved by upgrading.

This ticket was mentioned in Slack in #core by iandunn. View the logs.


7 years ago

This ticket was mentioned in Slack in #core by dd32. View the logs.


7 years ago

#12 follow-up: @barry
7 years ago

Howdy,

There are a few problems here:

1) The ip2location data on WP.org is very very old.
2) WP.org doesn't have a subscription/license to use the data that I am aware of.
3) I don't think there is any legitimate use case to pass an ip in the query string - this just opens up the API to abuse and most likely violates the terms/EULA of using this data.
4) The query performance is very poor and probably won't scale to the volume that we need. For ip2location specifically we have already solved this problem on WordPress.com, so I would suggest looking at the code there. I haven't looked at the geonames code much, but I suspect it suffers from similar issues.
5) We need a way to keep the information always current if we are going to use it. It changes enough on a regular basis that once it's > 60 days old it starts becoming noticeably wrong.

The first problem to solve is #2. I don't think we really need ipv6 support yet since the API should always use REMOTE_ADDR and the WordPress.org API currently doesn't have any AAAA records.

#13 @dd32
7 years ago

The first problem to solve is #2. I don't think we really need ipv6 support yet since the API should always use REMOTE_ADDR and the WordPress.org API currently doesn't have any AAAA records.

The IP is not the REMOTE_ADDR in most cases - it's the WordPress user client IP, not the server IP. There's some "anonymisation" stuff in there - see maybe_anonymize_ip_address(). The exception is when the client IP provided is a private range, in which case it uses REMOTE_ADDR.

Would it be better if the client browser was making the request directly? (and not proxied through WordPress) - the detect-location call would need to be made on every request in that case, and not just on the first call / when user searches for a location change.

Last edited 7 years ago by dd32 (previous) (diff)

#14 @barry
7 years ago

I see why it was implemented this way, but I think it's a clear violation of the EULA and the endpoint will quickly be abused. I am not too worried about the traffic once #4 above is fixed, but I guess there are some general performance and privacy concerns that will need to be addressed if the implementation is changed.

#15 in reply to: ↑ 12 @iandunn
7 years ago

Replying to barry:

1) The ip2location data on WP.org is very very old.
2) WP.org doesn't have a subscription/license to use the data
3) I don't think there is any legitimate use case to pass an ip in the query string
4) The query performance is very poor and probably won't scale to the volume that we need
5) We need a way to keep the information always current if we are going to use it

#1, #2, and #5 are all easy to solve.

#3 (EULA)

On a technical level, it's easy to switch the client to request the data directly from api.w.org instead of proxying through their WP site.

For anyone who's concerned about privacy with that architecture, we can release a plugin that will disable the Events section of the Dashboard widget entirely. We could replace it with a simple link to the Events page on make/Community.

Does anyone think that would not be enough to mitigate privacy concerns?

#4 (performance)

I'm going to start digging into this and looking for ways to optimize. If you can say which queries you're concerned the most about, that would be helpful.

Almost all of the queries are there to support extra functionality and edge cases. Worst case scenario, the queries in the following functions can all be disabled without preventing the majority of users from successfully retrieving events: guess_location_from_ip(), guess_location_from_geonames_fallback(), get_valid_country_codes(), get_country_from_name(), and get_city_from_coordinates()

That can be done at-will on the API side without breaking the client in Core. We could even do it automatically during traffic spikes if we have some kind of signal from monitoring agents to check.

#16 @barry
7 years ago

I have been thinking about the EULA a bit, I think if we remove the actual passed IP geolocation data from the response and just return the events data it will be fine. We would then use the client IP geolocation internally, in "business logic" to figure out the closest events and return them. We still need to work on the other issues, but they are technical and thus easier to fix.

#17 @iandunn
7 years ago

That sounds like a good idea to me. I think the detailed steps look like this:

  1. The initial request. The client doesn't have a location saved at this point.
    1. Client requests events near {ip address}
    2. API returns the events object like it does today, but no location object
    3. Client parses the city name from first event (e.g. "Seattle"), and saves that as the location
  2. The second request. The client has a city name saved at this point.
    1. The client requests events near {city name}, just like it does today when a user manually types in a location
    2. The API matches the city name to lat/long coordinates using the Creative Commons data from geonames.org
    3. The API returns the events object, and a location object, just like it does today.
    4. The client saves the full location object as its location
  3. All future requests. The client has a saved lat and long at this time
    1. These would be exactly the same as they are today.

So, the client would have to make 2 requests instead of 1, but it would eventually get the same behavior as we have today. The client could do step #2 asynchronously after the events received in step #1 have been displayed, in order to make things a little smoother.

All of the above only applies if the user hasn't manually entered a location. If they enter a location manually, then we skip step #1, and go straight to #2.

@barry, does that sound right to you? If so, I'll get started on the necessary changes.

#18 follow-up: @iandunn
7 years ago

To clarify, are you thinking we'd remove the IP from the client's request and just use REMOTE_ADDR on the API side, or would the client still be able to specify an arbitrary IP?

#19 in reply to: ↑ 18 @barry
7 years ago

Replying to iandunn:

To clarify, are you thinking we'd remove the IP from the client's request and just use REMOTE_ADDR on the API side, or would the client still be able to specify an arbitrary IP?

Clients can specify an arbitrary IP. I was initially concerned about the privacy implications of the clients connecting directly to the .org API which would provide their full IP and not the truncated version that the current proxy approach uses. I thought this would be the only place that a .org site sends the client IP to WordPress.org (as opposed to the server IP which is sent in many places). However, I then remembered that we use the W.org CDN for serving emoji in core, so I am not sure if it's a huge problem.

I'm ok with any approach that satisfies the "business requirements" of this widget and doesn't violate the EULA for the data source(s) we use. The rest of the stuff like performance and scaling we can figure out pretty easily.

#20 @barry
7 years ago

Oh, and I forgot, if we send arbitrary IPs we might need IPv6 support which I haven't looked at yet. If the privacy concerns aren't an issue, having the client connect directly to the API endpoint might be the cleaner implementation.

#21 @iandunn
7 years ago

@dd32 has a really good idea to modify the process a bit. I think will make things simpler on the client side, while still staying within the EULA.

When the client makes a request for events and provides the IP, we geolocate that IP from ip2location, and fetch the events that are close to it. We do not return the location data provided by ip2location to the client.

At this point, the response object being built looks like this:

{
  'location: {},
  'events: {
    {
       'id' => 1234,
       'date' => 194323423,
       'type' => 'wordcamp',
       'title' => 'WordCamp Seattle',
       'location' => 'Seattle, WA, USA'
    },
    // more events here
  }
}

The response.location is empty. The response.events[0].location field above is pulled from wporg_events, and originally comes from the WordCamp.org API.

The next step is to take that response.events[0].location field, and search our geonames table for it. geonames is Creative Commons licensed, and provides latitude and longitude.

So now, we can populate the location field in the response with the data we've pulled from geonames. Now the response object looks like this:

{
  'location: {
     'name' => 'Seattle',
     'latitude' => '47.200324',
     'longitude' => '-121.32523',
     'country' => 'US',
  },
  'events: {
    {
       'id' => 1234,
       'date' => 194323423,
       'type' => 'wordcamp',
       'title' => 'WordCamp Seattle',
       'location' => 'Seattle, WA, USA'
    },
    // more events here
  }
}

This way, we can still return a location to the client in the initial request, but we're not directly exposing the location details from ip2location at all. The previous approach was essentially the same thing, but the response.location lookup was being done in a subsequent request (see comment:17), rather than all at once.

@barry, do you see any problems with that?

#22 @iandunn
7 years ago

In 5490:

Events: Update IP test cases with current data

The IP geolocation database was updated, and a few locations shifted as a result.

See #2823

#23 @iandunn
7 years ago

In 5491:

Events: Replace IP geolocated data with venue location data

See #2823

#24 @iandunn
7 years ago

In 5492:

Events: Restore location['description']

r5491 accidentally changed this, which caused the Core client to have an empty location description.

See #2823

#25 @iandunn
7 years ago

It turned out to be even simpler than comment:21, since we already have the venue's location data in the events table. r5491 fixed #3 from comment:15.

Barry updated the geolocation data from ip2location.com, which fixed #1 and #2.

That just leaves #4 (performance), which Dion is working on, and #5 (automatic updates), which I'll look into once higher priority tasks are done.

#26 @iandunn
7 years ago

In 5493:

Events: Restore no_location_available behavior

r5491 introduced a new condition where the client might receive an unsuccesful location lookup, but it wasn't consistent with prior behavior. This commit applies the prior behavior to the new case, so that clients can expect a consistent no_location_available error if the location wasn't found.

See #2823

#27 @coreymckrill
7 years ago

Added the above patch to address Problem 2, IPv6 support. It will need to get updated once we know the name of the table with the IPv6 location data.

@dd32
7 years ago

#28 @dd32
7 years ago

2823.diff takes a pretty harsh attack on some of the code in order to attempt to get the performance better for location searches.

  • get_city_from_coordinates() is gone - It's just not performant enough and I couldn't find an easy way to achieve decent performance for such an edgecase
  • get_valid_country_codes() wasn't actually being used (though was querying)
  • guess_location_from_geonames_fallback() and as a result get_ideographic_counties() is gone - this at first seems like a step backwards, however, I think what it's attempting to achieve was going about things the wrong way.
    • REPLACE( alternatenames, CONCAT( asciiname, ' - ' ), '' ) LIKE wasn't performant and there wasn't much of a chance of it being so
    • using a alternatenames LIKE '%,Blah,%' was working, but wasn't really fast

In order to speed things up, I created a new table based off the geoname data, geoname_summary which expands the (name, asciiname, alternatenames) out into a single name column, and also adds variants (for example the city name without dashes).

One major change in this is that if you type in, for example, The Big Apple the location returned will now be The Big Apple instead of New York City, similarly, if you search for Wien you'll get just that Wien not Vienna.
Unfortunately typo's will stay as typo's (for example, New Yorke or Lunden).

This might also make it easier to return multiple results in the future (Springfield, MO vs Springfield, MA for example, or allowing an en_AU locale to pull up Wellington, NZ instead of Wellington, AU).

All the test cases that passed are still passing albeit with the new return data, but instead of the ~100 queries taking 4-6 seconds, it now takes 0.04s for a slightly fewer number of queries.

#29 @iandunn
7 years ago

In 5495:

Events: Prefer meetups over WordCamps to improve location accuracy

See #2823

#30 @iandunn
7 years ago

In 5496:

Events: Track query execution time to aid optimization

The test suite isn't representative of real-world usage, because by nature it's weighted heavily towards edge cases. Establishing a performance baseline is still useful when optimizing, though.

See #2823

Props dd32

This ticket was mentioned in Slack in #core by iandunn. View the logs.


7 years ago

#32 @iandunn
7 years ago

In 5497:

Events: Remove get_city_from_coordinates() to improve performance

The function was not performant, and it's not easy to make it that way. It's only needed for a few edge cases, and we can get the location from the events themselves (similar to r5491). So, removing this is a good trade-off.

Because we are able to rebuild the location, no changes are needed in Core, and a new version of this endpoint is not needed.

See #2823
Props dd32

This ticket was mentioned in Slack in #core by iandunn. View the logs.


7 years ago

#34 follow-up: @karmatosed
7 years ago

Its worth adding the oddity I experience here:

https://cldup.com/U3LohVCeQu.png

Firstly, I am nowhere near Romford (that's a pretty far away from where I am).

Secondly, as someone from UK, showing US City in the box is a weird experience. Also, why only cities? In UK for example towns have meetups and a lot of people don't live in cities.. this I am sure happens in US :)

#35 in reply to: ↑ 34 ; follow-up: @iandunn
7 years ago

Replying to karmatosed:

Secondly, as someone from UK, showing US City in the box is a weird experience.

That string is internationalized, so you'll see a local city as soon as it's translated.

Also, why only cities?

The database we have is primarily city-centric. It has some region information, but it's not detailed or comprehensive enough to fully support. The API does try to match inputs other than cities, but we only explicitly mention cities in the instructions because that's the only thing we can guarantee will work.

Last edited 7 years ago by iandunn (previous) (diff)

#36 in reply to: ↑ 35 ; follow-up: @dd32
7 years ago

Replying to iandunn:

Replying to karmatosed:

Also, why only cities?

The database we have is primarily city-centric. It has some region information, but it's not detailed or comprehensive enough to fully support. The API does try to match inputs other than cities, but we only explicitly mention cities in the instructions because that's the only thing we can guarantee will work.

Worth noting that the database we've been using says "city" but in this context, city means town. We've used a data source which includes town with a population of >1k (or an administrative area of 1k, so town X of 500 and 500 in the surrounding area, where X is the common point would be included). For countrycode = GB there were 3.6k 'cities' with 14k different spellings and alternate names.

Last edited 7 years ago by dd32 (previous) (diff)

#37 @iandunn
7 years ago

In 5499:

Events: Stop using fuzzy locations for 4.8-beta2

See #2823

@dd32
7 years ago

Approach using Google Geocode to power user lookups instead of attempting to roll our own geocode service - Google does it better, and doesn't receive any personal user details with a server-side request

#38 follow-up: @ocean90
7 years ago

Searching for "Köln" gives me a "We couldn’t locate Köln." error but a search for "Cologne" or "Koeln" returns the proper "Köln". "Münster" is working fine so I'm not sure if it's an issue with umlauts.

Another oddity: "Düsseldorf" returns "Düsseldorf-Pempelfort" which is kinda unexpected because "Pempelfort" is just a city part of "Düsseldorf".

#39 @pbiron
7 years ago

As I mentioned in the devchat yesterday, searching for "Gardner, CO" gives events in Kansas City, MO...which is almost 700 miles away. Looking into the community-events-location meta_key, I see that the lat/lng stored is for KC, MO, which explains why it showing event for KC, MO.

However, if Otto's comment is correct that place name searches are using geonames.org data then that is odd, because http://www.geonames.org/search.html?q=gardner%2C+co&country= correctly finds Gardner, CO (with the correct lat/lng) as the 1st result.

#40 in reply to: ↑ 38 ; follow-up: @iandunn
7 years ago

Replying to ocean90:

Searching for "Köln" gives me a "We couldn’t locate Köln." error but a search for "Cologne" or "Koeln" returns the proper "Köln". "Münster" is working fine so I'm not sure if it's an issue with umlauts.

Another oddity: "Düsseldorf" returns "Düsseldorf-Pempelfort" which is kinda unexpected because "Pempelfort" is just a city part of "Düsseldorf".

I think most of the charset issues have been worked out, but there's still at least one that I know about (see the unit tests).

@dd32 is working on upgrading our geocoding from geonames.org to Google's Geocoding API, which should fix all of that. See 2823.2.diff.

Replying to pbiron:

As I mentioned in the devchat yesterday, searching for "Gardner, CO" gives events in Kansas City, MO...which is almost 700 miles away. Looking into the community-events-location meta_key, I see that the lat/lng stored is for KC, MO, which explains why it showing event for KC, MO.

I think that's an edge case just because there's not enough information to distinguish between the two Kansas Citys, since they're so close to each other geographically.

[wp40774] might be able to help there, though, since we could use the IP geolocation as a signal. We'd also like to have the API start returning a list of alternate locations, and have Core display those in a dropdown for the user to choose. That'll probably have to be 4.8.1, though.

However, if Otto's comment is correct that place name searches are using geonames.org data then that is odd, because http://www.geonames.org/search.html?q=gardner%2C+co&country= correctly finds Gardner, CO (with the correct lat/lng) as the 1st result.

Otto's definitely correct, but the code does a lot of gymnastics to try and parse the query and look it up in a lot of ways. There could be a bug in there, or just some conflicting edge cases, etc.

Dion's work to switch to Google's API should help there too.

#41 in reply to: ↑ 36 ; follow-up: @pbiron
7 years ago

Replying to dd32:

Worth noting that the database we've been using says "city" but in this context, city means town. We've used a data source which includes town with a population of >1k (or an administrative area of 1k, so town X of 500 and 500 in the surrounding area, where X is the common point would be included).

That probably explains the behavior I'm seeing...as Gardner, CO has a population of only 70. If the surrounding farms are included, what I call "Greater Metropolitan Gardner", the pop balloons to a whopping 400. Yes, I live in the sticks :-).

#42 in reply to: ↑ 41 @iandunn
7 years ago

Replying to pbiron:

Replying to dd32:

Worth noting that the database we've been using says "city" but in this context, city means town. We've used a data source which includes town with a population of >1k (or an administrative area of 1k, so town X of 500 and 500 in the surrounding area, where X is the common point would be included).

That probably explains the behavior I'm seeing...as Gardner, CO has a population of only 70. If the surrounding farms are included, what I call "Greater Metropolitan Gardner", the pop balloons to a whopping 400. Yes, I live in the sticks :-).

No shame in that :)

I think the only reason we limited it to towns with > 1k population is just because, for an endpoint that will see as much traffic as this one, we don't have the server resources to do the database queries on larger data sets. Once we switch to Google's API, though, that should solve that problem.

#43 in reply to: ↑ 40 @pbiron
7 years ago

Replying to iandunn:

Replying to pbiron:

As I mentioned in the devchat yesterday, searching for "Gardner, CO" gives events in Kansas City, MO...which is almost 700 miles away. Looking into the community-events-location meta_key, I see that the lat/lng stored is for KC, MO, which explains why it showing event for KC, MO.

I think that's an edge case just because there's not enough information to distinguish between the two Kansas Citys, since they're so close to each other geographically.

Sorry, "KC, MO" was just my abbreviation for "Kansas City, MO"...I should have been more clear.

#44 @iandunn
7 years ago

In 5501:

Events: Support IPv6 address geolocation.

See #2823
Props coreymckrill

#45 @iandunn
7 years ago

In 5537:

Events: Return false instead of null

See #2823
Props dd32

#46 @iandunn
7 years ago

In 5538:

Events: Remove unused get_valid_country_codes()

See #2823
Props dd32

#47 @iandunn
7 years ago

In 5539:

Events: Remove unnecessary $location check

$location will never return true in this path, so there's no need to check it.

See #2823
Props dd32

#48 @iandunn
7 years ago

In 5540:

Events: Split location by any whitespace character, including Unicode

This will match more valid input than simply checking for a single space character.

See #2823
Props dd32

#49 @iandunn
7 years ago

In 5541:

Events: Select countries from countrycodes to improve performance

See #2823
Props dd32

#50 @iandunn
7 years ago

In 5543:

Events: Query new geoname_summary table to improve performance

See #2823
Props dd32

#51 @iandunn
7 years ago

r5543 merges the last bits of 2823.diff, so the performance issues should be sorted now.

The two things still left in this ticket are:

  1. Switching to an external geocoding API for better accuracy. I'm talk to a Google rep today about ToS, pricing, etc. If that goes well, I'll work on merging 2823.2.diff. I'd like to keep the geonames stuff around as a fallback, though, at least for a few weeks. Depending on pricing, we may also want to keep it around permenantly, and use it for queries that we are confident will return good results from the geonames.org data, and then only use the external API for requests that are likely to be get better results from it.
  2. Automating the ip2location updates. I think I have everything I need to finish the script now, so I'll be working on this today.

#52 @dd32
5 years ago

  • Component changed from API to Events API

#53 @dd32
5 years ago

  • Resolution set to fixed
  • Status changed from accepted to closed

Switching to an external geocoding API for better accuracy.

3 years later, we've still not done that. I think we can shelve that discussion for a later date.

Automating the ip2location updates. I think I have everything I need to finish the script now, so I'll be working on this today.

This still hasn't been done, but is updated often enough.

I'm going to close this ticket as fixed, we can pull it back up in the future if needed.

This ticket was mentioned in Slack in #meta by pbiron. View the logs.


3 years ago

Note: See TracTickets for help on using tickets.