Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#5562 closed defect (bug) (fixed)

Events: Exact city matches fail because of state abbreviations

Reported by: iandunn's profile iandunn Owned by: dd32's profile dd32
Milestone: Priority: normal
Component: Events API Keywords:

Description (last modified by iandunn)

These two searches (and probably many others) return the "wrong" Chicago.

Instead of Chicago, Illinois (Geonames ID 4887398) they return cities with an alternate name of Chicago, with a much smaller population. It may even match some buildings with population 0 that are classified as `P.PPL`?

I've added some PHPUnit cases in r10553 (xref r10544 and r16810-dotorg for info on the new test suite).

It looks like the problem may be related to how cities are now being imported into as Chicago, IL instead of the canonical Chicago in So, they fail the exact-match search, and the fuzzy search turns up the less-desirable result.

Possibly related to #3728, #5117, r16728-dotorg

I'll look into it more tomorrow, but wanted to document what I've found so far. CC @dd32

Change History (11)

#1 @iandunn
4 years ago

In 10553:

Events: Add failing tests to show exact-match bug.

See #5562.

#2 @iandunn
4 years ago

  • Description modified (diff)

#3 @dd32
4 years ago

It looks like the problem may be related to how cities are now being imported into as Chicago, IL instead of the canonical Chicago in So, they fail the exact-match search, and the fuzzy search turns up the less-desirable result.

As per r16728-dotorg the Chicago, IL entry should be added in addition to the canonical Chicago one. I'll review the queries and results today though to triple check that it's working how I intended it to..

#4 @iandunn
4 years ago

Ah, I missed that, thanks!

#5 in reply to: ↑ description @dd32
4 years ago

Apologies for the rambling comment @iandunn.. I'm just working through it and noting my thought process for myself..

After looking into this, there's several things happening:

  1. There was a data error, in that some of the source data had invalid UTF8 data which caused only partial Chicago data to be imported.

I've re-imported the data after fixing the issues (not all committed yet) which results in a more complete data set, but still 'invalid' responses.

Replying to iandunn:

  1. This looks correct with the current logic. Yes, hang on, let me explain..
  • Timezone trumps population, #3367 comes into play here.
  • Alternate names are treated as the same level of priority as "preferred" names

I don't think timezone should be a trump here in every case, but rather only when it's a preferred name (ie. Portland, OR / Portland, ME).

The Chicago examples has come up due to the greater number of cities after #3728, it looks like most of the chicago entries are probably alternate names (There's Chicago's in every US timezone, just like there's a Portland in each timezone).

As part of that, I've rebuilt the table again, this time marking alternate names as alts, so that preferred names can be prioritised.. Unfortunately I've done something wrong, and the alt field didn't get filled correctly, but from what I can tell, it'll fix these two examples and a few others I've been able to come up with.

I'll loop back to this tomorrow, but if you want to Ian, you can run queries against the geoname_summary_20210106082158 table to test with. Adding ORDER BY alt ASC to the start of the order by's seems to do the trick (Test failures such as Dona ana are due to the alt flag failure I mentioned above)

I've struggled to test it with PHPUnit, as it keeps running out of memory and requires PHPUnit8 by looks (Not 6, 7, or 9 as I had installed)..

#6 @iandunn
4 years ago

Ah, yes, complexity abounds :) Thanks for digging into it!

Weighting the preferred name sounds like a good approach. I'll play around a bit with that too, and see if I can add some more tests around the various situations.

Although, this might be a reason to revisit the idea of switching to an external API to handle this kind of thing. I remember Josepha and I looked into Google's Geocoding API after the WP 4.8 release; I'll try to find the details of where that left off.

Another idea: Instead of the API trying and pick the perfect match, maybe it should just return all the likely ones, and Core could give the user a dropdown to pick the right one. Google's Places Autocomplete API is designed for that (h/t


I initially had problems getting PHPUnit working on my sandbox, but it's working well for me now, using v9.5. If you're running into the same problems I was, disabling xdebug should fix it. Check out the new readme file in the API root if you haven't already, it has some details.

#7 @iandunn
4 years ago

--process-isolation can also help for PHPUnit, but I've only found it necessary when using the watch task.

#8 @dd32
4 years ago

  • Owner set to dd32
  • Resolution set to fixed
  • Status changed from new to closed

In 10554:

Events API: When determining a location, preference non-alternate names and populated areas over alternate names / unpopulated locations.

Fixes #5562.

#9 @iandunn
4 years ago

In 10557:

Events: Weight country over preferred name when ordering.

See #5562.

#10 follow-up: @iandunn
4 years ago

Kudos for the fix :)

I found one problem with the new ranking, but fixed it in r10557.

What's the reason to import entries with population = 0?

I opened a #core52249 for the idea about showing users multiple potential locations.

#11 in reply to: ↑ 10 @dd32
4 years ago

Replying to iandunn:

What's the reason to import entries with population = 0?

Quite simply, the population figures aren't available for every location, they're available for most locations, but not all.

For reference, only 10% of the current non-alternate name records have a population of greater than 0.
Previous to #3728 it looks like it was in the range of 75% had a population count > 0.

Note: See TracTickets for help on using tickets.