Making WordPress.org

Opened 6 years ago

Closed 4 years ago

#3728 closed defect (bug) (fixed)

Some city names shown incorrectly in Events API

Reported by: presskopp's profile Presskopp Owned by: dd32's profile dd32
Milestone: Priority: normal
Component: Events API Keywords:
Cc:

Description

Hi,

I just found that entering the german city of Gießen shows the events correctly, but not the name of the city, it will be shown as Giesen

Taking it a step further with the city of Bad Gottleuba-Berggießhübel it will only show Bad, which is bad ;)
After talking to @obenland he said it will be recognized by the API as Bad Gottleuba-Berggießhübel

I tested some more cities having the ß character and such, but they were ok so far.

Attachments (2)

Giesen.png (7.6 KB) - added by Presskopp 6 years ago.
hamburg.jpg (20.9 KB) - added by Presskopp 6 years ago.

Download all attachments as: .zip

Change History (21)

@Presskopp
6 years ago

This ticket was mentioned in Slack in #meta by tellyworth. View the logs.


6 years ago

#2 @SergeyBiryukov
6 years ago

  • Component changed from General to API

#3 @dryanpress
6 years ago

  • Owner set to dryanpress
  • Status changed from new to assigned

I'm happy to look at this next week. Any other major special characters to account for @obenland?

#4 @obenland
6 years ago

Not to my knowledge

#5 @Presskopp
6 years ago

I found another glitch using 5.1-beta3-44723, local installation:

Entering Hamburg I get results for Hambûrg!

@Presskopp
6 years ago

#6 @Presskopp
6 years ago

München -> Munchen
Berlin -> Berlín

#7 @dd32
6 years ago

IIRC this is mostly expected as we use the "primary" name for a city in the response (which is mostly English-centric AFAIK) but many lookups of non-ascii cities actually match in the "alternative names" which results in the returned data differing from the searched data.

To fix this I believe we'd want to change the data-source it's looking up against.

This ticket was mentioned in Slack in #meta by tobifjellner. View the logs.


6 years ago

#9 @dd32
5 years ago

  • Component changed from API to Events API

#10 @iandunn
5 years ago

  • Summary changed from Showing community events is not working for some cities to Some city names shown incorrectly in Events API

#11 @dd32
4 years ago

  • Resolution set to fixed
  • Status changed from assigned to closed

This was fixed via #5117

#12 @Presskopp
4 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

The issue is still present

Gießen -> Giesen
Bad Gottleuba-Berggießhübel -> Bad

#13 follow-up: @dd32
4 years ago

It looks like this is no longer an encoding issue, but a data-source issue.

Gießen -> Giesen

The data-source lists both of those as separate cities, Giesen having a much larger population and an alternate-name (not primary name) of Gießen. The co-ordinates are also slightly different (but quite obviously the same place) so it's really just an issue that the data had two cities that should've been combined.

Bad Gottleuba-Berggießhübel -> Bad

That city doesn't exist in the data-set, nor does a city that starts with that, which is why it gets truncated back to Bad which is an alt-name for Badou, Togo.

I think updating the source-data might help here, I don't think it's been updated in about 2 years.
We don't use a proper GIS here, just a very bland lookup table that has ~750k city names for ~150k unique cities around the world, it 100% doesn't match every city the world has to offer (let alone even 80% of them) but it does include some obscure places such as Grytviken which has between 2 and 30 people living there depending on the time of year.

#14 in reply to: ↑ 13 ; follow-up: @dd32
4 years ago

Replying to dd32:

I think updating the source-data might help here

Gießen -> Giesen

Looks like that's been corrected in the new dataset (which hasn't been imported yet)

Bad Gottleuba-Berggießhübel -> Bad

That doesn't exist in the smaller dataset we're using, but does exist in the full data-set.

The dataset we're using is from https://www.geonames.org/, The smaller set is ~30M and has now increased from 150k to 200k cities, where as the larger set of ~1.5GB and covers 12million city names (plus a 6-10 alt-names per city on average).

It might be worth us running a combined dataset - The smaller set, plus the official city names from the larger dataset (excluding the altnames) which would probably end up with a ~250M dataset

#15 @dd32
4 years ago

In 10469:

Events API: remove the UTF8 database connection juggling, which shouldn't be needed anymore.

See #3728.

#16 @dd32
4 years ago

In 10470:

Events API: Handle queries with more than 3 "words" in the search by searching backwards until one matches. Improves matches for multi-word cities/countries.

Searches A B C D E, A B C D, A B C, A B, A for when no exact match is found.

See #3728.

#17 in reply to: ↑ 14 @dd32
4 years ago

  • Keywords needs-patch removed
  • Owner changed from dryanpress to dd32
  • Status changed from reopened to assigned

Replying to dd32:

Replying to dd32:

I think updating the source-data might help here

Gießen -> Giesen
Bad Gottleuba-Berggießhübel -> Bad

These both now work after updating the dataset:

I've migrated it from the limited dataset to the full 12million cities dataset, just so there's less to update in the future.

In doing that, it's introduced some "bugs" in what previously worked, but no longer does, for example, Australia previously matched the country, where as now it matches a region in mexico. But it does also now know where my Secret Rocket Yards are.

I'm going to leave this open to see if any reports come in of breakage from the update.

#18 @dd32
4 years ago

In 10474:

Events API: Update the API to accept the new type DB field to disambiguate actual places from States/Countries/others.

As part of this the table now includes non-city locations such as countries too, and so if the best match is a country it's still handled properly (returning country-wide events rather than just distance around the central point of the country).

Updates the tests to take into account the new database changes, but these are likely to become out-of-date as the data will be updated daily going forward.

See #3728.

#19 @dd32
4 years ago

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.