WordPress.org

Making WordPress.org

Opened 2 weeks ago

Last modified 6 days ago

#3295 new defect

Events API location search is case-sensitive

Reported by: SergeyBiryukov Owned by:
Milestone: Priority: normal
Component: API Keywords: 2nd-opinion
Cc:

Description

When entering a city name in "Events and News" widget in WordPress admin:

  • Санкт-Петербург works as expected.
  • санкт-петербург returns a could_not_locate_city error.

Both options should return the same results.

On a related note, it looks like the Events API tries to guess the location from the IP address if no other info is provided, but for some reason it doesn't always work. For example, there's a Saint-Petersburg WordPress Meetup coming up on Decemper 9, but I've got some reports that those in the city only see it in the widget once they explicitly enter the city name, and only see WordCamp US otherwise.

Attachments (1)

3295.diff (3.0 KB) - added by iandunn 10 days ago.

Download all attachments as: .zip

Change History (11)

#1 follow-up: @Otto42
2 weeks ago

  • Keywords 2nd-opinion added

This appears to be because the data in the geonames table is stored incorrectly in a latin1 field, and so the collation for casing doesn't work properly.

I downloaded a copy of the table to work with it locally, and I found that altering the name column to utf8mb4 fixes the problem. To not convert the existing data incorrectly, the SQL that should be run is as follows:

ALTER TABLE geonames_summary MODIFY name BINARY(200);
ALTER TABLE geonames_summary MODIFY name CHAR(200) CHARACTER SET utf8mb4;

(Binary first to prevent converting the data, which is already utf8, stored in a latin1 column.)

After making this change, this query works properly:

SELECT * FROM `geoname_summary` WHERE `name` LIKE '%санкт-петербург%'

I'd like a second opinion though instead of just altering the data directly. @dd32 Thoughts? Also perhaps about changing the other 3 columns, which are all English but should still be utf8 just the same.

#2 in reply to: ↑ description ; follow-up: @iandunn
2 weeks ago

Replying to SergeyBiryukov:

  • Санкт-Петербург works as expected.
  • санкт-петербург returns a could_not_locate_city error.

I was able to reproduce that. I added a test case for it on my sandbox, but am going to wait to deploy it until after the 4.9.1 release settles down, just to be safe.

Events API tries to guess the location from the IP address if no other info is provided, but for some reason it doesn't always work

I'm guessing that's caused by a different problem. It's expected that it won't work 100% of the time, because IP geolocation is never perfect, but there might still be a specific problem that we can solve. Can you get some more details, so that we can troubleshoot it? If they install the log-community-events-requests.php mu-plugin from #wp41217, and then send us the relevant parts of the PHP error log, then we should have the info we'll need.

If they've set a location manually, they'll need to clear it, so that IP geolocation is re-enabled. The easiest way to do that is to search for a city that doesn't exist (like Narnia), and then refresh.

#3 in reply to: ↑ 1 @dd32
2 weeks ago

Replying to Otto42:

@dd32 Thoughts? Also perhaps about changing the other 3 columns, which are all English but should still be utf8 just the same.

In general, utf8(mb4)? doesn't work as you'd expect it to on WordPress.org due to HyperDB and connection charsets - I'm fairly certain our HyperDB doesn't even support utf8mb4 (We force it to in some places, like on WordCamp.org and BuddyPress.org, which results in lovely charset issues everywhere else). Latin1 is a better option in general, except for this specific case of upper/lower-case values.

Personally I think I'd prefer to instead ensure that geoname_summary is always lower-case'd, even if that means running a UPDATE.. SET name = CONVERT(...) query against the table after updates to convert all the data to lower-case.. however that might not be viable.

However, as a test, I've created geoname_summary_utf8mb4_test in the same DB for testing, using your SQL from above, and as expected, it fails to query the table unless we force HyperDB to utf8 mode manually
Should also note that that conversion has resulted in the fields being padded to 200char of null bytes after each name - I tried setting it to VARCHAR but that didn't help.

<?php define('SAVEQUERIES', true ); include './wp-load.php';

add_db_table('geoip', 'geoname_summary_utf8mb4_test');

var_dump([
	'lower' => $wpdb->get_results( "SELECT * FROM `geoname_summary_utf8mb4_test` WHERE `name` LIKE '%санкт-петербург%'" ),
	'upper' => $wpdb->get_results( "SELECT * FROM `geoname_summary_utf8mb4_test` WHERE `name` LIKE '%Санкт-Петербург%'" ),
	'id' => $wpdb->get_results( "SELECT * FROM `geoname_summary_utf8mb4_test` WHERE id = 41827" ),
	'queries' => $wpdb->queries
]);
// Only the ID query will work above

$wpdb->queries = array();

$dbh = $wpdb->db_connect( "SELECT * FROM `geoname_summary_utf8mb4_test` WHERE `name` LIKE '%санкт-петербург%'" );
$wpdb->set_charset( $dbh, 'utf8' );

var_dump([
	'lower' => $wpdb->get_results( "SELECT * FROM `geoname_summary_utf8mb4_test` WHERE `name` LIKE '%санкт-петербург%'" ),
	'upper' => $wpdb->get_results( "SELECT * FROM `geoname_summary_utf8mb4_test` WHERE `name` LIKE '%Санкт-Петербург%'" ),
	'id' => $wpdb->get_results( "SELECT * FROM `geoname_summary_utf8mb4_test` WHERE id = 41827" ),
	'queries' => $wpdb->queries
]);
// All above will work.
$wpdb->set_charset( $dbh, 'latin1' ); // Must set it back or it'll affect other things too.

@iandunn Can you try running the test table above against your tests, and any other edge cases that haven't been working? Just remember you'll need to force the connection to be in utf8 mode.

#4 in reply to: ↑ 2 @SergeyBiryukov
2 weeks ago

Replying to iandunn:

I'm guessing that's caused by a different problem. It's expected that it won't work 100% of the time, because IP geolocation is never perfect, but there might still be a specific problem that we can solve. Can you get some more details, so that we can troubleshoot it? If they install the log-community-events-requests.php mu-plugin from #wp41217, and then send us the relevant parts of the PHP error log, then we should have the info we'll need.

Thanks! It looks like the IP geolocation works fine after all, it's just that the meetup was only published yesterday and the data was cached for 12 hours. I'll keep these instructions in mind for future reference if I see any other reports.

So case-sensitive search is the only problem here.

#5 @Otto42
13 days ago

The null bytes is to be expected, as it's a CHAR column, not a VARCHAR.

As for the query, we'd probably need to modify the script to SET NAMES 'UTF8'; first.

#6 @iandunn
11 days ago

In 6210:

Events: Add tests for Cyrillic character cases.

See #3295

@iandunn
10 days ago

#7 @iandunn
10 days ago

r6210 adds the new test case, and 3295.diff experiments with the temp UTF8 table.

The new test passes with the patch, and the previously failing test for Yaoundé is doing better as well. It returns a location now, although it's missing the diacritics (e.g., yaounde instead of Yaoundé).

There's 21 tests failing with the patch, though, when previously there was only 2. Some of that is from the missing diacritics, but there's also some more serious cases, where searches for Tokyo, Berlin, etc fail to return a location.

I'm guessing we'll be able to fix those, but don't have time to dig further today.

#8 follow-up: @Otto42
9 days ago

The NULL padding can be avoided by using these commands to do the conversion instead:

ALTER TABLE geoname_summary MODIFY name VARBINARY(200);
ALTER TABLE geoname_summary MODIFY name VARCHAR(200) CHARACTER SET utf8mb4;

Using varbinary is the trick, and it seems to avoid the padding bytes.

#9 in reply to: ↑ 8 ; follow-up: @dd32
8 days ago

Replying to Otto42:

The NULL padding can be avoided by using these commands to do the conversion instead:

Using varbinary is the trick, and it seems to avoid the padding bytes.

FYI I've updated the geoname_summary_utf8mb4_test table using those commands instead.

#10 in reply to: ↑ 9 @iandunn
6 days ago

Replying to dd32:

FYI I've updated the geoname_summary_utf8mb4_test table using those commands instead.

Great! That fixed all of the tests except for 5, and the only problems with the remaining ones are the missing diacritics:

> php tests/test-index.php

Running 55 location tests

* city-south-america: _FAILED_

Expected result: Array
(
    [description] => sao paulo
    [latitude] => -23.548
    [longitude] => -46.636
    [country] => BR
)

Actual result: Array
(
    [description] => são paulo
    [latitude] => -23.548
    [longitude] => -46.636
    [country] => BR
)

* city-with-diacritics-in-formal-name-but-not-in-query: _FAILED_

Expected result: Array
(
    [description] => dona ana
    [latitude] => 32.390
    [longitude] => -106.814
    [country] => US
)

Actual result: Array
(
    [description] => doña ana
    [latitude] => 32.390
    [longitude] => -106.814
    [country] => US
)

* city-endonym-accents-africa: _FAILED_

Expected result: Array
(
    [description] => yaoundé
    [latitude] => 3.867
    [longitude] => 11.517
    [country] => CM
)

Actual result: Array
(
    [description] => yaounde
    [latitude] => 3.867
    [longitude] => 11.517
    [country] => CM
)

* city-endonym-accents-north-america: _FAILED_

Expected result: Array
(
    [description] => ciudad de méxico
    [latitude] => 19.428
    [longitude] => -99.128
    [country] => MX
)

Actual result: Array
(
    [description] => ciudad de mexico
    [latitude] => 19.428
    [longitude] => -99.128
    [country] => MX
)

* 2-word-city-region: _FAILED_

Expected result: Array
(
    [description] => são paulo
    [latitude] => -23.548
    [longitude] => -46.636
    [country] => BR
)

Actual result: Array
(
    [description] => sao
    [latitude] => -23.548
    [longitude] => -46.636
    [country] => BR
)


Running 2 events tests


Running 2 add_regional_wordcamps() tests


Running 4 build_response() tests


Running 4 is_client_core() tests


Finished running all tests.

* 5 tests failed
* 85 queries ran in 0.116151 seconds
* Average time per query: 0.001366 seconds
Note: See TracTickets for help on using tickets.