Making WordPress.org

Opened 5 days ago

Closed 4 days ago

Last modified 3 days ago

#8253 closed defect (bug) (fixed)

Credits: character encoding for names

Reported by: sabernhardt's profile sabernhardt Owned by: dd32's profile dd32
Milestone: Priority: normal
Component: API Keywords:
Cc:

Description

(Reported in #core65245)

The following characters display incorrectly in about 20 names on the credits.php page:

á ä é í ó ö ñ ú ü ý Ž

@clementpolito:

On WordPress 7.0-RC4, on the page /wp-admin/credits.php, I encounter an issue with some characters.

At first, I thought there might be a character encoding issue with my database tables on my end, but in fact, some accented characters are displaying correctly elsewhere on the page. And @audrasjb reproduce the issue.

  • "Albert Juhテゥ Lluveras" should be "Albert Juhé Lluveras"
  • "Alvaro Gテウmez" should be "Alvaro Gómez"
  • "Béryl de La Grandière" is ok
  • "Eliezer Peテアa" should be "Eliezer Peña"
  • "Johannes Jテシlg" should be "Johannes Jülg"
  • Etc.

screenshot of core contributors list on 6.9 credits page

@ocean90:

this looks more like an issue on WordPress.org since the API already returns the incorrect encoding, see ​https://api.wordpress.org/core/credits/1.1/?version=6.9.
The profiles, e.g. ​https://profiles.wordpress.org/aljullu/ and ​https://profiles.wordpress.org/anlino/, are looking ok though.

@jonsurrell:

This line looks suspicious. It hasn't changed in a long time, but changes to underlying language data or functionality could plausibly produce different results. In particular, listing JIS before UTF-8 in the from encoding seems problematic. Maybe the conversion can be dropped completely if the data is already UTF-8.

<?php
$raw = 'é1234567890';
echo mb_convert_encoding($raw, 'UTF-8', 'ASCII, JIS, UTF-8, Windows-1252, ISO-8859-1') . "\n";
// テゥ1234567890
echo mb_convert_encoding($raw, 'UTF-8', 'ASCII, UTF-8, JIS, Windows-1252, ISO-8859-1') . "\n";
// é1234567890

@siliconforks:

It looks like the behavior changed in PHP 8.3:

<?php

$raw = 'é1234567890';
$raw = mb_convert_encoding( $raw, 'UTF-8', 'ASCII, JIS, UTF-8, Windows-1252, ISO-8859-1' );

// PHP 8.2: é1234567890
// PHP 8.3: テゥ1234567890
echo $raw . "\n";

Change History (7)

This ticket was mentioned in Slack in #core by jorbin. View the logs.


4 days ago

#3 @TobiasBg
4 days ago

This also affects the WP release announcements, e.g. https://wordpress.org/news/2026/05/armstrong/ or https://wordpress.org/news/2025/12/gene/ so definitely an API thing.

#4 @dd32
4 days ago

Can confirm, this is a PHP 8.3 change in the Multibyte detection.

The reason for this _encode() method is that historically WordPress.org had some contributors names malformed in the users table, due to BuddyPress writing data into the users table with the incorrect connection charset/collation.

The DB writes have been fixed over the years, and this encode continued to work, until we switched to PHP 8.4 in the last few days from PHP 8.1 which seems to have resulted in a range of characters being detected as JIS. JIS was never correct here, but seemingly worked well enough for the original problem of UTF8 characters stored into a Latin1 table via a UTF8 charset connection, that was then read as UTF8-in-latin1 via latin1 charset connection..

I set claude loose on a single specific username:

What's happening with Toni Viemerö:

  • DB returns clean UTF-8: bytes 54 6F … C3 B6 (the C3 B6 is ö).
  • mb_convert_encoding() auto-detects from the list in order, and the UTF-8 bytes C3 B6 are also valid JIS X 0201 (single-byte halfwidth katakana テ + カ).
  • Because JIS is listed before UTF-8, PHP 8.4's mb_detect_encoding picks JIS and "converts" the bytes — corrupting ö into テカ (U+FF83 U+FF76).
  • json_encode then emits "Toni Viemer\uff83\uff76".

I've had Claude audit all other users, and found 22 users who needed this code still. I've now re-saved those users profiles, and the users table is correct, resulting in:

17,733 distinct credited users examined

  • 17,015 pure ASCII display_names (96%)
  • 458 valid UTF-8 with non-ASCII (these get garbled by the current mb_convert_encoding JIS bug)
  • 22 display_names that are invalid UTF-8 — These are now fixed

#5 @dd32
4 days ago

  • Owner set to dd32
  • Resolution set to fixed
  • Status changed from new to closed

In 14904:

API: Credits: Drop the Multibyte conversion & HTML encoding.

This reverts [2202] which is no longer needed now that wp_users contains correct UTF8 characters for credited users.

Bumps cache group to clear stale caches.

Fixes #8253.

#6 @dd32
4 days ago

In 14905:

API: Credits: Properly link to user nicenames.

The credits API "Groups" are keyed by user_login, not by user_nicename. This has meant that the links to profiles on credits.php was using the Login, which may include Spaces or other characters.

This updates the API such that it always outputs user_nicename for that field.

Cache sets are droped here as they added no value.

Props peterwilsoncc for noticing.
See #8253.

#7 @TobiasBg
3 days ago

Follow-up ticket for other places with this issue: https://meta.trac.wordpress.org/ticket/8254

Note: See TracTickets for help on using tickets.