WordPress.org

Making WordPress.org

Opened 2 months ago

Closed 8 days ago

#3192 closed enhancement (fixed)

Provide an API endpoint to fetch plugin checksums

Reported by: schlessera Owned by: dd32
Milestone: Priority: normal
Component: Plugin Directory Keywords:
Cc:

Description

In the context of the WordPress Plugins & Themes Checksums Project, we need an API endpoint on the .org infrastructure that can serve plugin checksums for all the files for all plugins from the plugin repository, in all their versions.

Although we had first planned to build this on a separate server, more recent discussions with @Otto42 and @dd32 have produced the current plan:

  1. Hook into the plugin repository's ZIP downloads mechanism to add the checksums as additional files to the downloads SVN.
  2. Build a first iteration of API endpoint for plugin checksums in collaboration with the Systems team to be directly deployed to the .org servers.
  3. Postpone theme support for a future iteration, as the theme repository has not yet caught up to the same infrastructure than the plugin repository.

This ticket will handle the .org side of the project (checksums storage & API endpoint), while the consuming side will be handled in the corresponding GitHub ticket.

( Related ticket that might make use of 1. above: #619 )

Change History (26)

#1 @dd32
2 months ago

I've done some prelim testing and have a POC working for plugins, which has brought up a few issues with will probably cause plugins to be a bad test case for this (strangely enough, themes would probably be easier, even with the older zip serving methods) - I'll go into further details tomorrow when I'm able to.

#2 @dd32
2 months ago

  • Owner set to dd32
  • Status changed from new to accepted

Okay, so, there's a few options for generating checksums, I'm going to outline to two main ones which I'd investigated briefly, and ended up writing a POC for both of:

Create checksums when we create ZIPs
ie. we create plugin-slug.1.2.3.zip and also create plugin-slug.1.2.3.checksums.json

This seems like the simplest option, but it's also the most unreliable and useless method available.

  • 49% of WordPress.org plugin packages have a version in the filename, but in ~8% of those it's not actually the same as what version of plugin is inside the ZIP.
    • The reason the versions are mismatched, is svn tag vs plugin header not matching.
  • 51% of Zips don't include a version at all (being built from trunk).

That means only 45% in the end has a Zip filename == plugin version match.

Create checksums when we create ZIPs, but based on the plugin header version

This seems to be the best option to me. Plugin Zips should've been done like this too IMHO.

When generating zips, we create a json file of the checksums. If the file exists of that version already, we update it to include the previous, and current checksums.
For example:

  • create example-plugin/tags/v2.0 with Version: 2.0.1 we create example-plugin.v2.0.zip already and we'll add example-plugin.2.0.1.checksums.json.
  • update example-plugin/trunk/plugin.php as stable_tag at version 2.0.1 with a new version and we'll create example-plugin.zip and example-plugin.2.0.1.checksums.json.

We wouldn't create checksums for the development version of a plugin (ie. /trunk/).
We would create checksums if /trunk/ is the stable_tag though, following the above file naming.

Creating checksum files for older versions, where it's being released from /trunk/ or tag modifications have been made is unlikely, but we'd have them going forward.

Database table of checksums
Another option raised was storing it in a DB table instead of flat files. This doesn't seem sane to me from a caching / updating perspective, the amount of data would increase exponentially, and not actually use any features of SQL other than a key-value store.
(Quick maths: 50k plugins * 15 versions on average * 10 files on average = 7.5million rows, that's not taking into consideration that larger plugins have more files and would skew it upwards IMHO)

Themes

Themes are strangely much easier.

  • Can't have multiple versions of a file in any given version.
  • ZIP file names contain the Version Header
  • SVN tags are based on the Version Header

Theme Authors don't have access to SVN, so all the version numbers are handled correctly without human error.

Checksums could therefor be generated/served very easily there, but we'd probably leave that until the ZIP storage method for themes is updated to a similar process that we use for Plugins.

#3 @dd32
2 months ago

In 6022:

Plugin Directory: Generate md5 hashes for plugins.

This is a POC and may change or be removed in the future, it's here for testing purposes.

A api.wordpress.org endpoint may be available in the future to access it.
This is only enabled for the 'exploit-scanner' plugin at present, purely for testing, as it publishes the md5 hashes of its own files already

Compare https://wordpress.org/plugins/exploit-scanner/ to https://downloads.wordpress.org/plugins/exploit-scanner.1.5.2.checksums.json

See #3192

This ticket was mentioned in Slack in #cli by schlessera. View the logs.


8 weeks ago

#5 @dd32
8 weeks ago

In 6042:

Plugin Directory: Checksums: When multiple changes are made to a file, ensure that the checksum is only listed once in checksum files.

See #3192

#6 follow-up: @schlessera
8 weeks ago

Some observations on #2 above:

A. Multiple checksums for a same file

Given the way ZIPs are being built, I suppose that different downloads of a ZIP of a given version name could result in different files as the actual content of the ZIP, depending on when the ZIP was built and downloaded (i.e. what SVN revision it was built at). To reliably check the files on existing WordPress installations would mean that we need to have all possible checksums for any given file at a named version, not only the latest for that named version. So, if a ZIP is being generated, and the checksum file already exists, the newly calculated checksums need to be added to the checksums file, if they are different from the ones that are already included.

If I understand correctly, the provided POC already does this. However, I wonder how we can retroactively calculate these multiple checksums for all existing versions. Does it make sense to merge all revisions that are tagged with a same version, or can some of them be excluded through some existing data?

B. Test plugin

The exploit plugin is great for making sure the calculated hashes are correct, but it seems to be very simple and properly versioned. Can we add one other plugin, that is known to include as many edge cases as possible? A plugin that has multiple released versions with a same named version tag, for example, to test the merging of multiple checksums for a same file.

C. Database

Regarding the database, I agree that an SQL database is not a good fit, but actually using SQL was never explicitly mentioned. If we want to examine the use of a database, we should rather examine the use of a NoSQL database like Redis, MongoDB, or Cassandra. They provide similar benefits than a file-based approach, but with added optimizations and integrity checks. They can also be synced more reliably across multiple servers.

However, I suppose that these are not part of the existing infrastructure, so might not even be an option at all.

Version 0, edited 8 weeks ago by schlessera (next)

#7 in reply to: ↑ 6 ; follow-up: @dd32
8 weeks ago

Replying to schlessera:

Some observations on #2 above:

A. Multiple checksums for a same file

Given the way ZIPs are being built, I suppose that different downloads of a ZIP of a given version name could result in different files as the actual content of the ZIP, depending on when the ZIP was built and downloaded (i.e. what SVN revision it was built at).

Correct

To reliably check the files on existing WordPress installations would mean that we need to have all possible checksums for any given file at a named version, not only the latest for that named version.

That is what I outlined in Comment:2 above, and as a result, the POC although it's in the ZIP generation class, *does not link checksums to a ZIP file* - they may by coincidence match, but you cannot assume that plugin.2.0.zip will have checksums in the 2.0 file (It might be in 2.0.1 instead, because the ZIP is named incorrectly).

However, I wonder how we can retroactively calculate these multiple checksums for all existing versions. Does it make sense to merge all revisions that are tagged with a same version, or can some of them be excluded through some existing data?

We could retroactively calculate them, but I don't think we should bother, it's not worth anyones time to do so.
We can create the checksums for previous tagged releases, which will cover the vast majority of plugins. Those who made multiple releases on a tag or from trunk would only have checksums for the latest version of the files.
In other words, checksums going forward will be useful, but historical checksums for a plugin which had 50 releases from /trunk/ in 2009 won't be (Only the latest release that was made from /trunk/ would be).

B. Test plugin

The exploit plugin is great for making sure the calculated hashes are correct, but it seems to be very simple and properly versioned. Can we add one other plugin, that is known to include as many edge cases as possible? A plugin that has multiple released versions with a same named version tag, for example, to test the merging of multiple checksums for a same file.

Here's another test checksum file, which covers a file being added in a later version update, and the main plugin file being updated.
https://downloads.wordpress.org/plugins/test-plugin-3.1.1.2-20160302.checksums.json
If you wish, I can add that to the regular generation & provide you commit to it to add

C. Database

We won't be investigating anything regarding this at this time.

#8 in reply to: ↑ 7 @schlessera
8 weeks ago

Replying to dd32:

That is what I outlined in Comment:2 above, and as a result, the POC although it's in the ZIP generation class, *does not link checksums to a ZIP file* - they may by coincidence match, but you cannot assume that plugin.2.0.zip will have checksums in the 2.0 file (It might be in 2.0.1 instead, because the ZIP is named incorrectly).

Yes, that makes sense. Are there any plans to change the way ZIPs are named? There's no direct need, afaict, but having different ways of naming versions is bound to cause confusion at some point.

If you wish, I can add that to the regular generation & provide you commit to it to add

Yes, please.

We won't be investigating anything regarding this at this time.

Yes, thought so already, and I don't think that's necessary. If we hit any major issues later down the road, we can always re-assess.

#9 @dd32
8 weeks ago

Yes, that makes sense. Are there any plans to change the way ZIPs are named? There's no direct need, afaict, but having different ways of naming versions is bound to cause confusion at some point.

Unfortunately not. ZIP's are tied to SVN tags, not plugin versions.

#10 @dd32
8 weeks ago

In 6047:

Plugin Directory: Checksums: Add test-plugin-3 to the POC checksum generation code.

See #3192

#11 follow-up: @schlessera
7 weeks ago

I just tested changing the already tagged version of the plugin, and I could see the changed checksums in the JSON file after about 5 seconds. I think that delay is negligible, so we don't need any additional logic to deal with delays.

This ticket was mentioned in Slack in #cli by rkialashaki. View the logs.


7 weeks ago

#13 @dd32
6 weeks ago

In 6061:

Plugin Directory: Checksums: Add sha256 checksums alongside md5.
This change also adds the full URL to the ZIP & SVN Source location, which may be an array when a single plugin version can be found within multiple ZIPs.

See #3192

#14 in reply to: ↑ 11 @dd32
6 weeks ago

Replying to schlessera:

I just tested changing the already tagged version of the plugin, and I could see the changed checksums in the JSON file after about 5 seconds. I think that delay is negligible, so we don't need any additional logic to deal with delays.

While there's caching on downloads.wordpress.org, both ZIPs and Checksum caches should clear at about the same time. There'll only ever be delays when the SVN processing is slowed down / delayed due to system issues or a large spike in plugin commits (such as what happens just after a WordPress release).

#15 @schlessera
5 weeks ago

@dd32 The way the md5 and sha256 are stored now duplicates all of the file names for every single type of hash we add (so x2 now, but might be even more later down the road).

That's why we opted to turn around the schema, so that files are enumerated first, and for every file, you have the md5 hashes and then the sha256 hashes:

{
    "plugin": "test-plugin-3",
    "version": "1.1.2-20160302",
    "source": [
        "https://plugins.svn.wordpress.org/test-plugin-3/tags/test-tag/",
        "https://plugins.svn.wordpress.org/test-plugin-3/tags/tag1/"
    ],
    "zip": [
        "https://downloads.wordpress.org/plugins/test-plugin-3.test-tag.zip",
        "https://downloads.wordpress.org/plugins/test-plugin-3.tag1.zip"
    ],
    "files": [
        "NEW-FILE": {
            "md5": "d41d8cd98f00b204e9800998ecf8427e",
            "sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
        },
        "plugin.php": {
            "md5": [
                "01565c8754903cb7b29b2e851a34b866",
                "4fdce3922d9dd6c6f717cac910dc10ba",
                "9f9e54ca1e325013bf523d6ec31a4a49",
                "79605ba19f6a0682007b314a90a2dba4"
            ],
            "sha256": [
                "2da8793f9ee56199ac7e88f34e07e412a97e91133c31068916c2e025af8ddac7",
                "221a74f498193d8f666a3f5836d9c122b0b4a3117dd05fd224f5cecffccd3837",
                "b94a3155ca9d14c205672a25a2dbadab39e69b0958fd299b847b08a7bfc08cf0",
                "1decb3435fb6915c7b3c3397f24d0a60ee83d3f4cb55a34ea905f5e164cf90c0"
            ]
        },
        "README.md": {
            "md5": "11995b9377c5bc4afcd46fb49f9bf887",
            "sha256": "ec210c8e2a08f87cbade08ad5eb4577586367dee9cc093579dd3c081d5872d61"
        }
    ]
}

This multiplies the md5 and sha256 strings instead, but as the filenames will contain paths and can potentially be very long, this is preferable.

Last edited 5 weeks ago by schlessera (previous) (diff)

#16 @dd32
5 weeks ago

@schlessera I don't see duplicating filenames as being an issue - size is always going to be high with plugins with lots of filenames or long path names.

However, I also don't feel strongly either way, and nesting it this way probably works better in the long run.

#17 @dd32
5 weeks ago

In 6114:

Plugin Directory: Checksums: Combine the md5 and sha256 fields back into the files property.

See #3192

#18 @schlessera
5 weeks ago

Thanks for the quick change, @dd32 !

I think the structure of the JSON file fully meets our needs for now.

What is the next step to get the actual API endpoint that redirects to the corresponding file, so that we have a proper URL to work against?

The URL we want to target is /checksums/1.0/?plugin=<slug>&version=<version>.

This ticket was mentioned in Slack in #cli by schlessera. View the logs.


3 weeks ago

#20 @dd32
3 weeks ago

In 6163:

Plugin Directory: Checksums: Move the checksums to their final URL, https://downloads.wordpress.org/plugin-checksums/$plugin/$version.json.

See #3192

#21 @dd32
3 weeks ago

Since there's no actual need for it to be hosted on api.wordpress.org, and it's mostly static after a chat we've decided that it'll be better to just leave it on downloads.wordpress.org.

I've moved it to a cleaner URL structure which is easier to construct and we can cache/route differently in the future if need be.
The new URL is: https://downloads.wordpress.org/plugin-checksums/exploit-scanner/1.5.2.json, or https://downloads.wordpress.org/plugin-checksums/$plugin/$version.json the .json is optional too.

This lacks a version in the URL structure, however, in the event we need to version it in the future, we'll be able to add support to serve those as https://downloads.wordpress.org/plugin-checksums/1.1/exploit-scanner/1.5.2.json instead.

@schlessera final check, anything else needed before this starts rolling out to all plugin slugs?

#22 @schlessera
2 weeks ago

@dd32 As already discussed on Slack, this is good to go from my side. You can roll this out for all plugins.

#23 @dd32
2 weeks ago

In 6184:

Plugin Directory: Enable checksum building for all plugins.

See #3192

#24 @dd32
2 weeks ago

A job to create them for existing plugins is running, but will take quite some time to finish processing.

It's building based on active_install count, so popular plugins will be built first.

Currently building the checksums for the latest version, and any other versions in svn tags.

As previously stated, the initial checksum files will only have the checksums for the latest files for each version available. If a plugin has released many variants of many versions from /trunk/ we'll only be build the checksum for the latest versions latest variant of files.
If a plugin has deleted the tags (as some do) then those are excluded too.

If we want to go into more depth than that, then we can, but only after the initial files are generated. We'll need to figure out some way of parsing the 1.7m SVN revs and coming up with a game plan after that.

This ticket was mentioned in Slack in #meta by ocean90. View the logs.


2 weeks ago

#26 @dd32
8 days ago

  • Resolution set to fixed
  • Status changed from accepted to closed

Looks like the job finished over the weekend.

If any plugin don't have the checksums available, please ping me with the plugin slug.
Checksums will also be rebuilt with the ZIP builder: php bin/rebuild-zip.php --plugin hello-dolly

Note: See TracTickets for help on using tickets.