Reproducibility is an important tool to empower users. Why would a user care about that? Let me elaborate.
For a piece of software to be reproducible means that everyone with access to the software’s source code is able to build the binary form of it (e.g. the executable that gets distributed). What’s the matter? Isn’t that true for any project with accessible source code? Not at all. Reproducibility means that the resulting binary EXACTLY matches what gets distributed. Each and every bit and byte of the binary is exactly the same, no matter on which machine the software gets built.
The benefit of this is that on top of being able to verify that the source code doesn’t contain any spyware or unwanted functionality, the user now is also able to verify that the distributable they got e.g. from an app store has no malicious code added into it. If for example the Google Play Store would inject spyware into an application submitted by a developer, the resulting binary that gets distributed via the store would be different from the binary built by the user.
Why is this important? Well, Google already requires developers to submit their signing keys, so they can modify software releases after the fact. Now, reproducibility becomes a tool to verify that Google did not tamper with the binaries.
I try to make PGPainless build reproducible as well. A few months ago I added some lines to the build script which were supposed to make the project reproducible by using static file modification dates, as well as a deterministic file order in the JAR archive.
// Reproducible Builds tasks.withType(AbstractArchiveTask) { preserveFileTimestamps = false reproducibleFileOrder = true }
It took a bit more tinkering back then to get it to work though, as I was using a Properties file written to disk during build time to access the libraries version during runtime, and it turns out that the default Writer for Properties files includes the current time and date in a comment line. This messed up reproducibility, as now that file would be different each time the project got built. I eventually managed to fix that though by writing the file myself using a custom Writer. When I tested my build script back then, both my laptop and my desktop PC were able to build the same exact JAR archive. I thought I was done with reproducibility.
Today I drafted another release for PGPainless. I noticed that my table of reprodubile build hashes for each release was missing the checksums for some recent releases. I quickly checked out those releases, computed the checksums and updated the table. Then I randomly chose the 1.2.2 release and decided to check if the checksum published to maven central still matches my local checksum. And to my surprise it didn’t! Was this a malicious act from Maven Central?
Release 1.2.2 was created while I was on my Journey through Europe, so I had used my Laptop to draft the release. So the first thing I did was grab the laptop, checkout the releases source in git and build the binaries. Et voila, I got checksums matching those on Maven Central. So it wasn’t an attack, but for some reason my laptop was producing different binaries than my main machine.
I transferred the “faulty” binaries over to my main machine to compare them in more detail. First I tried Meld, which is a nice graphical diff tool for text files. It completely froze though, as apparently it is not so great for comparing binary files.
Next I decompressed both binaries and compared the resulting folders. Meld did not report any differences, the directories matched perfectly. What the heck?
Next I tried diff 1.jar 2.jar
which very helpfully displayed the message “Binary files 1.jar and 2.jar are different”. Thanks for nothing. After some more research, I found out that you could use the flag --text
to make diff
spit out more details. However, the output was not really helpful either, as the binary files were producing lots of broken output in the command line.
I did some research and found that there were special diff tools for JAR files. Checking out one project called jardiff looked promising initially, but eventually it reported that the files were identical. Hm…
Then I opened both files in ghex
to inspect their byte code in hexadecimal. By chance I spotted some differences near the end of the files.
The same spots in the other JAR file look identical, but the A4
got replaced with B4
in the other file. Strange. I managed to find another command which found and displayed all places in the JAR files which had mismatches:
$ cmp -l 1.jar 2.jar | gawk '{printf "%08X %02X %02X\n", $1, strtonum(0$2), strtonum(0$3)}' 00057B80 ED FD 00057BB2 ED FD 00057BEF A4 B4 00057C3C ED FD 00057C83 A4 B4 00057CDE A4 B4 00057D3F A4 B4 00057DA1 A4 B4 ...
Weird, in many places ED
got changed to DF
and A4
got changed into B4
in what looked like some sort of index near the end of the JAR file. At this point I was sure that my answers would unfortunately lay within the ZIP standard. Why ZIP? For what I understand, JAR files are mostly ZIP files. Change the file ending from .jar to .zip and any standard ZIP tool will be able to extract your JAR file. There are probably nuances, but if there are, they don’t matter for the sake of this post.
The first thing I did was to check which versions of zip
were running on both of my machines. To my surprise they matched and since I wasn’t even sure if JAR files would be generated using the standard zip
tool, this was a dead end for me. Searching the internet some more eventually lead me to this site describing the file structure for PKZIP files. I originally wasn’t sure if PKZIP was what I was looking for, but I had seen the characters PK when investigating the hex code before, so I gave the site a try.
Somewhere I read The signature of the local file header. This is always '\x50\x4b\x03\x04'.
Aha! I just had to search for the octets 50 4B 03 04
in my file! It should be in approximation to the bytes in question so I just had to read backwards until I found them. Aaaand: 50 4B 01 02
Damn. This wasn’t it. But 01 02
looks so suspiciously non-random, maybe I oversaw something? Let’s continue to read on the website. Aha! The Central directory file header. This is always "\x50\x4b\x01\x02".
The section even described its format in a nice table. Now I just had to manually count the octets to determine what exactly differed between the two JAR files.
It turns out that the octets 00 00 A4 81
I had observed to change in the file were labeled as “external file attributes; host-system dependent”. Again, not very self-explanatory but something I could eventually throw into a search engine.
Some post on StackOverflow suggested that this had to do with file permissions. Apparently ZIP files (and by extension also JAR files) would use the external attributes field to store read and write permissions of the files inside the archive. Now the question turned into: “How can I set those to static values?”.
After another hour of researching the internet with permutations of the search terms jar, archive, file permissions, gradle, external attributes, zip
I finally stumbled across a bug report in another software project that talked about the same exact issue I had; differing jar files on different machines. In their case, their CI would build the jar file in a docker container and set different file permissions than the locally built file, hence a differing JAR archive.
In the end I found a bug report on the gradle issue tracker, which exactly described my issue and even presented a solution: dirMode
and fileMode
could be used to statically set permissions for files and directories in the JAR archive. One of the comments in that issue reads:
We should update the documentation.
The bug report was from 3 years ago…
Yes, this would have spared me from 3h of debugging 😉
But I probably would also not have gone onto this little dive into the JAR/ZIP format, so in the end I’m not mad.
Finally, my build script now contains the following lines to ensure reproducibility, no matter the file permissions:
// Reproducible Builds tasks.withType(AbstractArchiveTask) { preserveFileTimestamps = false reproducibleFileOrder = true dirMode = 0755 fileMode = 0644 }
Finally my laptop and my desktop machine produce the same binaries again. Hopefully this post can help others to fix their reproducibility issue.
Happy Hacking!
3 responses to “Reproducible Builds – Telling of a Debugging Story”
[…] Paul Schaub ☛ foss – vanitasvitae’s blog: Reproducible Builds – Telling of a Debugging Story […]
@vanitasvitae did you know there's a thing called diffoscope, which recursively diffs all kinds of archive-m formats and other binary files, and it kknows many binary formats enough to produce reasonable output?I think Debian uses for debugging reproducable builds. Sounds like it could've saved you some time.
[…] Paul Schaub a écrit un article sur PGPainless : Constructions reproductibles – Récit d’une histoire de débogage. […]