[incomplete draft]
Data compression works by squeezing out redundant patterns. When there are different kinds of patterns in the data, we use different compression schemes -- like GIF for line art and JPEG for photographs. There's no single scheme that compresses all data sources well.
Different compression schemes go with different data formats. Introducing a new format is a troublesome chicken-and-egg problem: nobody uses the encoding because users don't have the decoder, and therefore users never get the decoder. A better format can stall this way for years (or forever) waiting for some force to push it to critical mass -- thus GIF is still far more popular than PNG. The barrier to adoption generates political issues: companies try to appropriate an entire medium of communication by controlling a de facto standard, while more collaborative standards become sinks of complexity because everyone's pet features must get into the standard or die. With heavyweight standards, our media advance by giant lurches instead of smooth steps.
Maybe that's the best we can do. But there's no theoretical reason why: any compression scheme can be expressed as a computer program, and if we tacked that program onto a compressed data stream it would add only a constant overhead. So reducing all data compression formats to a single practical universal format takes, in principle, `just' some engineering. In this ideal world, it would be easy to introduce new schemes, they would evolve smoothly, there would be less to argue about, and nobody could extract monopoly rents from a compression format. (A patent-holder could still charge for their scheme, but not force anyone to use it in order to reach an installed base. Promoting an incompatible format, even for a new medium, would be a naked power grab.)
So what technical problems do we need to solve? If we follow the above recipe directly, we need a decompressor programming language that's:
A decompression program, even efficiently encoded, itself takes space; when sending many data streams that use the same scheme, you redundantly send their decompressor. There's an obvious improvement for this common case: allow the data stream to refer to its decompressor by name -- such as a URL it can be fetched from. (This opens up a security issue: a process that formerly only needed local computation now might need network access, and the mere act of opening a connection could leak sensitive information. Perhaps data stream readers would be configured with a gateway that decides when connections are allowed. I think this is the most serious problem with my proposal.) For well-known decompressor names the reader needn't go out and fetch the decompressor; it could be cached, or replaced with a precompiled native version.
There are some less obvious needs. First, data often comes in a stream, whose beginning must be processed before the end comes in. So we should be able to start running the decompressor before all the data is there, perhaps even before all of the decompressor is.
Second, many formats in practice are more than a simple source of bits: archives hold multiple named files accessed at random, images may display as a low-resolution overview before they're fully retrieved, time-based media like streaming audio must sync with a realtime clock, and so on. Different applications like this need to present, and call on, arbitrary APIs; so we've gone from standard data formats as the problem, to standard APIs. Is this progress? I think so; compared to passive data, an API combined with general computation greatly reduces the assumptions you need to build in beforehand. Another requirement for our problem statement, then: our format should work with an open-ended set of APIs.
There are practical compression standards already for the major media people use. Why bother with a new one?
[Actually there seems to be plenty of motion still in the compression world: http://www.eetimes.com/story/OEG20020322S0105]
That's a strong objection. I have three answers. First, if a need for a new format turns up, there should be a worked-out generic format design for it to adopt, to reap the benefits extolled here. Unless there's a design ready to use, it will be easier to make another specialized format, with the added temptation of lock-in. Once the general format is in use, there'll be further advantages for each new application that adopts it. New media standards are still being created; it was JPEG 2000's sheer complexity that prompted this proposal in reaction, and MPEG4 audio's Turing-complete encoding looks like it could have been built atop a scheme like this one, sharing much effort. (I have to admit to saying this as an ignorant outsider, though. Think of this as one attempt to see how life could be better.)
Second, there's one very important medium without such a standard: interactive computer programs. (A cynic may object that Pentium machine code under MS Windows is the standard; but that's so preposterous from a security standpoint that nobody but virus writers will use it for mobile networked programs, which I think is the genuinely interesting medium. Java had a chance here, but seems to have botched it. Perhaps Microsoft's CIL will take over, but it hasn't yet and I would like there to be an alternative.)
Third, a mobile code system has uses well beyond compression formats -- in fact, this is about the simplest possible application. Once you're using the system in these fancier ways, it becomes attractive to also use it as a compression format, for all the reasons I've given, plus, now, simplicity.
TODO: separate section on security issues here?
If data compression were the only goal, I would build this around Juice, a very nice portable encoding of the Oberon language. Since this has wider aims, and language-specific designs like the Java virtual machine have proven to be awkward targets for other languages, I've opted to create a low-level virtual machine more like a RISC CPU. Its basic instructions operate on a stack of 32-bit words. TODO: blah blah.
Here's how it tackles our requirements:
Portable: every basic operation has a deterministic effect. For instance, the order of bytes within words is big-endian, regardless of the native byte order of the machine it executes upon. (It'll change to little-endian before the 1.0 release since the most popular platform is little-endian.)
Safe to execute: no operation can affect anything outside the virtual machine's private memory, except through capabilities that the surrounding environment explicitly grants and the program explicitly invokes. (CPU cycles and memory are among the resources that must be explicitly granted, to guard against DoS attacks.) Furthermore, the above determinism property limits wall-banging to one direction only: if you don't give a program a capability to a source of nondeterminism, there's no possible channel for information to come in through. (It's impossible in general, however, to keep a general program from wall-banging bits out.) (Also, even though a program can be kept free of covert input channels, it might still use wall-banging or steganography in the output to send the results of parasitic computations. This seems a minor concern given the relative costs of computation vs. all the fancy engineering (technical and social) needed to get away with such a scheme.)
Simple: The machine is designed to bound the amount of code that must be audited to check these claims -- far simpler than the JVM, Microsoft CIL, etc.
Compact: there are two parts to this -- a compact object file format, and compact program code within the object file. An object file holds data in a straightforward binary pickle format. The form of program code deserves more explanation. TODO.
Quick: TODO
Streamable: it's easy for the data format to accommodate streaming as long as you design with it in mind. (In contrast, the original Idel code format started execution from the last definition in the file, forcing you to read the whole file before starting.)
TODO: flesh out:
basics of idel
need for simplicity: small TCB
capability-based security and resource management (ref http://erights.org)
wall-banging and determinism
combining code compression and fast execution:
1. base instruction set plus macroinstructions
2. runtime code generation with DEFER
current status
figures for the overhead with the sample huffman compressor
need to add media-processor instructions
This way of combining flexibility and efficiency in data formats through embedded code is a special case of the same sort of advantage of mobile code for protocols in general, as explained here: http://erights.org/elib/distrib/pipeline.html. TODO: acknowledge all the capabilities work that this design is based on...
Executable compressed files are of course nothing new, with two popular instances I know of: self-installing binaries and Postscript. The former are nonportable and insecure. Postscript is more interesting, but specialized: as a data format it's much too padded, and it wasn't designed for efficient general computation. (TODO: find out what sort of design it has for security and resource control...)
The Java virtual machine is a general mobile-code solution, but with serious shortcomings for this application. It has security issues: if you download an applet as a way of downloading a compressed file, the applet can by default do many other things, like communicate back to its server. (Despite having too many powers in this way, it also has too few: it can't write the decompressed file to disk! It's supposed to be possible to change the applet security policies, but the user interface doesn't encourage this.) You can't bound its resource consumption in any standard way. The delivery format is still painfully big. (TODO: measure it.) Java is a huge system you can't casually drop into your program, unless your program is itself in Java. It comes with an enormous trusted computing base. There's been little progress on these fronts in the last 6 years, as Sun focused on other markets.
TODO: others? MPEG 4, Telescript, Omniware, Agent Tcl, ARA, dotGNU, .NET, Juice, refs to the literature on code compression, UNCOL, C--, Mite, Dis, Glulx, Taos, Ertl's papers on efficient stack VMs, ...