Compression Games
Over these past few years I've become increasingly interested in the question of how to learn good compressors. Early on I thought that the problem of figuring out how to make a good compressor for a data stream was "simply" a problem of figuring out how to make a good probabilistic model of the data stream. But after wrangling with the problem for a while I have come to prefer to think about it as learning how to play a peculiar class of 2 player games.
The Structure of Compression Games¶
A compression game is a game with two players. The the first player is called the compressor C and the second player is called the decompressor D. The players take alternating turns.
Before the game starts an external actor picks a message to compress, call it m, which is provided in its entirety to C only. C's turn consists of providing a hint to D which can depend on m and need not be deterministically generated. During D's turn they provide a message to C dependent on the current history of hints, call these messages notions. The game ends when a pre-agreed upon condition is met. For example D sends a special message indicating the end of the game, or perhaps the first hint was of a form indicating the total number of turns to take, etc. After the end of the game the history of messages passed between the two players is fed into a reconstruction process which generates an estimate of the message $\hat{m}$.
At this point the game is scored by assigning an encoding cost to the sequence of hints generated by C. The form of the encoding cost function requires some knowledge about the form of the hints, so I will defer making explicit choices about how to measure encoding costs until after discussing specific compression games. If the rules of the game are such that it is possible for $\hat{m}$ to differ from the original message then you can also add in a penalty term for any differences, in which case you are dealing with lossy compression.
Turning a Compression Game into a Compression Algorithm¶
It may not be immediately clear that you could turn such a game into a usable compression algorithm. The turn structure and the messages sent back from the decompressor to the compressor during the process seem to mess things up.
The usual way of thinking about things is that you can think of the compressor and decompressor as a pair of functions that are approximate inverses of each other. If you wanted to frame this as a turn based game it seems like the natural way to do it is to make the compressor and decompressor each get only exactly one turn, and to let information only get sent from the compressor, no information should come back from the decompressor.
Also, in order to make a useful compression algorithm you need the decompressor player to play deterministically. Since you wouldn't be guaranteed to get the same output file each time otherwise. Which effectively turns the 2 player compression game into a single player game. Since effectively the second player just follows the instructions of the first. That sort of compression game doesn't seem like it would be a very fun game to play (though potentially still a very fun game to design the rules of).
But actually if you take a moment to think about it once you have required that the decompressor should behave deterministically you actually no longer need to require that there should just be one round of turns and/or no 2 way communication. In principle everything the decompressor knows at any point the compressor player also knows. So if the behavior of the decompressor is deterministic and known beforehand, then the compressor can just run an emulation of the decompressor as they play in order to figure out what messages a compressor would send back to them after each turn.
Then the compressor player could just output the resulting stream of hints from the whole multi-turn game on their very first turn, since they can just figure out how the game would play out without needing to actually consult with player 2. So with the additional requirement that player 2 plays deterministically (and their strategy is known to player 1) the multi-turn 2-way communication compression game is equivalent to the single turn single communication direction sort of compression situation that may seem more familiar.
An Aside on Extra Notional Communication Channels¶
I feel it is helpful here to pause for a moment to talk about the absence of a notional communication channel running from the compressor to the decompressor. After all the very same argument about determinism allowing the messages to be passed via emulation of the process that generates them still applies, just so long as those messages don't directly depend on the input target to be compressed. This is certainly true and earlier versions of my thinking about how to best think about compression as a game with learnable strategies included this extra channel of information.
But during deterministic compression mode "notions" aren't really passed from the decompressor to the compressor (though they can be actually passed along when learning to play the game). These messages, and the whole turn based structure, are just one possible way to do a mental decomposition of the compression process. I find that thinking about there being this extra communication channel from player 2 to player 1 helps me think clearly about what is going on, as well as to think creatively about what those messages could be.
While in principle you can also have this extra channel of notional messages passed from compressor to decompressor I found that while I was thinking of these messages as being the responsibility of the compressor to generate, I was constantly wanting to make those messages depend on the incoming message somehow. But that obviously isn't allowed, since the decompressor must be able to reconstruct them purely from the current hint stream history in order for the process to correspond to an implementable compression algorithm.
But if these messages must be always able to be deterministically generated by the decompressor player at any point. Then it seems clear that really there is no effective difference between messages notionally being passed one direction versus the other direction. Both players can have full access to the current stream of such notional messages at all times. So the idea that one or the other player is responsible for generating some portion of those messages is really just a trick of the mind. To me it seems simpler and more clarifying to make all such messages the responsibility of the decompressor, which makes the allowed input domain of those messages obvious. But there are certainly plausible strategies for which it might make sense for some of these messages as being generated by the compressor player. Either way the expressive power of the game is the same.
Example Compression Games¶
Lets take a moment to put some standard compression algorithms into the form of a compression game.
A common framework for a compression algorithm is to utilize a probabilistic next symbol predictor which takes the recent symbols as input.
How does this look in terms of a compression game? One way to do it would be to make the hints emitted by the encoder be the rank order of the correct next symbol. Where the ranks are determined by the predicted conditional probabilities of the next symbol.
Although you could incorporate the calculation of those probabilities into the operation of the compressor I think it is much more natural to give that task to the decompressor. At each step the decompressor considers the current sequence of known symbols and makes a prediction for the conditional probability of the next symbol. The probability distribution over those symbols can then be passed back to the compressor as part of its notion message.
For convenience sake you can consider the decompressor to also pass along its current estimate of the reconstructed message using the hints provided so far. Prior to applying the next symbol predictor each turn the decompressor simply grabs the symbol whose probability rank was indicated by the most recent compressor hint and adds it to the end of its estimated message. Which can then be fed as context input to predict probabilities for the next symbol, and both probabilities and predicted message are passed back to the compressor as the next notion. Then when the estimated message is equal to the true message the compressor can emit a special hint to indicate the end of the game.
That describes the structure of the compressor and decompressor but the complete specification also requires a specification for how to score the hint stream. Even if the next symbol predictor is very accurate we won't achieve any compression simply by replacing the message with a sequence of hints indicating that the top ranked prediction was always correct. We need to somehow decide how many bits on average we should pay for each hint. Arithmetic coding is a method to turn a set of symbol probabilities (potentially changing over time) into a bit stream which uses those probabilities to encode a stream of actual symbols.
You can think of arithmetic coding as being a method that enables you to pay just $-log_2(q)$ bits on average for encoding an event that has an estimated probability of $q$. If it turns out that actually the true probability of those events really was $q$ then this achieves the Shannon limit (Hooray!). If $p \neq q$ then you will end up paying close to on average $E_{x \sim p}[ -log_2(q(x))]$ bits per symbol. Which is the cross entropy between your estimated probability distribution and the true probability distribution.
To carry out scoring you could actually carry out the arithmetic coding procedure and take the length of the resulting sequence as the encoding cost. But the actually achieved encoding length is just an upper bound on the cross entropy. But there is no reason that you couldn't just use the cross entropy directly $\sum_i -log_2(q(x_i | x_{j<i}))$. Since that gives a continuous and differentiable entropy estimate and avoids any sharp edge cases where you end up needing one bit more or one bit less to encode any particular sequence.
Simpler more directly algorithmic compression schemes can also of course easily be turned into compression games. For example in run length encoding (RLE) the hints are tuples of a symbol and a number of times to copy that symbol into the output. Again in this scenario I think it is convenient to make the notion message of the decompressor be equal to the current estimate of the message and the compressor stops when the estimated message and the true message are equal.
This time we don't have access to any estimated symbol probabilities so we can't directly utilize the empirical cross entropy of our predictive model as an encoding cost. One natural encoding cost for run length encoding (or for that matter many possible compressors) is the length of the resulting hint sequence. If we want the encoding cost to be comparable to other methods then it would also be important to multiply by an appropriate proportionality constant. For example if our basic symbols are bytes and we are ok with limiting ourselves to a maximum run length of 256 at a time then we could encode each (symbol, run length) pair using 2 bytes, in that situation we could multiply the total number of such hints by 16 to get a result measurable in bits.
Adaptive Empirical Scoring Methods¶
The selected scoring method is a proxy for the length of a compressed version of the hint stream if we were to go to the trouble of actually implementing a dedicated encoder for it. The nature of that encoder is a very important consideration and significantly changes the structure of the game. In general you will get very different behaviors from using encoders which are indifferent to the actual hint stream from using more adaptive encoders. The strategy of using arithmetic coding on forward prediction probabilities and the 2 byte RLE encoding strategy just mentioned are fixed encoders. Such encoders may work well if we know the statistical properties that the hint stream is likely to have and tailor the encoder to that. But it will often be more desirable to use an encoder that adapts an empirical probability distribution from the hint stream to determine the encoding cost for each hint. For example one could assign to each hint a probability which is equal to the recent frequency of that particular hint in the hint stream.
Such empirical entropy measures have the major advantage that they don't rely as greatly on the accuracy of an predictive probability distributions being used by the compressor/decompressor. So even when our compressor/decompressor is implicitly using a certain probability to predict the next symbol it may actually be better to use a smoothed empirical probability distribution for specific hints learned from the hint stream.
Consider the case of the forward predictive based compressor from before, where the compressor outputs the rank order of the true symbols as hints. If the predictor gives accurate probabilities then choosing to encode the ranks using an arithmetic coding scheme customized to the predictive probabilities will give optimal encoding results. But if there are inaccuracies in the probabilities yielded by the predictive model then we end up paying an overhead in the encoding cost. Suppose for example that we have a model that achieves very good predictive accuracy, ranking the true next symbol first 75% of the time. But it is also very overconfident assigning a probability of 99.999% to all of its top predictions. The 75% of the time that the model turns out to be correct we pay almost no bits to encode that fact. But the 25% of the time that the model is wrong we will end up paying an absolute minimum of $log_2(0.99999) \approx 16$ bits and may end up paying more depending on exactly how unlikely the model thought that symbol was. Adding those two costs together with the appropriate probability weightings gives an estimate of at minimum 4.6 bits to encode each symbol. You can reduce the cost of such overconfidence by smoothing your probabilities to be more uniform but then you also start cutting into the maximum possible compression you may achieve.
Compare this to the use of an empirical entropy scoring method where we simply let each hint have a cost proportional to the logarithm of the frequency with which that exact hint has appeared previously in the hint stream. In that case since our model will give the top rank to the correct symbol 75% of the time then we should expect to pay roughly $-log_2(0.75) \approx 0.4$ bits per correct symbol and at least $-log_2(0.25) = 2$ bits for incorrect top predictions yielding an overall average of 0.8 bits per symbol. Effectively for this ranking kind of hint stream we then end up paying a number of bits which is determined by the empirical error rate of the predictive model on this particular message, which can protect us from paying large encoding overhead costs when encoding messages which don't closely match the probabilities implied by the model. Such empirical entropy measures of the hint stream also have the advantage that they don't depend on the internals of the compressor/decompressor being used and so can be used in any situation.
When using such empirical measures of entropy it is important that you also either are careful to only use entropy measures which use only the past of the hint stream to estimate future probabilities, and or add some amount of cost for encoding the parameters of the assumed underlying probability distribution as a necessary overhead independent to the cost of encoding each individual symbol. For example consider the seemingly innocuous choice of using an empirical hint probability model simply be equal to the empirically observed frequencies of hints over the whole hint stream. If you don't also consider the cost to encode the possible hints and their probabilities then a single hint consisting of the entire message to be transmitted would have a predicted probability of 1 and so have 0 encoding cost for that specific hint. That is fine as far as it goes, but in order for that to make sense one must also pay some number of bits to encode the table of hints which would of course in that particular case make up all of the encoding cost for that hint stream.
There is, of course, a standard set of systems that learn empirical distributions of data streams and then use them to come up with efficient ways of encoding those data streams, those would be compression algorithms. So one may also opt for a scoring method where the hint stream is first serialized in any convenient manner, for example as a json object, and then fed to some standard compression program, for example gzip. The encoding cost is then the size of the resulting artifact. This may seem like cheating since we have used a compression algorithm to score a compression algorithm but really it is just a nice way to keep from having to fuss over all the details of getting a robust entropy estimator just right for each and every possible hint language we might want to play with. If ever we find a compression game strategy with a json->gzip scoring method we can be confident that it will be possible to make a high performing compressor from this strategy by just utilizing gzip as a post processing step.
One may be tempted to think that using a more powerful compressor to carry out the scoring would also tend to yield superior results, but one must consider the fact that complex compressors also imply complex structures that achieve maximum compressibility. With a compression scorer the purpose of the compressor does not become to make the hint stream as short as possible but rather the task becomes to make the hint stream have a structure which is as compressible as possible taking the scorer into account. For this reason it may actually be desirable to err on the side of making the scorer utilize only very simple kinds of statistical structure, as this may simplify the problem of learning to make hint streams which mesh well with the scoring method.
Why Think of Compression as a Game¶
So we can turn most (if not all) existing compression algorithms into the form of a turn based compression game. But why would we go to the trouble of thinking about the game as though it were multi-turn when the space of possible single turn and multi-turn compression functions are the same either way?
I think there are a couple different advantages. This method of mentally partitioning a compression algorithm into different responsibilities, hint generation, notion generation, and hint stream encoding, is very helpful for me personally. While it isn't too hard to come up with interesting compression algorithm ideas when thinking in terms of a monolithic compressor -> decompressor sort of execution structure, I find that it is much much easier to come up with a wealth of ideas when thinking about things with the richer structure imparted to the problem by thinking about it as a compression game.
I find that my thoughts go to much more interesting places when I start by asking; What could a language of hints or a language of notions possibly look like? Rather than by starting to ask a question like; What are flexible and efficiently estimable probability distributions for my data? It is that second mindset of explicitly modeling probability distributions, which I think is the default mindset for thinking about compression within a ML framework. But although there might always be some sort of implicit probability distribution underlying the rules of any compression game we can come up with that doesn't mean that we need to understand that distribution in order to make compressors that work well. In fact, if I try to think of rules for a compression game that might be fun for two humans to play, they correspond to completely analytically intractable probability distributions. But games like Chess and Go are also analytically intractable (that is what makes them fun to play) but that hasn't stopped us from learning to make AI that can play them.
Also although thinking about how to model the probability distribution for our data, is very interesting and very important. But direct explicit probabilistic modeling of our data should not be the only sort of learning technique in our repertoire when thinking about learning to compress. Compression games don't require thinking about things from a probabilistic perspective at all, though they also very easily accommodate probabilistic thinking if desired. So compression games seem to me to be a more general sort of framework for thinking about learning to compress than can be accommodated simply by learning predictive probabilities.
Similarly in both the purely algorithmic setting and the probability distribution centered ML setting it seems that either the compressor or decompressor tends to be "dumb". By which I mean that either the compressor or the decompressor tends to act in a deterministic manner which is simple to understand and reason about. Then with the behavior of one of the two possible players in the compression game fixed you can begin to reason from first principles about how the other player should play in order to achieve optimality.
In the algorithmic setting I would say that the compressor tends to be where all the complexity lies, with the decompressor simply being a simple program that follows the instructions of the compressor blindly. In the typical ML scenario on the other hand in my way of thinking of things the complexity is almost all in the decompressor. Because as I have already talked about, carrying out probabilistic predictions about the data content seem most naturally to be a task carried out by the decompressor and sent as a notion to the compressor player. Which makes the task of the compressor to simply mindlessly compare the predictions to the actual message and indicate how well they line up.
But intuitively this seems like a huge waste of possible directions for optimizing performance. We shouldn't just fix the strategy of the compressor or decompressor and then only optimize the other one. If we simultaneously optimize both the compressor and decompressor strategies to work well with each other we may be able to achieve much better performance than if we only ever think about optimizing one or the other.