Data Structures for Context Model Statistics
I. Common applications
1. Collect all the non-unique substrings in the data, and count them. A more practical version has to renormalize the counters to compensate for data nonstationarity.
It is generally impossible to keep all the statistics in the memory, so there should be a way to discard older statistics, like by only keeping the contexts encountered in the "window" of some fixed size.
Beside plain substrings, there might be a lot of other contexts, like sparse, or based on preprocessed data. So basically a good storage class should provide a way to sequentially process any number of streams and collect the statistics in specified windows. Also it should provide the distribution data for whole alphabet in the context on a single request. And this means a support for non-binary alphabets and larger than byte too, if possible. (eg. unicode is widely available these days).
A simple example of data type where bitwise approach is inconvenient is a lexicographically sorted wordlist - the word length can be more precisely modelled separately, but using it to fix the probability of the word terminator would be complicated with bitwise approach (same applies to any text actually).
2. For analysis and various data transformations it can be useful to be able to enumerate the contexts (preferably in lexicographic order) and to collect the histories for contexts.
II. Known implementations
Maybe the best data structure actually, but considerably complex to deal with. Especially that concerns removing the contexts going out of the window when statistics are not just simple numbers of occurences. For certain, a tree with all the necessary properties is possible, but its implementation would be very complex, with multiple node layouts for compactness.
Also linear search for a symbol in all context orders' nodes is slow, but using a binary tree with pointers is inefficient.
Of course its still possible to think of something, but it'd be complex, as already said. So actual implementations (and these are rare these days; only PPMd is widely known) do use linear search and some ad hoc information removal techinques (eg. nodes with low number of occurences can be removed on low memory, not considering whether corresponding contexts appeared recently or not).
2. Direct indexing
Simplest, but restricted to order2-3 contexts max, and even for these doesn't really provide the best speed (due to cache misses and collisions).
Still, depending on data type, it might be the best choice, like with near-random files, or samples of analog origin, which can be easily quantized and transformed for uniform distribution.
3. Hashed indexing
Fashionable these days, but can't provide precise statistics for large amounts of data. Also if used as an index for additional storage for alphabet distributions it would be actually slower than a prefix tree (not mentioning suffix tree), so all the common implementations are bitwise.
As a note, the ever-increasing gap between RAM speed and CPU processing speed is not any favourable for hashes, so special attention to cache compatibility is necessary.
Additionally, there's no reason to constrain oneself with static hashes. For example, despite extra pointer lookup, a hash with pointers and dynamic storage for actual data nodes might be faster than simple one (due to better data locality), and even more compact (as dynamic allocation allows to use data nodes of variable size).
Actually, the hashing idea can be combined with a lot of various structures which are commonly based on pointers, as hashing and "backward pointers" can be used instead. An interesting example of that is a substring tree where any data from collision location is used to add "randomness" to rehashing procedure (we compare the new string with substring in the current node, and use hashing to jump to another location from any unmatched symbol).
Yet while much more interesting, the dynamic hashed structures have the same flaw as trees - potentially infinite growth, and with hashes its much harder to remove something "unnecessary", so maybe that's why they're not used in compressors.
4. Entropy-coded indexing
If data is redundant, then contexts in it would be redundant too, and thus compressible. And compression is an approximation of enumeration of possibilities. That is, if we have a set of N substrings with equal probabilities, then best codes for them would be numbers from 0 to N-1.
So having a perfect model for context data, it would be possible to design such a hash function that hashing would practically turn into direct access.
But then, a hash function can't be dynamically changed, so the hash function model has to be encoded as well, redundancy ensues. So in practice it would be possible to use only low-order statistics, and the result would be a hash with maybe better distribution, but much slower hash function than usually.
Still, this approach might be useful for the cases when encoding is not directly involved - like BWT forward transform.
5. Context bitmasks
There's a more straight approach too: if we're able to enumerate the possibilities for part of a context, then amount of whole context variants would be more limited as well.
In the most compact case, we can just expand the context by one bit and fill a bitmask of size N*2 bits, where N is number of variants of already indexed context part, and bit[f(X)*2+Y] would be set if context XY exists.
Well, actually a nibble step seemed more practical than a single bit. And it really allows for an efficient context enumeration. But to increase a context order its necessary to process whole data, so bitmasks have to be encoded before actual data.
III. Composite approach
If one thinks about it, hash is nice and simple when its relatively small (like, 4M, to fit into L2) and almost empty.
So what if we make a composite scheme, with a hash for "near" contexts and bitmask enumeration for "far" ones? Like, the data can be processed blockwise, with bitmask updates only after block processing is finished. So, the contexts in the current block would be kept in the hash, then the block would be added to the window, which would be then indexed anew into context bitmasks, and hash cleared to process a new block.
Both hash and bitmask implementations are quite simple, and also recent contexts would be "cached" in the hash, so some fast enough bytewise data processing might be possible. And also this approach allows for a nice trick, similar to segmentation - bitmask statistics can be dropped if there's high enough percentage of new contexts in the block (which becomes visible due to two-stage statistics), and even keeping multiple bitmask structures and using the "better one" might be possible.
There's an ambiguous part though - recalculation of statistics. Rebuilding the bitmasks by itself requires multiple passes through the window data, but necessary operations are very simple and cache-friendly, so we can hope that it would be fast enough with right block size (which has yet to be chosen after some benchmarks).
But calculation of statistics is much more time-consuming. Still, by keeping both old and newly rebuilt bitmasks it would be possible to remap the statistics as well to new context indexes.
However, it opens some nice possibility in the form of optimal parsing of data. Specifically, using the old statistics we can parse the data in the window and determine which context order any symbol fits best, and then compute the predictions according to that partitioning.
Actually, it may be worth consideration to try the same 2-stage approach with other data structures, like two hashes, or tree and a hash, as this approach has some benefits unrelated to specific data structure choice. But having to carefully design the tree layout for small block may be a waste of time, while hash would be much simpler. And only the context enumeration approach allows to remap the statistics without actual recalculation, which obviously increases the speed.
However, the requirement for blockwise rebuilding is a property of context bitmask scheme only, as it cannot be updated incrementally. So even with two hashes its possible to implement some kind of optimal parsing (in the just processed block) and content analysis.
Last link contains the source and win32 binaries for a coder based on context bitmasks (its not exactly compressor, as only bitmask coding part contains some basic modelling). Also its implementation is slightly more complicated than described in II.5 as there're two bitmasks for each order - for unique and non-unique contexts, and only non-unique contexts are expanded to the next order, which allows to somewhat reduce the memory usage, but really complicates the encoding parts.
2008-10-27 20:26:54 ICQ+discussion >
i don't have a deep into detail look
but that might help
well, my case is slightly different
as expanding the contexts down to all unique is too much of a performance hit
though its unavoidable for BWT
but you're right and its related to what i do
that is clear to me
but i never looked into that in detail
specifically, if i'd expand my context order until all contexts become unique
then my ctx would become a complete compressor without data coding part ;)
isn't that bwt?
not really ;)
but any context collection method can be used for BWT implementation
and guessing from the article, m03 is not BWT actually
because it doesn't process the BWT output sequentially
that's not important
you just need to restore the bwt output
did you know that method?
more or less, didn't see that article though
wonder if i should try turning my ctx into something like m03
its certainly possible
i think, actually, i just have to increase the max order to 150 or whatever it
was for book1
and maybe remove the non-unique bitmap coding for maxorder
guess i'd try... never though about it, might be interesting...
that might be interesting, too
i tried and it compresses book1 into 270k
guess that bitmask model is really pitiful
that's pretty good for just 32mb blocks
2013-07-31 11:34:40 >