Context Modelling

o6mix56b1	Context Mixer with single o0-o6 submodels, using the new hash for nibble statistics (hash.inc).
o6mix58	o6mix3e2 with added unary order3 match coding, also features the new coder structure and SSE usage example.

Ok, there's a theory to explain the failure of the new hash. Specifically, its just that 0.5M of nodes per order is not enough... and that uses 2x more memory than previous hash even like that. Of course, I started with a single table for all data types, then separated the mixers (for compression improvement) and then separated the orders (same reason) - probably overwriting the whole nodes of different types is not healthy. And the previous scheme hashed the single counters, so that memory usage was less redundant.

However, its kinda different with the second failure - an external LZP filter was tested before and showed some improvement, so at least better speed was expected... but in the end the optimizer fixed the match model in such a way that its rarely used. But anyway, this experiment probably shows that bitwise processing of determinated contexts (aka binary contexts) is not that imprecise, and a real match model with matches of length 32+ is needed.

Still, there're some improvements in this release - the counter class was rewritten using templates (as planned), and the coder structure was changed to allow more freedom - probability estimation and model update was integrated with rangecoding before, and now things like using different models for different parts of data are possible without some tricky flags and such. Also the speed might be improved by changing the order from bit0.estimation, bit0.update,bit1.estimation,(...) to bit0.estimation,bit1.estimation, bit0.update,bit2.estimation,bit1.update,(...) - that would give CPU much more time to read the counters and maybe the nibble alignment won't be necessary anymore.

Also there's a new interpolated SSE class (with fixed update instead of combinatoric - the compression became slightly worse in the match model test - from 915899 to 916474), and an example of its usage too.

Codec	Datasize	Ctime	Dtime	Metric	Notes
Intel Core2 Q9450 3.68Ghz=460x8, DDR2 5-5-5-18 @ 920=460*2
v3_o6mix3d1	138634222	331.436	327.655	25269.6	bugfixed ver; http://ctxmodel.net/files/MIX/mix_v3.rar
v3_o6mix3e	139141975	259.485	259.533	24595.7	16bit counters
v3_o6mix3e2	138243249	258.875	261.031	24469.7	2x expanded hash tables
v3_o6mix3e21	138426074	243.579	244.625	24318.9	strict 64byte alignment, like in 3d
v3_o6mix3e22	138426074	254.076	254.140	24424.6	tables aligned (by 4096)
v3_o6mix3e23	139228039	256.110	257.186	24582.4	hash indexes aligned (by 64 bytes)

o6mix3e21_el32	132825613	248.781	251.953	23522.3	using Shkarin's LZP preprocessor with -l32 (and E8 before LZP)

v4_o6mix56b1	142027465	493.826	492.156	27607.2	nibble hash with collision detection
v4_o6mix58	139235023	261.780	264.672	24664.0	order2-3 match model test

2008-07-14 13:52:03 Shelwien >
I know how to improve compression, but improving the speed
at the same time is hard... even keeping it like it is now.

Btw, toffer, I'd like to test a CMM version without the match
model.

2008-07-14 14:48:26 osmanturan >
He wrote about a test result with and without match model for CMM. Look at here:

http://encode.ru/forum/showpost.php?p=1433&postcount=21

I believe that PAQish way match model does not give a huge compression gain.

2008-07-14 14:52:04 toffer >
Here it is:

http://freenet-homepage.de/toffer_86/cmm4_02a_080712_nomm.7z

The speed is not representative, since i only commented out the match length identification, some updates, etc... are left intact.

2008-07-14 14:58:14 Shelwien >
Nah, I want to test it on my dataset.

As to "PAQish way" - which one? Afaik PAQ8 just mixes
it among other predictions (like that compression could
be improved for me as well) and PAQ9a maybe does it
like I tried here - exclusively, without main model update
on match.

But still I think that it only can be implemented like it is
here, or else there won't be any speed gain.
I just have to make a real match model, for long matches,
instead of some deterministic o3 fix which I tried.

Though well, I still hoped that compression would be improved,
due to SSE and such.

2008-07-14 15:03:38 Shelwien >
Thanks, testing. I just want to see whether adding a proper
match model would be enough to beat CMM ;)

2008-07-14 15:06:52 toffer >
You should extend the hashing checksums to 16 bits. I see no reason in using signed ints here.

BTW your collision handling is broken - you use the low order bits of "ccc" to lookup an entry AND to detect collisions. It's likely that these values are identical (even if you mask out some bits). You should use the high bits of your hash index (which you don't use to address an entry) to check for collisions.

2008-07-14 15:09:18 osmanturan >
Matt said that PAQ9a is kind of failure by using LZP on prediction without main model update. I expect such that behaviour, because by using such that technique, we just break the context itself. So, compression hurts. This remainds me ilia's experience with LZP & BWT. They are similar for me.

The second choice is very slow - at least for me. Because, we are running a slow match model besides several contextual models (searching a for match slow enough, I think).

toffer has an idea about fast match model. Maybe he want to share us with detailed explanation :)

2008-07-14 15:09:25 toffer >
I hardly believe that you can beat CMM4 with just this, for several reasons:
* SSE
* symbol ranking techniques
* per model context merging
* stronger hashes
* dirty tricks like the "x86 model" :)

2008-07-14 15:18:49 Shelwien >
the bit15 flag allows to differentiate the unassigned nodes
from already used, it would be slower without that, and
I don't think it matters whether checksums are 15 or 16 bits.

And as to hash and checksum correlations, I don't think its
that much, though I easily can try the high bits or whatever.

Anyway, imho the main reason for bad ratio is small size of
the hash tables. The results are actually quite similar to
when I tried strictly aligning the nibble statistics with
the old hash.

2008-07-14 15:22:31 toffer >
> I don't think it matter whether checksums are 15 or 16 bits.

I agree here. I didn't look at all of the code only at the collision handling, since it seemed odd to me, that collision handling didn't gain compression.

> And as to hash and checksum correlations, I don't think its
> that much, though I easily can try the high bits or whatever.

If you do it like that, its almost the same as using no checksums at all.

2008-07-14 15:35:18 toffer >
Sorry, didn't look close enough:

short v, chk = (ccc & 0x7FFF); // the 15 low bits
uint k=num_rehash, j, i = ccc & mask1; // i though you use this as an index
ddd &= mask2;
i = ( (i+1) * coef ) >> (32-hash_size_log);
i ^= ddd; // that's the hashed index

Ok, that was my fault. What about something like this:

chk = i*(some odd constant)>>17;

For order 2 you could still use ccc&0x7fff, but for higher orders the checksum will depend on more than just the lowest 15 bits (which is in fact almost order 2 context).

2008-07-14 16:13:21 Shelwien >
4) 470949 - chk=(i^(i>>16)^ddd)&0x7FFF;
where i is actual o3-o4 context and ddd is o5-o6 context
if present.
Btw, order2 is still direct there, not hashed.

2008-07-14 16:25:42 toffer >
Looks like, i'm out of luck here - nothing seems to work :(

What about your match model benchmarks?

And are you going to use model switching?

2008-07-14 16:48:05 Shelwien >
CMM4_nomm benchmark added to http://ctxmodel.net/files/MIX/mix_v4.htm
So guess I have some work to do beside the match model ;)

As to model switching, I thought about storing the probabilities
and doing the mixing blockwise, so that I'd have all
the mixing intermediates and would be able to choose any
of them for the actual coding.

But there's no way to dynamically switch off the submodels
for speed improvement without some compression loss - as
estimation of model's performance requires processing the
data with that model. And encoding would be even slower
if we'd have to "undo" the changes in statistics,
but decoding would be faster.

2008-07-14 17:38:51 toffer >
I've experimented with entropy estimation per nibble and byte to select the models. It was relatively slow compared to my current solution. You lose some compression, that's true, but it's not that excessive.

I've found this heuristic to work considerably well: Only include the 4 "topmost" models (ordering: 643210) in the prediction process, e.g. if order 6 is present (which implies orders 4-0 have already been seen) only predict using orders 6432. The match model contributes without this restriction.

2008-07-14 17:40:34 toffer >
BTW - one of the things on my TODO list is to mix and encode the predictions blockwise, like you want to do. This might speed up the encoding process, too (cache associativity and linear prediction stack access).