Context Modelling

<< ctxmodel.net

Mix v5:

sh_samples_1 benchmark results (sh_samples_1)

o6mix59 Context Mixer with single o0-o6 submodels, basically v3/o6mix3e2 with modified coder from v4/o6mix58
o6mix59a o6mix59 with secondary estimation for o6 context (before mixing), bits of current byte and 8 MSBs from o5 estimation are used as context (which is really dumb).

Here, o6mix59 went as expected - the only compression-related change there is counter simplification (branch-avoiding logic was replaced with separate functions for updates on 0 and 1, using templates) - its also the reason of slight speed improvement comparing to o6mix3e2.

It also has somewhat worse compression in the main benchmark, but that's due to tuning differences... instead this version shows better results in SFC benchmark and (what matters the most ;) on its book1+wcc386 tuning target (465928 vs 466568).

As to o6mix59a SSE experiment... well, its results are almost surprisingly good (taking into account the simple context and SSE update turned static), but slowdown is considerable, especially if we'd think about further compression improvement using more of such mappings.

So the delayed counters are still the main development direction. Also there's an idea to start with a single submodel again, but with a context mask allowing it to be anything between o0 and o6 and let optimizer to find the best primary submodel, then add more one by one and repeat the process (it was done the same way with o2-o6 in v0 btw, but with more restrictive context masks).

Intel Core2 Q9450 3.68Ghz=460x8, DDR2 5-5-5-18 @ 920=460*2
Codec Datasize Ctime Dtime Metric Notes
balz113-e 137842934 263.688 58.249 22384.1 http://encode.ru/balz/balz113.zip
balz113-ex 135284357 436.734 58.015 22155.1 -ex
cmm4-m-70 135276307 243.454 229.173 23672.1 http://freenet-homepage.de/toffer_86/cmm4_02a_080712_nomm.7z
cmm4_2a-70 130327852 236.968 244.390 23044.6 http://freenet-homepage.de/toffer_86/cmm4_02a_080712.7z
cmm4_2a-75 120225318 275.107 283.936 21899.7
ppmd-8 137974856 112.140 120.812 22878.8 PPMd Jr1 -m1980 -o8 http://compression.ru/ds/ppmdj1.rar
ppmd-6 140889604 101.126 109.297 23208.1 PPMd Jr1 -m1980 -o6

v3_o6mix3d1 138634222 331.436 327.655 25269.6 bugfixed ver; http://ctxmodel.net/files/MIX/mix_v3.rar
v3_o6mix3e 139141975 259.485 259.533 24595.7 16bit counters
v3_o6mix3e2 138243249 258.875 261.031 24469.7 2x expanded hash tables
v3_o6mix3e21 138426074 243.579 244.625 24318.9 strict 64byte alignment, like in 3d
v3_o6mix3e22 138426074 254.076 254.140 24424.6 tables aligned (by 4096)
v3_o6mix3e23 139228039 256.110 257.186 24582.4 hash indexes aligned (by 64 bytes)

o6mix3e21_el32 132825613 248.781 251.953 23522.3 using Shkarin's LZP preprocessor with -l32 (and E8 before LZP)

v4_o6mix56b1 142027465 493.826 492.156 27607.2 nibble hash with collision detection
v4_o6mix58 139235023 261.780 264.672 24664.0 order2-3 match model test: http://ctxmodel.net/files/MIX/mix_v4.rar

v5_o6mix59 138256704 257.016 257.890 24438.5 3e2 with coder from 58: http://ctxmodel.net/files/MIX/mix_v5.rar
v5_o6mix59a 135926958 286.483 287.546 24400.5 59 + interpolated SSE<6> over o6

2008-07-18 09:47:59 Anonymous       > 
Great work. What direction are you taking now? Speed or Compression ratio.

2008-07-18 12:34:48 donotdisturb    > 
To improve speed, you can try not only avoiding if-then-else branches but also dependencies between values to allow out-of-order execution on modern CPUs. For ex. instead of shifting the context after each bit (cx += cx + bit), you can do 
(cx = 1 << 8 | chr) before encoding a byte and calculate the current context like (cx >> (bitpos+1) where bitpos=7...0 ) in the encode-byte function. You can also try to precalculate as much as possible before the range-coder call.

2008-07-18 15:08:33 Shelwien        > 
> To improve speed, you can try not only avoiding 
> if-then-else branches but also dependencies between values 
> to allow out-of-order execution on modern CPUs.  
 
You're right and there're things like that in rc_sh2d.inc 
(rc_pre), and in v5 I made changes which would allow to 
overlap the sequences of estimations and updates (ie 
prefetching the counters/ mixers, then doing an update for 
previous bit, then actually using the prefetched values) 
 
> For ex. instead of shifting the context after each bit  
> (cx+=cx+bit), you can do (cx=1<<8|chr) before 
> encoding a byte and calculate the current context like 
> (cx>>(bitpos+1) where bitpos=7...0) in the encode-byte 
> function. 
 
That's the reason for bit encode calls looking like 
encode(c&(1<<7)). I tried explicitly setting this byte 
context in fpaq0pv4B, and it wasn't any faster than 
shifts - and no wonder, as its just a single LEA or SHL 
instruction which easily overlaps with other stuff. 
 
Also this context is known only in encoding, while 
there's no alternative in decoding because further 
bits are unknown. 
 
> You can also try to precalculate as much as 
> possible before the range-coder call. 
 
As I explained already, there was an idea (from fpaq0pv4B) 
to turn encode_byte() into template, and make an array of 
encode_byte<0>..encode_byte<255> and just call it by byte 
value - compiler would be able to optimize it really good, 
though compilation would be considerably slow. 
But with a complex model (unlike some order0) there'd be 
too much code in each instance of byte coding function, 
and it'd cause code cache problems. 
Also its only applicable for encoding (decoder versions 
with speculative steps are possible too, but that's too 
complicated for now). And there wasn't any demand for a 
reverse asymmetric compressor with twice faster 
compression than decompression ;).

2008-07-22 11:13:12 toffer          > 
Have you finished the delayed counter approach?

2008-07-22 13:59:17 Shelwien        > 
Got stuck in sSE optimization instead ;) 
Now messing with a single submodel (~o2, selected by 
optimizer) and SSE. Surprisingly, wasn't able to gain 
much speed from SSE (even though the initial implementation 
was clearly inefficient), but got some compression 
improvement. 
Btw, even with a single counter + SSE a coder takes ~70s 
to encode my test set (ppmd -o8 / CCM are ~120s). 
And I probably going to attach a delayed counter today, 
so it would become even slower...

2008-07-22 14:34:57 toffer          > 
Let's see what happens :) 
 
Could you provide a completely tuned (target: SFC) order 2 model, without SSE/additional stuff. Just a plain order 2 codec with flat deco.? I'm working on a new counter model, somehow this is a delayed counter with only a single mapping for speed. I think i'll use 3 delayed bits and 12 bit precision for the probability. This will take some time (modify my optimizer...).

2008-07-22 15:21:55 Shelwien        > 
> Could you provide a completely tuned (target: SFC) order 2 
> model, without SSE/additional stuff. Just a plain order 2 
> codec with flat deco.?  
 
For now I can only provide this: http://ctxmodel.net/files/o6mix6.rar 
Its not exactly order2 (context mask 405FFF) and was optimized to 
book1+wcc386. However I can turn it into real o2 and tune to SFC, 
but that would take until tomorrow. 
 
> I'm working on a new counter model, somehow this is a 
> delayed counter with only a single mapping for speed. 
 
You mean, a single mapping instead of separate estimation 
and update? Don't see how could this improve speed. 
 
> I think i'll use 3 delayed bits and 12 bit precision for 
> the probability.  
 
Somehow this layout reminds me of something ;)

2008-07-22 15:37:22 toffer          > 
I don't need it that quick. 
 
Well, if you separate prediction and update you need to do two "counter updates". Merging this will take only one update. 
 
I'm unsure about an exact layout for each order. My intuition says that more delayed bits will do a better job on higher orders. I'll bruteforce the best setting. 
 
And you know where i've taken my first guess from :)

2008-07-22 17:15:28 Shelwien        > 
> I don't need it that quick. 
 
Ok... I'd make a SFC-optimized o2 then. 
 
> Well, if you separate prediction and update you need to do 
> two "counter updates". Merging this will take only one 
> update. 
 
Right, but that's the point. Estimation mapping gives you the 
probability after adding all the delayed history, and update 
mapping only skips one bit of that history - they're completely 
different. 
I think you're better off with usual counter and a static 
mapping for estimation, if you can't afford two mappings. 
 
> I'm unsure about an exact layout for each order. My 
> intuition says that more delayed bits will do a better job 
> on higher orders. I'll bruteforce the best setting. 
 
That's not exactly about the number of delayed bits... but 
it might be useful to support long bit runs there. 
I'm not sure about history encoding for such a case though 
(really reminds me of state machine counters, and in fact 
the history might be lossy too), and imho it would be better 
to first analyze the statistics (on history distributions) 
instead of bruteforce.

2008-07-22 23:24:44 Shelwien        > 
Here's your o2: http://ctxmodel.net/files/o6mix6-t.rar

2008-07-23 09:45:30 toffer          > Thanks a lot! :) 

2008-07-23 09:46:03 toffer          > 
BTW I still haven't finished my new optimization model... hunting horrible bugs :(

2013-07-26 09:50:16                 > 

2014-03-20 00:42:48 DonaldRiCh      > x9527r-gb, ,  , , .