Context Modelling

<< ctxmodel.net

Mix v2:
sh_samples_1 benchmark results
SFC benchmark results

This version is basically a bugfix of v1 - the contexts (including hash functions etc) are no more calculated twice per bit. Also the parameters were tuned to book1+wcc386 again, because even though the SFC sample is beneficial for SFC compression, but the profile based on it seems to be far from "average".

o6mix2 o6mix with dual o0,o1 and single o2-o6, also contexts are stored when first computed, instead of doing that twice.
o6mix3 o6mix2 with hash functions changed to allow single context calculation per byte.
o6mix3a o6mix3 with single o0,o1 (2 or 9 counters and 2 of 8 mixers removed)

This time benchmarking went as expected (more or less) - results are better than v0 in sh_samples_1 test due to improved E8, and speed is somewhat better because of context calculation optimizations. o6mix3 compressing better than o6mix2 can probably be explained by tuning differences (o6mix3 had worse results on book1+wcc386).

But the fact that even o6mix3a is 4x slower than CCM is kinda depressing - only counters and mixers are left in the bit coding function.

Possible causes of this slowness include:

counter updates with additional table lookups;
bad cache hit rate at counter access (would try to add prefetching);
divisions in the mixer (unlikely as mixer update thresholds are quite high)

Mix v2 optimization progress - o6mix2 in red, o6mix3 in green and o6mix3a in blue. Vertical axis shows the compressed size of book1+wcc386 optimization target and horizontal axis shows the time in hours.

Intel Core2 Q9450 3.68Ghz=460x8, DDR2 5-5-5-18 @ 920=460*2
Codec Datasize Ctime Dtime Metric Notes
rar371-m1 174366298 17.642 16.486 27427.2 rar 3.71 -m1 -mdg - http://rarlab.com/rar/wrar371.exe
rar371-m2 148454375 49.873 16.124 23407.1 rar 3.71 -m2 -mdg
rar371-m3 144515166 65.158 15.813 22803.8 rar 3.71 -m3 -mdg
rar371-m4 139815031 72.656 38.593 22304.7 rar 3.71 -m4 -mdg
rar371-m4t 143892793 76.141 15.717 22716.6 rar 3.71 -m4 -mdg -mct-
rar371-m5 139875435 81.110 47.500 22411.6 rar 3.71 -m5 -mdg
rar371-m5t 143629349 88.437 17.157 22702.1 rar 3.71 -m5 -mdg -mct-

balz112-e 138914945 263.485 50.798 22476.9 http://encode.ru/balz/balz112.zip
balz112-ex 136358563 428.938 50.500 22240.0 -ex
balz113-e 137842934 263.688 58.249 22384.1 http://encode.ru/balz/balz113.zip
balz113-ex 135284357 436.734 58.015 22155.1 -ex
balz114-e 143781598 81.405 35.876 22906.0 http://encode.ru/balz/balz114.zip
balz114-ex 141820766 125.796 35.406 22639.4 -ex

ppms-4 151237030 77.921 83.063 24539.3 PPMs Jr1 -o4 http://compression.ru/ds/ppmsj.rar
ppmd-8 137974856 112.140 120.812 22878.8 PPMd Jr1 -m1980 -o8 http://compression.ru/ds/ppmdj1.rar
dlite-8 127536394 281.281 65.437 20863.2 Durilca'light -m1580 -o8 -t1 http://www.compression.ru/ds/durilca.rar
dlite-16 126660769 289.781 72.469 20805.2 -m1580 -o16 -t1
ppmj-10 117045960 749.500 337.968 22417.6 PPMonstr Jr1 -m1980 -o10 http://compression.ru/ds/ppmdj1.rar
paq9a-7 117303349 555.610 558.845 24472.7 paq9a -7 http://cs.fit.edu/~mmahoney/compression/paq9a.zip
cmm4-75 120002417 287.188 289.266 21930.2 http://freenet-homepage.de/toffer_86/cmm4_01f_080527-gcc4.3.7z

st2rc3-2 165794810 13.905 31.843 26237.8 http://shelwien.googlepages.com/st2rc_v3.rar
st2rc3e89-3 164499960 13.767 32.219 26039.1
st2rc6e89mm-2 159133531 52.763 74.029 25657.7

ccm-0 127465606 115.874 115.516 21187.5 CCM 1.30c -0 http://christian.martelock.googlepages.com/dl_ccm130c.zip
ccm-1 125991688 120.141 119.624 21002.6
ccm-2 124869661 121.641 121.094 20843.5
ccm-3 124017539 121.140 120.782 20706.7
ccm-4 123346433 120.859 120.500 20598.7
ccm-5 123152614 121.610 121.375 20578.0
ccm-6 123098025 124.672 124.796 20606.7
ccm-7 123066391 127.749 128.250 20639.4

o2mix-3 161322206 198.733 215.499 27560.3 http://ctxmodel.net/files/MIX/mix_v0.rar
o3mix-1 148964049 326.469 336.547 26967.6
o4mix-1 142328712 413.516 421.468 26867.1
o5mix-2 139873527 521.796 524.610 27623.1
o6mix-1 139534658 655.110 665.844 29115.8

v1_o5mix 141610055 507.281 519.439 27828.2 http://ctxmodel.net/files/MIX/mix_v1.rar
v1_o5mix-a 141657843 501.125 500.969 27644.9
v1_o6mix 140439276 635.315 645.016 29029.1
v1_o6mix-a 141373740 637.062 631.906 29045.8
v1_o6mix1a 141989779 526.610 505.046 27763.0
v1_o6mix1b 141665741 522.655 503.077 27688.7

v2_o6mix2 138865709 553.062 551.640 27767.2 http://ctxmodel.net/files/MIX/mix_v2.rar
v2_o6mix3 137947855 464.674 460.392 26622.9
v2_o6mix3a 138745569 397.406 391.891 25995.3

2008-07-10 12:00:23 toffer          > 
I guess your main "speed" problem still is the non-cached counter access... I wonder why you hardly refuse to use nibble aligned hashing?

2008-07-10 13:53:32 Shelwien        > 
Just tried to explicitly prefetch all the counters and mixers  
related to current context, before any actual access. 
Speed became 50% worse somehow. 
Now testing the model with 64-byte alignment - not that  
I refuse anything, but I think that counters are clustered 
well enough as they are. 
So I still think that the main cause of slow speed is table 
lookup in the counter update. 
 
Ah! Benchmark of aligned version is complete. 
v2_o6mix3c 138936605 393.987 389.343 
So yeah its faster, but not that much.

2008-07-10 14:39:01 Anonymous       > 
Could you post the code for the memory access?

2008-07-10 15:08:28 toffer          > 
I guess 64 byte alignment doesn't mean nibble based hashing?

2008-07-10 15:18:58 Anonymous       > 
o6mix3c source - http://ctxmodel.net/files/o6mix3c.rar 
Have a look at model.inc / ctxprepare()  
And 64 byte alignment does mean that counters for first 
nibble are packed into 64 bytes... and mixers for 5 bits.

2008-07-10 15:37:05 Anonymous       > 
I can tell you why your speed gain is tiny... the first nibble was almost always cached in the previous version. Now it is always cached, since you force 64 byte alignment. 
 
So the speedup is a result of cache alignment and the improved mixer storage.  
 
The main problem remains: the bits after the fist nibble still aren't cached at all. In CMM4 i'm doing two hashed memory accesses per model. You do 1 + 15 = 16. That's the point.

2008-07-10 15:37:52 toffer          > 
Could you add something like cookies to remember the names? It's really annoying...

2008-07-10 16:48:43 Shelwien        > 
Name storing added, but you'd have to type it at least once more ;)

2008-07-10 16:50:02 toffer          > 
There's a mistake: 
 
You do 1 + 15 = 16. Should read 1 + 4 (4 bits in a nibble :)

2008-07-10 16:50:19 toffer          > Wow, it works; Thanks!

2008-07-10 19:03:38 Shelwien        > 
v2_o6mix3d 138995584 329.845 327.954 ;-) 
Guess what? low nibble alignment did help.

2008-07-10 19:24:40 toffer          > 
Have you aligned the mixer, too? This is one thing i didn't took into accout. Ok, once again: during a byte coding step my mixer is completely cached. If your mixer is nibble-packed (two lookups per byte), you still do two. 
 
To sum things up, "random" cache line aligned accesses per byte: 
CMM4: 
 * hashes of 4 to 6 models: 2x4...2x6, max, 12 
 * mixing weights: 1 
 * SSE for each bit: 8 (cached well enough for redundant data) 
 * max: 21 
 
o6mix3d: 
 * hashes of the models (orders 1..6, i guess order0 will fit into cache): 2x7 = 14; 
 * hashes of mixers: 14 (if cache aligned) 
 * something else? 
 
Did you have a look at the call profile? Are there any non-inlined functions, like "void Update( int c, const word* MTb, int Mw ) {"? A complete byte coding loop is inlined in my implementation.

2008-07-10 23:06:31 Shelwien        > 
> Could you post the source, again? I can hardly believe this.  
 
Its a slightly tuned version, with somewhat better compression. 
http://ctxmodel.net/files/o6mix3d1.rar 
 
> What about profiling the cache misses? 
 
What exactly do you want to time or count? 
Well, I don't really know what more I can do easily enough. 
 
> Have you aligned the mixer, too?  
 
Of course. 
 
> Did you have a look at the call profile?  
 
There's no random function calls. The only variable 
thing there is memory access. 
 
> Are there any non-inlined functions, like "void Update( 
> int c, const word* MTb, int Mw ) {"? 
 
Functions declared inside the class are inline by default. 
 
> A complete byte coding loop is inlined in my implementation. 
 
I use a better compiler anyway, and with profile-guided  
optimization too.

2008-07-10 23:37:34 toffer          > 
Have a look at that: http://en.wikipedia.org/wiki/Valgrind 
 
If i'm not totally wrong, there's an Intel tool available, which can do similar stuff. I'll have a look tomorrow, good night :)

2008-07-11 07:21:28 Shelwien        > 
> http://en.wikipedia.org/wiki/Valgrind 
 
"an important limitation of Valgrind is its inability to 
detect bounds errors in the use of static or stack 
allocated data." 
Wonder if its able to process _any_ access to such data. 
 
> If i'm not totally wrong, there's an Intel tool available, 
> which can do similar stuff.  
 
There's VTune and some other tools, and recent CPUs have 
a lot of built-in performance counters (accessible via win32 afaik). 
But the question is what to measure and how to react to it. 
A tool is just a tool, it won't solve anything by itself. 
 
For example, I easily can measure something like this with RDTSC. 
Any suggestions on how to interpret it? 
 

2008-07-11 07:39:07 Shelwien        > 
Also here's another one with average clocks per function 

2008-07-11 14:23:31 toffer          > 
I can't really use these graphs to interpret anything, since i don't know how large their time consummation compared to the whole run time (nice graphs, btw) is. 
 
To valgrind - i don't think that data on the stack is the problem; the amount is tiny (compared to other data structures) and it should be accessed frequently, hence it should be cache efficient. 
 
What about a profile a la gprof's "annotated source listing"? 
 
I noticed one thing - you aligned the array indices to 16 element boundaries, that should be ok (16*4=64). But are you sure that the base memory of the arrays start at a 64 byte address? 
 
Pleas don't just say yes. You should really check it, i made some mistakes like that to, e.g. GCC's malloc/new implementation normally aligns memory block on *16* byte addresses

2008-07-11 14:39:05 toffer          > 
Found another issue: your nibble access isn't truly aligned on 64 byte chunks: 
 
nibble=(cxt)*15-(16-1)*15-1; 
 
This is:  
nibble=(ctx-16)*15 + 15; 
 
Which is basically ok, but you don't align on 16 byte boundaries, you use 15 byte boundaries. You'd better waste one byte per nibble and do: 
 
nibble=(ctx-16)*16 + 16;

2008-07-11 14:49:58 Shelwien        > 
> I can't really use these graphs to interpret anything, 
> since i don't know how large their time consummation 
> compared to the whole run time (nice graphs, btw) is. 
 
Sum of these 4 operations is ~95% of compressor's run time, 
including i/o. There's nothing much else to profile, and 
there's almost no branches. 
 
> What about a profile a la gprof's "annotated source listing"? 
 
Better do it yourself with the tools known to you. 
http://ctxmodel.net/files/o6mix3d1.rar 
I reuploaded the source with fixes for gcc compilation. 
 
> I noticed one thing - you aligned the array indices to 16 
> element boundaries, that should be ok (16*4=64). But are 
> you sure that the base memory of the arrays start at a 64 
> byte address? 
 
I'm sure that they're not actually. 
But there're other bugs anyway.  
Like, order0 which is actually order2 overlapping the 
next arrays %). 
I'm now tuning the fixed version of o6mix3d1  
(it has 0.3% worse complession but probably works faster), 
and a couple of new versions with 16bit counters 
(w/o T field) and fixed update - that is currently 1.8% 
worse in compression but much faster.

2008-07-11 14:50:48 toffer          > 
Contexts for coding the first nibble (mixers in a similar fashin!): 
 
  nibble = -1; 
  ctx = 1; 
 
  o0c =  0; 
  o1c =  ccc&255<<8;  // !!! 
  o2c =  (ccc&0xffff)*16; 
  o3c =  hashfun3( prevchar, O2_o3_mask )*16; 
  o4c =  hashfun4( prevchar, O2_o4_mask )*16; 
  o5c =  hashfun5( prevchar, O2_o5_mask ) ^ (ddd & O2_o5_mask2)*16; 
  o6c =  hashfun6( prevchar, O2_o6_mask ) ^ (ddd & O2_o6_mask2)*16; 
 
Proceeding to the second nibble: 
 
// that's sqrt(42) scaled and rounded to the nearest odd int :) 
nibble = ctx*278345695u*16-1; 
 
ctx = 1;  // indexes a nibble tree (bits 3-0). 
 
Using this along with 64 byte aligned data structures should make this cached right after the first access

2008-07-11 14:56:30 toffer          > 
Whoops, i forgot something: 
 
o0c += 16*(nibble-16)+16; 
o1c += 16*(nibble-16)+16;

2008-07-11 14:58:21 Shelwien        > 
You can check that yourself now ;) 
As to me, I'd benchmark it with and without alignment after 
finishing this optimization pass, but it looks like it has 
to run at least until tomorrow.

2008-07-11 14:58:38 toffer          > 
I hope you know, what i mean. Could you add some possibility to edit posts? This is another annoying thing, here :)

2008-07-11 15:03:03 Shelwien        > 
That'd have to wait for more visitors 
as that requires a proper registration etc. 
Just post your edited versions and I'd replace them.

2008-07-11 15:15:48 Shelwien        > 
Seems like I forgot to refresh the benchmark results. 
Check http://ctxmodel.net/files/MIX/mix_v2.htm 
Especially 3d vs 3d1 (3d has *16).

2008-07-11 15:44:03 toffer          > 
Well, i found something else: per model your T's maximal value is _very_ large, up to something like 8000. That's definitely another point... You might want to shrink this drastically, e.g. to at most 100. Or better leave it out. 
 
That adds one more random access per bit coding step, i bet this is the reason for your "much faster" 16 bit counters. 
 
PS: i just wanted to join the posts and correct some mistakes.

2008-07-11 16:03:06 Shelwien        > 
> Well, i found something else: per model your T's maximal 
> value is _very_ large, up to something like 8000.  
 
Yeah, its controlled by wrT parameters. 
But its like that because optimizer found that it improves 
compression. 
 
> That's definitely another point... You might want to 
> shrink this drastically, e.g. to at most 100.  
> Or better leave it out. 
> That adds one more random access per bit coding step, i 
> bet this is the reason for your "much faster" 16 bit 
> counters. 
 
Right, and I was talking about that from the start. 
But my current result with 16bit counters is 468784 
(on book1+wcc386) and 3d1 with buggy extra o2 removed 
makes 461793 - I think that's a significant difference. 
 
However I expected this too, and this is a necessary 
step before switching to delayed counters. 
 
> PS: i just wanted to join the posts and correct some mistakes. 
 
Just do it and repost, and I'd remove older copies.

2008-07-11 16:21:25 toffer          > 
Ok, i'll do it later. 
 
The slowness' reason still remains the same - uncached memory access. 
I mainly concentrated on the things i know from my own implementations: cross as few cachelines per symbol coding step as possible. It helped a lot. 
Now, you've also identified another bottleneck - as i did when looking at your counter limits. 
 
You can overcome the compression loss by using "per model context merging" (via a merging table and static mixing). Basically it's just a small (~100..300 entires) SSE table per model and static mixing. You know the idea, i've posted it somewhere here or on encode.ru.

2008-07-11 17:22:08 inikep          > 
Good work! 
 
I'm very interested in this talk, but I'm on my holidays in Georgia and I rarely have access to Internet. BTW your page is not working from Georgia so I'm using http://translate.google.com :)

2008-07-11 17:38:32 Shelwien        > 
> The slowness' reason still remains the same -  
> uncached memory access. 
 
I just looked at IntelC's asm output and its quite ugly - 
I'd probably have to rewrite the mixer and counter updates 
with templates. Also I don't like how it writes a 16bit 
value in update immediately after modifying the 32bit 
version of the same register. 
 
> I mainly concentrated on the things i know from my own 
> implementations: cross as few cachelines per symbol coding 
> step as possible. It helped a lot. 
 
Well, I added that nibble separation only because of your pressure ;) 
But actually I don't really like the idea of getting stuck at 
o6mix speed optimization. Quite a lot of things can be done there, 
but eg. an unary coder just won't have these problems from the start. 
 
You forgot probably, but I didn't really want to write this 
compressor because I had a fairly clear picture of how it 
would go even before starting. Its just that "inikep" kept 
insisting that adding more orders to my order2 coder would 
make it comparable to CCM right away ;). 
 
> You can overcome the compression loss by using "per model 
> context merging" (via a merging table and static mixing). 
> i've posted it somewhere here or on encode.ru. 
 
Yeah, you posted it here, and I thought that your static mixing 
can be merged either with the mapping or the dynamic mixing. 
 
But I'm just going to implement my delayed counters - 
they have a similar static mapping in estimation anyway. 
However switching all the contexts to delayed counters 
with optimized parameters would take quite a time - at 
least like a week I think. 
Guess I really need a 100Ghz CPU ;)

2008-07-11 17:50:59 inikep          > 
Currently, your page is working from Georgia. 
 
>Durilca'light results added to comparison, and its compression is comparable to CCM, but with 30% faster decoding. 
 
I believe that there is lack such powerful filters in CCM like in Durilca: 
http://encode.ru/forum/showpost.php?p=1508&postcount=20

2008-07-11 19:03:45 Shelwien        > 
> Why don't you just start to do a unary coder? 
 
I'm going to, eventually, but in distinction to o6mix, 
it requires some serious programming efforts as a hash 
doesn't fit it nor usual binary counters do - probability 
recalculation on rank swap would require some crazy precision, 
and would be slow. 
 
> The speed optimizations concerning cache alignment aren't 
> wrong at all - they help drastically. 
 
I argee; that's why I made that nibble context fix. 
But now I don't have any more ideas on improving the cache hit rate 
(well, I'm going to try 3:5 byte splitting instead of 4:4 
with 16bit counters). Of course, things like putting a mixer 
into the same cache line with 15 counters are possible, but 
that would cause a significant compression loss, and compression 
quality isn't that good even now. 
 
Also keeping symbol counters in a linear array instead of a tree 
is a much more efficient approach anyway, but that's hardly 
compatible with bitwise coding. 
 
> The speed gain of the 16 bit counters HEAVILY depends on 
> the reduction of random accesses 
 
Well, wait, we didn't see the benchmark yet. 
 
> If i've finished my exams i can aid you in optimization by 
> providing "good enough" starting estimates  
 
Not sure if that would actually work. 
But you can try improving the results of 3d1 or some other 
coder on some other file which I didn't use for tuning - 
I'd be surprised if you succeed though ;) 
 
> I get rather good results much faster than you,  
 
Actually its possible to do faster - eg. I could optimize 
different orders in parallel and merge the results and such. 
But here I'm optimizing the whole profile of 3-4 coder versions 
at once, so its not very efficient. 
And also I'm lazy and prefer to wait an extra day instead 
of manually guiding the optimization which could improve 
it too.

2008-07-11 21:10:03 inikep          > 
> Its just that "inikep" kept insisting that adding more 
> orders to my order2 coder would make it comparable to CCM 
> right away ;). 
 
And I was almost right :) You've achieved similar ratio to 
ppmd-8 and AFAIK still without a match model. My bigger 
concern was speed and I thought that it will be easier to 
achieve speed of CCM. So, toffer, you've also did a great 
job with your CMM.

2008-07-11 21:38:43 Shelwien        > 
@toffer: 
Btw, I just tested CMM's "x86 transform" (by hacking it out). 
And these are the results (for acrord32, mode 75): 
1314571 - no "x86 tranform" (MZ check disabled) 
1147078 - original result 
1188208 - my E8_v1 (ST2rc etc) 
1179265 - my E8_v2 (Mix v1+) 
Thus, questions: 
1. Is it really a "transform" (preprocessor) or a  
separate submodel 
2. Does it only process E8/E9 (E8 only?) 
 
Also, do you have a match model there?

2008-07-11 22:09:35 toffer          > 
For the CMM4 internals have a look at the describtion: 
 
http://cs.fit.edu/~mmahoney/compression/text.html#1727 
 
Yep, i've got a match model. The rest is basically a dirty hack, which changes some context hashes. Not really a submodel. It improves compression "for free": 
 
    switch (t) 
    { 
    case blocktransform::X86: // Some gappy stuff... somehow an ugly hack 
      hash[0] = 256*uint8(history>>16); 
      hash[3] = (1<<24|uint32(history&0xffff00))*5; 
      break; 
    default:  // as expected 
      hash[0] = 0; 
      hash[3] = (1<<24|uint32(history&0xffffff))*5; 
      break; 
    }

2008-07-11 23:20:21 toffer          > 
It's e8/e9 i tried to transform other relative addresses too, without a compression gain. To catch up something like this one needs an exe parser. GN8

2008-07-11 23:22:55 Shelwien        > check out http://compression.ru/sh/flt32.rar

2008-07-12 08:19:18 toffer          > 
Thanks! I already tried everything, expect 0xe8 xx xx xx xx 0x83  0xc4 xx. What is this? I haven't found anything like this in my instruction reference.

2008-07-12 09:19:59 Shelwien        > 
check some executables... 
0004F293: E89C510000 call 000054434  ---↓ (2) 
0004F298: 83C41C     add  esp,01C ;"∟" 
C++ executables remove function arguments from the stack like that.

2008-07-12 09:28:26 toffer          > 
Ah ok. Now i know why i was hardly able to find anything about these instructions within the call/jmp reference :) 

2008-07-12 10:03:51 Shelwien        > 
MixV3 shows 240-260s in sh_samples_1 benchmark 
(the version with 16bit counters). 
Also compression level is even slighly better than 
previously ;)

2008-07-12 10:31:00 toffer          > Good! What about the source?

2008-07-12 10:31:09 toffer          > What are your next plans?

2008-07-12 10:49:58 Shelwien        > 
I accidentally ran a benchmark at lower CPU frequency, so 
have to redo it. Also there's the SFC benchmark too. 
So v3 would be posted in less than a hour I think. 
As to sources... do I have to fix them again, or you'd be 
able to add the __min/__max definition and remove xmmintrin.h 
include by yourself? ;)

2008-07-12 10:51:15 toffer          > 
No, that'll be impossible :)  
 
In the previous version i wondered, why you included the mmx intristics anyway. You never used anything like this?! (and i thought you trust intelc)

2008-07-12 11:06:13 Shelwien        > 
These intel intrinsics are nonportable and rarely useful 
anyway, as its usually possible to write a portable code 
that would be properly vectorized. 
But here I experimented with prefetches, thus that header 
appeared.

2013-08-10 05:27:15                 > 

2013-10-10 18:13:54 toms            > You write very well!

2014-11-27 07:42:25                 > 

2015-01-12 08:14:46                 >