Context Modelling

<< ctxmodel.net

Mix v3:

sh_samples_1 benchmark results (sh_samples_1)

o6mix3d1 Context Mixer with single o0-o6 submodels. Note that previous versions had a buggy o2 instead of o0 actually.
o6mix3e o6mix3d1 switched to 16-bit counters and retuned.
o6mix3e2 hash tables were expanded to compensate for reduced counter size. (there was also 3e1 with inheritance, but it was a total failure)
o6mix3e21 alignment experiments - strict nibble alignment (like in 3d)
o6mix3e22 + tables aligned by memory page
o6mix3e23 + hash indexes aligned for linear cache line access

Well, as expected, only a compact layout of elements was really significant, while too strict alignment can cause a reverse effect (due to limited cache associativity?)

Also the speed improvement with 16bit counters went as expected, but no compression loss was a bit of a surprise

Further plans:

various speed optimizations (like a rewrite of mixer and counter updates with C++ templates - compiler doesn't optimize them good enough)
introducing a match model (there's too much compression loss in all benchmarks due to its absence)
switching to 12:3 delayed counters. Removing the "update speed" parameter from counters was a preparation for this actually, but it seems better to attach a match model first, to avoid wasting a lot of CPU time on delayed counter optimization (they have a whole lot of parameters).

Codec	Datasize	Ctime	Dtime	Metric	Notes
Intel Core2 Q9450 3.68Ghz=460x8, DDR2 5-5-5-18 @ 920=460*2
st2rc3-2	165794810	13.905	31.843	26237.8	http://shelwien.googlepages.com/st2rc_v3.rar
st2rc3e89-3	164499960	13.767	32.219	26039.1
st2rc6e89mm-2	159133531	52.763	74.029	25657.7

ppms-4	151237030	77.921	83.063	24539.3	PPMs Jr1 -o4 http://compression.ru/ds/ppmsj.rar
ppmd-8	137974856	112.140	120.812	22878.8	PPMd Jr1 -m1980 -o8 http://compression.ru/ds/ppmdj1.rar
dlite-8	127536394	281.281	65.437	20863.2	Durilca'light -m1580 -o8 -t1 http://www.compression.ru/ds/durilca.rar
dlite-16	126660769	289.781	72.469	20805.2	-m1580 -o16 -t1
ppmj-10	117045960	749.500	337.968	22417.6	PPMonstr Jr1 -m1980 -o10 http://compression.ru/ds/ppmdj1.rar
paq9a-7	117303349	555.610	558.845	24472.7	paq9a -7 http://cs.fit.edu/~mmahoney/compression/paq9a.zip
cmm4-old-75	120002417	287.188	289.266	21930.2	http://freenet-homepage.de/toffer_86/cmm4_01f_080527-gcc4.3.7z
cmm4-75	120002444	301.625	312.204	22174.0
ccm-5	123152614	121.610	121.375	20578.0

o2mix-3	161322206	198.733	215.499	27560.3	http://ctxmodel.net/files/MIX/mix_v0.rar
o3mix-1	148964049	326.469	336.547	26967.6
o4mix-1	142328712	413.516	421.468	26867.1
o5mix-2	139873527	521.796	524.610	27623.1
o6mix-1	139534658	655.110	665.844	29115.8

v1_o5mix	141610055	507.281	519.439	27828.2	http://ctxmodel.net/files/MIX/mix_v1.rar
v1_o5mix-a	141657843	501.125	500.969	27644.9
v1_o6mix	140439276	635.315	645.016	29029.1
v1_o6mix-a	141373740	637.062	631.906	29045.8
v1_o6mix1a	141989779	526.610	505.046	27763.0
v1_o6mix1b	141665741	522.655	503.077	27688.7

v2_o6mix2	138865709	553.062	551.640	27767.2	http://ctxmodel.net/files/MIX/mix_v2.rar
v2_o6mix3	137947855	464.674	460.392	26622.9
v2_o6mix3a	138745569	397.406	391.891	25995.3
v2_o6mix3b	138745569	547.141	547.359	27699.7	prefetches
v2_o6mix3c	138936605	393.987	389.343	25996.3	64-byte hash alignment
v2_o6mix3d	138995584	329.845	327.954	25327.4	nibble-aligned byte tree structure
v2_o6mix3d1	138725010	332.437	329.562	25303.8	more compact byte tree
v2_o6mix3d2	139125143	333.500	331.594	25387.7	3d1 + 64-byte offset aligment
v2_o6mix3d1a	138417350	335.375	332.938	25292.5	3d1 + small tuning

v3_o6mix3d1	138634222	331.436	327.655	25269.6	bugfixed ver; http://ctxmodel.net/files/MIX/mix_v3.rar
v3_o6mix3e	139141975	259.485	259.533	24595.7	16bit counters
v3_o6mix3e2	138243249	258.875	261.031	24469.7	2x expanded hash tables
v3_o6mix3e21	138426074	243.579	244.625	24318.9	strict 64byte alignment, like in 3d
v3_o6mix3e22	138426074	254.076	254.140	24424.6	tables aligned (by 4096)
v3_o6mix3e23	139228039	256.110	257.186	24582.4	hash indexes aligned (by 64 bytes)

2008-07-12 13:30:56 toffer          > 
Good work. I'll look into it, later. 
 
How do you plan to implement your match model? 

2008-07-12 14:19:57 Shelwien        > 
> How do you plan to implement your match model?  
 
There're two ways, basically: 
1) use the last symbol occured in the hashed context (LZP); 
2) store the _offset_ of last context occurence  
(and keep some data window) and try to follow the 
successful matches. 
 
Btw, I did some additional benchmarks with 
 http://ctxmodel.net/files/LZP-DS_v0.rar 
 http://ctxmodel.net/files/E8flt-v2.rar 
and added the results to 
 http://ctxmodel.net/files/MIX/mix_v3_SFC.htm 
 http://ctxmodel.net/files/MIX/mix_v3.htm

2008-07-12 14:26:58 toffer          > 
I know. I mean how do you want to convert a match length+contexts to a prediction? Which contexts do you select?

2008-07-12 14:52:42 Shelwien        > 
> how do you want to convert a match length+contexts  
> to a prediction? 
 
I don't plan to use forward match lengths (though that 
could speed up the encoder), but some match histories 
probably would be useful. 
 
And of course its supposed to work like unary coding - 
the match model predicts the symbol (eg. last symbol 
from current o6 context), then the match flag is encoded, 
and only on no match the current o6mix model would be used. 
 
However I think that I'd first try out xoring the value 
with estimation and processing the result with my current model. 
 
> Which contexts do you select? 
 
There's not much stuff practically useful in the fast 
compressor, basically that's last data bytes and match 
histories (in different contexts maybe). 
But more than one context can be used to produce the 
rank0 estimation - eg. like some sparse contexts and such.

2008-07-12 15:32:39 Shelwien        > 
Well, obviously xoring didn't work, but it was really funny - 
optimizer first disabled the new trick by zeroing the 
hash multiplier - I blocked that. But then it did the same 
with masks, and then with predictor update speed %)

2008-07-12 16:01:10 toffer          > 
Could you test CMM4 .2a on your testset and see if it's faster this time (like it should be)? 
 
http://freenet-homepage.de/toffer_86/cmm4_02a_080712.7z

2008-07-12 16:05:20 toffer          > 
Don't do it like this, since it gives horrible results. I tried something like this while collecting ideas for CMM4 (when is switched from 3 to 4). It was very fast, but the compression was poor, around 11500000 for the SFC. Also PAQ9a shows that the results. It's better to turn off some "normal" models, like orders 0-3 and use orders 4-6 and the match model to predict. This way you achieve a speedup, while still obtaining good enough compression.

2008-07-12 19:07:53 Shelwien        > 
CMM 02a added and looks like compression is 8.5% faster, 
decompression is 9% faster and ratio is 0.1% worse. 
http://ctxmodel.net/files/MIX/mix_v3.htm

2008-07-12 19:51:42 toffer          > 
Thanks! 
 
My caching mechanism will give a more significant speedup without loosing compression :) I expect 20-40% Dunno if i'll integrate the modified decomposition, since it hurts compression too often. But it offers about 20% faster compression. 
 
I quickly tried to modify your v3_o6mix3e21, but was unable to get a speed gain with proper cache alignment. I think your hashing mechanism simply doesn't pay off (no collision handling, etc...). 
 
BTW: which dataset do you use to tune your models/compressors?

2008-07-12 19:53:04 toffer          > 
And please fix this wierd file handling. I had to change the conversion of a FILE* to an uint64, since i'm using a 64 bit system.

2008-07-12 20:32:51 Shelwien        > 
> I think your hashing mechanism simply doesn't pay off (no 
> collision handling, etc...). 
 
Do you think that compression could be improved with collision  
detection? It would be certainly slower though. 
 
> which dataset do you use to tune your models/compressors? 
 
Concatenated book1 and wcc386, as I mentioned. 
Tried using a cut down version of SFC (concatenated samples 
of each SFC file), but it only improved compression for SFC (in v1). 
 
> And please fix this wierd file handling. I had to change 
> the conversion of a FILE* to an uint64, since i'm using a 
> 64 bit system. 
 
Ok... its just that i/o module can be replaced with win32 i/o there.

2008-07-12 20:47:01 toffer          > 
Yep, collision detection does improve compression. You can see the difference with CMM4 (.1f vs >=.2). For some models the lastest version only uses two instead of three nibble models per hash entry (the speed gain is caused by other things, lots of inlining, collision handling is really negliable!) compared to 0.1f. A Further increase (more than three) of the number of nibble models hardly helps to improve compression.  
 
I personally would completely change the hash layout anyway (two nibble models with a 1 byte exclusion hash, each). Somehow it feels strange how your model behaves with cache tuning. And the hashing tables still aren't aligned. I noticed that with collision handling you can easily merge (even all!) hashing tables. The effect of a larger collision domain conceals the compression loss, hence compression improves. 

2008-07-12 21:23:13 osmanturan      > 
I want to share something from my experience. I had spent some time on hashing mechanism of the BIT. Collision detecting is similar to CMM1. Here is the hash entry structure: 
[Key] [Priority] [Data] 
My current counters are 16 bits long. Data sections are reserved for a nibble. So, data structure is 32 bytes. Core2 Duo and never processors have 64 bytes cache lines. So, BIT tries to select 2 entries on collisions. Aligned test (I mean in 64 bytes boundry) performance is a bit poor. Instead, I use forward test (i.e. if I found a collision, instead of testing with index^1, I test with index+1). Another tiny compression gain comes from hashing function. I nearly test all of the well known hashing functions (PAQ, CMM1, some multiplactive hashes etc.). Then I thought, why I didn't test the CRC32 performance? It works really well. With CRC32 I get best compression performance. But note that, I only use CRC32 on for first nibble in the coding byte. I update hashes with multiplactive hash based on previous nibble context.

2008-07-12 21:30:28 toffer          > 
CRC32 isn't designed for hashing. I wonder how the performance difference is. Multiplicative hashes are fast and give good results. I compared them against several more expensive mixing and hashing functions, without any big differences. 
Do you mean linear probing (and crossing the cache line)?

2008-07-12 21:35:34 Shelwien        > 
> Yep, collision detection does improve compression.  
 
Ok, I'd think about it. 
 
> You can see the difference with CMM4 (.1f vs >=.2).  
 
Yeah, right ;) 1f had better compression in my tests ;) 
 
> And the hashing tables still aren't aligned.  
 
They're aligned by 16, that's good enough probably... 
You can compare 3e21 vs 3e22 in http://ctxmodel.net/files/MIX/mix_v3.htm 
The only difference there is table alignment added in 3e22. 
 
> I noticed that with collision handling you can 
> easily merge (even all!) hashing tables.  
 
That's not exactly right - o?P0 and w??W parameters 
(initial values for counters and mixers) are different, 
and its even more important now, after switching to 
static updates.

2008-07-12 21:39:25 osmanturan      > 
> CRC32 isn't designed for hashing. I wonder how the 
> performance difference is.  
Yes, I know. I just tested and it works. That's all. Best compression performance was achieved by combination of CRC32 and multiplactive hash (for first nibble CRC32, for updating the second one same as CMM1 - multiplicative). 
 
> Multiplicative hashes are fast and give good results. 
Yes, they are fast and give good result. But, I think per nibble based complex hash functions don't affect too much the whole compression performance. I gained some KB by using CRC32. I didn't notice notable speed loss. 
 
> I compared them against several more 
> expensive mixing and hashing functions, without any big 
> differences.  
Don't expect so much differences. Just around a few KB for SFC IIRC. 
 
> Do you mean linear probing (and crossing the cache line)? 
Yes. I mean crossing the cache line. Also, interesting thing crossing the cache line on collision detection also gained the compression (of course, this can be true with only CRC32, not sure)

2008-07-12 21:41:45 toffer          > 
I didn't took this into account. 
 
If you plan to include collision handling, 16 byte alignment isn't enough, if you want to stay within a cache line for probing. Linear probing might not be as bad as i expected, since linear access is cache friendly.

2008-07-12 21:49:42 toffer          > 
BTW one more suggestion - you can use a direct lookup table for the first nibble of order 2. It is small 2^16*2^4*2 = 2 Mb. You don't have any collisions here. That was one of my smaller changes from .1f to .2 :)

2008-07-12 21:54:38 osmanturan      > 
I just confused while reading your post? :D I didn't understand clearly: "you" mean me or Shelwien? 
 
Because, really simple collision handling is already using in BIT. I use 32 bytes data structure per hash entry. I tried to handle collision in the cache line (by avoid crossing the cache line), this was a little performance loss for both speed and compression. I just tried next entry on collision (linear probing???). It worked better (both speed and compression) 
 
> Linear probing might not be as bad as i expected, since 
> linear access is cache friendly. 
 
Caching mechanism seems to work perfect on forward memory access. Some pappers already confirmed that idea. My experiments also confirmed it. 
 
P.S. Forgive me if there is a misunderstood

2008-07-12 21:55:51 osmanturan      > Sorry, there was a misunderstood :(

2008-07-12 22:01:58 toffer          > 
You don't need to read papers to get it. It've read it in an AMD optimization manual :) 
 
I meant Shelwien.  
 
But this is an improvement, which suits for you, too. Why should you hash the first nibble for order 2, if you can perfectly look it up (without collisions!).

2008-07-12 22:20:01 Shelwien        > 
Actually my order2 is still fully direct ;) 
I didn't like the idea of nibble partitioning before, 
but now I see how it affects the caching, so guess 
I'd really make a new nibble-oriented hash. 
These hashes in o6mix are lame anyway, as I said before ;)

2008-07-13 20:17:23 toffer          > Any advance today :) ?

2008-07-14 09:13:59 toffer          > 
It will become a bit slower, for sure. But i only expect a compression gain? How do you handle collisions? 
I commented out the collision handling in CMM4: 
 
10.919.698 vs 10.532.067 (collision handling) 
 
How did you implement your match model? I hope not in the sense of PAQ9a?

2008-07-14 09:16:21 toffer          > (I tested on the SFC)

2013-08-19 12:51:29                 >