Context Modelling

<< ctxmodel.net

> Especially considering that it requires like a day of coding to make a alpha version ;)

Well, so here's the result: Mix v0 and the detailed benchmark is there.

Its basically just another mixer demo, starting with

mix(mix(mix(o0,o0'),mix(o1,o1')),mix(o2,o2'))

and lamely hashed higher orders (24M per order) are mixed over that. These are not exactly the traditional CM orders though, because the optimizer masked out half of the bits.

> [...] I can patch together an order6 coder [...]
> But even before doing it I know that both its speed and quality
> would be nothing special.

...And now we've got a coder comparable to rar and ppmd in compression, but at 10x slower speed (that can be somewhat fixed though): as expected.

In conclusion, here's what has to be done to reach the compression level of CCM/LPAQ/PAQ9.

Match model is a must, it seems;
Word model would be helpful too;
a match model probably would provide 2x the current speed, but even then only o2mix would be able to compete with CCM - that means 6 counters and 5 mixers at most. So delayed counters have to be used (or state machines, but I won't bother with that);
encoding from file instead of reading the file into memory, and also using the stream-oriented version of E8 would somewhat improve speed too;
the hashes probably could be fixed for better compression;

But further development of this coder is not really reasonable. Well, except for streamed E8 implementation which could be used elsewhere. Because eg. creating a new bytewise unary coder with tree-based statistics seems much more beneficial.

---

Intel Core2 Q9450 3.68Ghz=460x8, DDR2 5-5-5-18 @ 920=460*2
Codec Datasize Ctime Dtime Metric Notes
rar371-m1 174366298 17.642 16.486 27427.2 rar 3.71 -m1 -mdg - http://rarlab.com/rar/wrar371.exe
rar371-m2 148454375 49.873 16.124 23407.1 rar 3.71 -m2 -mdg
rar371-m3 144515166 65.158 15.813 22803.8 rar 3.71 -m3 -mdg
rar371-m4 139815031 72.656 38.593 22304.7 rar 3.71 -m4 -mdg
rar371-m4t 143892793 76.141 15.717 22716.6 rar 3.71 -m4 -mdg -mct-
rar371-m5 139875435 81.110 47.500 22411.6 rar 3.71 -m5 -mdg
rar371-m5t 143629349 88.437 17.157 22702.1 rar 3.71 -m5 -mdg -mct-

balz112-e 138914945 263.485 50.798 22476.9 http://encode.ru/balz/balz112.zip
balz112-ex 136358563 428.938 50.500 22240.0 -ex
balz113-e 137842934 263.688 58.249 22384.1 http://encode.ru/balz/balz113.zip
balz113-ex 135284357 436.734 58.015 22155.1 -ex
balz114-e 143781598 81.405 35.876 22906.0 http://encode.ru/balz/balz114.zip
balz114-ex 141820766 125.796 35.406 22639.4 -ex

ppms-4 151237030 77.921 83.063 24539.3 PPMs Jr1 -o4 http://compression.ru/ds/ppmsj.rar
ppmd-8 137974856 112.140 120.812 22878.8 PPMd Jr1 -m1980 -o8 http://compression.ru/ds/ppmdj1.rar
ppmj-10 117045960 749.500 337.968 22417.6 PPMonstr Jr1 -m1980 -o10 http://compression.ru/ds/ppmdj1.rar
paq9a-7 117303349 555.610 558.845 24472.7 paq9a -7 http://cs.fit.edu/~mmahoney/compression/paq9a.zip
cmm4-75 120002417 287.188 289.266 21930.2 http://freenet-homepage.de/toffer_86/cmm4_01f_080527-gcc4.3.7z

st2rc3-2 165794810 13.905 31.843 26237.8 http://shelwien.googlepages.com/st2rc_v3.rar
st2rc3e89-3 164499960 13.767 32.219 26039.1
st2rc6e89mm-2 159133531 52.763 74.029 25657.7

ccm-0 127465606 115.874 115.516 21187.5 CCM 1.30c -0 http://christian.martelock.googlepages.com/dl_ccm130c.zip
ccm-1 125991688 120.141 119.624 21002.6
ccm-2 124869661 121.641 121.094 20843.5
ccm-3 124017539 121.140 120.782 20706.7
ccm-4 123346433 120.859 120.500 20598.7
ccm-5 123152614 121.610 121.375 20578.0
ccm-6 123098025 124.672 124.796 20606.7
ccm-7 123066391 127.749 128.250 20639.4

o2mix-3 161322206 198.733 215.499 27560.3 http://ctxmodel.net/files/MIX/mix_v0.rar
o3mix-1 148964049 326.469 336.547 26967.6
o4mix-1 142328712 413.516 421.468 26867.1
o5mix-2 139873527 521.796 524.610 27623.1
o6mix-1 139534658 655.110 665.844 29115.8

2008-07-05 11:54:26 osmanturan      > 
> Match model is a must, it seems 
I don't think so for PAQish match model. toffer has good example of about with match model vs without match model. Look at this: 
 
http://www.encode.ru/forum/showpost.php?p=1433&postcount=21 
 
Surely, a better model can be made. Maybe, we can ask toffer for it's details. He really likes to share his experiences with any other people. Not like the others :-)

2008-07-05 12:08:26 toffer          > 
By accident i found that you've also posted things, which didn't appear at encode.ru, maybe private messaging, etc... Would you mind to post this stuff in the forum and maybe create a thread? It would be better if things don't spread around.

2008-07-05 12:44:32 inikep          > 
> at 10x slower speed (that can be somewhat fixed though) 
 
How much improvement do you think is possible? 
 
> only o2mix would be able to compete with CCM 
 
So we still don't know from CCM speed comes. 
 
> But further development of this coder is not really reasonable. 
 
I think that now is a good time to start discussion about a CCM/LPAQ. 
 
Thanks for accepting my "challenge".

2008-07-05 21:29:31 Shelwien        > 
@inikep: 
> > at 10x slower speed (that can be somewhat fixed though) 
> 
> How much improvement do you think is possible? 
 
1. A match model would allow to avoid all this mixing 
for 50-70% of data. (2x) 
2. Dual o0,o1,o2 may be not that useful with higher orders 
(1.5x) 
3. A single o2 instance takes 16M*(4+2)=96M. It should be 
hashed instead (1.2-1.5x) 
4. All counter arrays can be reduced in half by removing 
the access counters (along with dynamic update speed which 
has effect only at coding start) (1.2-1.5x) 
5. File i/o and streamed E8 (1.1x) 
6. Formal speed optimizations (eg. using the template 
scripts which I made for fpaq0pv4B). (1.1x) 
7. Division in the mixer can be replaced with a table lookup. 
8. "Precise" rc with 64bit multiplication is slow too. 
9. A compressor can store counter predictions for the 
block instead of mixing them right away, so an asymmetric 
implementation is possible. 
  
> > only o2mix would be able to compete with CCM 
>  
> So we still don't know from CCM speed comes. 
 
I'm still not really impressed, in fact. 
PPMd has the same speed, and no filters.  
Guess I should benchmark durilca'light. 
 
> > But further development of this coder  
> > is not really reasonable. 
>  
> I think that now is a good time to start discussion 
> about a CCM/LPAQ. 
 
Maybe. I'm especially interested in the paq9 results with 
my finnish dictionary - 
http://ctxmodel.net/files/MIX/mix_v0.htm ... and overall 
results distribution for that file. 
  
> Thanks for accepting my "challenge". 
 
The actual programming took like half an hour at most, 
its just a (automatic) parameter optimization which 
needed a day. 
Btw, I made a new optimization target - concatenated 
parts of each SFC file, and started o5 and o6 tuning 
again. It might be interesting, because the previous  
optimization was incremental (eg. o3mix used optimized 
parameters from o2mix and only o3 parameters was  
actually tuned in it... and so on), and this time _all_ 
the parameters are optimized by the same metric. 
 
@toffer: 
> By accident i found that you've also posted things, which 
> didn't appear at encode.ru, maybe private messaging, 
> etc... Would you mind to post this stuff in the forum and 
> maybe create a thread? It would be better if things don't 
> spread around. 
 
Well, I want to experience maintaining my own site and 
also I'm trying to make a blog engine to my liking, 
and its reasonable to take advantage of the content I'm writing. 
 
But of course you can cross-post anything to the forum 
if you like to discuss it there for some reason, but 
anyway there're only a few people who actually discuss  
such topics. 
 
Btw, mainly I don't like how its troublesome to make 
my posts in the forums to look as I intend them to, 
and also Ilia keeps his unstable hosting where DB server 
is down 10% of the time. 
 
@osmanturan: 
> > Match model is a must, it seems 
> I don't think so for PAQish match model. toffer has good 
> example of about with match model vs without match model. 
 
Well, I'm talking about a bytewise match model which  
would encode a flag for rank0 byte values, and only 
proceed with mixing and updating counters when its 
not a rank0 symbol. Its a first step of unary coding 
actually, and taking into account that rank0 hits 
for more than 50% of bytes... 
So its not a model specific for very long matches. 
but it would probably work good enough for long matches 
as well.

2008-07-07 11:26:05 inikep          > 
> A match model would allow to avoid all this mixing  
> for 50-70% of data. (2x)  
 
Nice, so altogether it's possible to get 6-10x. 
 
> I'm still not really impressed, in fact.  
> PPMd has the same speed, and no filters.  
 
I don't believe that filters do matter for CCM's speed. MFC contains over 400 files in 30+ formats. 
 
> Guess I should benchmark durilca'light. 
 
Accoring to maximum compression durilca'light have almost the same decompression speed (173 s) as PPMd, but much slower compression (366 s).

2008-07-07 11:36:05 inikep          > 
>I'm especially interested in the paq9 results with  
my finnish dictionary 
 
It seems that this file suits PAQ. The results with lpaq8 are similar?

2013-08-13 12:27:18                 > 

2013-08-14 20:30:24                 > 

2013-10-14 06:39:21                 > 

2014-01-06 21:26:13                 > 

2014-05-11 19:01:39                 > 

2014-07-15 13:49:58                 > 

2014-11-07 01:12:46                 > 

2014-11-14 23:33:03                 > 

2014-11-25 21:54:37                 > 

2015-01-10 19:29:03                 > 

2015-01-24 14:06:04                 >