<< ctxmodel.net
Mix v2: sh_samples_1 benchmark results SFC benchmark results This version is basically a bugfix of v1 - the contexts (including hash functions etc) are no more calculated twice per bit. Also the parameters were tuned to book1+wcc386 again, because even though the SFC sample is beneficial for SFC compression, but the profile based on it seems to be far from "average".
This time benchmarking went as expected (more or less) - results are better than v0 in sh_samples_1 test due to improved E8, and speed is somewhat better because of context calculation optimizations. o6mix3 compressing better than o6mix2 can probably be explained by tuning differences (o6mix3 had worse results on book1+wcc386). But the fact that even o6mix3a is 4x slower than CCM is kinda depressing - only counters and mixers are left in the bit coding function. Possible causes of this slowness include:
2008-07-10 12:00:23 toffer >
I guess your main "speed" problem still is the non-cached counter access... I wonder why you hardly refuse to use nibble aligned hashing?
2008-07-10 13:53:32 Shelwien >
Just tried to explicitly prefetch all the counters and mixers related to current context, before any actual access. Speed became 50% worse somehow. Now testing the model with 64-byte alignment - not that I refuse anything, but I think that counters are clustered well enough as they are. So I still think that the main cause of slow speed is table lookup in the counter update. Ah! Benchmark of aligned version is complete. v2_o6mix3c 138936605 393.987 389.343 So yeah its faster, but not that much.
2008-07-10 14:39:01 Anonymous >
Could you post the code for the memory access?
2008-07-10 15:08:28 toffer >
I guess 64 byte alignment doesn't mean nibble based hashing?
2008-07-10 15:18:58 Anonymous >
o6mix3c source - http://ctxmodel.net/files/o6mix3c.rar Have a look at model.inc / ctxprepare() And 64 byte alignment does mean that counters for first nibble are packed into 64 bytes... and mixers for 5 bits.
2008-07-10 15:37:05 Anonymous >
I can tell you why your speed gain is tiny... the first nibble was almost always cached in the previous version. Now it is always cached, since you force 64 byte alignment. So the speedup is a result of cache alignment and the improved mixer storage. The main problem remains: the bits after the fist nibble still aren't cached at all. In CMM4 i'm doing two hashed memory accesses per model. You do 1 + 15 = 16. That's the point.
2008-07-10 15:37:52 toffer >
Could you add something like cookies to remember the names? It's really annoying...
2008-07-10 16:48:43 Shelwien >
Name storing added, but you'd have to type it at least once more ;)
2008-07-10 16:50:02 toffer >
There's a mistake: You do 1 + 15 = 16. Should read 1 + 4 (4 bits in a nibble :)
2008-07-10 16:50:19 toffer > Wow, it works; Thanks!
2008-07-10 19:03:38 Shelwien >
v2_o6mix3d 138995584 329.845 327.954 ;-) Guess what? low nibble alignment did help.
2008-07-10 19:24:40 toffer >
Have you aligned the mixer, too? This is one thing i didn't took into accout. Ok, once again: during a byte coding step my mixer is completely cached. If your mixer is nibble-packed (two lookups per byte), you still do two. To sum things up, "random" cache line aligned accesses per byte: CMM4: * hashes of 4 to 6 models: 2x4...2x6, max, 12 * mixing weights: 1 * SSE for each bit: 8 (cached well enough for redundant data) * max: 21 o6mix3d: * hashes of the models (orders 1..6, i guess order0 will fit into cache): 2x7 = 14; * hashes of mixers: 14 (if cache aligned) * something else? Did you have a look at the call profile? Are there any non-inlined functions, like "void Update( int c, const word* MTb, int Mw ) {"? A complete byte coding loop is inlined in my implementation.
2008-07-10 23:06:31 Shelwien >
> Could you post the source, again? I can hardly believe this. Its a slightly tuned version, with somewhat better compression. http://ctxmodel.net/files/o6mix3d1.rar > What about profiling the cache misses? What exactly do you want to time or count? Well, I don't really know what more I can do easily enough. > Have you aligned the mixer, too? Of course. > Did you have a look at the call profile? There's no random function calls. The only variable thing there is memory access. > Are there any non-inlined functions, like "void Update( > int c, const word* MTb, int Mw ) {"? Functions declared inside the class are inline by default. > A complete byte coding loop is inlined in my implementation. I use a better compiler anyway, and with profile-guided optimization too.
2008-07-10 23:37:34 toffer >
Have a look at that: http://en.wikipedia.org/wiki/Valgrind If i'm not totally wrong, there's an Intel tool available, which can do similar stuff. I'll have a look tomorrow, good night :)
2008-07-11 07:21:28 Shelwien >
> http://en.wikipedia.org/wiki/Valgrind "an important limitation of Valgrind is its inability to detect bounds errors in the use of static or stack allocated data." Wonder if its able to process _any_ access to such data. > If i'm not totally wrong, there's an Intel tool available, > which can do similar stuff. There's VTune and some other tools, and recent CPUs have a lot of built-in performance counters (accessible via win32 afaik). But the question is what to measure and how to react to it. A tool is just a tool, it won't solve anything by itself. For example, I easily can measure something like this with RDTSC. Any suggestions on how to interpret it?
2008-07-11 07:39:07 Shelwien >
Also here's another one with average clocks per function
2008-07-11 14:23:31 toffer >
I can't really use these graphs to interpret anything, since i don't know how large their time consummation compared to the whole run time (nice graphs, btw) is. To valgrind - i don't think that data on the stack is the problem; the amount is tiny (compared to other data structures) and it should be accessed frequently, hence it should be cache efficient. What about a profile a la gprof's "annotated source listing"? I noticed one thing - you aligned the array indices to 16 element boundaries, that should be ok (16*4=64). But are you sure that the base memory of the arrays start at a 64 byte address? Pleas don't just say yes. You should really check it, i made some mistakes like that to, e.g. GCC's malloc/new implementation normally aligns memory block on *16* byte addresses
2008-07-11 14:39:05 toffer >
Found another issue: your nibble access isn't truly aligned on 64 byte chunks: nibble=(cxt)*15-(16-1)*15-1; This is: nibble=(ctx-16)*15 + 15; Which is basically ok, but you don't align on 16 byte boundaries, you use 15 byte boundaries. You'd better waste one byte per nibble and do: nibble=(ctx-16)*16 + 16;
2008-07-11 14:49:58 Shelwien >
> I can't really use these graphs to interpret anything, > since i don't know how large their time consummation > compared to the whole run time (nice graphs, btw) is. Sum of these 4 operations is ~95% of compressor's run time, including i/o. There's nothing much else to profile, and there's almost no branches. > What about a profile a la gprof's "annotated source listing"? Better do it yourself with the tools known to you. http://ctxmodel.net/files/o6mix3d1.rar I reuploaded the source with fixes for gcc compilation. > I noticed one thing - you aligned the array indices to 16 > element boundaries, that should be ok (16*4=64). But are > you sure that the base memory of the arrays start at a 64 > byte address? I'm sure that they're not actually. But there're other bugs anyway. Like, order0 which is actually order2 overlapping the next arrays %). I'm now tuning the fixed version of o6mix3d1 (it has 0.3% worse complession but probably works faster), and a couple of new versions with 16bit counters (w/o T field) and fixed update - that is currently 1.8% worse in compression but much faster.
2008-07-11 14:50:48 toffer >
Contexts for coding the first nibble (mixers in a similar fashin!): nibble = -1; ctx = 1; o0c = 0; o1c = ccc&255<<8; // !!! o2c = (ccc&0xffff)*16; o3c = hashfun3( prevchar, O2_o3_mask )*16; o4c = hashfun4( prevchar, O2_o4_mask )*16; o5c = hashfun5( prevchar, O2_o5_mask ) ^ (ddd & O2_o5_mask2)*16; o6c = hashfun6( prevchar, O2_o6_mask ) ^ (ddd & O2_o6_mask2)*16; Proceeding to the second nibble: // that's sqrt(42) scaled and rounded to the nearest odd int :) nibble = ctx*278345695u*16-1; ctx = 1; // indexes a nibble tree (bits 3-0). Using this along with 64 byte aligned data structures should make this cached right after the first access
2008-07-11 14:56:30 toffer >
Whoops, i forgot something: o0c += 16*(nibble-16)+16; o1c += 16*(nibble-16)+16;
2008-07-11 14:58:21 Shelwien >
You can check that yourself now ;) As to me, I'd benchmark it with and without alignment after finishing this optimization pass, but it looks like it has to run at least until tomorrow.
2008-07-11 14:58:38 toffer >
I hope you know, what i mean. Could you add some possibility to edit posts? This is another annoying thing, here :)
2008-07-11 15:03:03 Shelwien >
That'd have to wait for more visitors as that requires a proper registration etc. Just post your edited versions and I'd replace them.
2008-07-11 15:15:48 Shelwien >
Seems like I forgot to refresh the benchmark results. Check http://ctxmodel.net/files/MIX/mix_v2.htm Especially 3d vs 3d1 (3d has *16).
2008-07-11 15:44:03 toffer >
Well, i found something else: per model your T's maximal value is _very_ large, up to something like 8000. That's definitely another point... You might want to shrink this drastically, e.g. to at most 100. Or better leave it out. That adds one more random access per bit coding step, i bet this is the reason for your "much faster" 16 bit counters. PS: i just wanted to join the posts and correct some mistakes.
2008-07-11 16:03:06 Shelwien >
> Well, i found something else: per model your T's maximal > value is _very_ large, up to something like 8000. Yeah, its controlled by wrT parameters. But its like that because optimizer found that it improves compression. > That's definitely another point... You might want to > shrink this drastically, e.g. to at most 100. > Or better leave it out. > That adds one more random access per bit coding step, i > bet this is the reason for your "much faster" 16 bit > counters. Right, and I was talking about that from the start. But my current result with 16bit counters is 468784 (on book1+wcc386) and 3d1 with buggy extra o2 removed makes 461793 - I think that's a significant difference. However I expected this too, and this is a necessary step before switching to delayed counters. > PS: i just wanted to join the posts and correct some mistakes. Just do it and repost, and I'd remove older copies.
2008-07-11 16:21:25 toffer >
Ok, i'll do it later. The slowness' reason still remains the same - uncached memory access. I mainly concentrated on the things i know from my own implementations: cross as few cachelines per symbol coding step as possible. It helped a lot. Now, you've also identified another bottleneck - as i did when looking at your counter limits. You can overcome the compression loss by using "per model context merging" (via a merging table and static mixing). Basically it's just a small (~100..300 entires) SSE table per model and static mixing. You know the idea, i've posted it somewhere here or on encode.ru.
2008-07-11 17:22:08 inikep >
Good work! I'm very interested in this talk, but I'm on my holidays in Georgia and I rarely have access to Internet. BTW your page is not working from Georgia so I'm using http://translate.google.com :)
2008-07-11 17:38:32 Shelwien >
> The slowness' reason still remains the same - > uncached memory access. I just looked at IntelC's asm output and its quite ugly - I'd probably have to rewrite the mixer and counter updates with templates. Also I don't like how it writes a 16bit value in update immediately after modifying the 32bit version of the same register. > I mainly concentrated on the things i know from my own > implementations: cross as few cachelines per symbol coding > step as possible. It helped a lot. Well, I added that nibble separation only because of your pressure ;) But actually I don't really like the idea of getting stuck at o6mix speed optimization. Quite a lot of things can be done there, but eg. an unary coder just won't have these problems from the start. You forgot probably, but I didn't really want to write this compressor because I had a fairly clear picture of how it would go even before starting. Its just that "inikep" kept insisting that adding more orders to my order2 coder would make it comparable to CCM right away ;). > You can overcome the compression loss by using "per model > context merging" (via a merging table and static mixing). > i've posted it somewhere here or on encode.ru. Yeah, you posted it here, and I thought that your static mixing can be merged either with the mapping or the dynamic mixing. But I'm just going to implement my delayed counters - they have a similar static mapping in estimation anyway. However switching all the contexts to delayed counters with optimized parameters would take quite a time - at least like a week I think. Guess I really need a 100Ghz CPU ;)
2008-07-11 17:50:59 inikep >
Currently, your page is working from Georgia. >Durilca'light results added to comparison, and its compression is comparable to CCM, but with 30% faster decoding. I believe that there is lack such powerful filters in CCM like in Durilca: http://encode.ru/forum/showpost.php?p=1508&postcount=20
2008-07-11 19:03:45 Shelwien >
> Why don't you just start to do a unary coder? I'm going to, eventually, but in distinction to o6mix, it requires some serious programming efforts as a hash doesn't fit it nor usual binary counters do - probability recalculation on rank swap would require some crazy precision, and would be slow. > The speed optimizations concerning cache alignment aren't > wrong at all - they help drastically. I argee; that's why I made that nibble context fix. But now I don't have any more ideas on improving the cache hit rate (well, I'm going to try 3:5 byte splitting instead of 4:4 with 16bit counters). Of course, things like putting a mixer into the same cache line with 15 counters are possible, but that would cause a significant compression loss, and compression quality isn't that good even now. Also keeping symbol counters in a linear array instead of a tree is a much more efficient approach anyway, but that's hardly compatible with bitwise coding. > The speed gain of the 16 bit counters HEAVILY depends on > the reduction of random accesses Well, wait, we didn't see the benchmark yet. > If i've finished my exams i can aid you in optimization by > providing "good enough" starting estimates Not sure if that would actually work. But you can try improving the results of 3d1 or some other coder on some other file which I didn't use for tuning - I'd be surprised if you succeed though ;) > I get rather good results much faster than you, Actually its possible to do faster - eg. I could optimize different orders in parallel and merge the results and such. But here I'm optimizing the whole profile of 3-4 coder versions at once, so its not very efficient. And also I'm lazy and prefer to wait an extra day instead of manually guiding the optimization which could improve it too.
2008-07-11 21:10:03 inikep >
> Its just that "inikep" kept insisting that adding more > orders to my order2 coder would make it comparable to CCM > right away ;). And I was almost right :) You've achieved similar ratio to ppmd-8 and AFAIK still without a match model. My bigger concern was speed and I thought that it will be easier to achieve speed of CCM. So, toffer, you've also did a great job with your CMM.
2008-07-11 21:38:43 Shelwien >
@toffer: Btw, I just tested CMM's "x86 transform" (by hacking it out). And these are the results (for acrord32, mode 75): 1314571 - no "x86 tranform" (MZ check disabled) 1147078 - original result 1188208 - my E8_v1 (ST2rc etc) 1179265 - my E8_v2 (Mix v1+) Thus, questions: 1. Is it really a "transform" (preprocessor) or a separate submodel 2. Does it only process E8/E9 (E8 only?) Also, do you have a match model there?
2008-07-11 22:09:35 toffer >
For the CMM4 internals have a look at the describtion: http://cs.fit.edu/~mmahoney/compression/text.html#1727 Yep, i've got a match model. The rest is basically a dirty hack, which changes some context hashes. Not really a submodel. It improves compression "for free": switch (t) { case blocktransform::X86: // Some gappy stuff... somehow an ugly hack hash[0] = 256*uint8(history>>16); hash[3] = (1<<24|uint32(history&0xffff00))*5; break; default: // as expected hash[0] = 0; hash[3] = (1<<24|uint32(history&0xffffff))*5; break; }
2008-07-11 23:20:21 toffer >
It's e8/e9 i tried to transform other relative addresses too, without a compression gain. To catch up something like this one needs an exe parser. GN8
2008-07-11 23:22:55 Shelwien > check out http://compression.ru/sh/flt32.rar
2008-07-12 08:19:18 toffer >
Thanks! I already tried everything, expect 0xe8 xx xx xx xx 0x83 0xc4 xx. What is this? I haven't found anything like this in my instruction reference.
2008-07-12 09:19:59 Shelwien >
check some executables... 0004F293: E89C510000 call 000054434 ---↓ (2) 0004F298: 83C41C add esp,01C ;"∟" C++ executables remove function arguments from the stack like that.
2008-07-12 09:28:26 toffer >
Ah ok. Now i know why i was hardly able to find anything about these instructions within the call/jmp reference :)
2008-07-12 10:03:51 Shelwien >
MixV3 shows 240-260s in sh_samples_1 benchmark (the version with 16bit counters). Also compression level is even slighly better than previously ;)
2008-07-12 10:31:00 toffer > Good! What about the source?
2008-07-12 10:31:09 toffer > What are your next plans?
2008-07-12 10:49:58 Shelwien >
I accidentally ran a benchmark at lower CPU frequency, so have to redo it. Also there's the SFC benchmark too. So v3 would be posted in less than a hour I think. As to sources... do I have to fix them again, or you'd be able to add the __min/__max definition and remove xmmintrin.h include by yourself? ;)
2008-07-12 10:51:15 toffer >
No, that'll be impossible :) In the previous version i wondered, why you included the mmx intristics anyway. You never used anything like this?! (and i thought you trust intelc)
2008-07-12 11:06:13 Shelwien >
These intel intrinsics are nonportable and rarely useful anyway, as its usually possible to write a portable code that would be properly vectorized. But here I experimented with prefetches, thus that header appeared.
2013-08-10 05:27:15 >
2013-10-10 18:13:54 toms > You write very well!
2014-11-27 07:42:25 >
2015-01-12 08:14:46 >
|