sh_samples_1 benchmark results (sh_samples_1)
Well, as expected, only a compact layout of elements was really significant, while too strict alignment can cause a reverse effect (due to limited cache associativity?)
Also the speed improvement with 16bit counters went as expected, but no compression loss was a bit of a surprise
2008-07-12 13:30:56 toffer >
Good work. I'll look into it, later.
How do you plan to implement your match model?
2008-07-12 14:19:57 Shelwien >
> How do you plan to implement your match model?
There're two ways, basically:
1) use the last symbol occured in the hashed context (LZP);
2) store the _offset_ of last context occurence
(and keep some data window) and try to follow the
Btw, I did some additional benchmarks with
and added the results to
2008-07-12 14:26:58 toffer >
I know. I mean how do you want to convert a match length+contexts to a prediction? Which contexts do you select?
2008-07-12 14:52:42 Shelwien >
> how do you want to convert a match length+contexts
> to a prediction?
I don't plan to use forward match lengths (though that
could speed up the encoder), but some match histories
probably would be useful.
And of course its supposed to work like unary coding -
the match model predicts the symbol (eg. last symbol
from current o6 context), then the match flag is encoded,
and only on no match the current o6mix model would be used.
However I think that I'd first try out xoring the value
with estimation and processing the result with my current model.
> Which contexts do you select?
There's not much stuff practically useful in the fast
compressor, basically that's last data bytes and match
histories (in different contexts maybe).
But more than one context can be used to produce the
rank0 estimation - eg. like some sparse contexts and such.
2008-07-12 15:32:39 Shelwien >
Well, obviously xoring didn't work, but it was really funny -
optimizer first disabled the new trick by zeroing the
hash multiplier - I blocked that. But then it did the same
with masks, and then with predictor update speed %)
2008-07-12 16:01:10 toffer >
Could you test CMM4 .2a on your testset and see if it's faster this time (like it should be)?
2008-07-12 16:05:20 toffer >
Don't do it like this, since it gives horrible results. I tried something like this while collecting ideas for CMM4 (when is switched from 3 to 4). It was very fast, but the compression was poor, around 11500000 for the SFC. Also PAQ9a shows that the results. It's better to turn off some "normal" models, like orders 0-3 and use orders 4-6 and the match model to predict. This way you achieve a speedup, while still obtaining good enough compression.
2008-07-12 19:07:53 Shelwien >
CMM 02a added and looks like compression is 8.5% faster,
decompression is 9% faster and ratio is 0.1% worse.
2008-07-12 19:51:42 toffer >
My caching mechanism will give a more significant speedup without loosing compression :) I expect 20-40% Dunno if i'll integrate the modified decomposition, since it hurts compression too often. But it offers about 20% faster compression.
I quickly tried to modify your v3_o6mix3e21, but was unable to get a speed gain with proper cache alignment. I think your hashing mechanism simply doesn't pay off (no collision handling, etc...).
BTW: which dataset do you use to tune your models/compressors?
2008-07-12 19:53:04 toffer >
And please fix this wierd file handling. I had to change the conversion of a FILE* to an uint64, since i'm using a 64 bit system.
2008-07-12 20:32:51 Shelwien >
> I think your hashing mechanism simply doesn't pay off (no
> collision handling, etc...).
Do you think that compression could be improved with collision
detection? It would be certainly slower though.
> which dataset do you use to tune your models/compressors?
Concatenated book1 and wcc386, as I mentioned.
Tried using a cut down version of SFC (concatenated samples
of each SFC file), but it only improved compression for SFC (in v1).
> And please fix this wierd file handling. I had to change
> the conversion of a FILE* to an uint64, since i'm using a
> 64 bit system.
Ok... its just that i/o module can be replaced with win32 i/o there.
2008-07-12 20:47:01 toffer >
Yep, collision detection does improve compression. You can see the difference with CMM4 (.1f vs >=.2). For some models the lastest version only uses two instead of three nibble models per hash entry (the speed gain is caused by other things, lots of inlining, collision handling is really negliable!) compared to 0.1f. A Further increase (more than three) of the number of nibble models hardly helps to improve compression.
I personally would completely change the hash layout anyway (two nibble models with a 1 byte exclusion hash, each). Somehow it feels strange how your model behaves with cache tuning. And the hashing tables still aren't aligned. I noticed that with collision handling you can easily merge (even all!) hashing tables. The effect of a larger collision domain conceals the compression loss, hence compression improves.
2008-07-12 21:23:13 osmanturan >
I want to share something from my experience. I had spent some time on hashing mechanism of the BIT. Collision detecting is similar to CMM1. Here is the hash entry structure:
[Key] [Priority] [Data]
My current counters are 16 bits long. Data sections are reserved for a nibble. So, data structure is 32 bytes. Core2 Duo and never processors have 64 bytes cache lines. So, BIT tries to select 2 entries on collisions. Aligned test (I mean in 64 bytes boundry) performance is a bit poor. Instead, I use forward test (i.e. if I found a collision, instead of testing with index^1, I test with index+1). Another tiny compression gain comes from hashing function. I nearly test all of the well known hashing functions (PAQ, CMM1, some multiplactive hashes etc.). Then I thought, why I didn't test the CRC32 performance? It works really well. With CRC32 I get best compression performance. But note that, I only use CRC32 on for first nibble in the coding byte. I update hashes with multiplactive hash based on previous nibble context.
2008-07-12 21:30:28 toffer >
CRC32 isn't designed for hashing. I wonder how the performance difference is. Multiplicative hashes are fast and give good results. I compared them against several more expensive mixing and hashing functions, without any big differences.
Do you mean linear probing (and crossing the cache line)?
2008-07-12 21:35:34 Shelwien >
> Yep, collision detection does improve compression.
Ok, I'd think about it.
> You can see the difference with CMM4 (.1f vs >=.2).
Yeah, right ;) 1f had better compression in my tests ;)
> And the hashing tables still aren't aligned.
They're aligned by 16, that's good enough probably...
You can compare 3e21 vs 3e22 in http://ctxmodel.net/files/MIX/mix_v3.htm
The only difference there is table alignment added in 3e22.
> I noticed that with collision handling you can
> easily merge (even all!) hashing tables.
That's not exactly right - o?P0 and w??W parameters
(initial values for counters and mixers) are different,
and its even more important now, after switching to
2008-07-12 21:39:25 osmanturan >
> CRC32 isn't designed for hashing. I wonder how the
> performance difference is.
Yes, I know. I just tested and it works. That's all. Best compression performance was achieved by combination of CRC32 and multiplactive hash (for first nibble CRC32, for updating the second one same as CMM1 - multiplicative).
> Multiplicative hashes are fast and give good results.
Yes, they are fast and give good result. But, I think per nibble based complex hash functions don't affect too much the whole compression performance. I gained some KB by using CRC32. I didn't notice notable speed loss.
> I compared them against several more
> expensive mixing and hashing functions, without any big
Don't expect so much differences. Just around a few KB for SFC IIRC.
> Do you mean linear probing (and crossing the cache line)?
Yes. I mean crossing the cache line. Also, interesting thing crossing the cache line on collision detection also gained the compression (of course, this can be true with only CRC32, not sure)
2008-07-12 21:41:45 toffer >
I didn't took this into account.
If you plan to include collision handling, 16 byte alignment isn't enough, if you want to stay within a cache line for probing. Linear probing might not be as bad as i expected, since linear access is cache friendly.
2008-07-12 21:49:42 toffer >
BTW one more suggestion - you can use a direct lookup table for the first nibble of order 2. It is small 2^16*2^4*2 = 2 Mb. You don't have any collisions here. That was one of my smaller changes from .1f to .2 :)
2008-07-12 21:54:38 osmanturan >
I just confused while reading your post? :D I didn't understand clearly: "you" mean me or Shelwien?
Because, really simple collision handling is already using in BIT. I use 32 bytes data structure per hash entry. I tried to handle collision in the cache line (by avoid crossing the cache line), this was a little performance loss for both speed and compression. I just tried next entry on collision (linear probing???). It worked better (both speed and compression)
> Linear probing might not be as bad as i expected, since
> linear access is cache friendly.
Caching mechanism seems to work perfect on forward memory access. Some pappers already confirmed that idea. My experiments also confirmed it.
P.S. Forgive me if there is a misunderstood
2008-07-12 21:55:51 osmanturan > Sorry, there was a misunderstood :(
2008-07-12 22:01:58 toffer >
You don't need to read papers to get it. It've read it in an AMD optimization manual :)
I meant Shelwien.
But this is an improvement, which suits for you, too. Why should you hash the first nibble for order 2, if you can perfectly look it up (without collisions!).
2008-07-12 22:20:01 Shelwien >
Actually my order2 is still fully direct ;)
I didn't like the idea of nibble partitioning before,
but now I see how it affects the caching, so guess
I'd really make a new nibble-oriented hash.
These hashes in o6mix are lame anyway, as I said before ;)
2008-07-13 20:17:23 toffer > Any advance today :) ?
2008-07-14 09:13:59 toffer >
It will become a bit slower, for sure. But i only expect a compression gain? How do you handle collisions?
I commented out the collision handling in CMM4:
10.919.698 vs 10.532.067 (collision handling)
How did you implement your match model? I hope not in the sense of PAQ9a?
2008-07-14 09:16:21 toffer > (I tested on the SFC)
2013-08-19 12:51:29 >
2014-11-26 13:37:32 >
2015-01-11 10:59:47 >