Here, o6mix59 went as expected - the only compression-related change there is counter simplification (branch-avoiding logic was replaced with separate functions for updates on 0 and 1, using templates) - its also the reason of slight speed improvement comparing to o6mix3e2.
It also has somewhat worse compression in the main benchmark, but that's due to tuning differences... instead this version shows better results in SFC benchmark and (what matters the most ;) on its book1+wcc386 tuning target (465928 vs 466568).
As to o6mix59a SSE experiment... well, its results are almost surprisingly good (taking into account the simple context and SSE update turned static), but slowdown is considerable, especially if we'd think about further compression improvement using more of such mappings.
So the delayed counters are still the main development
direction. Also there's an idea to start with a single
submodel again, but with a context mask allowing it to be
anything between o0 and o6 and let optimizer
to find the best primary submodel, then add more one by
one and repeat the process (it was done the same way with
o2-o6 in v0 btw, but with more
restrictive context masks).
2008-07-18 09:47:59 Anonymous >
Great work. What direction are you taking now? Speed or Compression ratio.
2008-07-18 12:34:48 donotdisturb >
To improve speed, you can try not only avoiding if-then-else branches but also dependencies between values to allow out-of-order execution on modern CPUs. For ex. instead of shifting the context after each bit (cx += cx + bit), you can do
(cx = 1 << 8 | chr) before encoding a byte and calculate the current context like (cx >> (bitpos+1) where bitpos=7...0 ) in the encode-byte function. You can also try to precalculate as much as possible before the range-coder call.
2008-07-18 15:08:33 Shelwien >
> To improve speed, you can try not only avoiding
> if-then-else branches but also dependencies between values
> to allow out-of-order execution on modern CPUs.
You're right and there're things like that in rc_sh2d.inc
(rc_pre), and in v5 I made changes which would allow to
overlap the sequences of estimations and updates (ie
prefetching the counters/ mixers, then doing an update for
previous bit, then actually using the prefetched values)
> For ex. instead of shifting the context after each bit
> (cx+=cx+bit), you can do (cx=1<<8|chr) before
> encoding a byte and calculate the current context like
> (cx>>(bitpos+1) where bitpos=7...0) in the encode-byte
That's the reason for bit encode calls looking like
encode(c&(1<<7)). I tried explicitly setting this byte
context in fpaq0pv4B, and it wasn't any faster than
shifts - and no wonder, as its just a single LEA or SHL
instruction which easily overlaps with other stuff.
Also this context is known only in encoding, while
there's no alternative in decoding because further
bits are unknown.
> You can also try to precalculate as much as
> possible before the range-coder call.
As I explained already, there was an idea (from fpaq0pv4B)
to turn encode_byte() into template, and make an array of
encode_byte<0>..encode_byte<255> and just call it by byte
value - compiler would be able to optimize it really good,
though compilation would be considerably slow.
But with a complex model (unlike some order0) there'd be
too much code in each instance of byte coding function,
and it'd cause code cache problems.
Also its only applicable for encoding (decoder versions
with speculative steps are possible too, but that's too
complicated for now). And there wasn't any demand for a
reverse asymmetric compressor with twice faster
compression than decompression ;).
2008-07-22 11:13:12 toffer >
Have you finished the delayed counter approach?
2008-07-22 13:59:17 Shelwien >
Got stuck in sSE optimization instead ;)
Now messing with a single submodel (~o2, selected by
optimizer) and SSE. Surprisingly, wasn't able to gain
much speed from SSE (even though the initial implementation
was clearly inefficient), but got some compression
Btw, even with a single counter + SSE a coder takes ~70s
to encode my test set (ppmd -o8 / CCM are ~120s).
And I probably going to attach a delayed counter today,
so it would become even slower...
2008-07-22 14:34:57 toffer >
Let's see what happens :)
Could you provide a completely tuned (target: SFC) order 2 model, without SSE/additional stuff. Just a plain order 2 codec with flat deco.? I'm working on a new counter model, somehow this is a delayed counter with only a single mapping for speed. I think i'll use 3 delayed bits and 12 bit precision for the probability. This will take some time (modify my optimizer...).
2008-07-22 15:21:55 Shelwien >
> Could you provide a completely tuned (target: SFC) order 2
> model, without SSE/additional stuff. Just a plain order 2
> codec with flat deco.?
For now I can only provide this: http://ctxmodel.net/files/o6mix6.rar
Its not exactly order2 (context mask 405FFF) and was optimized to
book1+wcc386. However I can turn it into real o2 and tune to SFC,
but that would take until tomorrow.
> I'm working on a new counter model, somehow this is a
> delayed counter with only a single mapping for speed.
You mean, a single mapping instead of separate estimation
and update? Don't see how could this improve speed.
> I think i'll use 3 delayed bits and 12 bit precision for
> the probability.
Somehow this layout reminds me of something ;)
2008-07-22 15:37:22 toffer >
I don't need it that quick.
Well, if you separate prediction and update you need to do two "counter updates". Merging this will take only one update.
I'm unsure about an exact layout for each order. My intuition says that more delayed bits will do a better job on higher orders. I'll bruteforce the best setting.
And you know where i've taken my first guess from :)
2008-07-22 17:15:28 Shelwien >
> I don't need it that quick.
Ok... I'd make a SFC-optimized o2 then.
> Well, if you separate prediction and update you need to do
> two "counter updates". Merging this will take only one
Right, but that's the point. Estimation mapping gives you the
probability after adding all the delayed history, and update
mapping only skips one bit of that history - they're completely
I think you're better off with usual counter and a static
mapping for estimation, if you can't afford two mappings.
> I'm unsure about an exact layout for each order. My
> intuition says that more delayed bits will do a better job
> on higher orders. I'll bruteforce the best setting.
That's not exactly about the number of delayed bits... but
it might be useful to support long bit runs there.
I'm not sure about history encoding for such a case though
(really reminds me of state machine counters, and in fact
the history might be lossy too), and imho it would be better
to first analyze the statistics (on history distributions)
instead of bruteforce.
2008-07-22 23:24:44 Shelwien >
Here's your o2: http://ctxmodel.net/files/o6mix6-t.rar
2008-07-23 09:45:30 toffer > Thanks a lot! :)
2008-07-23 09:46:03 toffer >
BTW I still haven't finished my new optimization model... hunting horrible bugs :(
2013-07-26 09:50:16 >
2014-03-20 00:42:48 DonaldRiCh > x9527r-gb, , , , .