Context Modelling

<< ctxmodel.net

> My expirience shows that mixing itself isn't slow, but the weight adaption
> is time consuming (cmm2 dropped from ~700kb/s to 420 on my pc).

Not exactly sure what are you talking about, but in my implementation mixer uses the same update as a counter, so with mixing 2 models you need to update 3 counters, though there're trade-offs like PPM with partial updates.

But its still remains the question whether these kinds of bruteforce methods, like combining the estimations of multiple models etc, are worth their speed impact. Eg. is 2% smaller file worth 2x slower decompression speed or not. (It seems that even 100x slower compression speed is considered acceptable for almost any visible size reduction, if decoding speed remains the same.)

> Your mixing seems to be hierarchical, like mix( p0, mix( p1, p2 ) ),
> which gives two averages instead of three. Implicitly this gives higher
> weights to lower orders (the orders in the "outer mix", here p0),

It doesn't. Because it gives higher weights to the model which is more efficient entropy-wise. Also the mixers are initialized with the weight which doesn't include the first probability at all.

> try to average over more than two probabilities at once or swap the mixes.

I did it like that 5 years ago, thanks.
And now I think that its better to use the mathematically sensible mixer instead of shamanic ones.

> Has anybody tried this hirarchical stuff before?
> Maybe this can be made efficient?

Well, it is efficient.
I think you'd see it if you tried to reach the same compression level with "dumb" contexts like used in my examples (whole previous byte and such).

I did try to prove it another way, though, by stripping lpaq, but seems that is has too many "unsportsmanlike" tricks in it, so I did manage to disable models except order1 and 2 (despite the #defines ), and text/exe preprocessing, but there're too much stuff still left as even the hash-function and counters are somewhat tuned to the "real data".

> You shoud at least select a mixer by the coding order,
> i noticed that this is the most important mixing context
> (which should be obvious, after a second thought).

Well, thats only if you use the same mixer array for mixing different orders. Which isn't obvious. Instead, it's more obvious imho that you'd need different contexts for mixing different orders - at least different quantization.

> You do mixing like "counter" updates?
> What is a "counter" in your designations?
> Do you refer to a probability with adjustment by the error,
> or a counter like counting the occurences of 0/1 bits?

Well, by counter I mean a statistic counter like these "states". Actually my counter is derived from the same origin as states - integer number of occurences that is.

So, it's evolved like this:

1. P0 = bitcount[0]/(bitcount[0]+bitcount[1]);
   bitcount[bit]++;
2. P0 = bitcount[0]/(bitcount[0]+bitcount[1]);
   (bitcount[bit]*=(1-1/InvSpeed))++;
3. P0 = bitcount[0];
   (bitcount[0]*=(1-1/bitcount[1]))+=(bit?0:1/bitcoun t[1]);
   (bitcount[1]*=(1-1/InvSpeed))++;
4. P0 = bitcount[0];
   wr = Tbl[InvSpeed][bitcount[1]];
   (bitcount[0]*=(1-wr))+=(bit?0:wr);
   bitcount[1]++;

So [2] here is a nonstationarity countermeasure, and other changes are mostly speed optimizations. And Matt Mahoney made the next step here, by quantizing such a counter into a single byte - which takes away the freedom of adaptation speed controlling and even adds some redundancy. So of course his approach has its place in the industrial compression schemes (like h264), but I think that it really complicates the development of new models.

> I tried to make a mixer based on this idea with poor results, nice speed
> though. For mixing i use an optimization approach (minimizig the entropy),

There was this trick which I called "bruteforce adaptation", BFA - true entropy minimization, which accumulates all the possible codelengths for a given set of weight values and uses the weight with minimal length for actual encoding:

for( w=0,i=0; w<=1; w+=1/63,i++ ) {
  p = p1*(1-w) + p2*w;
  (len[i]*=(1-1/InvSpeed)) += -log2(bit?1-p);
}

Here's a working example from that time: http://compression.ru/sh/o1bfa.rar

And the new mixing idea is really simple: we update the mixed probability as if it was from a normal counter, and then "back-propagate" the change into reachable parameters... weight, here.

Well, it looks like this (here's a "translated" version from my sh_mixer.inc):

int w;
iMixer::Process( bit, p1, p2, InvSpeed, Limit ) {
  Counter tmp;
  tmp.P = p1+(p2-p1)*w; // mixed probability of 0
  tmp.Encode( bit ); // encode the bit with mixing result
  tmp.Update( bit, InvSpeed ); // update tmp counter like usual
  // speed opt - no sense to change weight when p1 and p2 are too close
  if( ABS(p2-p1)>Limit ) {
    // change weight so that we'll get an "updated" value from repeated mixing
    w = (tmp.P-p1)/(p2-p1);
    w = max(0,min(1,w));
  }
}

> PAQ's mixer is a neural network, but not that different from
> weighted averaging (expect the stretch/squash transformation preceeding and
> succseeding mixing) with adjustment based on the gradient. Matt modified
> the gradient descent to minimize ld( pbit ), rather than standard back
> propagation (minimizing RMSE). One problem you always face to is getting
> stuck in a local minimum.

I think that neural network origin is what makes Matt's mixer so inefficient in my example. That is, maybe he did manage somehow to make it output something useful (its always possible when you build a model from a given set of functions comparing to trying to insert a function into a working model), but it surely doesn't adapt well to some abstract probabilities, out of paq context.

Well, my order 2-1-0 mix looks like this:

// (de)coding a bit from "flag" variable.
// TTable_NN,MMM are counter update mode settings (kinda random here)
// also "100" in mixer calls is a limit value = 0.003 in 0..1 interval
p1 = order0[z].P;
p2 = order1[p][z].P;
p3 = order2[pp][p][z].P;
p4 = p1 + (p2-p1)*w1[p][z].w;
w2[p][z].Process( flag, TTable_5,1,100, p4,p3 );
order0[z].Update( flag, TTable_20, 700 );
order1[p][z].Update( flag, TTable_10, 600 );
order2[pp][p][z].Update( flag, TTable_8, 400 );
w1[p][z].Process( flag, TTable_5,1,100, p1,p2 );

> Something which comes to my mind now was an experiment to improve mixing
> speed, by only using 3 or 4 active models out of 6 to 8 in mixing, e.g.
> orders 654, when coding at order 6 and completly ignoring lower ones -
> which improved speed AND compression.
> Maybe this is exactly what you meant!?

I didn't, Though I'm pretty sure that you'd get a valid result from any set of estimations you'll run through my mixer.

> Mixer initialization isn't that critical in my experience.

It is quite significant with wide enough mixer context.

> I agree with you, that a slight compression gain compared to a huge amount
> of extra time isn't worth the effort.

Well, not sure what are you agreeing with.

In fact, I think that its always useful to have the best result you're able to reach within common sense restrictions (eg. no supercomputers etc). Because only when you have values to compare, you'd be able to make a correct decision on where to stop.

And the current problem is that paq8 isn't quite that. Its obviously more useful than some widespread archivers for data redundancy analysis, but it's too easy to make a very simple model with better results for custom data - and no wonder as paq8 uses lots of tricks for common files compression, but its compression classes, which aren't particularly good even in theory, are made even less efficient by all these speed and memory trade-offs.

> The general order-n models are somehow universal (since data is stored in bytes),
> but special models will definitley do better - like a bi- or trigram word for text,
> surrounding pixels for images and so on. I think a specialized model along with maybe
> order210 will give nice results; and higher orders are definitely not worth
> the effort here.

Well, it depends on the target.
That is, yes, if you aim for best places in composite comparisons (size+speed), but this approach completely fails in data analysis.

2013-08-13 07:56:12                 > 

2014-11-15 09:33:59                 > 

2014-11-26 20:22:09                 > 

2015-01-11 19:10:00                 >