Context Modelling

<< ctxmodel.net

My optimizer just randomly modifies parameters until it finds some metric improvement (btw, that metric isn't necessarily only a code length. In some cases it includes other components, like a function of model memory size). And parameter subsets may be optimized separately or not depending on the requirements, but either way it continues iteratively until there're no improvements for some long enough time.

> I don't expect you optimized the P0's for every (or some
> larger amount) of contexts within a model?

Why, of course its supposed to be optimized for every context separately. In this model for a codec which I'm currently supporting, there're 48 counter tables, 24 SSE1 tables and 12 SSE2 tables, each with its own optimized set of parameters (including context).

Also this is a specialized model for a known data type, so I'm sure that a proper universal model would have 10x of that (or more).

> One P0 set per update/prediction mapping for every model?

There's no real reason to use the same counter parameters for different contexts.

However I don't widely use delayed counters yet, I think that should wait until successful implementation of polynomial (or another) fast optimizer.

The problem there is that I cannot just replace all the counters with delayed ones, though that would surely be the best for compression. But each delayed counter requires two additional multiplications and two table lookups, so its too slow... and searching for a new balanced model structure with current methods would take a lot of time.

> Ok, interesting. Do you mean contexts or bit history quantization, or both?

Both.

Here's a context declaration and code generated for it:

Index y
ych: ch, 1!1
yfb: fb, 1!00100010
yp0: p01, 1!0000000000010000
y_s: As, -7!000000000000000
y_p: p, -7!010000000000001
y_q: q, -7!001000000000111
yr1: r1, -7!000000000000000
y_r: r , -7!100000000000000
yr2: r2, -7!000000000000000
y_col: col, 1!0000000000000000000

y = 0;
y = y*2 + ((ch>0+(1-1)));
y = y*3 + ((fb>2+(1-1))+(fb>6+(1-1)));
y = y*2 + ((p01>11+(1-1)));
y = y*3 + ((p>1+(-7-1))+(p>14+(-7-1)));
y = y*5 + ((q>2+(-7-1))+(q>12+(-7-1))+(q>13+(-7-1))+(q>14+(-7-1)));
y = y*2 + ((r >0+(-7-1)));

p,q,r are some adjacent values (its a table),
ch is a channel, col is a table column,
and fb,p01 are scaled probabilities from another counters.

> And show some results on "general data", i don't know which data you deal with.

I didn't make any "universal" models since ash, so no "general data" for you. Thing is, I really want to create a proper model for text, but it gets too complex.

> > I still don't understand how you're going eg. optimize a
> > counter under SSE with it.
>
> My optimization target is the estimated entropy H(source)

Seems like a misunderstanding again...

You see, if you have a prediction p from some model, already optimized by entropy of p sequence, that doesn't mean that SSE(p) sequence would be optimal right away.

In fact, my counters, which was optimized under SSE, are no good without SSE, And SSE is not a continuous function despite its interpolation, so you cannot really calculate a gradient for it.

> How did you optimize several orderN models *at once*? (like o0, o0')

I'm optimizing all parameters at once.

That is, all 44 parameters in the case of o2rcA. Of course its not always possible, or reasonable, but these coders are still toys comparing to the one for which I made my optimizer.

> Or is o0' just a guess? Optimizing this at once > would require to include the mixing in the opt. process.

Here's my optimization scheme:

1. A special class is used for parameter initialization, which calculates the actual parameter values from "bitmaps" like "01011000". This class also adds some unique signatures to these bitstrings, so they can be easily found in the executable.

2. Optimizer itself is a perl script, which toggles these bits in the executable and runs it and keeps the best result. I think its more or less what's called "genetic optimization".

> I observed that optimizing more than one parameter "class"
> at once (in parallel, like you do) can sometimes lead to
> oscillations (if the parameters from different classes
> influence each other).

My optimization process can get stuck in a local minimum, but can't oscillate. Optimizer attempts to modify the current best state to get even better result, so its only able to improve the overall result.

> How many iterations (.exe executions) did your optimizer
> take until your optimization target was reached when tuning
> the fpaq0pv4b order0 model (on the whole SFC)?

With a single counter its really fast, though original fpaq0pv4b uses a shift-counter and I didn't optimize it. But the later version in ST2rc_v3 has 5+12+8+15=40 bits of counter parameter set, and needed like 3 iterations through it, so thats like 120 executables runs.

> My laptop ran at 800mhz and the whole process took about 40minutes.

Well, SFC is compressed in ~10s by ST2rc_v3, so guess it takes 20mins=120*10/60 on my machine.

Anyway, I don't care when its at this scale - while it gets me the best parameters while I sleep.

However, I had to use 6 machines for 4 whole days to optimize a "real" model, and started to think about better approaches after that.

2013-08-08 15:51:28                 > 

2014-11-27 03:10:29                 > 

2015-01-12 03:13:08                 >