Context Modelling

> what I don't know is a escape modelling and handling of 'pollution'.

Well, escaping is simple to explain (though not so simple to implement). Its all about separate modelling of symbols which didn't occur in context yet.

With PPM approach (eg. PPMd) you just keep an explicit escape symbol in the alphabet, and encode it when encountered symbol is not in alphabet. Then you switch to another context, mask out already seen symbols, and continue like that until you encode the symbol or run out of contexts. But with such a model escape codes amount to a significant part of a compressed file so are an obvious target for further improvement, like a secondary estimation (SEE is origin of SSE, in fact).

This approach also benefits from so called "inheritance" = cutting a part of escape's frequency corresponding to actual symbol's probability in other context and assigning it to a new symbol in context.

Another possibility is to separately encode the escape flag before processing the context alphabet. Recent results show that this allows for a better compression at the cost of one extra coder call per symbol.

Then again, this is all about real PPM models, where only the highest order corresponds to actual data, while all others process the leftovers. It minimizes context updates and coder calls so is relatively fast.

While with CM you simultaneously model the data with all the context orders, so its enough to leave some interval for escape, where mixed estimations from other orders could be mapped. Normally CM shows better compression, but it requires mixing (at least static) and update of all contexts to process each symbol, so is much slower.

> what's 'pollution'? typical example is pdf's. it contains a
> lot of text and a lot of random (compressed) data. that
> compressed data pollutes statistics gathered on text
> additionally compressed data gets expanded by large amount.
> what i want to achieve is to make some workaround for it.

One possible workaround is like what I did in order_test4 demo, where rangecoder is not immediately used on encoding, but probabilities are collected instead, and discarded when they appear redundant.

But this only stops the uncompression, though to prevent the "statistics pollution" its possible to keep a statistics modification log and undo the changes using it when redundancy is detected.

Its also possible to keep a delayed copy of full model at low orders, but I doubt that it would be any faster than patches described above.

And another idea might be to perform a segmentation pass before the actual compression and roughly determine which segments don't fit the model.

> if nobody knows a efficient solution for 'pollution' problem
> then maybe i will skip it as it's not very important.

I agree that its wrong to provide support for random data while not having proper models for known formats. But thats probably not your reason.

> most important thing for me is efficient escape modelling.

Well, again, its simple.
You just define what exactly is an "escape", and then model it, with secondary estimation, submodel mixing and whatever else if simple statistics are not enough.