PPM? Into practice!
Dmitry Shkarin
Institute for Dynamics of Geospheres, Moscow, Russia
E-mail: dmitry.shkarin@mtu-net.ru
PPM is one of most promising lossless data compression algorithms
using Markov source model of order D. Its main feature is coding of a
new (in the given context) symbol in one of inner nodes of the context
tree; special escape symbols are used for description of this node. In
the reality, the majority of symbols is encoded in inner nodes and
Markov model becomes rather conventional. In spite of the fact that PPM
algorithm achieves best results on comparison with others, it is used
rarely in practical applications, owing to its high computational
complexity. This paper is devoted to PPM algorithm implementation that
has complexity comparable with widespread practical compression schemes
based on LZ77, LZ78 and BWT algorithms. This scheme is named PPM with
Information Inheritance (PPMII).
$1. PPM algorithm - basic definitions.
Let A be discrete alphabet consisting of M>=2 symbols;
x^n = x[1],...,x[n], x @ A
be first _n_ symbols of message; |L| be length of sequence _L_ or
cardinality of set _L_. Probability of symbol _x[n+1] = a @ A_ for a
source with memory depends on current context
s[d] = x[n],...,x[n-d+1] @ A^d, d<=D,
in general case this context has variable length. The set of all
possible contexts can be presented as nodes in M-ary tree with depth D.
Context (sequence) _s_ describes a path from tree root to current node
also denoted as _s_.
Usually true conditional probabilities are unknown and the coding
conditional probabilities _q(a|s)_ depend on characteristics of one or
several subsequences _x^n(s)_. These characteristics are - frequency
_f(a|s) = f(a|x^n(s))_ of symbol _a_ in _x^n(s)_, alphabet
A(s) = A(s,x^n) = { All _A_ : f(a|s)>0 },
its cardinality _m(s) = |A(s)|_ etc. Already at small _D_, number of
model states is big, subsequences _x^n(s)_ are short in average, and
their statistics is insufficient for effective compression. In
particular, task of _multialphabet coding_ for unknown _A(s)_ is
hindered.
Multialphabet properties of PPM algorithm [1] are based on (implicit)
assumption: the more common initial part of contexts, the more (in
average) similarity of their conditional probability distributions. High
PPM efficiency means this assumption is fair for majority of coded
sources.
Let _Z[n]_ be set of nodes _s[d],d<=D_ on current branch
x[n], x[n-1], ...
of context tree; _s(a)_ be context with maximal length, for which
f(a|s)>0; d(a)=|s(a)|
(if such context is absent then_d(a)=-1_ )
_q*(a|s)_ and q*(esc|s) be conditional probabilities used for coding
in _s_ of symbols _a @ A(s)_ and escape symbol. Escape symbol signals
about a new symbol appearance in _A(s)_.
Key feature of all PPM modifications is representation of coding
conditional probability for any symbol _x[n+1] @ A_ as
q( x[n+1] | x^n ) = Prod( q*(esc|s[i]), i=d+1..D ) * q*( x[n+1] | s[d] ),
d = d(a) (1)
The product of escape conditional probabilities describes to the decoder
a sequential descent from _s[D]_ to _s(a)_ (if _d(a)=D_ then this
product is equal to 1). Usually
q*(a|s) = t(a|s)/T(s), All _a @ A#(s)
q*(esc|s) = t(esc|s)/T(s) (2)
where _A#(s[j]) = A(s[j]) / A(s[j+1])_, _s[j+1]_ is a _child context_ of
_s[j]_ on current tree branch _Z[n]_ (and vice versa, _s[j]_ is a
_parent context_ for _s[j+1] ), _t(.|s)_ are _generalized symbol
frequencies_ for _x^n(s)_, _T(s)_ is sum of all generalized frequencies.
As a rule, _t(.|s)_ are written as _t(.|s) = f(.|s)+c(.|s)_. For
example, PPMC [2] corresponds to _c(a|s) = c(esc|s) = 0_, for PPMD [3] -
_c(a|s)=-1/2, c(esc|s)=-m/2_.
After encoding of next symbol _x[n+1]_ it is necessary to update (to
increase by one) its frequency in various tree nodes _s @ Z[n]_. First
PPM variants [1] used _full updates_, at which, frequencies are changed
for all _s @ Z[n]_. Later PPM variants [2] use _update exclusions_ , at
which, frequencies are changed only in nodes with length |s|>=|s(a)|. In
some sense, full updates and update exclusions correspond to two
ultimate cases.
During a descent along _Z[n]_ with help of escape symbols, one must
eliminate symbols already checked in higher order contexts (exclusions).
Let's name such symbols as masked symbols and contexts containing only
such symbols - masked contexts. Let's designate _s_ as _binary context_
if _m(s)=1_ and two only probabilities are used, i.e. _q(a|s)=1-q(esc|s)_.
$2. Evaluation of generalized frequencies of symbols.
1. Information inheritance.
Main difficulty of all explicit context-based modeling schemes is
statistics insufficiency in higher order contexts. There were proposed
many ways to overcome this problem: predictions weighting for lower and
higher order contexts (CTW, interpolated Markov model), symbol coding
only in contexts that have enough (by some criterion) statistics (LOE,
state selection). All these methods require big computational resources
and unacceptable for our purposes. We can take advantage of a
distribution functions similarity in parent and child contexts and set
initial value of generalized symbol frequency in a child context with
regard to information about this symbol gathered in a parent context.
Such approach has two merits: firstly, reference to parent statistics
occurs only at addition of new symbol to child context, i.e. rarely
enough, that supposes existence of linear (not depending on tree depth)
time complexity solutions; secondly, owing to rare use of parent
statistics again, model can quickly adapt to variations of character of
input data.
Next notation is used below - _s[i]_ is the new context (T(s[i])=0)
or the context, to that new symbol _a_ to be added (t(a|s[i])=0), _s[k]_
is the longest context that contains current symbol _a_ (s[k]=s(a)).
Addition of a new symbol to the old context will be our initial concern.
Locally, at the given point of coded text, it would be optimal
immediately to use PPM-model of order _k_, i.e. to reduce tree depth to
_k_; in this way, we eliminate errors associated with inaccuracy of
escape probabilities estimation. On the other hand, we need only
statistics similar to statistics in _s[i]_, for this reason, we must
perform reduction of tree depth only along context branch _s[j],k ( =S/N0 ) value is subtracted from
_S_, i.e. change of _S_ is _dS = d[j]-S/N0_ for each step. _N0_ is
chosen as power of two to eliminate division operation.
Number _h_ of vector _w(s)_ components and their influence on
_q(esc|w)_ can be found experimentally only. They are enumerated below
in the decreasing influence order.
1) Of course, escape probability depends on generalized symbol
frequency _q(a|s)_ to a great degree. This variable is quantized to 128
values.
2) PPM uses similarity of parent and child contexts, therefore,
alphabet size _m(s[i-1])_ of the parent have strong effect to the escape
probability from the child. This _w(s)_ component is quantized to 4
values.
3) The experiments show high predictable data blocks are interleaved
with low predictable ones in real sources. Sizes of these blocks are
small, approximately 3-5 symbols, it corresponds to a natural language
text segmentation into words and parts of words. Previous encoded symbol
probability in previous context is included into _w(s)_ to watch
switching between such blocks. This variable is quantized to 2 values.
4) Current coded symbol correlates to previous symbol mostly. However,
we can not include whole previous symbol into _w(s)_ because number of
SEE-contexts would be too big and frequency of each separate SEE-context
would be too small. Only one bit flag is included into _w(s)_, flag
value is set to 0 if two higher bits of previous symbol are zeroed and
to 1 for opposite case.
5) The long block of symbols with length _L_ is the sequence of input
symbols for which, there were not escapes to the lower orders and coding
probability for _L_ symbols of this sequence was larger 1/2. It is quite
probable PPM model with _D0);
the position in coded string is only remembered for new contexts
(T(s)=0).
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
³
CONTEXT structure ³
³
m(s)>1: ³ m(s)=1:
³
m(s) - 2 bytes ³ m(s) - 1 byte
T(s) - 2 bytes ³ TRANSITION structure - 4 bytes
Link to TRANSITIONs array - 4 bytes ³ Link to suffix(es) - 4 bytes
Link to suffix(es) - 4 bytes ³
³
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
TRANSITION structure
a - 1 byte
t(a|s) - 1 byte
Link to _successor(s,a)_ - 4 bytes
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
Fig.1. Data structures of PPMII algorithm.
Graphical representation of the used data structures is presented on
Fig.1. It can be seen from this figure that the proposed algorithm
wastes 12 bytes for each repeated (binary or nonbinary) context and 6
bytes for each transition structure in the nonbinary context. For
comparison, one of the most memory economical PPM/PPM* implementations
[6] requires 8 4-byte machine words for the nonbinary context structure,
6 words for the binary context structure and 6 words for the transition
structure in the nonbinary context.
The precision of the frequency representation is 1 for binary
contexts and 1/4 for nonbinary ones. Statistics in nonbinary context is
scaled (all frequencies are halved) when the value of one of frequencies
exceeds threshold 30. Simplified variant of the range coder [7] is used
as an entropy coder. Division operation in equation (5) is approximated
by series of comparisons, other used approximations can be found in [5].
$5. Complicated variant.
PPMII algorithm demonstrates very nice results, so it becomes to be
interesting to look at maximal compression efficiency that similar
approach can provide. We will not limit ourselves by requirement of low
computational complexity in this paragraph. This modification of initial
scheme is named complicated PPMII (cPPMII).
Some improvements can be obtained by mere removing of introduced
simplifications. Described at the end of $5 approximation of (5) has
been cancelled. Delayed addition of new symbol to context ($2.1) is
performed for any contexts, but not just for binary ones. For the update
exclusions modification ($2.2), statistics is updated at any context
length including _|s(a)|=D_ case. Moreover, statistics is updated for
three contexts in parent-child chain with increments 1/2, 1/4, and 1/8.
The precision of the frequency representation is enlarged up to 1/8.
Other improvements require individual considerations.
1. Improving probability estimation for more probable symbols and for
less probable ones.
Statistics in the parents is accumulated faster than in "young" child
contexts (with small _T(s)_), so there is good reason to use parent
statistics repeatedly.
Generalized frequency of each of more probable symbols (symbol _a1_
falls into this group when _q(a1|s)>=q(aMPS|s)/2_, _aMPS_ means the most
probable symbol (MPS) in the context) is corrected when the child _s[i]_
is very young ( T(s[i])= q(aMPS|s[i])/2
³
´ T(s[i])~=(+)/2. Then, at small _N(i)_, we can
update statistics not only for _i_ value, but also for (i-1) and (i+1),
with some small weight that decreasing while _N(i)_ increasing. For
calculations simplicity, this weight is chosen equal to
2^-Ceiling(log(N(i))) and it is set to zero after exceeding some
threshold _N0_. In cPPMII, this technique is applied to any variable
that is quantized to more than two values.
3. Adaptive probability estimation for MPS.
Encoded message length is markedly affected by precise probability
estimation of MPS, therefore, after frequency correction (see §5.1),
adaptive probability estimation for MPS is performed. SEE-model for MPS
is built similar to the model for escape symbols. Generalized frequency
_t(aMPS|s)_ behavior is modeled for contexts without masked symbols and
conditional probability _q(aMPS|s) behavior is modeled for contexts with
masked symbols.
For contexts without masked symbols, vector _w(s)_ consists of:
1) generalized symbol frequency _t(aMPS|s)_, it is quantized to 68
values;
2) one bit flag indicating whether statistics rescaling was performed;
3) the comparison result of current context length |s| with average
used context length ~~ (averaging is performed over last 128
symbols);
4) the same as 4) and 6) items in $3.1 enumeration;
For contexts with masked symbols, vector _w(s)_ consists of:
1) probability estimation _q(aMPS|s)_, it is quantized to 40 values;
2) the comparison result of average nonmasked symbol frequency with
average masked symbol frequency;
3) one bit flag indicating whether only one symbol is masked;
4) the same as 2) and 4) items in previous enumeration;
4. Additional SEE-contexts components.
Additional SEE-context fields for binary contexts are next:
1) the comparison result of _m(s[i-1])_ with number of symbols in the
previous context;
2) the number of binary parent contexts, it is quantized to 2 values;
3) the flag built of two higher bits of symbol preceding to already
encoded symbol;
4) the same as 3) item in the first $5.3 enumeration;
Additional SEE-context fields for contexts with masked symbol are next:
1) the same as 1) item in the first $5.3 enumeration;
2) the same as 2) item in the second $5.3 enumeration;
5. Adaptive escape frequency estimation for nonbinary contexts
(without masked symbols).
Escape frequency estimation model for these contexts is built
similarly to one for contexts with masked symbols, i.e. escape frequency
is modeled, but not escape probability. Vector _w(s)_ consists of the
next components:
1) the alphabet size _m(s)_, it is quantized to 25 values;
2) the result of calculation (4*m(s[i])>3*m(s[i-1]));
3) the same as 4) - 6) items in $3.1 enumeration;
4) the same as 2) and 3) items in the first $5.3 enumeration;
5) the same as 1) item in the first $5.4 enumeration;
$6. Experimental results.
PPMII algorithm and its complicated modification were implemented on
C++ programming language and this implementation is publicly available
at [5]. PPMd.exe executable file corresponds to the basic algorithm and
PPMonstr.exe corresponds to cPPMII. All experiments were carried out on
standard Calgary corpus [8].
ÚÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄ¿
³ Order ³ PPMD ³ + II ³ + UEM ³ + SEE1³ + EE1 ³ + SEE2 ³
ÃÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄ´
³ 2 ³ 2.790 ³ 2.766 ³ 2.766 ³ 2.767 ³ 2.767 ³ 2.759 ³
³ 3 ³ 2.427 ³ 2.387 ³ 2.387 ³ 2.382 ³ 2.379 ³ 2.366 ³
³ 4 ³ 2.310 ³ 2.254 ³ 2.254 ³ 2.235 ³ 2.230 ³ 2.212 ³
³ 5 ³ 2.290 ³ 2.215 ³ 2.211 ³ 2.185 ³ 2.178 ³ 2.158 ³
³ 6 ³ 2.297 ³ 2.204 ³ 2.197 ³ 2.166 ³ 2.158 ³ 2.137 ³
³ 8 ³ 2.319 ³ 2.196 ³ 2.186 ³ 2.150 ³ 2.142 ³ 2.118 ³
³ 10 ³ 2.339 ³ 2.195 ³ 2.184 ³ 2.143 ³ 2.136 ³ 2.111 ³
³ 16 ³ 2.369 ³ 2.194 ³ 2.182 ³ 2.137 ³ 2.130 ³ 2.104 ³
ÃÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄ´
³ Table 1. PPMII: step by step. ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
1. Evaluation of contribution of each algorithm part.
Unweighted average bits per byte (bpb) are presented in Table 1 for
each PPM algorithm modification as described in $$2-3. Let's describe
each column:
[ PPMD] - original PPMD by P.G.Howard (implementation of author);
[+ II] - previous scheme plus information inheritance
($2.1, w=1/4 in (5));
[+ UEM] - previous scheme plus update exclusions modification ($2.2);
[+ SEE1] - previous scheme plus SEE-model for binary contexts
($3.1, w=1 for binary contexts);
[+ EE1] - previous scheme plus escape estimation for nonbinary contexts
($3.2);
[+ SEE2] - previous scheme plus SEE-model for contexts with masked
symbols ($3.3);
Table 2. Integral characteristics of various compressors.
ÚÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³ Model ³ PPMII ³ cPPMII ³
ÃÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄ´
³ Order ³ Average ³ Time, ³ Memory, ³ Average ³ Time, ³ Memory,³
³ ³ bpb ³ sec ³ MB ³ bpb ³ sec ³ MB ³
ÃÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄ´
³ 2 ³ 2.759 ³ 3.18 ³ 0.6 ³ 2.716 ³ 8.51 ³ 1.1 ³
³ 3 ³ 2.366 ³ 3.79 ³ 1.0 ³ 2.321 ³ 10.60 ³ 1.5 ³
³ 4 ³ 2.212 ³ 4.51 ³ 1.9 ³ 2.170 ³ 12.46 ³ 2.4 ³
³ 5 ³ 2.158 ³ 5.21 ³ 3.5 ³ 2.114 ³ 14.00 ³ 4.0 ³
³ 6 ³ 2.137 ³ 5.88 ³ 5.6 ³ 2.090 ³ 15.21 ³ 6.1 ³
³ 8 ³ 2.118 ³ 6.76 ³ 10.1 ³ 2.067 ³ 16.85 ³ 10.6 ³
³ 10 ³ 2.111 ³ 7.25 ³ 13.3 ³ 2.057 ³ 17.57 ³ 13.8 ³
³ 16 ³ 2.104 ³ 7.74 ³ 16.2 ³ 2.047 ³ 18.56 ³ 16.7 ³
ÃÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄ´
³ For comparison ³
³ ÚÄÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÂÄÄÄÄÄÄ¿ ³
³ ³ ZIP -9 ³ 2.764 ³ 5.93 ³ 0.5 ³ ³
³ ³ BZIP2 -8 ³ 2.368 ³ 5.87 ³ 6.0 ³ ³
³ ³ PPMZ2 ³ 2.139 ³ n/a ³ >100 ³ ³
³ ÀÄÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÙ ³
³ Table 2. Integral characteristics of various compressors. ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
2. Time and memory requirements.
In the second experiment, time and memory requirements of PPMII and
cPPMII implementations were compared with widespread practical
implementations of LZ77 (ZIP [9]) and BWT (BZIP2 [10]) algorithms and,
also, with most powerful implementation of PPM* algorithm (PPMZ2 [11]).
Results of measurement of compression efficiency (average bpb), of
compression time (seconds) and of maximal memory requirements
(megabytes) are presented in Table 2.
It can be seen from table that basic PPMII algorithm provides wide
range of opportunities. At small _D_ (2-3), it gives compression
efficiency comparable to one of ZIP or BZIP2 with faster compression
speed and smaller memory requirements than BZIP2 ones. At medium _D_
(4-6), PPMII efficiency is noticeable better than ZIP and BZIP2 ones and
time/memory requirements remain moderate. Lastly, at high _D_ (8-16),
PPMII outperforms the best of described programs PPMZ2 by all
characteristics. Complicated cPPMII gives even better efficiency, but
its low execution speed is not very promising.
ÄÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÄÄÄ
File ³ CTW ³ ACB ³ FSMX ³ PPMZ2 ³ PPMII ³ cPPMII³ cPPMII³ cPPMII
³ ³ ³ ³ ³ -8 ³ -8 ³ -16 ³ -64
ÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄ
BIB ³ 1.782 ³ 1.935 ³ 1.786 ³ 1.717 ³ 1.732 ³ 1.694 ³ 1.679 ³ 1.676
BOOK1 ³ 2.158 ³ 2.317 ³ 2.184 ³ 2.195 ³ 2.192 ³ 2.136 ³ 2.135 ³ 2.135
BOOK2 ³ 1.869 ³ 1.936 ³ 1.862 ³ 1.843 ³ 1.838 ³ 1.795 ³ 1.782 ³ 1.782
GEO ³ 4.608 ³ 4.555 ³ 4.458 ³ 4.576 ³ 4.349 ³ 4.163 ³ 4.159 ³ 4.158
NEWS ³ 2.322 ³ 2.317 ³ 2.285 ³ 2.205 ³ 2.205 ³ 2.160 ³ 2.142 ³ 2.137
OBJ1 ³ 3.814 ³ 3.498 ³ 3.678 ³ 3.661 ³ 3.536 ³ 3.507 ³ 3.497 ³ 3.498
OBJ2 ³ 2.473 ³ 2.201 ³ 2.283 ³ 2.241 ³ 2.206 ³ 2.154 ³ 2.118 ³ 2.110
PAPER1 ³ 2.247 ³ 2.343 ³ 2.250 ³ 2.214 ³ 2.194 ³ 2.152 ³ 2.144 ³ 2.142
PAPER2 ³ 2.190 ³ 2.337 ³ 2.213 ³ 2.184 ³ 2.181 ³ 2.130 ³ 2.124 ³ 2.124
PIC ³ 0.800 ³ 0.745 ³ 0.781 ³ 0.751 ³ 0.757 ³ 0.721 ³ 0.715 ³ 0.704
PROGC ³ 2.330 ³ 2.332 ³ 2.291 ³ 2.257 ³ 2.215 ³ 2.178 ³ 2.161 ³ 2.161
PROGL ³ 1.595 ³ 1.505 ³ 1.545 ³ 1.445 ³ 1.470 ³ 1.433 ³ 1.398 ³ 1.390
PROGP ³ 1.636 ³ 1.502 ³ 1.531 ³ 1.448 ³ 1.522 ³ 1.489 ³ 1.414 ³ 1.391
TRANS ³ 1.394 ³ 1.293 ³ 1.325 ³ 1.214 ³ 1.257 ³ 1.228 ³ 1.186 ³ 1.172
AverageÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÄÄ
bpb ³ 2.230 ³ 2.201 ³ 2.177 ³ 2.139 ³ 2.118 ³ 2.067 ³ 2.047 ³ 2.041
ÄÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÄÄÄ
Table 3. Compression efficiency for various compression schemes.
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
3. Compression efficiency.
In the last experiment, more detailed comparison is performed for
proposed algorithms and for algorithms described in the literature. Next
schemes are presented in Table 3: implementation [12] of CTW method [13]
with binary decomposition (results were taken from [14]), associative
coder by G.Buyanovsky (ACB) [15], the best of described by S.Bunton FSMX
coders [6], PPMZ2 by C.Bloom [11]. Next column contains PPMII results
for _D = 8_ and last three columns contain cPPMII results for _D = 8,
16, 64_. It is necessary to emphasize implementation [12] of CTW uses
symbol decomposition specially optimized for English texts.
Let's remark, in opposite to other PPM schemes, PPMII works well
enough for nontextual data. Now, it is really - universal data
compression ;-).
References.
1. Cleary, J.G. and Witten, I.H. (1984)
Data compression using adaptive coding and partial string matching.
IEEE Trans. on Comm., 32(4):396-402.
2. Moffat, A. (1990)
Implementing the PPM data compression scheme.
IEEE Trans. on Comm., 38(11):1917-1921.
3. Howard, P. G. (1993)
The Design and Analysis of Efficient Lossless Data Compression Systems.
PhD thesis, Brown University.
4. Bloom, C. (1996)
Solving the Problems of Context Modeling.
www.cbloom.com/papers/.
5. Shkarin, D. (2001)
PPMd - fast PPM compressor for textual data.
ftp.elf.stuba.sk/pub/pc/pack/ppmdh.rar.
6. Bunton, S. (1996)
On-Line Stochastic Processes in Data Compression.
PhD thesis, University of Washington.
7. Martin, G.N.N. (1979)
Range encoding: an algorithm for removing redundancy from a
digitised message.
Presented to the Video & Data Recording conference. Southampton.
8. The Calgary Compression Corpus.
ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus/.
9. Info-ZIP Group (1999) Zip v.2.3 - compression and file packaging
utility.
www.cdrom.com/pub/infozip/.
10. Seward, J. (2000) BZip2 v.1.0 - block-sorting file compressor.
www.muraroa.demon.co.uk/.
11. Bloom, C. (1999) PPMZ2 - High Compression Markov Predictive Coder.
www.cbloom.com/src/.
12. Volf, P.A.J. (1996)
Text compression methods based on context weighting.
Technical report, Stan Ackermans Institute,
Eindhoven University of Technology.
13. Willems, F., Shtarkov, Y. and Tjalkens, T. (1995)
The context-tree weighting method: Basic properties.
IEEE Trans. on Inf. Theory, 41(3):653-664.
14. Volf P.A.J and Willems F.M.J.(1998)
Switching between two universal source coding algorithms.
Proc. IEEE Data Compression Conf. pp.491-500.
15. Buyanovsky, G. (1994)
Associative coding.
The Monitor, 8:10-22, in Russian.
~~