PPM? Into practice!

                               Dmitry Shkarin

            Institute for Dynamics of Geospheres, Moscow, Russia

                     E-mail: dmitry.shkarin@mtu-net.ru

   PPM is one of most promising  lossless  data  compression  algorithms
using Markov source model of order D.  Its main feature is coding  of  a
new (in the given context) symbol in one of inner nodes of  the  context
tree; special escape symbols are used for description of this  node.  In
the reality, the majority of symbols  is  encoded  in  inner  nodes  and
Markov model becomes rather conventional.  In spite of the fact that PPM
algorithm achieves best results on comparison with others,  it  is  used
rarely in  practical  applications,  owing  to  its  high  computational
complexity.  This paper is devoted to PPM algorithm implementation  that
has complexity comparable with widespread practical compression  schemes
based on LZ77, LZ78 and BWT algorithms.  This scheme is named  PPM  with
Information Inheritance (PPMII).

                 $1. PPM algorithm - basic definitions.

   Let A be discrete alphabet consisting of M>=2 symbols;

  x^n = x[1],...,x[n], x @ A

be first _n_ symbols of message;  |L|  be  length  of  sequence  _L_  or
cardinality of set _L_.  Probability of symbol _x[n+1] = a @  A_  for  a
source with memory depends on current context

  s[d] = x[n],...,x[n-d+1] @ A^d, d<=D,

in general case this  context  has  variable  length.  The  set  of  all
possible contexts can be presented as nodes in M-ary tree with depth  D.
Context (sequence) _s_ describes a path from tree root to  current  node
also denoted as _s_.

   Usually true conditional probabilities are  unknown  and  the  coding
conditional probabilities _q(a|s)_ depend on characteristics  of  one  or
several subsequences _x^n(s)_.  These  characteristics  are  -  frequency
_f(a|s) = f(a|x^n(s))_ of symbol _a_ in _x^n(s)_,  alphabet

  A(s) = A(s,x^n) = { All _A_ : f(a|s)>0 },

its cardinality _m(s) = |A(s)|_ etc.  Already at small  _D_,  number  of
model states is big, subsequences _x^n(s)_ are  short  in  average,  and
their  statistics  is  insufficient  for  effective   compression.    In
particular,  task  of  _multialphabet  coding_  for  unknown  _A(s)_  is
hindered.

   Multialphabet properties of PPM algorithm [1] are based on (implicit)
assumption: the more common initial  part  of  contexts,  the  more  (in
average) similarity of their conditional probability distributions. High
PPM efficiency means this assumption  is  fair  for  majority  of  coded
sources.
   Let _Z[n]_ be set of nodes _s[d],d<=D_ on current branch

  x[n], x[n-1], ...

of context tree; _s(a)_ be context with maximal length, for which

  f(a|s)>0; d(a)=|s(a)|

(if such context is absent then_d(a)=-1_ )
  _q*(a|s)_ and q*(esc|s) be conditional probabilities used  for  coding
in _s_ of symbols _a @ A(s)_ and escape symbol.  Escape  symbol  signals
about a new symbol appearance in _A(s)_.
   Key feature of all PPM  modifications  is  representation  of  coding
conditional probability for any symbol _x[n+1] @ A_ as

  q( x[n+1] | x^n ) = Prod( q*(esc|s[i]), i=d+1..D ) * q*( x[n+1] | s[d] ),

  d = d(a)                                                           (1)

The product of escape conditional probabilities describes to the decoder
a sequential descent from  _s[D]_  to  _s(a)_  (if  _d(a)=D_  then  this
product is equal to 1). Usually

  q*(a|s) = t(a|s)/T(s), All _a @ A#(s)

  q*(esc|s) = t(esc|s)/T(s)                                          (2)

where _A#(s[j]) = A(s[j]) / A(s[j+1])_, _s[j+1]_ is a _child context_ of
_s[j]_ on current tree branch  _Z[n]_  (and  vice  versa,  _s[j]_  is  a
_parent  context_  for  _s[j+1]  ),  _t(.|s)_  are  _generalized  symbol
frequencies_ for _x^n(s)_, _T(s)_ is sum of all generalized frequencies.
As a rule,  _t(.|s)_  are  written  as  _t(.|s)  =  f(.|s)+c(.|s)_.  For
example, PPMC [2] corresponds to _c(a|s) = c(esc|s) = 0_, for PPMD [3] -
_c(a|s)=-1/2, c(esc|s)=-m/2_.
   After encoding of next symbol _x[n+1]_ it is necessary to update  (to
increase by one) its frequency in various tree nodes _s @  Z[n]_.  First
PPM variants [1] used _full updates_, at which, frequencies are  changed
for all _s @ Z[n]_.  Later PPM variants [2] use _update exclusions_ , at
which, frequencies are changed only in nodes with length |s|>=|s(a)|. In
some sense,  full  updates  and  update  exclusions  correspond  to  two
ultimate cases.
   During a descent along _Z[n]_ with help of escape symbols,  one  must
eliminate symbols already checked in higher order contexts (exclusions).
Let's name such symbols as masked symbols and contexts  containing  only
such symbols - masked contexts.  Let's designate _s_ as _binary context_

if _m(s)=1_ and two only probabilities are used, i.e. _q(a|s)=1-q(esc|s)_.

         $2. Evaluation of generalized frequencies of symbols.

   1. Information inheritance.

   Main difficulty of all explicit  context-based  modeling  schemes  is
statistics insufficiency in higher order contexts.  There were  proposed
many ways to overcome this problem: predictions weighting for lower  and
higher order contexts (CTW, interpolated Markov  model),  symbol  coding
only in contexts that have enough (by some criterion)  statistics  (LOE,
state selection).  All these methods require big computational resources
and  unacceptable  for  our  purposes.  We  can  take  advantage  of   a
distribution functions similarity in parent and child contexts  and  set
initial value of generalized symbol frequency in a  child  context  with
regard to information about this symbol gathered in  a  parent  context.
Such approach has two merits: firstly, reference  to  parent  statistics
occurs only at addition of new symbol  to  child  context,  i.e.  rarely
enough, that supposes existence of linear (not depending on tree  depth)
time complexity  solutions;  secondly,  owing  to  rare  use  of  parent
statistics again, model can quickly adapt to variations of character  of
input data.
   Next notation is used below - _s[i]_ is the new  context  (T(s[i])=0)
or the context, to that new symbol _a_ to be added (t(a|s[i])=0), _s[k]_
is the longest context that contains  current  symbol  _a_  (s[k]=s(a)).
Addition of a new symbol to the old context will be our initial concern.
Locally, at  the  given  point  of  coded  text,  it  would  be  optimal
immediately to use PPM-model of order _k_, i.e.  to reduce tree depth to
_k_; in this way, we eliminate  errors  associated  with  inaccuracy  of
escape probabilities  estimation.  On  the  other  hand,  we  need  only
statistics similar to statistics in _s[i]_, for  this  reason,  we  must
perform reduction of tree depth only along context branch  _s[j],k<j<=i.
Such tree depth reduction  removes  distortions  of  probability  symbol
distribution brought in by symbol masking under  update  exclusions  and
increases precision of symbol probability estimation in _s[i]_.
   Now, we can equate symbol  _a_  probability  estimations  in  context
_s[i]_ and in _s[k]_ for tree with pruned branch:

  t0(a|s[i]) / (T(s[i])+t0(a|s[i])) = t(a|s[k]) / (T(s[k])+T[i,k]),  [3]

where

  T[i,k] = Sum( T(s[j]) - t(esc|s[j]) -
           - Sum( t0(a[n]|s[j]), n=1..m(s[j]) ), j=k+1..i ),

_T[i,k]_ is statistics gathered in contexts _s[j], k<j<=i_.
   Equation (3) is too complex for calculations and requires  additional
variable containing the sum  of  initial  values  _t0(a|s[j])_  in  each
context structure.  As a  rule,  escape  happens  to  the  parent,  i.e.
_k=i-1_; if it is not true  then  statistics  gathered  in  intermediate
contexts _s[j]_ is small and consist  mainly  of  inherited  frequencies
_t0(a|s[j])_.  Therefore, we can neglect all intermediate _s[j]_ in  sum
_T[i,k]_. We can also replace _t(esc|s[j])_ and _t0(a[n]|s[j])_ by their
evaluation  in  PPMD  method.  Resolving  equation  (3)   relative    to
_t0(a|s[i]_ and using above simplifications, we get next formula:

  t0(a|s) = T(s[i])*t(a|s[k]) / ( T(s[k])-t(a|s[k])+T(s[i])-m(s[i]) ) [4]

At deriving equations (3) & (4), it was implicitly assumed  _t0(a|s[i])_
estimation is performed _after_ statistics update in  _s[k]_.  The  same
argumentation can be repeated for a case when inheritance  is  performed
_before_ statistics update; some intermediate cases are  also  possible.
It requires introduction of free parameter _w_ :  _w=0_  corresponds  to
information inheritance _after_ statistics update and _w=1_  corresponds
to  inheritance  _before_  statistics  update.  Repeating  all  previous
reasonings as well for the new context ( T(s[i])=0 ),  we  come  to  the
final formula:

                   ЪД
                   і t0(esc|s[i])*(t(a|s[k])-w)
                   і ДДДДДДДДДДДДДДДДДДДДДДДДДД
                   і    T(s[k]) - t(a|s[k])
  t0(a|s[i]) = w + і                                                 (5)
                   і
                   і T(s[i])*(t(a|s[k])-w)
                   і ДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДД
                   і T(s[k]) - t(a|s[k]) + T(s[i]) - m(s[i])
                   АД

where _t0(esc|s[i])_ should be specified by  extraneous  methods  (PPMD,
for example).  By experiments, optimal value of _w_ is _w=~1/4_ for PPMD
method.
   At deriving equation  (5),  it  was  supposed  that  _t0(a|s[i])_  is
calculated immediately after first advent of _a_ in _s[i]_.  However, at
large _D_, it is more preferable  to  delay  calculation  of  (5)  until
_s[i]_ is met next time. In this case, symbol probabilities distribution
in  another  parent  _s'[k]=s'(a)_  is  usually  more  similar  to   the
distribution in _s[i]_  and calculated _t0(a|s[i])_  corresponds  better
to unknown conditional probability of _a_ in _s[i]_.  On the other hand,
such "delayed" symbol addition to _s[i]_ requires  additional  searching
of _a_ in _s'[i]_ and additional searching of _s'[i]_ itself;  it  would
cause to increase algorithm complexity.  For this reason, delayed symbol
addition is performed for new contexts only (T(s[i])=0).

   2. Update exclusions modification.

   For  original  PPM  (without  information  inheritance),  the  parent
contexts are used only for coding of new  symbols  which  are  not  seen
earlier  in  the  child  contexts.   After    information    inheritance
introduction their  importance  raises,  now  they  are  used  also  for
calculation of _t0(a|s[i])_.  Therefore, statistics gathering  speed  in
parent  contexts  becomes  more  important.  The  conventional    update
exclusions mechanism  does  not  satisfy  to  these  requirements,  full
updates work badly also.
   Update exclusions can be modified in the  next  way:  alongside  with
_t(a|s[k])_  frequency  increment  in  _s[k]=s(a)_,  we  shall  increase
frequency in its parent _s[k-1]_, but with weight 1/2.  At  conventional
update exclusions, symbol frequency and its  probability  estimation  in
parent _s[j]_ are proportional to  number  of  child  contexts  _s[j+1]_
containing this symbol. To not lose completely this property, statistics
update in _s[k-1]_ must be stopped after frequency _t(a|s[k])_  reaching
some threshold. Experimentally, it is found equal to 8.
   Proposed  update  exclusions  modification  requires  one  additional
scanning of _s[k-1]_ and  this  scanning  can  increase  execution  time
nearly twice for some cases.  We can suppose if symbol was processed  by
longest possible context (|s(a)|=D) then statistics in this tree  branch
have already been stabilized, escape probability is small and here is no
need to update statistics in the  parent.  With  this  supposition,  the
execution time is increased by 2-10% only (it depends on _D_).

                $3. Evaluation of escapes probabilities.

   For escape probability estimation, we shall divide  all  contexts  to
three types: binary contexts, nonbinary contexts without masked  symbols
and nonbinary contexts with masked symbols.  Let's consider each type of
contexts separately.

   1. Escape probability for binary contexts.

   At big enough _D_, symbol  encoding  begins  at  binary  context  for
70-80% of cases. Coding probability of a single symbol for such contexts
is strongly connected to the  escape  probability:  _q(a|s)=1-q(esc|s)_.
For  these  reasons,  accurate  escape  probability  estimation    gives
significant gain for  compression  efficiency.  We  shall  construct  an
additional model for escape probability estimation dependent on some set
of parameters  _w(s)  =  (w[1],...,w[h])_  connected  with  the  current
context. To not confuse this model with the basic model, we shall denote
it as SEE (secondary escape estimation) model [4] and parameters of this
model  -  SEE-contexts.  Escape  probability  for  each  SEE-context  is
computed with usual the mean formula

  q(esc|w) = 1/N * Sum( d[j](w), j=1..N(w) ) = S/N                   (6)

where _d[j]=1_, if new symbol has appeared on j-th step and _d[j]=0_  in
opposite case, _N(w)_ is number of tests.
   Scheme with permanent  statistics  rescaling  is  used  in  PPMII  to
improve algorithm adaptivity.  Number of tests is fixed formally:  _N  =
N0_, _(1-1/N0)_ is scaling factor.  On each step, d[j] is added  to  the
sum _S_ and the mean <d[j]> ( <d[j]>=S/N0 )  value  is  subtracted  from
_S_, i.e.  change of _S_ is _dS = d[j]-S/N0_  for  each  step.  _N0_  is
chosen as power of two to eliminate division operation.
   Number _h_  of  vector  _w(s)_  components  and  their  influence  on
_q(esc|w)_ can be found experimentally only.  They are enumerated  below
in the decreasing influence order.

  1)  Of  course,  escape  probability  depends  on  generalized  symbol
frequency _q(a|s)_ to a great degree.  This variable is quantized to 128
values.

  2) PPM uses  similarity  of  parent  and  child  contexts,  therefore,
alphabet size _m(s[i-1])_ of the parent have strong effect to the escape
probability from the child.  This _w(s)_ component  is  quantized  to  4
values.

  3) The experiments show high predictable data blocks  are  interleaved
with low predictable ones in real sources.  Sizes of  these  blocks  are
small, approximately 3-5 symbols, it corresponds to a  natural  language
text segmentation into words and parts of words. Previous encoded symbol
probability in  previous  context  is  included  into  _w(s)_  to  watch
switching between such blocks. This variable is quantized to 2 values.

  4) Current coded symbol correlates to previous symbol mostly. However,
we can not include whole previous symbol into _w(s)_ because  number  of
SEE-contexts would be too big and frequency of each separate SEE-context
would be too small.  Only one bit flag is  included  into  _w(s)_,  flag
value is set to 0 if two higher bits of previous symbol are  zeroed  and
to 1 for opposite case.

  5) The long block of symbols with length _L_ is the sequence of  input
symbols for which, there were not escapes to the lower orders and coding
probability for _L_ symbols of this sequence was larger 1/2. It is quite
probable PPM model with _D<L_ is not optimal for  such  blocks.  Special
flag is included into _w(s)_ to signal occurrence of the long block.

  6) By analogy with 4), the flag built of two  higher  bits  of  single
predicted symbol of binary context is added to SEE-context.
   Thus, the SEE-model for binary contexts  consists  of  128*4*2^5=8192
SEE-contexts. Quantization details of various SEE-context components can
be found in [5].
   The SEE-model performs averaging over large contexts  groups,  so  it
depends weakly on statistical outliers at small _T(s)_.  Therefore, best
value of free parameter _w_ in (5) is  _w=1_,  and  initial  _t0(esc|s)_
estimation is performed by PPMC - _t0(esc|s)=1.

   2. Escape frequency for nonbinary contexts without masked symbols.

   This type of contexts is a more difficult case because such  contexts
occur rarely at high _D_ and adaptive method (similar to above described
one) would not gather enough statistics.  For this reason,  semiadaptive
method is chosen that have some parallel with PPMD method.
   Let's suppose that symbols probabilities distribution in  context  is
exponential, i.e.

  q(a[n]|s[i]) = p^n*(1-p), n = 0,1,2,...                            (7)

   Then, at binary to nonbinary context transformation, the base  number
_p_ can be found with help of $3.1 results.  At known  _p_,  the  escape
frequency t02(esc2|s[i]) can be calculated for  context  containing  two
symbols.  These calculations can be performed numerically  only  because
the symbols occur non essential in decreasing probability order.
   Furthermore, _t(esc|s[i])_ value is changed at new  symbols  addition
to context only similar to PPMD  method.  Generalized  escape  frequency
_t(esc|s[i])_ increment is

         ЪД
         і  1/2, 4*m(s[i]) < m(s(a))
  d[1] = і  1/4, 2*m(s[i]) < m(s(a))                                 (8)
         і    0, for all other cases
         АД

The additional adjustment is required, when the new symbol has  a  small
probability:

  d[2] = 1 - t0(a|s[i]), All _a_: t0(a|s[i])<1                       (9)

The final formula is those:

  t(esc|s[i]) = t02(esc2|s[i]) + Sum( d1(s[i],s(a))+d2(a,s[i]) )    (10)

where _t02(esc2|s[i])_ is calculated only once, at binary  to  nonbinary
context transformation,  summation  is  performed  at  each  new  symbol
occurrence, _d1_ and _d2_ are defined by equations (8) & (9).

   3. Escape frequency for nonbinary contexts with masked symbols.

   Escape probability for these contexts depends mostly on the total sum
of nonmasked symbols frequencies.  This quantity must be presented  with
very high precision resulting in the large number  of  SEE-contexts  and
the low  frequency  of  their  occurrence.  Therefore,  we  shall  model
behavior of escape frequencies  which  depends  weakly  on  the  summary
frequency, but not escape probabilities.  The  mean  estimation  of  the
escape frequency is performed similarly to the mean estimation  of  $3.1
except for the beginning of coding (at small _N(w)_ ).  At the beginning
of coding, formally fixed number _N0_ of tests varies under the law

  N0 = 2^(1+|log(N'(w))|),
  N'(w) = max( 2^2, N(w) ),

that leads to the faster adaptivity of the escape frequency estimation.

   Components of vector _w(s)_ are enumerated below  for  this  type  of
contexts:

  1. Escape frequency highly depends on _m(s[i])_; it is quantized to 25
values.

  2. One bit  flag  is  included  into  SEE-context  that  contains  the
comparison result for _m(s[i])-m(s[i-1])_ and _m(s[i-1])_.

  3. Similar flag, built of comparison  result  for  _m(s[i])-m(s[i-1])_
and _m(s[i+1])-m(s[i])_, is also included.

  4. It is the same as 4) item in $3.1 enumeration.

  5. There exists some correlation between the escape frequency and  the
average generalized frequency _t(s) = T(s)/(m(s)+1)_; it is quantized to
2 values.

  Thus, total number of SEE-contexts is 25*2^4=400.

                      $4. Implementation details.

   Let's list most time expensive operations:

1) searching for a current coded symbol _a_ in the list of  all  symbols
seen in _s_ and estimating probability interval for this symbol,  denote
this operation as _prob(s,a)_;

2) escaping to the parent context in the case of symbol _a_ not found in
_s_, denote this operation as _suffix(s)_;

3) finding the next active context after symbol a encoding, denote  this
operation as _successor(s,a)_, it can be written formally:

                   ЪД
                   і         sa, |s|<D
  successor(s,a) = і                                                (11)
                   і suffix(s)a, |s|=D
                   АД

   The _prob(s,a)_ operation is most time consuming, it can occupy up to
half of execution time.  The searching of a current symbol  is  executed
through usual linear scan of  TRANSITION  structures  array.  TRANSITION
structure contains symbol _a_ and its  generalized  frequency  _t(a|s)_.
The array is sorted in the decreasing frequency  order  for  the  search
speed up. More complex search methods would not give speed gain because,
firstly, the alphabet size _m(s)_ is usually small and, secondly, it  is
necessary to perform symbol exclusions while searching.
   The _suffix(s)_ operation can be eliminated by  saving  reference  to
the parent _s(i-1)_ in CONTEXT structure containing  characteristics  of
the context _s[i]_.
   The  _successor(s,a)_  operation  is  also  eliminated   by    saving
corresponding reference in TRANSITION  structure.  For  memory  economy,
CONTEXT structure is created only for repeatedly seen contexts (T(s)>0);
the position in  coded  string  is  only  remembered  for  new  contexts
(T(s)=0).

ДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДВДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДД
                                       і
  CONTEXT structure                    і
                                       і
  m(s)>1:                              і  m(s)=1:
                                       і
  m(s)                      - 2 bytes  і  m(s)                 - 1 byte
  T(s)                      - 2 bytes  і  TRANSITION structure - 4 bytes
  Link to TRANSITIONs array - 4 bytes  і  Link to suffix(es)   - 4 bytes
  Link to suffix(es)        - 4 bytes  і
                                       і
ДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДБДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДД

  TRANSITION structure

  a                         - 1 byte
  t(a|s)                    - 1 byte
  Link to _successor(s,a)_  - 4 bytes

ДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДД
               Fig.1. Data structures of PPMII algorithm.

   Graphical representation of the used data structures is presented  on
Fig.1. It can be seen from  this  figure  that  the  proposed  algorithm
wastes 12 bytes for each repeated (binary or nonbinary)  context  and  6
bytes for each  transition  structure  in  the  nonbinary  context.  For
comparison, one of the most memory economical  PPM/PPM*  implementations
[6] requires 8 4-byte machine words for the nonbinary context structure,
6 words for the binary context structure and 6 words for the  transition
structure in the nonbinary context.
   The precision  of  the  frequency  representation  is  1  for  binary
contexts and 1/4 for nonbinary ones.  Statistics in nonbinary context is
scaled (all frequencies are halved) when the value of one of frequencies
exceeds threshold 30. Simplified variant of the range coder [7] is  used
as an entropy coder.  Division operation in equation (5) is approximated
by series of comparisons, other used approximations can be found in [5].

                          $5. Complicated variant.

   PPMII algorithm demonstrates very nice results, so it becomes  to  be
interesting to look  at  maximal  compression  efficiency  that  similar
approach can provide.  We will not limit ourselves by requirement of low
computational complexity in this paragraph. This modification of initial
scheme is named complicated PPMII (cPPMII).
   Some improvements can be obtained  by  mere  removing  of  introduced
simplifications.  Described at the end of $5 approximation  of  (5)  has
been cancelled.  Delayed addition of new symbol  to  context  ($2.1)  is
performed for any contexts, but not just for binary ones. For the update
exclusions modification ($2.2), statistics is  updated  at  any  context
length including _|s(a)|=D_ case.  Moreover, statistics is  updated  for
three contexts in parent-child chain with increments 1/2, 1/4, and  1/8.
The precision of the frequency representation is  enlarged  up  to  1/8.
Other improvements require individual considerations.

   1. Improving probability estimation for more probable symbols and for
      less probable ones.

   Statistics in the parents is accumulated faster than in "young" child
contexts (with small _T(s)_), so there is  good  reason  to  use  parent
statistics repeatedly.
   Generalized frequency of each of more probable symbols  (symbol  _a1_
falls into this group when _q(a1|s)>=q(aMPS|s)/2_, _aMPS_ means the most
probable symbol (MPS) in the context) is corrected when the child _s[i]_
is  very  young  (  T(s[i])<T(s[i-1])  )  and  symbol  _a1_  probability
estimation in _s[i]_ is less than the probability estimation in _s[i-1]_.
New value _t1(a1|s[i]) is calculated as a weighted average  of  the  old
frequency value and of the adjusted frequency  value  _t'(a1|s[i-1])  in
the parent:

                T(s[i])*t(a1|s[i]) + T(s[i-1])*t'(a1|s[i-1])
  t1(a1|s[i]) = ДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДД        (12)
                             T(s[i]) + T(s[i-1])

where
                               T(s[i]) - t(a|s[i])
  t'(a|s[i-1]) = t(a|s[i-1]) * ДДДДДДДДДДДДДДДДДДДДДДД
                               T(s[i-1]) - t(a|s[i-1])

under conditions

  ЪД
  і q(a1|s[i] >= q(aMPS|s[i])/2
  і
  ґ T(s[i])<T(s[i-1])
  і
  і q(a1|s[i])<q(a1|s[i-1]) .
  АД

   The same symbol probability correction is performed while inheritance
to a new context.
   Probability overestimation of less  probable  symbol  (symbol  _a[n]_
falls into this group when  _t(a[n]|s)<2_  )  can  have  ill  effect  on
probability estimation of more probable symbols in  the  young  contexts
(context _s_ falls into this group when _tt(s)<4, tt(s) = T(s)/(m(s)+1)
).  Therefore,  generalized  frequencies  of  such  symbols  are  simply
incremented by 3/4, but not by 1:

 delta_t(a[n]|s)=3/4, ( t(a[n]|s)<2 & tt(s)<4 )                     (13)

   2. Improving adaptive mean estimation.

   Adaptive mean estimation method ($3.1) requires minimal computational
resources, but it adapts too slowly at small  _N[w]_,  therefore  cPPMII
everywhere uses its modification mentioned in $3.3.
   Adaptation speed of the mean estimation can additionally be increased
if some suppositions are made on the mean dependency of vector  _w(s)  =
{w[1],...,w[h]}  components.  Let's   suppose    some    _x_    variable
(probability, for example) monotonically depends  on  discrete  variable
_i_, i.e. <x(i)>~=(<x(i-1)>+<x(i+1)>)/2. Then, at small _N(i)_,  we  can
update statistics not only for _i_ value, but also for (i-1) and  (i+1),
with some small weight that  decreasing  while  _N(i)_  increasing.  For
calculations  simplicity,  this    weight    is    chosen    equal    to
2^-Ceiling(log(N(i))) and  it  is  set  to  zero  after  exceeding  some
threshold _N0_.  In cPPMII, this technique is applied  to  any  variable
that is quantized to more than two values.

   3. Adaptive probability estimation for MPS.

   Encoded message length is markedly affected  by  precise  probability
estimation of MPS, therefore, after  frequency  correction  (see  §5.1),
adaptive probability estimation for MPS is performed.  SEE-model for MPS
is built similar to the model for escape symbols.  Generalized frequency
_t(aMPS|s)_ behavior is modeled for contexts without masked symbols  and
conditional probability _q(aMPS|s) behavior is modeled for contexts with
masked symbols.
   For contexts without masked symbols, vector _w(s)_ consists of:

  1) generalized symbol frequency _t(aMPS|s)_, it  is  quantized  to  68
values;

  2) one bit flag indicating whether statistics rescaling was performed;

  3) the comparison result of current context length  |s|  with  average
used context  length  <s(a)>  (averaging  is  performed  over  last  128
symbols);

  4) the same as 4) and 6) items in $3.1 enumeration;

For contexts with masked symbols, vector _w(s)_ consists of:

  1) probability estimation _q(aMPS|s)_, it is quantized to 40 values;

  2) the comparison result of average nonmasked  symbol  frequency  with

average masked symbol frequency;

  3) one bit flag indicating whether only one symbol is masked;

  4) the same as 2) and 4) items in previous enumeration;

   4. Additional SEE-contexts components.

   Additional SEE-context fields for binary contexts are next:

  1) the comparison result of _m(s[i-1])_ with number of symbols in  the
previous context;

  2) the number of binary parent contexts, it is quantized to 2 values;

  3) the flag built of two higher bits of symbol  preceding  to  already
encoded symbol;

  4) the same as 3) item in the first $5.3 enumeration;

Additional SEE-context fields for contexts with masked symbol are next:

  1) the same as 1) item in the first $5.3 enumeration;

  2) the same as 2) item in the second $5.3 enumeration;

   5. Adaptive escape frequency estimation for nonbinary contexts
      (without masked symbols).

   Escape  frequency  estimation  model  for  these  contexts  is  built
similarly to one for contexts with masked symbols, i.e. escape frequency
is modeled, but not escape probability.  Vector _w(s)_ consists  of  the
next components:

  1) the alphabet size _m(s)_, it is quantized to 25 values;

  2) the result of calculation (4*m(s[i])>3*m(s[i-1]));

  3) the same as 4) - 6) items in $3.1 enumeration;

  4) the same as 2) and 3) items in the first $5.3 enumeration;

  5) the same as 1) item in the first $5.4 enumeration;

                         $6. Experimental results.

   PPMII algorithm and its complicated modification were implemented  on
C++ programming language and this implementation is  publicly  available
at [5].  PPMd.exe executable file corresponds to the basic algorithm and
PPMonstr.exe corresponds to cPPMII.  All experiments were carried out on

standard Calgary corpus [8].

   ЪДДДДДДДВДДДДДДДВДДДДДДДВДДДДДДДВДДДДДДДВДДДДДДДВДДДДДДДДї
   і Order і PPMD  і + II  і + UEM і + SEE1і + EE1 і + SEE2 і
   ГДДДДДДДЕДДДДДДДЕДДДДДДДЕДДДДДДДЕДДДДДДДЕДДДДДДДЕДДДДДДДДґ
   і   2   і 2.790 і 2.766 і 2.766 і 2.767 і 2.767 і 2.759  і
   і   3   і 2.427 і 2.387 і 2.387 і 2.382 і 2.379 і 2.366  і
   і   4   і 2.310 і 2.254 і 2.254 і 2.235 і 2.230 і 2.212  і
   і   5   і 2.290 і 2.215 і 2.211 і 2.185 і 2.178 і 2.158  і
   і   6   і 2.297 і 2.204 і 2.197 і 2.166 і 2.158 і 2.137  і
   і   8   і 2.319 і 2.196 і 2.186 і 2.150 і 2.142 і 2.118  і
   і  10   і 2.339 і 2.195 і 2.184 і 2.143 і 2.136 і 2.111  і
   і  16   і 2.369 і 2.194 і 2.182 і 2.137 і 2.130 і 2.104  і
   ГДДДДДДДБДДДДДДДБДДДДДДДБДДДДДДДБДДДДДДДБДДДДДДДБДДДДДДДДґ
   і             Table 1. PPMII: step by step.              і
   АДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДЩ

   1. Evaluation of contribution of each algorithm part.

   Unweighted average bits per byte (bpb) are presented in Table  1  for
each PPM algorithm modification as described in  $$2-3.  Let's  describe
each column:

[ PPMD]  - original PPMD by P.G.Howard (implementation of author);

[+ II]   - previous scheme plus information inheritance
           ($2.1, w=1/4 in (5));

[+ UEM]  - previous scheme plus update exclusions modification ($2.2);

[+ SEE1] - previous scheme plus SEE-model for binary contexts
           ($3.1, w=1 for binary contexts);

[+ EE1]  - previous scheme plus escape estimation for nonbinary  contexts
           ($3.2);

[+ SEE2] - previous scheme plus SEE-model for contexts with masked
           symbols ($3.3);

Table 2. Integral characteristics of various compressors.

   ЪДДДДДДДВДДДДДДДДДДДДДДДДДДДДДДДДДДДВДДДДДДДДДДДДДДДДДДДДДДДДДДї
   і Model і           PPMII           і          cPPMII          і
   ГДДДДДДДЕДДДДДДДДДВДДДДДДДВДДДДДДДДДЕДДДДДДДДДВДДДДДДДВДДДДДДДДґ
   і Order і Average і Time, і Memory, і Average і Time, і Memory,і
   і       і bpb     і sec   і MB      і bpb     і sec   і MB     і
   ГДДДДДДДЕДДДДДДДДДЕДДДДДДДЕДДДДДДДДДЕДДДДДДДДДЕДДДДДДДЕДДДДДДДДґ
   і   2   і 2.759   і 3.18  і 0.6     і 2.716   і  8.51 і 1.1    і
   і   3   і 2.366   і 3.79  і 1.0     і 2.321   і 10.60 і 1.5    і
   і   4   і 2.212   і 4.51  і 1.9     і 2.170   і 12.46 і 2.4    і
   і   5   і 2.158   і 5.21  і 3.5     і 2.114   і 14.00 і 4.0    і
   і   6   і 2.137   і 5.88  і 5.6     і 2.090   і 15.21 і 6.1    і
   і   8   і 2.118   і 6.76  і 10.1    і 2.067   і 16.85 і 10.6   і
   і  10   і 2.111   і 7.25  і 13.3    і 2.057   і 17.57 і 13.8   і
   і  16   і 2.104   і 7.74  і 16.2    і 2.047   і 18.56 і 16.7   і
   ГДДДДДДДБДДДДДДДДДБДДДДДДДБДДДДДДДДДБДДДДДДДДДБДДДДДДДБДДДДДДДДґ
   і                         For comparison                       і
   і               ЪДДДДДДДДДДВДДДДДДДВДДДДДДВДДДДДДї             і
   і               і ZIP -9   і 2.764 і 5.93 і  0.5 і             і
   і               і BZIP2 -8 і 2.368 і 5.87 і  6.0 і             і
   і               і PPMZ2    і 2.139 і n/a  і >100 і             і
   і               АДДДДДДДДДДБДДДДДДДБДДДДДДБДДДДДДЩ             і
   і    Table 2. Integral characteristics of various compressors. і
   АДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДЩ

   2. Time and memory requirements.

   In the second experiment, time and memory requirements of  PPMII  and
cPPMII  implementations  were  compared  with    widespread    practical
implementations of LZ77 (ZIP [9]) and BWT (BZIP2 [10])  algorithms  and,
also, with most powerful implementation of PPM* algorithm (PPMZ2  [11]).
Results of measurement  of  compression  efficiency  (average  bpb),  of
compression  time  (seconds)  and  of  maximal    memory    requirements
(megabytes) are presented in Table 2.
   It can be seen from table that basic PPMII  algorithm  provides  wide
range of  opportunities.  At  small  _D_  (2-3),  it  gives  compression
efficiency comparable to one of ZIP or  BZIP2  with  faster  compression
speed and smaller memory requirements than BZIP2  ones.  At  medium  _D_
(4-6), PPMII efficiency is noticeable better than ZIP and BZIP2 ones and
time/memory requirements remain moderate.  Lastly, at high  _D_  (8-16),
PPMII  outperforms  the  best  of  described  programs  PPMZ2  by    all
characteristics.  Complicated cPPMII gives even better  efficiency,  but
its low execution speed is not very promising.

ДДДДДДДДВДДДДДДДВДДДДДДДВДДДДДДДВДДДДДДДВДДДДДДДВДДДДДДДВДДДДДДДВДДДДДДД
  File  і CTW   і ACB   і FSMX  і PPMZ2 і PPMII і cPPMIIі cPPMIIі cPPMII
        і       і       і       і       і  -8   і   -8  і  -16  і  -64
 ДДДДДДДЕДДДДДДДЕДДДДДДДЕДДДДДДДЕДДДДДДДЕДДДДДДДЕДДДДДДДЕДДДДДДДЕДДДДДД
  BIB   і 1.782 і 1.935 і 1.786 і 1.717 і 1.732 і 1.694 і 1.679 і 1.676
 BOOK1  і 2.158 і 2.317 і 2.184 і 2.195 і 2.192 і 2.136 і 2.135 і 2.135
 BOOK2  і 1.869 і 1.936 і 1.862 і 1.843 і 1.838 і 1.795 і 1.782 і 1.782
  GEO   і 4.608 і 4.555 і 4.458 і 4.576 і 4.349 і 4.163 і 4.159 і 4.158
  NEWS  і 2.322 і 2.317 і 2.285 і 2.205 і 2.205 і 2.160 і 2.142 і 2.137
  OBJ1  і 3.814 і 3.498 і 3.678 і 3.661 і 3.536 і 3.507 і 3.497 і 3.498
  OBJ2  і 2.473 і 2.201 і 2.283 і 2.241 і 2.206 і 2.154 і 2.118 і 2.110
 PAPER1 і 2.247 і 2.343 і 2.250 і 2.214 і 2.194 і 2.152 і 2.144 і 2.142
 PAPER2 і 2.190 і 2.337 і 2.213 і 2.184 і 2.181 і 2.130 і 2.124 і 2.124
  PIC   і 0.800 і 0.745 і 0.781 і 0.751 і 0.757 і 0.721 і 0.715 і 0.704
 PROGC  і 2.330 і 2.332 і 2.291 і 2.257 і 2.215 і 2.178 і 2.161 і 2.161
 PROGL  і 1.595 і 1.505 і 1.545 і 1.445 і 1.470 і 1.433 і 1.398 і 1.390
 PROGP  і 1.636 і 1.502 і 1.531 і 1.448 і 1.522 і 1.489 і 1.414 і 1.391
 TRANS  і 1.394 і 1.293 і 1.325 і 1.214 і 1.257 і 1.228 і 1.186 і 1.172
AverageДЕДДДДДДДЕДДДДДДДЕДДДДДДДЕДДДДДДДЕДДДДДДДЕДДДДДДДЕДДДДДДДЕДДДДДД
  bpb   і 2.230 і 2.201 і 2.177 і 2.139 і 2.118 і 2.067 і 2.047 і 2.041
ДДДДДДДДБДДДДДДДБДДДДДДДБДДДДДДДБДДДДДДДБДДДДДДДБДДДДДДДБДДДДДДДБДДДДДДД
    Table 3. Compression efficiency for various compression schemes.
ДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДДД

   3. Compression efficiency.

   In the last experiment, more detailed  comparison  is  performed  for
proposed algorithms and for algorithms described in the literature. Next
schemes are presented in Table 3: implementation [12] of CTW method [13]
with binary decomposition (results were taken  from  [14]),  associative
coder by G.Buyanovsky (ACB) [15], the best of described by S.Bunton FSMX
coders [6], PPMZ2 by C.Bloom [11].  Next column contains  PPMII  results
for _D = 8_ and last three columns contain cPPMII results for  _D  =  8,
16, 64_.  It is necessary to emphasize implementation [12] of  CTW  uses
symbol decomposition specially optimized for English texts.
   Let's remark, in opposite to other  PPM  schemes,  PPMII  works  well
enough  for  nontextual  data.  Now,  it  is  really  -  universal  data
compression ;-).

                              References.

  1. Cleary, J.G. and Witten, I.H. (1984)
     Data compression using adaptive coding and partial string matching.
     IEEE Trans. on Comm., 32(4):396-402.

  2. Moffat, A. (1990)
     Implementing the PPM data compression scheme.
     IEEE Trans. on Comm., 38(11):1917-1921.

  3. Howard, P. G. (1993)
     The Design and Analysis of Efficient Lossless Data Compression Systems.
     PhD thesis, Brown University.

  4. Bloom, C. (1996)
     Solving the Problems of Context Modeling.
     www.cbloom.com/papers/.

  5. Shkarin, D. (2001)
     PPMd - fast PPM compressor for textual data.
     ftp.elf.stuba.sk/pub/pc/pack/ppmdh.rar.

  6. Bunton, S. (1996)
     On-Line Stochastic Processes in Data Compression.
     PhD thesis, University of Washington.

  7. Martin, G.N.N. (1979)
     Range  encoding:  an  algorithm  for  removing  redundancy  from  a
      digitised message.
     Presented to the Video & Data Recording conference. Southampton.

  8. The Calgary Compression Corpus.
     ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus/.

  9. Info-ZIP Group (1999) Zip v.2.3 - compression and file packaging
      utility.
     www.cdrom.com/pub/infozip/.

 10. Seward, J. (2000) BZip2 v.1.0 - block-sorting file compressor.
     www.muraroa.demon.co.uk/.

 11. Bloom, C. (1999) PPMZ2 - High Compression Markov Predictive Coder.
     www.cbloom.com/src/.

 12. Volf, P.A.J. (1996)
     Text compression methods based on context weighting.
     Technical report, Stan Ackermans Institute,
      Eindhoven University of Technology.

 13. Willems, F., Shtarkov, Y. and Tjalkens, T. (1995)
     The context-tree weighting method: Basic properties.
     IEEE Trans. on Inf. Theory, 41(3):653-664.

 14. Volf P.A.J and Willems F.M.J.(1998)
     Switching between two universal source coding algorithms.
     Proc. IEEE Data Compression Conf. pp.491-500.

 15. Buyanovsky, G. (1994)
     Associative coding.
     The Monitor, 8:10-22, in Russian.