Collecting the Statistics on Context Occurences
CBS = Context Bitmask Set
1 bit contexts: mask 11, list 0,1 2 bit contexts: mask 1101, list 00,01,11 3 bit contexts: mask 010101, list 001,011,111
I. Tasks for storage structure evaluation
1. For a given context order, generate a list of symbol distributions within contexts of given order at each file position.
2. Enumerate the contexts in the file (up to shortest unique) and print number of occurences for each context
II. On optimal parsing
1. The parsed (or preprocessed) data isn't really compatible with main context tree (its a tree anyway, even if its stored as CBS) - a word's prefix or suffix doesn't necessarility have to be included if the word itself is. So guess that a secondary statistic storage would be required for parsed results.
2. The idea is to split the text at the points, where contexts don't predict the next data right. One example is when a contexts exists, but its symbol distribution entropy is greater than some threshold (like predicting the next word from the word suffix in a wordlist) - its where we should terminate a word. And the left edge can be determined in a similar way using a right context... getting statistics for it is troublesome though.
Anyway, the start and end of each word have to be turned into special symbols, and then these symbols have to be merged by statistical similarity - basically, we merge two special symbols when single one in their place works better (improves compression).
Also its not necessary to actually preprocess the data and insert some symbols (though even like that it could be useful), as they're just signs for context merging, so statistics from a corresponding context have to be mixed in with separator's probability.
So in the end, it looks like an extra statistical structure (or even two - besides usual "left" tree, also "right" tree, and word tree) would be required for data parsing.
A rough simpler implementation is still possible though, which maybe wouldn't detect the word's left edges, and would insert spaces in place of end-of-word separator.
III. On 2-stage statistic storage
It seems that some kind of incremental CBS is actually possible. To be specific, we know that no more new contexts than there are symbols in the block can appear while processing the block. Then, corresponding amount of bits can be reserved in all CBs and dynamically allocated on new context appearance. But of course, some workaround is required as we cannot afford to dynamically remap the bitmasks. And here's the idea:
1. Assuming that dual bitmasks are used (for all and only non-unique contexts), there's normally a single unused state (01 - context not present, but is non-unique). This can be used to mark the new contexts. (Though even like that the non-unique bit vectors would have to be masked with common bit vectors before index lookup).
2. Suffices of such new contexts cannot be allocated at the point determined by actual index (that'd require moving a lot of data), so relocation mechanism is required. And it can be implemented with a hash (single for all context orders), with a hash function based on context order and actual index. Its also possible to allocate a pointer per each missing context, but somehow that seems too redundant.
Btw, a specific CBS for nibble-based dual bitmask probably has to be created, to improve the performance - unique / non-unique vectors can be interleaved and an interface for simultaneous access to both implemented.
However, an implementation like that would have a flaw similar to PPMd tree - context order only can be incremented at each occurence, so its impossible to capture a long string at second occurence. The only way around that seems to require using a pointer per each indexed context (to the last context occurence in the data), which might even be okay, considering other requirements (pointer field can be used for counting after context becomes non-unique). But its inconvenient to keep all the file in the memory to be able to recover any context, and a trade-off might be to only keep these pointers for dynamic, relocated parts of CBs, ie only for new contexts from the current block.
After processing a block, the CBS would have to be remapped with a procedure similar to sorting - a scan to find the new contexts and tracking of new context suffix areas (luckily they seem to be continuous) and shifting the data, while filling the holes with relocated masks (and all that including actual statistics and such).
Concerning that, it looks like current design with a continuous bit vectors and index tables won't work anymore (due to memory reallocation issues). Well, it was redundant anyway.
Instead, 64k bit vectors can be allocated separately (along with internal indices, and both U and NU, so 8k(U)+8k(NU)+4k(index)=20kbyte, or maybe 24 if we'd index all the 16bit words instead of 32bit ones) and a global index can be used, with 32bit indices and block pointers. It'd also allow to insert the new blocks at the start of CB and thus forward memory access would be possible in remapping.
IV. Farewell CBS?
Looks like this approach becomes more and more complex, now with 2-level index, hashed relocations and blockwise remapping. And at the same time, especially with a "load" like counters and data pointers, it becomes more and more similar to a simple tree.
Now let's consider a nibble tree. In context's node we can store the data pointer in case of unique context, and nibble table pointer otherwise. Also a 16bit mask can be used to avoid redundancy - and allocator of 15 different node types would be necessary (by length). But instead, there'd be no need to separately allocate the tables by order, and context lookup seems near the same as for CBS, just with memory pointers instead of bit indices.
Well, of course a single bitmap would still be faster and more compact, but already inclusion of data pointer to capture whole match at 2nd occurence makes it no different from a tree, except for layout. And as to that, using the pointer from context node is certainly faster than getting it with a hash lookup. And then, there's the whole matter of remapping. Guess, nibble tree is the choice. Or even a byte tree right away - just with nibble masks.
2013-08-09 14:56:32 >
2013-08-19 11:05:43 >
2014-11-27 06:37:31 >
2015-01-12 05:52:05 >