79 lines
4.1 KiB
Text
79 lines
4.1 KiB
Text
o) We should really be looking up hashes both in our cache and in our cache
|
|
of hashes from the peer. If it's only in the peer's cache, we can add it
|
|
to ours and send it as if it were a new piece of data. This could either
|
|
be a big speedup or a big slowdown for the encoding process. It depends
|
|
on how often we see data from the peer then be data we need to send back
|
|
to the peer.
|
|
|
|
XCodec 0.9.0 goals:
|
|
o) Stop using hashes like names and use actual names. This abstraction will
|
|
allow us to minimize the cost of collisions, speed lookup, etc. It also
|
|
means that different systems will be able to use different encode/hash
|
|
algorithms for lookup based on their requirements.
|
|
o) Expand the protocol to have different ops for referening hashes in our
|
|
namespace vs. those of our peer...
|
|
o) Exchange not just our UUIDs but a list of all of the UUIDs of other systems
|
|
we're talking to, allowing us to also reference hashes in other namespaces
|
|
that we share access to.
|
|
o) Also exchange other parameters, like size of the backref window, using the
|
|
minimum between the two peers.
|
|
|
|
Past ideas:
|
|
o) Create a new XCodecTag that incorporates a hash and a counter and
|
|
perhaps other things...
|
|
o) The counter will increment for each collision (or perhaps just be a
|
|
random number after the first collision and bail out if there's a
|
|
collision on the random number) and add new variants of extract, etc.,
|
|
that give a counter to append to the hash to get the Tag.
|
|
o) Allow either powers of two above 128 or multiples of 128 to be usable
|
|
chunk sizes instead of just 128. Include any time we define a tag a
|
|
bitmap of 128-byte blocks (or blocks of each size down to 128) within
|
|
the chunk that are to be learned, too, so that we still deal well with
|
|
changes. Eventually allow defining new e.g. 4K blocks based on old 4K
|
|
blocks with a single 128-byte block difference?
|
|
o) Add a pass number to the Tag so we can do recursive encoding. Use a
|
|
limited number of bits and put this above the opcode so that we can have
|
|
separate back-reference windows, etc., for each pass and so that we can
|
|
avoid escaping for subsequent passes, perhaps?
|
|
o) Deflate after recursive encoding.
|
|
o) A new encoder that can exploit all of those features, possibly keeping
|
|
the old encoder around for applications that need low latency and high
|
|
throughput.
|
|
|
|
To-do:
|
|
|
|
o) Add a 'count' field to the hash and allow incrementing it to do collision
|
|
overflow. For this it'd be nice to have an interface that would return a
|
|
range of matches in the dictionary. Put the count at the end to make this
|
|
possible. Would need to change the encoding logic to use a different OP
|
|
for these that took, say, a count or even just the full hash/identifier.
|
|
o) Only have N bytes outstanding at any given time (say 128k?) and add some
|
|
type of ACK, perhaps? This is necessary to:
|
|
o) Write a garbage-collector for the dictionary. LRU?
|
|
|
|
Possibly-bad future ideas:
|
|
|
|
o) Incorporate run-length encoding.
|
|
o) Incorporate occasional (figure out frequency) CRCs or such of the next N
|
|
bytes of decoded data to make it possible to detect any hash mismatches,
|
|
using a different hash function to any that go into the hash.
|
|
o) If the encoded version of a stream is larger than the source would be
|
|
escaped, it'd be nice to just transmit it escaped and to have some way to
|
|
tell the remote side how to pick out chunks to be taken as known to both
|
|
parties in the future. One approach would be to send a list of offsets
|
|
at which hashes were declared.
|
|
|
|
%%%
|
|
|
|
Hash-set deduplication:
|
|
|
|
For a given number of hashes (say 64), put an unordered list (hash?) of each
|
|
64 hashes that are encountered into a database.
|
|
|
|
When data is encoded, check whether its 64 hashes have appeared previously. If
|
|
they have, then use a compact encoding to list the order in which they appear
|
|
and the offsets within the list at which escaped or new data is to be inserted.
|
|
|
|
Eventually extend with one of the Computational Biology algorithms for finding
|
|
sequences missing an element or with one element changed so that we can do work
|
|
with deltas and offset sequences/sets more reliably.
|