version 3.0

2015-08-31 14:01:44 +02:00 · 2015-08-31 14:01:44 +02:00 · d837490606
commit d837490606
209 changed files with 19662 additions and 0 deletions
--- a/xcodec/FUTURE
+++ b/xcodec/FUTURE
@ -0,0 +1,79 @@
+o) We should really be looking up hashes both in our cache and in our cache
+   of hashes from the peer.  If it's only in the peer's cache, we can add it
+   to ours and send it as if it were a new piece of data.  This could either
+   be a big speedup or a big slowdown for the encoding process.  It depends
+   on how often we see data from the peer then be data we need to send back
+   to the peer.
+
+XCodec 0.9.0 goals:
+o) Stop using hashes like names and use actual names.  This abstraction will
+   allow us to minimize the cost of collisions, speed lookup, etc.  It also
+   means that different systems will be able to use different encode/hash
+   algorithms for lookup based on their requirements.
+o) Expand the protocol to have different ops for referening hashes in our
+   namespace vs. those of our peer...
+o) Exchange not just our UUIDs but a list of all of the UUIDs of other systems
+   we're talking to, allowing us to also reference hashes in other namespaces
+   that we share access to.
+o) Also exchange other parameters, like size of the backref window, using the
+   minimum between the two peers.
+
+Past ideas:
+o) Create a new XCodecTag that incorporates a hash and a counter and
+   perhaps other things...
+o) The counter will increment for each collision (or perhaps just be a
+   random number after the first collision and bail out if there's a
+   collision on the random number) and add new variants of extract, etc.,
+   that give a counter to append to the hash to get the Tag.
+o) Allow either powers of two above 128 or multiples of 128 to be usable
+   chunk sizes instead of just 128.  Include any time we define a tag a
+   bitmap of 128-byte blocks (or blocks of each size down to 128) within
+   the chunk that are to be learned, too, so that we still deal well with
+   changes.  Eventually allow defining new e.g. 4K blocks based on old 4K
+   blocks with a single 128-byte block difference?
+o) Add a pass number to the Tag so we can do recursive encoding.  Use a
+   limited number of bits and put this above the opcode so that we can have
+   separate back-reference windows, etc., for each pass and so that we can
+   avoid escaping for subsequent passes, perhaps?
+o) Deflate after recursive encoding.
+o) A new encoder that can exploit all of those features, possibly keeping
+   the old encoder around for applications that need low latency and high
+   throughput.
+
+To-do:
+
+o) Add a 'count' field to the hash and allow incrementing it to do collision
+   overflow.  For this it'd be nice to have an interface that would return a
+   range of matches in the dictionary.  Put the count at the end to make this
+   possible.  Would need to change the encoding logic to use a different OP
+   for these that took, say, a count or even just the full hash/identifier.
+o) Only have N bytes outstanding at any given time (say 128k?) and add some
+   type of ACK, perhaps?  This is necessary to:
+o) Write a garbage-collector for the dictionary.  LRU?
+
+Possibly-bad future ideas:
+
+o) Incorporate run-length encoding.
+o) Incorporate occasional (figure out frequency) CRCs or such of the next N
+   bytes of decoded data to make it possible to detect any hash mismatches,
+   using a different hash function to any that go into the hash.
+o) If the encoded version of a stream is larger than the source would be
+   escaped, it'd be nice to just transmit it escaped and to have some way to
+   tell the remote side how to pick out chunks to be taken as known to both
+   parties in the future.  One approach would be to send a list of offsets
+   at which hashes were declared.
+
+%%%
+
+Hash-set deduplication:
+
+For a given number of hashes (say 64), put an unordered list (hash?) of each
+64 hashes that are encountered into a database.
+
+When data is encoded, check whether its 64 hashes have appeared previously.  If
+they have, then use a compact encoding to list the order in which they appear
+and the offsets within the list at which escaped or new data is to be inserted.
+
+Eventually extend with one of the Computational Biology algorithms for finding
+sequences missing an element or with one element changed so that we can do work
+with deltas and offset sequences/sets more reliably.