• N
    improve delta long block matching with big files · 84336696
    Nicolas Pitre 提交于
    Martin Koegler noted that create_delta() performs a new hash lookup
    after every block copy encoding which are currently limited to 64KB.
    
    In case of larger identical blocks, the next hash lookup would normally
    point to the next 64KB block in the reference buffer and multiple block
    copy operations will be consecutively encoded.
    
    It is however possible that the reference buffer be sparsely indexed if
    hash buckets have been trimmed down in create_delta_index() when hashing
    of the reference buffer isn't well balanced.  In that case the hash
    lookup following a block copy might fail to match anything and the fact
    that the reference buffer still matches beyond the previous 64KB block
    will be missed.
    
    Let's rework the code so that buffer comparison isn't bounded to 64KB
    anymore.  The match size should be as large as possible up front and
    only then should multiple block copy be encoded to cover it all.
    Also, fewer hash lookups will be performed in the end.
    
    According to Martin, this patch should reduce his 92MB pack down to 75MB
    with the dataset he has.
    
    Tests performed on the Linux kernel repo show a slightly smaller pack and
    a slightly faster repack.
    Signed-off-by: NNicolas Pitre <nico@cam.org>
    Signed-off-by: NJunio C Hamano <junkio@cox.net>
    84336696
diff-delta.c 13.8 KB