• L
    Handling large files with GIT · 492e0759
    Linus Torvalds 提交于
    On Tue, 14 Feb 2006, Junio C Hamano wrote:
    
    > Linus Torvalds <torvalds@osdl.org> writes:
    >
    > > If somebody is interested in making the "lots of filename changes" case go
    > > fast, I'd be more than happy to walk them through what they'd need to
    > > change. I'm just not horribly motivated to do it myself. Hint, hint.
    >
    > In case anybody is wondering, I share the same feeling.  I
    > cannot say I'd be "more than happy to" clean up potential
    > breakages during the development of such changes, but if the
    > change eventually would help certain use cases, I can be
    > persuaded to help debugging such a mess ;-).
    
    Actually, I got interested in seeing how hard this is, and wrote a simple
    first cut at doing a tree-optimized merger.
    
    Let me shout a bit first:
    
      THIS IS WORKING CODE, BUT BE CAREFUL: IT'S A TECHNOLOGY DEMONSTRATION
      RATHER THAN THE FINAL PRODUCT!
    
    With that out of the way, let me descibe what this does (and then describe
    the missing parts).
    
    This is basically a three-way merge that works entirely on the "tree"
    level, rather than on the index. A lot of the _concepts_ are the same,
    though, and if you're familiar with the results of an index merge, some of
    the output will make more sense.
    
    You give it three trees: the base tree (tree 0), and the two branches to
    be merged (tree 1 and tree 2 respectively). It will then walk these three
    trees, and resolve them as it goes along.
    
    The interesting part is:
     - it can resolve whole sub-directories in one go, without actually even
       looking recursively at them. A whole subdirectory will resolve the same
       way as any individual files will (although that may need some
       modification, see later).
     - if it has a "content conflict", for subdirectories that means "try to
       do a recursive tree merge", while for non-subdirectories it's just a
       content conflict and we'll output the stage 1/2/3 information.
     - a successful merge will output a single stage 0 ("merged") entry,
       potentially for a whole subdirectory.
     - it outputs all the resolve information on stdout, so something like the
       recursive resolver can pretty easily parse it all.
    
    Now, the caveats:
     - we probably need to be more careful about subdirectory resolves. The
       trivial case (both branches have the exact same subdirectory) is a
       trivial resolve, but the other cases ("branch1 matches base, branch2 is
       different" probably can't be silently just resolved to the "branch2"
       subdirectory state, since it might involve renames into - or out of -
       that subdirectory)
     - we do not track the current index file at all, so this does not do the
       "check that index matches branch1" logic that the three-way merge in
       git-read-tree does. The theory is that we'd do a full three-way merge
       (ignoring the index and working directory), and then to update the
       working tree, we'd do a two-way "git-read-tree branch1->result"
     - I didn't actually make it do all the trivial resolve cases that
       git-read-tree does. It's a technology demonstration.
    
    Finally (a more serious caveat):
     - doing things through stdout may end up being so expensive that we'd
       need to do something else. In particular, it's likely that I should
       not actually output the "merge results", but instead output a "merge
       results as they _differ_ from branch1"
    
    However, I think this patch is already interesting enough that people who
    are interested in merging trees might want to look at it. Please keep in
    mind that tech _demo_ part, and in particular, keep in mind the final
    "serious caveat" part.
    
    In many ways, the really _interesting_ part of a merge is not the result,
    but how it _changes_ the branch we're merging into. That's particularly
    important as it should hopefully also mean that the output size for any
    reasonable case is minimal (and tracks what we actually need to do to the
    current state to create the final result).
    
    The code very much is organized so that doing the result as a "diff
    against branch1" should be quite easy/possible. I was actually going to do
    it, but I decided that it probably makes the output harder to read. I
    dunno.
    
    Anyway, let's think about this kind of approach.. Note how the code itself
    is actually quite small and short, although it's prbably pretty "dense".
    
    As an interesting test-case, I'd suggest this merge in the kernel:
    
    	git-merge-tree $(git-merge-base 4cbf876 7d2babc) 4cbf876 7d2babc
    
    which resolves beautifully (there are no actual file-level conflicts), and
    you can look at the output of that command to start thinking about what
    it does.
    
    The interesting part (perhaps) is that timing that command for me shows
    that it takes all of 0.004 seconds.. (the git-merge-base thing takes
    considerably more ;)
    
    The point is, we _can_ do the actual merge part really really quickly.
    
    		Linus
    
    PS. Final note: when I say that it is "WORKING CODE", that is obviously by
    my standards. IOW, I tested it once and it gave reasonable results - so it
    must be perfect.
    
    Whether it works for anybody else, or indeed for any other test-case, is
    not my problem ;)
    Signed-off-by: NJunio C Hamano <junkio@cox.net>
    492e0759
Makefile 15.7 KB