1. 27 8月, 2020 1 次提交
    • J
      revision: set rev_input_given in handle_revision_arg() · 04a0e985
      Jeff King 提交于
      Commit 7ba82629 (revision: add rev_input_given flag, 2017-08-02) added
      a flag to rev_info to tell whether we got any revision arguments. As
      explained there, this is necessary because some revision arguments may
      not produce any pending traversal objects, but should still inhibit
      default behaviors (e.g., a glob that matches nothing).
      
      However, it only set the flag in the globbing code, but not for
      revisions we get on the command-line or via stdin. This leads to two
      problems:
      
        - the command-line code keeps its own separate got_rev_arg flag; this
          isn't wrong, but it's confusing and an extra maintenance burden
      
        - even specifically-named rev arguments might end up not adding any
          pending objects: if --ignore-missing is set, then specifying a
          missing object is a noop rather than an error.
      
      And that leads to some user-visible bugs:
      
        - when deciding whether a default rev like "HEAD" should kick in, we
          check both got_rev_arg and rev_input_given. That means that
          "--ignore-missing $ZERO_OID" works on the command-line (where we set
          got_rev_arg) but not on --stdin (where we don't)
      
        - when rev-list decides whether it should complain that it wasn't
          given a starting point, it relies on rev_input_given. So it can't
          even get the command-line "--ignore-missing $ZERO_OID" right
      
      Let's consistently set the flag if we got any revision argument. That
      lets us clean up the redundant got_rev_arg, and fixes both of those bugs
      (but note there are three new tests: we'll confirm the already working
      git-log command-line case).
      
      A few implementation notes:
      
        - conceptually we want to set the flag whenever handle_revision_arg()
          finds an actual revision arg ("handles" it, you might say). But it
          covers a ton of cases with early returns. Rather than annotating
          each one, we just wrap it and use its success exit-code to set the
          flag in one spot.
      
        - the new rev-list test is in t6018, which is titled to cover globs.
          This isn't exactly a glob, but it made sense to stick it with the
          other tests that handle the "even though we got a rev, we have no
          pending objects" case, which are globs.
      
        - the tests check for the oid of a missing object, which it's pretty
          clear --ignore-missing should ignore. You can see the same behavior
          with "--ignore-missing a-ref-that-does-not-exist", because
          --ignore-missing treats them both the same. That's perhaps less
          clearly correct, and we may want to change that in the future. But
          the way the code and tests here are written, we'd continue to do the
          right thing even if it does.
      Reported-by: NBryan Turner <bturner@atlassian.com>
      Signed-off-by: NJeff King <peff@peff.net>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      04a0e985
  2. 12 8月, 2020 1 次提交
  3. 08 8月, 2020 1 次提交
  4. 05 8月, 2020 3 次提交
    • S
      revision: fix die() message for "--unpacked=" · f649aaaf
      Sergey Organov 提交于
      Get rid of the trailing dot and mark for translation.
      Signed-off-by: NSergey Organov <sorganov@gmail.com>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      f649aaaf
    • J
      revision: avoid leak when preparing bloom filter for "/" · 398e659e
      Jeff King 提交于
      If we're given an empty pathspec, we refuse to set up bloom filters, as
      described in f3c2a368 (revision: empty pathspecs should not use Bloom
      filters, 2020-07-01).
      
      But before the empty string check, we drop any trailing slash by
      allocating a new string without it. So a pathspec consisting only of "/"
      will allocate that string, but then still cause us to bail, leaking the
      new string. Let's make sure to free it.
      Signed-off-by: NJeff King <peff@peff.net>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      398e659e
    • J
      revision: avoid out-of-bounds read/write on empty pathspec · fd9a631c
      Jeff King 提交于
      Running t4216 with ASan results in it complaining of an out-of-bounds
      read in prepare_to_use_bloom_filter(). The issue is this code to strip a
      trailing slash:
      
        last_index = pi->len - 1;
        if (pi->match[last_index] == '/') {
      
      because we have no guarantee that pi->len isn't zero. This can happen if
      the pathspec is ".", as we translate that to an empty string. And if
      that read of random memory does trigger the conditional, we'd then do an
      out-of-bounds write:
      
        path_alloc = xstrdup(pi->match);
        path_alloc[last_index] = '\0';
      
      Let's make sure to check the length before subtracting. Note that for an
      empty pathspec, we'd end up bailing from the function a few lines later,
      which makes it tempting to just:
      
        if (!pi->len)
                return;
      
      early here. But our code here is stripping a trailing slash, and we need
      to check for emptiness after stripping that slash, too. So we'd have two
      blocks, which would require repeating some cleanup code.
      
      Instead, just skip the trailing-slash for an empty string. Setting
      last_index at all in the case is awkward since it will have a nonsense
      value (and it uses an "int", which is a too-small type for a string
      anyway). So while we're here, let's:
      
        - drop last_index entirely; it's only used in two spots right next to
          each other and writing out "pi->len - 1" in both is actually easier
          to follow
      
        - use xmemdupz() to duplicate the string. This is slightly more
          efficient, but more importantly makes the intent more clear by
          allocating the correct-sized substring in the first place. It also
          eliminates any question of whether path_alloc is as long as
          pi->match (which it would not be if pi->match has any embedded NULs,
          though in practice this is probably impossible).
      Signed-off-by: NJeff King <peff@peff.net>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      fd9a631c
  5. 31 7月, 2020 1 次提交
    • J
      strvec: rename struct fields · d70a9eb6
      Jeff King 提交于
      The "argc" and "argv" names made sense when the struct was argv_array,
      but now they're just confusing. Let's rename them to "nr" (which we use
      for counts elsewhere) and "v" (which is rather terse, but reads well
      when combined with typical variable names like "args.v").
      
      Note that we have to update all of the callers immediately. Playing
      tricks with the preprocessor is hard here, because we wouldn't want to
      rewrite unrelated tokens.
      Signed-off-by: NJeff King <peff@peff.net>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      d70a9eb6
  6. 30 7月, 2020 1 次提交
    • J
      revision: add "--no-diff-merges" option to counteract "-m" · 6fae74b4
      Jeff King 提交于
      The "-m" option sets revs->ignore_merges to "0", but there's no way to
      undo it. This probably isn't something anybody overly cares about, since
      "1" is already the default, but it will serve as an escape hatch when we
      flip the default for ignore_merges to "0" in more situations.
      
      We'll also add a few extra niceties:
      
        - initialize the value to "-1" to indicate "not set", and then resolve
          it to the normal 0/1 bool in setup_revisions(). This lets any tweak
          functions, as well as setup_revisions() itself, avoid clobbering the
          user's preference (which until now they couldn't actually express).
      
        - since we now have --no-diff-merges, let's add the matching
          --diff-merges, which is just a synonym for "-m". Then we don't even
          need to document --no-diff-merges separately; it countermands the
          long form of "-m" in the usual way.
      
      The new test shows that this behaves just the same as the current
      behavior without "-m".
      Signed-off-by: NJeff King <peff@peff.net>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      6fae74b4
  7. 29 7月, 2020 2 次提交
    • J
      strvec: convert remaining callers away from argv_array name · c972bf4c
      Jeff King 提交于
      We eventually want to drop the argv_array name and just use strvec
      consistently. There's no particular reason we have to do it all at once,
      or care about interactions between converted and unconverted bits.
      Because of our preprocessor compat layer, the names are interchangeable
      to the compiler (so even a definition and declaration using different
      names is OK).
      
      This patch converts all of the remaining files, as the resulting diff is
      reasonably sized.
      
      The conversion was done purely mechanically with:
      
        git ls-files '*.c' '*.h' |
        xargs perl -i -pe '
          s/ARGV_ARRAY/STRVEC/g;
          s/argv_array/strvec/g;
        '
      
      We'll deal with any indentation/style fallouts separately.
      Signed-off-by: NJeff King <peff@peff.net>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      c972bf4c
    • J
      strvec: rename files from argv-array to strvec · dbbcd44f
      Jeff King 提交于
      This requires updating #include lines across the code-base, but that's
      all fairly mechanical, and was done with:
      
        git ls-files '*.c' '*.h' |
        xargs perl -i -pe 's/argv-array.h/strvec.h/'
      Signed-off-by: NJeff King <peff@peff.net>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      dbbcd44f
  8. 07 7月, 2020 1 次提交
  9. 02 7月, 2020 4 次提交
    • S
      commit-graph: check all leading directories in changed path Bloom filters · c525ce95
      SZEDER Gábor 提交于
      The file 'dir/subdir/file' can only be modified if its leading
      directories 'dir' and 'dir/subdir' are modified as well.
      
      So when checking modified path Bloom filters looking for commits
      modifying a path with multiple path components, then check not only
      the full path in the Bloom filters, but all its leading directories as
      well.  Take care to check these paths in "deepest first" order,
      because it's the full path that is least likely to be modified, and
      the Bloom filter queries can short circuit sooner.
      
      This can significantly reduce the average false positive rate, by
      about an order of magnitude or three(!), and can further speed up
      pathspec-limited revision walks.  The table below compares the average
      false positive rate and runtime of
      
        git rev-list HEAD -- "$path"
      
      before and after this change for 5000+ randomly* selected paths from
      each repository:
      
                          Average false           Average        Average
                          positive rate           runtime        runtime
                        before     after     before     after   difference
        ------------------------------------------------------------------
        git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
        linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
        tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%
      
      *Path selection was done with the following pipeline:
      
      	git ls-tree -r --name-only HEAD | sort -R | head -n 5000
      
      The improvements in runtime are much smaller than the improvements in
      average false positive rate, as we are clearly reaching diminishing
      returns here.  However, all these timings depend on that accessing
      tree objects is reasonably fast (warm caches).  If we had a partial
      clone and the tree objects had to be fetched from a promisor remote,
      e.g.:
      
        $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
        $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
              commit-graph write --reachable
        $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
        $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
              rev-list HEAD -- "$path"
      
      then checking all leading path component can reduce the runtime from
      over an hour to a few seconds (and this is with the clone and the
      promisor on the same machine).
      
      This adjusts the tracing values in t4216-log-bloom.sh, which provides a
      concrete way to notice the improvement.
      Helped-by: NTaylor Blau <me@ttaylorr.com>
      Helped-by: NRené Scharfe <l.s.r@web.de>
      Signed-off-by: NSZEDER Gábor <szeder.dev@gmail.com>
      Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      c525ce95
    • T
      revision: empty pathspecs should not use Bloom filters · f3c2a368
      Taylor Blau 提交于
      The prepare_to_use_bloom_filter() method was not intended to be called
      on an empty pathspec. However, 'git log -- .' and 'git log' are subtly
      different: the latter reports all commits while the former will simplify
      commits that do not change the root tree.
      
      This means that the path used to construct the bloom_key might be empty,
      and that value is not added to the Bloom filter during construction.
      That means that the results are likely incorrect!
      
      To resolve the issue, be careful about the length of the path and stop
      filling Bloom filters. To be completely sure we do not use them, drop
      the pointer to the bloom_filter_settings from the commit-graph. That
      allows our test to look at the trace2 logs to verify no Bloom filter
      statistics are reported.
      Signed-off-by: NTaylor Blau <me@ttaylorr.com>
      Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      f3c2a368
    • D
      revision.c: fix whitespace · dc8e95ba
      Derrick Stolee 提交于
      Here, four spaces were used instead of tab characters.
      Reported-by: NTaylor Blau <me@ttaylorr.com>
      Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      dc8e95ba
    • D
      bloom: fix logic in get_bloom_filter() · 94919742
      Derrick Stolee 提交于
      The get_bloom_filter() method is a bit complicated in some parts where
      it does not need to be. In particular, it needs to return a NULL filter
      only when compute_if_not_present is zero AND the filter data cannot be
      loaded from a commit-graph file. This currently happens by accident
      because the commit-graph does not load changed-path Bloom filters from
      an existing commit-graph when writing a new one. This will change in a
      later patch.
      
      Also clean up some style issues while we are here.
      
      One side-effect of returning a NULL filter is that the filters that are
      reported as "too large" will now be reported as NULL insead of length
      zero. This case was not properly covered before, so add a test. Further,
      remote the counting of the zero-length filters from revision.c and the
      trace2 logs.
      Helped-by: NRené Scharfe <l.s.r@web.de>
      Helped-by: NSZEDER Gábor <szeder.dev@gmail.com>
      Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      94919742
  10. 23 6月, 2020 1 次提交
  11. 18 6月, 2020 2 次提交
  12. 09 6月, 2020 1 次提交
    • S
      diff.h: drop diff_tree_oid() & friends' return value · 0ee3cb88
      SZEDER Gábor 提交于
      ll_diff_tree_oid() has only ever returned 0 [1], so it's return value
      is basically useless.  It's only caller diff_tree_oid() has only ever
      returned the return value of ll_diff_tree_oid() as-is [2], so its
      return value is just as useless.  Most of diff_tree_oid()'s callers
      simply ignore its return value, except:
      
        - diff_root_tree_oid() is a thin wrapper around diff_tree_oid() and
          returns with its return value, but all of diff_root_tree_oid()'s
          callers ignore its return value.
      
        - rev_compare_tree() and rev_same_tree_as_empty() do look at the
          return value in a condition, but, since the return value is always
          0, the former's < 0 condition is never fulfilled, while the
          latter's >= 0 condition is always fulfilled.
      
      So let's drop the return value of ll_diff_tree_oid(), diff_tree_oid()
      and diff_root_tree_oid(), and drop those conditions from
      rev_compare_tree() and rev_same_tree_as_empty() as well.
      
      [1] ll_diff_tree_oid() and its ancestors have been returning only 0
          ever since it was introduced as diff_tree() in 9174026c (Add
          "diff-tree" program to show which files have changed between two
          trees., 2005-04-09).
      [2] diff_tree_oid() traces back to diff-tree.c:main() in 9174026c as
          well.
      Signed-off-by: NSZEDER Gábor <szeder.dev@gmail.com>
      Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      0ee3cb88
  13. 12 5月, 2020 3 次提交
    • D
      line-log: integrate with changed-path Bloom filters · f32dde8c
      Derrick Stolee 提交于
      The previous changes to the line-log machinery focused on making the
      first result appear faster. This was achieved by no longer walking the
      entire commit history before returning the early results. There is still
      another way to improve the performance: walk most commits much faster.
      Let's use the changed-path Bloom filters to reduce time spent computing
      diffs.
      
      Since the line-log computation requires opening blobs and checking the
      content-diff, there is still a lot of necessary computation that cannot
      be replaced with changed-path Bloom filters. The part that we can reduce
      is most effective when checking the history of a file that is deep in
      several directories and those directories are modified frequently. In
      this case, the computation to check if a commit is TREESAME to its first
      parent takes a large fraction of the time. That is ripe for improvement
      with changed-path Bloom filters.
      
      We must ensure that prepare_to_use_bloom_filters() is called in
      revision.c so that the bloom_filter_settings are loaded into the struct
      rev_info from the commit-graph. Of course, some cases are still
      forbidden, but in the line-log case the pathspec is provided in a
      different way than normal.
      
      Since multiple paths and segments could be requested, we compute the
      struct bloom_key data dynamically during the commit walk. This could
      likely be improved, but adds code complexity that is not valuable at
      this time.
      
      There are two cases to care about: merge commits and "ordinary" commits.
      Merge commits have multiple parents, but if we are TREESAME to our first
      parent in every range, then pass the blame for all ranges to the first
      parent. Ordinary commits have the same condition, but each is done
      slightly differently in the process_ranges_[merge|ordinary]_commit()
      methods. By checking if the changed-path Bloom filter can guarantee
      TREESAME, we can avoid that tree-diff cost. If the filter says "probably
      changed", then we need to run the tree-diff and then the blob-diff if
      there was a real edit.
      
      The Linux kernel repository is a good testing ground for the performance
      improvements claimed here. There are two different cases to test. The
      first is the "entire history" case, where we output the entire history
      to /dev/null to see how long it would take to compute the full line-log
      history. The second is the "first result" case, where we find how long
      it takes to show the first value, which is an indicator of how quickly a
      user would see responses when waiting at a terminal.
      
      To test, I selected the paths that were changed most frequently in the
      top 10,000 commits using this command (stolen from StackOverflow [1]):
      
      	git log --pretty=format: --name-only -n 10000 | sort | \
      		uniq -c | sort -rg | head -10
      
      which results in
      
          121 MAINTAINERS
           63 fs/namei.c
           60 arch/x86/kvm/cpuid.c
           59 fs/io_uring.c
           58 arch/x86/kvm/vmx/vmx.c
           51 arch/x86/kvm/x86.c
           45 arch/x86/kvm/svm.c
           42 fs/btrfs/disk-io.c
           42 Documentation/scsi/index.rst
      
      (along with a bogus first result). It appears that the path
      arch/x86/kvm/svm.c was renamed, so we ignore that entry. This leaves the
      following results for the real command time:
      
      |                              | Entire History  | First Result    |
      | Path                         | Before | After  | Before | After  |
      |------------------------------|--------|--------|--------|--------|
      | MAINTAINERS                  | 4.26 s | 3.87 s | 0.41 s | 0.39 s |
      | fs/namei.c                   | 1.99 s | 0.99 s | 0.42 s | 0.21 s |
      | arch/x86/kvm/cpuid.c         | 5.28 s | 1.12 s | 0.16 s | 0.09 s |
      | fs/io_uring.c                | 4.34 s | 0.99 s | 0.94 s | 0.27 s |
      | arch/x86/kvm/vmx/vmx.c       | 5.01 s | 1.34 s | 0.21 s | 0.12 s |
      | arch/x86/kvm/x86.c           | 2.24 s | 1.18 s | 0.21 s | 0.14 s |
      | fs/btrfs/disk-io.c           | 1.82 s | 1.01 s | 0.06 s | 0.05 s |
      | Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |
      
      It is worth noting that the least speedup comes for the MAINTAINERS file
      which is
      
       * edited frequently,
       * low in the directory heirarchy, and
       * quite a large file.
      
      All of those points lead to spending more time doing the blob diff and
      less time doing the tree diff. Still, we see some improvement in that
      case and significant improvement in other cases. A 2-4x speedup is
      likely the more typical case as opposed to the small 5% change for that
      file.
      Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      f32dde8c
    • S
      line-log: try to use generation number-based topo-ordering · 002933f3
      SZEDER Gábor 提交于
      The previous patch made it possible to perform line-level filtering
      during history traversal instead of in an expensive preprocessing
      step, but it still requires some simpler preprocessing steps, notably
      topo-ordering.  However, nowadays we have commit-graphs storing
      generation numbers, which make it possible to incrementally traverse
      the history in topological order, without the preparatory limit_list()
      and sort_in_topological_order() steps; see b4542418 (revision.c:
      generation-based topo-order algorithm, 2018-11-01).
      
      This patch combines the two, so we can do both the topo-ordering and
      the line-level filtering during history traversal, eliminating even
      those simpler preprocessing steps, and thus further reducing the delay
      before showing the first commit modifying the given line range.
      
      The 'revs->limited' flag plays the central role in this, because, due
      to limitations of the current implementation, the generation
      number-based topo-ordering is only enabled when this flag remains
      unset.  Line-level log, however, always sets this flag in
      setup_revisions() ever since the feature was introduced in 12da1d1f
      (Implement line-history search (git log -L), 2013-03-28).  The reason
      for setting 'limited' is unclear, though, because the line-level log
      itself doesn't directly depend on it, and it doesn't affect how the
      limit_list() function limits the revision range.  However, there is an
      indirect dependency: the line-level log requires topo-ordering, and
      the "traditional" sort_in_topological_order() requires an already
      limited commit list since e6c3505b (Make sure we generate the whole
      commit list before trying to sort it topologically, 2005-07-06).  The
      new, generation numbers-based topo-ordering doesn't require a limited
      commit list anymore.
      
      So don't set 'revs->limited' for line-level log, unless it is really
      necessary, namely:
      
        - The user explicitly requested parent rewriting, because that is
          still done in the line_log_filter() preprocessing step (see
          previous patch), which requires sort_in_topological_order() and in
          turn limit_list() as well.
      
        - A commit-graph file is not available or it doesn't yet contain
          generation numbers.  In these cases we had to fall back on
          sort_in_topological_order() and in turn limit_list().  The
          existing condition with generation_numbers_enabled() has already
          ensured that the 'limited' flag is set in these cases; this patch
          just makes sure that the line-level log sets 'revs->topo_order'
          before that condition.
      
      While the reduced delay before showing the first commit is measurable
      in git.git, it takes a bigger repository to make it clearly noticable.
      In both cases below the line ranges were chosen so that they were
      modified rather close to the starting revisions, so the effect of this
      change is most noticable.
      
        # git.git
        $ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0
      
        Before:
      
          real    0m0.107s
          user    0m0.091s
          sys     0m0.013s
      
        After:
      
          real    0m0.058s
          user    0m0.050s
          sys     0m0.005s
      
        # linux.git
        $ time git --no-pager log \
          -L:build_restore_work_registers:arch/mips/mm/tlbex.c -1 v5.2
      
        Before:
      
          real   0m1.129s
          user   0m1.061s
          sys    0m0.069s
      
        After:
      
          real   0m0.096s
          user   0m0.087s
          sys    0m0.009s
      
      Additional testing by Derrick Stolee: Since this patch improves
      the performance for the first result, I repeated the experiment
      from the previous patch on the Linux kernel repository, reporting
      real time here:
      
          Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
           Before: 0.71 s
            After: 0.05 s
      
      Now, we have dropped the full topo-order of all ~910,000 commits
      before reporting the first result. The remaining performance
      improvements then are:
      
       1. Update the parent-rewriting logic to be incremental similar to
          how "git log --graph" behaves.
      
       2. Use changed-path Bloom filters to reduce the time spend in the
          tree-diff to see if the path(s) changed.
      Signed-off-by: NSZEDER Gábor <szeder.dev@gmail.com>
      Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      002933f3
    • S
      line-log: more responsive, incremental 'git log -L' · 3cb9d2b6
      SZEDER Gábor 提交于
      The current line-level log implementation performs a preprocessing
      step in prepare_revision_walk(), during which the line_log_filter()
      function filters and rewrites history to keep only commits modifying
      the given line range.  This preprocessing affects both responsiveness
      and correctness:
      
        - Git doesn't produce any output during this preprocessing step.
          Checking whether a commit modified the given line range is
          somewhat expensive, so depending on the size of the given revision
          range this preprocessing can result in a significant delay before
          the first commit is shown.
      
        - Limiting the number of displayed commits (e.g. 'git log -3 -L...')
          doesn't limit the amount of work during preprocessing, because
          that limit is applied during history traversal.  Alas, by that
          point this expensive preprocessing step has already churned
          through the whole revision range to find all commits modifying the
          revision range, even though only a few of them need to be shown.
      
        - It rewrites parents, with no way to turn it off.  Without the user
          explicitly requesting parent rewriting any parent object ID shown
          should be that of the immediate parent, just like in case of a
          pathspec-limited history traversal without parent rewriting.
      
          However, after that preprocessing step rewrote history, the
          subsequent "regular" history traversal (i.e. get_revision() in a
          loop) only sees commits modifying the given line range.
          Consequently, it can only show the object ID of the last ancestor
          that modified the given line range (which might happen to be the
          immediate parent, but many-many times it isn't).
      
      This patch addresses both the correctness and, at least for the common
      case, the responsiveness issues by integrating line-level log
      filtering into the regular revision walking machinery:
      
        - Make process_ranges_arbitrary_commit(), the static function in
          'line-log.c' deciding whether a commit modifies the given line
          range, public by removing the static keyword and adding the
          'line_log_' prefix, so it can be called from other parts of the
          revision walking machinery.
      
        - If the user didn't explicitly ask for parent rewriting (which, I
          believe, is the most common case):
      
          - Call this now-public function during regular history traversal,
            namely from get_commit_action() to ignore any commits not
            modifying the given line range.
      
            Note that while this check is relatively expensive, it must be
            performed before other, much cheaper conditions, because the
            tracked line range must be adjusted even when the commit will
            end up being ignored by other conditions.
      
          - Skip the line_log_filter() call, i.e. the expensive
            preprocessing step, in prepare_revision_walk(), because, thanks
            to the above points, the revision walking machinery is now able
            to filter out commits not modifying the given line range while
            traversing history.
      
            This way the regular history traversal sees the unmodified
            history, and is therefore able to print the object ids of the
            immediate parents of the listed commits.  The eliminated
            preprocessing step can greatly reduce the delay before the first
            commit is shown, see the numbers below.
      
        - However, if the user did explicitly ask for parent rewriting via
          '--parents' or a similar option, then stick with the current
          implementation for now, i.e. perform that expensive filtering and
          history rewriting in the preprocessing step just like we did
          before, leaving the initial delay as long as it was.
      
      I tried to integrate line-level log filtering with parent rewriting
      into the regular history traversal, but, unfortunately, several
      subtleties resisted... :)  Maybe someday we'll figure out how to do
      that, but until then at least the simple and common (i.e. without
      parent rewriting) 'git log -L:func:file' commands can benefit from the
      reduced delay.
      
      This change makes the failing 'parent oids without parent rewriting'
      test in 't4211-line-log.sh' succeed.
      
      The reduced delay is most noticable when there's a commit modifying
      the line range near the tip of a large-ish revision range:
      
        # no parent rewriting requested, no commit-graph present
        $ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0
      
        Before:
      
          real    0m9.570s
          user    0m9.494s
          sys     0m0.076s
      
        After:
      
          real    0m0.718s
          user    0m0.674s
          sys     0m0.044s
      
      A significant part of the remaining delay is spent reading and parsing
      commit objects in limit_list().  With the help of the commit-graph we
      can eliminate most of that reading and parsing overhead, so here are
      the timing results of the same command as above, but this time using
      the commit-graph:
      
        Before:
      
          real    0m8.874s
          user    0m8.816s
          sys     0m0.057s
      
        After:
      
          real    0m0.107s
          user    0m0.091s
          sys     0m0.013s
      
      The next patch will further reduce the remaining delay.
      
      To be clear: this patch doesn't actually optimize the line-level log,
      but merely moves most of the work from the preprocessing step to the
      history traversal, so the commits modifying the line range can be
      shown as soon as they are processed, and the traversal can be
      terminated as soon as the given number of commits are shown.
      Consequently, listing the full history of a line range, potentially
      all the way to the root commit, will take the same time as before (but
      at least the user might start reading the output earlier).
      Furthermore, if the most recent commit modifying the line range is far
      away from the starting revision, then that initial delay will still be
      significant.
      
      Additional testing by Derrick Stolee: In the Linux kernel repository,
      the MAINTAINERS file was changed ~3,500 times across the ~915,000
      commits. In addition to that edit frequency, the file itself is quite
      large (~18,700 lines). This means that a significant portion of the
      computation is taken up by computing the patch-diff of the file. This
      patch improves the real time it takes to output the first result quite
      a bit:
      
      Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
       Before: 3.88 s
        After: 0.71 s
      
      If we drop the "-n 1" in the command, then there is no change in
      end-to-end process time. This is because the command still needs to
      walk the entire commit history, which negates the point of this
      patch. This is expected.
      
      As a note for future reference, the ~4.3 seconds in the old code
      spends ~2.6 seconds computing the patch-diffs, and the rest of the
      time is spent walking commits and computing diffs for which paths
      changed at each commit. The changed-path Bloom filters could improve
      the end-to-end computation time (i.e. no "-n 1" in the command).
      Signed-off-by: NSZEDER Gábor <szeder.dev@gmail.com>
      Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      3cb9d2b6
  14. 17 4月, 2020 1 次提交
    • D
      revision: complicated pathspecs disable filters · 8918e379
      Derrick Stolee 提交于
      The changed-path Bloom filters work only when we can compute an
      explicit Bloom filter key in advance. When a pathspec is given
      that allows case-insensitive checks or wildcard matching, we
      must disable the Bloom filter performance checks.
      
      By checking the pathspec in prepare_to_use_bloom_filters(), we
      avoid setting up the Bloom filter data and thus revert to the
      usual logic.
      
      Before this change, the following tests would fail*:
      
      	t6004-rev-list-path-optim.sh (Tests 6-7)
      	t6130-pathspec-noglob.sh (Tests 3-6)
      	t6131-pathspec-icase.sh (Tests 3-5)
      
      *These tests would fail when using GIT_TEST_COMMIT_GRAPH and
      GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS except that the latter
      environment variable was not set up correctly to write the changed-
      path Bloom filters in the test suite. That will be fixed in the
      next change.
      Helped-by: NTaylor Blau <me@ttaylorr.com>
      Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      8918e379
  15. 11 4月, 2020 1 次提交
    • D
      revision: --show-pulls adds helpful merges · 8d049e18
      Derrick Stolee 提交于
      The default file history simplification of "git log -- <path>" or
      "git rev-list -- <path>" focuses on providing the smallest set of
      commits that first contributed a change. The revision walk greatly
      restricts the set of walked commits by visiting only the first
      TREESAME parent of a merge commit, when one exists. This means
      that portions of the commit-graph are not walked, which can be a
      performance benefit, but can also "hide" commits that added changes
      but were ignored by a merge resolution.
      
      The --full-history option modifies this by walking all commits and
      reporting a merge commit as "interesting" if it has _any_ parent
      that is not TREESAME. This tends to be an over-representation of
      important commits, especially in an environment where most merge
      commits are created by pull request completion.
      
      Suppose we have a commit A and we create a commit B on top that
      changes our file. When we merge the pull request, we create a merge
      commit M. If no one else changed the file in the first-parent
      history between M and A, then M will not be TREESAME to its first
      parent, but will be TREESAME to B. Thus, the simplified history
      will be "B". However, M will appear in the --full-history mode.
      
      However, suppose that a number of topics T1, T2, ..., Tn were
      created based on commits C1, C2, ..., Cn between A and M as
      follows:
      
        A----C1----C2--- ... ---Cn----M------P1---P2--- ... ---Pn
         \     \     \            \  /      /    /            /
          \     \__.. \            \/ ..__T1    /           Tn
           \           \__..       /\     ..__T2           /
            \_____________________B  \____________________/
      
      If the commits T1, T2, ... Tn did not change the file, then all of
      P1 through Pn will be TREESAME to their first parent, but not
      TREESAME to their second. This means that all of those merge commits
      appear in the --full-history view, with edges that immediately
      collapse into the lower history without introducing interesting
      single-parent commits.
      
      The --simplify-merges option was introduced to remove these extra
      merge commits. By noticing that the rewritten parents are reachable
      from their first parents, those edges can be simplified away. Finally,
      the commits now look like single-parent commits that are TREESAME to
      their "only" parent. Thus, they are removed and this issue does not
      cause issues anymore. However, this also ends up removing the commit
      M from the history view! Even worse, the --simplify-merges option
      requires walking the entire history before returning a single result.
      
      Many Git users are using Git alongside a Git service that provides
      code storage alongside a code review tool commonly called "Pull
      Requests" or "Merge Requests" against a target branch.  When these
      requests are accepted and merged, they typically create a merge
      commit whose first parent is the previous branch tip and the second
      parent is the tip of the topic branch used for the request. This
      presents a valuable order to the parents, but also makes that merge
      commit slightly special. Users may want to see not only which
      commits changed a file, but which pull requests merged those commits
      into their branch. In the previous example, this would mean the
      users want to see the merge commit "M" in addition to the single-
      parent commit "C".
      
      Users are even more likely to want these merge commits when they
      use pull requests to merge into a feature branch before merging that
      feature branch into their trunk.
      
      In some sense, users are asking for the "first" merge commit to
      bring in the change to their branch. As long as the parent order is
      consistent, this can be handled with the following rule:
      
        Include a merge commit if it is not TREESAME to its first
        parent, but is TREESAME to a later parent.
      
      These merges look like the merge commits that would result from
      running "git pull <topic>" on a main branch. Thus, the option to
      show these commits is called "--show-pulls". This has the added
      benefit of showing the commits created by closing a pull request or
      merge request on any of the Git hosting and code review platforms.
      
      To test these options, extend the standard test example to include
      a merge commit that is not TREESAME to its first parent. It is
      surprising that that option was not already in the example, as it
      is instructive.
      
      In particular, this extension demonstrates a common issue with file
      history simplification. When a user resolves a merge conflict using
      "-Xours" or otherwise ignoring one side of the conflict, they create
      a TREESAME edge that probably should not be TREESAME. This leads
      users to become frustrated and complain that "my change disappeared!"
      In my experience, showing them history with --full-history and
      --simplify-merges quickly reveals the problematic merge. As mentioned,
      this option is expensive to compute. The --show-pulls option
      _might_ show the merge commit (usually titled "resolving conflicts")
      more quickly. Of course, this depends on the user having the correct
      parent order, which is backwards when using "git pull master" from a
      topic branch.
      
      There are some special considerations when combining the --show-pulls
      option with --simplify-merges. This requires adding a new PULL_MERGE
      object flag to store the information from the initial TREESAME
      comparisons. This helps avoid dropping those commits in later filters.
      This is covered by a test, including how the parents can be simplified.
      Since "struct object" has already ruined its 32-bit alignment by using
      33 bits across parsed, type, and flags member, let's not make it worse.
      PULL_MERGE is used in revision.c with the same value (1u<<15) as
      REACHABLE in commit-graph.c. The REACHABLE flag is only used when
      writing a commit-graph file, and a revision walk using --show-pulls
      does not happen in the same process. Care must be taken in the future
      to ensure this remains the case.
      
      Update Documentation/rev-list-options.txt with significant details
      around this option. This requires updating the example in the
      History Simplification section to demonstrate some of the problems
      with TREESAME second parents.
      Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      8d049e18
  16. 08 4月, 2020 1 次提交
    • E
      format-patch: teach --no-encode-email-headers · 19d097e3
      Emma Brooks 提交于
      When commit subjects or authors have non-ASCII characters, git
      format-patch Q-encodes them so they can be safely sent over email.
      However, if the patch transfer method is something other than email (web
      review tools, sneakernet), this only serves to make the patch metadata
      harder to read without first applying it (unless you can decode RFC 2047
      in your head). git am as well as some email software supports
      non-Q-encoded mail as described in RFC 6531.
      
      Add --[no-]encode-email-headers and format.encodeEmailHeaders to let the
      user control this behavior.
      Signed-off-by: NEmma Brooks <me@pluvano.com>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      19d097e3
  17. 07 4月, 2020 2 次提交
    • G
      revision.c: add trace2 stats around Bloom filter usage · 42e50e78
      Garima Singh 提交于
      Add trace2 statistics around Bloom filter usage and behavior
      for 'git log -- path' commands that are hoping to benefit from
      the presence of computed changed paths Bloom filters.
      
      These statistics are great for performance analysis work and
      for formal testing, which we will see in the commit following
      this one.
      
      Helped-by: Derrick Stolee <dstolee@microsoft.com
      Helped-by: NSZEDER Gábor <szeder.dev@gmail.com>
      Helped-by: NJonathan Tan <jonathantanmy@google.com>
      Signed-off-by: NGarima Singh <garima.singh@microsoft.com>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      42e50e78
    • G
      revision.c: use Bloom filters to speed up path based revision walks · a56b9464
      Garima Singh 提交于
      Revision walk will now use Bloom filters for commits to speed up
      revision walks for a particular path (for computing history for
      that path), if they are present in the commit-graph file.
      
      We load the Bloom filters during the prepare_revision_walk step,
      currently only when dealing with a single pathspec. Extending
      it to work with multiple pathspecs can be explored and built on
      top of this series in the future.
      
      While comparing trees in rev_compare_trees(), if the Bloom filter
      says that the file is not different between the two trees, we don't
      need to compute the expensive diff. This is where we get our
      performance gains. The other response of the Bloom filter is '`:maybe',
      in which case we fall back to the full diff calculation to determine
      if the path was changed in the commit.
      
      We do not try to use Bloom filters when the '--walk-reflogs' option
      is specified. The '--walk-reflogs' option does not walk the commit
      ancestry chain like the rest of the options. Incorporating the
      performance gains when walking reflog entries would add more
      complexity, and can be explored in a later series.
      
      Performance Gains:
      We tested the performance of `git log -- <path>` on the git repo, the linux
      and some internal large repos, with a variety of paths of varying depths.
      
      On the git and linux repos:
      - we observed a 2x to 5x speed up.
      
      On a large internal repo with files seated 6-10 levels deep in the tree:
      - we observed 10x to 20x speed ups, with some paths going up to 28 times
        faster.
      
      Helped-by: Derrick Stolee <dstolee@microsoft.com
      Helped-by: NSZEDER Gábor <szeder.dev@gmail.com>
      Helped-by: NJonathan Tan <jonathantanmy@google.com>
      Signed-off-by: NGarima Singh <garima.singh@microsoft.com>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      a56b9464
  18. 14 12月, 2019 1 次提交
    • D
      notes: break set_display_notes() into smaller functions · 1d729751
      Denton Liu 提交于
      In 8164c961 (format-patch: use --notes behavior for format.notes,
      2019-12-09), we introduced set_display_notes() which was a monolithic
      function with three mutually exclusive branches. Break the function up
      into three small and simple functions that each are only responsible for
      one task.
      
      This family of functions accepts an `int *show_notes` instead of
      returning a value suitable for assignment to `show_notes`. This is for
      two reasons. First of all, this guarantees that the external
      `show_notes` variable changes in lockstep with the
      `struct display_notes_opt`. Second, this prompts future developers to be
      careful about doing something meaningful with this value. In fact, a
      NULL check is intentionally omitted because causing a segfault here
      would tell the future developer that they are meant to use the value for
      something meaningful.
      
      One alternative was making the family of functions accept a
      `struct rev_info *` instead of the `struct display_notes_opt *`, since
      the former contains the `show_notes` field as well. This does not work
      because we have to call git_config() before repo_init_revisions().
      However, if we had a `struct rev_info`, we'd need to initialize it before
      it gets assigned values from git_config(). As a result, we break the
      circular dependency by having standalone `int show_notes` and
      `struct display_notes_opt notes_opt` variables which temporarily hold
      values from git_config() until the values are copied over to `rev`.
      
      To implement this change, we need to get a pointer to
      `rev_info::show_notes`. Unfortunately, this is not possible with
      bitfields and only direct-assignment is possible. Change
      `rev_info::show_notes` to a non-bitfield int so that we can get its
      address.
      Signed-off-by: NDenton Liu <liu.denton@gmail.com>
      Signed-off-by: NJunio C Hamano <gitster@pobox.com>
      1d729751
  19. 10 12月, 2019 2 次提交
  20. 25 11月, 2019 2 次提交
  21. 20 11月, 2019 1 次提交
  22. 07 10月, 2019 7 次提交