提交 · cacab0c8566084901cd0f8ed6c9e1c4a5dc63677 · 李少辉-开发者 / git

27 8月, 2020 1 次提交

revision: set rev_input_given in handle_revision_arg() · 04a0e985

由 Jeff King 提交于 8月 26, 2020

Commit 7ba82629 (revision: add rev_input_given flag, 2017-08-02) added
a flag to rev_info to tell whether we got any revision arguments. As
explained there, this is necessary because some revision arguments may
not produce any pending traversal objects, but should still inhibit
default behaviors (e.g., a glob that matches nothing).

However, it only set the flag in the globbing code, but not for
revisions we get on the command-line or via stdin. This leads to two
problems:

  - the command-line code keeps its own separate got_rev_arg flag; this
    isn't wrong, but it's confusing and an extra maintenance burden

  - even specifically-named rev arguments might end up not adding any
    pending objects: if --ignore-missing is set, then specifying a
    missing object is a noop rather than an error.

And that leads to some user-visible bugs:

  - when deciding whether a default rev like "HEAD" should kick in, we
    check both got_rev_arg and rev_input_given. That means that
    "--ignore-missing $ZERO_OID" works on the command-line (where we set
    got_rev_arg) but not on --stdin (where we don't)

  - when rev-list decides whether it should complain that it wasn't
    given a starting point, it relies on rev_input_given. So it can't
    even get the command-line "--ignore-missing $ZERO_OID" right

Let's consistently set the flag if we got any revision argument. That
lets us clean up the redundant got_rev_arg, and fixes both of those bugs
(but note there are three new tests: we'll confirm the already working
git-log command-line case).

A few implementation notes:

  - conceptually we want to set the flag whenever handle_revision_arg()
    finds an actual revision arg ("handles" it, you might say). But it
    covers a ton of cases with early returns. Rather than annotating
    each one, we just wrap it and use its success exit-code to set the
    flag in one spot.

  - the new rev-list test is in t6018, which is titled to cover globs.
    This isn't exactly a glob, but it made sense to stick it with the
    other tests that handle the "even though we got a rev, we have no
    pending objects" case, which are globs.

  - the tests check for the oid of a missing object, which it's pretty
    clear --ignore-missing should ignore. You can see the same behavior
    with "--ignore-missing a-ref-that-does-not-exist", because
    --ignore-missing treats them both the same. That's perhaps less
    clearly correct, and we may want to change that in the future. But
    the way the code and tests here are written, we'd continue to do the
    right thing even if it does.
Reported-by: NBryan Turner <bturner@atlassian.com>
Signed-off-by: NJeff King <peff@peff.net>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

04a0e985

12 8月, 2020 1 次提交

revision: change "--diff-merges" option to require parameter · 6501580f

由 Sergey Organov 提交于 8月 06, 2020

--diff-merges=off is the only accepted form for now, a synonym for
--no-diff-merges.

This patch is a preparation for adding more values, as well as supporting
--diff-merges=<parent>, where <parent> is single parent number to output diff
against.
Signed-off-by: NSergey Organov <sorganov@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

6501580f

08 8月, 2020 1 次提交

rev-list: allow bisect and first-parent flags · 0fe305a5

由 Aaron Lipman 提交于 8月 07, 2020

Add first_parent_only parameter to find_bisection(), removing the
barrier that prevented combining the --bisect and --first-parent flags
when using git rev-list
Based-on-patch-by: NTiago Botelho <tiagonbotelho@hotmail.com>
Signed-off-by: NAaron Lipman <alipman88@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

0fe305a5

05 8月, 2020 3 次提交

revision: fix die() message for "--unpacked=" · f649aaaf

由 Sergey Organov 提交于 8月 05, 2020

Get rid of the trailing dot and mark for translation.
Signed-off-by: NSergey Organov <sorganov@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

f649aaaf

revision: avoid leak when preparing bloom filter for "/" · 398e659e

由 Jeff King 提交于 8月 04, 2020

If we're given an empty pathspec, we refuse to set up bloom filters, as
described in f3c2a368 (revision: empty pathspecs should not use Bloom
filters, 2020-07-01).

But before the empty string check, we drop any trailing slash by
allocating a new string without it. So a pathspec consisting only of "/"
will allocate that string, but then still cause us to bail, leaking the
new string. Let's make sure to free it.
Signed-off-by: NJeff King <peff@peff.net>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

398e659e

revision: avoid out-of-bounds read/write on empty pathspec · fd9a631c

由 Jeff King 提交于 8月 04, 2020

Running t4216 with ASan results in it complaining of an out-of-bounds
read in prepare_to_use_bloom_filter(). The issue is this code to strip a
trailing slash:

  last_index = pi->len - 1;
  if (pi->match[last_index] == '/') {

because we have no guarantee that pi->len isn't zero. This can happen if
the pathspec is ".", as we translate that to an empty string. And if
that read of random memory does trigger the conditional, we'd then do an
out-of-bounds write:

  path_alloc = xstrdup(pi->match);
  path_alloc[last_index] = '\0';

Let's make sure to check the length before subtracting. Note that for an
empty pathspec, we'd end up bailing from the function a few lines later,
which makes it tempting to just:

  if (!pi->len)
          return;

early here. But our code here is stripping a trailing slash, and we need
to check for emptiness after stripping that slash, too. So we'd have two
blocks, which would require repeating some cleanup code.

Instead, just skip the trailing-slash for an empty string. Setting
last_index at all in the case is awkward since it will have a nonsense
value (and it uses an "int", which is a too-small type for a string
anyway). So while we're here, let's:

  - drop last_index entirely; it's only used in two spots right next to
    each other and writing out "pi->len - 1" in both is actually easier
    to follow

  - use xmemdupz() to duplicate the string. This is slightly more
    efficient, but more importantly makes the intent more clear by
    allocating the correct-sized substring in the first place. It also
    eliminates any question of whether path_alloc is as long as
    pi->match (which it would not be if pi->match has any embedded NULs,
    though in practice this is probably impossible).
Signed-off-by: NJeff King <peff@peff.net>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

fd9a631c

31 7月, 2020 1 次提交

strvec: rename struct fields · d70a9eb6

由 Jeff King 提交于 7月 28, 2020

The "argc" and "argv" names made sense when the struct was argv_array,
but now they're just confusing. Let's rename them to "nr" (which we use
for counts elsewhere) and "v" (which is rather terse, but reads well
when combined with typical variable names like "args.v").

Note that we have to update all of the callers immediately. Playing
tricks with the preprocessor is hard here, because we wouldn't want to
rewrite unrelated tokens.
Signed-off-by: NJeff King <peff@peff.net>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

d70a9eb6

30 7月, 2020 1 次提交

revision: add "--no-diff-merges" option to counteract "-m" · 6fae74b4

由 Jeff King 提交于 7月 29, 2020

The "-m" option sets revs->ignore_merges to "0", but there's no way to
undo it. This probably isn't something anybody overly cares about, since
"1" is already the default, but it will serve as an escape hatch when we
flip the default for ignore_merges to "0" in more situations.

We'll also add a few extra niceties:

  - initialize the value to "-1" to indicate "not set", and then resolve
    it to the normal 0/1 bool in setup_revisions(). This lets any tweak
    functions, as well as setup_revisions() itself, avoid clobbering the
    user's preference (which until now they couldn't actually express).

  - since we now have --no-diff-merges, let's add the matching
    --diff-merges, which is just a synonym for "-m". Then we don't even
    need to document --no-diff-merges separately; it countermands the
    long form of "-m" in the usual way.

The new test shows that this behaves just the same as the current
behavior without "-m".
Signed-off-by: NJeff King <peff@peff.net>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

6fae74b4

29 7月, 2020 2 次提交

strvec: convert remaining callers away from argv_array name · c972bf4c

由 Jeff King 提交于 7月 28, 2020

We eventually want to drop the argv_array name and just use strvec
consistently. There's no particular reason we have to do it all at once,
or care about interactions between converted and unconverted bits.
Because of our preprocessor compat layer, the names are interchangeable
to the compiler (so even a definition and declaration using different
names is OK).

This patch converts all of the remaining files, as the resulting diff is
reasonably sized.

The conversion was done purely mechanically with:

  git ls-files '*.c' '*.h' |
  xargs perl -i -pe '
    s/ARGV_ARRAY/STRVEC/g;
    s/argv_array/strvec/g;
  '

We'll deal with any indentation/style fallouts separately.
Signed-off-by: NJeff King <peff@peff.net>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

c972bf4c

strvec: rename files from argv-array to strvec · dbbcd44f

由 Jeff King 提交于 7月 28, 2020

This requires updating #include lines across the code-base, but that's
all fairly mechanical, and was done with:

  git ls-files '*.c' '*.h' |
  xargs perl -i -pe 's/argv-array.h/strvec.h/'
Signed-off-by: NJeff King <peff@peff.net>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

dbbcd44f

07 7月, 2020 1 次提交

revision: disable min_age optimization with line-log · 01faa91c

由 René Scharfe 提交于 7月 04, 2020

If one of the options --before, --min-age or --until is given,
limit_list() filters out younger commits early on.  Line-log needs all
those commits to trace the movement of line ranges, though.  Skip this
optimization if both are used together.
Reported-by: NМария Долгополова <dolgopolovamariia@gmail.com>
Signed-off-by: NRené Scharfe <l.s.r@web.de>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

01faa91c

02 7月, 2020 4 次提交

commit-graph: check all leading directories in changed path Bloom filters · c525ce95

由 SZEDER Gábor 提交于 7月 01, 2020

The file 'dir/subdir/file' can only be modified if its leading
directories 'dir' and 'dir/subdir' are modified as well.

So when checking modified path Bloom filters looking for commits
modifying a path with multiple path components, then check not only
the full path in the Bloom filters, but all its leading directories as
well.  Take care to check these paths in "deepest first" order,
because it's the full path that is least likely to be modified, and
the Bloom filter queries can short circuit sooner.

This can significantly reduce the average false positive rate, by
about an order of magnitude or three(!), and can further speed up
pathspec-limited revision walks.  The table below compares the average
false positive rate and runtime of

  git rev-list HEAD -- "$path"

before and after this change for 5000+ randomly* selected paths from
each repository:

                    Average false           Average        Average
                    positive rate           runtime        runtime
                  before     after     before     after   difference
  ------------------------------------------------------------------
  git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
  linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
  tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%

*Path selection was done with the following pipeline:

	git ls-tree -r --name-only HEAD | sort -R | head -n 5000

The improvements in runtime are much smaller than the improvements in
average false positive rate, as we are clearly reaching diminishing
returns here.  However, all these timings depend on that accessing
tree objects is reasonably fast (warm caches).  If we had a partial
clone and the tree objects had to be fetched from a promisor remote,
e.g.:

  $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
  $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
        commit-graph write --reachable
  $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
  $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
        rev-list HEAD -- "$path"

then checking all leading path component can reduce the runtime from
over an hour to a few seconds (and this is with the clone and the
promisor on the same machine).

This adjusts the tracing values in t4216-log-bloom.sh, which provides a
concrete way to notice the improvement.
Helped-by: NTaylor Blau <me@ttaylorr.com>
Helped-by: NRené Scharfe <l.s.r@web.de>
Signed-off-by: NSZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

c525ce95

revision: empty pathspecs should not use Bloom filters · f3c2a368

由 Taylor Blau 提交于 7月 01, 2020

The prepare_to_use_bloom_filter() method was not intended to be called
on an empty pathspec. However, 'git log -- .' and 'git log' are subtly
different: the latter reports all commits while the former will simplify
commits that do not change the root tree.

This means that the path used to construct the bloom_key might be empty,
and that value is not added to the Bloom filter during construction.
That means that the results are likely incorrect!

To resolve the issue, be careful about the length of the path and stop
filling Bloom filters. To be completely sure we do not use them, drop
the pointer to the bloom_filter_settings from the commit-graph. That
allows our test to look at the trace2 logs to verify no Bloom filter
statistics are reported.
Signed-off-by: NTaylor Blau <me@ttaylorr.com>
Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

f3c2a368

revision.c: fix whitespace · dc8e95ba

由 Derrick Stolee 提交于 7月 01, 2020

Here, four spaces were used instead of tab characters.
Reported-by: NTaylor Blau <me@ttaylorr.com>
Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

dc8e95ba

bloom: fix logic in get_bloom_filter() · 94919742

由 Derrick Stolee 提交于 7月 01, 2020

The get_bloom_filter() method is a bit complicated in some parts where
it does not need to be. In particular, it needs to return a NULL filter
only when compute_if_not_present is zero AND the filter data cannot be
loaded from a commit-graph file. This currently happens by accident
because the commit-graph does not load changed-path Bloom filters from
an existing commit-graph when writing a new one. This will change in a
later patch.

Also clean up some style issues while we are here.

One side-effect of returning a NULL filter is that the filters that are
reported as "too large" will now be reported as NULL insead of length
zero. This case was not properly covered before, so add a test. Further,
remote the counting of the zero-length filters from revision.c and the
trace2 logs.
Helped-by: NRené Scharfe <l.s.r@web.de>
Helped-by: NSZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

94919742

23 6月, 2020 1 次提交

worktree: drop get_worktrees() unused 'flags' argument · 03f2465b

由 Eric Sunshine 提交于 6月 19, 2020

get_worktrees() accepts a 'flags' argument, however, there are no
existing flags (the lone flag GWT_SORT_LINKED was recently retired) and
no behavior which can be tweaked. Therefore, drop the 'flags' argument.
Signed-off-by: NEric Sunshine <sunshine@sunshineco.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

03f2465b

18 6月, 2020 2 次提交

commit-graph: minimize commit_graph_data_slab access · c752ad09

由 Abhishek Kumar 提交于 6月 17, 2020

In an earlier patch, multiple struct acccesses to `graph_pos` and
`generation` were auto-converted to multiple method calls.

Since the values are fixed and commit-slab access costly, we would be
better off with storing the values as a local variable and reusing it.
Signed-off-by: NAbhishek Kumar <abhishekkumar8222@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

c752ad09

commit: move members graph_pos, generation to a slab · c49c82aa

由 Abhishek Kumar 提交于 6月 17, 2020

We remove members `graph_pos` and `generation` from the struct commit.
The default assignments in init_commit_node() are no longer valid,
which is fine as the slab helpers return appropriate default values and
the assignments are removed.

We will replace existing use of commit->generation and commit->graph_pos
by commit_graph_data_slab helpers using
`contrib/coccinelle/commit.cocci'.
Signed-off-by: NAbhishek Kumar <abhishekkumar8222@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

c49c82aa

09 6月, 2020 1 次提交

diff.h: drop diff_tree_oid() & friends' return value · 0ee3cb88

由 SZEDER Gábor 提交于 6月 05, 2020

ll_diff_tree_oid() has only ever returned 0 [1], so it's return value
is basically useless.  It's only caller diff_tree_oid() has only ever
returned the return value of ll_diff_tree_oid() as-is [2], so its
return value is just as useless.  Most of diff_tree_oid()'s callers
simply ignore its return value, except:

  - diff_root_tree_oid() is a thin wrapper around diff_tree_oid() and
    returns with its return value, but all of diff_root_tree_oid()'s
    callers ignore its return value.

  - rev_compare_tree() and rev_same_tree_as_empty() do look at the
    return value in a condition, but, since the return value is always
    0, the former's < 0 condition is never fulfilled, while the
    latter's >= 0 condition is always fulfilled.

So let's drop the return value of ll_diff_tree_oid(), diff_tree_oid()
and diff_root_tree_oid(), and drop those conditions from
rev_compare_tree() and rev_same_tree_as_empty() as well.

[1] ll_diff_tree_oid() and its ancestors have been returning only 0
    ever since it was introduced as diff_tree() in 9174026c (Add
    "diff-tree" program to show which files have changed between two
    trees., 2005-04-09).
[2] diff_tree_oid() traces back to diff-tree.c:main() in 9174026c as
    well.
Signed-off-by: NSZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

0ee3cb88

12 5月, 2020 3 次提交

line-log: integrate with changed-path Bloom filters · f32dde8c

由 Derrick Stolee 提交于 5月 11, 2020

The previous changes to the line-log machinery focused on making the
first result appear faster. This was achieved by no longer walking the
entire commit history before returning the early results. There is still
another way to improve the performance: walk most commits much faster.
Let's use the changed-path Bloom filters to reduce time spent computing
diffs.

Since the line-log computation requires opening blobs and checking the
content-diff, there is still a lot of necessary computation that cannot
be replaced with changed-path Bloom filters. The part that we can reduce
is most effective when checking the history of a file that is deep in
several directories and those directories are modified frequently. In
this case, the computation to check if a commit is TREESAME to its first
parent takes a large fraction of the time. That is ripe for improvement
with changed-path Bloom filters.

We must ensure that prepare_to_use_bloom_filters() is called in
revision.c so that the bloom_filter_settings are loaded into the struct
rev_info from the commit-graph. Of course, some cases are still
forbidden, but in the line-log case the pathspec is provided in a
different way than normal.

Since multiple paths and segments could be requested, we compute the
struct bloom_key data dynamically during the commit walk. This could
likely be improved, but adds code complexity that is not valuable at
this time.

There are two cases to care about: merge commits and "ordinary" commits.
Merge commits have multiple parents, but if we are TREESAME to our first
parent in every range, then pass the blame for all ranges to the first
parent. Ordinary commits have the same condition, but each is done
slightly differently in the process_ranges_[merge|ordinary]_commit()
methods. By checking if the changed-path Bloom filter can guarantee
TREESAME, we can avoid that tree-diff cost. If the filter says "probably
changed", then we need to run the tree-diff and then the blob-diff if
there was a real edit.

The Linux kernel repository is a good testing ground for the performance
improvements claimed here. There are two different cases to test. The
first is the "entire history" case, where we output the entire history
to /dev/null to see how long it would take to compute the full line-log
history. The second is the "first result" case, where we find how long
it takes to show the first value, which is an indicator of how quickly a
user would see responses when waiting at a terminal.

To test, I selected the paths that were changed most frequently in the
top 10,000 commits using this command (stolen from StackOverflow [1]):

	git log --pretty=format: --name-only -n 10000 | sort | \
		uniq -c | sort -rg | head -10

which results in

    121 MAINTAINERS
     63 fs/namei.c
     60 arch/x86/kvm/cpuid.c
     59 fs/io_uring.c
     58 arch/x86/kvm/vmx/vmx.c
     51 arch/x86/kvm/x86.c
     45 arch/x86/kvm/svm.c
     42 fs/btrfs/disk-io.c
     42 Documentation/scsi/index.rst

(along with a bogus first result). It appears that the path
arch/x86/kvm/svm.c was renamed, so we ignore that entry. This leaves the
following results for the real command time:

|                              | Entire History  | First Result    |
| Path                         | Before | After  | Before | After  |
|------------------------------|--------|--------|--------|--------|
| MAINTAINERS                  | 4.26 s | 3.87 s | 0.41 s | 0.39 s |
| fs/namei.c                   | 1.99 s | 0.99 s | 0.42 s | 0.21 s |
| arch/x86/kvm/cpuid.c         | 5.28 s | 1.12 s | 0.16 s | 0.09 s |
| fs/io_uring.c                | 4.34 s | 0.99 s | 0.94 s | 0.27 s |
| arch/x86/kvm/vmx/vmx.c       | 5.01 s | 1.34 s | 0.21 s | 0.12 s |
| arch/x86/kvm/x86.c           | 2.24 s | 1.18 s | 0.21 s | 0.14 s |
| fs/btrfs/disk-io.c           | 1.82 s | 1.01 s | 0.06 s | 0.05 s |
| Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |

It is worth noting that the least speedup comes for the MAINTAINERS file
which is

 * edited frequently,
 * low in the directory heirarchy, and
 * quite a large file.

All of those points lead to spending more time doing the blob diff and
less time doing the tree diff. Still, we see some improvement in that
case and significant improvement in other cases. A 2-4x speedup is
likely the more typical case as opposed to the small 5% change for that
file.
Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

f32dde8c

line-log: try to use generation number-based topo-ordering · 002933f3

由 SZEDER Gábor 提交于 5月 11, 2020

The previous patch made it possible to perform line-level filtering
during history traversal instead of in an expensive preprocessing
step, but it still requires some simpler preprocessing steps, notably
topo-ordering.  However, nowadays we have commit-graphs storing
generation numbers, which make it possible to incrementally traverse
the history in topological order, without the preparatory limit_list()
and sort_in_topological_order() steps; see b4542418 (revision.c:
generation-based topo-order algorithm, 2018-11-01).

This patch combines the two, so we can do both the topo-ordering and
the line-level filtering during history traversal, eliminating even
those simpler preprocessing steps, and thus further reducing the delay
before showing the first commit modifying the given line range.

The 'revs->limited' flag plays the central role in this, because, due
to limitations of the current implementation, the generation
number-based topo-ordering is only enabled when this flag remains
unset.  Line-level log, however, always sets this flag in
setup_revisions() ever since the feature was introduced in 12da1d1f
(Implement line-history search (git log -L), 2013-03-28).  The reason
for setting 'limited' is unclear, though, because the line-level log
itself doesn't directly depend on it, and it doesn't affect how the
limit_list() function limits the revision range.  However, there is an
indirect dependency: the line-level log requires topo-ordering, and
the "traditional" sort_in_topological_order() requires an already
limited commit list since e6c3505b (Make sure we generate the whole
commit list before trying to sort it topologically, 2005-07-06).  The
new, generation numbers-based topo-ordering doesn't require a limited
commit list anymore.

So don't set 'revs->limited' for line-level log, unless it is really
necessary, namely:

  - The user explicitly requested parent rewriting, because that is
    still done in the line_log_filter() preprocessing step (see
    previous patch), which requires sort_in_topological_order() and in
    turn limit_list() as well.

  - A commit-graph file is not available or it doesn't yet contain
    generation numbers.  In these cases we had to fall back on
    sort_in_topological_order() and in turn limit_list().  The
    existing condition with generation_numbers_enabled() has already
    ensured that the 'limited' flag is set in these cases; this patch
    just makes sure that the line-level log sets 'revs->topo_order'
    before that condition.

While the reduced delay before showing the first commit is measurable
in git.git, it takes a bigger repository to make it clearly noticable.
In both cases below the line ranges were chosen so that they were
modified rather close to the starting revisions, so the effect of this
change is most noticable.

  # git.git
  $ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0

  Before:

    real    0m0.107s
    user    0m0.091s
    sys     0m0.013s

  After:

    real    0m0.058s
    user    0m0.050s
    sys     0m0.005s

  # linux.git
  $ time git --no-pager log \
    -L:build_restore_work_registers:arch/mips/mm/tlbex.c -1 v5.2

  Before:

    real   0m1.129s
    user   0m1.061s
    sys    0m0.069s

  After:

    real   0m0.096s
    user   0m0.087s
    sys    0m0.009s

Additional testing by Derrick Stolee: Since this patch improves
the performance for the first result, I repeated the experiment
from the previous patch on the Linux kernel repository, reporting
real time here:

    Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
     Before: 0.71 s
      After: 0.05 s

Now, we have dropped the full topo-order of all ~910,000 commits
before reporting the first result. The remaining performance
improvements then are:

 1. Update the parent-rewriting logic to be incremental similar to
    how "git log --graph" behaves.

 2. Use changed-path Bloom filters to reduce the time spend in the
    tree-diff to see if the path(s) changed.
Signed-off-by: NSZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

002933f3

line-log: more responsive, incremental 'git log -L' · 3cb9d2b6

由 SZEDER Gábor 提交于 5月 11, 2020

The current line-level log implementation performs a preprocessing
step in prepare_revision_walk(), during which the line_log_filter()
function filters and rewrites history to keep only commits modifying
the given line range.  This preprocessing affects both responsiveness
and correctness:

  - Git doesn't produce any output during this preprocessing step.
    Checking whether a commit modified the given line range is
    somewhat expensive, so depending on the size of the given revision
    range this preprocessing can result in a significant delay before
    the first commit is shown.

  - Limiting the number of displayed commits (e.g. 'git log -3 -L...')
    doesn't limit the amount of work during preprocessing, because
    that limit is applied during history traversal.  Alas, by that
    point this expensive preprocessing step has already churned
    through the whole revision range to find all commits modifying the
    revision range, even though only a few of them need to be shown.

  - It rewrites parents, with no way to turn it off.  Without the user
    explicitly requesting parent rewriting any parent object ID shown
    should be that of the immediate parent, just like in case of a
    pathspec-limited history traversal without parent rewriting.

    However, after that preprocessing step rewrote history, the
    subsequent "regular" history traversal (i.e. get_revision() in a
    loop) only sees commits modifying the given line range.
    Consequently, it can only show the object ID of the last ancestor
    that modified the given line range (which might happen to be the
    immediate parent, but many-many times it isn't).

This patch addresses both the correctness and, at least for the common
case, the responsiveness issues by integrating line-level log
filtering into the regular revision walking machinery:

  - Make process_ranges_arbitrary_commit(), the static function in
    'line-log.c' deciding whether a commit modifies the given line
    range, public by removing the static keyword and adding the
    'line_log_' prefix, so it can be called from other parts of the
    revision walking machinery.

  - If the user didn't explicitly ask for parent rewriting (which, I
    believe, is the most common case):

    - Call this now-public function during regular history traversal,
      namely from get_commit_action() to ignore any commits not
      modifying the given line range.

      Note that while this check is relatively expensive, it must be
      performed before other, much cheaper conditions, because the
      tracked line range must be adjusted even when the commit will
      end up being ignored by other conditions.

    - Skip the line_log_filter() call, i.e. the expensive
      preprocessing step, in prepare_revision_walk(), because, thanks
      to the above points, the revision walking machinery is now able
      to filter out commits not modifying the given line range while
      traversing history.

      This way the regular history traversal sees the unmodified
      history, and is therefore able to print the object ids of the
      immediate parents of the listed commits.  The eliminated
      preprocessing step can greatly reduce the delay before the first
      commit is shown, see the numbers below.

  - However, if the user did explicitly ask for parent rewriting via
    '--parents' or a similar option, then stick with the current
    implementation for now, i.e. perform that expensive filtering and
    history rewriting in the preprocessing step just like we did
    before, leaving the initial delay as long as it was.

I tried to integrate line-level log filtering with parent rewriting
into the regular history traversal, but, unfortunately, several
subtleties resisted... :)  Maybe someday we'll figure out how to do
that, but until then at least the simple and common (i.e. without
parent rewriting) 'git log -L:func:file' commands can benefit from the
reduced delay.

This change makes the failing 'parent oids without parent rewriting'
test in 't4211-line-log.sh' succeed.

The reduced delay is most noticable when there's a commit modifying
the line range near the tip of a large-ish revision range:

  # no parent rewriting requested, no commit-graph present
  $ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0

  Before:

    real    0m9.570s
    user    0m9.494s
    sys     0m0.076s

  After:

    real    0m0.718s
    user    0m0.674s
    sys     0m0.044s

A significant part of the remaining delay is spent reading and parsing
commit objects in limit_list().  With the help of the commit-graph we
can eliminate most of that reading and parsing overhead, so here are
the timing results of the same command as above, but this time using
the commit-graph:

  Before:

    real    0m8.874s
    user    0m8.816s
    sys     0m0.057s

  After:

    real    0m0.107s
    user    0m0.091s
    sys     0m0.013s

The next patch will further reduce the remaining delay.

To be clear: this patch doesn't actually optimize the line-level log,
but merely moves most of the work from the preprocessing step to the
history traversal, so the commits modifying the line range can be
shown as soon as they are processed, and the traversal can be
terminated as soon as the given number of commits are shown.
Consequently, listing the full history of a line range, potentially
all the way to the root commit, will take the same time as before (but
at least the user might start reading the output earlier).
Furthermore, if the most recent commit modifying the line range is far
away from the starting revision, then that initial delay will still be
significant.

Additional testing by Derrick Stolee: In the Linux kernel repository,
the MAINTAINERS file was changed ~3,500 times across the ~915,000
commits. In addition to that edit frequency, the file itself is quite
large (~18,700 lines). This means that a significant portion of the
computation is taken up by computing the patch-diff of the file. This
patch improves the real time it takes to output the first result quite
a bit:

Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
 Before: 3.88 s
  After: 0.71 s

If we drop the "-n 1" in the command, then there is no change in
end-to-end process time. This is because the command still needs to
walk the entire commit history, which negates the point of this
patch. This is expected.

As a note for future reference, the ~4.3 seconds in the old code
spends ~2.6 seconds computing the patch-diffs, and the rest of the
time is spent walking commits and computing diffs for which paths
changed at each commit. The changed-path Bloom filters could improve
the end-to-end computation time (i.e. no "-n 1" in the command).
Signed-off-by: NSZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

3cb9d2b6

17 4月, 2020 1 次提交

revision: complicated pathspecs disable filters · 8918e379

由 Derrick Stolee 提交于 4月 16, 2020

The changed-path Bloom filters work only when we can compute an
explicit Bloom filter key in advance. When a pathspec is given
that allows case-insensitive checks or wildcard matching, we
must disable the Bloom filter performance checks.

By checking the pathspec in prepare_to_use_bloom_filters(), we
avoid setting up the Bloom filter data and thus revert to the
usual logic.

Before this change, the following tests would fail*:

	t6004-rev-list-path-optim.sh (Tests 6-7)
	t6130-pathspec-noglob.sh (Tests 3-6)
	t6131-pathspec-icase.sh (Tests 3-5)

*These tests would fail when using GIT_TEST_COMMIT_GRAPH and
GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS except that the latter
environment variable was not set up correctly to write the changed-
path Bloom filters in the test suite. That will be fixed in the
next change.
Helped-by: NTaylor Blau <me@ttaylorr.com>
Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

8918e379

11 4月, 2020 1 次提交

revision: --show-pulls adds helpful merges · 8d049e18

由 Derrick Stolee 提交于 4月 10, 2020

The default file history simplification of "git log -- <path>" or
"git rev-list -- <path>" focuses on providing the smallest set of
commits that first contributed a change. The revision walk greatly
restricts the set of walked commits by visiting only the first
TREESAME parent of a merge commit, when one exists. This means
that portions of the commit-graph are not walked, which can be a
performance benefit, but can also "hide" commits that added changes
but were ignored by a merge resolution.

The --full-history option modifies this by walking all commits and
reporting a merge commit as "interesting" if it has _any_ parent
that is not TREESAME. This tends to be an over-representation of
important commits, especially in an environment where most merge
commits are created by pull request completion.

Suppose we have a commit A and we create a commit B on top that
changes our file. When we merge the pull request, we create a merge
commit M. If no one else changed the file in the first-parent
history between M and A, then M will not be TREESAME to its first
parent, but will be TREESAME to B. Thus, the simplified history
will be "B". However, M will appear in the --full-history mode.

However, suppose that a number of topics T1, T2, ..., Tn were
created based on commits C1, C2, ..., Cn between A and M as
follows:

  A----C1----C2--- ... ---Cn----M------P1---P2--- ... ---Pn
   \     \     \            \  /      /    /            /
    \     \__.. \            \/ ..__T1    /           Tn
     \           \__..       /\     ..__T2           /
      \_____________________B  \____________________/

If the commits T1, T2, ... Tn did not change the file, then all of
P1 through Pn will be TREESAME to their first parent, but not
TREESAME to their second. This means that all of those merge commits
appear in the --full-history view, with edges that immediately
collapse into the lower history without introducing interesting
single-parent commits.

The --simplify-merges option was introduced to remove these extra
merge commits. By noticing that the rewritten parents are reachable
from their first parents, those edges can be simplified away. Finally,
the commits now look like single-parent commits that are TREESAME to
their "only" parent. Thus, they are removed and this issue does not
cause issues anymore. However, this also ends up removing the commit
M from the history view! Even worse, the --simplify-merges option
requires walking the entire history before returning a single result.

Many Git users are using Git alongside a Git service that provides
code storage alongside a code review tool commonly called "Pull
Requests" or "Merge Requests" against a target branch.  When these
requests are accepted and merged, they typically create a merge
commit whose first parent is the previous branch tip and the second
parent is the tip of the topic branch used for the request. This
presents a valuable order to the parents, but also makes that merge
commit slightly special. Users may want to see not only which
commits changed a file, but which pull requests merged those commits
into their branch. In the previous example, this would mean the
users want to see the merge commit "M" in addition to the single-
parent commit "C".

Users are even more likely to want these merge commits when they
use pull requests to merge into a feature branch before merging that
feature branch into their trunk.

In some sense, users are asking for the "first" merge commit to
bring in the change to their branch. As long as the parent order is
consistent, this can be handled with the following rule:

  Include a merge commit if it is not TREESAME to its first
  parent, but is TREESAME to a later parent.

These merges look like the merge commits that would result from
running "git pull <topic>" on a main branch. Thus, the option to
show these commits is called "--show-pulls". This has the added
benefit of showing the commits created by closing a pull request or
merge request on any of the Git hosting and code review platforms.

To test these options, extend the standard test example to include
a merge commit that is not TREESAME to its first parent. It is
surprising that that option was not already in the example, as it
is instructive.

In particular, this extension demonstrates a common issue with file
history simplification. When a user resolves a merge conflict using
"-Xours" or otherwise ignoring one side of the conflict, they create
a TREESAME edge that probably should not be TREESAME. This leads
users to become frustrated and complain that "my change disappeared!"
In my experience, showing them history with --full-history and
--simplify-merges quickly reveals the problematic merge. As mentioned,
this option is expensive to compute. The --show-pulls option
_might_ show the merge commit (usually titled "resolving conflicts")
more quickly. Of course, this depends on the user having the correct
parent order, which is backwards when using "git pull master" from a
topic branch.

There are some special considerations when combining the --show-pulls
option with --simplify-merges. This requires adding a new PULL_MERGE
object flag to store the information from the initial TREESAME
comparisons. This helps avoid dropping those commits in later filters.
This is covered by a test, including how the parents can be simplified.
Since "struct object" has already ruined its 32-bit alignment by using
33 bits across parsed, type, and flags member, let's not make it worse.
PULL_MERGE is used in revision.c with the same value (1u<<15) as
REACHABLE in commit-graph.c. The REACHABLE flag is only used when
writing a commit-graph file, and a revision walk using --show-pulls
does not happen in the same process. Care must be taken in the future
to ensure this remains the case.

Update Documentation/rev-list-options.txt with significant details
around this option. This requires updating the example in the
History Simplification section to demonstrate some of the problems
with TREESAME second parents.
Signed-off-by: NDerrick Stolee <dstolee@microsoft.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

8d049e18

08 4月, 2020 1 次提交

format-patch: teach --no-encode-email-headers · 19d097e3

由 Emma Brooks 提交于 4月 08, 2020

When commit subjects or authors have non-ASCII characters, git
format-patch Q-encodes them so they can be safely sent over email.
However, if the patch transfer method is something other than email (web
review tools, sneakernet), this only serves to make the patch metadata
harder to read without first applying it (unless you can decode RFC 2047
in your head). git am as well as some email software supports
non-Q-encoded mail as described in RFC 6531.

Add --[no-]encode-email-headers and format.encodeEmailHeaders to let the
user control this behavior.
Signed-off-by: NEmma Brooks <me@pluvano.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

19d097e3

07 4月, 2020 2 次提交

revision.c: add trace2 stats around Bloom filter usage · 42e50e78

由 Garima Singh 提交于 4月 06, 2020

Add trace2 statistics around Bloom filter usage and behavior
for 'git log -- path' commands that are hoping to benefit from
the presence of computed changed paths Bloom filters.

These statistics are great for performance analysis work and
for formal testing, which we will see in the commit following
this one.

Helped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: NSZEDER Gábor <szeder.dev@gmail.com>
Helped-by: NJonathan Tan <jonathantanmy@google.com>
Signed-off-by: NGarima Singh <garima.singh@microsoft.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

42e50e78

revision.c: use Bloom filters to speed up path based revision walks · a56b9464

由 Garima Singh 提交于 4月 06, 2020

Revision walk will now use Bloom filters for commits to speed up
revision walks for a particular path (for computing history for
that path), if they are present in the commit-graph file.

We load the Bloom filters during the prepare_revision_walk step,
currently only when dealing with a single pathspec. Extending
it to work with multiple pathspecs can be explored and built on
top of this series in the future.

While comparing trees in rev_compare_trees(), if the Bloom filter
says that the file is not different between the two trees, we don't
need to compute the expensive diff. This is where we get our
performance gains. The other response of the Bloom filter is '`:maybe',
in which case we fall back to the full diff calculation to determine
if the path was changed in the commit.

We do not try to use Bloom filters when the '--walk-reflogs' option
is specified. The '--walk-reflogs' option does not walk the commit
ancestry chain like the rest of the options. Incorporating the
performance gains when walking reflog entries would add more
complexity, and can be explored in a later series.

Performance Gains:
We tested the performance of `git log -- <path>` on the git repo, the linux
and some internal large repos, with a variety of paths of varying depths.

On the git and linux repos:
- we observed a 2x to 5x speed up.

On a large internal repo with files seated 6-10 levels deep in the tree:
- we observed 10x to 20x speed ups, with some paths going up to 28 times
  faster.

Helped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: NSZEDER Gábor <szeder.dev@gmail.com>
Helped-by: NJonathan Tan <jonathantanmy@google.com>
Signed-off-by: NGarima Singh <garima.singh@microsoft.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

a56b9464

14 12月, 2019 1 次提交

notes: break set_display_notes() into smaller functions · 1d729751

由 Denton Liu 提交于 12月 11, 2019

In 8164c961 (format-patch: use --notes behavior for format.notes,
2019-12-09), we introduced set_display_notes() which was a monolithic
function with three mutually exclusive branches. Break the function up
into three small and simple functions that each are only responsible for
one task.

This family of functions accepts an `int *show_notes` instead of
returning a value suitable for assignment to `show_notes`. This is for
two reasons. First of all, this guarantees that the external
`show_notes` variable changes in lockstep with the
`struct display_notes_opt`. Second, this prompts future developers to be
careful about doing something meaningful with this value. In fact, a
NULL check is intentionally omitted because causing a segfault here
would tell the future developer that they are meant to use the value for
something meaningful.

One alternative was making the family of functions accept a
`struct rev_info *` instead of the `struct display_notes_opt *`, since
the former contains the `show_notes` field as well. This does not work
because we have to call git_config() before repo_init_revisions().
However, if we had a `struct rev_info`, we'd need to initialize it before
it gets assigned values from git_config(). As a result, we break the
circular dependency by having standalone `int show_notes` and
`struct display_notes_opt notes_opt` variables which temporarily hold
values from git_config() until the values are copied over to `rev`.

To implement this change, we need to get a pointer to
`rev_info::show_notes`. Unfortunately, this is not possible with
bitfields and only direct-assignment is possible. Change
`rev_info::show_notes` to a non-bitfield int so that we can get its
address.
Signed-off-by: NDenton Liu <liu.denton@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

1d729751

10 12月, 2019 2 次提交

notes: extract logic into set_display_notes() · 452538c3

由 Denton Liu 提交于 12月 09, 2019

Instead of open coding the logic that tweaks the variables in
`struct display_notes_opt` within handle_revision_opt(), abstract away the
logic into set_display_notes() so that it can be reused.
Signed-off-by: NDenton Liu <liu.denton@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

452538c3

notes: create init_display_notes() helper · e6e230ee

由 Denton Liu 提交于 12月 09, 2019

We currently open code the initialization for revs->notes_opt. Abstract
this away into a helper function so that the logic can be reused in a
future commit.

This is slightly wasteful as we memset the struct twice but this is only
run once so it shouldn't have any major effect.
Signed-off-by: NDenton Liu <liu.denton@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

e6e230ee

25 11月, 2019 2 次提交

revision: free topo_walk_info before creating a new one in init_topo_walk · 0aa0c2b2

由 Mike Hommey 提交于 11月 22, 2019

init_topo_walk doesn't reuse an existing topo_walk_info, and currently
leaks the one that might exist on the current rev_info if it was already
used for a topo walk beforehand.
Signed-off-by: NMike Hommey <mh@glandium.org>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

0aa0c2b2

revision: clear the topo-walk flags in reset_revision_walk · ffa1f28f

由 Mike Hommey 提交于 11月 22, 2019

Not doing so can lead to wrong topo-walks when using the revision walk API
consecutively.
Signed-off-by: NMike Hommey <mh@glandium.org>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

ffa1f28f

20 11月, 2019 1 次提交

revision: make get_revision_mark() return const pointer · 49825164

由 Denton Liu 提交于 11月 19, 2019

get_revision_mark() used to return a `char *`, even though all of the
strings it was returning were string literals. Make get_revision_mark()
return a `const char *` so that callers won't be tempted to modify the
returned string.
Signed-off-by: NDenton Liu <liu.denton@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

49825164

07 10月, 2019 7 次提交

hashmap: remove type arg from hashmap_{get,put,remove}_entry · 404ab78e

由 Eric Wong 提交于 10月 06, 2019

Since these macros already take a `keyvar' pointer of a known type,
we can rely on OFFSETOF_VAR to get the correct offset without
relying on non-portable `__typeof__' and `offsetof'.

Argument order is also rearranged, so `keyvar' and `member' are
sequential as they are used as: `keyvar->member'
Signed-off-by: NEric Wong <e@80x24.org>
Reviewed-by: NDerrick Stolee <stolee@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

404ab78e

OFFSETOF_VAR macro to simplify hashmap iterators · 23dee69f

由 Eric Wong 提交于 10月 06, 2019

While we cannot rely on a `__typeof__' operator being portable
to use with `offsetof'; we can calculate the pointer offset
using an existing pointer and the address of a member using
pointer arithmetic for compilers without `__typeof__'.

This allows us to simplify usage of hashmap iterator macros
by not having to specify a type when a pointer of that type
is already given.

In the future, list iterator macros (e.g. list_for_each_entry)
may also be implemented using OFFSETOF_VAR to save hackers the
trouble of using container_of/list_entry macros and without
relying on non-portable `__typeof__'.

v3: use `__typeof__' to avoid clang warnings
Signed-off-by: NEric Wong <e@80x24.org>
Reviewed-by: NDerrick Stolee <stolee@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

23dee69f

hashmap: introduce hashmap_free_entries · c8e424c9

由 Eric Wong 提交于 10月 06, 2019

`hashmap_free_entries' behaves like `container_of' and passes
the offset of the hashmap_entry struct to the internal
`hashmap_free_' function, allowing the function to free any
struct pointer regardless of where the hashmap_entry field
is located.

`hashmap_free' no longer takes any arguments aside from
the hashmap itself.
Signed-off-by: NEric Wong <e@80x24.org>
Reviewed-by: NDerrick Stolee <stolee@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

c8e424c9

hashmap: use *_entry APIs for iteration · 87571c3f

由 Eric Wong 提交于 10月 06, 2019

Inspired by list_for_each_entry in the Linux kernel.
Once again, these are somewhat compromised usability-wise
by compilers lacking __typeof__ support.
Signed-off-by: NEric Wong <e@80x24.org>
Reviewed-by: NDerrick Stolee <stolee@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

87571c3f

hashmap_cmp_fn takes hashmap_entry params · 939af16e

由 Eric Wong 提交于 10月 06, 2019

Another step in eliminating the requirement of hashmap_entry
being the first member of a struct.
Signed-off-by: NEric Wong <e@80x24.org>
Reviewed-by: NDerrick Stolee <stolee@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

939af16e

hashmap_get{,_from_hash} return "struct hashmap_entry *" · f23a4651

由 Eric Wong 提交于 10月 06, 2019

Update callers to use hashmap_get_entry, hashmap_get_entry_from_hash
or container_of as appropriate.

This is another step towards eliminating the requirement of
hashmap_entry being the first field in a struct.
Signed-off-by: NEric Wong <e@80x24.org>
Reviewed-by: NDerrick Stolee <stolee@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

f23a4651

hashmap_put takes "struct hashmap_entry *" · 26b455f2

由 Eric Wong 提交于 10月 06, 2019

This is less error-prone than "void *" as the compiler now
detects invalid types being passed.
Signed-off-by: NEric Wong <e@80x24.org>
Reviewed-by: NDerrick Stolee <stolee@gmail.com>
Signed-off-by: NJunio C Hamano <gitster@pobox.com>

26b455f2

李少辉-开发者 / git 与 Fork 源项目一致

李少辉-开发者 / git
与 Fork 源项目一致