1. 31 10月, 2018 10 次提交
    • A
      perf trace beauty: Use the mmap flags table generated from headers · 2f967f1d
      Arnaldo Carvalho de Melo 提交于
      Instead of requiring us to go on and edit sources to add new flag.
      
        # perf trace -e *mmap sleep 0.1
           0.025 ( 0.005 ms): sleep/29876 mmap(len: 163746, prot: READ, flags: PRIVATE, fd: 3) = 0x7faa68ad1000
           0.059 ( 0.004 ms): sleep/29876 mmap(len: 8192, prot: READ|WRITE, flags: PRIVATE|ANONYMOUS) = 0x7faa68acf000
           0.069 ( 0.006 ms): sleep/29876 mmap(len: 3889792, prot: EXEC|READ, flags: PRIVATE|DENYWRITE, fd: 3) = 0x7faa6851f000
           0.086 ( 0.009 ms): sleep/29876 mmap(addr: 0x7faa688cb000, len: 24576, prot: READ|WRITE, flags: PRIVATE|FIXED|DENYWRITE, fd: 3, off: 1753088) = 0x7faa688cb000
           0.101 ( 0.005 ms): sleep/29876 mmap(addr: 0x7faa688d1000, len: 14976, prot: READ|WRITE, flags: PRIVATE|FIXED|ANONYMOUS) = 0x7faa688d1000
           0.348 ( 0.005 ms): sleep/29876 mmap(len: 111950656, prot: READ, flags: PRIVATE, fd: 3) = 0x7faa61a5b000
        #
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: https://lkml.kernel.org/n/tip-ggmoy6vxoygh5yim890ht0kf@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      2f967f1d
    • A
      perf beauty: Wire up the mmap flags table generator to the Makefile · fbd7458d
      Arnaldo Carvalho de Melo 提交于
      Now when we run 'make -C tools/perf O=/tmp/build/perf' we end up with:
      
        $ cat /tmp/build/perf/trace/beauty/generated/mmap_flags_array.c
        static const char *mmap_flags[] = {
      	[ilog2(0x40) + 1] = "32BIT",
      	[ilog2(0x01) + 1] = "SHARED",
      	[ilog2(0x02) + 1] = "PRIVATE",
      	[ilog2(0x10) + 1] = "FIXED",
      	[ilog2(0x20) + 1] = "ANONYMOUS",
      	[ilog2(0x100000) + 1] = "FIXED_NOREPLACE",
      	[ilog2(0x0100) + 1] = "GROWSDOWN",
      	[ilog2(0x0800) + 1] = "DENYWRITE",
      	[ilog2(0x1000) + 1] = "EXECUTABLE",
      	[ilog2(0x2000) + 1] = "LOCKED",
      	[ilog2(0x4000) + 1] = "NORESERVE",
      	[ilog2(0x8000) + 1] = "POPULATE",
      	[ilog2(0x10000) + 1] = "NONBLOCK",
      	[ilog2(0x20000) + 1] = "STACK",
      	[ilog2(0x40000) + 1] = "HUGETLB",
      	[ilog2(0x80000) + 1] = "SYNC",
        };
        $
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: https://lkml.kernel.org/n/tip-t3fn7u3tjsupio6e6vkufx9m@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      fbd7458d
    • A
      perf beauty: Add a generator for MAP_ mmap's flag constants · 80ee5668
      Arnaldo Carvalho de Melo 提交于
      It'll use tools/{arch}/*,include copies of mman.h to generate a table to
      be used by tools, initially by the 'mmap' beautifiers in 'perf trace',
      but that could also be used to translate from a string constant to the
      integer value to be used in a eBPF or tracefs tracepoint filter.
      
      Tested for all archs using:
      
      $ for arch in `ls tools/arch/` ; \
      	do echo $arch ; tools/perf/trace/beauty/mmap_flags.sh $arch ; \
         done | less
      
      Example for alpha, an oddball, doesn't include any header, defines all
      its stuff:
      
        $ tools/perf/trace/beauty/mmap_flags.sh alpha
        static const char *mmap_flags[] = {
      	[ilog2(0x10) + 1] = "ANONYMOUS",
      	[ilog2(0x02000) + 1] = "DENYWRITE",
      	[ilog2(0x04000) + 1] = "EXECUTABLE",
      	[ilog2(0x100) + 1] = "FIXED",
      	[ilog2(0x01000) + 1] = "GROWSDOWN",
      	[ilog2(0x100000) + 1] = "HUGETLB",
      	[ilog2(0x08000) + 1] = "LOCKED",
      	[ilog2(0x40000) + 1] = "NONBLOCK",
      	[ilog2(0x10000) + 1] = "NORESERVE",
      	[ilog2(0x20000) + 1] = "POPULATE",
      	[ilog2(0x02) + 1] = "PRIVATE",
      	[ilog2(0x01) + 1] = "SHARED",
      	[ilog2(0x80000) + 1] = "STACK",
        };
        $
      
      Common case, my workstation, defines one entry (MAP_32BIT), then
      includes mman.h, which gets it to include mman-common.h too:
      
        $ tools/perf/trace/beauty/mmap_flags.sh
        static const char *mmap_flags[] = {
      	[ilog2(0x40) + 1] = "32BIT",
      	[ilog2(0x01) + 1] = "SHARED",
      	[ilog2(0x02) + 1] = "PRIVATE",
      	[ilog2(0x10) + 1] = "FIXED",
      	[ilog2(0x20) + 1] = "ANONYMOUS",
      	[ilog2(0x100000) + 1] = "FIXED_NOREPLACE",
      	[ilog2(0x0100) + 1] = "GROWSDOWN",
      	[ilog2(0x0800) + 1] = "DENYWRITE",
      	[ilog2(0x1000) + 1] = "EXECUTABLE",
      	[ilog2(0x2000) + 1] = "LOCKED",
      	[ilog2(0x4000) + 1] = "NORESERVE",
      	[ilog2(0x8000) + 1] = "POPULATE",
      	[ilog2(0x10000) + 1] = "NONBLOCK",
      	[ilog2(0x20000) + 1] = "STACK",
      	[ilog2(0x40000) + 1] = "HUGETLB",
      	[ilog2(0x80000) + 1] = "SYNC",
        };
        $ uname -m
        x86_64
        $
      
      Sparc, that defines a bunch then includes just mman-common.h:
      
        $ tools/perf/trace/beauty/mmap_flags.sh sparc
        static const char *mmap_flags[] = {
      	[ilog2(0x0800) + 1] = "DENYWRITE",
      	[ilog2(0x1000) + 1] = "EXECUTABLE",
      	[ilog2(0x0200) + 1] = "GROWSDOWN",
      	[ilog2(0x40000) + 1] = "HUGETLB",
      	[ilog2(0x100) + 1] = "LOCKED",
      	[ilog2(0x10000) + 1] = "NONBLOCK",
      	[ilog2(0x40) + 1] = "NORESERVE",
      	[ilog2(0x8000) + 1] = "POPULATE",
      	[ilog2(0x20000) + 1] = "STACK",
      	[ilog2(0x01) + 1] = "SHARED",
      	[ilog2(0x02) + 1] = "PRIVATE",
      	[ilog2(0x10) + 1] = "FIXED",
      	[ilog2(0x20) + 1] = "ANONYMOUS",
      	[ilog2(0x100000) + 1] = "FIXED_NOREPLACE",
        };
        [acme@jouet perf]$
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: https://lkml.kernel.org/n/tip-xydeh491z8fkgglcmqnl5thj@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      80ee5668
    • A
      tools include uapi: Update asound.h copy · 89eb1f3b
      Arnaldo Carvalho de Melo 提交于
      To silence this perf build warning:
      
        Warning: Kernel ABI header at 'tools/include/uapi/sound/asound.h' differs from latest version at 'include/uapi/sound/asound.h'
        diff -u tools/include/uapi/sound/asound.h include/uapi/sound/asound.h
      
      Due to this cset:
      
        a9840151 ("ALSA: timer: fix wrong comment to refer to 'SNDRV_TIMER_PSFLG_*'")
      
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Takashi Sakamoto <o-takashi@sakamocchi.jp>
      Cc: Takashi Iwai <tiwai@suse.de>
      Link: https://lkml.kernel.org/n/tip-76gsvs0w2g0x723ivqa2xua3@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      89eb1f3b
    • A
      tools arch uapi: Update asm-generic/unistd.h and arm64 unistd.h copies · 8dd4c0f6
      Arnaldo Carvalho de Melo 提交于
      To get the changes in:
      
        82b355d1 ("y2038: Remove newstat family from default syscall set")
      
      Which will make the syscall table used by 'perf trace' for arm64 to be
      updated from the changes in that patch.
      
      This silences these perf build warnings:
      
        Warning: Kernel ABI header at 'tools/arch/arm64/include/uapi/asm/unistd.h' differs from latest version at 'arch/arm64/include/uapi/asm/unistd.h'
        diff -u tools/arch/arm64/include/uapi/asm/unistd.h arch/arm64/include/uapi/asm/unistd.h
        Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/unistd.h' differs from latest version at 'include/uapi/asm-generic/unistd.h'
        diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
      
      Cc: Kim Phillips <kim.phillips@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-3euy7c4yy5mvnp5bm16t9vqg@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      8dd4c0f6
    • A
      tools include uapi: Update linux/fs.h copy · 733ac4f9
      Arnaldo Carvalho de Melo 提交于
      To silence this perf build warning:
      
        Warning: Kernel ABI header at 'tools/include/uapi/linux/fs.h' differs from latest version at 'include/uapi/linux/fs.h'
        diff -u tools/include/uapi/linux/fs.h include/uapi/linux/fs.h
      
      Due to just two comments added by:
      
        Fixes: 578bdaab ("crypto: speck - remove Speck")
      
      So nothing that entails changes in tools/, that so far uses fs.h to
      generate the mount and umount syscalls 'flags' argument integer->string
      tables with:
      
        $ tools/perf/trace/beauty/mount_flags.sh
        static const char *mount_flags[] = {
      	[4096 ? (ilog2(4096) + 1) : 0] = "BIND",
      <SNIP>
      	[30 + 1] = "ACTIVE",
      	[31 + 1] = "NOUSER",
        };
        $
        # trace -e mount,umount mount --bind /proc /mnt
           1.228 ( 2.581 ms): mount/1068 mount(dev_name: /mnt, dir_name: 0x55f011c354a0, type: 0x55f011c38170, flags: BIND) = 0
        # trace -e mount,umount umount /proc /mnt
        umount: /proc: target is busy.
           1.587 ( 0.010 ms): umount/1070 umount2(name: /proc) = -1 EBUSY Device or resource busy
           1.799 (12.660 ms): umount/1070 umount2(name: /mnt) = 0
        #
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Link: https://lkml.kernel.org/n/tip-c00bqzclscgah26z2g5zxm73@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      733ac4f9
    • D
      perf callchain: Honour the ordering of PERF_CONTEXT_{USER,KERNEL,etc} · e9024d51
      David S. Miller 提交于
      When processing using 'perf report -g caller', which is the default, we
      ended up reverting the callchain entries received from the kernel, but
      simply reverting throws away the information that tells that from a
      point onwards the addresses are for userspace, kernel, guest kernel,
      guest user, hypervisor.
      
      The idea is that if we are walking backwards, for each cluster of
      non-cpumode entries we have to first scan backwards for the next one and
      use that for the cluster.
      
      This seems silly and more expensive than it needs to be but it is enough
      for a initial fix.
      
      The code here is really complicated because it is intimately intertwined
      with the lbr and branch handling, as well as this callchain order,
      further fixes will be needed to properly take into account the cpumode
      in those cases.
      
      Another problem with ORDER_CALLER is that the NULL "0" IP that is at the
      end of most callchains shows up at the top of the histogram because
      every callchain contains it and with ORDER_CALLER it is the first entry.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Souvik Banerjee <souvik1997@gmail.com>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: stable@vger.kernel.org # 4.19
      Link: https://lkml.kernel.org/n/tip-2wt3ayp6j2y2f2xowixa8y6y@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      e9024d51
    • L
      perf cs-etm: Correct CPU mode for samples · d6c9c05f
      Leo Yan 提交于
      Since commit edeb0c90 ("perf tools: Stop fallbacking to kallsyms for
      vdso symbols lookup"), the kernel address cannot be properly parsed to
      kernel symbol with command 'perf script -k vmlinux'.  The reason is
      CoreSight samples is always to set CPU mode as PERF_RECORD_MISC_USER,
      thus it fails to find corresponding map/dso in below flows:
      
        process_sample_event()
          `-> machine__resolve()
      	  `-> thread__find_map(thread, sample->cpumode, sample->ip, al);
      
      In this flow it needs to pass argument 'sample->cpumode' to tell what's
      the CPU mode, before it always passed PERF_RECORD_MISC_USER but without
      any failure until the commit edeb0c90 ("perf tools: Stop fallbacking
      to kallsyms for vdso symbols lookup") has been merged.  The reason is
      even with the wrong CPU mode the function thread__find_map() firstly
      fails to find map but it will rollback to find kernel map for vdso
      symbols lookup.  In the latest code it has removed the fallback code,
      thus if CPU mode is PERF_RECORD_MISC_USER then it cannot find map
      anymore with kernel address.
      
      This patch is to correct samples CPU mode setting, it creates a new
      helper function cs_etm__cpu_mode() to tell what's the CPU mode based on
      the address with the info from machine structure; this patch has a bit
      extension to check not only kernel and user mode, but also check for
      host/guest and hypervisor mode.  Finally this patch uses the function in
      instruction and branch samples and also apply in cs_etm__mem_access()
      for a minor polishing.
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: coresight@lists.linaro.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: stable@kernel.org # v4.19
      Link: http://lkml.kernel.org/r/1540883908-17018-1-git-send-email-leo.yan@linaro.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      d6c9c05f
    • M
      perf unwind: Take pgoff into account when reporting elf to libdwfl · 1fe627da
      Milian Wolff 提交于
      libdwfl parses an ELF file itself and creates mappings for the
      individual sections. perf on the other hand sees raw mmap events which
      represent individual sections. When we encounter an address pointing
      into a mapping with pgoff != 0, we must take that into account and
      report the file at the non-offset base address.
      
      This fixes unwinding with libdwfl in some cases. E.g. for a file like:
      
      ```
      
      using namespace std;
      
      mutex g_mutex;
      
      double worker()
      {
          lock_guard<mutex> guard(g_mutex);
          uniform_real_distribution<double> uniform(-1E5, 1E5);
          default_random_engine engine;
          double s = 0;
          for (int i = 0; i < 1000; ++i) {
              s += norm(complex<double>(uniform(engine), uniform(engine)));
          }
          cout << s << endl;
          return s;
      }
      
      int main()
      {
          vector<std::future<double>> results;
          for (int i = 0; i < 10000; ++i) {
              results.push_back(async(launch::async, worker));
          }
          return 0;
      }
      ```
      
      Compile it with `g++ -g -O2 -lpthread cpp-locking.cpp  -o cpp-locking`,
      then record it with `perf record --call-graph dwarf -e
      sched:sched_switch`.
      
      When you analyze it with `perf script` and libunwind, you should see:
      
      ```
      cpp-locking 20038 [005] 54830.236589: sched:sched_switch: prev_comm=cpp-locking prev_pid=20038 prev_prio=120 prev_state=T ==> next_comm=swapper/5 next_pid=0 next_prio=120
              ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb1670208 schedule+0x28 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb16737cc rwsem_down_read_failed+0xec (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb1665e04 call_rwsem_down_read_failed+0x14 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb1672a03 down_read+0x13 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb106bd85 __do_page_fault+0x445 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb18015f5 page_fault+0x45 (/lib/modules/4.14.78-1-lts/build/vmlinux)
                  7f38e4252591 new_heap+0x101 (/usr/lib/libc-2.28.so)
                  7f38e4252d0b arena_get2.part.4+0x2fb (/usr/lib/libc-2.28.so)
                  7f38e4255b1c tcache_init.part.6+0xec (/usr/lib/libc-2.28.so)
                  7f38e42569e5 __GI___libc_malloc+0x115 (inlined)
                  7f38e4241790 __GI__IO_file_doallocate+0x90 (inlined)
                  7f38e424fbbf __GI__IO_doallocbuf+0x4f (inlined)
                  7f38e424ee47 __GI__IO_file_overflow+0x197 (inlined)
                  7f38e424df36 _IO_new_file_xsputn+0x116 (inlined)
                  7f38e4242bfb __GI__IO_fwrite+0xdb (inlined)
                  7f38e463fa6d std::basic_streambuf<char, std::char_traits<char> >::sputn(char const*, long)+0x1cd (inlined)
                  7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> >::_M_put(char const*, long)+0x1cd (inlined)
                  7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> > std::__write<char>(std::ostreambuf_iterator<char, std::char_traits<char> >, char const*, int)+0x1cd (inlined)
                  7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> > std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_float<double>(std::ostreambuf_iterator<c>
                  7f38e464bd70 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::put(std::ostreambuf_iterator<char, std::char_traits<char> >, std::ios_base&, char, double) const+0x90 (inl>
                  7f38e464bd70 std::ostream& std::ostream::_M_insert<double>(double)+0x90 (/usr/lib/libstdc++.so.6.0.25)
                  563b9cb502f7 std::ostream::operator<<(double)+0xb7 (inlined)
                  563b9cb502f7 worker()+0xb7 (/ssd/milian/projects/kdab/rnd/hotspot/build/tests/test-clients/cpp-locking/cpp-locking)
                  563b9cb506fb double std::__invoke_impl<double, double (*)()>(std::__invoke_other, double (*&&)())+0x2b (inlined)
                  563b9cb506fb std::__invoke_result<double (*)()>::type std::__invoke<double (*)()>(double (*&&)())+0x2b (inlined)
                  563b9cb506fb decltype (__invoke((_S_declval<0ul>)())) std::thread::_Invoker<std::tuple<double (*)()> >::_M_invoke<0ul>(std::_Index_tuple<0ul>)+0x2b (inlined)
                  563b9cb506fb std::thread::_Invoker<std::tuple<double (*)()> >::operator()()+0x2b (inlined)
                  563b9cb506fb std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<double>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<double (*)()> >, dou>
                  563b9cb506fb std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_>
                  563b9cb507e8 std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>::operator()() const+0x28 (inlined)
                  563b9cb507e8 std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)+0x28 (/ssd/milian/>
                  7f38e46d24fe __pthread_once_slow+0xbe (/usr/lib/libpthread-2.28.so)
                  563b9cb51149 __gthread_once+0xe9 (inlined)
                  563b9cb51149 void std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)>
                  563b9cb51149 std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool)+0xe9 (inlined)
                  563b9cb51149 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_impl(std::thread::_Invoker<std::tuple<double (*)()> >&&)::{lambda()#1}::op>
                  563b9cb51149 void std::__invoke_impl<void, std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_impl(std::thread::_Invoker<std::tuple<double>
                  563b9cb51149 std::__invoke_result<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_impl(std::thread::_Invoker<std::tuple<double (*)()> >>
                  563b9cb51149 decltype (__invoke((_S_declval<0ul>)())) std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_>
                  563b9cb51149 std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_impl(std::thread::_Invoker<std::tuple<dou>
                  563b9cb51149 std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_impl(std::thread>
                  7f38e45f0062 execute_native_thread_routine+0x12 (/usr/lib/libstdc++.so.6.0.25)
                  7f38e46caa9c start_thread+0xfc (/usr/lib/libpthread-2.28.so)
                  7f38e42ccb22 __GI___clone+0x42 (inlined)
      ```
      
      Before this patch, using libdwfl, you would see:
      
      ```
      cpp-locking 20038 [005] 54830.236589: sched:sched_switch: prev_comm=cpp-locking prev_pid=20038 prev_prio=120 prev_state=T ==> next_comm=swapper/5 next_pid=0 next_prio=120
              ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb1670208 schedule+0x28 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb16737cc rwsem_down_read_failed+0xec (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb1665e04 call_rwsem_down_read_failed+0x14 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb1672a03 down_read+0x13 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb106bd85 __do_page_fault+0x445 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb18015f5 page_fault+0x45 (/lib/modules/4.14.78-1-lts/build/vmlinux)
                  7f38e4252591 new_heap+0x101 (/usr/lib/libc-2.28.so)
              a041161e77950c5c [unknown] ([unknown])
      ```
      
      With this patch applied, we get a bit further in unwinding:
      
      ```
      cpp-locking 20038 [005] 54830.236589: sched:sched_switch: prev_comm=cpp-locking prev_pid=20038 prev_prio=120 prev_state=T ==> next_comm=swapper/5 next_pid=0 next_prio=120
              ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb1670208 schedule+0x28 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb16737cc rwsem_down_read_failed+0xec (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb1665e04 call_rwsem_down_read_failed+0x14 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb1672a03 down_read+0x13 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb106bd85 __do_page_fault+0x445 (/lib/modules/4.14.78-1-lts/build/vmlinux)
              ffffffffb18015f5 page_fault+0x45 (/lib/modules/4.14.78-1-lts/build/vmlinux)
                  7f38e4252591 new_heap+0x101 (/usr/lib/libc-2.28.so)
                  7f38e4252d0b arena_get2.part.4+0x2fb (/usr/lib/libc-2.28.so)
                  7f38e4255b1c tcache_init.part.6+0xec (/usr/lib/libc-2.28.so)
                  7f38e42569e5 __GI___libc_malloc+0x115 (inlined)
                  7f38e4241790 __GI__IO_file_doallocate+0x90 (inlined)
                  7f38e424fbbf __GI__IO_doallocbuf+0x4f (inlined)
                  7f38e424ee47 __GI__IO_file_overflow+0x197 (inlined)
                  7f38e424df36 _IO_new_file_xsputn+0x116 (inlined)
                  7f38e4242bfb __GI__IO_fwrite+0xdb (inlined)
                  7f38e463fa6d std::basic_streambuf<char, std::char_traits<char> >::sputn(char const*, long)+0x1cd (inlined)
                  7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> >::_M_put(char const*, long)+0x1cd (inlined)
                  7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> > std::__write<char>(std::ostreambuf_iterator<char, std::char_traits<char> >, char const*, int)+0x1cd (inlined)
                  7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> > std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_float<double>(std::ostreambuf_iterator<c>
                  7f38e464bd70 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::put(std::ostreambuf_iterator<char, std::char_traits<char> >, std::ios_base&, char, double) const+0x90 (inl>
                  7f38e464bd70 std::ostream& std::ostream::_M_insert<double>(double)+0x90 (/usr/lib/libstdc++.so.6.0.25)
                  563b9cb502f7 std::ostream::operator<<(double)+0xb7 (inlined)
                  563b9cb502f7 worker()+0xb7 (/ssd/milian/projects/kdab/rnd/hotspot/build/tests/test-clients/cpp-locking/cpp-locking)
              6eab825c1ee3e4ff [unknown] ([unknown])
      ```
      
      Note that the backtrace is still stopping too early, when compared to
      the nice results obtained via libunwind. It's unclear so far what the
      reason for that is.
      
      Committer note:
      
      Further comment by Milian on the thread started on the Link: tag below:
      
       ---
      The remaining issue is due to a bug in elfutils:
      
      https://sourceware.org/ml/elfutils-devel/2018-q4/msg00089.html
      
      With both patches applied, libunwind and elfutils produce the same output for
      the above scenario.
       ---
      Signed-off-by: NMilian Wolff <milian.wolff@kdab.com>
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Link: http://lkml.kernel.org/r/20181029141644.3907-1-milian.wolff@kdab.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      1fe627da
    • A
      perf top: Do not use overwrite mode by default · 218d6111
      Arnaldo Carvalho de Melo 提交于
      Enabling --overwrite mode allows us to to use just the most recent
      records, which helps in high core count machines such as Knights
      Landing/Mill, but right now is being disabled by default as the pausing
      used in this technique is leading to loss of metadata events such as
      PERF_RECORD_MMAP which makes 'perf top' unable to resolve samples,
      leading to lots of unknown samples appearing on the UI.
      
      Enabling this may be useful if you are in such machines and profiling a
      workload that doesn't creates short lived threads and/or doesn't uses
      many executable mmap operations.
      
      Work is being planed to solve this situation, till then, this will
      remain disabled by default.
      Reported-by: NDavid Miller <davem@davemloft.net>
      Acked-by: NKan Liang <kan.liang@intel.com>
      Link: https://lkml.kernel.org/r/4f84468f-37d9-cf1b-12c1-514ef74b6a48@linux.intel.com
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Fixes: ebebbf08 ("perf top: Switch default mode to overwrite mode")
      Link: https://lkml.kernel.org/n/tip-ehvf77vi1si9409r7p4wx788@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      218d6111
  2. 30 10月, 2018 11 次提交
    • A
      perf top: Allow disabling the overwrite mode · 4e303fbe
      Arnaldo Carvalho de Melo 提交于
      In ebebbf08 ("perf top: Switch default mode to overwrite mode") we
      forgot to leave a way to disable that new default, add a --overwrite
      option that can be disabled using --no-overwrite, since the code already
      in such a way that we can readily disable this mode.
      
      This is useful when investigating bugs with this mode like the recent
      report from David Miller where lots of unknown symbols appear due to
      disabling the events while processing them which disables all record
      types, not just PERF_RECORD_SAMPLE, which makes it impossible to resolve
      maps when we lose PERF_RECORD_MMAP records.
      
      This can be easily seen while building a kernel, when there are lots of
      short lived processes.
      Reported-by: NDavid Miller <davem@davemloft.net>
      Acked-by: NKan Liang <kan.liang@intel.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jin Yao <yao.jin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Fixes: ebebbf08 ("perf top: Switch default mode to overwrite mode")
      Link: https://lkml.kernel.org/n/tip-oqgsz2bq4kgrnnajrafcdhie@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      4e303fbe
    • A
      perf trace: Beautify mount's first pathname arg · 23c07a23
      Arnaldo Carvalho de Melo 提交于
      The pathname beautifiers so far support just one augmented pathname per
      syscall, so do it just for mount's first arg, later this will get fixed.
      
      With:
      
        # perf probe -l
        probe:vfs_getname    (on getname_flags:73@acme/git/linux/fs/namei.c with pathname)
        #
      
      Later this will get added to augmented_syscalls.c (eBPF):
      
      In one xterm:
      
        # perf trace -e mount,umount
        2687.331 ( 3.544 ms): mount/8892 mount(dev_name: /mnt, dir_name: 0x561f9ac184a0, type: 0x561f9ac1b170, flags: BIND) = 0
        3912.126 ( 8.807 ms): umount/8895 umount2(name: /mnt) = 0
        ^C#
      
      In the other:
      
        $ sudo mount --bind /proc /mnt
        $ sudo umount /mnt
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Benjamin Peterson <benjamin@python.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: https://lkml.kernel.org/n/tip-qsvhrm2es635cl4zicqjeth2@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      23c07a23
    • A
      perf trace: Beautify the umount's 'name' argument · 476c92ca
      Arnaldo Carvalho de Melo 提交于
      By using the SCA_FILENAME beautifier, that works when either the
      probe:vfs_getname probe is in place or with the eBPF program
      tools/perf/examples/bpf/augmented_syscalls.c:
      
        # perf probe -l
        probe:vfs_getname (on getname_flags:73@acme/git/linux/fs/namei.c with pathname)
        # perf trace -e umount
        9630.332 ( 9.521 ms): umount/8082 umount2(name: /mnt) = 0
        #
      
      The augmented syscalls one will be done in the next patch.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Benjamin Peterson <benjamin@python.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: https://lkml.kernel.org/n/tip-hegbzlpd2nrn584l5jxn7sy2@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      476c92ca
    • A
      perf trace: Consider syscall aliases too · f932184e
      Arnaldo Carvalho de Melo 提交于
      When trying to trace the 'umount' syscall on x86_64 I noticed that it
      was failing:
      
        # trace -e umount umount /mnt
        event syntax error: 'umount'
                             \___ parser error
        Run 'perf list' for a list of valid events
      
         Usage: perf trace [<options>] [<command>]
            or: perf trace [<options>] -- <command> [<options>]
            or: perf trace record [<options>] [<command>]
            or: perf trace record [<options>] -- <command> [<options>]
      
            -e, --event <event>   event/syscall selector. use 'perf list' to list available events
        #
      
      This is because in the x86-64 we have it just as 'umount2':
      
        $ grep umount arch/x86/entry/syscalls/syscall_64.tbl
        166	common	umount2			__x64_sys_umount
        $
      
      So if the syscall name fails, try fallbacking to looking at the aliases
      we have in the syscall_fmts table to then re-lookup, now:
      
        # trace -e umount umount -f /mnt
        umount: /mnt: not mounted.
           1.759 ( 0.004 ms): umount/18365 umount2(name: 0x55fbfcbc4480, flags: 1) = -1 EINVAL Invalid argument
        #
      
      Time to beautify the flags arg :-)
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Benjamin Peterson <benjamin@python.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: https://lkml.kernel.org/n/tip-ukweodgzbmjd25lfkgryeft1@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      f932184e
    • A
      perf trace beauty: Beautify mount/umount's 'flags' argument · 73d141ad
      Arnaldo Carvalho de Melo 提交于
        # trace -e mount mount -o ro -t debugfs nodev /mnt
           0.000 ( 1.040 ms): mount/27235 mount(dev_name: 0x5601cc8c64e0, dir_name: 0x5601cc8c6500, type: 0x5601cc8c6480, flags: RDONLY) = 0
        # trace -e mount mount -o remount,relatime -t debugfs nodev /mnt
           0.000 ( 2.946 ms): mount/27262 mount(dev_name: 0x55f4a73d64e0, dir_name: 0x55f4a73d6500, type: 0x55f4a73d6480, flags: REMOUNT|RELATIME) = 0
        # trace -e mount mount -o remount,strictatime -t debugfs nodev /mnt
           0.000 ( 2.934 ms): mount/27265 mount(dev_name: 0x5617f71d94e0, dir_name: 0x5617f71d9500, type: 0x5617f71d9480, flags: REMOUNT|STRICTATIME) = 0
        # trace -e mount mount -o remount,suid,silent -t debugfs nodev /mnt
           0.000 ( 0.049 ms): mount/27273 mount(dev_name: 0x55ad65df24e0, dir_name: 0x55ad65df2500, type: 0x55ad65df2480, flags: REMOUNT|SILENT) = 0
        # trace -e mount mount -o remount,rw,sync,lazytime -t debugfs nodev /mnt
           0.000 ( 2.684 ms): mount/27281 mount(dev_name: 0x561216055530, dir_name: 0x561216055550, type: 0x561216055510, flags: SYNCHRONOUS|REMOUNT|LAZYTIME) = 0
        # trace -e mount mount -o remount,dirsync -t debugfs nodev /mnt
           0.000 ( 3.512 ms): mount/27314 mount(dev_name: 0x55c4e7188480, dir_name: 0x55c4e7188530, type: 0x55c4e71884a0, flags: REMOUNT|DIRSYNC, data: 0x55c4e71884e0) = 0
        #
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Benjamin Peterson <benjamin@python.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: https://lkml.kernel.org/n/tip-i5ncao73c0bd02qprgrq6wb9@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      73d141ad
    • A
      perf trace beauty: Allow syscalls to mask an argument before considering it · 496fd346
      Arnaldo Carvalho de Melo 提交于
      Take mount's 'flags' arg, to cope with this semantic, as defined in do_mount in fs/namespace.c:
      
        /*
         * Pre-0.97 versions of mount() didn't have a flags word.  When the
         * flags word was introduced its top half was required to have the
         * magic value 0xC0ED, and this remained so until 2.4.0-test9.
         * Therefore, if this magic number is present, it carries no
         * information and must be discarded.
         */
      
      We need to mask this arg, and then see if it is zero, when we simply
      don't print the arg name and value.
      
      The next patch will use this for mount's 'flag' arg.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Benjamin Peterson <benjamin@python.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: https://lkml.kernel.org/n/tip-btue14k5jemayuykfrwsnh85@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      496fd346
    • A
      perf beauty: Introduce strarray__scnprintf_flags() · 579e5ff6
      Arnaldo Carvalho de Melo 提交于
      Generalizing pkey_alloc__scnprintf_access_rights(), so that we can use
      it with other flags-like arguments, such as mount's mountflags argument.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Benjamin Peterson <benjamin@python.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: https://lkml.kernel.org/n/tip-o3ymi3104m8moaz9865g09w9@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      579e5ff6
    • A
      perf beauty: Switch from GPL v2.0 to LGPL v2.1 · 794f594e
      Arnaldo Carvalho de Melo 提交于
      The intention is to have this as a library, since it is not perf
      specific at all.
      
      I did the switch for the files where I'm the only contributor, with the
      exception of a few lines changed by Jiri Olsa.
      Acked-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: https://lkml.kernel.org/n/tip-a04q6chdyjknm1hr305ulx8h@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      794f594e
    • A
      perf beauty: Add a generator for MS_ mount/umount's flag constants · ceaf8e5b
      Arnaldo Carvalho de Melo 提交于
      It'll use tools/include copy of linux/fs.h to generate a table to be
      used by tools, initially by the 'mount' and 'umount' beautifiers in
      'perf trace', but that could also be used to translate from a string
      constant to the integer value to be used in a eBPF or tracefs tracepoint
      filter.
      
      When used without any args it produces:
      
        $ tools/perf/trace/beauty/mount_flags.sh
        static const char *mount_flags[] = {
      	[1 ? (ilog2(1) + 1) : 0] = "RDONLY",
      	[2 ? (ilog2(2) + 1) : 0] = "NOSUID",
      	[4 ? (ilog2(4) + 1) : 0] = "NODEV",
      	[8 ? (ilog2(8) + 1) : 0] = "NOEXEC",
      	[16 ? (ilog2(16) + 1) : 0] = "SYNCHRONOUS",
      	[32 ? (ilog2(32) + 1) : 0] = "REMOUNT",
      	[64 ? (ilog2(64) + 1) : 0] = "MANDLOCK",
      	[128 ? (ilog2(128) + 1) : 0] = "DIRSYNC",
      	[1024 ? (ilog2(1024) + 1) : 0] = "NOATIME",
      	[2048 ? (ilog2(2048) + 1) : 0] = "NODIRATIME",
      	[4096 ? (ilog2(4096) + 1) : 0] = "BIND",
      	[8192 ? (ilog2(8192) + 1) : 0] = "MOVE",
      	[16384 ? (ilog2(16384) + 1) : 0] = "REC",
      	[32768 ? (ilog2(32768) + 1) : 0] = "SILENT",
      	[16 + 1] = "POSIXACL",
      	[17 + 1] = "UNBINDABLE",
      	[18 + 1] = "PRIVATE",
      	[19 + 1] = "SLAVE",
      	[20 + 1] = "SHARED",
      	[21 + 1] = "RELATIME",
      	[22 + 1] = "KERNMOUNT",
      	[23 + 1] = "I_VERSION",
      	[24 + 1] = "STRICTATIME",
      	[25 + 1] = "LAZYTIME",
      	[26 + 1] = "SUBMOUNT",
      	[27 + 1] = "NOREMOTELOCK",
      	[28 + 1] = "NOSEC",
      	[29 + 1] = "BORN",
      	[30 + 1] = "ACTIVE",
      	[31 + 1] = "NOUSER",
        };
        $
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Benjamin Peterson <benjamin@python.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: https://lkml.kernel.org/n/tip-mgutbbkmip9gfnmd28ikg7xt@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      ceaf8e5b
    • A
      tools include uapi: Grab a copy of linux/fs.h · f443f38c
      Arnaldo Carvalho de Melo 提交于
      We'll use it to create tables for the 'flags' argument to the 'mount'
      and 'umount' syscalls.
      
      Add it to check_headers.sh so that when a new protocol gets added we get
      a notification during the build process.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Benjamin Peterson <benjamin@python.org>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: https://lkml.kernel.org/n/tip-yacf9jvkwfwg2g95r2us3xb3@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      f443f38c
    • C
      perf/core: Clean up inconsisent indentation · 28fa741c
      Colin Ian King 提交于
      Replace a bunch of spaces with tab, cleans up indentation
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-janitors@vger.kernel.org
      Link: http://lkml.kernel.org/r/20181029233211.21475-1-colin.king@canonical.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      28fa741c
  3. 29 10月, 2018 1 次提交
  4. 28 10月, 2018 1 次提交
    • L
      i2c-hid: properly terminate i2c_hid_dmi_desc_override_table[] array · b59dfdae
      Linus Torvalds 提交于
      Commit 9ee3e066 ("HID: i2c-hid: override HID descriptors for certain
      devices") added a new dmi_system_id quirk table to override certain HID
      report descriptors for some systems that lack them.
      
      But the table wasn't properly terminated, causing the dmi matching to
      walk off into la-la-land, and starting to treat random data as dmi
      descriptor pointers, causing boot-time oopses if you were at all
      unlucky.
      
      Terminate the array.
      
      We really should have some way to just statically check that arrays that
      should be terminated by an empty entry actually are so.  But the HID
      people really should have caught this themselves, rather than have me
      deal with an oops during the merge window.  Tssk, tssk.
      
      Cc: Julian Sax <jsbc@gmx.de>
      Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b59dfdae
  5. 27 10月, 2018 17 次提交
    • L
      Merge branch 'akpm' (patches from Andrew) · 345671ea
      Linus Torvalds 提交于
      Merge updates from Andrew Morton:
      
       - a few misc things
      
       - ocfs2 updates
      
       - most of MM
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits)
        hugetlbfs: dirty pages as they are added to pagecache
        mm: export add_swap_extent()
        mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
        tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE
        mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
        mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
        mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
        mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
        mm/gup: cache dev_pagemap while pinning pages
        Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
        mm: return zero_resv_unavail optimization
        mm: zero remaining unavailable struct pages
        tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option
        tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option
        tools/testing/selftests/vm/gup_benchmark.c: allow user specified file
        tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage
        mm/gup_benchmark.c: add additional pinning methods
        mm/gup_benchmark.c: time put_page()
        mm: don't raise MEMCG_OOM event due to failed high-order allocation
        mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
        ...
      345671ea
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 49040081
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
       "What better way to start off a weekend than with some networking bug
        fixes:
      
        1) net namespace leak in dump filtering code of ipv4 and ipv6, fixed
           by David Ahern and Bjørn Mork.
      
        2) Handle bad checksums from hardware when using CHECKSUM_COMPLETE
           properly in UDP, from Sean Tranchetti.
      
        3) Remove TCA_OPTIONS from policy validation, it turns out we don't
           consistently use nested attributes for this across all packet
           schedulers. From David Ahern.
      
        4) Fix SKB corruption in cadence driver, from Tristram Ha.
      
        5) Fix broken WoL handling in r8169 driver, from Heiner Kallweit.
      
        6) Fix OOPS in pneigh_dump_table(), from Eric Dumazet"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (28 commits)
        net/neigh: fix NULL deref in pneigh_dump_table()
        net: allow traceroute with a specified interface in a vrf
        bridge: do not add port to router list when receives query with source 0.0.0.0
        net/smc: fix smc_buf_unuse to use the lgr pointer
        ipv6/ndisc: Preserve IPv6 control buffer if protocol error handlers are called
        net/{ipv4,ipv6}: Do not put target net if input nsid is invalid
        lan743x: Remove SPI dependency from Microchip group.
        drivers: net: remove <net/busy_poll.h> inclusion when not needed
        net: phy: genphy_10g_driver: Avoid NULL pointer dereference
        r8169: fix broken Wake-on-LAN from S5 (poweroff)
        octeontx2-af: Use GFP_ATOMIC under spin lock
        net: ethernet: cadence: fix socket buffer corruption problem
        net/ipv6: Allow onlink routes to have a device mismatch if it is the default route
        net: sched: Remove TCA_OPTIONS from policy
        ice: Poll for link status change
        ice: Allocate VF interrupts and set queue map
        ice: Introduce ice_dev_onetime_setup
        net: hns3: Fix for warning uninitialized symbol hw_err_lst3
        octeontx2-af: Copy the right amount of memory
        net: udp: fix handling of CHECKSUM_COMPLETE packets
        ...
      49040081
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · a45dcff7
      Linus Torvalds 提交于
      Pull sparc fixes from David Miller:
       "Some more sparc fixups, mostly aimed at getting the allmodconfig build
        up and clean again"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
        sparc64: Rework xchg() definition to avoid warnings.
        sparc64: Export __node_distance.
        sparc64: Make corrupted user stacks more debuggable.
      a45dcff7
    • M
      hugetlbfs: dirty pages as they are added to pagecache · 22146c3c
      Mike Kravetz 提交于
      Some test systems were experiencing negative huge page reserve counts and
      incorrect file block counts.  This was traced to /proc/sys/vm/drop_caches
      removing clean pages from hugetlbfs file pagecaches.  When non-hugetlbfs
      explicit code removes the pages, the appropriate accounting is not
      performed.
      
      This can be recreated as follows:
       fallocate -l 2M /dev/hugepages/foo
       echo 1 > /proc/sys/vm/drop_caches
       fallocate -l 2M /dev/hugepages/foo
       grep -i huge /proc/meminfo
         AnonHugePages:         0 kB
         ShmemHugePages:        0 kB
         HugePages_Total:    2048
         HugePages_Free:     2047
         HugePages_Rsvd:    18446744073709551615
         HugePages_Surp:        0
         Hugepagesize:       2048 kB
         Hugetlb:         4194304 kB
       ls -lsh /dev/hugepages/foo
         4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo
      
      To address this issue, dirty pages as they are added to pagecache.  This
      can easily be reproduced with fallocate as shown above.  Read faulted
      pages will eventually end up being marked dirty.  But there is a window
      where they are clean and could be impacted by code such as drop_caches.
      So, just dirty them all as they are added to the pagecache.
      
      Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
      Fixes: 6bda666a ("hugepages: fold find_or_alloc_pages into huge_no_page()")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMihcla Hocko <mhocko@suse.com>
      Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22146c3c
    • O
      mm: export add_swap_extent() · aa8aa8a3
      Omar Sandoval 提交于
      Btrfs currently does not support swap files because swap's use of bmap
      does not work with copy-on-write and multiple devices.  See 35054394
      ("Btrfs: stop providing a bmap operation to avoid swapfile corruptions").
      
      However, the swap code has a mechanism for the filesystem to manually add
      swap extents using add_swap_extent() from the ->swap_activate() aop.
      iomap has done this since 67482129 ("iomap: add a swapfile activation
      function").  Btrfs will do the same in a later patch, so export
      add_swap_extent().
      
      Link: http://lkml.kernel.org/r/bb1208575e02829aae51b538709476964f97b1ea.1536704650.git.osandov@fb.comSigned-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa8aa8a3
    • O
      mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS · bc4ae27d
      Omar Sandoval 提交于
      The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go
      through the filesystem, and to make swapoff() call ->swap_deactivate().
      For Btrfs, we want the latter but not the former, so split this flag into
      two.  This makes us always call ->swap_deactivate() if ->swap_activate()
      succeeded, not just if it didn't add any swap extents itself.
      
      This also resolves the issue of the very misleading name of SWP_FILE,
      which is only used for swap files over NFS.
      
      Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.comSigned-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc4ae27d
    • M
      tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE · 91cbacc3
      Michael Ellerman 提交于
      Add a test for MAP_FIXED_NOREPLACE, based on some code originally by Jann
      Horn.  This would have caught the overlap bug reported by Daniel Micay.
      
      I originally suggested to Michal that we create MAP_FIXED_NOREPLACE, but
      instead of writing a selftest I spent my time bike-shedding whether it
      should be called MAP_FIXED_SAFE/NOCLOBBER/WEAK/NEW ..  mea culpa.
      
      Link: http://lkml.kernel.org/r/20181013133929.28653-1-mpe@ellerman.id.auSigned-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Joel Stanley <joel@jms.id.au>
      Cc: Jason Evans <jasone@google.com>
      Cc: David Goldblatt <davidtgoldblatt@gmail.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91cbacc3
    • A
      mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page() · 7eef5f97
      Andrea Arcangeli 提交于
      There should be no cache left by the time we overwrite the old transhuge
      pmd with the new one.  It's already too late to flush through the virtual
      address because we already copied the page data to the new physical
      address.
      
      So flush the cache before the data copy.
      
      Also delete the "end" variable to shutoff a "unused variable" warning on
      x86 where flush_cache_range() is a noop.
      
      Link: http://lkml.kernel.org/r/20181015202311.7209-1-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7eef5f97
    • A
      mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() · 7066f0f9
      Andrea Arcangeli 提交于
      change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB
      right away.  do_huge_pmd_numa_page() flushes the TLB before calling
      migrate_misplaced_transhuge_page().  By the time do_huge_pmd_numa_page()
      runs some CPU could still access the page through the TLB.
      
      change_huge_pmd() before arming the numa/protnone transhuge pmd calls
      mmu_notifier_invalidate_range_start().  So there's no need of
      mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
      sequence in migrate_misplaced_transhuge_page() too, because by the time
      migrate_misplaced_transhuge_page() runs, the pmd mapping has already been
      invalidated in the secondary MMUs.  It has to or if a secondary MMU can
      still write to the page, the migrate_page_copy() would lose data.
      
      However an explicit mmu_notifier_invalidate_range() is needed before
      migrate_misplaced_transhuge_page() starts copying the data of the
      transhuge page or the below can happen for MMU notifier users sharing the
      primary MMU pagetables and only implementing ->invalidate_range:
      
      CPU0		CPU1		GPU sharing linux pagetables using
                                      only ->invalidate_range
      -----------	------------	---------
      				GPU secondary MMU writes to the page
      				mapped by the transhuge pmd
      change_pmd_range()
      mmu..._range_start()
      ->invalidate_range_start() noop
      change_huge_pmd()
      set_pmd_at(numa/protnone)
      pmd_unlock()
      		do_huge_pmd_numa_page()
      		CPU TLB flush globally (1)
      		CPU cannot write to page
      		migrate_misplaced_transhuge_page()
      				GPU writes to the page...
      		migrate_page_copy()
      				...GPU stops writing to the page
      CPU TLB flush (2)
      mmu..._range_end() (3)
      ->invalidate_range_stop() noop
      ->invalidate_range()
      				GPU secondary MMU is invalidated
      				and cannot write to the page anymore
      				(too late)
      
      Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives
      too late, we also need a mmu_notifier_invalidate_range() before calling
      migrate_misplaced_transhuge_page(), because the ->invalidate_range() in
      (3) also arrives too late.
      
      This requirement is the result of the lazy optimization in
      change_huge_pmd() that releases the pmd_lock without first flushing the
      TLB and without first calling mmu_notifier_invalidate_range().
      
      Even converting the removed mmu_notifier_invalidate_range_only_end() into
      a mmu_notifier_invalidate_range_end() would not have been enough to fix
      this, because it run after migrate_page_copy().
      
      After the hugepage data copy is done migrate_misplaced_transhuge_page()
      can proceed and call set_pmd_at without having to flush the TLB nor any
      secondary MMUs because the secondary MMU invalidate, just like the CPU TLB
      flush, has to happen before the migrate_page_copy() is called or it would
      be a bug in the first place (and it was for drivers using
      ->invalidate_range()).
      
      KVM is unaffected because it doesn't implement ->invalidate_range().
      
      The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
      uses the generic migrate_pages which transitions the pte from
      numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
      and all mmu notifiers there before copying the page.
      
      Link: http://lkml.kernel.org/r/20181013002430.698-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7066f0f9
    • A
      mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition · d7c33934
      Andrea Arcangeli 提交于
      Patch series "migrate_misplaced_transhuge_page race conditions".
      
      Aaron found a new instance of the THP MADV_DONTNEED race against
      pmdp_clear_flush* variants, that was apparently left unfixed.
      
      While looking into the race found by Aaron, I may have found two more
      issues in migrate_misplaced_transhuge_page.
      
      These race conditions would not cause kernel instability, but they'd
      corrupt userland data or leave data non zero after MADV_DONTNEED.
      
      I did only minor testing, and I don't expect to be able to reproduce this
      (especially the lack of ->invalidate_range before migrate_page_copy,
      requires the latest iommu hardware or infiniband to reproduce).  The last
      patch is noop for x86 and it needs further review from maintainers of
      archs that implement flush_cache_range() (not in CC yet).
      
      To avoid confusion, it's not the first patch that introduces the bug fixed
      in the second patch, even before removing the
      pmdp_huge_clear_flush_notify, that _notify suffix was called after
      migrate_page_copy already run.
      
      This patch (of 3):
      
      This is a corollary of ced10803 ("thp: fix MADV_DONTNEED vs.  numa
      balancing race"), 58ceeb6b ("thp: fix MADV_DONTNEED vs.  MADV_FREE
      race") and 5b7abeae ("thp: fix MADV_DONTNEED vs clear soft dirty
      race).
      
      When the above three fixes where posted Dave asked
      https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
      but apparently this was missed.
      
      The pmdp_clear_flush* in migrate_misplaced_transhuge_page() was introduced
      in a54a407f ("mm: Close races between THP migration and PMD numa
      clearing").
      
      The important part of such commit is only the part where the page lock is
      not released until the first do_huge_pmd_numa_page() finished disarming
      the pagenuma/protnone.
      
      The addition of pmdp_clear_flush() wasn't beneficial to such commit and
      there's no commentary about such an addition either.
      
      I guess the pmdp_clear_flush() in such commit was added just in case for
      safety, but it ended up introducing the MADV_DONTNEED race condition found
      by Aaron.
      
      At that point in time nobody thought of such kind of MADV_DONTNEED race
      conditions yet (they were fixed later) so the code may have looked more
      robust by adding the pmdp_clear_flush().
      
      This specific race condition won't destabilize the kernel, but it can
      confuse userland because after MADV_DONTNEED the memory won't be zeroed
      out.
      
      This also optimizes the code and removes a superfluous TLB flush.
      
      [akpm@linux-foundation.org: reflow comment to 80 cols, fix grammar and typo (beacuse)]
      Link: http://lkml.kernel.org/r/20181013002430.698-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NAaron Tomlin <atomlin@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7c33934
    • C
      mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t · 026d1eaf
      Clark Williams 提交于
      The static lock quarantine_lock is used in quarantine.c to protect the
      quarantine queue datastructures.  It is taken inside quarantine queue
      manipulation routines (quarantine_put(), quarantine_reduce() and
      quarantine_remove_cache()), with IRQs disabled.  This is not a problem on
      a stock kernel but is problematic on an RT kernel where spin locks are
      sleeping spinlocks, which can sleep and can not be acquired with disabled
      interrupts.
      
      Convert the quarantine_lock to a raw spinlock_t.  The usage of
      quarantine_lock is confined to quarantine.c and the work performed while
      the lock is held is used for debug purpose.
      
      [bigeasy@linutronix.de: slightly altered the commit message]
      Link: http://lkml.kernel.org/r/20181010214945.5owshc3mlrh74z4b@linutronix.deSigned-off-by: NClark Williams <williams@redhat.com>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      026d1eaf
    • K
      mm/gup: cache dev_pagemap while pinning pages · df06b37f
      Keith Busch 提交于
      Getting pages from ZONE_DEVICE memory needs to check the backing device's
      live-ness, which is tracked in the device's dev_pagemap metadata.  This
      metadata is stored in a radix tree and looking it up adds measurable
      software overhead.
      
      This patch avoids repeating this relatively costly operation when
      dev_pagemap is used by caching the last dev_pagemap while getting user
      pages.  The gup_benchmark kernel self test reports this reduces time to
      get user pages to as low as 1/3 of the previous time.
      
      Link: http://lkml.kernel.org/r/20181012173040.15669-1-keith.busch@intel.comSigned-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df06b37f
    • M
      Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" · 9fd61bc9
      Masayoshi Mizuma 提交于
      commit 124049de ("x86/e820: put !E820_TYPE_RAM regions into
      memblock.reserved") breaks movable_node kernel option because it changed
      the memory gap range to reserved memblock.  So, the node is marked as
      Normal zone even if the SRAT has Hot pluggable affinity.
      
          =====================================================================
          kernel: BIOS-e820: [mem 0x0000180000000000-0x0000180fffffffff] usable
          kernel: BIOS-e820: [mem 0x00001c0000000000-0x00001c0fffffffff] usable
          ...
          kernel: reserved[0x12]#011[0x0000181000000000-0x00001bffffffffff], 0x000003f000000000 bytes flags: 0x0
          ...
          kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x180000000000-0x1bffffffffff] hotplug
          kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c0000000000-0x1fffffffffff] hotplug
          ...
          kernel: Movable zone start for each node
          kernel:  Node 3: 0x00001c0000000000
          kernel: Early memory node ranges
          ...
          =====================================================================
      
      The original issue is fixed by the former patches, so let's revert commit
      124049de ("x86/e820: put !E820_TYPE_RAM regions into
      memblock.reserved").
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-4-msys.mizuma@gmail.comSigned-off-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9fd61bc9
    • P
      mm: return zero_resv_unavail optimization · ec393a0f
      Pavel Tatashin 提交于
      When checking for valid pfns in zero_resv_unavail(), it is not necessary
      to verify that pfns within pageblock_nr_pages ranges are valid, only the
      first one needs to be checked.  This is because memory for pages are
      allocated in contiguous chunks that contain pageblock_nr_pages struct
      pages.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.comSigned-off-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec393a0f
    • N
      mm: zero remaining unavailable struct pages · 907ec5fc
      Naoya Horiguchi 提交于
      Patch series "mm: Fix for movable_node boot option", v3.
      
      This patch series contains a fix for the movable_node boot option issue
      which was introduced by commit 124049de ("x86/e820: put !E820_TYPE_RAM
      regions into memblock.reserved").
      
      The commit breaks the option because it changed the memory gap range to
      reserved memblock.  So, the node is marked as Normal zone even if the SRAT
      has Hot pluggable affinity.
      
      First and second patch fix the original issue which the commit tried to
      fix, then revert the commit.
      
      This patch (of 3):
      
      There is a kernel panic that is triggered when reading /proc/kpageflags on
      the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
      
        BUG: unable to handle kernel paging request at fffffffffffffffe
        PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
        Oops: 0000 [#1] SMP PTI
        CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
        RIP: 0010:stable_page_flags+0x27/0x3c0
        Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
        RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
        RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
        RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
        R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
        R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
        FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
        Call Trace:
         kpageflags_read+0xc7/0x120
         proc_reg_read+0x3c/0x60
         __vfs_read+0x36/0x170
         vfs_read+0x89/0x130
         ksys_pread64+0x71/0x90
         do_syscall_64+0x5b/0x160
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7efc42e75e23
        Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
      
      According to kernel bisection, this problem became visible due to commit
      f7f99100 which changes how struct pages are initialized.
      
      Memblock layout affects the pfn ranges covered by node/zone.  Consider
      that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the
      default (no memmap= given) memblock layout is like below:
      
        MEMBLOCK configuration:
         memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x4
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
         memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      If you give memmap=1G!4G (so it just covers memory[0x2]),
      the range [0x100000000-0x13fffffff] is gone:
      
        MEMBLOCK configuration:
         memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x3
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      This causes shrinking node 0's pfn range because it is calculated by the
      address range of memblock.memory.  So some of struct pages in the gap
      range are left uninitialized.
      
      We have a function zero_resv_unavail() which does zeroing the struct pages
      outside memblock.memory, but currently it covers only the reserved
      unavailable range (i.e.  memblock.memory && !memblock.reserved).  This
      patch extends it to cover all unavailable range, which fixes the reported
      issue.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Tested-by: NOscar Salvador <osalvador@suse.de>
      Tested-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      907ec5fc
    • K
      tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option · 3821b76c
      Keith Busch 提交于
      Add a new option '-H' to the gup benchmark to help understand how hugetlb
      mapping pages compare with the default.
      
      Link: http://lkml.kernel.org/r/20181010195605.10689-6-keith.busch@intel.comSigned-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3821b76c
    • K
      tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option · 0dd8666a
      Keith Busch 提交于
      Add a new benchmark option, -S, to request MAP_SHARED.  This can be used
      to compare with MAP_PRIVATE, or for files that require this option, like
      dax.
      
      Link: http://lkml.kernel.org/r/20181010195605.10689-5-keith.busch@intel.comSigned-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0dd8666a