1. 29 3月, 2018 11 次提交
    • D
      Merge branch 'bpf-raw-tracepoints' · f6ef5658
      Daniel Borkmann 提交于
      Alexei Starovoitov says:
      
      ====================
      v7->v8:
      - moved 'u32 num_args' from 'struct tracepoint' into 'struct bpf_raw_event_map'
        that increases memory overhead, but can be optimized/compressed later.
        Now it's zero changes in tracepoint.[ch]
      
      v6->v7:
      - adopted Steven's bpf_raw_tp_map section approach to find tracepoint
        and corresponding bpf probe function instead of kallsyms approach.
        dropped kernel_tracepoint_find_by_name() patch
      
      v5->v6:
      - avoid changing semantics of for_each_kernel_tracepoint() function, instead
        introduce kernel_tracepoint_find_by_name() helper
      
      v4->v5:
      - adopted Daniel's fancy REPEAT macro in bpf_trace.c in patch 6
      
      v3->v4:
      - adopted Linus's CAST_TO_U64 macro to cast any integer, pointer, or small
        struct to u64. That nicely reduced the size of patch 1
      
      v2->v3:
      - with Linus's suggestion introduced generic COUNT_ARGS and CONCATENATE macros
        (or rather moved them from apparmor)
        that cleaned up patch 6
      - added patch 4 to refactor trace_iwlwifi_dev_ucode_error() from 17 args to 4
        Now any tracepoint with >12 args will have build error
      
      v1->v2:
      - simplified api by combing bpf_raw_tp_open(name) + bpf_attach(prog_fd) into
        bpf_raw_tp_open(name, prog_fd) as suggested by Daniel.
        That simplifies bpf_detach as well which is now simple close() of fd.
      - fixed memory leak in error path which was spotted by Daniel.
      - fixed bpf_get_stackid(), bpf_perf_event_output() called from raw tracepoints
      - added more tests
      - fixed allyesconfig build caught by buildbot
      
      v1:
      This patch set is a different way to address the pressing need to access
      task_struct pointers in sched tracepoints from bpf programs.
      
      The first approach simply added these pointers to sched tracepoints:
      https://lkml.org/lkml/2017/12/14/753
      which Peter nacked.
      Few options were discussed and eventually the discussion converged on
      doing bpf specific tracepoint_probe_register() probe functions.
      Details here:
      https://lkml.org/lkml/2017/12/20/929
      
      Patch 1 is kernel wide cleanup of pass-struct-by-value into
      pass-struct-by-reference into tracepoints.
      
      Patches 2 and 3 are minor cleanups to address allyesconfig build
      
      Patch 4 refactor trace_iwlwifi_dev_ucode_error from 17 to 4 args
      
      Patch 5 introduces COUNT_ARGS macro
      
      Patch 6 introduces BPF_RAW_TRACEPOINT api.
      the auto-cleanup and multiple concurrent users are must have
      features of tracing api. For bpf raw tracepoints it looks like:
        // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
        prog_fd = bpf_prog_load(...);
      
        // receive anon_inode fd for given bpf_raw_tracepoint
        // and attach bpf program to it
        raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);
      
      Ctrl-C of tracing daemon or cmdline tool will automatically
      detach bpf program, unload it and unregister tracepoint probe.
      More details in patch 6.
      
      Patch 7 - trivial support in libbpf
      Patches 8, 9 - user space tests
      
      samples/bpf/test_overhead performance on 1 cpu:
      
      tracepoint    base  kprobe+bpf tracepoint+bpf raw_tracepoint+bpf
      task_rename   1.1M   769K        947K            1.0M
      urandom_read  789K   697K        750K            755K
      ====================
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f6ef5658
    • A
      selftests/bpf: test for bpf_get_stackid() from raw tracepoints · 3bbe0869
      Alexei Starovoitov 提交于
      similar to traditional traceopint test add bpf_get_stackid() test
      from raw tracepoints
      and reduce verbosity of existing stackmap test
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3bbe0869
    • A
      samples/bpf: raw tracepoint test · 4662a4e5
      Alexei Starovoitov 提交于
      add empty raw_tracepoint bpf program to test overhead similar
      to kprobe and traditional tracepoint tests
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4662a4e5
    • A
      libbpf: add bpf_raw_tracepoint_open helper · a0fe3e57
      Alexei Starovoitov 提交于
      add bpf_raw_tracepoint_open(const char *name, int prog_fd) api to libbpf
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a0fe3e57
    • A
      bpf: introduce BPF_RAW_TRACEPOINT · c4f6699d
      Alexei Starovoitov 提交于
      Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
      kernel internal arguments of the tracepoints in their raw form.
      
      >From bpf program point of view the access to the arguments look like:
      struct bpf_raw_tracepoint_args {
             __u64 args[0];
      };
      
      int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
      {
        // program can read args[N] where N depends on tracepoint
        // and statically verified at program load+attach time
      }
      
      kprobe+bpf infrastructure allows programs access function arguments.
      This feature allows programs access raw tracepoint arguments.
      
      Similar to proposed 'dynamic ftrace events' there are no abi guarantees
      to what the tracepoints arguments are and what their meaning is.
      The program needs to type cast args properly and use bpf_probe_read()
      helper to access struct fields when argument is a pointer.
      
      For every tracepoint __bpf_trace_##call function is prepared.
      In assembler it looks like:
      (gdb) disassemble __bpf_trace_xdp_exception
      Dump of assembler code for function __bpf_trace_xdp_exception:
         0xffffffff81132080 <+0>:     mov    %ecx,%ecx
         0xffffffff81132082 <+2>:     jmpq   0xffffffff811231f0 <bpf_trace_run3>
      
      where
      
      TRACE_EVENT(xdp_exception,
              TP_PROTO(const struct net_device *dev,
                       const struct bpf_prog *xdp, u32 act),
      
      The above assembler snippet is casting 32-bit 'act' field into 'u64'
      to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
      All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
      and in total this approach adds 7k bytes to .text.
      
      This approach gives the lowest possible overhead
      while calling trace_xdp_exception() from kernel C code and
      transitioning into bpf land.
      Since tracepoint+bpf are used at speeds of 1M+ events per second
      this is valuable optimization.
      
      The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
      that returns anon_inode FD of 'bpf-raw-tracepoint' object.
      
      The user space looks like:
      // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
      prog_fd = bpf_prog_load(...);
      // receive anon_inode fd for given bpf_raw_tracepoint with prog attached
      raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);
      
      Ctrl-C of tracing daemon or cmdline tool that uses this feature
      will automatically detach bpf program, unload it and
      unregister tracepoint probe.
      
      On the kernel side the __bpf_raw_tp_map section of pointers to
      tracepoint definition and to __bpf_trace_*() probe function is used
      to find a tracepoint with "xdp_exception" name and
      corresponding __bpf_trace_xdp_exception() probe function
      which are passed to tracepoint_probe_register() to connect probe
      with tracepoint.
      
      Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
      tracepoint mechanisms. perf_event_open() can be used in parallel
      on the same tracepoint.
      Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
      Each with its own bpf program. The kernel will execute
      all tracepoint probes and all attached bpf programs.
      
      In the future bpf_raw_tracepoints can be extended with
      query/introspection logic.
      
      __bpf_raw_tp_map section logic was contributed by Steven Rostedt
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c4f6699d
    • A
      macro: introduce COUNT_ARGS() macro · cf14f27f
      Alexei Starovoitov 提交于
      move COUNT_ARGS() macro from apparmor to generic header and extend it
      to count till twelve.
      
      COUNT() was an alternative name for this logic, but it's used for
      different purpose in many other places.
      
      Similarly for CONCATENATE() macro.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cf14f27f
    • A
      net/wireless/iwlwifi: fix iwlwifi_dev_ucode_error tracepoint · 4fe43c2c
      Alexei Starovoitov 提交于
      fix iwlwifi_dev_ucode_error tracepoint to pass pointer to a table
      instead of all 17 arguments by value.
      dvm/main.c and mvm/utils.c have 'struct iwl_error_event_table'
      defined with very similar yet subtly different fields and offsets.
      tracepoint is still common and using definition of 'struct iwl_error_event_table'
      from dvm/commands.h while copying fields.
      Long term this tracepoint probably should be split into two.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4fe43c2c
    • A
      net/mac802154: disambiguate mac80215 vs mac802154 trace events · 14624a93
      Alexei Starovoitov 提交于
      two trace events defined with the same name and both unused.
      They conflict in allyesconfig build. Rename one of them.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      14624a93
    • A
      net/mediatek: disambiguate mt76 vs mt7601u trace events · d992ee6c
      Alexei Starovoitov 提交于
      two trace events defined with the same name and both unused.
      They conflict in allyesconfig build. Rename one of them.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d992ee6c
    • A
      treewide: remove large struct-pass-by-value from tracepoint arguments · c1055475
      Alexei Starovoitov 提交于
      - fix trace_hfi1_ctxt_info() to pass large struct by reference instead of by value
      - convert 'type array[]' tracepoint arguments into 'type *array',
        since compiler will warn that sizeof('type array[]') == sizeof('type *array')
        and later should be used instead
      
      The CAST_TO_U64 macro in the later patch will enforce that tracepoint
      arguments can only be integers, pointers, or less than 8 byte structures.
      Larger structures should be passed by reference.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c1055475
    • N
      bpf: Add sock_ops R/W access to ipv4 tos · 6f5c39fa
      Nikita V. Shirokov 提交于
      Sample usage for tos ...
      
        bpf_getsockopt(skops, SOL_IP, IP_TOS, &v, sizeof(v))
      
      ... where skops is a pointer to the ctx (struct bpf_sock_ops).
      Signed-off-by: NNikita V. Shirokov <tehnerd@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6f5c39fa
  2. 28 3月, 2018 2 次提交
  3. 26 3月, 2018 3 次提交
  4. 24 3月, 2018 7 次提交
  5. 23 3月, 2018 17 次提交
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 03fe2deb
      David S. Miller 提交于
      Fun set of conflict resolutions here...
      
      For the mac80211 stuff, these were fortunately just parallel
      adds.  Trivially resolved.
      
      In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
      function phy_disable_interrupts() earlier in the file, whilst in
      'net-next' the phy_error() call from this function was removed.
      
      In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
      'rt_table_id' member of rtable collided with a bug fix in 'net' that
      added a new struct member "rt_mtu_locked" which needs to be copied
      over here.
      
      The mlxsw driver conflict consisted of net-next separating
      the span code and definitions into separate files, whilst
      a 'net' bug fix made some changes to that moved code.
      
      The mlx5 infiniband conflict resolution was quite non-trivial,
      the RDMA tree's merge commit was used as a guide here, and
      here are their notes:
      
      ====================
      
          Due to bug fixes found by the syzkaller bot and taken into the for-rc
          branch after development for the 4.17 merge window had already started
          being taken into the for-next branch, there were fairly non-trivial
          merge issues that would need to be resolved between the for-rc branch
          and the for-next branch.  This merge resolves those conflicts and
          provides a unified base upon which ongoing development for 4.17 can
          be based.
      
          Conflicts:
                  drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f
                  (IB/mlx5: Fix cleanup order on unload) added to for-rc and
                  commit b5ca15ad (IB/mlx5: Add proper representors support)
                  add as part of the devel cycle both needed to modify the
                  init/de-init functions used by mlx5.  To support the new
                  representors, the new functions added by the cleanup patch
                  needed to be made non-static, and the init/de-init list
                  added by the representors patch needed to be modified to
                  match the init/de-init list changes made by the cleanup
                  patch.
          Updates:
                  drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
                  prototypes added by representors patch to reflect new function
                  names as changed by cleanup patch
                  drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
                  stage list to match new order from cleanup patch
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03fe2deb
    • L
      Merge branch 'akpm' (patches from Andrew) · f36b7534
      Linus Torvalds 提交于
      Merge misc fixes from Andrew Morton:
       "13 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm, thp: do not cause memcg oom for thp
        mm/vmscan: wake up flushers for legacy cgroups too
        Revert "mm: page_alloc: skip over regions of invalid pfns where possible"
        mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink()
        mm/thp: do not wait for lock_page() in deferred_split_scan()
        mm/khugepaged.c: convert VM_BUG_ON() to collapse fail
        x86/mm: implement free pmd/pte page interfaces
        mm/vmalloc: add interfaces to free unmapped page table
        h8300: remove extraneous __BIG_ENDIAN definition
        hugetlbfs: check for pgoff value overflow
        lockdep: fix fs_reclaim warning
        MAINTAINERS: update Mark Fasheh's e-mail
        mm/mempolicy.c: avoid use uninitialized preferred_node
      f36b7534
    • L
      Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 8401c72c
      Linus Torvalds 提交于
      Pull libnvdimm fixes from Dan Williams:
       "Two regression fixes, two bug fixes for older issues, two fixes for
        new functionality added this cycle that have userspace ABI concerns,
        and a small cleanup. These have appeared in a linux-next release and
        have a build success report from the 0day robot.
      
         * The 4.16 rework of altmap handling led to some configurations
           leaking page table allocations due to freeing from the altmap
           reservation rather than the page allocator.
      
           The impact without the fix is leaked memory and a WARN() message
           when tearing down libnvdimm namespaces. The rework also missed a
           place where error handling code needed to be removed that can lead
           to a crash if devm_memremap_pages() fails.
      
         * acpi_map_pxm_to_node() had a latent bug whereby it could
           misidentify the closest online node to a given proximity domain.
      
         * Block integrity handling was reworked several kernels back to allow
           calling add_disk() after setting up the integrity profile.
      
           The nd_btt and nd_blk drivers are just now catching up to fix
           automatic partition detection at driver load time.
      
         * The new peristence_domain attribute, a platform indicator of
           whether cpu caches are powerfail protected for example, is meant to
           be a single value enum and not a set of flags.
      
           This oversight was caught while reviewing new userspace code in
           libndctl to communicate the attribute.
      
           Fix this new enabling up so that we are not stuck with an unwanted
           userspace ABI"
      
      * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        libnvdimm, nfit: fix persistence domain reporting
        libnvdimm, region: hide persistence_domain when unknown
        acpi, numa: fix pxm to online numa node associations
        x86, memremap: fix altmap accounting at free
        libnvdimm: remove redundant assignment to pointer 'dev'
        libnvdimm, {btt, blk}: do integrity setup before add_disk()
        kernel/memremap: Remove stale devres_free() call
      8401c72c
    • L
      Merge tag 'drm-fixes-for-v4.16-rc7' of git://people.freedesktop.org/~airlied/linux · 9ec7ccc8
      Linus Torvalds 提交于
      Pull drm fixes from Dave Airlie:
       "A bunch of fixes all over the place (core, i915, amdgpu, imx, sun4i,
        ast, tegra, vmwgfx), nothing too serious or worrying at this stage.
      
         - one uapi fix to stop multi-planar images with getfb
      
         - Sun4i error path and clock fixes
      
         - udl driver mmap offset fix
      
         - i915 DP MST and GPU reset fixes
      
         - vmwgfx mutex and black screen fixes
      
         - imx array underflow fix and vblank fix
      
         - amdgpu: display fixes
      
         - exynos devicetree fix
      
         - ast mode fix"
      
      * tag 'drm-fixes-for-v4.16-rc7' of git://people.freedesktop.org/~airlied/linux: (29 commits)
        drm/ast: Fixed 1280x800 Display Issue
        drm: udl: Properly check framebuffer mmap offsets
        drm/i915: Specify which engines to reset following semaphore/event lockups
        drm/vmwgfx: Fix a destoy-while-held mutex problem.
        drm/vmwgfx: Fix black screen and device errors when running without fbdev
        drm: Reject getfb for multi-plane framebuffers
        drm/amd/display: Add one to EDID's audio channel count when passing to DC
        drm/amd/display: We shouldn't set format_default on plane as atomic driver
        drm/amd/display: Fix FMT truncation programming
        drm/amd/display: Allow truncation to 10 bits
        drm/sun4i: hdmi: Fix another error handling path in 'sun4i_hdmi_bind()'
        drm/sun4i: hdmi: Fix an error handling path in 'sun4i_hdmi_bind()'
        drm/i915/dp: Write to SET_POWER dpcd to enable MST hub.
        drm/amd/display: fix dereferencing possible ERR_PTR()
        drm/amd/display: Refine disable VGA
        drm/tegra: Shutdown on driver unbind
        drm/tegra: dsi: Don't disable regulator on ->exit()
        drm/tegra: dc: Detach IOMMU group from domain only once
        dt-bindings: exynos: Document #sound-dai-cells property of the HDMI node
        drm/imx: move arming of the vblank event to atomic_flush
        ...
      9ec7ccc8
    • D
      mm, thp: do not cause memcg oom for thp · 9d3c3354
      David Rientjes 提交于
      Commit 25160354 ("mm, thp: remove __GFP_NORETRY from khugepaged and
      madvised allocations") changed the page allocator to no longer detect
      thp allocations based on __GFP_NORETRY.
      
      It did not, however, modify the mem cgroup try_charge() path to avoid
      oom kill for either khugepaged collapsing or thp faulting.  It is never
      expected to oom kill a process to allocate a hugepage for thp; reclaim
      is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations
      (and charging) should fallback instead of oom killing processes.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com
      Fixes: 25160354 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d3c3354
    • A
      mm/vmscan: wake up flushers for legacy cgroups too · 1c610d5f
      Andrey Ryabinin 提交于
      Commit 726d061f ("mm: vmscan: kick flushers when we encounter dirty
      pages on the LRU") added flusher invocation to shrink_inactive_list()
      when many dirty pages on the LRU are encountered.
      
      However, shrink_inactive_list() doesn't wake up flushers for legacy
      cgroup reclaim, so the next commit bbef9384 ("mm: vmscan: remove old
      flusher wakeup from direct reclaim path") removed the only source of
      flusher's wake up in legacy mem cgroup reclaim path.
      
      This leads to premature OOM if there is too many dirty pages in cgroup:
          # mkdir /sys/fs/cgroup/memory/test
          # echo $$ > /sys/fs/cgroup/memory/test/tasks
          # echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
          # dd if=/dev/zero of=tmp_file bs=1M count=100
          Killed
      
          dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
      
          Call Trace:
           dump_stack+0x46/0x65
           dump_header+0x6b/0x2ac
           oom_kill_process+0x21c/0x4a0
           out_of_memory+0x2a5/0x4b0
           mem_cgroup_out_of_memory+0x3b/0x60
           mem_cgroup_oom_synchronize+0x2ed/0x330
           pagefault_out_of_memory+0x24/0x54
           __do_page_fault+0x521/0x540
           page_fault+0x45/0x50
      
          Task in /test killed as a result of limit of /test
          memory: usage 51200kB, limit 51200kB, failcnt 73
          memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
          kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
          Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
                  mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
      	    active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
          Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
          Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
          oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      Wake up flushers in legacy cgroup reclaim too.
      
      Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
      Fixes: bbef9384 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c610d5f
    • D
      Revert "mm: page_alloc: skip over regions of invalid pfns where possible" · f59f1caf
      Daniel Vacek 提交于
      This reverts commit b92df1de ("mm: page_alloc: skip over regions of
      invalid pfns where possible").  The commit is meant to be a boot init
      speed up skipping the loop in memmap_init_zone() for invalid pfns.
      
      But given some specific memory mapping on x86_64 (or more generally
      theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the
      implementation also skips valid pfns which is plain wrong and causes
      'kernel BUG at mm/page_alloc.c:1389!'
      
        crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
        kernel BUG at mm/page_alloc.c:1389!
        invalid opcode: 0000 [#1] SMP
        --
        RIP: 0010: move_freepages+0x15e/0x160
        --
        Call Trace:
          move_freepages_block+0x73/0x80
          __rmqueue+0x263/0x460
          get_page_from_freelist+0x7e1/0x9e0
          __alloc_pages_nodemask+0x176/0x420
        --
      
        crash> page_init_bug -v | grep RAM
        <struct resource 0xffff88067fffd2f8>          1000 -        9bfff       System RAM (620.00 KiB)
        <struct resource 0xffff88067fffd3a0>        100000 -     430bffff       System RAM (  1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
        <struct resource 0xffff88067fffd410>      4b0c8000 -     4bf9cfff       System RAM ( 14.83 MiB = 15188.00 KiB)
        <struct resource 0xffff88067fffd480>      4bfac000 -     646b1fff       System RAM (391.02 MiB = 400408.00 KiB)
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct resource 0xffff88067fffd640>     100000000 -    67fffffff       System RAM ( 22.00 GiB)
      
        crash> page_init_bug | head -6
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct page 0xffffea0001ede200>   1fffff00000000  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        <struct page 0xffffea0001ede200>       505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
        <struct page 0xffffea0001ed8000>                0  0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA               1       4095
        <struct page 0xffffea0001edffc0>   1fffff00000400  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        BUG, zones differ!
      
        crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
              PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
        ffffea0001e00000  78000000                0        0  0 0
        ffffea0001ed7fc0  7b5ff000                0        0  0 0
        ffffea0001ed8000  7b600000                0        0  0 0       <<<<
        ffffea0001ede1c0  7b787000                0        0  0 0
        ffffea0001ede200  7b788000                0        0  1 1fffff00000000
      
      Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com
      Fixes: b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible")
      Signed-off-by: NDaniel Vacek <neelx@redhat.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f59f1caf
    • K
      mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink() · b3cd54b2
      Kirill A. Shutemov 提交于
      shmem_unused_huge_shrink() gets called from reclaim path.  Waiting for
      page lock may lead to deadlock there.
      
      There was a bug report that may be attributed to this:
      
        http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net
      
      Replace lock_page() with trylock_page() and skip the page if we failed
      to lock it.  We will get to the page on the next scan.
      
      We can test for the PageTransHuge() outside the page lock as we only
      need protection against splitting the page under us.  Holding pin oni
      the page is enough for this.
      
      Link: http://lkml.kernel.org/r/20180316210830.43738-1-kirill.shutemov@linux.intel.com
      Fixes: 779750d2 ("shmem: split huge pages beyond i_size under memory pressure")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NEric Wheeler <linux-mm@lists.ewheeler.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3cd54b2
    • K
      mm/thp: do not wait for lock_page() in deferred_split_scan() · fa41b900
      Kirill A. Shutemov 提交于
      deferred_split_scan() gets called from reclaim path.  Waiting for page
      lock may lead to deadlock there.
      
      Replace lock_page() with trylock_page() and skip the page if we failed
      to lock it.  We will get to the page on the next scan.
      
      Link: http://lkml.kernel.org/r/20180315150747.31945-1-kirill.shutemov@linux.intel.com
      Fixes: 9a982250 ("thp: introduce deferred_split_huge_page()")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa41b900
    • K
      mm/khugepaged.c: convert VM_BUG_ON() to collapse fail · fece2029
      Kirill A. Shutemov 提交于
      khugepaged is not yet able to convert PTE-mapped huge pages back to PMD
      mapped.  We do not collapse such pages.  See check
      khugepaged_scan_pmd().
      
      But if between khugepaged_scan_pmd() and __collapse_huge_page_isolate()
      somebody managed to instantiate THP in the range and then split the PMD
      back to PTEs we would have a problem --
      VM_BUG_ON_PAGE(PageCompound(page)) will get triggered.
      
      It's possible since we drop mmap_sem during collapse to re-take for
      write.
      
      Replace the VM_BUG_ON() with graceful collapse fail.
      
      Link: http://lkml.kernel.org/r/20180315152353.27989-1-kirill.shutemov@linux.intel.com
      Fixes: b1caa957 ("khugepaged: ignore pmd tables with THP mapped with ptes")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fece2029
    • T
      x86/mm: implement free pmd/pte page interfaces · 28ee90fe
      Toshi Kani 提交于
      Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
      clear a given pud/pmd entry and free up lower level page table(s).
      
      The address range associated with the pud/pmd entry must have been
      purged by INVLPG.
      
      Link: http://lkml.kernel.org/r/20180314180155.19492-3-toshi.kani@hpe.com
      Fixes: e61ce6ad ("mm: change ioremap to set up huge I/O mappings")
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Reported-by: NLei Li <lious.lilei@hisilicon.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28ee90fe
    • T
      mm/vmalloc: add interfaces to free unmapped page table · b6bdb751
      Toshi Kani 提交于
      On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may
      create pud/pmd mappings.  A kernel panic was observed on arm64 systems
      with Cortex-A75 in the following steps as described by Hanjun Guo.
      
       1. ioremap a 4K size, valid page table will build,
       2. iounmap it, pte0 will set to 0;
       3. ioremap the same address with 2M size, pgd/pmd is unchanged,
          then set the a new value for pmd;
       4. pte0 is leaked;
       5. CPU may meet exception because the old pmd is still in TLB,
          which will lead to kernel panic.
      
      This panic is not reproducible on x86.  INVLPG, called from iounmap,
      purges all levels of entries associated with purged address on x86.  x86
      still has memory leak.
      
      The patch changes the ioremap path to free unmapped page table(s) since
      doing so in the unmap path has the following issues:
      
       - The iounmap() path is shared with vunmap(). Since vmap() only
         supports pte mappings, making vunmap() to free a pte page is an
         overhead for regular vmap users as they do not need a pte page freed
         up.
      
       - Checking if all entries in a pte page are cleared in the unmap path
         is racy, and serializing this check is expensive.
      
       - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
         Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
         purge.
      
      Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which
      clear a given pud/pmd entry and free up a page for the lower level
      entries.
      
      This patch implements their stub functions on x86 and arm64, which work
      as workaround.
      
      [akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub]
      Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com
      Fixes: e61ce6ad ("mm: change ioremap to set up huge I/O mappings")
      Reported-by: NLei Li <lious.lilei@hisilicon.com>
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Wang Xuefeng <wxf.wang@hisilicon.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6bdb751
    • A
      h8300: remove extraneous __BIG_ENDIAN definition · 1705f7c5
      Arnd Bergmann 提交于
      A bugfix I did earlier caused a build regression on h8300, which defines
      the __BIG_ENDIAN macro in a slightly different way than the generic
      code:
      
        arch/h8300/include/asm/byteorder.h:5:0: warning: "__BIG_ENDIAN" redefined
      
      We don't need to define it here, as the same macro is already provided
      by the linux/byteorder/big_endian.h, and that version does not conflict.
      
      While this is a v4.16 regression, my earlier patch also got backported
      to the 4.14 and 4.15 stable kernels, so we need the fixup there as well.
      
      Link: http://lkml.kernel.org/r/20180313120752.2645129-1-arnd@arndb.de
      Fixes: 101110f6 ("Kbuild: always define endianess in kconfig.h")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1705f7c5
    • M
      hugetlbfs: check for pgoff value overflow · 63489f8e
      Mike Kravetz 提交于
      A vma with vm_pgoff large enough to overflow a loff_t type when
      converted to a byte offset can be passed via the remap_file_pages system
      call.  The hugetlbfs mmap routine uses the byte offset to calculate
      reservations and file size.
      
      A sequence such as:
      
        mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
        remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);
      
      will result in the following when task exits/file closed,
      
        kernel BUG at mm/hugetlb.c:749!
        Call Trace:
          hugetlbfs_evict_inode+0x2f/0x40
          evict+0xcb/0x190
          __dentry_kill+0xcb/0x150
          __fput+0x164/0x1e0
          task_work_run+0x84/0xa0
          exit_to_usermode_loop+0x7d/0x80
          do_syscall_64+0x18b/0x190
          entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      The overflowed pgoff value causes hugetlbfs to try to set up a mapping
      with a negative range (end < start) that leaves invalid state which
      causes the BUG.
      
      The previous overflow fix to this code was incomplete and did not take
      the remap_file_pages system call into account.
      
      [mike.kravetz@oracle.com: v3]
        Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
      [akpm@linux-foundation.org: include mmdebug.h]
      [akpm@linux-foundation.org: fix -ve left shift count on sh]
      Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
      Fixes: 045c7a3f ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NNic Losby <blurbdust@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63489f8e
    • T
      lockdep: fix fs_reclaim warning · 2e517d68
      Tetsuo Handa 提交于
      Dave Jones reported fs_reclaim lockdep warnings.
      
        ============================================
        WARNING: possible recursive locking detected
        4.15.0-rc9-backup-debug+ #1 Not tainted
        --------------------------------------------
        sshd/24800 is trying to acquire lock:
         (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
      
        but task is already holding lock:
         (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(fs_reclaim);
          lock(fs_reclaim);
      
         *** DEADLOCK ***
      
         May be due to missing lock nesting notation
      
        2 locks held by sshd/24800:
         #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
         #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30
      
        stack backtrace:
        CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
        Call Trace:
         dump_stack+0xbc/0x13f
         __lock_acquire+0xa09/0x2040
         lock_acquire+0x12e/0x350
         fs_reclaim_acquire.part.102+0x29/0x30
         kmem_cache_alloc+0x3d/0x2c0
         alloc_extent_state+0xa7/0x410
         __clear_extent_bit+0x3ea/0x570
         try_release_extent_mapping+0x21a/0x260
         __btrfs_releasepage+0xb0/0x1c0
         btrfs_releasepage+0x161/0x170
         try_to_release_page+0x162/0x1c0
         shrink_page_list+0x1d5a/0x2fb0
         shrink_inactive_list+0x451/0x940
         shrink_node_memcg.constprop.88+0x4c9/0x5e0
         shrink_node+0x12d/0x260
         try_to_free_pages+0x418/0xaf0
         __alloc_pages_slowpath+0x976/0x1790
         __alloc_pages_nodemask+0x52c/0x5c0
         new_slab+0x374/0x3f0
         ___slab_alloc.constprop.81+0x47e/0x5a0
         __slab_alloc.constprop.80+0x32/0x60
         __kmalloc_track_caller+0x267/0x310
         __kmalloc_reserve.isra.40+0x29/0x80
         __alloc_skb+0xee/0x390
         sk_stream_alloc_skb+0xb8/0x340
         tcp_sendmsg_locked+0x8e6/0x1d30
         tcp_sendmsg+0x27/0x40
         inet_sendmsg+0xd0/0x310
         sock_write_iter+0x17a/0x240
         __vfs_write+0x2ab/0x380
         vfs_write+0xfb/0x260
         SyS_write+0xb6/0x140
         do_syscall_64+0x1e5/0xc05
         entry_SYSCALL64_slow_path+0x25/0x25
      
      This warning is caused by commit d92a8cfc ("locking/lockdep:
      Rework FS_RECLAIM annotation") which replaced the use of
      lockdep_{set,clear}_current_reclaim_state() in __perform_reclaim()
      and lockdep_trace_alloc() in slab_pre_alloc_hook() with
      fs_reclaim_acquire()/ fs_reclaim_release().
      
      Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC |
      __GFP_NOWARN to gfp_mask, and all reclaim path simply propagates
      __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook() is
      trying to grab the 'fake' lock again when __perform_reclaim() already
      grabbed the 'fake' lock.
      
      The
      
        /* this guy won't enter reclaim */
        if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
                return false;
      
      test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
      was added by commit cf40bd16 ("lockdep: annotate reclaim context
      (__GFP_NOFS)").  But that test is outdated because PF_MEMALLOC thread
      won't enter reclaim regardless of __GFP_NOMEMALLOC after commit
      341ce06f ("page allocator: calculate the alloc_flags for allocation
      only once") added the PF_MEMALLOC safeguard (
      
        /* Avoid recursion of direct reclaim */
        if (p->flags & PF_MEMALLOC)
                goto nopage;
      
      in __alloc_pages_slowpath()).
      
      Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and
      allow __need_fs_reclaim() to return false.
      
      Link: http://lkml.kernel.org/r/201802280650.FJC73911.FOSOMLJVFFQtHO@I-love.SAKURA.ne.jp
      Fixes: d92a8cfc ("locking/lockdep: Rework FS_RECLAIM annotation")
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Tested-by: NDave Jones <davej@codemonkey.org.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e517d68
    • M
      MAINTAINERS: update Mark Fasheh's e-mail · 296cefee
      Mark Fasheh 提交于
      I'd like to use my personal e-mail for Ocfs2 requests and review.
      
      Link: http://lkml.kernel.org/r/20180311231356.9385-1-mfasheh@versity.comSigned-off-by: NMark Fasheh <mark@fasheh.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      296cefee
    • Y
      mm/mempolicy.c: avoid use uninitialized preferred_node · 8970a63e
      Yisheng Xie 提交于
      Alexander reported a use of uninitialized memory in __mpol_equal(),
      which is caused by incorrect use of preferred_node.
      
      When mempolicy in mode MPOL_PREFERRED with flags MPOL_F_LOCAL, it uses
      numa_node_id() instead of preferred_node, however, __mpol_equal() uses
      preferred_node without checking whether it is MPOL_F_LOCAL or not.
      
      [akpm@linux-foundation.org: slight comment tweak]
      Link: http://lkml.kernel.org/r/4ebee1c2-57f6-bcb8-0e2d-1833d1ee0bb7@huawei.com
      Fixes: fc36b8d3 ("mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local Policy")
      Signed-off-by: NYisheng Xie <xieyisheng1@huawei.com>
      Reported-by: NAlexander Potapenko <glider@google.com>
      Tested-by: NAlexander Potapenko <glider@google.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Dmitriy Vyukov <dvyukov@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8970a63e