1. 01 4月, 2017 1 次提交
    • M
      mm: move mm_percpu_wq initialization earlier · 597b7305
      Michal Hocko 提交于
      Yang Li has reported that drain_all_pages triggers a WARN_ON which means
      that this function is called earlier than the mm_percpu_wq is
      initialized on arm64 with CMA configured:
      
        WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c
        Modules linked in:
        CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127
        Hardware name: Freescale Layerscape 2088A RDB Board (DT)
        task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000
        PC is at drain_all_pages+0x244/0x25c
        LR is at start_isolate_page_range+0x14c/0x1f0
        [...]
         drain_all_pages+0x244/0x25c
         start_isolate_page_range+0x14c/0x1f0
         alloc_contig_range+0xec/0x354
         cma_alloc+0x100/0x1fc
         dma_alloc_from_contiguous+0x3c/0x44
         atomic_pool_init+0x7c/0x208
         arm64_dma_init+0x44/0x4c
         do_one_initcall+0x38/0x128
         kernel_init_freeable+0x1a0/0x240
         kernel_init+0x10/0xfc
         ret_from_fork+0x10/0x20
      
      Fix this by moving the whole setup_vmstat which is an initcall right now
      to init_mm_internals which will be called right after the WQ subsystem
      is initialized.
      
      Link: http://lkml.kernel.org/r/20170315164021.28532-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NYang Li <pku.leo@gmail.com>
      Tested-by: NYang Li <pku.leo@gmail.com>
      Tested-by: NXiaolong Ye <xiaolong.ye@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      597b7305
  2. 25 3月, 2017 2 次提交
    • D
      uapi: fix rdma/mlx5-abi.h userspace compilation errors · 812755d6
      Dmitry V. Levin 提交于
      Consistently use types from linux/types.h to fix the following
      rdma/mlx5-abi.h userspace compilation errors:
      
      /usr/include/rdma/mlx5-abi.h:69:25: error: 'u64' undeclared here (not in a function)
        MLX5_LIB_CAP_4K_UAR = (u64)1 << 0,
      /usr/include/rdma/mlx5-abi.h:69:29: error: expected ',' or '}' before numeric constant
        MLX5_LIB_CAP_4K_UAR = (u64)1 << 0,
      
      Include <linux/if_ether.h> to fix the following rdma/mlx5-abi.h
      userspace compilation error:
      
      /usr/include/rdma/mlx5-abi.h:286:12: error: 'ETH_ALEN' undeclared here (not in a function)
        __u8 dmac[ETH_ALEN];
      Signed-off-by: NDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      812755d6
    • B
      IB/core: Restore I/O MMU, s390 and powerpc support · 0957c29f
      Bart Van Assche 提交于
      Avoid that the following error message is reported on the console
      while loading an RDMA driver with I/O MMU support enabled:
      
      DMAR: Allocating domain for mlx5_0 failed
      
      Ensure that DMA mapping operations that use to_pci_dev() to
      access to struct pci_dev see the correct PCI device. E.g. the s390
      and powerpc DMA mapping operations use to_pci_dev() even with I/O
      MMU support disabled.
      
      This patch preserves the following changes of the DMA mapping updates
      patch series:
      - Introduction of dma_virt_ops.
      - Removal of ib_device.dma_ops.
      - Removal of struct ib_dma_mapping_ops.
      - Removal of an if-statement from each ib_dma_*() operation.
      - IB HW drivers no longer set dma_device directly.
      Reported-by: NSebastian Ott <sebott@linux.vnet.ibm.com>
      Reported-by: NParav Pandit <parav@mellanox.com>
      Fixes: commit 99db9494 ("IB/core: Remove ib_device.dma_device")
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: parav@mellanox.com
      Tested-by: parav@mellanox.com
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      0957c29f
  3. 24 3月, 2017 1 次提交
  4. 23 3月, 2017 1 次提交
  5. 22 3月, 2017 7 次提交
  6. 21 3月, 2017 2 次提交
  7. 20 3月, 2017 1 次提交
    • S
      generic syscalls: Wire up statx syscall · fdfe4a39
      Stafford Horne 提交于
      The new syscall statx is implemented as generic code, so enable it
      for architectures like openrisc which use the generic syscall table.
      
      Fixes: a528d35e ("statx: Add a system call to make enhanced file info available")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Signed-off-by: NStafford Horne <shorne@gmail.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      fdfe4a39
  8. 19 3月, 2017 2 次提交
    • M
      target: fix ALUA transition timeout handling · d7175373
      Mike Christie 提交于
      The implicit transition time tells initiators the min time
      to wait before timing out a transition. We currently schedule
      the transition to occur in tg_pt_gp_implicit_trans_secs
      seconds so there is no room for delays. If
      core_alua_do_transition_tg_pt_work->core_alua_update_tpg_primary_metadata
      needs to write out info to a remote file, then the initiator can
      easily time out the operation.
      Signed-off-by: NMike Christie <mchristi@redhat.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      d7175373
    • M
      target: allow ALUA setup for some passthrough backends · 530c6891
      Mike Christie 提交于
      This patch allows passthrough backends to use the core/base LIO
      ALUA setup and state checks, but still handle the execution of
      commands.
      
      This will allow the target_core_user module to execute STPG and RTPG
      in userspace, and not have to duplicate the ALUA state checks, path
      information (needed so we can check if command is executable on
      specific paths) and setup (rtslib sets/updates the configfs ALUA
      interface like it does for iblock or file).
      
      For STPG, the target_core_user userspace daemon, tcmu-runner will
      still execute the STPG, and to update the core/base LIO state it
      will use the existing configfs interface. For RTPG, tcmu-runner
      will loop over configfs and/or cache the state.
      Signed-off-by: NMike Christie <mchristi@redhat.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      530c6891
  9. 17 3月, 2017 3 次提交
  10. 16 3月, 2017 4 次提交
    • G
      crypto: ccp - Assign DMA commands to the channel's CCP · 7c468447
      Gary R Hook 提交于
      The CCP driver generally uses a round-robin approach when
      assigning operations to available CCPs. For the DMA engine,
      however, the DMA mappings of the SGs are associated with a
      specific CCP. When an IOMMU is enabled, the IOMMU is
      programmed based on this specific device.
      
      If the DMA operations are not performed by that specific
      CCP then addressing errors and I/O page faults will occur.
      
      Update the CCP driver to allow a specific CCP device to be
      requested for an operation and use this in the DMA engine
      support.
      
      Cc: <stable@vger.kernel.org> # 4.9.x-
      Signed-off-by: NGary R Hook <gary.hook@amd.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      7c468447
    • D
      vmbus: remove hv_event_tasklet_disable/enable · dad72a1d
      Dexuan Cui 提交于
      With the recent introduction of per-channel tasklet, we need to update
      the way we handle the 3 concurrency issues:
      
      1. hv_process_channel_removal -> percpu_channel_deq vs.
         vmbus_chan_sched -> list_for_each_entry(..., percpu_list);
      
      2. vmbus_process_offer -> percpu_channel_enq/deq vs. vmbus_chan_sched.
      
      3. vmbus_close_internal vs. the per-channel tasklet vmbus_on_event;
      
      The first 2 issues can be handled by Stephen's recent patch
      "vmbus: use rcu for per-cpu channel list", and the third issue
      can be handled by calling tasklet_disable in vmbus_close_internal here.
      
      We don't need the original hv_event_tasklet_disable/enable since we
      now use per-channel tasklet instead of the previous per-CPU tasklet,
      and actually we must remove them due to the side effect now:
      vmbus_process_offer -> hv_event_tasklet_enable -> tasklet_schedule will
      start the per-channel callback prematurely, cauing NULL dereferencing
      (the channel may haven't been properly configured to run the callback yet).
      
      Fixes: 631e63a9 ("vmbus: change to per channel tasklet")
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dad72a1d
    • S
      vmbus: use rcu for per-cpu channel list · 8200f208
      Stephen Hemminger 提交于
      The per-cpu channel list is now referred to in the interrupt
      routine. This is mostly safe since the host will not normally generate
      an interrupt when channel is being deleted but if it did then there
      would be a use after free problem.
      
      To solve, this use RCU protection on ther per-cpu list.
      
      Fixes: 631e63a9 ("vmbus: change to per channel tasklet")
      Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8200f208
    • E
      fscrypt: eliminate ->prepare_context() operation · 94840e3c
      Eric Biggers 提交于
      The only use of the ->prepare_context() fscrypt operation was to allow
      ext4 to evict inline data from the inode before ->set_context().
      However, there is no reason why this cannot be done as simply the first
      step in ->set_context(), and in fact it makes more sense to do it that
      way because then the policy modes and flags get validated before any
      real work is done.  Therefore, merge ext4_prepare_context() into
      ext4_set_context(), and remove ->prepare_context().
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      94840e3c
  11. 14 3月, 2017 4 次提交
  12. 13 3月, 2017 4 次提交
    • S
      netfilter: Force fake conntrack entry to be at least 8 bytes aligned · 170a1fb9
      Steven Rostedt (VMware) 提交于
      Since the nfct and nfctinfo have been combined, the nf_conn structure
      must be at least 8 bytes aligned, as the 3 LSB bits are used for the
      nfctinfo. But there's a fake nf_conn structure to denote untracked
      connections, which is created by a PER_CPU construct. This does not
      guarantee that it will be 8 bytes aligned and can break the logic in
      determining the correct nfctinfo.
      
      I triggered this on a 32bit machine with the following error:
      
      BUG: unable to handle kernel NULL pointer dereference at 00000af4
      IP: nf_ct_deliver_cached_events+0x1b/0xfb
      *pdpt = 0000000031962001 *pde = 0000000000000000
      
      Oops: 0000 [#1] SMP
      [Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 crc_ccitt ppdev r8169 parport_pc parport
        OK  ]
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-test+ #75
      Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
      task: c126ec00 task.stack: c1258000
      EIP: nf_ct_deliver_cached_events+0x1b/0xfb
      EFLAGS: 00010202 CPU: 0
      EAX: 0021cd01 EBX: 00000000 ECX: 27b0c767 EDX: 32bcb17a
      ESI: f34135c0 EDI: f34135c0 EBP: f2debd60 ESP: f2debd3c
       DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
      CR0: 80050033 CR2: 00000af4 CR3: 309a0440 CR4: 001406f0
      Call Trace:
       <SOFTIRQ>
       ? ipv6_skip_exthdr+0xac/0xcb
       ipv6_confirm+0x10c/0x119 [nf_conntrack_ipv6]
       nf_hook_slow+0x22/0xc7
       nf_hook+0x9a/0xad [ipv6]
       ? ip6t_do_table+0x356/0x379 [ip6_tables]
       ? ip6_fragment+0x9e9/0x9e9 [ipv6]
       ip6_output+0xee/0x107 [ipv6]
       ? ip6_fragment+0x9e9/0x9e9 [ipv6]
       dst_output+0x36/0x4d [ipv6]
       NF_HOOK.constprop.37+0xb2/0xba [ipv6]
       ? icmp6_dst_alloc+0x2c/0xfd [ipv6]
       ? local_bh_enable+0x14/0x14 [ipv6]
       mld_sendpack+0x1c5/0x281 [ipv6]
       ? mark_held_locks+0x40/0x5c
       mld_ifc_timer_expire+0x1f6/0x21e [ipv6]
       call_timer_fn+0x135/0x283
       ? detach_if_pending+0x55/0x55
       ? mld_dad_timer_expire+0x3e/0x3e [ipv6]
       __run_timers+0x111/0x14b
       ? mld_dad_timer_expire+0x3e/0x3e [ipv6]
       run_timer_softirq+0x1c/0x36
       __do_softirq+0x185/0x37c
       ? test_ti_thread_flag.constprop.19+0xd/0xd
       do_softirq_own_stack+0x22/0x28
       </SOFTIRQ>
       irq_exit+0x5a/0xa4
       smp_apic_timer_interrupt+0x2a/0x34
       apic_timer_interrupt+0x37/0x3c
      
      By using DEFINE/DECLARE_PER_CPU_ALIGNED we can enforce at least 8 byte
      alignment as all cache line sizes are at least 8 bytes or more.
      
      Fixes: a9e419dc ("netfilter: merge ctinfo into nfct pointer storage area")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      170a1fb9
    • L
      netfilter: nf_tables: fix mismatch in big-endian system · 10596608
      Liping Zhang 提交于
      Currently, there are two different methods to store an u16 integer to
      the u32 data register. For example:
        u32 *dest = &regs->data[priv->dreg];
        1. *dest = 0; *(u16 *) dest = val_u16;
        2. *dest = val_u16;
      
      For method 1, the u16 value will be stored like this, either in
      big-endian or little-endian system:
        0          15           31
        +-+-+-+-+-+-+-+-+-+-+-+-+
        |   Value   |     0     |
        +-+-+-+-+-+-+-+-+-+-+-+-+
      
      For method 2, in little-endian system, the u16 value will be the same
      as listed above. But in big-endian system, the u16 value will be stored
      like this:
        0          15           31
        +-+-+-+-+-+-+-+-+-+-+-+-+
        |     0     |   Value   |
        +-+-+-+-+-+-+-+-+-+-+-+-+
      
      So later we use "memcmp(&regs->data[priv->sreg], data, 2);" to do
      compare in nft_cmp, nft_lookup expr ..., method 2 will get the wrong
      result in big-endian system, as 0~15 bits will always be zero.
      
      For the similar reason, when loading an u16 value from the u32 data
      register, we should use "*(u16 *) sreg;" instead of "(u16)*sreg;",
      the 2nd method will get the wrong value in the big-endian system.
      
      So introduce some wrapper functions to store/load an u8 or u16
      integer to/from the u32 data register, and use them in the right
      place.
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      10596608
    • D
      uapi: fix drm/omap_drm.h userspace compilation errors · 337ba7fb
      Dmitry V. Levin 提交于
      Consistently use types from linux/types.h like in other uapi drm/*_drm.h
      header files to fix the following drm/omap_drm.h userspace compilation
      errors:
      
      /usr/include/drm/omap_drm.h:36:2: error: unknown type name 'uint64_t'
        uint64_t param;   /* in */
      /usr/include/drm/omap_drm.h:37:2: error: unknown type name 'uint64_t'
        uint64_t value;   /* in (set_param), out (get_param) */
      /usr/include/drm/omap_drm.h:56:2: error: unknown type name 'uint32_t'
        uint32_t bytes;  /* (for non-tiled formats) */
      /usr/include/drm/omap_drm.h:58:3: error: unknown type name 'uint16_t'
         uint16_t width;
      /usr/include/drm/omap_drm.h:59:3: error: unknown type name 'uint16_t'
         uint16_t height;
      /usr/include/drm/omap_drm.h:65:2: error: unknown type name 'uint32_t'
        uint32_t flags;   /* in */
      /usr/include/drm/omap_drm.h:66:2: error: unknown type name 'uint32_t'
        uint32_t handle;  /* out */
      /usr/include/drm/omap_drm.h:67:2: error: unknown type name 'uint32_t'
        uint32_t __pad;
      /usr/include/drm/omap_drm.h:77:2: error: unknown type name 'uint32_t'
        uint32_t handle;  /* buffer handle (in) */
      /usr/include/drm/omap_drm.h:78:2: error: unknown type name 'uint32_t'
        uint32_t op;   /* mask of omap_gem_op (in) */
      /usr/include/drm/omap_drm.h:82:2: error: unknown type name 'uint32_t'
        uint32_t handle;  /* buffer handle (in) */
      /usr/include/drm/omap_drm.h:83:2: error: unknown type name 'uint32_t'
        uint32_t op;   /* mask of omap_gem_op (in) */
      /usr/include/drm/omap_drm.h:88:2: error: unknown type name 'uint32_t'
        uint32_t nregions;
      /usr/include/drm/omap_drm.h:89:2: error: unknown type name 'uint32_t'
        uint32_t __pad;
      /usr/include/drm/omap_drm.h:93:2: error: unknown type name 'uint32_t'
        uint32_t handle;  /* buffer handle (in) */
      /usr/include/drm/omap_drm.h:94:2: error: unknown type name 'uint32_t'
        uint32_t pad;
      /usr/include/drm/omap_drm.h:95:2: error: unknown type name 'uint64_t'
        uint64_t offset;  /* mmap offset (out) */
      /usr/include/drm/omap_drm.h:102:2: error: unknown type name 'uint32_t'
        uint32_t size;   /* virtual size for mmap'ing (out) */
      /usr/include/drm/omap_drm.h:103:2: error: unknown type name 'uint32_t'
        uint32_t __pad;
      
      Fixes: ef6503e8 ("drm: Kbuild: add omap_drm.h to the installed headers")
      Signed-off-by: NDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: NTomi Valkeinen <tomi.valkeinen@ti.com>
      337ba7fb
    • D
      bpf: improve read-only handling · 65869a47
      Daniel Borkmann 提交于
      Improve bpf_{prog,jit_binary}_{un,}lock_ro() by throwing a
      one-time warning in case of an error when the image couldn't
      be set read-only, and also mark struct bpf_prog as locked when
      bpf_prog_lock_ro() was called.
      
      Reason for the latter is that bpf_prog_unlock_ro() is called from
      various places including error paths, and we shouldn't mess with
      page attributes when really not needed.
      
      For bpf_jit_binary_unlock_ro() this is not needed as jited flag
      implicitly indicates this, thus for archs with ARCH_HAS_SET_MEMORY
      we're guaranteed to have a previously locked image. Overall, this
      should also help us to identify any further potential issues with
      set_memory_*() helpers.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65869a47
  13. 11 3月, 2017 3 次提交
  14. 10 3月, 2017 5 次提交
    • D
      net: Work around lockdep limitation in sockets that use sockets · cdfbabfb
      David Howells 提交于
      Lockdep issues a circular dependency warning when AFS issues an operation
      through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.
      
      The theory lockdep comes up with is as follows:
      
       (1) If the pagefault handler decides it needs to read pages from AFS, it
           calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
           creating a call requires the socket lock:
      
      	mmap_sem must be taken before sk_lock-AF_RXRPC
      
       (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
           binds the underlying UDP socket whilst holding its socket lock.
           inet_bind() takes its own socket lock:
      
      	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET
      
       (3) Reading from a TCP socket into a userspace buffer might cause a fault
           and thus cause the kernel to take the mmap_sem, but the TCP socket is
           locked whilst doing this:
      
      	sk_lock-AF_INET must be taken before mmap_sem
      
      However, lockdep's theory is wrong in this instance because it deals only
      with lock classes and not individual locks.  The AF_INET lock in (2) isn't
      really equivalent to the AF_INET lock in (3) as the former deals with a
      socket entirely internal to the kernel that never sees userspace.  This is
      a limitation in the design of lockdep.
      
      Fix the general case by:
      
       (1) Double up all the locking keys used in sockets so that one set are
           used if the socket is created by userspace and the other set is used
           if the socket is created by the kernel.
      
       (2) Store the kern parameter passed to sk_alloc() in a variable in the
           sock struct (sk_kern_sock).  This informs sock_lock_init(),
           sock_init_data() and sk_clone_lock() as to the lock keys to be used.
      
           Note that the child created by sk_clone_lock() inherits the parent's
           kern setting.
      
       (3) Add a 'kern' parameter to ->accept() that is analogous to the one
           passed in to ->create() that distinguishes whether kernel_accept() or
           sys_accept4() was the caller and can be passed to sk_alloc().
      
           Note that a lot of accept functions merely dequeue an already
           allocated socket.  I haven't touched these as the new socket already
           exists before we get the parameter.
      
           Note also that there are a couple of places where I've made the accepted
           socket unconditionally kernel-based:
      
      	irda_accept()
      	rds_rcp_accept_one()
      	tcp_accept_from_sock()
      
           because they follow a sock_create_kern() and accept off of that.
      
      Whilst creating this, I noticed that lustre and ocfs don't create sockets
      through sock_create_kern() and thus they aren't marked as for-kernel,
      though they appear to be internal.  I wonder if these should do that so
      that they use the new set of lock keys.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdfbabfb
    • A
      userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED · 70ccb92f
      Andrea Arcangeli 提交于
      userfaultfd_remove() has to be execute before zapping the pagetables or
      UFFDIO_COPY could keep filling pages after zap_page_range returned,
      which would result in non zero data after a MADV_DONTNEED.
      
      However userfaultfd_remove() may have to release the mmap_sem.  This was
      handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
      potentially stale vma (the very vma passed to zap_page_range(vma, ...)).
      
      The fix consists in revalidating the vma in case userfaultfd_remove()
      had to release the mmap_sem.
      
      This also optimizes away an unnecessary down_read/up_read in the
      MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.
      
      It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
      userfaultfd_remove() will be defined as "true" at build time.
      
      Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ccb92f
    • Y
      mm/vmstats: add thp_split_pud event for clarity · ce9311cf
      Yisheng Xie 提交于
      We added support for PUD-sized transparent hugepages, however we count
      the event "thp split pud" into thp_split_pmd event.
      
      To separate the event count of thp split pud from pmd, add a new event
      named thp_split_pud.
      
      Link: http://lkml.kernel.org/r/1488282380-5076-1-git-send-email-xieyisheng1@huawei.comSigned-off-by: NYisheng Xie <xieyisheng1@huawei.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce9311cf
    • A
      include/linux/fs.h: fix unsigned enum warning with gcc-4.2 · cbfd0c10
      Arnd Bergmann 提交于
      With arm-linux-gcc-4.2, almost every file we build in the kernel ends up
      with this warning:
      
        include/linux/fs.h:2648: warning: comparison of unsigned expression < 0 is always false
      
      Later versions don't have this problem, but it's easy enough to work
      around.
      
      Link: http://lkml.kernel.org/r/20161216105634.235457-12-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbfd0c10
    • A
      userfaultfd: non-cooperative: rollback userfaultfd_exit · dd0db88d
      Andrea Arcangeli 提交于
      Patch series "userfaultfd non-cooperative further update for 4.11 merge
      window".
      
      Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
      more testing.  I've been doing testing before and this was also tested
      by kbuild bot and exercised by the selftest, but this bug never
      reproduced before.
      
      I dropped userfaultfd_exit as result.  I dropped it because of
      implementation difficulty in receiving signals in __mmput and because I
      think -ENOSPC as result from the background UFFDIO_COPY should be enough
      already.
      
      Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
      wasn't exercised by the selftest and when I tried to exercise it, after
      moving it to a more correct place in __mmput where it would make more
      sense and where the vma list is stable, it resulted in the
      event_wait_completion in D state.  So then I added the second patch to
      be sure even if we call userfaultfd_event_wait_completion too late
      during task exit(), we won't risk to generate tasks in D state.  The
      same check exists in handle_userfault() for the same reason, except it
      makes a difference there, while here is just a robustness check and it's
      run under WARN_ON_ONCE.
      
      While looking at the userfaultfd_event_wait_completion() function I
      looked back at its callers too while at it and I think it's not ok to
      stop executing dup_fctx on the fcs list because we relay on
      userfaultfd_event_wait_completion to execute
      userfaultfd_ctx_put(fctx->orig) which is paired against
      userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
      list_add(fcs).  This change only takes care of fctx->orig but this area
      also needs further review looking for similar problems in fctx->new.
      
      The only patch that is urgent is the first because it's an use after
      free during a SMP race condition that affects all processes if
      CONFIG_USERFAULTFD=y.  Very hard to reproduce though and probably
      impossible without SLUB poisoning enabled.
      
      This patch (of 3):
      
      I once reproduced this oops with the userfaultfd selftest, it's not
      easily reproducible and it requires SLUB poisoning to reproduce.
      
          general protection fault: 0000 [#1] SMP
          Modules linked in:
          CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G               ------------ T 3.10.0+ #15
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
          task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000
          RIP: 0010:[<ffffffff81451299>]  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
          RSP: 0018:ffff8801f833fe80  EFLAGS: 00010202
          RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440
          RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600
          RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001
          R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000
          R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700
          FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          Call Trace:
            do_exit+0x297/0xd10
            SyS_exit+0x17/0x20
            tracesys+0xdd/0xe2
          Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 <4c> 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80
          RIP  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
           RSP <ffff8801f833fe80>
          ---[ end trace 9fecd6dcb442846a ]---
      
      In the debugger I located the "mm" pointer in the stack and walking
      mm->mmap->vm_next through the end shows the vma->vm_next list is fully
      consistent and it is null terminated list as expected.  So this has to
      be an SMP race condition where userfaultfd_exit was running while the
      vma list was being modified by another CPU.
      
      When userfaultfd_exit() run one of the ->vm_next pointers pointed to
      SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).
      
      The reason is that it's not running in __mmput but while there are still
      other threads running and it's not holding the mmap_sem (it can't as it
      has to wait the even to be received by the manager).  So this is an use
      after free that was happening for all processes.
      
      One more implementation problem aside from the race condition:
      userfaultfd_exit has really to check a flag in mm->flags before walking
      the vma or it's going to slowdown the exit() path for regular tasks.
      
      One more implementation problem: at that point signals can't be
      delivered so it would also create a task in D state if the manager
      doesn't read the event.
      
      The major design issue: it overall looks superfluous as the manager can
      check for -ENOSPC in the background transfer:
      
      	if (mmget_not_zero(ctx->mm)) {
      [..]
      	} else {
      		return -ENOSPC;
      	}
      
      It's safer to roll it back and re-introduce it later if at all.
      
      [rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT]
        Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd0db88d