1. 04 4月, 2017 1 次提交
  2. 03 4月, 2017 2 次提交
    • D
      KEYS: Add a system blacklist keyring · 734114f8
      David Howells 提交于
      Add the following:
      
       (1) A new system keyring that is used to store information about
           blacklisted certificates and signatures.
      
       (2) A new key type (called 'blacklist') that is used to store a
           blacklisted hash in its description as a hex string.  The key accepts
           no payload.
      
       (3) The ability to configure a list of blacklisted hashes into the kernel
           at build time.  This is done by setting
           CONFIG_SYSTEM_BLACKLIST_HASH_LIST to the filename of a list of hashes
           that are in the form:
      
      	"<hash>", "<hash>", ..., "<hash>"
      
           where each <hash> is a hex string representation of the hash and must
           include all necessary leading zeros to pad the hash to the right size.
      
      The above are enabled with CONFIG_SYSTEM_BLACKLIST_KEYRING.
      
      Once the kernel is booted, the blacklist keyring can be listed:
      
      	root@andromeda ~]# keyctl show %:.blacklist
      	Keyring
      	 723359729 ---lswrv      0     0  keyring: .blacklist
      	 676257228 ---lswrv      0     0   \_ blacklist: 123412341234c55c1dcc601ab8e172917706aa32fb5eaf826813547fdf02dd46
      
      The blacklist cannot currently be modified by userspace, but it will be
      possible to load it, for example, from the UEFI blacklist database.
      
      A later commit will make it possible to load blacklisted asymmetric keys in
      here too.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      734114f8
    • E
      security, keys: convert key.usage from atomic_t to refcount_t · fff29291
      Elena Reshetova 提交于
      refcount_t type and corresponding API should be
      used instead of atomic_t when the variable is used as
      a reference counter. This allows to avoid accidental
      refcounter overflows that might lead to use-after-free
      situations.
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      fff29291
  3. 28 3月, 2017 1 次提交
    • T
      LSM: Revive security_task_alloc() hook and per "struct task_struct" security blob. · e4e55b47
      Tetsuo Handa 提交于
      We switched from "struct task_struct"->security to "struct cred"->security
      in Linux 2.6.29. But not all LSM modules were happy with that change.
      TOMOYO LSM module is an example which want to use per "struct task_struct"
      security blob, for TOMOYO's security context is defined based on "struct
      task_struct" rather than "struct cred". AppArmor LSM module is another
      example which want to use it, for AppArmor is currently abusing the cred
      a little bit to store the change_hat and setexeccon info. Although
      security_task_free() hook was revived in Linux 3.4 because Yama LSM module
      wanted to release per "struct task_struct" security blob,
      security_task_alloc() hook and "struct task_struct"->security field were
      not revived. Nowadays, we are getting proposals of lightweight LSM modules
      which want to use per "struct task_struct" security blob.
      
      We are already allowing multiple concurrent LSM modules (up to one fully
      armored module which uses "struct cred"->security field or exclusive hooks
      like security_xfrm_state_pol_flow_match(), plus unlimited number of
      lightweight modules which do not use "struct cred"->security nor exclusive
      hooks) as long as they are built into the kernel. But this patch does not
      implement variable length "struct task_struct"->security field which will
      become needed when multiple LSM modules want to use "struct task_struct"->
      security field. Although it won't be difficult to implement variable length
      "struct task_struct"->security field, let's think about it after we merged
      this patch.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NJohn Johansen <john.johansen@canonical.com>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Acked-by: NCasey Schaufler <casey@schaufler-ca.com>
      Tested-by: NDjalal Harouni <tixxdz@gmail.com>
      Acked-by: NJosé Bollo <jobol@nonadev.net>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: José Bollo <jobol@nonadev.net>
      Signed-off-by: NJames Morris <james.l.morris@oracle.com>
      e4e55b47
  4. 25 3月, 2017 2 次提交
    • D
      uapi: fix rdma/mlx5-abi.h userspace compilation errors · 812755d6
      Dmitry V. Levin 提交于
      Consistently use types from linux/types.h to fix the following
      rdma/mlx5-abi.h userspace compilation errors:
      
      /usr/include/rdma/mlx5-abi.h:69:25: error: 'u64' undeclared here (not in a function)
        MLX5_LIB_CAP_4K_UAR = (u64)1 << 0,
      /usr/include/rdma/mlx5-abi.h:69:29: error: expected ',' or '}' before numeric constant
        MLX5_LIB_CAP_4K_UAR = (u64)1 << 0,
      
      Include <linux/if_ether.h> to fix the following rdma/mlx5-abi.h
      userspace compilation error:
      
      /usr/include/rdma/mlx5-abi.h:286:12: error: 'ETH_ALEN' undeclared here (not in a function)
        __u8 dmac[ETH_ALEN];
      Signed-off-by: NDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      812755d6
    • B
      IB/core: Restore I/O MMU, s390 and powerpc support · 0957c29f
      Bart Van Assche 提交于
      Avoid that the following error message is reported on the console
      while loading an RDMA driver with I/O MMU support enabled:
      
      DMAR: Allocating domain for mlx5_0 failed
      
      Ensure that DMA mapping operations that use to_pci_dev() to
      access to struct pci_dev see the correct PCI device. E.g. the s390
      and powerpc DMA mapping operations use to_pci_dev() even with I/O
      MMU support disabled.
      
      This patch preserves the following changes of the DMA mapping updates
      patch series:
      - Introduction of dma_virt_ops.
      - Removal of ib_device.dma_ops.
      - Removal of struct ib_dma_mapping_ops.
      - Removal of an if-statement from each ib_dma_*() operation.
      - IB HW drivers no longer set dma_device directly.
      Reported-by: NSebastian Ott <sebott@linux.vnet.ibm.com>
      Reported-by: NParav Pandit <parav@mellanox.com>
      Fixes: commit 99db9494 ("IB/core: Remove ib_device.dma_device")
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: parav@mellanox.com
      Tested-by: parav@mellanox.com
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      0957c29f
  5. 23 3月, 2017 1 次提交
  6. 22 3月, 2017 7 次提交
  7. 21 3月, 2017 2 次提交
  8. 20 3月, 2017 1 次提交
    • S
      generic syscalls: Wire up statx syscall · fdfe4a39
      Stafford Horne 提交于
      The new syscall statx is implemented as generic code, so enable it
      for architectures like openrisc which use the generic syscall table.
      
      Fixes: a528d35e ("statx: Add a system call to make enhanced file info available")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Signed-off-by: NStafford Horne <shorne@gmail.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      fdfe4a39
  9. 19 3月, 2017 2 次提交
    • M
      target: fix ALUA transition timeout handling · d7175373
      Mike Christie 提交于
      The implicit transition time tells initiators the min time
      to wait before timing out a transition. We currently schedule
      the transition to occur in tg_pt_gp_implicit_trans_secs
      seconds so there is no room for delays. If
      core_alua_do_transition_tg_pt_work->core_alua_update_tpg_primary_metadata
      needs to write out info to a remote file, then the initiator can
      easily time out the operation.
      Signed-off-by: NMike Christie <mchristi@redhat.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      d7175373
    • M
      target: allow ALUA setup for some passthrough backends · 530c6891
      Mike Christie 提交于
      This patch allows passthrough backends to use the core/base LIO
      ALUA setup and state checks, but still handle the execution of
      commands.
      
      This will allow the target_core_user module to execute STPG and RTPG
      in userspace, and not have to duplicate the ALUA state checks, path
      information (needed so we can check if command is executable on
      specific paths) and setup (rtslib sets/updates the configfs ALUA
      interface like it does for iblock or file).
      
      For STPG, the target_core_user userspace daemon, tcmu-runner will
      still execute the STPG, and to update the core/base LIO state it
      will use the existing configfs interface. For RTPG, tcmu-runner
      will loop over configfs and/or cache the state.
      Signed-off-by: NMike Christie <mchristi@redhat.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      530c6891
  10. 17 3月, 2017 3 次提交
  11. 16 3月, 2017 4 次提交
    • G
      crypto: ccp - Assign DMA commands to the channel's CCP · 7c468447
      Gary R Hook 提交于
      The CCP driver generally uses a round-robin approach when
      assigning operations to available CCPs. For the DMA engine,
      however, the DMA mappings of the SGs are associated with a
      specific CCP. When an IOMMU is enabled, the IOMMU is
      programmed based on this specific device.
      
      If the DMA operations are not performed by that specific
      CCP then addressing errors and I/O page faults will occur.
      
      Update the CCP driver to allow a specific CCP device to be
      requested for an operation and use this in the DMA engine
      support.
      
      Cc: <stable@vger.kernel.org> # 4.9.x-
      Signed-off-by: NGary R Hook <gary.hook@amd.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      7c468447
    • D
      vmbus: remove hv_event_tasklet_disable/enable · dad72a1d
      Dexuan Cui 提交于
      With the recent introduction of per-channel tasklet, we need to update
      the way we handle the 3 concurrency issues:
      
      1. hv_process_channel_removal -> percpu_channel_deq vs.
         vmbus_chan_sched -> list_for_each_entry(..., percpu_list);
      
      2. vmbus_process_offer -> percpu_channel_enq/deq vs. vmbus_chan_sched.
      
      3. vmbus_close_internal vs. the per-channel tasklet vmbus_on_event;
      
      The first 2 issues can be handled by Stephen's recent patch
      "vmbus: use rcu for per-cpu channel list", and the third issue
      can be handled by calling tasklet_disable in vmbus_close_internal here.
      
      We don't need the original hv_event_tasklet_disable/enable since we
      now use per-channel tasklet instead of the previous per-CPU tasklet,
      and actually we must remove them due to the side effect now:
      vmbus_process_offer -> hv_event_tasklet_enable -> tasklet_schedule will
      start the per-channel callback prematurely, cauing NULL dereferencing
      (the channel may haven't been properly configured to run the callback yet).
      
      Fixes: 631e63a9 ("vmbus: change to per channel tasklet")
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dad72a1d
    • S
      vmbus: use rcu for per-cpu channel list · 8200f208
      Stephen Hemminger 提交于
      The per-cpu channel list is now referred to in the interrupt
      routine. This is mostly safe since the host will not normally generate
      an interrupt when channel is being deleted but if it did then there
      would be a use after free problem.
      
      To solve, this use RCU protection on ther per-cpu list.
      
      Fixes: 631e63a9 ("vmbus: change to per channel tasklet")
      Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8200f208
    • E
      fscrypt: eliminate ->prepare_context() operation · 94840e3c
      Eric Biggers 提交于
      The only use of the ->prepare_context() fscrypt operation was to allow
      ext4 to evict inline data from the inode before ->set_context().
      However, there is no reason why this cannot be done as simply the first
      step in ->set_context(), and in fact it makes more sense to do it that
      way because then the policy modes and flags get validated before any
      real work is done.  Therefore, merge ext4_prepare_context() into
      ext4_set_context(), and remove ->prepare_context().
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      94840e3c
  12. 14 3月, 2017 4 次提交
  13. 13 3月, 2017 4 次提交
    • S
      netfilter: Force fake conntrack entry to be at least 8 bytes aligned · 170a1fb9
      Steven Rostedt (VMware) 提交于
      Since the nfct and nfctinfo have been combined, the nf_conn structure
      must be at least 8 bytes aligned, as the 3 LSB bits are used for the
      nfctinfo. But there's a fake nf_conn structure to denote untracked
      connections, which is created by a PER_CPU construct. This does not
      guarantee that it will be 8 bytes aligned and can break the logic in
      determining the correct nfctinfo.
      
      I triggered this on a 32bit machine with the following error:
      
      BUG: unable to handle kernel NULL pointer dereference at 00000af4
      IP: nf_ct_deliver_cached_events+0x1b/0xfb
      *pdpt = 0000000031962001 *pde = 0000000000000000
      
      Oops: 0000 [#1] SMP
      [Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 crc_ccitt ppdev r8169 parport_pc parport
        OK  ]
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-test+ #75
      Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
      task: c126ec00 task.stack: c1258000
      EIP: nf_ct_deliver_cached_events+0x1b/0xfb
      EFLAGS: 00010202 CPU: 0
      EAX: 0021cd01 EBX: 00000000 ECX: 27b0c767 EDX: 32bcb17a
      ESI: f34135c0 EDI: f34135c0 EBP: f2debd60 ESP: f2debd3c
       DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
      CR0: 80050033 CR2: 00000af4 CR3: 309a0440 CR4: 001406f0
      Call Trace:
       <SOFTIRQ>
       ? ipv6_skip_exthdr+0xac/0xcb
       ipv6_confirm+0x10c/0x119 [nf_conntrack_ipv6]
       nf_hook_slow+0x22/0xc7
       nf_hook+0x9a/0xad [ipv6]
       ? ip6t_do_table+0x356/0x379 [ip6_tables]
       ? ip6_fragment+0x9e9/0x9e9 [ipv6]
       ip6_output+0xee/0x107 [ipv6]
       ? ip6_fragment+0x9e9/0x9e9 [ipv6]
       dst_output+0x36/0x4d [ipv6]
       NF_HOOK.constprop.37+0xb2/0xba [ipv6]
       ? icmp6_dst_alloc+0x2c/0xfd [ipv6]
       ? local_bh_enable+0x14/0x14 [ipv6]
       mld_sendpack+0x1c5/0x281 [ipv6]
       ? mark_held_locks+0x40/0x5c
       mld_ifc_timer_expire+0x1f6/0x21e [ipv6]
       call_timer_fn+0x135/0x283
       ? detach_if_pending+0x55/0x55
       ? mld_dad_timer_expire+0x3e/0x3e [ipv6]
       __run_timers+0x111/0x14b
       ? mld_dad_timer_expire+0x3e/0x3e [ipv6]
       run_timer_softirq+0x1c/0x36
       __do_softirq+0x185/0x37c
       ? test_ti_thread_flag.constprop.19+0xd/0xd
       do_softirq_own_stack+0x22/0x28
       </SOFTIRQ>
       irq_exit+0x5a/0xa4
       smp_apic_timer_interrupt+0x2a/0x34
       apic_timer_interrupt+0x37/0x3c
      
      By using DEFINE/DECLARE_PER_CPU_ALIGNED we can enforce at least 8 byte
      alignment as all cache line sizes are at least 8 bytes or more.
      
      Fixes: a9e419dc ("netfilter: merge ctinfo into nfct pointer storage area")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      170a1fb9
    • L
      netfilter: nf_tables: fix mismatch in big-endian system · 10596608
      Liping Zhang 提交于
      Currently, there are two different methods to store an u16 integer to
      the u32 data register. For example:
        u32 *dest = &regs->data[priv->dreg];
        1. *dest = 0; *(u16 *) dest = val_u16;
        2. *dest = val_u16;
      
      For method 1, the u16 value will be stored like this, either in
      big-endian or little-endian system:
        0          15           31
        +-+-+-+-+-+-+-+-+-+-+-+-+
        |   Value   |     0     |
        +-+-+-+-+-+-+-+-+-+-+-+-+
      
      For method 2, in little-endian system, the u16 value will be the same
      as listed above. But in big-endian system, the u16 value will be stored
      like this:
        0          15           31
        +-+-+-+-+-+-+-+-+-+-+-+-+
        |     0     |   Value   |
        +-+-+-+-+-+-+-+-+-+-+-+-+
      
      So later we use "memcmp(&regs->data[priv->sreg], data, 2);" to do
      compare in nft_cmp, nft_lookup expr ..., method 2 will get the wrong
      result in big-endian system, as 0~15 bits will always be zero.
      
      For the similar reason, when loading an u16 value from the u32 data
      register, we should use "*(u16 *) sreg;" instead of "(u16)*sreg;",
      the 2nd method will get the wrong value in the big-endian system.
      
      So introduce some wrapper functions to store/load an u8 or u16
      integer to/from the u32 data register, and use them in the right
      place.
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      10596608
    • D
      uapi: fix drm/omap_drm.h userspace compilation errors · 337ba7fb
      Dmitry V. Levin 提交于
      Consistently use types from linux/types.h like in other uapi drm/*_drm.h
      header files to fix the following drm/omap_drm.h userspace compilation
      errors:
      
      /usr/include/drm/omap_drm.h:36:2: error: unknown type name 'uint64_t'
        uint64_t param;   /* in */
      /usr/include/drm/omap_drm.h:37:2: error: unknown type name 'uint64_t'
        uint64_t value;   /* in (set_param), out (get_param) */
      /usr/include/drm/omap_drm.h:56:2: error: unknown type name 'uint32_t'
        uint32_t bytes;  /* (for non-tiled formats) */
      /usr/include/drm/omap_drm.h:58:3: error: unknown type name 'uint16_t'
         uint16_t width;
      /usr/include/drm/omap_drm.h:59:3: error: unknown type name 'uint16_t'
         uint16_t height;
      /usr/include/drm/omap_drm.h:65:2: error: unknown type name 'uint32_t'
        uint32_t flags;   /* in */
      /usr/include/drm/omap_drm.h:66:2: error: unknown type name 'uint32_t'
        uint32_t handle;  /* out */
      /usr/include/drm/omap_drm.h:67:2: error: unknown type name 'uint32_t'
        uint32_t __pad;
      /usr/include/drm/omap_drm.h:77:2: error: unknown type name 'uint32_t'
        uint32_t handle;  /* buffer handle (in) */
      /usr/include/drm/omap_drm.h:78:2: error: unknown type name 'uint32_t'
        uint32_t op;   /* mask of omap_gem_op (in) */
      /usr/include/drm/omap_drm.h:82:2: error: unknown type name 'uint32_t'
        uint32_t handle;  /* buffer handle (in) */
      /usr/include/drm/omap_drm.h:83:2: error: unknown type name 'uint32_t'
        uint32_t op;   /* mask of omap_gem_op (in) */
      /usr/include/drm/omap_drm.h:88:2: error: unknown type name 'uint32_t'
        uint32_t nregions;
      /usr/include/drm/omap_drm.h:89:2: error: unknown type name 'uint32_t'
        uint32_t __pad;
      /usr/include/drm/omap_drm.h:93:2: error: unknown type name 'uint32_t'
        uint32_t handle;  /* buffer handle (in) */
      /usr/include/drm/omap_drm.h:94:2: error: unknown type name 'uint32_t'
        uint32_t pad;
      /usr/include/drm/omap_drm.h:95:2: error: unknown type name 'uint64_t'
        uint64_t offset;  /* mmap offset (out) */
      /usr/include/drm/omap_drm.h:102:2: error: unknown type name 'uint32_t'
        uint32_t size;   /* virtual size for mmap'ing (out) */
      /usr/include/drm/omap_drm.h:103:2: error: unknown type name 'uint32_t'
        uint32_t __pad;
      
      Fixes: ef6503e8 ("drm: Kbuild: add omap_drm.h to the installed headers")
      Signed-off-by: NDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: NTomi Valkeinen <tomi.valkeinen@ti.com>
      337ba7fb
    • D
      bpf: improve read-only handling · 65869a47
      Daniel Borkmann 提交于
      Improve bpf_{prog,jit_binary}_{un,}lock_ro() by throwing a
      one-time warning in case of an error when the image couldn't
      be set read-only, and also mark struct bpf_prog as locked when
      bpf_prog_lock_ro() was called.
      
      Reason for the latter is that bpf_prog_unlock_ro() is called from
      various places including error paths, and we shouldn't mess with
      page attributes when really not needed.
      
      For bpf_jit_binary_unlock_ro() this is not needed as jited flag
      implicitly indicates this, thus for archs with ARCH_HAS_SET_MEMORY
      we're guaranteed to have a previously locked image. Overall, this
      should also help us to identify any further potential issues with
      set_memory_*() helpers.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65869a47
  14. 11 3月, 2017 3 次提交
  15. 10 3月, 2017 3 次提交
    • D
      net: Work around lockdep limitation in sockets that use sockets · cdfbabfb
      David Howells 提交于
      Lockdep issues a circular dependency warning when AFS issues an operation
      through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.
      
      The theory lockdep comes up with is as follows:
      
       (1) If the pagefault handler decides it needs to read pages from AFS, it
           calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
           creating a call requires the socket lock:
      
      	mmap_sem must be taken before sk_lock-AF_RXRPC
      
       (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
           binds the underlying UDP socket whilst holding its socket lock.
           inet_bind() takes its own socket lock:
      
      	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET
      
       (3) Reading from a TCP socket into a userspace buffer might cause a fault
           and thus cause the kernel to take the mmap_sem, but the TCP socket is
           locked whilst doing this:
      
      	sk_lock-AF_INET must be taken before mmap_sem
      
      However, lockdep's theory is wrong in this instance because it deals only
      with lock classes and not individual locks.  The AF_INET lock in (2) isn't
      really equivalent to the AF_INET lock in (3) as the former deals with a
      socket entirely internal to the kernel that never sees userspace.  This is
      a limitation in the design of lockdep.
      
      Fix the general case by:
      
       (1) Double up all the locking keys used in sockets so that one set are
           used if the socket is created by userspace and the other set is used
           if the socket is created by the kernel.
      
       (2) Store the kern parameter passed to sk_alloc() in a variable in the
           sock struct (sk_kern_sock).  This informs sock_lock_init(),
           sock_init_data() and sk_clone_lock() as to the lock keys to be used.
      
           Note that the child created by sk_clone_lock() inherits the parent's
           kern setting.
      
       (3) Add a 'kern' parameter to ->accept() that is analogous to the one
           passed in to ->create() that distinguishes whether kernel_accept() or
           sys_accept4() was the caller and can be passed to sk_alloc().
      
           Note that a lot of accept functions merely dequeue an already
           allocated socket.  I haven't touched these as the new socket already
           exists before we get the parameter.
      
           Note also that there are a couple of places where I've made the accepted
           socket unconditionally kernel-based:
      
      	irda_accept()
      	rds_rcp_accept_one()
      	tcp_accept_from_sock()
      
           because they follow a sock_create_kern() and accept off of that.
      
      Whilst creating this, I noticed that lustre and ocfs don't create sockets
      through sock_create_kern() and thus they aren't marked as for-kernel,
      though they appear to be internal.  I wonder if these should do that so
      that they use the new set of lock keys.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdfbabfb
    • A
      userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED · 70ccb92f
      Andrea Arcangeli 提交于
      userfaultfd_remove() has to be execute before zapping the pagetables or
      UFFDIO_COPY could keep filling pages after zap_page_range returned,
      which would result in non zero data after a MADV_DONTNEED.
      
      However userfaultfd_remove() may have to release the mmap_sem.  This was
      handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
      potentially stale vma (the very vma passed to zap_page_range(vma, ...)).
      
      The fix consists in revalidating the vma in case userfaultfd_remove()
      had to release the mmap_sem.
      
      This also optimizes away an unnecessary down_read/up_read in the
      MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.
      
      It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
      userfaultfd_remove() will be defined as "true" at build time.
      
      Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ccb92f
    • Y
      mm/vmstats: add thp_split_pud event for clarity · ce9311cf
      Yisheng Xie 提交于
      We added support for PUD-sized transparent hugepages, however we count
      the event "thp split pud" into thp_split_pmd event.
      
      To separate the event count of thp split pud from pmd, add a new event
      named thp_split_pud.
      
      Link: http://lkml.kernel.org/r/1488282380-5076-1-git-send-email-xieyisheng1@huawei.comSigned-off-by: NYisheng Xie <xieyisheng1@huawei.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce9311cf