1. 21 4月, 2018 3 次提交
    • A
      kasan: add no_sanitize attribute for clang builds · 12c8f25a
      Andrey Konovalov 提交于
      KASAN uses the __no_sanitize_address macro to disable instrumentation of
      particular functions.  Right now it's defined only for GCC build, which
      causes false positives when clang is used.
      
      This patch adds a definition for clang.
      
      Note, that clang's revision 329612 or higher is required.
      
      [andreyknvl@google.com: remove redundant #ifdef CONFIG_KASAN check]
        Link: http://lkml.kernel.org/r/c79aa31a2a2790f6131ed607c58b0dd45dd62a6c.1523967959.git.andreyknvl@google.com
      Link: http://lkml.kernel.org/r/4ad725cc903f8534f8c8a60f0daade5e3d674f8d.1523554166.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Paul Lawrence <paullawrence@google.com>
      Cc: Sandipan Das <sandipan@linux.vnet.ibm.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12c8f25a
    • G
      writeback: safer lock nesting · 2e898e4c
      Greg Thelen 提交于
      lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
      the page's memcg is undergoing move accounting, which occurs when a
      process leaves its memcg for a new one that has
      memory.move_charge_at_immigrate set.
      
      unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
      the given inode is switching writeback domains.  Switches occur when
      enough writes are issued from a new domain.
      
      This existing pattern is thus suspicious:
          lock_page_memcg(page);
          unlocked_inode_to_wb_begin(inode, &locked);
          ...
          unlocked_inode_to_wb_end(inode, locked);
          unlock_page_memcg(page);
      
      If both inode switch and process memcg migration are both in-flight then
      unlocked_inode_to_wb_end() will unconditionally enable interrupts while
      still holding the lock_page_memcg() irq spinlock.  This suggests the
      possibility of deadlock if an interrupt occurs before unlock_page_memcg().
      
          truncate
          __cancel_dirty_page
          lock_page_memcg
          unlocked_inode_to_wb_begin
          unlocked_inode_to_wb_end
          <interrupts mistakenly enabled>
                                          <interrupt>
                                          end_page_writeback
                                          test_clear_page_writeback
                                          lock_page_memcg
                                          <deadlock>
          unlock_page_memcg
      
      Due to configuration limitations this deadlock is not currently possible
      because we don't mix cgroup writeback (a cgroupv2 feature) and
      memory.move_charge_at_immigrate (a cgroupv1 feature).
      
      If the kernel is hacked to always claim inode switching and memcg
      moving_account, then this script triggers lockup in less than a minute:
      
        cd /mnt/cgroup/memory
        mkdir a b
        echo 1 > a/memory.move_charge_at_immigrate
        echo 1 > b/memory.move_charge_at_immigrate
        (
          echo $BASHPID > a/cgroup.procs
          while true; do
            dd if=/dev/zero of=/mnt/big bs=1M count=256
          done
        ) &
        while true; do
          sync
        done &
        sleep 1h &
        SLEEP=$!
        while true; do
          echo $SLEEP > a/cgroup.procs
          echo $SLEEP > b/cgroup.procs
        done
      
      The deadlock does not seem possible, so it's debatable if there's any
      reason to modify the kernel.  I suggest we should to prevent future
      surprises.  And Wang Long said "this deadlock occurs three times in our
      environment", so there's more reason to apply this, even to stable.
      Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
      see "[PATCH for-4.4] writeback: safer lock nesting"
      https://lkml.org/lkml/2018/4/11/146
      
      Wang Long said "this deadlock occurs three times in our environment"
      
      [gthelen@google.com: v4]
        Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
      [akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
      Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
      Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
      Fixes: 682aa8e1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Reported-by: NWang Long <wanglong19@meituan.com>
      Acked-by: NWang Long <wanglong19@meituan.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: <stable@vger.kernel.org>	[v4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e898e4c
    • K
      fork: unconditionally clear stack on fork · e01e8063
      Kees Cook 提交于
      One of the classes of kernel stack content leaks[1] is exposing the
      contents of prior heap or stack contents when a new process stack is
      allocated.  Normally, those stacks are not zeroed, and the old contents
      remain in place.  In the face of stack content exposure flaws, those
      contents can leak to userspace.
      
      Fixing this will make the kernel no longer vulnerable to these flaws, as
      the stack will be wiped each time a stack is assigned to a new process.
      There's not a meaningful change in runtime performance; it almost looks
      like it provides a benefit.
      
      Performing back-to-back kernel builds before:
      	Run times: 157.86 157.09 158.90 160.94 160.80
      	Mean: 159.12
      	Std Dev: 1.54
      
      and after:
      	Run times: 159.31 157.34 156.71 158.15 160.81
      	Mean: 158.46
      	Std Dev: 1.46
      
      Instead of making this a build or runtime config, Andy Lutomirski
      recommended this just be enabled by default.
      
      [1] A noisy search for many kinds of stack content leaks can be seen here:
      https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel+stack+leak
      
      I did some more with perf and cycle counts on running 100,000 execs of
      /bin/true.
      
      before:
      Cycles: 218858861551 218853036130 214727610969 227656844122 224980542841
      Mean:  221015379122.60
      Std Dev: 4662486552.47
      
      after:
      Cycles: 213868945060 213119275204 211820169456 224426673259 225489986348
      Mean:  217745009865.40
      Std Dev: 5935559279.99
      
      It continues to look like it's faster, though the deviation is rather
      wide, but I'm not sure what I could do that would be less noisy.  I'm
      open to ideas!
      
      Link: http://lkml.kernel.org/r/20180221021659.GA37073@beastSigned-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e01e8063
  2. 18 4月, 2018 1 次提交
    • T
      vlan: Fix reading memory beyond skb->tail in skb_vlan_tagged_multi · 7ce23672
      Toshiaki Makita 提交于
      Syzkaller spotted an old bug which leads to reading skb beyond tail by 4
      bytes on vlan tagged packets.
      This is caused because skb_vlan_tagged_multi() did not check
      skb_headlen.
      
      BUG: KMSAN: uninit-value in eth_type_vlan include/linux/if_vlan.h:283 [inline]
      BUG: KMSAN: uninit-value in skb_vlan_tagged_multi include/linux/if_vlan.h:656 [inline]
      BUG: KMSAN: uninit-value in vlan_features_check include/linux/if_vlan.h:672 [inline]
      BUG: KMSAN: uninit-value in dflt_features_check net/core/dev.c:2949 [inline]
      BUG: KMSAN: uninit-value in netif_skb_features+0xd1b/0xdc0 net/core/dev.c:3009
      CPU: 1 PID: 3582 Comm: syzkaller435149 Not tainted 4.16.0+ #82
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:17 [inline]
        dump_stack+0x185/0x1d0 lib/dump_stack.c:53
        kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
        __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
        eth_type_vlan include/linux/if_vlan.h:283 [inline]
        skb_vlan_tagged_multi include/linux/if_vlan.h:656 [inline]
        vlan_features_check include/linux/if_vlan.h:672 [inline]
        dflt_features_check net/core/dev.c:2949 [inline]
        netif_skb_features+0xd1b/0xdc0 net/core/dev.c:3009
        validate_xmit_skb+0x89/0x1320 net/core/dev.c:3084
        __dev_queue_xmit+0x1cb2/0x2b60 net/core/dev.c:3549
        dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590
        packet_snd net/packet/af_packet.c:2944 [inline]
        packet_sendmsg+0x7c57/0x8a10 net/packet/af_packet.c:2969
        sock_sendmsg_nosec net/socket.c:630 [inline]
        sock_sendmsg net/socket.c:640 [inline]
        sock_write_iter+0x3b9/0x470 net/socket.c:909
        do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
        do_iter_write+0x30d/0xd40 fs/read_write.c:932
        vfs_writev fs/read_write.c:977 [inline]
        do_writev+0x3c9/0x830 fs/read_write.c:1012
        SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
        SyS_writev+0x56/0x80 fs/read_write.c:1082
        do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
        entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      RIP: 0033:0x43ffa9
      RSP: 002b:00007fff2cff3948 EFLAGS: 00000217 ORIG_RAX: 0000000000000014
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043ffa9
      RDX: 0000000000000001 RSI: 0000000020000080 RDI: 0000000000000003
      RBP: 00000000006cb018 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000217 R12: 00000000004018d0
      R13: 0000000000401960 R14: 0000000000000000 R15: 0000000000000000
      
      Uninit was created at:
        kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
        kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
        kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
        kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
        slab_post_alloc_hook mm/slab.h:445 [inline]
        slab_alloc_node mm/slub.c:2737 [inline]
        __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
        __kmalloc_reserve net/core/skbuff.c:138 [inline]
        __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
        alloc_skb include/linux/skbuff.h:984 [inline]
        alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234
        sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085
        packet_alloc_skb net/packet/af_packet.c:2803 [inline]
        packet_snd net/packet/af_packet.c:2894 [inline]
        packet_sendmsg+0x6444/0x8a10 net/packet/af_packet.c:2969
        sock_sendmsg_nosec net/socket.c:630 [inline]
        sock_sendmsg net/socket.c:640 [inline]
        sock_write_iter+0x3b9/0x470 net/socket.c:909
        do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
        do_iter_write+0x30d/0xd40 fs/read_write.c:932
        vfs_writev fs/read_write.c:977 [inline]
        do_writev+0x3c9/0x830 fs/read_write.c:1012
        SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
        SyS_writev+0x56/0x80 fs/read_write.c:1082
        do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
        entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 58e998c6 ("offloading: Force software GSO for multiple vlan tags.")
      Reported-and-tested-by: syzbot+0bbe42c764feafa82c5a@syzkaller.appspotmail.com
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ce23672
  3. 17 4月, 2018 4 次提交
    • O
      xen/sndif: Sync up with the canonical definition in Xen · cd6e992b
      Oleksandr Andrushchenko 提交于
      This is the sync up with the canonical definition of the sound
      protocol in Xen:
      
      1. Protocol version was referenced in the protocol description,
         but missed its definition. Fixed by adding a constant
         for current protocol version.
      
      2. Some of the request descriptions have "reserved" fields
         missed: fixed by adding corresponding entries.
      
      3. Extend the size of the requests and responses to 64 octets.
         Bump protocol version to 2.
      
      4. Add explicit back and front synchronization
         In order to provide explicit synchronization between backend and
         frontend the following changes are introduced in the protocol:
          - add new ring buffer for sending asynchronous events from
            backend to frontend to report number of bytes played by the
            frontend (XENSND_EVT_CUR_POS)
          - introduce trigger events for playback control: start/stop/pause/resume
          - add "req-" prefix to event-channel and ring-ref to unify naming
            of the Xen event channels for requests and events
      
      5. Add explicit back and front parameter negotiation
         In order to provide explicit stream parameter negotiation between
         backend and frontend the following changes are introduced in the protocol:
         add XENSND_OP_HW_PARAM_QUERY request to read/update
         configuration space for the parameters given: request passes
         desired parameter's intervals/masks and the response to this request
         returns allowed min/max intervals/masks to be used.
      Signed-off-by: NOleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
      Signed-off-by: NOleksandr Grytsov <oleksandr_grytsov@epam.com>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      cd6e992b
    • P
      livepatch: Allow to call a custom callback when freeing shadow variables · 3b2c77d0
      Petr Mladek 提交于
      We might need to do some actions before the shadow variable is freed.
      For example, we might need to remove it from a list or free some data
      that it points to.
      
      This is already possible now. The user can get the shadow variable
      by klp_shadow_get(), do the necessary actions, and then call
      klp_shadow_free().
      
      This patch allows to do it a more elegant way. The user could implement
      the needed actions in a callback that is passed to klp_shadow_free()
      as a parameter. The callback usually does reverse operations to
      the constructor callback that can be called by klp_shadow_*alloc().
      
      It is especially useful for klp_shadow_free_all(). There we need to do
      these extra actions for each found shadow variable with the given ID.
      
      Note that the memory used by the shadow variable itself is still released
      later by rcu callback. It is needed to protect internal structures that
      keep all shadow variables. But the destructor is called immediately.
      The shadow variable must not be access anyway after klp_shadow_free()
      is called. The user is responsible to protect this any suitable way.
      
      Be aware that the destructor is called under klp_shadow_lock. It is
      the same as for the contructor in klp_shadow_alloc().
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NMiroslav Benes <mbenes@suse.cz>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      3b2c77d0
    • P
      livepatch: Initialize shadow variables safely by a custom callback · e91c2518
      Petr Mladek 提交于
      The existing API allows to pass a sample data to initialize the shadow
      data. It works well when the data are position independent. But it fails
      miserably when we need to set a pointer to the shadow structure itself.
      
      Unfortunately, we might need to initialize the pointer surprisingly
      often because of struct list_head. It is even worse because the list
      might be hidden in other common structures, for example, struct mutex,
      struct wait_queue_head.
      
      For example, this was needed to fix races in ALSA sequencer. It required
      to add mutex into struct snd_seq_client. See commit b3defb79
      ("ALSA: seq: Make ioctls race-free") and commit d15d662e
      ("ALSA: seq: Fix racy pool initializations")
      
      This patch makes the API more safe. A custom constructor function and data
      are passed to klp_shadow_*alloc() functions instead of the sample data.
      
      Note that ctor_data are no longer a template for shadow->data. It might
      point to any data that might be necessary when the constructor is called.
      
      Also note that the constructor is called under klp_shadow_lock. It is
      an internal spin_lock that synchronizes alloc() vs. get() operations,
      see klp_shadow_get_or_alloc(). On one hand, this adds a risk of ABBA
      deadlocks. On the other hand, it allows to do some operations safely.
      For example, we could add the new structure into an existing list.
      This must be done only once when the structure is allocated.
      Reported-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NMiroslav Benes <mbenes@suse.cz>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      e91c2518
    • R
      textsearch: fix kernel-doc warnings and add kernel-api section · 5968a70d
      Randy Dunlap 提交于
      Make lib/textsearch.c usable as kernel-doc.
      Add textsearch() function family to kernel-api documentation.
      Fix kernel-doc warnings in <linux/textsearch.h>:
        ../include/linux/textsearch.h:65: warning: Incorrect use of kernel-doc format:
      	* get_next_block - fetch next block of data
        ../include/linux/textsearch.h:82: warning: Incorrect use of kernel-doc format:
      	* finish - finalize/clean a series of get_next_block() calls
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5968a70d
  4. 16 4月, 2018 1 次提交
  5. 14 4月, 2018 8 次提交
  6. 13 4月, 2018 4 次提交
  7. 12 4月, 2018 19 次提交
    • M
      page cache: use xa_lock · b93b0163
      Matthew Wilcox 提交于
      Remove the address_space ->tree_lock and use the xa_lock newly added to
      the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
      since we don't really care that it's a tree.
      
      [willy@infradead.org: fix nds32, fs/dax.c]
        Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b93b0163
    • M
      xarray: add the xa_lock to the radix_tree_root · f6bb2a2c
      Matthew Wilcox 提交于
      This results in no change in structure size on 64-bit machines as it
      fits in the padding between the gfp_t and the void *.  32-bit machines
      will grow the structure from 8 to 12 bytes.  Almost all radix trees are
      protected with (at least) a spinlock, so as they are converted from
      radix trees to xarrays, the data structures will shrink again.
      
      Initialising the spinlock requires a name for the benefit of lockdep, so
      RADIX_TREE_INIT() now needs to know the name of the radix tree it's
      initialising, and so do IDR_INIT() and IDA_INIT().
      
      Also add the xa_lock() and xa_unlock() family of wrappers to make it
      easier to use the lock.  If we could rely on -fplan9-extensions in the
      compiler, we could avoid all of this syntactic sugar, but that wasn't
      added until gcc 4.6.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-8-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6bb2a2c
    • M
      export __set_page_dirty · f82b3764
      Matthew Wilcox 提交于
      XFS currently contains a copy-and-paste of __set_page_dirty().  Export
      it from buffer.c instead.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-6-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f82b3764
    • M
      radix tree: use GFP_ZONEMASK bits of gfp_t for flags · fa290cda
      Matthew Wilcox 提交于
      Patch series "XArray", v9.  (First part thereof).
      
      This patchset is, I believe, appropriate for merging for 4.17.  It
      contains the XArray implementation, to eventually replace the radix
      tree, and converts the page cache to use it.
      
      This conversion keeps the radix tree and XArray data structures in sync
      at all times.  That allows us to convert the page cache one function at
      a time and should allow for easier bisection.  Other than renaming some
      elements of the structures, the data structures are fundamentally
      unchanged; a radix tree walk and an XArray walk will touch the same
      number of cachelines.  I have changes planned to the XArray data
      structure, but those will happen in future patches.
      
      Improvements the XArray has over the radix tree:
      
       - The radix tree provides operations like other trees do; 'insert' and
         'delete'. But what most users really want is an automatically
         resizing array, and so it makes more sense to give users an API that
         is like an array -- 'load' and 'store'. We still have an 'insert'
         operation for users that really want that semantic.
      
       - The XArray considers locking as part of its API. This simplifies a
         lot of users who formerly had to manage their own locking just for
         the radix tree. It also improves code generation as we can now tell
         RCU that we're holding a lock and it doesn't need to generate as much
         fencing code. The other advantage is that tree nodes can be moved
         (not yet implemented).
      
       - GFP flags are now parameters to calls which may need to allocate
         memory. The radix tree forced users to decide what the allocation
         flags would be at creation time. It's much clearer to specify them at
         allocation time.
      
       - Memory is not preloaded; we don't tie up dozens of pages on the off
         chance that the slab allocator fails. Instead, we drop the lock,
         allocate a new node and retry the operation. We have to convert all
         the radix tree, IDA and IDR preload users before we can realise this
         benefit, but I have not yet found a user which cannot be converted.
      
       - The XArray provides a cmpxchg operation. The radix tree forces users
         to roll their own (and at least four have).
      
       - Iterators take a 'max' parameter. That simplifies many users and will
         reduce the amount of iteration done.
      
       - Iteration can proceed backwards. We only have one user for this, but
         since it's called as part of the pagefault readahead algorithm, that
         seemed worth mentioning.
      
       - RCU-protected pointers are not exposed as part of the API. There are
         some fun bugs where the page cache forgets to use rcu_dereference()
         in the current codebase.
      
       - Value entries gain an extra bit compared to radix tree exceptional
         entries. That gives us the extra bit we need to put huge page swap
         entries in the page cache.
      
       - Some iterators now take a 'filter' argument instead of having
         separate iterators for tagged/untagged iterations.
      
      The page cache is improved by this:
      
       - Shorter, easier to read code
      
       - More efficient iterations
      
       - Reduction in size of struct address_space
      
       - Fewer walks from the top of the data structure; the XArray API
         encourages staying at the leaf node and conducting operations there.
      
      This patch (of 8):
      
      None of these bits may be used for slab allocations, so we can use them
      as radix tree flags as long as we mask them off before passing them to
      the slab allocator. Move the IDR flag from the high bits to the
      GFP_ZONEMASK bits.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-3-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: NJeff Layton <jlayton@kernel.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa290cda
    • M
      linux/const.h: refactor _BITUL and _BITULL a bit · 21e7bc60
      Masahiro Yamada 提交于
      Minor cleanups available by _UL and _ULL.
      
      Link: http://lkml.kernel.org/r/1519301715-31798-5-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21e7bc60
    • M
      linux/const.h: move UL() macro to include/linux/const.h · 2dd8a62c
      Masahiro Yamada 提交于
      ARM, ARM64 and UniCore32 duplicate the definition of UL():
      
        #define UL(x) _AC(x, UL)
      
      This is not actually arch-specific, so it will be useful to move it to a
      common header.  Currently, we only have the uapi variant for
      linux/const.h, so I am creating include/linux/const.h.
      
      I also added _UL(), _ULL() and ULL() because _AC() is mostly used in
      the form either _AC(..., UL) or _AC(..., ULL).  I expect they will be
      replaced in follow-up cleanups.  The underscore-prefixed ones should
      be used for exported headers.
      
      Link: http://lkml.kernel.org/r/1519301715-31798-4-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NGuan Xuetao <gxt@mprc.pku.edu.cn>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2dd8a62c
    • M
      linux/const.h: prefix include guard of uapi/linux/const.h with _UAPI · 2a6cc8a6
      Masahiro Yamada 提交于
      Patch series "linux/const.h: cleanups of macros such as UL(), _BITUL(),
      BIT() etc", v3.
      
      ARM, ARM64, UniCore32 define UL() as a shorthand of _AC(..., UL).  More
      architectures may introduce it in the future.
      
      UL() is arch-agnostic, and useful. So let's move it to
      include/linux/const.h
      
      Currently, <asm/memory.h> must be included to use UL().  It pulls in more
      bloats just for defining some bit macros.
      
      I posted V2 one year ago.
      
      The previous posts are:
      https://patchwork.kernel.org/patch/9498273/
      https://patchwork.kernel.org/patch/9498275/
      https://patchwork.kernel.org/patch/9498269/
      https://patchwork.kernel.org/patch/9498271/
      
      At that time, what blocked this series was a comment from
      David Howells:
        You need to be very careful doing this.  Some userspace stuff
        depends on the guard macro names on the kernel header files.
      
      (https://patchwork.kernel.org/patch/9498275/)
      
      Looking at the code closer, I noticed this is not a problem.
      
      See the following line.
      https://github.com/torvalds/linux/blob/v4.16-rc2/scripts/headers_install.sh#L40
      
      scripts/headers_install.sh rips off _UAPI prefix from guard macro names.
      
      I ran "make headers_install" and confirmed the result is what I expect.
      
      So, we can prefix the include guard of include/uapi/linux/const.h,
      and add a new include/linux/const.h.
      
      This patch (of 4):
      
      I am going to add include/linux/const.h for the kernel space.
      
      Add _UAPI to the include guard of include/uapi/linux/const.h to
      prepare for that.
      
      Please notice the guard name of the exported one will be kept as-is.
      So, this commit has no impact to the userspace even if some userspace
      stuff depends on the guard macro names.
      
      scripts/headers_install.sh processes exported headers by SED, and
      rips off "_UAPI" from guard macro names.
      
        #ifndef _UAPI_LINUX_CONST_H
        #define _UAPI_LINUX_CONST_H
      
      will be turned into
      
        #ifndef _LINUX_CONST_H
        #define _LINUX_CONST_H
      
      Link: http://lkml.kernel.org/r/1519301715-31798-2-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a6cc8a6
    • M
      fs, elf: drop MAP_FIXED usage from elf_map · 4ed28639
      Michal Hocko 提交于
      Both load_elf_interp and load_elf_binary rely on elf_map to map segments
      on a controlled address and they use MAP_FIXED to enforce that.  This is
      however dangerous thing prone to silent data corruption which can be
      even exploitable.
      
      Let's take CVE-2017-1000253 as an example.  At the time (before commit
      eab09532: "binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
      ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
      the stack top on 32b (legacy) memory layout (only 1GB away).  Therefore
      we could end up mapping over the existing stack with some luck.
      
      The issue has been fixed since then (a87938b2: "fs/binfmt_elf.c: fix
      bug in loading of PIE binaries"), ELF_ET_DYN_BASE moved moved much
      further from the stack (eab09532 and later by c715b72c: "mm:
      revert x86_64 and arm64 ELF_ET_DYN_BASE base changes") and excessive
      stack consumption early during execve fully stopped by da029c11
      ("exec: Limit arg stack to at most 75% of _STK_LIM").  So we should be
      safe and any attack should be impractical.  On the other hand this is
      just too subtle assumption so it can break quite easily and hard to
      spot.
      
      I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
      fundamentally dangerous.  Moreover it shouldn't be even needed.  We are
      at the early process stage and so there shouldn't be unrelated mappings
      (except for stack and loader) existing so mmap for a given address should
      succeed even without MAP_FIXED.  Something is terribly wrong if this is
      not the case and we should rather fail than silently corrupt the
      underlying mapping.
      
      Address this issue by changing MAP_FIXED to the newly added
      MAP_FIXED_NOREPLACE.  This will mean that mmap will fail if there is an
      existing mapping clashing with the requested one without clobbering it.
      
      [mhocko@suse.com: fix build]
      [akpm@linux-foundation.org: coding-style fixes]
      [avagin@openvz.org: don't use the same value for MAP_FIXED_NOREPLACE and MAP_SYNC]
        Link: http://lkml.kernel.org/r/20171218184916.24445-1-avagin@openvz.org
      Link: http://lkml.kernel.org/r/20171213092550.2774-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrei Vagin <avagin@openvz.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Joel Stanley <joel@jms.id.au>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ed28639
    • M
      mm: introduce MAP_FIXED_NOREPLACE · a4ff8e86
      Michal Hocko 提交于
      Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2.
      
      This has started as a follow up discussion [3][4] resulting in the
      runtime failure caused by hardening patch [5] which removes MAP_FIXED
      from the elf loader because MAP_FIXED is inherently dangerous as it
      might silently clobber an existing underlying mapping (e.g.  stack).
      The reason for the failure is that some architectures enforce an
      alignment for the given address hint without MAP_FIXED used (e.g.  for
      shared or file backed mappings).
      
      One way around this would be excluding those archs which do alignment
      tricks from the hardening [6].  The patch is really trivial but it has
      been objected, rightfully so, that this screams for a more generic
      solution.  We basically want a non-destructive MAP_FIXED.
      
      The first patch introduced MAP_FIXED_NOREPLACE which enforces the given
      address but unlike MAP_FIXED it fails with EEXIST if the given range
      conflicts with an existing one.  The flag is introduced as a completely
      new one rather than a MAP_FIXED extension because of the backward
      compatibility.  We really want a never-clobber semantic even on older
      kernels which do not recognize the flag.  Unfortunately mmap sucks
      wrt flags evaluation because we do not EINVAL on unknown flags.  On
      those kernels we would simply use the traditional hint based semantic so
      the caller can still get a different address (which sucks) but at least
      not silently corrupt an existing mapping.  I do not see a good way
      around that.  Except we won't export expose the new semantic to the
      userspace at all.
      
      It seems there are users who would like to have something like that.
      Jemalloc has been mentioned by Michael Ellerman [7]
      
      Florian Weimer has mentioned the following:
      : glibc ld.so currently maps DSOs without hints.  This means that the kernel
      : will map right next to each other, and the offsets between them a completely
      : predictable.  We would like to change that and supply a random address in a
      : window of the address space.  If there is a conflict, we do not want the
      : kernel to pick a non-random address. Instead, we would try again with a
      : random address.
      
      John Hubbard has mentioned CUDA example
      : a) Searches /proc/<pid>/maps for a "suitable" region of available
      : VA space.  "Suitable" generally means it has to have a base address
      : within a certain limited range (a particular device model might
      : have odd limitations, for example), it has to be large enough, and
      : alignment has to be large enough (again, various devices may have
      : constraints that lead us to do this).
      :
      : This is of course subject to races with other threads in the process.
      :
      : Let's say it finds a region starting at va.
      :
      : b) Next it does:
      :     p = mmap(va, ...)
      :
      : *without* setting MAP_FIXED, of course (so va is just a hint), to
      : attempt to safely reserve that region. If p != va, then in most cases,
      : this is a failure (almost certainly due to another thread getting a
      : mapping from that region before we did), and so this layer now has to
      : call munmap(), before returning a "failure: retry" to upper layers.
      :
      :     IMPROVEMENT: --> if instead, we could call this:
      :
      :             p = mmap(va, ... MAP_FIXED_NOREPLACE ...)
      :
      :         , then we could skip the munmap() call upon failure. This
      :         is a small thing, but it is useful here. (Thanks to Piotr
      :         Jaroszynski and Mark Hairgrove for helping me get that detail
      :         exactly right, btw.)
      :
      : c) After that, CUDA suballocates from p, via:
      :
      :      q = mmap(sub_region_start, ... MAP_FIXED ...)
      :
      : Interestingly enough, "freeing" is also done via MAP_FIXED, and
      : setting PROT_NONE to the subregion. Anyway, I just included (c) for
      : general interest.
      
      Atomic address range probing in the multithreaded programs in general
      sounds like an interesting thing to me.
      
      The second patch simply replaces MAP_FIXED use in elf loader by
      MAP_FIXED_NOREPLACE.  I believe other places which rely on MAP_FIXED
      should follow.  Actually real MAP_FIXED usages should be docummented
      properly and they should be more of an exception.
      
      [1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org
      [3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au
      [4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com
      [5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org
      [6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz
      [7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au
      
      This patch (of 2):
      
      MAP_FIXED is used quite often to enforce mapping at the particular range.
      The main problem of this flag is, however, that it is inherently dangerous
      because it unmaps existing mappings covered by the requested range.  This
      can cause silent memory corruptions.  Some of them even with serious
      security implications.  While the current semantic might be really
      desiderable in many cases there are others which would want to enforce the
      given range but rather see a failure than a silent memory corruption on a
      clashing range.  Please note that there is no guarantee that a given range
      is obeyed by the mmap even when it is free - e.g.  arch specific code is
      allowed to apply an alignment.
      
      Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this
      behavior.  It has the same semantic as MAP_FIXED wrt.  the given address
      request with a single exception that it fails with EEXIST if the requested
      address is already covered by an existing mapping.  We still do rely on
      get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
      check for a conflicting vma after it returns.
      
      The flag is introduced as a completely new one rather than a MAP_FIXED
      extension because of the backward compatibility.  We really want a
      never-clobber semantic even on older kernels which do not recognize the
      flag.  Unfortunately mmap sucks wrt.  flags evaluation because we do not
      EINVAL on unknown flags.  On those kernels we would simply use the
      traditional hint based semantic so the caller can still get a different
      address (which sucks) but at least not silently corrupt an existing
      mapping.  I do not see a good way around that.
      
      [mpe@ellerman.id.au: fix whitespace]
      [fail on clashing range with EEXIST as per Florian Weimer]
      [set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
      Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.orgReviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Joel Stanley <joel@jms.id.au>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Jason Evans <jasone@google.com>
      Cc: David Goldblatt <davidtgoldblatt@gmail.com>
      Cc: Edward Tomasz Napierała <trasz@FreeBSD.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4ff8e86
    • V
      include/linux/kfifo.h: fix comment · de99626c
      Valentin Vidic 提交于
      Clean up unusual formatting in the note about locking.
      
      Link: http://lkml.kernel.org/r/20180324002630.13046-1-Valentin.Vidic@CARNet.hrSigned-off-by: NValentin Vidic <Valentin.Vidic@CARNet.hr>
      Cc: Stefani Seibold <stefani@seibold.net>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Sean Young <sean@mess.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de99626c
    • D
      ipc/msg: introduce msgctl(MSG_STAT_ANY) · 23c8cec8
      Davidlohr Bueso 提交于
      There is a permission discrepancy when consulting msq ipc object
      metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl
      command.  The later does permission checks for the object vs S_IRUGO.
      As such there can be cases where EACCESS is returned via syscall but the
      info is displayed anyways in the procfs files.
      
      While this might have security implications via info leaking (albeit no
      writing to the msq metadata), this behavior goes way back and showing
      all the objects regardless of the permissions was most likely an
      overlook - so we are stuck with it.  Furthermore, modifying either the
      syscall or the procfs file can cause userspace programs to break (ie
      ipcs).  Some applications require getting the procfs info (without root
      privileges) and can be rather slow in comparison with a syscall -- up to
      500x in some reported cases for shm.
      
      This patch introduces a new MSG_STAT_ANY command such that the msq ipc
      object permissions are ignored, and only audited instead.  In addition,
      I've left the lsm security hook checks in place, as if some policy can
      block the call, then the user has no other choice than just parsing the
      procfs file.
      
      Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reported-by: NRobert Kettler <robert.kettler@outlook.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23c8cec8
    • D
      ipc/sem: introduce semctl(SEM_STAT_ANY) · a280d6dc
      Davidlohr Bueso 提交于
      There is a permission discrepancy when consulting shm ipc object
      metadata between /proc/sysvipc/sem (0444) and the SEM_STAT semctl
      command.  The later does permission checks for the object vs S_IRUGO.
      As such there can be cases where EACCESS is returned via syscall but the
      info is displayed anyways in the procfs files.
      
      While this might have security implications via info leaking (albeit no
      writing to the sma metadata), this behavior goes way back and showing
      all the objects regardless of the permissions was most likely an
      overlook - so we are stuck with it.  Furthermore, modifying either the
      syscall or the procfs file can cause userspace programs to break (ie
      ipcs).  Some applications require getting the procfs info (without root
      privileges) and can be rather slow in comparison with a syscall -- up to
      500x in some reported cases for shm.
      
      This patch introduces a new SEM_STAT_ANY command such that the sem ipc
      object permissions are ignored, and only audited instead.  In addition,
      I've left the lsm security hook checks in place, as if some policy can
      block the call, then the user has no other choice than just parsing the
      procfs file.
      
      Link: http://lkml.kernel.org/r/20180215162458.10059-3-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reported-by: NRobert Kettler <robert.kettler@outlook.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a280d6dc
    • D
      ipc/shm: introduce shmctl(SHM_STAT_ANY) · c21a6970
      Davidlohr Bueso 提交于
      Patch series "sysvipc: introduce STAT_ANY commands", v2.
      
      The following patches adds the discussed (see [1]) new command for shm
      as well as for sems and msq as they are subject to the same
      discrepancies for ipc object permission checks between the syscall and
      via procfs.  These new commands are justified in that (1) we are stuck
      with this semantics as changing syscall and procfs can break userland;
      and (2) some users can benefit from performance (for large amounts of
      shm segments, for example) from not having to parse the procfs
      interface.
      
      Once merged, I will submit the necesary manpage updates.  But I'm thinking
      something like:
      
      : diff --git a/man2/shmctl.2 b/man2/shmctl.2
      : index 7bb503999941..bb00bbe21a57 100644
      : --- a/man2/shmctl.2
      : +++ b/man2/shmctl.2
      : @@ -41,6 +41,7 @@
      :  .\" 2005-04-25, mtk -- noted aberrant Linux behavior w.r.t. new
      :  .\"	attaches to a segment that has already been marked for deletion.
      :  .\" 2005-08-02, mtk: Added IPC_INFO, SHM_INFO, SHM_STAT descriptions.
      : +.\" 2018-02-13, dbueso: Added SHM_STAT_ANY description.
      :  .\"
      :  .TH SHMCTL 2 2017-09-15 "Linux" "Linux Programmer's Manual"
      :  .SH NAME
      : @@ -242,6 +243,18 @@ However, the
      :  argument is not a segment identifier, but instead an index into
      :  the kernel's internal array that maintains information about
      :  all shared memory segments on the system.
      : +.TP
      : +.BR SHM_STAT_ANY " (Linux-specific)"
      : +Return a
      : +.I shmid_ds
      : +structure as for
      : +.BR SHM_STAT .
      : +However, the
      : +.I shm_perm.mode
      : +is not checked for read access for
      : +.IR shmid ,
      : +resembing the behaviour of
      : +/proc/sysvipc/shm.
      :  .PP
      :  The caller can prevent or allow swapping of a shared
      :  memory segment with the following \fIcmd\fP values:
      : @@ -287,7 +300,7 @@ operation returns the index of the highest used entry in the
      :  kernel's internal array recording information about all
      :  shared memory segments.
      :  (This information can be used with repeated
      : -.B SHM_STAT
      : +.B SHM_STAT/SHM_STAT_ANY
      :  operations to obtain information about all shared memory segments
      :  on the system.)
      :  A successful
      : @@ -328,7 +341,7 @@ isn't accessible.
      :  \fIshmid\fP is not a valid identifier, or \fIcmd\fP
      :  is not a valid command.
      :  Or: for a
      : -.B SHM_STAT
      : +.B SHM_STAT/SHM_STAT_ANY
      :  operation, the index value specified in
      :  .I shmid
      :  referred to an array slot that is currently unused.
      
      This patch (of 3):
      
      There is a permission discrepancy when consulting shm ipc object metadata
      between /proc/sysvipc/shm (0444) and the SHM_STAT shmctl command.  The
      later does permission checks for the object vs S_IRUGO.  As such there can
      be cases where EACCESS is returned via syscall but the info is displayed
      anyways in the procfs files.
      
      While this might have security implications via info leaking (albeit no
      writing to the shm metadata), this behavior goes way back and showing all
      the objects regardless of the permissions was most likely an overlook - so
      we are stuck with it.  Furthermore, modifying either the syscall or the
      procfs file can cause userspace programs to break (ie ipcs).  Some
      applications require getting the procfs info (without root privileges) and
      can be rather slow in comparison with a syscall -- up to 500x in some
      reported cases.
      
      This patch introduces a new SHM_STAT_ANY command such that the shm ipc
      object permissions are ignored, and only audited instead.  In addition,
      I've left the lsm security hook checks in place, as if some policy can
      block the call, then the user has no other choice than just parsing the
      procfs file.
      
      [1] https://lkml.org/lkml/2017/12/19/220
      
      Link: http://lkml.kernel.org/r/20180215162458.10059-2-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Robert Kettler <robert.kettler@outlook.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c21a6970
    • K
      exec: pin stack limit during exec · c31dbb14
      Kees Cook 提交于
      Since the stack rlimit is used in multiple places during exec and it can
      be changed via other threads (via setrlimit()) or processes (via
      prlimit()), the assumption that the value doesn't change cannot be made.
      This leads to races with mm layout selection and argument size
      calculations.  This changes the exec path to use the rlimit stored in
      bprm instead of in current.  Before starting the thread, the bprm stack
      rlimit is stored back to current.
      
      Link: http://lkml.kernel.org/r/1518638796-20819-4-git-send-email-keescook@chromium.org
      Fixes: 64701dee ("exec: Use sane stack rlimit under secureexec")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reported-by: NBen Hutchings <ben.hutchings@codethink.co.uk>
      Reported-by: NAndy Lutomirski <luto@kernel.org>
      Reported-by: NBrad Spengler <spender@grsecurity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Greg KH <greg@kroah.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c31dbb14
    • K
      exec: introduce finalize_exec() before start_thread() · b8383831
      Kees Cook 提交于
      Provide a final callback into fs/exec.c before start_thread() takes
      over, to handle any last-minute changes, like the coming restoration of
      the stack limit.
      
      Link: http://lkml.kernel.org/r/1518638796-20819-3-git-send-email-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: Greg KH <greg@kroah.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8383831
    • K
      exec: pass stack rlimit into mm layout functions · 8f2af155
      Kees Cook 提交于
      Patch series "exec: Pin stack limit during exec".
      
      Attempts to solve problems with the stack limit changing during exec
      continue to be frustrated[1][2].  In addition to the specific issues
      around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
      other places during exec where the stack limit is used and is assumed to
      be unchanging.  Given the many places it gets used and the fact that it
      can be manipulated/raced via setrlimit() and prlimit(), I think the only
      way to handle this is to move away from the "current" view of the stack
      limit and instead attach it to the bprm, and plumb this down into the
      functions that need to know the stack limits.  This series implements
      the approach.
      
      [1] 04e35f44 ("exec: avoid RLIMIT_STACK races with prlimit()")
      [2] 779f4e1c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
      [3] to security@kernel.org, "Subject: existing rlimit races?"
      
      This patch (of 3):
      
      Since it is possible that the stack rlimit can change externally during
      exec (either via another thread calling setrlimit() or another process
      calling prlimit()), provide a way to pass the rlimit down into the
      per-architecture mm layout functions so that the rlimit can stay in the
      bprm structure instead of sitting in the signal structure until exec is
      finalized.
      
      Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f2af155
    • A
      seq_file: allocate seq_file from kmem_cache · 09652320
      Alexey Dobriyan 提交于
      For fine-grained debugging and usercopy protection.
      
      Link: http://lkml.kernel.org/r/20180310085027.GA17121@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09652320
    • K
      task_struct: only use anon struct under randstruct plugin · 2cfe0d30
      Kees Cook 提交于
      The original intent for always adding the anonymous struct in
      task_struct was to make sure we had compiler coverage.
      
      However, this caused pathological padding of 40 bytes at the start of
      task_struct.  Instead, move the anonymous struct to being only used when
      struct layout randomization is enabled.
      
      Link: http://lkml.kernel.org/r/20180327213609.GA2964@beast
      Fixes: 29e48ce8 ("task_struct: Allow randomized")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2cfe0d30
    • A
      uts: create "struct uts_namespace" from kmem_cache · 3ea056c5
      Alexey Dobriyan 提交于
      So "struct uts_namespace" can enjoy fine-grained SLAB debugging and
      usercopy protection.
      
      I'd prefer shorter name "utsns" but there is "user_namespace" already.
      
      Link: http://lkml.kernel.org/r/20180228215158.GA23146@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea056c5