1. 13 4月, 2018 3 次提交
  2. 12 4月, 2018 37 次提交
    • A
      CIFS: add ONCE flag for cifs_dbg type · f2f176b4
      Aurelien Aptel 提交于
      * Since cifs_vfs_error was just using pr_debug_ratelimited like the rest
        of cifs_dbg, move it there too
      * Add a ONCE type flag to call the pr_xxx_once() debug function instead
        of the ratelimited ones.
      
      To convert existing printk_once() calls to this we can run:
      
          perl -i -pE \
            's/printk_once\s*\(([^" \n]+)(.*)/cifs_dbg(VFS|ONCE,$2/g' \
            fs/cifs/*.c
      Signed-off-by: NAurelien Aptel <aaptel@suse.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      f2f176b4
    • G
      cifs: Use ULL suffix for 64-bit constant · 3995bbf5
      Geert Uytterhoeven 提交于
      On 32-bit (e.g. with m68k-linux-gnu-gcc-4.1):
      
          fs/cifs/inode.c: In function ‘simple_hashstr’:
          fs/cifs/inode.c:713: warning: integer constant is too large for ‘long’ type
      
      Fixes: 7ea884c7 ("smb3: Fix root directory when server returns inode number of zero")
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NAurelien Aptel <aaptel@suse.com>
      3995bbf5
    • S
      SMB3: Log at least once if tree connect fails during reconnect · c318e6c2
      Steve French 提交于
      Adding an extra debug message to show if a tree connect failure during
      reconnect (and made it a log once so it doesn't spam the logs).
      Saw a case recently where tree connect repeatedly returned
      access denied on reconnect and it wasn't as easy to spot as it
      should have been.
      Signed-off-by: NSteve French <smfrench@gmail.com>
      Reviewed-by: NAurelien Aptel <aaptel@suse.com>
      c318e6c2
    • G
      cifs: smb2pdu: Fix potential NULL pointer dereference · c0953f2e
      Gustavo A. R. Silva 提交于
      tcon->ses is being dereferenced before it is null checked, hence
      there is a potential null pointer dereference.
      
      Fix this by moving the pointer dereference after tcon->ses has
      been properly null checked.
      
      Addresses-Coverity-ID: 1467426 ("Dereference before null check")
      Fixes: 93012bf9 ("cifs: add server->vals->header_preamble_size")
      Signed-off-by: NGustavo A. R. Silva <garsilva@embeddedor.com>
      Reviewed-by: NAurelien Aptel <aaptel@suse.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      c0953f2e
    • L
      Merge branch 'akpm' (patches from Andrew) · 8837c70d
      Linus Torvalds 提交于
      Merge more updates from Andrew Morton:
      
       - almost all of the rest of MM
      
       - kasan updates
      
       - lots of procfs work
      
       - misc things
      
       - lib/ updates
      
       - checkpatch
      
       - rapidio
      
       - ipc/shm updates
      
       - the start of willy's XArray conversion
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (140 commits)
        page cache: use xa_lock
        xarray: add the xa_lock to the radix_tree_root
        fscache: use appropriate radix tree accessors
        export __set_page_dirty
        unicore32: turn flush_dcache_mmap_lock into a no-op
        arm64: turn flush_dcache_mmap_lock into a no-op
        mac80211_hwsim: use DEFINE_IDA
        radix tree: use GFP_ZONEMASK bits of gfp_t for flags
        linux/const.h: refactor _BITUL and _BITULL a bit
        linux/const.h: move UL() macro to include/linux/const.h
        linux/const.h: prefix include guard of uapi/linux/const.h with _UAPI
        xen, mm: allow deferred page initialization for xen pv domains
        elf: enforce MAP_FIXED on overlaying elf segments
        fs, elf: drop MAP_FIXED usage from elf_map
        mm: introduce MAP_FIXED_NOREPLACE
        MAINTAINERS: update bouncing aacraid@adaptec.com addresses
        fs/dcache.c: add cond_resched() in shrink_dentry_list()
        include/linux/kfifo.h: fix comment
        ipc/shm.c: shm_split(): remove unneeded test for NULL shm_file_data.vm_ops
        kernel/sysctl.c: add kdoc comments to do_proc_do{u}intvec_minmax_conv_param
        ...
      8837c70d
    • M
      page cache: use xa_lock · b93b0163
      Matthew Wilcox 提交于
      Remove the address_space ->tree_lock and use the xa_lock newly added to
      the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
      since we don't really care that it's a tree.
      
      [willy@infradead.org: fix nds32, fs/dax.c]
        Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b93b0163
    • M
      xarray: add the xa_lock to the radix_tree_root · f6bb2a2c
      Matthew Wilcox 提交于
      This results in no change in structure size on 64-bit machines as it
      fits in the padding between the gfp_t and the void *.  32-bit machines
      will grow the structure from 8 to 12 bytes.  Almost all radix trees are
      protected with (at least) a spinlock, so as they are converted from
      radix trees to xarrays, the data structures will shrink again.
      
      Initialising the spinlock requires a name for the benefit of lockdep, so
      RADIX_TREE_INIT() now needs to know the name of the radix tree it's
      initialising, and so do IDR_INIT() and IDA_INIT().
      
      Also add the xa_lock() and xa_unlock() family of wrappers to make it
      easier to use the lock.  If we could rely on -fplan9-extensions in the
      compiler, we could avoid all of this syntactic sugar, but that wasn't
      added until gcc 4.6.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-8-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6bb2a2c
    • M
      fscache: use appropriate radix tree accessors · e5a95541
      Matthew Wilcox 提交于
      Don't open-code accesses to data structure internals.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-7-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: NJeff Layton <jlayton@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5a95541
    • M
      export __set_page_dirty · f82b3764
      Matthew Wilcox 提交于
      XFS currently contains a copy-and-paste of __set_page_dirty().  Export
      it from buffer.c instead.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-6-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f82b3764
    • M
      unicore32: turn flush_dcache_mmap_lock into a no-op · d339d705
      Matthew Wilcox 提交于
      Unicore doesn't walk the VMA tree in its flush_dcache_page()
      implementation, so has no need to take the tree_lock.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-5-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d339d705
    • M
      arm64: turn flush_dcache_mmap_lock into a no-op · 427c896f
      Matthew Wilcox 提交于
      ARM64 doesn't walk the VMA tree in its flush_dcache_page()
      implementation, so has no need to take the tree_lock.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-4-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: NWill Deacon <will.deacon@arm.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      427c896f
    • M
      60a05271
    • M
      radix tree: use GFP_ZONEMASK bits of gfp_t for flags · fa290cda
      Matthew Wilcox 提交于
      Patch series "XArray", v9.  (First part thereof).
      
      This patchset is, I believe, appropriate for merging for 4.17.  It
      contains the XArray implementation, to eventually replace the radix
      tree, and converts the page cache to use it.
      
      This conversion keeps the radix tree and XArray data structures in sync
      at all times.  That allows us to convert the page cache one function at
      a time and should allow for easier bisection.  Other than renaming some
      elements of the structures, the data structures are fundamentally
      unchanged; a radix tree walk and an XArray walk will touch the same
      number of cachelines.  I have changes planned to the XArray data
      structure, but those will happen in future patches.
      
      Improvements the XArray has over the radix tree:
      
       - The radix tree provides operations like other trees do; 'insert' and
         'delete'. But what most users really want is an automatically
         resizing array, and so it makes more sense to give users an API that
         is like an array -- 'load' and 'store'. We still have an 'insert'
         operation for users that really want that semantic.
      
       - The XArray considers locking as part of its API. This simplifies a
         lot of users who formerly had to manage their own locking just for
         the radix tree. It also improves code generation as we can now tell
         RCU that we're holding a lock and it doesn't need to generate as much
         fencing code. The other advantage is that tree nodes can be moved
         (not yet implemented).
      
       - GFP flags are now parameters to calls which may need to allocate
         memory. The radix tree forced users to decide what the allocation
         flags would be at creation time. It's much clearer to specify them at
         allocation time.
      
       - Memory is not preloaded; we don't tie up dozens of pages on the off
         chance that the slab allocator fails. Instead, we drop the lock,
         allocate a new node and retry the operation. We have to convert all
         the radix tree, IDA and IDR preload users before we can realise this
         benefit, but I have not yet found a user which cannot be converted.
      
       - The XArray provides a cmpxchg operation. The radix tree forces users
         to roll their own (and at least four have).
      
       - Iterators take a 'max' parameter. That simplifies many users and will
         reduce the amount of iteration done.
      
       - Iteration can proceed backwards. We only have one user for this, but
         since it's called as part of the pagefault readahead algorithm, that
         seemed worth mentioning.
      
       - RCU-protected pointers are not exposed as part of the API. There are
         some fun bugs where the page cache forgets to use rcu_dereference()
         in the current codebase.
      
       - Value entries gain an extra bit compared to radix tree exceptional
         entries. That gives us the extra bit we need to put huge page swap
         entries in the page cache.
      
       - Some iterators now take a 'filter' argument instead of having
         separate iterators for tagged/untagged iterations.
      
      The page cache is improved by this:
      
       - Shorter, easier to read code
      
       - More efficient iterations
      
       - Reduction in size of struct address_space
      
       - Fewer walks from the top of the data structure; the XArray API
         encourages staying at the leaf node and conducting operations there.
      
      This patch (of 8):
      
      None of these bits may be used for slab allocations, so we can use them
      as radix tree flags as long as we mask them off before passing them to
      the slab allocator. Move the IDR flag from the high bits to the
      GFP_ZONEMASK bits.
      
      Link: http://lkml.kernel.org/r/20180313132639.17387-3-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: NJeff Layton <jlayton@kernel.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa290cda
    • M
      linux/const.h: refactor _BITUL and _BITULL a bit · 21e7bc60
      Masahiro Yamada 提交于
      Minor cleanups available by _UL and _ULL.
      
      Link: http://lkml.kernel.org/r/1519301715-31798-5-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21e7bc60
    • M
      linux/const.h: move UL() macro to include/linux/const.h · 2dd8a62c
      Masahiro Yamada 提交于
      ARM, ARM64 and UniCore32 duplicate the definition of UL():
      
        #define UL(x) _AC(x, UL)
      
      This is not actually arch-specific, so it will be useful to move it to a
      common header.  Currently, we only have the uapi variant for
      linux/const.h, so I am creating include/linux/const.h.
      
      I also added _UL(), _ULL() and ULL() because _AC() is mostly used in
      the form either _AC(..., UL) or _AC(..., ULL).  I expect they will be
      replaced in follow-up cleanups.  The underscore-prefixed ones should
      be used for exported headers.
      
      Link: http://lkml.kernel.org/r/1519301715-31798-4-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NGuan Xuetao <gxt@mprc.pku.edu.cn>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2dd8a62c
    • M
      linux/const.h: prefix include guard of uapi/linux/const.h with _UAPI · 2a6cc8a6
      Masahiro Yamada 提交于
      Patch series "linux/const.h: cleanups of macros such as UL(), _BITUL(),
      BIT() etc", v3.
      
      ARM, ARM64, UniCore32 define UL() as a shorthand of _AC(..., UL).  More
      architectures may introduce it in the future.
      
      UL() is arch-agnostic, and useful. So let's move it to
      include/linux/const.h
      
      Currently, <asm/memory.h> must be included to use UL().  It pulls in more
      bloats just for defining some bit macros.
      
      I posted V2 one year ago.
      
      The previous posts are:
      https://patchwork.kernel.org/patch/9498273/
      https://patchwork.kernel.org/patch/9498275/
      https://patchwork.kernel.org/patch/9498269/
      https://patchwork.kernel.org/patch/9498271/
      
      At that time, what blocked this series was a comment from
      David Howells:
        You need to be very careful doing this.  Some userspace stuff
        depends on the guard macro names on the kernel header files.
      
      (https://patchwork.kernel.org/patch/9498275/)
      
      Looking at the code closer, I noticed this is not a problem.
      
      See the following line.
      https://github.com/torvalds/linux/blob/v4.16-rc2/scripts/headers_install.sh#L40
      
      scripts/headers_install.sh rips off _UAPI prefix from guard macro names.
      
      I ran "make headers_install" and confirmed the result is what I expect.
      
      So, we can prefix the include guard of include/uapi/linux/const.h,
      and add a new include/linux/const.h.
      
      This patch (of 4):
      
      I am going to add include/linux/const.h for the kernel space.
      
      Add _UAPI to the include guard of include/uapi/linux/const.h to
      prepare for that.
      
      Please notice the guard name of the exported one will be kept as-is.
      So, this commit has no impact to the userspace even if some userspace
      stuff depends on the guard macro names.
      
      scripts/headers_install.sh processes exported headers by SED, and
      rips off "_UAPI" from guard macro names.
      
        #ifndef _UAPI_LINUX_CONST_H
        #define _UAPI_LINUX_CONST_H
      
      will be turned into
      
        #ifndef _LINUX_CONST_H
        #define _LINUX_CONST_H
      
      Link: http://lkml.kernel.org/r/1519301715-31798-2-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a6cc8a6
    • P
      xen, mm: allow deferred page initialization for xen pv domains · 6f84f8d1
      Pavel Tatashin 提交于
      Juergen Gross noticed that commit f7f99100 ("mm: stop zeroing memory
      during allocation in vmemmap") broke XEN PV domains when deferred struct
      page initialization is enabled.
      
      This is because the xen's PagePinned() flag is getting erased from
      struct pages when they are initialized later in boot.
      
      Juergen fixed this problem by disabling deferred pages on xen pv
      domains.  It is desirable, however, to have this feature available as it
      reduces boot time.  This fix re-enables the feature for pv-dmains, and
      fixes the problem the following way:
      
      The fix is to delay setting PagePinned flag until struct pages for all
      allocated memory are initialized, i.e.  until after free_all_bootmem().
      
      A new x86_init.hyper op init_after_bootmem() is called to let xen know
      that boot allocator is done, and hence struct pages for all the
      allocated memory are now initialized.  If deferred page initialization
      is enabled, the rest of struct pages are going to be initialized later
      in boot once page_alloc_init_late() is called.
      
      xen_after_bootmem() walks page table's pages and marks them pinned.
      
      Link: http://lkml.kernel.org/r/20180226160112.24724-2-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Tested-by: NJuergen Gross <jgross@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Alok Kataria <akataria@vmware.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Jinbum Park <jinb.park7@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Jia Zhang <zhang.jia@linux.alibaba.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f84f8d1
    • M
      elf: enforce MAP_FIXED on overlaying elf segments · ad55eac7
      Michal Hocko 提交于
      Anshuman has reported that with "fs, elf: drop MAP_FIXED usage from
      elf_map" applied, some ELF binaries in his environment fail to start
      with
      
       [   23.423642] 9148 (sed): Uhuuh, elf segment at 0000000010030000 requested but the memory is mapped already
       [   23.423706] requested [10030000, 10040000] mapped [10030000, 10040000] 100073 anon
      
      The reason is that the above binary has overlapping elf segments:
      
        LOAD           0x0000000000000000 0x0000000010000000 0x0000000010000000
                       0x0000000000013a8c 0x0000000000013a8c  R E    10000
        LOAD           0x000000000001fd40 0x000000001002fd40 0x000000001002fd40
                       0x00000000000002c0 0x00000000000005e8  RW     10000
        LOAD           0x0000000000020328 0x0000000010030328 0x0000000010030328
                       0x0000000000000384 0x00000000000094a0  RW     10000
      
      That binary has two RW LOAD segments, the first crosses a page border
      into the second
      
        0x1002fd40 (LOAD2-vaddr) + 0x5e8 (LOAD2-memlen) == 0x10030328 (LOAD3-vaddr)
      
      Handle this situation by enforcing MAP_FIXED when we establish a
      temporary brk VMA to handle overlapping segments.  All other mappings
      will still use MAP_FIXED_NOREPLACE.
      
      Link: http://lkml.kernel.org/r/20180213100440.GM3443@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Andrei Vagin <avagin@openvz.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Joel Stanley <joel@jms.id.au>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Mark Brown <broonie@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad55eac7
    • M
      fs, elf: drop MAP_FIXED usage from elf_map · 4ed28639
      Michal Hocko 提交于
      Both load_elf_interp and load_elf_binary rely on elf_map to map segments
      on a controlled address and they use MAP_FIXED to enforce that.  This is
      however dangerous thing prone to silent data corruption which can be
      even exploitable.
      
      Let's take CVE-2017-1000253 as an example.  At the time (before commit
      eab09532: "binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
      ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from
      the stack top on 32b (legacy) memory layout (only 1GB away).  Therefore
      we could end up mapping over the existing stack with some luck.
      
      The issue has been fixed since then (a87938b2: "fs/binfmt_elf.c: fix
      bug in loading of PIE binaries"), ELF_ET_DYN_BASE moved moved much
      further from the stack (eab09532 and later by c715b72c: "mm:
      revert x86_64 and arm64 ELF_ET_DYN_BASE base changes") and excessive
      stack consumption early during execve fully stopped by da029c11
      ("exec: Limit arg stack to at most 75% of _STK_LIM").  So we should be
      safe and any attack should be impractical.  On the other hand this is
      just too subtle assumption so it can break quite easily and hard to
      spot.
      
      I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still
      fundamentally dangerous.  Moreover it shouldn't be even needed.  We are
      at the early process stage and so there shouldn't be unrelated mappings
      (except for stack and loader) existing so mmap for a given address should
      succeed even without MAP_FIXED.  Something is terribly wrong if this is
      not the case and we should rather fail than silently corrupt the
      underlying mapping.
      
      Address this issue by changing MAP_FIXED to the newly added
      MAP_FIXED_NOREPLACE.  This will mean that mmap will fail if there is an
      existing mapping clashing with the requested one without clobbering it.
      
      [mhocko@suse.com: fix build]
      [akpm@linux-foundation.org: coding-style fixes]
      [avagin@openvz.org: don't use the same value for MAP_FIXED_NOREPLACE and MAP_SYNC]
        Link: http://lkml.kernel.org/r/20171218184916.24445-1-avagin@openvz.org
      Link: http://lkml.kernel.org/r/20171213092550.2774-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrei Vagin <avagin@openvz.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Joel Stanley <joel@jms.id.au>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ed28639
    • M
      mm: introduce MAP_FIXED_NOREPLACE · a4ff8e86
      Michal Hocko 提交于
      Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2.
      
      This has started as a follow up discussion [3][4] resulting in the
      runtime failure caused by hardening patch [5] which removes MAP_FIXED
      from the elf loader because MAP_FIXED is inherently dangerous as it
      might silently clobber an existing underlying mapping (e.g.  stack).
      The reason for the failure is that some architectures enforce an
      alignment for the given address hint without MAP_FIXED used (e.g.  for
      shared or file backed mappings).
      
      One way around this would be excluding those archs which do alignment
      tricks from the hardening [6].  The patch is really trivial but it has
      been objected, rightfully so, that this screams for a more generic
      solution.  We basically want a non-destructive MAP_FIXED.
      
      The first patch introduced MAP_FIXED_NOREPLACE which enforces the given
      address but unlike MAP_FIXED it fails with EEXIST if the given range
      conflicts with an existing one.  The flag is introduced as a completely
      new one rather than a MAP_FIXED extension because of the backward
      compatibility.  We really want a never-clobber semantic even on older
      kernels which do not recognize the flag.  Unfortunately mmap sucks
      wrt flags evaluation because we do not EINVAL on unknown flags.  On
      those kernels we would simply use the traditional hint based semantic so
      the caller can still get a different address (which sucks) but at least
      not silently corrupt an existing mapping.  I do not see a good way
      around that.  Except we won't export expose the new semantic to the
      userspace at all.
      
      It seems there are users who would like to have something like that.
      Jemalloc has been mentioned by Michael Ellerman [7]
      
      Florian Weimer has mentioned the following:
      : glibc ld.so currently maps DSOs without hints.  This means that the kernel
      : will map right next to each other, and the offsets between them a completely
      : predictable.  We would like to change that and supply a random address in a
      : window of the address space.  If there is a conflict, we do not want the
      : kernel to pick a non-random address. Instead, we would try again with a
      : random address.
      
      John Hubbard has mentioned CUDA example
      : a) Searches /proc/<pid>/maps for a "suitable" region of available
      : VA space.  "Suitable" generally means it has to have a base address
      : within a certain limited range (a particular device model might
      : have odd limitations, for example), it has to be large enough, and
      : alignment has to be large enough (again, various devices may have
      : constraints that lead us to do this).
      :
      : This is of course subject to races with other threads in the process.
      :
      : Let's say it finds a region starting at va.
      :
      : b) Next it does:
      :     p = mmap(va, ...)
      :
      : *without* setting MAP_FIXED, of course (so va is just a hint), to
      : attempt to safely reserve that region. If p != va, then in most cases,
      : this is a failure (almost certainly due to another thread getting a
      : mapping from that region before we did), and so this layer now has to
      : call munmap(), before returning a "failure: retry" to upper layers.
      :
      :     IMPROVEMENT: --> if instead, we could call this:
      :
      :             p = mmap(va, ... MAP_FIXED_NOREPLACE ...)
      :
      :         , then we could skip the munmap() call upon failure. This
      :         is a small thing, but it is useful here. (Thanks to Piotr
      :         Jaroszynski and Mark Hairgrove for helping me get that detail
      :         exactly right, btw.)
      :
      : c) After that, CUDA suballocates from p, via:
      :
      :      q = mmap(sub_region_start, ... MAP_FIXED ...)
      :
      : Interestingly enough, "freeing" is also done via MAP_FIXED, and
      : setting PROT_NONE to the subregion. Anyway, I just included (c) for
      : general interest.
      
      Atomic address range probing in the multithreaded programs in general
      sounds like an interesting thing to me.
      
      The second patch simply replaces MAP_FIXED use in elf loader by
      MAP_FIXED_NOREPLACE.  I believe other places which rely on MAP_FIXED
      should follow.  Actually real MAP_FIXED usages should be docummented
      properly and they should be more of an exception.
      
      [1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org
      [3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au
      [4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com
      [5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org
      [6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz
      [7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au
      
      This patch (of 2):
      
      MAP_FIXED is used quite often to enforce mapping at the particular range.
      The main problem of this flag is, however, that it is inherently dangerous
      because it unmaps existing mappings covered by the requested range.  This
      can cause silent memory corruptions.  Some of them even with serious
      security implications.  While the current semantic might be really
      desiderable in many cases there are others which would want to enforce the
      given range but rather see a failure than a silent memory corruption on a
      clashing range.  Please note that there is no guarantee that a given range
      is obeyed by the mmap even when it is free - e.g.  arch specific code is
      allowed to apply an alignment.
      
      Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this
      behavior.  It has the same semantic as MAP_FIXED wrt.  the given address
      request with a single exception that it fails with EEXIST if the requested
      address is already covered by an existing mapping.  We still do rely on
      get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
      check for a conflicting vma after it returns.
      
      The flag is introduced as a completely new one rather than a MAP_FIXED
      extension because of the backward compatibility.  We really want a
      never-clobber semantic even on older kernels which do not recognize the
      flag.  Unfortunately mmap sucks wrt.  flags evaluation because we do not
      EINVAL on unknown flags.  On those kernels we would simply use the
      traditional hint based semantic so the caller can still get a different
      address (which sucks) but at least not silently corrupt an existing
      mapping.  I do not see a good way around that.
      
      [mpe@ellerman.id.au: fix whitespace]
      [fail on clashing range with EEXIST as per Florian Weimer]
      [set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
      Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.orgReviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Joel Stanley <joel@jms.id.au>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Jason Evans <jasone@google.com>
      Cc: David Goldblatt <davidtgoldblatt@gmail.com>
      Cc: Edward Tomasz Napierała <trasz@FreeBSD.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4ff8e86
    • J
      MAINTAINERS: update bouncing aacraid@adaptec.com addresses · 721d8b41
      Joe Perches 提交于
      Adaptec is now part of Microsemi.
      
      Commit 2a81ffdd ("MAINTAINERS: Update email address for aacraid")
      updated only one of the driver maintainer addresses.
      
      Update the other two sections as the aacraid@adaptec.com address
      bounces.
      
      Link: http://lkml.kernel.org/r/1522103936.12357.27.camel@perches.comSigned-off-by: NJoe Perches <joe@perches.com>
      Cc: Dave Carroll <david.carroll@microsemi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      721d8b41
    • N
      fs/dcache.c: add cond_resched() in shrink_dentry_list() · 32785c05
      Nikolay Borisov 提交于
      As previously reported (https://patchwork.kernel.org/patch/8642031/)
      it's possible to call shrink_dentry_list with a large number of dentries
      (> 10000).  This, in turn, could trigger the softlockup detector and
      possibly trigger a panic.  In addition to the unmount path being
      vulnerable to this scenario, at SuSE we've observed similar situation
      happening during process exit on processes that touch a lot of dentries.
      Here is an excerpt from a crash dump.  The number after the colon are
      the number of dentries on the list passed to shrink_dentry_list:
      
      PID 99760: 10722
      PID 107530: 215
      PID 108809: 24134
      PID 108877: 21331
      PID 141708: 16487
      
      So we want to kill between 15k-25k dentries without yielding.
      
      And one possible call stack looks like:
      
      4 [ffff8839ece41db0] _raw_spin_lock at ffffffff8152a5f8
      5 [ffff8839ece41db0] evict at ffffffff811c3026
      6 [ffff8839ece41dd0] __dentry_kill at ffffffff811bf258
      7 [ffff8839ece41df0] shrink_dentry_list at ffffffff811bf593
      8 [ffff8839ece41e18] shrink_dcache_parent at ffffffff811bf830
      9 [ffff8839ece41e50] proc_flush_task at ffffffff8120dd61
      10 [ffff8839ece41ec0] release_task at ffffffff81059ebd
      11 [ffff8839ece41f08] do_exit at ffffffff8105b8ce
      12 [ffff8839ece41f78] sys_exit at ffffffff8105bd53
      13 [ffff8839ece41f80] system_call_fastpath at ffffffff81532909
      
      While some of the callers of shrink_dentry_list do use cond_resched,
      this is not sufficient to prevent softlockups.  So just move
      cond_resched into shrink_dentry_list from its callers.
      
      David said: I've found hundreds of occurrences of warnings that we emit
      when need_resched stays set for a prolonged period of time with the
      stack trace that is included in the change log.
      
      Link: http://lkml.kernel.org/r/1521718946-31521-1-git-send-email-nborisov@suse.comSigned-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32785c05
    • V
      include/linux/kfifo.h: fix comment · de99626c
      Valentin Vidic 提交于
      Clean up unusual formatting in the note about locking.
      
      Link: http://lkml.kernel.org/r/20180324002630.13046-1-Valentin.Vidic@CARNet.hrSigned-off-by: NValentin Vidic <Valentin.Vidic@CARNet.hr>
      Cc: Stefani Seibold <stefani@seibold.net>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Sean Young <sean@mess.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de99626c
    • A
      ipc/shm.c: shm_split(): remove unneeded test for NULL shm_file_data.vm_ops · a61fc2cb
      Andrew Morton 提交于
      This was added by the recent "ipc/shm.c: add split function to
      shm_vm_ops", but it is not necessary.
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a61fc2cb
    • W
      kernel/sysctl.c: add kdoc comments to do_proc_do{u}intvec_minmax_conv_param · 24704f36
      Waiman Long 提交于
      Kdoc comments are added to the do_proc_dointvec_minmax_conv_param and
      do_proc_douintvec_minmax_conv_param structures thare are used internally
      for range checking.
      
      The error codes returned by proc_dointvec_minmax() and
      proc_douintvec_minmax() are also documented.
      
      Link: http://lkml.kernel.org/r/1519926220-7453-3-git-send-email-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NLuis R. Rodriguez <mcgrof@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24704f36
    • W
      fs/proc/proc_sysctl.c: fix typo in sysctl_check_table_array() · 64a11f3d
      Waiman Long 提交于
      Patch series "ipc: Clamp *mni to the real IPCMNI limit", v3.
      
      The sysctl parameters msgmni, shmmni and semmni have an inherent limit
      of IPC_MNI (32k).  However, users may not be aware of that because they
      can write a value much higher than that without getting any error or
      notification.  Reading the parameters back will show the newly written
      values which are not real.
      
      Enforcing the limit by failing sysctl parameter write, however, can
      break existing user applications.  To address this delemma, a new flags
      field is introduced into the ctl_table.  The value CTL_FLAGS_CLAMP_RANGE
      can be added to any ctl_table entries to enable a looser range clamping
      without returning any error.  For example,
      
        .flags = CTL_FLAGS_CLAMP_RANGE,
      
      This flags value are now used for the range checking of shmmni, msgmni
      and semmni without breaking existing applications.  If any out of range
      value is written to those sysctl parameters, the following warning will
      be printed instead.
      
        Kernel parameter "shmmni" was set out of range [0, 32768], clamped to 32768.
      
      Reading the values back will show 32768 instead of some fake values.
      
      This patch (of 6):
      
      Fix a typo.
      
      Link: http://lkml.kernel.org/r/1519926220-7453-2-git-send-email-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NLuis R. Rodriguez <mcgrof@kernel.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64a11f3d
    • D
      ipc/msg: introduce msgctl(MSG_STAT_ANY) · 23c8cec8
      Davidlohr Bueso 提交于
      There is a permission discrepancy when consulting msq ipc object
      metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl
      command.  The later does permission checks for the object vs S_IRUGO.
      As such there can be cases where EACCESS is returned via syscall but the
      info is displayed anyways in the procfs files.
      
      While this might have security implications via info leaking (albeit no
      writing to the msq metadata), this behavior goes way back and showing
      all the objects regardless of the permissions was most likely an
      overlook - so we are stuck with it.  Furthermore, modifying either the
      syscall or the procfs file can cause userspace programs to break (ie
      ipcs).  Some applications require getting the procfs info (without root
      privileges) and can be rather slow in comparison with a syscall -- up to
      500x in some reported cases for shm.
      
      This patch introduces a new MSG_STAT_ANY command such that the msq ipc
      object permissions are ignored, and only audited instead.  In addition,
      I've left the lsm security hook checks in place, as if some policy can
      block the call, then the user has no other choice than just parsing the
      procfs file.
      
      Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reported-by: NRobert Kettler <robert.kettler@outlook.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23c8cec8
    • D
      ipc/sem: introduce semctl(SEM_STAT_ANY) · a280d6dc
      Davidlohr Bueso 提交于
      There is a permission discrepancy when consulting shm ipc object
      metadata between /proc/sysvipc/sem (0444) and the SEM_STAT semctl
      command.  The later does permission checks for the object vs S_IRUGO.
      As such there can be cases where EACCESS is returned via syscall but the
      info is displayed anyways in the procfs files.
      
      While this might have security implications via info leaking (albeit no
      writing to the sma metadata), this behavior goes way back and showing
      all the objects regardless of the permissions was most likely an
      overlook - so we are stuck with it.  Furthermore, modifying either the
      syscall or the procfs file can cause userspace programs to break (ie
      ipcs).  Some applications require getting the procfs info (without root
      privileges) and can be rather slow in comparison with a syscall -- up to
      500x in some reported cases for shm.
      
      This patch introduces a new SEM_STAT_ANY command such that the sem ipc
      object permissions are ignored, and only audited instead.  In addition,
      I've left the lsm security hook checks in place, as if some policy can
      block the call, then the user has no other choice than just parsing the
      procfs file.
      
      Link: http://lkml.kernel.org/r/20180215162458.10059-3-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reported-by: NRobert Kettler <robert.kettler@outlook.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a280d6dc
    • D
      ipc/shm: introduce shmctl(SHM_STAT_ANY) · c21a6970
      Davidlohr Bueso 提交于
      Patch series "sysvipc: introduce STAT_ANY commands", v2.
      
      The following patches adds the discussed (see [1]) new command for shm
      as well as for sems and msq as they are subject to the same
      discrepancies for ipc object permission checks between the syscall and
      via procfs.  These new commands are justified in that (1) we are stuck
      with this semantics as changing syscall and procfs can break userland;
      and (2) some users can benefit from performance (for large amounts of
      shm segments, for example) from not having to parse the procfs
      interface.
      
      Once merged, I will submit the necesary manpage updates.  But I'm thinking
      something like:
      
      : diff --git a/man2/shmctl.2 b/man2/shmctl.2
      : index 7bb503999941..bb00bbe21a57 100644
      : --- a/man2/shmctl.2
      : +++ b/man2/shmctl.2
      : @@ -41,6 +41,7 @@
      :  .\" 2005-04-25, mtk -- noted aberrant Linux behavior w.r.t. new
      :  .\"	attaches to a segment that has already been marked for deletion.
      :  .\" 2005-08-02, mtk: Added IPC_INFO, SHM_INFO, SHM_STAT descriptions.
      : +.\" 2018-02-13, dbueso: Added SHM_STAT_ANY description.
      :  .\"
      :  .TH SHMCTL 2 2017-09-15 "Linux" "Linux Programmer's Manual"
      :  .SH NAME
      : @@ -242,6 +243,18 @@ However, the
      :  argument is not a segment identifier, but instead an index into
      :  the kernel's internal array that maintains information about
      :  all shared memory segments on the system.
      : +.TP
      : +.BR SHM_STAT_ANY " (Linux-specific)"
      : +Return a
      : +.I shmid_ds
      : +structure as for
      : +.BR SHM_STAT .
      : +However, the
      : +.I shm_perm.mode
      : +is not checked for read access for
      : +.IR shmid ,
      : +resembing the behaviour of
      : +/proc/sysvipc/shm.
      :  .PP
      :  The caller can prevent or allow swapping of a shared
      :  memory segment with the following \fIcmd\fP values:
      : @@ -287,7 +300,7 @@ operation returns the index of the highest used entry in the
      :  kernel's internal array recording information about all
      :  shared memory segments.
      :  (This information can be used with repeated
      : -.B SHM_STAT
      : +.B SHM_STAT/SHM_STAT_ANY
      :  operations to obtain information about all shared memory segments
      :  on the system.)
      :  A successful
      : @@ -328,7 +341,7 @@ isn't accessible.
      :  \fIshmid\fP is not a valid identifier, or \fIcmd\fP
      :  is not a valid command.
      :  Or: for a
      : -.B SHM_STAT
      : +.B SHM_STAT/SHM_STAT_ANY
      :  operation, the index value specified in
      :  .I shmid
      :  referred to an array slot that is currently unused.
      
      This patch (of 3):
      
      There is a permission discrepancy when consulting shm ipc object metadata
      between /proc/sysvipc/shm (0444) and the SHM_STAT shmctl command.  The
      later does permission checks for the object vs S_IRUGO.  As such there can
      be cases where EACCESS is returned via syscall but the info is displayed
      anyways in the procfs files.
      
      While this might have security implications via info leaking (albeit no
      writing to the shm metadata), this behavior goes way back and showing all
      the objects regardless of the permissions was most likely an overlook - so
      we are stuck with it.  Furthermore, modifying either the syscall or the
      procfs file can cause userspace programs to break (ie ipcs).  Some
      applications require getting the procfs info (without root privileges) and
      can be rather slow in comparison with a syscall -- up to 500x in some
      reported cases.
      
      This patch introduces a new SHM_STAT_ANY command such that the shm ipc
      object permissions are ignored, and only audited instead.  In addition,
      I've left the lsm security hook checks in place, as if some policy can
      block the call, then the user has no other choice than just parsing the
      procfs file.
      
      [1] https://lkml.org/lkml/2017/12/19/220
      
      Link: http://lkml.kernel.org/r/20180215162458.10059-2-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Robert Kettler <robert.kettler@outlook.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c21a6970
    • C
      kernel/params.c: downgrade warning for unsafe parameters · edc41b3c
      Chris Wilson 提交于
      As using an unsafe module parameter is, by its very definition, an
      expected user action, emitting a warning is overkill.  Nothing has yet
      gone wrong, and we add a taint flag for any future oops should something
      actually go wrong.  So instead of having a user controllable pr_warn,
      downgrade it to a pr_notice for "a normal, but significant condition".
      
      We make use of unsafe kernel parameters in igt
      (https://cgit.freedesktop.org/drm/igt-gpu-tools/) (we have not yet
      succeeded in removing all such debugging options), which generates a
      warning and taints the kernel.  The warning is unhelpful as we then need
      to filter it out again as we check that every test themselves do not
      provoke any kernel warnings.
      
      Link: http://lkml.kernel.org/r/20180226151919.9674-1-chris@chris-wilson.co.uk
      Fixes: 91f9d330 ("module: make it possible to have unsafe, tainting module params")
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: NJani Nikula <jani.nikula@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jean Delvare <khali@linux-fr.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Petri Latvala <petri.latvala@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edc41b3c
    • R
      kernel/sysctl.c: fix sizeof argument to match variable name · 2d87b309
      Randy Dunlap 提交于
      Fix sizeof argument to be the same as the data variable name.  Probably
      a copy/paste error.
      
      Mostly harmless since both variables are unsigned int.
      
      Fixes kernel bugzilla #197371:
        Possible access to unintended variable in "kernel/sysctl.c" line 1339
      https://bugzilla.kernel.org/show_bug.cgi?id=197371
      
      Link: http://lkml.kernel.org/r/e0d0531f-361e-ef5f-8499-32743ba907e1@infradead.orgSigned-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Reported-by: NPetru Mihancea <petrum@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d87b309
    • I
      rapidio: use a reference count for struct mport_dma_req · bbd876ad
      Ioan Nicu 提交于
      Once the dma request is passed to the DMA engine, the DMA subsystem
      would hold a pointer to this structure and could call the completion
      callback after do_dma_request() has timed out.
      
      The current code deals with this by putting timed out SYNC requests to a
      pending list and freeing them later, when the mport cdev device is
      released.  This still does not guarantee that the DMA subsystem is
      really done with those transfers, so in theory
      dma_xfer_callback/dma_req_free could be called after
      mport_cdev_release_dma and could potentially access already freed
      memory.
      
      This patch simplifies the current handling by using a kref in the mport
      dma request structure, so that it gets freed only when nobody uses it
      anymore.
      
      This also simplifies the code a bit, as FAF transfers are now handled in
      the same way as SYNC and ASYNC transfers.  There is no need anymore for
      the pending list and for the dma workqueue which was used in case of FAF
      transfers, so we remove them both.
      
      Link: http://lkml.kernel.org/r/20180405203342.GA16191@nokia.comSigned-off-by: NIoan Nicu <ioan.nicu.ext@nokia.com>
      Acked-by: NAlexandre Bounine <alex.bou9@gmail.com>
      Cc: Barry Wood <barry.wood@idt.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Frank Kunz <frank.kunz@nokia.com>
      Cc: Alexander Sverdlin <alexander.sverdlin@nokia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbd876ad
    • V
      drivers/rapidio/rio-scan.c: fix typo in comment · b94bb1f6
      Vasyl Gomonovych 提交于
      Fix typo in the words 'receiver', 'specified', 'during'
      
      Link: http://lkml.kernel.org/r/20180321211035.8904-1-gomonovych@gmail.comSigned-off-by: NVasyl Gomonovych <gomonovych@gmail.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Alexandre Bounine <alexandre.bounine@idt.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b94bb1f6
    • K
      exec: pin stack limit during exec · c31dbb14
      Kees Cook 提交于
      Since the stack rlimit is used in multiple places during exec and it can
      be changed via other threads (via setrlimit()) or processes (via
      prlimit()), the assumption that the value doesn't change cannot be made.
      This leads to races with mm layout selection and argument size
      calculations.  This changes the exec path to use the rlimit stored in
      bprm instead of in current.  Before starting the thread, the bprm stack
      rlimit is stored back to current.
      
      Link: http://lkml.kernel.org/r/1518638796-20819-4-git-send-email-keescook@chromium.org
      Fixes: 64701dee ("exec: Use sane stack rlimit under secureexec")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reported-by: NBen Hutchings <ben.hutchings@codethink.co.uk>
      Reported-by: NAndy Lutomirski <luto@kernel.org>
      Reported-by: NBrad Spengler <spender@grsecurity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Greg KH <greg@kroah.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c31dbb14
    • K
      exec: introduce finalize_exec() before start_thread() · b8383831
      Kees Cook 提交于
      Provide a final callback into fs/exec.c before start_thread() takes
      over, to handle any last-minute changes, like the coming restoration of
      the stack limit.
      
      Link: http://lkml.kernel.org/r/1518638796-20819-3-git-send-email-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: Greg KH <greg@kroah.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8383831
    • K
      exec: pass stack rlimit into mm layout functions · 8f2af155
      Kees Cook 提交于
      Patch series "exec: Pin stack limit during exec".
      
      Attempts to solve problems with the stack limit changing during exec
      continue to be frustrated[1][2].  In addition to the specific issues
      around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
      other places during exec where the stack limit is used and is assumed to
      be unchanging.  Given the many places it gets used and the fact that it
      can be manipulated/raced via setrlimit() and prlimit(), I think the only
      way to handle this is to move away from the "current" view of the stack
      limit and instead attach it to the bprm, and plumb this down into the
      functions that need to know the stack limits.  This series implements
      the approach.
      
      [1] 04e35f44 ("exec: avoid RLIMIT_STACK races with prlimit()")
      [2] 779f4e1c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
      [3] to security@kernel.org, "Subject: existing rlimit races?"
      
      This patch (of 3):
      
      Since it is possible that the stack rlimit can change externally during
      exec (either via another thread calling setrlimit() or another process
      calling prlimit()), provide a way to pass the rlimit down into the
      per-architecture mm layout functions so that the rlimit can stay in the
      bprm structure instead of sitting in the signal structure until exec is
      finalized.
      
      Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f2af155
    • A
      seq_file: account everything to kmemcg · d64d01a1
      Alexey Dobriyan 提交于
      All it takes to open a file and read 1 byte from it.
      
      seq_file will be allocated along with any private allocations, and more
      importantly seq file buffer which is 1 page by default.
      
      Link: http://lkml.kernel.org/r/20180310085252.GB17121@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d64d01a1