1. 17 4月, 2015 27 次提交
  2. 16 4月, 2015 8 次提交
    • J
      proc: remove use of seq_printf return value · 25ce3191
      Joe Perches 提交于
      The seq_printf return value, because it's frequently misused,
      will eventually be converted to void.
      
      See: commit 1f33c41c ("seq_file: Rename seq_overflow() to
           seq_has_overflowed() and make public")
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25ce3191
    • R
      lib/string_helpers.c: change semantics of string_escape_mem · 41416f23
      Rasmus Villemoes 提交于
      The current semantics of string_escape_mem are inadequate for one of its
      current users, vsnprintf().  If that is to honour its contract, it must
      know how much space would be needed for the entire escaped buffer, and
      string_escape_mem provides no way of obtaining that (short of allocating a
      large enough buffer (~4 times input string) to let it play with, and
      that's definitely a big no-no inside vsnprintf).
      
      So change the semantics for string_escape_mem to be more snprintf-like:
      Return the size of the output that would be generated if the destination
      buffer was big enough, but of course still only write to the part of dst
      it is allowed to, and (contrary to snprintf) don't do '\0'-termination.
      It is then up to the caller to detect whether output was truncated and to
      append a '\0' if desired.  Also, we must output partial escape sequences,
      otherwise a call such as snprintf(buf, 3, "%1pE", "\123") would cause
      printf to write a \0 to buf[2] but leaving buf[0] and buf[1] with whatever
      they previously contained.
      
      This also fixes a bug in the escaped_string() helper function, which used
      to unconditionally pass a length of "end-buf" to string_escape_mem();
      since the latter doesn't check osz for being insanely large, it would
      happily write to dst.  For example, kasprintf(GFP_KERNEL, "something and
      then %pE", ...); is an easy way to trigger an oops.
      
      In test-string_helpers.c, the -ENOMEM test is replaced with testing for
      getting the expected return value even if the buffer is too small.  We
      also ensure that nothing is written (by relying on a NULL pointer deref)
      if the output size is 0 by passing NULL - this has to work for
      kasprintf("%pE") to work.
      
      In net/sunrpc/cache.c, I think qword_add still has the same semantics.
      Someone should definitely double-check this.
      
      In fs/proc/array.c, I made the minimum possible change, but longer-term it
      should stop poking around in seq_file internals.
      
      [andriy.shevchenko@linux.intel.com: simplify qword_add]
      [andriy.shevchenko@linux.intel.com: add missed curly braces]
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41416f23
    • I
      kernel: conditionally support non-root users, groups and capabilities · 2813893f
      Iulia Manda 提交于
      There are a lot of embedded systems that run most or all of their
      functionality in init, running as root:root.  For these systems,
      supporting multiple users is not necessary.
      
      This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
      non-root users, non-root groups, and capabilities optional.  It is enabled
      under CONFIG_EXPERT menu.
      
      When this symbol is not defined, UID and GID are zero in any possible case
      and processes always have all capabilities.
      
      The following syscalls are compiled out: setuid, setregid, setgid,
      setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
      getgroups, setfsuid, setfsgid, capget, capset.
      
      Also, groups.c is compiled out completely.
      
      In kernel/capability.c, capable function was moved in order to avoid
      adding two ifdef blocks.
      
      This change saves about 25 KB on a defconfig build.  The most minimal
      kernels have total text sizes in the high hundreds of kB rather than
      low MB.  (The 25k goes down a bit with allnoconfig, but not that much.
      
      The kernel was booted in Qemu.  All the common functionalities work.
      Adding users/groups is not possible, failing with -ENOSYS.
      
      Bloat-o-meter output:
      add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NIulia Manda <iulia.manda21@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Tested-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2813893f
    • C
      /proc/PID/status: show all sets of pid according to ns · e4bc3324
      Chen Hanxiao 提交于
      If some issues occurred inside a container guest, host user could not know
      which process is in trouble just by guest pid: the users of container
      guest only knew the pid inside containers.  This will bring obstacle for
      trouble shooting.
      
      This patch adds four fields: NStgid, NSpid, NSpgid and NSsid:
      
      a) In init_pid_ns, nothing changed;
      
      b) In one pidns, will tell the pid inside containers:
        NStgid: 21776   5       1
        NSpid:  21776   5       1
        NSpgid: 21776   5       1
        NSsid:  21729   1       0
        ** Process id is 21776 in level 0, 5 in level 1, 1 in level 2.
      
      c) If pidns is nested, it depends on which pidns are you in.
        NStgid: 5       1
        NSpid:  5       1
        NSpgid: 5       1
        NSsid:  1       0
        ** Views from level 1
      
      [akpm@linux-foundation.org: add CONFIG_PID_NS ifdef]
      Signed-off-by: NChen Hanxiao <chenhanxiao@cn.fujitsu.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Tested-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Tested-by: NNathan Scott <nathans@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4bc3324
    • B
      dax: unify ext2/4_{dax,}_file_operations · be64f884
      Boaz Harrosh 提交于
      The original dax patchset split the ext2/4_file_operations because of the
      two NULL splice_read/splice_write in the dax case.
      
      In the vfs if splice_read/splice_write are NULL we then call
      default_splice_read/write.
      
      What we do here is make generic_file_splice_read aware of IS_DAX() so the
      original ext2/4_file_operations can be used as is.
      
      For write it appears that iter_file_splice_write is just fine.  It uses
      the regular f_op->write(file,..) or new_sync_write(file, ...).
      Signed-off-by: NBoaz Harrosh <boaz@plexistor.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be64f884
    • B
      dax: use pfn_mkwrite to update c/mtime + freeze protection · 0e3b210c
      Boaz Harrosh 提交于
      From: Yigal Korman <yigal@plexistor.com>
      
      [v1]
      Without this patch, c/mtime is not updated correctly when mmap'ed page is
      first read from and then written to.
      
      A new xfstest is submitted for testing this (generic/080)
      
      [v2]
      Jan Kara has pointed out that if we add the
      sb_start/end_pagefault pair in the new pfn_mkwrite we
      are then fixing another bug where: A user could start
      writing to the page while filesystem is frozen.
      Signed-off-by: NYigal Korman <yigal@plexistor.com>
      Signed-off-by: NBoaz Harrosh <boaz@plexistor.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e3b210c
    • D
      fs, jfs: remove slab object constructor · ee146245
      David Rientjes 提交于
      Mempools based on slab caches with object constructors are risky because
      element allocation can happen either from the slab cache itself, meaning
      the constructor is properly called before returning, or from the mempool
      reserve pool, meaning the constructor is not called before returning,
      depending on the allocation context.
      
      For this reason, we should disallow creating mempools based on slab caches
      that have object constructors.  Callers of mempool_alloc() will be
      responsible for properly initializing the returned element.
      
      Then, it doesn't matter if the element came from the slab cache or the
      mempool reserved pool.
      
      The only occurrence of a mempool being based on a slab cache with an
      object constructor in the tree is in fs/jfs/jfs_metapage.c.  Remove it and
      properly initialize the element in alloc_metapage().
      
      At the same time, META_free is never used, so remove it as well.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NDave Kleikamp <dave.kleikamp@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee146245
    • M
      hugetlbfs: accept subpool min_size mount option and setup accordingly · 7ca02d0a
      Mike Kravetz 提交于
      Make 'min_size=<value>' be an option when mounting a hugetlbfs.  This
      option takes the same value as the 'size' option.  min_size can be
      specified without specifying size.  If both are specified, min_size must
      be less that or equal to size else the mount will fail.  If min_size is
      specified, then at mount time an attempt is made to reserve min_size
      pages.  If the reservation fails, the mount fails.  At umount time, the
      reserved pages are released.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ca02d0a
  3. 15 4月, 2015 5 次提交
    • D
      mm, mempool: do not allow atomic resizing · 11d83360
      David Rientjes 提交于
      Allocating a large number of elements in atomic context could quickly
      deplete memory reserves, so just disallow atomic resizing entirely.
      
      Nothing currently uses mempool_resize() with anything other than
      GFP_KERNEL, so convert existing callers to drop the gfp_mask.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: Steffen Maier <maier@linux.vnet.ibm.com>	[zfcp]
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Steve French <sfrench@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11d83360
    • K
      mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE · 204db6ed
      Kees Cook 提交于
      The arch_randomize_brk() function is used on several architectures,
      even those that don't support ET_DYN ASLR. To avoid bulky extern/#define
      tricks, consolidate the support under CONFIG_ARCH_HAS_ELF_RANDOMIZE for
      the architectures that support it, while still handling CONFIG_COMPAT_BRK.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Russell King <linux@arm.linux.org.uk>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "David A. Long" <dave.long@linaro.org>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Arun Chandran <achandran@mvista.com>
      Cc: Yann Droneaud <ydroneaud@opteya.com>
      Cc: Min-Hua Chen <orca.chen@gmail.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Alex Smith <alex@alex-smith.me.uk>
      Cc: Markos Chandras <markos.chandras@imgtec.com>
      Cc: Vineeth Vijayan <vvijayan@mvista.com>
      Cc: Jeff Bailey <jeffbailey@google.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Behan Webster <behanw@converseincode.com>
      Cc: Ismael Ripoll <iripoll@upv.es>
      Cc: Jan-Simon Mller <dl9pf@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      204db6ed
    • K
      mm: split ET_DYN ASLR from mmap ASLR · d1fd836d
      Kees Cook 提交于
      This fixes the "offset2lib" weakness in ASLR for arm, arm64, mips,
      powerpc, and x86.  The problem is that if there is a leak of ASLR from
      the executable (ET_DYN), it means a leak of shared library offset as
      well (mmap), and vice versa.  Further details and a PoC of this attack
      is available here:
      
        http://cybersecurity.upv.es/attacks/offset2lib/offset2lib.html
      
      With this patch, a PIE linked executable (ET_DYN) has its own ASLR
      region:
      
        $ ./show_mmaps_pie
        54859ccd6000-54859ccd7000 r-xp  ...  /tmp/show_mmaps_pie
        54859ced6000-54859ced7000 r--p  ...  /tmp/show_mmaps_pie
        54859ced7000-54859ced8000 rw-p  ...  /tmp/show_mmaps_pie
        7f75be764000-7f75be91f000 r-xp  ...  /lib/x86_64-linux-gnu/libc.so.6
        7f75be91f000-7f75beb1f000 ---p  ...  /lib/x86_64-linux-gnu/libc.so.6
        7f75beb1f000-7f75beb23000 r--p  ...  /lib/x86_64-linux-gnu/libc.so.6
        7f75beb23000-7f75beb25000 rw-p  ...  /lib/x86_64-linux-gnu/libc.so.6
        7f75beb25000-7f75beb2a000 rw-p  ...
        7f75beb2a000-7f75beb4d000 r-xp  ...  /lib64/ld-linux-x86-64.so.2
        7f75bed45000-7f75bed46000 rw-p  ...
        7f75bed46000-7f75bed47000 r-xp  ...
        7f75bed47000-7f75bed4c000 rw-p  ...
        7f75bed4c000-7f75bed4d000 r--p  ...  /lib64/ld-linux-x86-64.so.2
        7f75bed4d000-7f75bed4e000 rw-p  ...  /lib64/ld-linux-x86-64.so.2
        7f75bed4e000-7f75bed4f000 rw-p  ...
        7fffb3741000-7fffb3762000 rw-p  ...  [stack]
        7fffb377b000-7fffb377d000 r--p  ...  [vvar]
        7fffb377d000-7fffb377f000 r-xp  ...  [vdso]
      
      The change is to add a call the newly created arch_mmap_rnd() into the
      ELF loader for handling ET_DYN ASLR in a separate region from mmap ASLR,
      as was already done on s390.  Removes CONFIG_BINFMT_ELF_RANDOMIZE_PIE,
      which is no longer needed.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reported-by: NHector Marco-Gisbert <hecmargi@upv.es>
      Cc: Russell King <linux@arm.linux.org.uk>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "David A. Long" <dave.long@linaro.org>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Arun Chandran <achandran@mvista.com>
      Cc: Yann Droneaud <ydroneaud@opteya.com>
      Cc: Min-Hua Chen <orca.chen@gmail.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Alex Smith <alex@alex-smith.me.uk>
      Cc: Markos Chandras <markos.chandras@imgtec.com>
      Cc: Vineeth Vijayan <vvijayan@mvista.com>
      Cc: Jeff Bailey <jeffbailey@google.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Behan Webster <behanw@converseincode.com>
      Cc: Ismael Ripoll <iripoll@upv.es>
      Cc: Jan-Simon Mller <dl9pf@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1fd836d
    • M
      fs/binfmt_elf.c: fix bug in loading of PIE binaries · a87938b2
      Michael Davidson 提交于
      With CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE enabled, and a normal top-down
      address allocation strategy, load_elf_binary() will attempt to map a PIE
      binary into an address range immediately below mm->mmap_base.
      
      Unfortunately, load_elf_ binary() does not take account of the need to
      allocate sufficient space for the entire binary which means that, while
      the first PT_LOAD segment is mapped below mm->mmap_base, the subsequent
      PT_LOAD segment(s) end up being mapped above mm->mmap_base into the are
      that is supposed to be the "gap" between the stack and the binary.
      
      Since the size of the "gap" on x86_64 is only guaranteed to be 128MB this
      means that binaries with large data segments > 128MB can end up mapping
      part of their data segment over their stack resulting in corruption of the
      stack (and the data segment once the binary starts to run).
      
      Any PIE binary with a data segment > 128MB is vulnerable to this although
      address randomization means that the actual gap between the stack and the
      end of the binary is normally greater than 128MB.  The larger the data
      segment of the binary the higher the probability of failure.
      
      Fix this by calculating the total size of the binary in the same way as
      load_elf_interp().
      Signed-off-by: NMichael Davidson <md@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a87938b2
    • V
      cleancache: remove limit on the number of cleancache enabled filesystems · 3cb29d11
      Vladimir Davydov 提交于
      The limit equals 32 and is imposed by the number of entries in the
      fs_poolid_map and shared_fs_poolid_map.  Nowadays it is insufficient,
      because with containers on board a Linux host can have hundreds of
      active fs mounts.
      
      These maps were introduced by commit 49a9ab81 ("mm: cleancache:
      lazy initialization to allow tmem backends to build/run as modules") in
      order to allow compiling cleancache drivers as modules.  Real pool ids
      are stored in these maps while super_block->cleancache_poolid points to
      an entry in the map, so that on cleancache registration we can walk over
      all (if there are <= 32 of them, of course) cleancache-enabled super
      blocks and assign real pool ids.
      
      Actually, there is absolutely no need in these maps, because we can
      iterate over all super blocks immediately using iterate_supers.  This is
      not racy, because cleancache_init_ops is called from mount_fs with
      super_block->s_umount held for writing, while iterate_supers takes this
      semaphore for reading, so if we call iterate_supers after setting
      cleancache_ops, all super blocks that had been created before
      cleancache_register_ops was called will be assigned pool ids by the
      action function of iterate_supers while all newer super blocks will
      receive it in cleancache_init_fs.
      
      This patch therefore removes the maps and hence the artificial limit on
      the number of cleancache enabled filesystems.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Stefan Hengelein <ilendir@googlemail.com>
      Cc: Florian Schmaus <fschmaus@gmail.com>
      Cc: Andor Daam <andor.daam@googlemail.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3cb29d11