1. 14 12月, 2014 40 次提交
    • L
      Merge branch 'for-3.19/drivers' of git://git.kernel.dk/linux-block · 9ea18f8c
      Linus Torvalds 提交于
      Pull block layer driver updates from Jens Axboe:
      
       - NVMe updates:
              - The blk-mq conversion from Matias (and others)
      
              - A stack of NVMe bug fixes from the nvme tree, mostly from Keith.
      
              - Various bug fixes from me, fixing issues in both the blk-mq
                conversion and generic bugs.
      
              - Abort and CPU online fix from Sam.
      
              - Hot add/remove fix from Indraneel.
      
       - A couple of drbd fixes from the drbd team (Andreas, Lars, Philipp)
      
       - With the generic IO stat accounting from 3.19/core, converting md,
         bcache, and rsxx to use those.  From Gu Zheng.
      
       - Boundary check for queue/irq mode for null_blk from Matias.  Fixes
         cases where invalid values could be given, causing the device to hang.
      
       - The xen blkfront pull request, with two bug fixes from Vitaly.
      
      * 'for-3.19/drivers' of git://git.kernel.dk/linux-block: (56 commits)
        NVMe: fix race condition in nvme_submit_sync_cmd()
        NVMe: fix retry/error logic in nvme_queue_rq()
        NVMe: Fix FS mount issue (hot-remove followed by hot-add)
        NVMe: fix error return checking from blk_mq_alloc_request()
        NVMe: fix freeing of wrong request in abort path
        xen/blkfront: remove redundant flush_op
        xen/blkfront: improve protection against issuing unsupported REQ_FUA
        NVMe: Fix command setup on IO retry
        null_blk: boundary check queue_mode and irqmode
        block/rsxx: use generic io stats accounting functions to simplify io stat accounting
        md: use generic io stats accounting functions to simplify io stat accounting
        drbd: use generic io stats accounting functions to simplify io stat accounting
        md/bcache: use generic io stats accounting functions to simplify io stat accounting
        NVMe: Update module version major number
        NVMe: fail pci initialization if the device doesn't have any BARs
        NVMe: add ->exit_hctx() hook
        NVMe: make setup work for devices that don't do INTx
        NVMe: enable IO stats by default
        NVMe: nvme_submit_async_admin_req() must use atomic rq allocation
        NVMe: replace blk_put_request() with blk_mq_free_request()
        ...
      9ea18f8c
    • L
      Merge branch 'for-3.19/core' of git://git.kernel.dk/linux-block · caf292ae
      Linus Torvalds 提交于
      Pull block driver core update from Jens Axboe:
       "This is the pull request for the core block IO changes for 3.19.  Not
        a huge round this time, mostly lots of little good fixes:
      
         - Fix a bug in sysfs blktrace interface causing a NULL pointer
           dereference, when enabled/disabled through that API.  From Arianna
           Avanzini.
      
         - Various updates/fixes/improvements for blk-mq:
      
              - A set of updates from Bart, mostly fixing buts in the tag
                handling.
      
              - Cleanup/code consolidation from Christoph.
      
              - Extend queue_rq API to be able to handle batching issues of IO
                requests. NVMe will utilize this shortly. From me.
      
              - A few tag and request handling updates from me.
      
              - Cleanup of the preempt handling for running queues from Paolo.
      
              - Prevent running of unmapped hardware queues from Ming Lei.
      
              - Move the kdump memory limiting check to be in the correct
                location, from Shaohua.
      
              - Initialize all software queues at init time from Takashi. This
                prevents a kobject warning when CPUs are brought online that
                weren't online when a queue was registered.
      
         - Single writeback fix for I_DIRTY clearing from Tejun.  Queued with
           the core IO changes, since it's just a single fix.
      
         - Version X of the __bio_add_page() segment addition retry from
           Maurizio.  Hope the Xth time is the charm.
      
         - Documentation fixup for IO scheduler merging from Jan.
      
         - Introduce (and use) generic IO stat accounting helpers for non-rq
           drivers, from Gu Zheng.
      
         - Kill off artificial limiting of max sectors in a request from
           Christoph"
      
      * 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
        bio: modify __bio_add_page() to accept pages that don't start a new segment
        blk-mq: Fix uninitialized kobject at CPU hotplugging
        blktrace: don't let the sysfs interface remove trace from running list
        blk-mq: Use all available hardware queues
        blk-mq: Micro-optimize bt_get()
        blk-mq: Fix a race between bt_clear_tag() and bt_get()
        blk-mq: Avoid that __bt_get_word() wraps multiple times
        blk-mq: Fix a use-after-free
        blk-mq: prevent unmapped hw queue from being scheduled
        blk-mq: re-check for available tags after running the hardware queue
        blk-mq: fix hang in bt_get()
        blk-mq: move the kdump check to blk_mq_alloc_tag_set
        blk-mq: cleanup tag free handling
        blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map
        blk: introduce generic io stat accounting help function
        blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
        genhd: check for int overflow in disk_expand_part_tbl()
        blk-mq: add blk_mq_free_hctx_request()
        blk-mq: export blk_mq_free_request()
        blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
        ...
      caf292ae
    • L
      Merge tag 'trace-seq-buf-3.19-v2' of... · 8f4385d5
      Linus Torvalds 提交于
      Merge tag 'trace-seq-buf-3.19-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
      
      Pull tracing fixlet from Steven Rostedt:
       "Remove unnecessary preempt_disable in printk()"
      
      * tag 'trace-seq-buf-3.19-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        printk: Do not disable preemption for accessing printk_func
      8f4385d5
    • L
      Merge tag 'trace-fixes-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 52bb4525
      Linus Torvalds 提交于
      Pull tracing fixes from Steven Rostedt:
       "Here's two fixes:
      
        1) Discovered by Fengguang Wu's tests.  I changed the parameters to
           the function graph x86 prepare_ftrace_return call but forgot to
           update the call from entry_32 (i386 version).  This patch corrects
           that.
      
        2) I was tracing some code and found that the sched_switch tracepoint
           was showing tasks in the INTERRUPTIBLE state as RUNNING.  This was
           due to the updates to convert preempt_count into a per_cpu
           variable.  The tracepoint logic was made to use the tasks
           saved_preempt_count which could hold a stale "PREEMPT_ACTIVE",
           instead of using the current preempt_count() call"
      
      * tag 'trace-fixes-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        tracing/sched: Check preempt_count() for current when reading task->state
        ftrace/x86: Update i386 call to prepare_ftrace_return()
      52bb4525
    • L
      Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit · a99abce2
      Linus Torvalds 提交于
      Pull audit updates from Paul Moore:
       "Two small patches from the audit next branch; only one of which has
        any real significant code changes, the other is simply a MAINTAINERS
        update for audit.
      
        The single code patch is pretty small and rather straightforward, it
        changes the audit "version" number reported to userspace from an
        integer to a bitmap which is used to indicate the functionality of the
        running kernel.  This really doesn't have much impact on the kernel,
        but it will make life easier for the audit userspace folks.
      
        Thankfully we were still on a version number which allowed us to do
        this without breaking userspace"
      
      * 'upstream' of git://git.infradead.org/users/pcmoore/audit:
        audit: convert status version to a feature bitmap
        audit: add Paul Moore to the MAINTAINERS entry
      a99abce2
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · e3aa91a7
      Linus Torvalds 提交于
      Pull crypto update from Herbert Xu:
       - The crypto API is now documented :)
       - Disallow arbitrary module loading through crypto API.
       - Allow get request with empty driver name through crypto_user.
       - Allow speed testing of arbitrary hash functions.
       - Add caam support for ctr(aes), gcm(aes) and their derivatives.
       - nx now supports concurrent hashing properly.
       - Add sahara support for SHA1/256.
       - Add ARM64 version of CRC32.
       - Misc fixes.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (77 commits)
        crypto: tcrypt - Allow speed testing of arbitrary hash functions
        crypto: af_alg - add user space interface for AEAD
        crypto: qat - fix problem with coalescing enable logic
        crypto: sahara - add support for SHA1/256
        crypto: sahara - replace tasklets with kthread
        crypto: sahara - add support for i.MX53
        crypto: sahara - fix spinlock initialization
        crypto: arm - replace memset by memzero_explicit
        crypto: powerpc - replace memset by memzero_explicit
        crypto: sha - replace memset by memzero_explicit
        crypto: sparc - replace memset by memzero_explicit
        crypto: algif_skcipher - initialize upon init request
        crypto: algif_skcipher - removed unneeded code
        crypto: algif_skcipher - Fixed blocking recvmsg
        crypto: drbg - use memzero_explicit() for clearing sensitive data
        crypto: drbg - use MODULE_ALIAS_CRYPTO
        crypto: include crypto- module prefix in template
        crypto: user - add MODULE_ALIAS
        crypto: sha-mb - remove a bogus NULL check
        crytpo: qat - Fix 64 bytes requests
        ...
      e3aa91a7
    • L
      Merge branch 'akpm' (second patch-bomb from Andrew) · 78a45c6f
      Linus Torvalds 提交于
      Merge second patchbomb from Andrew Morton:
       - the rest of MM
       - misc fs fixes
       - add execveat() syscall
       - new ratelimit feature for fault-injection
       - decompressor updates
       - ipc/ updates
       - fallocate feature creep
       - fsnotify cleanups
       - a few other misc things
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (99 commits)
        cgroups: Documentation: fix trivial typos and wrong paragraph numberings
        parisc: percpu: update comments referring to __get_cpu_var
        percpu: update local_ops.txt to reflect this_cpu operations
        percpu: remove __get_cpu_var and __raw_get_cpu_var macros
        fsnotify: remove destroy_list from fsnotify_mark
        fsnotify: unify inode and mount marks handling
        fallocate: create FAN_MODIFY and IN_MODIFY events
        mm/cma: make kmemleak ignore CMA regions
        slub: fix cpuset check in get_any_partial
        slab: fix cpuset check in fallback_alloc
        shmdt: use i_size_read() instead of ->i_size
        ipc/shm.c: fix overly aggressive shmdt() when calls span multiple segments
        ipc/msg: increase MSGMNI, remove scaling
        ipc/sem.c: increase SEMMSL, SEMMNI, SEMOPM
        ipc/sem.c: change memory barrier in sem_lock() to smp_rmb()
        lib/decompress.c: consistency of compress formats for kernel image
        decompress_bunzip2: off by one in get_next_block()
        usr/Kconfig: make initrd compression algorithm selection not expert
        fault-inject: add ratelimit option
        ratelimit: add initialization macro
        ...
      78a45c6f
    • S
      cgroups: Documentation: fix trivial typos and wrong paragraph numberings · 29d293b6
      SeongJae Park 提交于
      Signed-off-by: NSeongJae Park <sj38.park@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29d293b6
    • C
      parisc: percpu: update comments referring to __get_cpu_var · 6ddb798f
      Christoph Lameter 提交于
      __get_cpu_var was removed. Update comments to refer to
      this_cpu_ptr() instead.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ddb798f
    • C
      percpu: update local_ops.txt to reflect this_cpu operations · 7d94a82e
      Christoph Lameter 提交于
      Update the documentation to reflect changes due to the availability of
      this_cpu operations.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d94a82e
    • C
      percpu: remove __get_cpu_var and __raw_get_cpu_var macros · 6c51ec4d
      Christoph Lameter 提交于
      No user is left in the kernel source tree.  Therefore we can drop the
      definitions.
      
      This is the final merge of the transition away from __get_cpu_var.  After
      this patch the kernel will not build if anyone uses __get_cpu_var.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c51ec4d
    • J
      fsnotify: remove destroy_list from fsnotify_mark · 37d469e7
      Jan Kara 提交于
      destroy_list is used to track marks which still need waiting for srcu
      period end before they can be freed.  However by the time mark is added to
      destroy_list it isn't in group's list of marks anymore and thus we can
      reuse fsnotify_mark->g_list for queueing into destroy_list.  This saves
      two pointers for each fsnotify_mark.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37d469e7
    • J
      fsnotify: unify inode and mount marks handling · 0809ab69
      Jan Kara 提交于
      There's a lot of common code in inode and mount marks handling.  Factor it
      out to a common helper function.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0809ab69
    • H
      fallocate: create FAN_MODIFY and IN_MODIFY events · 820c12d5
      Heinrich Schuchardt 提交于
      The fanotify and the inotify API can be used to monitor changes of the
      file system.  System call fallocate() modifies files.  Hence it should
      trigger the corresponding fanotify (FAN_MODIFY) and inotify (IN_MODIFY)
      events.  The most interesting case is FALLOC_FL_COLLAPSE_RANGE because
      this value allows to create arbitrary file content from random data.
      
      This patch adds the missing call to fsnotify_modify().
      
      The FAN_MODIFY and IN_MODIFY event will be created when fallocate()
      succeeds.  It will even be created if the file length remains unchanged,
      e.g.  when calling fanotify with flag FALLOC_FL_KEEP_SIZE.
      
      This logic was primarily chosen to keep the coding simple.
      
      It resembles the logic of the write() system call.
      
      When we call write() we always create a FAN_MODIFY event, even in the case
      of overwriting with identical data.
      
      Events FAN_MODIFY and IN_MODIFY do not provide any guarantee that data was
      actually changed.
      
      Furthermore even if if the filesize remains unchanged, fallocate() may
      influence whether a subsequent write() will succeed and hence the
      fallocate() call may be considered a modification.
      
      The fallocate(2) man page teaches: After a successful call, subsequent
      writes into the range specified by offset and len are guaranteed not to
      fail because of lack of disk space.
      
      So calling fallocate(fd, FALLOC_FL_KEEP_SIZE, offset, len) may result in
      different outcomes of a subsequent write depending on the values of offset
      and len.
      Signed-off-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: John McCutchan <john@johnmccutchan.com>
      Cc: Robert Love <rlove@rlove.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      820c12d5
    • T
      mm/cma: make kmemleak ignore CMA regions · 620951e2
      Thierry Reding 提交于
      kmemleak will add allocations as objects to a pool.  The memory allocated
      for each object in this pool is periodically searched for pointers to
      other allocated objects.  This only works for memory that is mapped into
      the kernel's virtual address space, which happens not to be the case for
      most CMA regions.
      
      Furthermore, CMA regions are typically used to store data transferred to
      or from a device and therefore don't contain pointers to other objects.
      
      Without this, the kernel crashes on the first execution of the
      scan_gray_list() because it tries to access highmem.  Perhaps a more
      appropriate fix would be to reject any object that can't map to a kernel
      virtual address?
      
      [akpm@linux-foundation.org: add comment]
      [akpm@linux-foundation.org: fix comment, per Catalin]
      [sfr@canb.auug.org.au: include linux/io.h for phys_to_virt()]
      Signed-off-by: NThierry Reding <treding@nvidia.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      620951e2
    • V
      slub: fix cpuset check in get_any_partial · dee2f8aa
      Vladimir Davydov 提交于
      If we fail to allocate from the current node's stock, we look for free
      objects on other nodes before calling the page allocator (see
      get_any_partial).  While checking other nodes we respect cpuset
      constraints by calling cpuset_zone_allowed.  We enforce hardwall check.
      As a result, we will fallback to the page allocator even if there are some
      pages cached on other nodes, but the current cpuset doesn't have them set.
       However, the page allocator uses softwall check for kernel allocations,
      so it may allocate from one of the other nodes in this case.
      
      Therefore we should use softwall cpuset check in get_any_partial to
      conform with the cpuset check in the page allocator.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dee2f8aa
    • V
      slab: fix cpuset check in fallback_alloc · 061d7074
      Vladimir Davydov 提交于
      fallback_alloc is called on kmalloc if the preferred node doesn't have
      free or partial slabs and there's no pages on the node's free list
      (GFP_THISNODE allocations fail).  Before invoking the reclaimer it tries
      to locate a free or partial slab on other allowed nodes' lists.  While
      iterating over the preferred node's zonelist it skips those zones which
      hardwall cpuset check returns false for.  That means that for a task bound
      to a specific node using cpusets fallback_alloc will always ignore free
      slabs on other nodes and go directly to the reclaimer, which, however, may
      allocate from other nodes if cpuset.mem_hardwall is unset (default).  As a
      result, we may get lists of free slabs grow without bounds on other nodes,
      which is bad, because inactive slabs are only evicted by cache_reap at a
      very slow rate and cannot be dropped forcefully.
      
      To reproduce the issue, run a process that will walk over a directory tree
      with lots of files inside a cpuset bound to a node that constantly
      experiences memory pressure.  Look at num_slabs vs active_slabs growth as
      reported by /proc/slabinfo.
      
      To avoid this we should use softwall cpuset check in fallback_alloc.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      061d7074
    • D
      shmdt: use i_size_read() instead of ->i_size · 07a46ed2
      Dave Hansen 提交于
      Andrew Morton noted
      
      	http://lkml.kernel.org/r/20141104142027.a7a0d010772d84560b445f59@linux-foundation.org
      
      that the shmdt uses inode->i_size outside of i_mutex being held.
      There is one more case in shm.c in shm_destroy().  This converts
      both users over to use i_size_read().
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07a46ed2
    • D
      ipc/shm.c: fix overly aggressive shmdt() when calls span multiple segments · d3c97900
      Dave Hansen 提交于
      This is a highly-contrived scenario.  But, a single shmdt() call can be
      induced in to unmapping memory from mulitple shm segments.  Example code
      is here:
      
      	http://www.sr71.net/~dave/intel/shmfun.c
      
      The fix is pretty simple: Record the 'struct file' for the first VMA we
      encounter and then stick to it.  Decline to unmap anything not from the
      same file and thus the same segment.
      
      I found this by inspection and the odds of anyone hitting this in practice
      are pretty darn small.
      
      Lightly tested, but it's a pretty small patch.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Reviewed-by: NDavidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3c97900
    • M
      ipc/msg: increase MSGMNI, remove scaling · 0050ee05
      Manfred Spraul 提交于
      SysV can be abused to allocate locked kernel memory.  For most systems, a
      small limit doesn't make sense, see the discussion with regards to SHMMAX.
      
      Therefore: increase MSGMNI to the maximum supported.
      
      And: If we ignore the risk of locking too much memory, then an automatic
      scaling of MSGMNI doesn't make sense.  Therefore the logic can be removed.
      
      The code preserves auto_msgmni to avoid breaking any user space applications
      that expect that the value exists.
      
      Notes:
      1) If an administrator must limit the memory allocations, then he can set
      MSGMNI as necessary.
      
      Or he can disable sysv entirely (as e.g. done by Android).
      
      2) MSGMAX and MSGMNB are intentionally not increased, as these values are used
      to control latency vs. throughput:
      If MSGMNB is large, then msgsnd() just returns and more messages can be queued
      before a task switch to a task that calls msgrcv() is forced.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0050ee05
    • M
      ipc/sem.c: increase SEMMSL, SEMMNI, SEMOPM · e843e7d2
      Manfred Spraul 提交于
      a)
      
      SysV can be abused to allocate locked kernel memory.  For most systems, a
      small limit doesn't make sense, see the discussion with regards to SHMMAX.
      
      Therefore: Increase the sysv sem limits so that all known applications
      will work with these defaults.
      
      b)
      
      With regards to the maximum supported:
      Some of the specified hard limits are not correct anymore, therefore the
      patch updates the documentation.
      
      - SEMMNI must stay below IPCMNI, which is 32768.
        As for SHMMAX: Stay a bit below this limit.
      
      - SEMMSL was limited to 8k, to ensure that the kmalloc for the kernel array
        was limited to 16 kB (order=2)
      
        This doesn't apply anymore:
         - the allocation size isn't sizeof(short)*nsems anymore.
         - ipc_alloc falls back to vmalloc
      
      - SEMOPM should stay below 1000, to limit the kmalloc in semtimedop() to an
        order=1 allocation.
        Therefore: Leave it at 500 (order=0 allocation).
      
      Note:
      If an administrator must limit the memory allocations, then he can set the
      values as necessary.
      
      Or he can disable sysv entirely (as e.g. done by Android).
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e843e7d2
    • M
      ipc/sem.c: change memory barrier in sem_lock() to smp_rmb() · 2e094abf
      Manfred Spraul 提交于
      When I fixed bugs in the sem_lock() logic, I was more conservative than
      necessary.  Therefore it is safe to replace the smp_mb() with smp_rmb().
      And: With smp_rmb(), semop() syscalls are up to 10% faster.
      
      The race we must protect against is:
      
      	sem->lock is free
      	sma->complex_count = 0
      	sma->sem_perm.lock held by thread B
      
      thread A:
      
      A: spin_lock(&sem->lock)
      
      			B: sma->complex_count++; (now 1)
      			B: spin_unlock(&sma->sem_perm.lock);
      
      A: spin_is_locked(&sma->sem_perm.lock);
      A: XXXXX memory barrier
      A: if (sma->complex_count == 0)
      
      Thread A must read the increased complex_count value, i.e. the read must
      not be reordered with the read of sem_perm.lock done by spin_is_locked().
      
      Since it's about ordering of reads, smp_rmb() is sufficient.
      
      [akpm@linux-foundation.org: update sem_lock() comment, from Davidlohr]
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Reviewed-by: NDavidlohr Bueso <dave@stgolabs.net>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e094abf
    • H
      lib/decompress.c: consistency of compress formats for kernel image · a060bfe0
      Haesung Kim 提交于
      Magic number of compress formats for kernel image is defined by two bytes.
       These numbers are written in hexadecimal number, nevertheless magic
      number for only gunzip is written in octal number.  The formats should be
      consistent for readability.  Therefore, magic numbers for gunzip are also
      defined by hexadecimal number.
      Signed-off-by: NHaesung Kim <matia.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a060bfe0
    • D
      decompress_bunzip2: off by one in get_next_block() · b5c8afe5
      Dan Carpenter 提交于
      "origPtr" is used as an offset into the bd->dbuf[] array.  That array is
      allocated in start_bunzip() and has "bd->dbufSize" number of elements so
      the test here should be >= instead of >.
      
      Later we check "origPtr" again before using it as an offset so I don't
      know if this bug can be triggered in real life.
      
      Fixes: bc22c17e ('bzip2/lzma: library support for gzip, bzip2 and lzma decompression')
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Alain Knaff <alain@knaff.lu>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5c8afe5
    • A
      usr/Kconfig: make initrd compression algorithm selection not expert · ec72c666
      Andi Kleen 提交于
      The kernel has support for (nearly) every compression algorithm known to
      man, each to handle some particular microscopic niche.
      
      Unfortunately all of these always get compiled in if you want to support
      INITRDs, and can be only disabled when CONFIG_EXPERT is set.
      
      I don't see why I need to set EXPERT just to properly configure the initrd
      compression algorithms, and not always include every possible algorithm
      
      Usually the initrd is just compressed with gzip anyways, at least that's
      true on all distributions I use.
      
      Remove the dependencies for initrd compression on CONFIG_EXPERT.
      
      Make the various options just default y, which should be good enough to
      not break any previous configuration.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec72c666
    • D
      fault-inject: add ratelimit option · 6adc4a22
      Dmitry Monakhov 提交于
      Current debug levels are not optimal.  Especially if one want to provoke
      big numbers of faults(broken device simulator) then any verbose level will
      produce giant numbers of identical logging messages.  Let's add ratelimit
      parameter for that purpose.
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Acked-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6adc4a22
    • D
      ratelimit: add initialization macro · 89e3f909
      Dmitry Monakhov 提交于
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89e3f909
    • F
      fs/affs/file.c: remove obsolete pagesize check · 92cab82b
      Fabian Frederick 提交于
      linux kernel doesn't manage page sizes below 4kb.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92cab82b
    • F
      fs/affs/file.c: add support to O_DIRECT · 9abb4083
      Fabian Frederick 提交于
      Based on ext2_direct_IO
      
      Tested with O_DIRECT file open and sysbench/mariadb with 1% written
      queries improvement (update_non_index test) on a volume created with
      mkaffs.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9abb4083
    • F
      fs/affs/amigaffs.c: use va_format instead of buffer/vnsprintf · 1ee54b09
      Fabian Frederick 提交于
      -Remove ErrorBuffer and use %pV
      
      -Add __printf to enable argument mistmatch warnings
      
      Original patch by Joe Perches.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ee54b09
    • F
      fs/affs/file.c: forward declaration clean-up · 7633978b
      Fabian Frederick 提交于
      -Move file_operations to avoid forward declarations.
      
      -Remove unused declarations.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7633978b
    • R
      gcov: enable GCOV_PROFILE_ALL from ARCH Kconfigs · 957e3fac
      Riku Voipio 提交于
      Following the suggestions from Andrew Morton and Stephen Rothwell,
      Dont expand the ARCH list in kernel/gcov/Kconfig. Instead,
      define a ARCH_HAS_GCOV_PROFILE_ALL bool which architectures
      can enable.
      
      set ARCH_HAS_GCOV_PROFILE_ALL on Architectures where it was
      previously allowed + ARM64 which I tested.
      Signed-off-by: NRiku Voipio <riku.voipio@linaro.org>
      Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      957e3fac
    • M
      kexec: remove unnecessary KERN_ERR from kexec.c · d5393955
      Masanari Iida 提交于
      Remove unnecessary KERN_ERR from pr_err() within kexec.c.
      Signed-off-by: NMasanari Iida <standby24x7@gmail.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5393955
    • D
      sparc: hook up execveat system call · 38351a32
      David Drysdale 提交于
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38351a32
    • D
      syscalls: add selftest for execveat(2) · c9b26b81
      David Drysdale 提交于
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9b26b81
    • D
      x86: hook up execveat system call · 27d6ec7a
      David Drysdale 提交于
      Hook up x86-64, i386 and x32 ABIs.
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27d6ec7a
    • D
      syscalls: implement execveat() system call · 51f39a1f
      David Drysdale 提交于
      This patchset adds execveat(2) for x86, and is derived from Meredydd
      Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
      
      The primary aim of adding an execveat syscall is to allow an
      implementation of fexecve(3) that does not rely on the /proc filesystem,
      at least for executables (rather than scripts).  The current glibc version
      of fexecve(3) is implemented via /proc, which causes problems in sandboxed
      or otherwise restricted environments.
      
      Given the desire for a /proc-free fexecve() implementation, HPA suggested
      (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
      an appropriate generalization.
      
      Also, having a new syscall means that it can take a flags argument without
      back-compatibility concerns.  The current implementation just defines the
      AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
      added in future -- for example, flags for new namespaces (as suggested at
      https://lkml.org/lkml/2006/7/11/474).
      
      Related history:
       - https://lkml.org/lkml/2006/12/27/123 is an example of someone
         realizing that fexecve() is likely to fail in a chroot environment.
       - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
         documenting the /proc requirement of fexecve(3) in its manpage, to
         "prevent other people from wasting their time".
       - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
         problem where a process that did setuid() could not fexecve()
         because it no longer had access to /proc/self/fd; this has since
         been fixed.
      
      This patch (of 4):
      
      Add a new execveat(2) system call.  execveat() is to execve() as openat()
      is to open(): it takes a file descriptor that refers to a directory, and
      resolves the filename relative to that.
      
      In addition, if the filename is empty and AT_EMPTY_PATH is specified,
      execveat() executes the file to which the file descriptor refers.  This
      replicates the functionality of fexecve(), which is a system call in other
      UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
      so relies on /proc being mounted).
      
      The filename fed to the executed program as argv[0] (or the name of the
      script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
      (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
      reflecting how the executable was found.  This does however mean that
      execution of a script in a /proc-less environment won't work; also, script
      execution via an O_CLOEXEC file descriptor fails (as the file will not be
      accessible after exec).
      
      Based on patches by Meredydd Luff.
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51f39a1f
    • N
      fat: fix data past EOF resulting from fsx testsuite · c0ef0cc9
      Namjae Jeon 提交于
      When running FSX with direct I/O mode, fsx resulted in DATA past EOF issues.
      
        fsx ./file2 -Z -r 4096 -w 4096
        ...
        ..
        truncating to largest ever: 0x907c
        fallocating to largest ever: 0x11137
        truncating to largest ever: 0x2c6fe
        truncating to largest ever: 0x2cfdf
        fallocating to largest ever: 0x40000
        Mapped Read: non-zero data past EOF (0x18628) page offset 0x629 is 0x2a4e
        ...
        ..
      
      The reason being, it is doing a truncate down, but the zeroing does not
      happen on the last block boundary when offset is not aligned.  Even though
      it calls truncate_setsize()->truncate_inode_pages()->
      truncate_inode_pages_range() and considers the partial zeroout but it
      retrieves the page using find_lock_page() - which only looks the page in
      the cache.  So, zeroing out does not happen in case of direct IO.
      
      Make a truncate page based around block_truncate_page for FAT filesystem
      and invoke that helper to zerout in case the offset is not aligned with
      the blocksize.
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: NAmit Sahrawat <a.sahrawat@samsung.com>
      Acked-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0ef0cc9
    • J
      befs: remove dead code · f441ada0
      Jan Kara 提交于
      Coverity id: 1042674
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f441ada0
    • H
      mm/zbud: init user ops only when it is needed · 1dd61aa3
      Heesub Shin 提交于
      When zbud is initialized through the zpool wrapper, pool->ops which
      points to user-defined operations is always set regardless of whether it
      is specified from the upper layer. This causes zbud_reclaim_page() to
      iterate its loop for evicting pool pages out without any gain.
      
      This patch sets the user-defined ops only when it is needed, so that
      zbud_reclaim_page() can bail out the reclamation loop earlier if there
      is no user-defined operations specified.
      Signed-off-by: NHeesub Shin <heesub.shin@samsung.com>
      Acked-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Sunae Seo <sunae.seo@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dd61aa3