1. 12 6月, 2012 3 次提交
    • A
      PCI: export pci_user functions for use by other drivers · c63587d7
      Alex Williamson 提交于
      VFIO PCI support will make use of these for user-initiated
      PCI config accesses.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      c63587d7
    • A
      PCI: add ACS validation utility · ad805758
      Alex Williamson 提交于
      In a PCI environment, transactions aren't always required to reach
      the root bus before being re-routed.  Intermediate switches between
      an endpoint and the root bus can redirect DMA back downstream before
      things like IOMMUs have a chance to intervene.  Legacy PCI is always
      susceptible to this as it operates on a shared bus.  PCIe added a
      new capability to describe and control this behavior, Access Control
      Services, or ACS.
      
      The utility function pci_acs_enabled() allows us to test the ACS
      capabilities of an individual devices against a set of flags while
      pci_acs_path_enabled() tests a complete path from a given downstream
      device up to the specified upstream device.  We also include the
      ability to add device specific tests as it's likely we'll see
      devices that do not implement ACS, but want to indicate support
      for various capabilities in this space.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      ad805758
    • A
      PCI: add PCI DMA source ID quirk · 12ea6cad
      Alex Williamson 提交于
      DMA transactions are tagged with the source ID of the device making
      the request.  Occasionally hardware screws this up and uses the
      source ID of a different device (often the wrong function number of
      a multifunction device).  A specific Ricoh multifunction device is
      a prime example of this problem and included in this patch.
      
      Given a pci_dev, this function returns the pci_dev to use as the
      source ID for DMA.  When hardware works correctly, this returns
      the input device.  For the components of the Ricoh multifunction
      device, it returns the pci_dev for function 0.
      
      This will be used by IOMMU drivers for determining the boundaries
      of IOMMU groups as multiple devices using the same source ID must
      be contained within the same group.  This can also be used by
      existing streaming DMA paths for the same purpose.
      
      [bhelgaas: fold in pci_dev_get() for !CONFIG_PCI]
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      12ea6cad
  2. 08 6月, 2012 3 次提交
  3. 06 6月, 2012 3 次提交
  4. 04 6月, 2012 3 次提交
  5. 03 6月, 2012 1 次提交
    • L
      tty: Revert the tty locking series, it needs more work · f309532b
      Linus Torvalds 提交于
      This reverts the tty layer change to use per-tty locking, because it's
      not correct yet, and fixing it will require some more deep surgery.
      
      The main revert is d29f3ef3 ("tty_lock: Localise the lock"), but
      there are several smaller commits that built upon it, they also get
      reverted here. The list of reverted commits is:
      
        fde86d31 - tty: add lockdep annotations
        8f6576ad - tty: fix ldisc lock inversion trace
        d3ca8b64 - pty: Fix lock inversion
        b1d679af - tty: drop the pty lock during hangup
        abcefe5f - tty/amiserial: Add missing argument for tty_unlock()
        fd11b42e - cris: fix missing tty arg in wait_event_interruptible_tty call
        d29f3ef3 - tty_lock: Localise the lock
      
      The revert had a trivial conflict in the 68360serial.c staging driver
      that got removed in the meantime.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f309532b
  6. 02 6月, 2012 9 次提交
    • A
      new helper: signal_delivered() · efee984c
      Al Viro 提交于
      Does block_sigmask() + tracehook_signal_handler();  called when
      sigframe has been successfully built.  All architectures converted
      to it; block_sigmask() itself is gone now (merged into this one).
      
      I'm still not too happy with the signature, but that's a separate
      story (IMO we need a structure that would contain signal number +
      siginfo + k_sigaction, so that get_signal_to_deliver() would fill one,
      signal_delivered(), handle_signal() and probably setup...frame() -
      take one).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      efee984c
    • A
      most of set_current_blocked() callers want SIGKILL/SIGSTOP removed from set · 77097ae5
      Al Viro 提交于
      Only 3 out of 63 do not.  Renamed the current variant to __set_current_blocked(),
      added set_current_blocked() that will exclude unblockable signals, switched
      open-coded instances to it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      77097ae5
    • A
    • A
      new helper: sigmask_to_save() · b7f9a11a
      Al Viro 提交于
      replace boilerplate "should we use ->saved_sigmask or ->blocked?"
      with calls of obvious inlined helper...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b7f9a11a
    • A
      new helper: restore_saved_sigmask() · 51a7b448
      Al Viro 提交于
      first fruits of ..._restore_sigmask() helpers: now we can take
      boilerplate "signal didn't have a handler, clear RESTORE_SIGMASK
      and restore the blocked mask from ->saved_mask" into a common
      helper.  Open-coded instances switched...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      51a7b448
    • A
      new helpers: {clear,test,test_and_clear}_restore_sigmask() · 4ebefe3e
      Al Viro 提交于
      helpers parallel to set_restore_sigmask(), used in the next commits
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4ebefe3e
    • A
      HAVE_RESTORE_SIGMASK is defined on all architectures now · 754421c8
      Al Viro 提交于
      Everyone either defines it in arch thread_info.h or has TIF_RESTORE_SIGMASK
      and picks default set_restore_sigmask() in linux/thread_info.h.  Kill the
      ifdefs, slap #error in linux/thread_info.h to catch breakage when new ones
      get merged.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      754421c8
    • M
      vfs: retry last component if opening stale dentry · 16b1c1cd
      Miklos Szeredi 提交于
      NFS optimizes away d_revalidates for last component of open.  This means that
      open itself can find the dentry stale.
      
      This patch allows the filesystem to return EOPENSTALE and the VFS will retry the
      lookup on just the last component if possible.
      
      If the lookup was done using RCU mode, including the last component, then this
      is not possible since the parent dentry is lost.  In this case fall back to
      non-RCU lookup.  Currently this is not used since NFS will always leave RCU
      mode.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      16b1c1cd
    • J
      fs: introduce inode operation ->update_time · c3b2da31
      Josef Bacik 提交于
      Btrfs has to make sure we have space to allocate new blocks in order to modify
      the inode, so updating time can fail.  We've gotten around this by having our
      own file_update_time but this is kind of a pain, and Christoph has indicated he
      would like to make xfs do something different with atime updates.  So introduce
      ->update_time, where we will deal with i_version an a/m/c time updates and
      indicate which changes need to be made.  The normal version just does what it
      has always done, updates the time and marks the inode dirty, and then
      filesystems can choose to do something different.
      
      I've gone through all of the users of file_update_time and made them check for
      errors with the exception of the fault code since it's complicated and I wasn't
      quite sure what to do there, also Jan is going to be pushing the file time
      updates into page_mkwrite for those who have it so that should satisfy btrfs and
      make it not a big deal to check the file_update_time() return code in the
      generic fault path. Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      c3b2da31
  7. 01 6月, 2012 18 次提交
    • A
      switch aio and shm to do_mmap_pgoff(), make do_mmap() static · e3fc629d
      Al Viro 提交于
      after all, 0 bytes and 0 pages is the same thing...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e3fc629d
    • A
      take security_mmap_file() outside of ->mmap_sem · 8b3ec681
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8b3ec681
    • C
      c/r: prctl: add ability to set new mm_struct::exe_file · b32dfe37
      Cyrill Gorcunov 提交于
      When we do restore we would like to have a way to setup a former
      mm_struct::exe_file so that /proc/pid/exe would point to the original
      executable file a process had at checkpoint time.
      
      For this the PR_SET_MM_EXE_FILE code is introduced.  This option takes a
      file descriptor which will be set as a source for new /proc/$pid/exe
      symlink.
      
      Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE
      vmas present for current process, simply because this feature is a special
      to C/R and mm::num_exe_file_vmas become meaningless after that.
      
      To minimize the amount of transition the /proc/pid/exe symlink might have,
      this feature is implemented in one-shot manner.  Thus once changed the
      symlink can't be changed again.  This should help sysadmins to monitor the
      symlinks over all process running in a system.
      
      In particular one could make a snapshot of processes and ring alarm if
      there unexpected changes of /proc/pid/exe's in a system.
      
      Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and
      the caller must have CAP_SYS_RESOURCE capability granted, otherwise the
      request to change symlink will be rejected.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b32dfe37
    • C
      c/r: prctl: extend PR_SET_MM to set up more mm_struct entries · fe8c7f5c
      Cyrill Gorcunov 提交于
      During checkpoint we dump whole process memory to a file and the dump
      includes process stack memory.  But among stack data itself, the stack
      carries additional parameters such as command line arguments, environment
      data and auxiliary vector.
      
      So when we do restore procedure and once we've restored stack data itself
      we need to setup mm_struct::arg_start/end, env_start/end, so restored
      process would be able to find command line arguments and environment data
      it had at checkpoint time.  The same applies to auxiliary vector.
      
      For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START |
      ENV_END | AUXV) codes are introduced.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe8c7f5c
    • C
      syscalls, x86: add __NR_kcmp syscall · d97b46a6
      Cyrill Gorcunov 提交于
      While doing the checkpoint-restore in the user space one need to determine
      whether various kernel objects (like mm_struct-s of file_struct-s) are
      shared between tasks and restore this state.
      
      The 2nd step can be solved by using appropriate CLONE_ flags and the
      unshare syscall, while there's currently no ways for solving the 1st one.
      
      One of the ways for checking whether two tasks share e.g.  mm_struct is to
      provide some mm_struct ID of a task to its proc file, but showing such
      info considered to be not that good for security reasons.
      
      Thus after some debates we end up in conclusion that using that named
      'comparison' syscall might be the best candidate.  So here is it --
      __NR_kcmp.
      
      It takes up to 5 arguments - the pids of the two tasks (which
      characteristics should be compared), the comparison type and (in case of
      comparison of files) two file descriptors.
      
      Lookups for pids are done in the caller's PID namespace only.
      
      At moment only x86 is supported and tested.
      
      [akpm@linux-foundation.org: fix up selftests, warnings]
      [akpm@linux-foundation.org: include errno.h]
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Valdis.Kletnieks@vt.edu
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d97b46a6
    • C
      aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector() · ac34ebb3
      Christopher Yeoh 提交于
      A cleanup of rw_copy_check_uvector and compat_rw_copy_check_uvector after
      changes made to support CMA in an earlier patch.
      
      Rather than having an additional check_access parameter to these
      functions, the first paramater type is overloaded to allow the caller to
      specify CHECK_IOVEC_ONLY which means check that the contents of the iovec
      are valid, but do not check the memory that they point to.  This is used
      by process_vm_readv/writev where we need to validate that a iovec passed
      to the syscall is valid but do not want to check the memory that it points
      to at this point because it refers to an address space in another process.
      Signed-off-by: NChris Yeoh <yeohc@au1.ibm.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac34ebb3
    • S
      eventfd: change int to __u64 in eventfd_signal() · ee62c6b2
      Sha Zhengju 提交于
      eventfd_ctx->count is an __u64 counter which is allowed to reach
      ULLONG_MAX.  eventfd_write() adds a __u64 value to "count", but the kernel
      side eventfd_signal() only adds an int value to it.  Make them consistent.
      
      [akpm@linux-foundation.org: update interface documentation]
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee62c6b2
    • A
      rapidio: add DMA engine support for RIO data transfers · e42d98eb
      Alexandre Bounine 提交于
      Adds DMA Engine framework support into RapidIO subsystem.
      
      Uses DMA Engine DMA_SLAVE interface to generate data transfers to/from
      remote RapidIO target devices.
      
      Introduces RapidIO-specific wrapper for prep_slave_sg() interface with an
      extra parameter to pass target specific information.
      
      Uses scatterlist to describe local data buffer.  Address flat data buffer
      on a remote side.
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Acked-by: NVinod Koul <vinod.koul@linux.intel.com>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e42d98eb
    • K
      mqueue: separate mqueue default value from maximum value · cef0184c
      KOSAKI Motohiro 提交于
      Commit b231cca4 ("message queues: increase range limits") changed
      mqueue default value when attr parameter is specified NULL from hard
      coded value to fs.mqueue.{msg,msgsize}_max sysctl value.
      
      This made large side effect.  When user need to use two mqueue
      applications 1) using !NULL attr parameter and it require big message
      size and 2) using NULL attr parameter and only need small size message,
      app (1) require to raise fs.mqueue.msgsize_max and app (2) consume large
      memory size even though it doesn't need.
      
      Doug Ledford propsed to switch back it to static hard coded value.
      However it also has a compatibility problem.  Some applications might
      started depend on the default value is tunable.
      
      The solution is to separate default value from maximum value.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Acked-by: NDoug Ledford <dledford@redhat.com>
      Acked-by: NJoe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cef0184c
    • K
      mqueue: revert bump up DFLT_*MAX · e6315bb1
      KOSAKI Motohiro 提交于
      Mqueue limitation is slightly naieve parameter likes other ipcs because
      unprivileged user can consume kernel memory by using ipcs.
      
      Thus, too aggressive raise bring us security issue.  Example, current
      setting allow evil unprivileged user use 256GB (= 256 * 1024 * 1024*1024)
      and it's enough large to system will belome unresponsive.  Don't do that.
      
      Instead, every admin should adjust the knobs for their own systems.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDoug Ledford <dledford@redhat.com>
      Acked-by: NJoe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6315bb1
    • D
      ipc/mqueue: update maximums for the mqueue subsystem · 5b5c4d1a
      Doug Ledford 提交于
      Commit b231cca4 ("message queues: increase range limits") changed the
      maximum size of a message in a message queue from INT_MAX to 8192*128.
      Unfortunately, we had customers that relied on a size much larger than
      8192*128 on their production systems.  After reviewing POSIX, we found
      that it is silent on the maximum message size.  We did find a couple other
      areas in which it was not silent.  Fix up the mqueue maximums so that the
      customer's system can continue to work, and document both the POSIX and
      real world requirements in ipc_namespace.h so that we don't have this
      issue crop back up.
      
      Also, commit 9cf18e1d ("ipc: HARD_MSGMAX should be higher not lower
      on 64bit") fiddled with HARD_MSGMAX without realizing that the number was
      intentionally in place to limit the msg queue depth to one that was small
      enough to kmalloc an array of pointers (hence why we divided 128k by
      sizeof(long)).  If we wish to meet POSIX requirements, we have no choice
      but to change our allocation to a vmalloc instead (at least for the large
      queue size case).  With that, it's possible to increase our allowed
      maximum to the POSIX requirements (or more if we choose).
      
      [sfr@canb.auug.org.au: using vmalloc requires including vmalloc.h]
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b5c4d1a
    • D
      ipc/mqueue: switch back to using non-max values on create · 858ee378
      Doug Ledford 提交于
      Commit b231cca4 ("message queues: increase range limits") changed
      how we create a queue that does not include an attr struct passed to
      open so that it creates the queue with whatever the maximum values are.
      However, if the admin has set the maximums to allow flexibility in
      creating a queue (aka, both a large size and large queue are allowed,
      but combined they create a queue too large for the RLIMIT_MSGQUEUE of
      the user), then attempts to create a queue without an attr struct will
      fail.  Switch back to using acceptable defaults regardless of what the
      maximums are.
      
      Note: so far, we only know of a few applications that rely on this
      behavior (specifically, set the maximums in /proc, then run the
      application which calls mq_open() without passing in an attr struct, and
      the application expects the newly created message queue to have the
      maximum sizes that were set in /proc used on the mq_open() call, and all
      of those applications that we know of are actually part of regression
      test suites that were coded to do something like this:
      
      for size in 4096 65536 $((1024 * 1024)) $((16 * 1024 * 1024)); do
      	echo $size > /proc/sys/fs/mqueue/msgsize_max
      	mq_open || echo "Error opening mq with size $size"
      done
      
      These test suites that depend on any behavior like this are broken.  The
      concept that programs should rely upon the system wide maximum in order
      to get their desired results instead of simply using a attr struct to
      specify what they want is fundamentally unfriendly programming practice
      for any multi-tasking OS.
      
      Fixing this will break those few apps that we know of (and those app
      authors recognize the brokenness of their code and the need to fix it).
      However, the following patch "mqueue: separate mqueue default value"
      allows a workaround in the form of new knobs for the default msg queue
      creation parameters for any software out there that we don't already
      know about that might rely on this behavior at the moment.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      858ee378
    • D
      ipc/mqueue: cleanup definition names and locations · 93e6f119
      Doug Ledford 提交于
      Since commit b231cca4 ("message queues: increase range limits") on
      Oct 18, 2008, calls to mq_open() that did not pass in an attribute
      struct and expected to get default values for the size of the queue and
      the max message size now get the system wide maximums instead of
      hardwired defaults like they used to get.
      
      This was uncovered when one of the earlier patches in this patch set
      increased the default system wide maximums at the same time it increased
      the hard ceiling on the system wide maximums (a customer specifically
      needed the hard ceiling brought back up, the new ceiling that commit
      b231cca4 introduced was too low for their production systems).  By
      increasing the default maximums and not realising they were tied to any
      attempt to create a message queue without an attribute struct, I had
      inadvertently made it such that all message queue creation attempts
      without an attribute struct were failing because the new default
      maximums would create a queue that exceeded the default rlimit for
      message queue bytes.
      
      As a result, the system wide defaults were brought back down to their
      previous levels, and the system wide ceilings on the maximums were
      raised to meet the customer's needs.  However, the fact that the no
      attribute struct behavior of mq_open() could be broken by changing the
      system wide maximums for message queues was seen as fundamentally broken
      itself.  So we hardwired the no attribute case back like it used to be.
      But, then we realized that on the very off chance that some piece of
      software in the wild depended on that behavior, we could work around
      that issue by adding two new knobs to /proc that allowed setting the
      defaults for message queues created without an attr struct separately
      from the system wide maximums.
      
      What is not an option IMO is to leave the current behavior in place.  No
      piece of software should ever rely on setting the system wide maximums
      in order to get a desired message queue.  Such a reliance would be so
      fundamentally multitasking OS unfriendly as to not really be tolerable.
      Fortunately, we don't know of any software in the wild that uses this
      except for a regression test program that caught the issue in the first
      place.  If there is though, we have made accommodations with the two new
      /proc knobs (and that's all the accommodations such fundamentally broken
      software can be allowed)..
      
      This patch:
      
      The various defines for minimums and maximums of the sysctl controllable
      mqueue values are scattered amongst different files and named
      inconsistently.  Move them all into ipc_namespace.h and make them have
      consistent names.  Additionally, make the number of queues per namespace
      also have a minimum and maximum and use the same sysctl function as the
      other two settable variables.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93e6f119
    • M
      kexec: export kexec.h to user space · 29a5c67e
      maximilian attems 提交于
      Add userspace definitions, guard all relevant kernel structures.  While at
      it document stuff and remove now useless userspace hint.
      
      It is easy to add the relevant system call to respective libc's, but it
      seems pointless to have to duplicate the data structures.
      
      This is based on the kexec-tools headers, with the exception of just using
      int on return (succes or failure) and using size_t instead of 'unsigned
      long int' for the number of segments argument of kexec_load().
      Signed-off-by: Nmaximilian attems <max@stro.at>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Haren Myneni <hbabu@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29a5c67e
    • A
      cpu: introduce clear_tasks_mm_cpumask() helper · cb79295e
      Anton Vorontsov 提交于
      Many architectures clear tasks' mm_cpumask like this:
      
      	read_lock(&tasklist_lock);
      	for_each_process(p) {
      		if (p->mm)
      			cpumask_clear_cpu(cpu, mm_cpumask(p->mm));
      	}
      	read_unlock(&tasklist_lock);
      
      Depending on the context, the code above may have several problems,
      such as:
      
      1. Working with task->mm w/o getting mm or grabing the task lock is
         dangerous as ->mm might disappear (exit_mm() assigns NULL under
         task_lock(), so tasklist lock is not enough).
      
      2. Checking for process->mm is not enough because process' main
         thread may exit or detach its mm via use_mm(), but other threads
         may still have a valid mm.
      
      This patch implements a small helper function that does things
      correctly, i.e.:
      
      1. We take the task's lock while whe handle its mm (we can't use
         get_task_mm()/mmput() pair as mmput() might sleep);
      
      2. To catch exited main thread case, we use find_lock_task_mm(),
         which walks up all threads and returns an appropriate task
         (with task lock held).
      
      Also, Per Peter Zijlstra's idea, now we don't grab tasklist_lock in
      the new helper, instead we take the rcu read lock. We can do this
      because the function is called after the cpu is taken down and marked
      offline, so no new tasks will get this cpu set in their mm mask.
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb79295e
    • O
      cred: remove task_is_dead() from __task_cred() validation · 43e13cc1
      Oleg Nesterov 提交于
      Commit 8f92054e ("CRED: Fix __task_cred()'s lockdep check and banner
      comment"):
      
          add the following validation condition:
      
              task->exit_state >= 0
      
          to permit the access if the target task is dead and therefore
          unable to change its own credentials.
      
      OK, but afaics currently this can only help wait_task_zombie() which calls
      __task_cred() without rcu lock.
      
      Remove this validation and change wait_task_zombie() to use task_uid()
      instead.  This means we do rcu_read_lock() only to shut up the lockdep,
      but we already do the same in, say, wait_task_stopped().
      
      task_is_dead() should die, task->exit_state != 0 means that this task has
      passed exit_notify(), only do_wait-like code paths should use this.
      
      Unfortunately, we can't kill task_is_dead() right now, it has already
      acquired buggy users in drivers/staging.  The fix already exists.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43e13cc1
    • B
      kmod: move call_usermodehelper_fns() to .c file and unexport all it's helpers · 785042f2
      Boaz Harrosh 提交于
      If we move call_usermodehelper_fns() to kmod.c file and EXPORT_SYMBOL it
      we can avoid exporting all it's helper functions:
      	call_usermodehelper_setup
      	call_usermodehelper_setfns
      	call_usermodehelper_exec
      And make all of them static to kmod.c
      
      Since the optimizer will see all these as a single call site it will
      inline them inside call_usermodehelper_fns().  So we loose the call to
      _fns but gain 3 calls to the helpers.  (Not that it matters)
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      785042f2
    • B
      kmod: unexport call_usermodehelper_freeinfo() · ae3cef73
      Boaz Harrosh 提交于
      call_usermodehelper_freeinfo() is not used outside of kmod.c.  So unexport
      it, and make it static to kmod.c
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae3cef73