1. 02 6月, 2012 9 次提交
    • A
      new helper: signal_delivered() · efee984c
      Al Viro 提交于
      Does block_sigmask() + tracehook_signal_handler();  called when
      sigframe has been successfully built.  All architectures converted
      to it; block_sigmask() itself is gone now (merged into this one).
      
      I'm still not too happy with the signature, but that's a separate
      story (IMO we need a structure that would contain signal number +
      siginfo + k_sigaction, so that get_signal_to_deliver() would fill one,
      signal_delivered(), handle_signal() and probably setup...frame() -
      take one).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      efee984c
    • A
      most of set_current_blocked() callers want SIGKILL/SIGSTOP removed from set · 77097ae5
      Al Viro 提交于
      Only 3 out of 63 do not.  Renamed the current variant to __set_current_blocked(),
      added set_current_blocked() that will exclude unblockable signals, switched
      open-coded instances to it.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      77097ae5
    • A
    • A
      new helper: sigmask_to_save() · b7f9a11a
      Al Viro 提交于
      replace boilerplate "should we use ->saved_sigmask or ->blocked?"
      with calls of obvious inlined helper...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b7f9a11a
    • A
      new helper: restore_saved_sigmask() · 51a7b448
      Al Viro 提交于
      first fruits of ..._restore_sigmask() helpers: now we can take
      boilerplate "signal didn't have a handler, clear RESTORE_SIGMASK
      and restore the blocked mask from ->saved_mask" into a common
      helper.  Open-coded instances switched...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      51a7b448
    • A
      new helpers: {clear,test,test_and_clear}_restore_sigmask() · 4ebefe3e
      Al Viro 提交于
      helpers parallel to set_restore_sigmask(), used in the next commits
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4ebefe3e
    • A
      HAVE_RESTORE_SIGMASK is defined on all architectures now · 754421c8
      Al Viro 提交于
      Everyone either defines it in arch thread_info.h or has TIF_RESTORE_SIGMASK
      and picks default set_restore_sigmask() in linux/thread_info.h.  Kill the
      ifdefs, slap #error in linux/thread_info.h to catch breakage when new ones
      get merged.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      754421c8
    • M
      vfs: retry last component if opening stale dentry · 16b1c1cd
      Miklos Szeredi 提交于
      NFS optimizes away d_revalidates for last component of open.  This means that
      open itself can find the dentry stale.
      
      This patch allows the filesystem to return EOPENSTALE and the VFS will retry the
      lookup on just the last component if possible.
      
      If the lookup was done using RCU mode, including the last component, then this
      is not possible since the parent dentry is lost.  In this case fall back to
      non-RCU lookup.  Currently this is not used since NFS will always leave RCU
      mode.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      16b1c1cd
    • J
      fs: introduce inode operation ->update_time · c3b2da31
      Josef Bacik 提交于
      Btrfs has to make sure we have space to allocate new blocks in order to modify
      the inode, so updating time can fail.  We've gotten around this by having our
      own file_update_time but this is kind of a pain, and Christoph has indicated he
      would like to make xfs do something different with atime updates.  So introduce
      ->update_time, where we will deal with i_version an a/m/c time updates and
      indicate which changes need to be made.  The normal version just does what it
      has always done, updates the time and marks the inode dirty, and then
      filesystems can choose to do something different.
      
      I've gone through all of the users of file_update_time and made them check for
      errors with the exception of the fault code since it's complicated and I wasn't
      quite sure what to do there, also Jan is going to be pushing the file time
      updates into page_mkwrite for those who have it so that should satisfy btrfs and
      make it not a big deal to check the file_update_time() return code in the
      generic fault path. Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      c3b2da31
  2. 01 6月, 2012 26 次提交
    • A
      switch aio and shm to do_mmap_pgoff(), make do_mmap() static · e3fc629d
      Al Viro 提交于
      after all, 0 bytes and 0 pages is the same thing...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e3fc629d
    • A
      take security_mmap_file() outside of ->mmap_sem · 8b3ec681
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8b3ec681
    • C
      c/r: prctl: add ability to set new mm_struct::exe_file · b32dfe37
      Cyrill Gorcunov 提交于
      When we do restore we would like to have a way to setup a former
      mm_struct::exe_file so that /proc/pid/exe would point to the original
      executable file a process had at checkpoint time.
      
      For this the PR_SET_MM_EXE_FILE code is introduced.  This option takes a
      file descriptor which will be set as a source for new /proc/$pid/exe
      symlink.
      
      Note it allows to change /proc/$pid/exe if there are no VM_EXECUTABLE
      vmas present for current process, simply because this feature is a special
      to C/R and mm::num_exe_file_vmas become meaningless after that.
      
      To minimize the amount of transition the /proc/pid/exe symlink might have,
      this feature is implemented in one-shot manner.  Thus once changed the
      symlink can't be changed again.  This should help sysadmins to monitor the
      symlinks over all process running in a system.
      
      In particular one could make a snapshot of processes and ring alarm if
      there unexpected changes of /proc/pid/exe's in a system.
      
      Note -- this feature is available iif CONFIG_CHECKPOINT_RESTORE is set and
      the caller must have CAP_SYS_RESOURCE capability granted, otherwise the
      request to change symlink will be rejected.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b32dfe37
    • C
      c/r: prctl: extend PR_SET_MM to set up more mm_struct entries · fe8c7f5c
      Cyrill Gorcunov 提交于
      During checkpoint we dump whole process memory to a file and the dump
      includes process stack memory.  But among stack data itself, the stack
      carries additional parameters such as command line arguments, environment
      data and auxiliary vector.
      
      So when we do restore procedure and once we've restored stack data itself
      we need to setup mm_struct::arg_start/end, env_start/end, so restored
      process would be able to find command line arguments and environment data
      it had at checkpoint time.  The same applies to auxiliary vector.
      
      For this reason additional PR_SET_MM_(ARG_START | ARG_END | ENV_START |
      ENV_END | AUXV) codes are introduced.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe8c7f5c
    • C
      syscalls, x86: add __NR_kcmp syscall · d97b46a6
      Cyrill Gorcunov 提交于
      While doing the checkpoint-restore in the user space one need to determine
      whether various kernel objects (like mm_struct-s of file_struct-s) are
      shared between tasks and restore this state.
      
      The 2nd step can be solved by using appropriate CLONE_ flags and the
      unshare syscall, while there's currently no ways for solving the 1st one.
      
      One of the ways for checking whether two tasks share e.g.  mm_struct is to
      provide some mm_struct ID of a task to its proc file, but showing such
      info considered to be not that good for security reasons.
      
      Thus after some debates we end up in conclusion that using that named
      'comparison' syscall might be the best candidate.  So here is it --
      __NR_kcmp.
      
      It takes up to 5 arguments - the pids of the two tasks (which
      characteristics should be compared), the comparison type and (in case of
      comparison of files) two file descriptors.
      
      Lookups for pids are done in the caller's PID namespace only.
      
      At moment only x86 is supported and tested.
      
      [akpm@linux-foundation.org: fix up selftests, warnings]
      [akpm@linux-foundation.org: include errno.h]
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Valdis.Kletnieks@vt.edu
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d97b46a6
    • C
      aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector() · ac34ebb3
      Christopher Yeoh 提交于
      A cleanup of rw_copy_check_uvector and compat_rw_copy_check_uvector after
      changes made to support CMA in an earlier patch.
      
      Rather than having an additional check_access parameter to these
      functions, the first paramater type is overloaded to allow the caller to
      specify CHECK_IOVEC_ONLY which means check that the contents of the iovec
      are valid, but do not check the memory that they point to.  This is used
      by process_vm_readv/writev where we need to validate that a iovec passed
      to the syscall is valid but do not want to check the memory that it points
      to at this point because it refers to an address space in another process.
      Signed-off-by: NChris Yeoh <yeohc@au1.ibm.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac34ebb3
    • S
      eventfd: change int to __u64 in eventfd_signal() · ee62c6b2
      Sha Zhengju 提交于
      eventfd_ctx->count is an __u64 counter which is allowed to reach
      ULLONG_MAX.  eventfd_write() adds a __u64 value to "count", but the kernel
      side eventfd_signal() only adds an int value to it.  Make them consistent.
      
      [akpm@linux-foundation.org: update interface documentation]
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee62c6b2
    • A
      rapidio: add DMA engine support for RIO data transfers · e42d98eb
      Alexandre Bounine 提交于
      Adds DMA Engine framework support into RapidIO subsystem.
      
      Uses DMA Engine DMA_SLAVE interface to generate data transfers to/from
      remote RapidIO target devices.
      
      Introduces RapidIO-specific wrapper for prep_slave_sg() interface with an
      extra parameter to pass target specific information.
      
      Uses scatterlist to describe local data buffer.  Address flat data buffer
      on a remote side.
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Acked-by: NVinod Koul <vinod.koul@linux.intel.com>
      Cc: Li Yang <leoli@freescale.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e42d98eb
    • K
      mqueue: separate mqueue default value from maximum value · cef0184c
      KOSAKI Motohiro 提交于
      Commit b231cca4 ("message queues: increase range limits") changed
      mqueue default value when attr parameter is specified NULL from hard
      coded value to fs.mqueue.{msg,msgsize}_max sysctl value.
      
      This made large side effect.  When user need to use two mqueue
      applications 1) using !NULL attr parameter and it require big message
      size and 2) using NULL attr parameter and only need small size message,
      app (1) require to raise fs.mqueue.msgsize_max and app (2) consume large
      memory size even though it doesn't need.
      
      Doug Ledford propsed to switch back it to static hard coded value.
      However it also has a compatibility problem.  Some applications might
      started depend on the default value is tunable.
      
      The solution is to separate default value from maximum value.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Acked-by: NDoug Ledford <dledford@redhat.com>
      Acked-by: NJoe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cef0184c
    • K
      mqueue: revert bump up DFLT_*MAX · e6315bb1
      KOSAKI Motohiro 提交于
      Mqueue limitation is slightly naieve parameter likes other ipcs because
      unprivileged user can consume kernel memory by using ipcs.
      
      Thus, too aggressive raise bring us security issue.  Example, current
      setting allow evil unprivileged user use 256GB (= 256 * 1024 * 1024*1024)
      and it's enough large to system will belome unresponsive.  Don't do that.
      
      Instead, every admin should adjust the knobs for their own systems.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDoug Ledford <dledford@redhat.com>
      Acked-by: NJoe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6315bb1
    • D
      ipc/mqueue: update maximums for the mqueue subsystem · 5b5c4d1a
      Doug Ledford 提交于
      Commit b231cca4 ("message queues: increase range limits") changed the
      maximum size of a message in a message queue from INT_MAX to 8192*128.
      Unfortunately, we had customers that relied on a size much larger than
      8192*128 on their production systems.  After reviewing POSIX, we found
      that it is silent on the maximum message size.  We did find a couple other
      areas in which it was not silent.  Fix up the mqueue maximums so that the
      customer's system can continue to work, and document both the POSIX and
      real world requirements in ipc_namespace.h so that we don't have this
      issue crop back up.
      
      Also, commit 9cf18e1d ("ipc: HARD_MSGMAX should be higher not lower
      on 64bit") fiddled with HARD_MSGMAX without realizing that the number was
      intentionally in place to limit the msg queue depth to one that was small
      enough to kmalloc an array of pointers (hence why we divided 128k by
      sizeof(long)).  If we wish to meet POSIX requirements, we have no choice
      but to change our allocation to a vmalloc instead (at least for the large
      queue size case).  With that, it's possible to increase our allowed
      maximum to the POSIX requirements (or more if we choose).
      
      [sfr@canb.auug.org.au: using vmalloc requires including vmalloc.h]
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b5c4d1a
    • D
      ipc/mqueue: switch back to using non-max values on create · 858ee378
      Doug Ledford 提交于
      Commit b231cca4 ("message queues: increase range limits") changed
      how we create a queue that does not include an attr struct passed to
      open so that it creates the queue with whatever the maximum values are.
      However, if the admin has set the maximums to allow flexibility in
      creating a queue (aka, both a large size and large queue are allowed,
      but combined they create a queue too large for the RLIMIT_MSGQUEUE of
      the user), then attempts to create a queue without an attr struct will
      fail.  Switch back to using acceptable defaults regardless of what the
      maximums are.
      
      Note: so far, we only know of a few applications that rely on this
      behavior (specifically, set the maximums in /proc, then run the
      application which calls mq_open() without passing in an attr struct, and
      the application expects the newly created message queue to have the
      maximum sizes that were set in /proc used on the mq_open() call, and all
      of those applications that we know of are actually part of regression
      test suites that were coded to do something like this:
      
      for size in 4096 65536 $((1024 * 1024)) $((16 * 1024 * 1024)); do
      	echo $size > /proc/sys/fs/mqueue/msgsize_max
      	mq_open || echo "Error opening mq with size $size"
      done
      
      These test suites that depend on any behavior like this are broken.  The
      concept that programs should rely upon the system wide maximum in order
      to get their desired results instead of simply using a attr struct to
      specify what they want is fundamentally unfriendly programming practice
      for any multi-tasking OS.
      
      Fixing this will break those few apps that we know of (and those app
      authors recognize the brokenness of their code and the need to fix it).
      However, the following patch "mqueue: separate mqueue default value"
      allows a workaround in the form of new knobs for the default msg queue
      creation parameters for any software out there that we don't already
      know about that might rely on this behavior at the moment.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      858ee378
    • D
      ipc/mqueue: cleanup definition names and locations · 93e6f119
      Doug Ledford 提交于
      Since commit b231cca4 ("message queues: increase range limits") on
      Oct 18, 2008, calls to mq_open() that did not pass in an attribute
      struct and expected to get default values for the size of the queue and
      the max message size now get the system wide maximums instead of
      hardwired defaults like they used to get.
      
      This was uncovered when one of the earlier patches in this patch set
      increased the default system wide maximums at the same time it increased
      the hard ceiling on the system wide maximums (a customer specifically
      needed the hard ceiling brought back up, the new ceiling that commit
      b231cca4 introduced was too low for their production systems).  By
      increasing the default maximums and not realising they were tied to any
      attempt to create a message queue without an attribute struct, I had
      inadvertently made it such that all message queue creation attempts
      without an attribute struct were failing because the new default
      maximums would create a queue that exceeded the default rlimit for
      message queue bytes.
      
      As a result, the system wide defaults were brought back down to their
      previous levels, and the system wide ceilings on the maximums were
      raised to meet the customer's needs.  However, the fact that the no
      attribute struct behavior of mq_open() could be broken by changing the
      system wide maximums for message queues was seen as fundamentally broken
      itself.  So we hardwired the no attribute case back like it used to be.
      But, then we realized that on the very off chance that some piece of
      software in the wild depended on that behavior, we could work around
      that issue by adding two new knobs to /proc that allowed setting the
      defaults for message queues created without an attr struct separately
      from the system wide maximums.
      
      What is not an option IMO is to leave the current behavior in place.  No
      piece of software should ever rely on setting the system wide maximums
      in order to get a desired message queue.  Such a reliance would be so
      fundamentally multitasking OS unfriendly as to not really be tolerable.
      Fortunately, we don't know of any software in the wild that uses this
      except for a regression test program that caught the issue in the first
      place.  If there is though, we have made accommodations with the two new
      /proc knobs (and that's all the accommodations such fundamentally broken
      software can be allowed)..
      
      This patch:
      
      The various defines for minimums and maximums of the sysctl controllable
      mqueue values are scattered amongst different files and named
      inconsistently.  Move them all into ipc_namespace.h and make them have
      consistent names.  Additionally, make the number of queues per namespace
      also have a minimum and maximum and use the same sysctl function as the
      other two settable variables.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93e6f119
    • M
      kexec: export kexec.h to user space · 29a5c67e
      maximilian attems 提交于
      Add userspace definitions, guard all relevant kernel structures.  While at
      it document stuff and remove now useless userspace hint.
      
      It is easy to add the relevant system call to respective libc's, but it
      seems pointless to have to duplicate the data structures.
      
      This is based on the kexec-tools headers, with the exception of just using
      int on return (succes or failure) and using size_t instead of 'unsigned
      long int' for the number of segments argument of kexec_load().
      Signed-off-by: Nmaximilian attems <max@stro.at>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Haren Myneni <hbabu@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29a5c67e
    • A
      cpu: introduce clear_tasks_mm_cpumask() helper · cb79295e
      Anton Vorontsov 提交于
      Many architectures clear tasks' mm_cpumask like this:
      
      	read_lock(&tasklist_lock);
      	for_each_process(p) {
      		if (p->mm)
      			cpumask_clear_cpu(cpu, mm_cpumask(p->mm));
      	}
      	read_unlock(&tasklist_lock);
      
      Depending on the context, the code above may have several problems,
      such as:
      
      1. Working with task->mm w/o getting mm or grabing the task lock is
         dangerous as ->mm might disappear (exit_mm() assigns NULL under
         task_lock(), so tasklist lock is not enough).
      
      2. Checking for process->mm is not enough because process' main
         thread may exit or detach its mm via use_mm(), but other threads
         may still have a valid mm.
      
      This patch implements a small helper function that does things
      correctly, i.e.:
      
      1. We take the task's lock while whe handle its mm (we can't use
         get_task_mm()/mmput() pair as mmput() might sleep);
      
      2. To catch exited main thread case, we use find_lock_task_mm(),
         which walks up all threads and returns an appropriate task
         (with task lock held).
      
      Also, Per Peter Zijlstra's idea, now we don't grab tasklist_lock in
      the new helper, instead we take the rcu read lock. We can do this
      because the function is called after the cpu is taken down and marked
      offline, so no new tasks will get this cpu set in their mm mask.
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb79295e
    • O
      cred: remove task_is_dead() from __task_cred() validation · 43e13cc1
      Oleg Nesterov 提交于
      Commit 8f92054e ("CRED: Fix __task_cred()'s lockdep check and banner
      comment"):
      
          add the following validation condition:
      
              task->exit_state >= 0
      
          to permit the access if the target task is dead and therefore
          unable to change its own credentials.
      
      OK, but afaics currently this can only help wait_task_zombie() which calls
      __task_cred() without rcu lock.
      
      Remove this validation and change wait_task_zombie() to use task_uid()
      instead.  This means we do rcu_read_lock() only to shut up the lockdep,
      but we already do the same in, say, wait_task_stopped().
      
      task_is_dead() should die, task->exit_state != 0 means that this task has
      passed exit_notify(), only do_wait-like code paths should use this.
      
      Unfortunately, we can't kill task_is_dead() right now, it has already
      acquired buggy users in drivers/staging.  The fix already exists.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43e13cc1
    • B
      kmod: move call_usermodehelper_fns() to .c file and unexport all it's helpers · 785042f2
      Boaz Harrosh 提交于
      If we move call_usermodehelper_fns() to kmod.c file and EXPORT_SYMBOL it
      we can avoid exporting all it's helper functions:
      	call_usermodehelper_setup
      	call_usermodehelper_setfns
      	call_usermodehelper_exec
      And make all of them static to kmod.c
      
      Since the optimizer will see all these as a single call site it will
      inline them inside call_usermodehelper_fns().  So we loose the call to
      _fns but gain 3 calls to the helpers.  (Not that it matters)
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      785042f2
    • B
      kmod: unexport call_usermodehelper_freeinfo() · ae3cef73
      Boaz Harrosh 提交于
      call_usermodehelper_freeinfo() is not used outside of kmod.c.  So unexport
      it, and make it static to kmod.c
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae3cef73
    • A
      fat: introduce special inode for managing the FSINFO block · 020ac5b6
      Artem Bityutskiy 提交于
      This is patchset makes fatfs stop using the VFS '->write_super()' method
      for writing out the FSINFO block.
      
      The final goal is to get rid of the 'sync_supers()' kernel thread.  This
      kernel thread wakes up every 5 seconds (by default) and calls
      '->write_super()' for all mounted file-systems.  And the bad thing is that
      this is done even if all the superblocks are clean.  Moreover, some
      file-systems do not even need this end they do not register the
      '->write_super()' method at all (e.g., btrfs).
      
      So 'sync_supers()' most often just generates useless wake-ups and wastes
      power.  I am trying to make all file-systems independent of
      '->write_super()' and plan to remove 'sync_supers()' and '->write_super'
      completely once there are no more users.
      
      The '->write_supers()' method is mostly used by baroque file-systems like
      hfs, udf, etc.  Modern file-systems like btrfs and xfs do not use it.
      This justifies removing this stuff from VFS completely and make every FS
      self-manage own superblock.
      
      Tested with xfstests.
      
      This patch:
      
      Preparation for further changes.  It introduces a special inode
      ('fsinfo_inode') in FAT file-system which we'll later use for managing the
      FSINFO block.  Note, this there is already one special inode ('fat_inode')
      which is used for managing the FAT tables.
      
      Introduce new 'MSDOS_FSINFO_INO' constant for this special inode.  It is
      safe to do because FAT file-system does not store inode numbers on the
      media but generates them run-time.
      
      I've also cleaned up the comment to existing 'MSDOS_ROOT_INO' constant,
      while on it.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      020ac5b6
    • X
      introduce SIZE_MAX · a3860c1c
      Xi Wang 提交于
      ULONG_MAX is often used to check for integer overflow when calculating
      allocation size.  While ULONG_MAX happens to work on most systems, there
      is no guarantee that `size_t' must be the same size as `long'.
      
      This patch introduces SIZE_MAX, the maximum value of `size_t', to improve
      portability and readability for allocation size validation.
      Signed-off-by: NXi Wang <xi.wang@gmail.com>
      Acked-by: NAlex Elder <elder@dreamhost.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3860c1c
    • J
      nfsd4: move rq_flavor into svc_cred · d5497fc6
      J. Bruce Fields 提交于
      Move the rq_flavor into struct svc_cred, and use it in setclientid and
      exchange_id comparisons as well.
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      d5497fc6
    • J
      nfsd4: move principal name into svc_cred · 03a4e1f6
      J. Bruce Fields 提交于
      Instead of keeping the principal name associated with a request in a
      structure that's private to auth_gss and using an accessor function,
      move it to svc_cred.
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      03a4e1f6
    • S
      SUNRPC: new svc_bind() routine introduced · 9793f7c8
      Stanislav Kinsbursky 提交于
      This new routine is responsible for service registration in a specified
      network context.
      
      The idea is to separate service creation from per-net operations.
      
      Note also: since registering service with svc_bind() can fail, the
      service will be destroyed and during destruction it will try to
      unregister itself from rpcbind. In this case unregistration has to be
      skipped.
      Signed-off-by: NStanislav Kinsbursky <skinsbursky@parallels.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      9793f7c8
    • T
      NFS: Ensure that setattr and getattr wait for O_DIRECT write completion · 1d59d61f
      Trond Myklebust 提交于
      Use the same mechanism as the block devices are using, but move the
      helper functions from fs/direct-io.c into fs/inode.c to remove the
      dependency on CONFIG_BLOCK.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Fred Isaman <iisaman@netapp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d59d61f
    • A
      split ->file_mmap() into ->mmap_addr()/->mmap_file() · e5467859
      Al Viro 提交于
      ... i.e. file-dependent and address-dependent checks.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e5467859
    • A
      split cap_mmap_addr() out of cap_file_mmap() · d007794a
      Al Viro 提交于
      ... switch callers.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d007794a
  3. 31 5月, 2012 3 次提交
    • N
      fsnotify: handle subfiles' perm events · a4f9a9a6
      Naohiro Aota 提交于
      Recently I'm working on fanotify and found the following strange
      behaviors.
      
      I wrote a program to set fanotify_mark on "/tmp/block" and FAN_DENY
      all events notified.
      
      fanotify_mask = FAN_ALL_EVENTS | FAN_ALL_PERM_EVENTS | FAN_EVENT_ON_CHILD:
      $ cd /tmp/block; cat foo
      cat: foo: Operation not permitted
      
      Operation on the file is blocked as expected.
      
      But,
      
      fanotify_mask = FAN_ALL_PERM_EVENTS | FAN_EVENT_ON_CHILD:
      $ cd /tmp/block; cat foo
      aaa
      
      It's not blocked anymore.  This is confusing behavior.  Also reading
      commit "fsnotify: call fsnotify_parent in perm events", it seems like
      fsnotify should handle subfiles' perm events as well as the other notify
      events.
      
      With this patch, regardless of FAN_ALL_EVENTS set or not:
      $ cd /tmp/block; cat foo
      cat: foo: Operation not permitted
      
      Operation on the file is now blocked properly.
      
      FS_OPEN_PERM and FS_ACCESS_PERM are not listed on FS_EVENTS_POSS_ON_CHILD.
       Due to fsnotify_inode_watches_children() check, if you only specify only
      these events as fsnotify_mask, you don't get subfiles' perm events
      notified.
      
      This patch add the events to FS_EVENTS_POSS_ON_CHILD to get them notified
      even if only these events are specified to fsnotify_mask.
      Signed-off-by: NNaohiro Aota <naota@elisp.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a4f9a9a6
    • A
      bury __kernel_nlink_t, make internal nlink_t consistent · bb8ac181
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bb8ac181
    • J
      netdevice: Update netif_dbg for CONFIG_DYNAMIC_DEBUG · 0053ea9c
      Joe Perches 提交于
      Make netif_dbg use dynamic debugging whenever
      CONFIG_DYNAMIC_DEBUG is enabled.
      
      commit b558c96f
      ("dynamic_debug: make dynamic-debug supersede DEBUG ccflag")
      missed updating the netif_dbg variant.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0053ea9c
  4. 30 5月, 2012 2 次提交
    • M
      i2c: Split I2C_M_NOSTART support out of I2C_FUNC_PROTOCOL_MANGLING · 14674e70
      Mark Brown 提交于
      Since there are uses for I2C_M_NOSTART which are much more sensible and
      standard than most of the protocol mangling functionality (the main one
      being gather writes to devices where something like a register address
      needs to be inserted before a block of data) create a new I2C_FUNC_NOSTART
      for this feature and update all the users to use it.
      
      Also strengthen the disrecommendation of the protocol mangling while we're
      at it.
      
      In the case of regmap-i2c we remove the requirement for mangling as
      I2C_M_NOSTART is the only mangling feature which is being used.
      Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
      Acked-by: NWolfram Sang <w.sang@pengutronix.de>
      Signed-off-by: NJean Delvare <khali@linux-fr.org>
      14674e70
    • H
      watchdog: Add support for dynamically allocated watchdog_device structs · e907df32
      Hans de Goede 提交于
      If a driver's watchdog_device struct is part of a dynamically allocated
      struct (which it often will be), merely locking the module is not enough,
      even with a drivers module locked, the driver can be unbound from the device,
      examples:
      1) The root user can unbind it through sysfd
      2) The i2c bus master driver being unloaded for an i2c watchdog
      
      I will gladly admit that these are corner cases, but we still need to handle
      them correctly.
      
      The fix for this consists of 2 parts:
      1) Add ref / unref operations, so that the driver can refcount the struct
         holding the watchdog_device struct and delay freeing it until any
         open filehandles referring to it are closed
      2) Most driver operations will do IO on the device and the driver should not
         do any IO on the device after it has been unbound. Rather then letting each
         driver deal with this internally, it is better to ensure at the watchdog
         core level that no operations (other then unref) will get called after
         the driver has called watchdog_unregister_device(). This actually is the
         bulk of this patch.
      Signed-off-by: NHans de Goede <hdegoede@redhat.com>
      Signed-off-by: NWim Van Sebroeck <wim@iguana.be>
      e907df32