1. 01 6月, 2012 40 次提交
    • D
      ipc/mqueue: improve performance of send/recv · d6629859
      Doug Ledford 提交于
      The existing implementation of the POSIX message queue send and recv
      functions is, well, abysmal.  Even worse than abysmal.  I submitted a
      patch to increase the maximum POSIX message queue limit to 65536 due to
      customer needs, however, upon looking over the send/recv implementation, I
      realized that my customer needs help with that too even if they don't know
      it.  The basic problem is that, given the fairly typical use case scenario
      for a large queue of queueing lots of messages all at the same priority (I
      verified with my customer that this is indeed what their app does), the
      msg_insert routine is basically a frikkin' bubble sort.  I mean, whoa,
      that's *so* middle school.
      
      OK, OK, to not slam the original author too much, I'm sure they didn't
      envision a queue depth of 50,000+ messages.  No one would think that
      moving elements in an array, one at a time, and dereferencing each pointer
      in that array to check priority of the message being pointed too, again
      one at a time, for 50,000+ times would be good.  So let's assume that, as
      is typical, the users have found a way to break our code simply by using
      it in a way we didn't envision.  Fair enough.
      
      "So, just how broken is it?", you ask.  I wondered the same thing, so I
      wrote an app to let me know.  It's my next patch.  It gave me some
      interesting results.  Here's what it tested:
      
      Interference with other apps - In continuous mode, the app just sits there
      and hits a message queue forever, while you go do something productive on
      another terminal using other CPUs.  You then measure how long it takes you
      to do that something productive.  Then you restart the app in fake
      continuous mode, and it sits in a tight loop on a CPU while you repeat
      your tests.  The whole point of this is to keep one CPU tied up (so it
      can't be used in your other work) but in one case tied up hitting the
      mqueue code so we can see the effect of walking that 65,528 element array
      one pointer at a time on the global CPU cache.  If it's bad, then it will
      slow down your app on the other CPUs just by polluting cache mercilessly.
      In the fake case, it will be in a tight loop, but not polluting cache.
      Testing the mqueue subsystem directly - Here we just run a number of tests
      to see how the mqueue subsystem performs under different conditions.  A
      couple conditions are known to be worst case for the old system, and some
      routines, so this tests all of them.
      
      So, on to the results already:
      
      Subsystem/Test                  Old                         New
      
      Time to compile linux
      kernel (make -j12 on a
      6 core CPU)
        Running mqueue test     user 49m10.744s             user 45m26.294s
      			   sys  5m51.924s              sys  4m59.894s
      			 total 55m02.668s            total 50m26.188s
      
        Running fake test       user 45m32.686s             user 45m18.552s
                                 sys  5m12.465s              sys  4m56.468s
                               total 50m45.151s            total 50m15.020s
      
        % slowdown from mqueue
          cache thrashing            ~8%                         ~.5%
      
      Avg time to send/recv (in nanoseconds per message)
        when queue empty            305/288                    349/318
        when queue full (65528 messages)
          constant priority      526589/823                    362/314
          increasing priority    403105/916                    495/445
          decreasing priority     73420/594                    482/409
          random priority        280147/920                    546/436
      
      Time to fill/drain queue (65528 messages, in seconds)
        constant priority         17.37/.12                    .13/.12
        increasing priority        4.14/.14                    .21/.18
        decreasing priority       12.93/.13                    .21/.18
        random priority            8.88/.16                    .22/.17
      
      So, I think the results speak for themselves.  It's possible this
      implementation could be improved by cacheing at least one priority level
      in the node tree (that would bring the queue empty performance more in
      line with the old implementation), but this works and is *so* much better
      than what we had, especially for the common case of a single priority in
      use, that further refinements can be in follow on patches.
      
      [akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
      [levinsasha928@gmail.com: use correct gfp flags in msg_insert]
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6629859
    • D
      selftests: add mq_open_tests · 50069a58
      Doug Ledford 提交于
      Add a directory to house POSIX message queue subsystem specific tests.
      Add first test which checks the operation of mq_open() under various
      corner conditions.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50069a58
    • K
      mqueue: separate mqueue default value from maximum value · cef0184c
      KOSAKI Motohiro 提交于
      Commit b231cca4 ("message queues: increase range limits") changed
      mqueue default value when attr parameter is specified NULL from hard
      coded value to fs.mqueue.{msg,msgsize}_max sysctl value.
      
      This made large side effect.  When user need to use two mqueue
      applications 1) using !NULL attr parameter and it require big message
      size and 2) using NULL attr parameter and only need small size message,
      app (1) require to raise fs.mqueue.msgsize_max and app (2) consume large
      memory size even though it doesn't need.
      
      Doug Ledford propsed to switch back it to static hard coded value.
      However it also has a compatibility problem.  Some applications might
      started depend on the default value is tunable.
      
      The solution is to separate default value from maximum value.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Acked-by: NDoug Ledford <dledford@redhat.com>
      Acked-by: NJoe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cef0184c
    • K
      mqueue: don't use kmalloc with KMALLOC_MAX_SIZE · fd1f87d2
      KOSAKI Motohiro 提交于
      KMALLOC_MAX_SIZE is not a good threshold.  It is extremely high and
      problematic.  Unfortunately, some silly drivers depend on this and we
      can't change it.  But any new code needn't use such extreme ugly high
      order allocations.  It brings us awful fragmentation issues and system
      slowdown.
      Signed-off-by: NKOSAKI Motohiro <mkosaki@jp.fujitsu.com>
      Acked-by: NDoug Ledford <dledford@redhat.com>
      Acked-by: NJoe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd1f87d2
    • K
      mqueue: revert bump up DFLT_*MAX · e6315bb1
      KOSAKI Motohiro 提交于
      Mqueue limitation is slightly naieve parameter likes other ipcs because
      unprivileged user can consume kernel memory by using ipcs.
      
      Thus, too aggressive raise bring us security issue.  Example, current
      setting allow evil unprivileged user use 256GB (= 256 * 1024 * 1024*1024)
      and it's enough large to system will belome unresponsive.  Don't do that.
      
      Instead, every admin should adjust the knobs for their own systems.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDoug Ledford <dledford@redhat.com>
      Acked-by: NJoe Korty <joe.korty@ccur.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6315bb1
    • D
      ipc/mqueue: update maximums for the mqueue subsystem · 5b5c4d1a
      Doug Ledford 提交于
      Commit b231cca4 ("message queues: increase range limits") changed the
      maximum size of a message in a message queue from INT_MAX to 8192*128.
      Unfortunately, we had customers that relied on a size much larger than
      8192*128 on their production systems.  After reviewing POSIX, we found
      that it is silent on the maximum message size.  We did find a couple other
      areas in which it was not silent.  Fix up the mqueue maximums so that the
      customer's system can continue to work, and document both the POSIX and
      real world requirements in ipc_namespace.h so that we don't have this
      issue crop back up.
      
      Also, commit 9cf18e1d ("ipc: HARD_MSGMAX should be higher not lower
      on 64bit") fiddled with HARD_MSGMAX without realizing that the number was
      intentionally in place to limit the msg queue depth to one that was small
      enough to kmalloc an array of pointers (hence why we divided 128k by
      sizeof(long)).  If we wish to meet POSIX requirements, we have no choice
      but to change our allocation to a vmalloc instead (at least for the large
      queue size case).  With that, it's possible to increase our allowed
      maximum to the POSIX requirements (or more if we choose).
      
      [sfr@canb.auug.org.au: using vmalloc requires including vmalloc.h]
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b5c4d1a
    • D
      ipc/mqueue: enforce hard limits · 02967ea0
      Doug Ledford 提交于
      In two places we don't enforce the hard limits for CAP_SYS_RESOURCE apps.
      In preparation for making more reasonable hard limits, start enforcing
      them even on CAP_SYS_RESOURCE.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02967ea0
    • D
      ipc/mqueue: switch back to using non-max values on create · 858ee378
      Doug Ledford 提交于
      Commit b231cca4 ("message queues: increase range limits") changed
      how we create a queue that does not include an attr struct passed to
      open so that it creates the queue with whatever the maximum values are.
      However, if the admin has set the maximums to allow flexibility in
      creating a queue (aka, both a large size and large queue are allowed,
      but combined they create a queue too large for the RLIMIT_MSGQUEUE of
      the user), then attempts to create a queue without an attr struct will
      fail.  Switch back to using acceptable defaults regardless of what the
      maximums are.
      
      Note: so far, we only know of a few applications that rely on this
      behavior (specifically, set the maximums in /proc, then run the
      application which calls mq_open() without passing in an attr struct, and
      the application expects the newly created message queue to have the
      maximum sizes that were set in /proc used on the mq_open() call, and all
      of those applications that we know of are actually part of regression
      test suites that were coded to do something like this:
      
      for size in 4096 65536 $((1024 * 1024)) $((16 * 1024 * 1024)); do
      	echo $size > /proc/sys/fs/mqueue/msgsize_max
      	mq_open || echo "Error opening mq with size $size"
      done
      
      These test suites that depend on any behavior like this are broken.  The
      concept that programs should rely upon the system wide maximum in order
      to get their desired results instead of simply using a attr struct to
      specify what they want is fundamentally unfriendly programming practice
      for any multi-tasking OS.
      
      Fixing this will break those few apps that we know of (and those app
      authors recognize the brokenness of their code and the need to fix it).
      However, the following patch "mqueue: separate mqueue default value"
      allows a workaround in the form of new knobs for the default msg queue
      creation parameters for any software out there that we don't already
      know about that might rely on this behavior at the moment.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      858ee378
    • D
      ipc/mqueue: cleanup definition names and locations · 93e6f119
      Doug Ledford 提交于
      Since commit b231cca4 ("message queues: increase range limits") on
      Oct 18, 2008, calls to mq_open() that did not pass in an attribute
      struct and expected to get default values for the size of the queue and
      the max message size now get the system wide maximums instead of
      hardwired defaults like they used to get.
      
      This was uncovered when one of the earlier patches in this patch set
      increased the default system wide maximums at the same time it increased
      the hard ceiling on the system wide maximums (a customer specifically
      needed the hard ceiling brought back up, the new ceiling that commit
      b231cca4 introduced was too low for their production systems).  By
      increasing the default maximums and not realising they were tied to any
      attempt to create a message queue without an attribute struct, I had
      inadvertently made it such that all message queue creation attempts
      without an attribute struct were failing because the new default
      maximums would create a queue that exceeded the default rlimit for
      message queue bytes.
      
      As a result, the system wide defaults were brought back down to their
      previous levels, and the system wide ceilings on the maximums were
      raised to meet the customer's needs.  However, the fact that the no
      attribute struct behavior of mq_open() could be broken by changing the
      system wide maximums for message queues was seen as fundamentally broken
      itself.  So we hardwired the no attribute case back like it used to be.
      But, then we realized that on the very off chance that some piece of
      software in the wild depended on that behavior, we could work around
      that issue by adding two new knobs to /proc that allowed setting the
      defaults for message queues created without an attr struct separately
      from the system wide maximums.
      
      What is not an option IMO is to leave the current behavior in place.  No
      piece of software should ever rely on setting the system wide maximums
      in order to get a desired message queue.  Such a reliance would be so
      fundamentally multitasking OS unfriendly as to not really be tolerable.
      Fortunately, we don't know of any software in the wild that uses this
      except for a regression test program that caught the issue in the first
      place.  If there is though, we have made accommodations with the two new
      /proc knobs (and that's all the accommodations such fundamentally broken
      software can be allowed)..
      
      This patch:
      
      The various defines for minimums and maximums of the sysctl controllable
      mqueue values are scattered amongst different files and named
      inconsistently.  Move them all into ipc_namespace.h and make them have
      consistent names.  Additionally, make the number of queues per namespace
      also have a minimum and maximum and use the same sysctl function as the
      other two settable variables.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Amerigo Wang <amwang@redhat.com>
      Cc: Joe Korty <joe.korty@ccur.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93e6f119
    • M
      kexec: export kexec.h to user space · 29a5c67e
      maximilian attems 提交于
      Add userspace definitions, guard all relevant kernel structures.  While at
      it document stuff and remove now useless userspace hint.
      
      It is easy to add the relevant system call to respective libc's, but it
      seems pointless to have to duplicate the data structures.
      
      This is based on the kexec-tools headers, with the exception of just using
      int on return (succes or failure) and using size_t instead of 'unsigned
      long int' for the number of segments argument of kexec_load().
      Signed-off-by: Nmaximilian attems <max@stro.at>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Haren Myneni <hbabu@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29a5c67e
    • A
      kernel/cpu.c: document clear_tasks_mm_cpumask() · e4cc2f87
      Anton Vorontsov 提交于
      Add more comments on clear_tasks_mm_cpumask, plus adds a runtime check:
      the function is only suitable for offlined CPUs, and if called
      inappropriately, the kernel should scream aloud.
      
      [akpm@linux-foundation.org: tweak comment: s/walks up/walks/, use 80 cols]
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Suggested-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4cc2f87
    • A
      um: properly check all process' threads for a live mm · 2c922c51
      Anton Vorontsov 提交于
      kill_off_processes() might miss a valid process, this is because checking
      for process->mm is not enough.  Process' main thread may exit or detach
      its mm via use_mm(), but other threads may still have a valid mm.
      
      To catch this we use find_lock_task_mm(), which walks up all threads and
      returns an appropriate task (with task lock held).
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Richard Weinberger <richard@nod.at>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c922c51
    • A
      um: fix possible race on task->mm · 137d1a26
      Anton Vorontsov 提交于
      Checking for task->mm is dangerous as ->mm might disappear (exit_mm()
      assigns NULL under task_lock(), so tasklist lock is not enough).
      
      We can't use get_task_mm()/mmput() pair as mmput() might sleep, so let's
      take the task lock while we care about its mm.
      
      Note that we should also use find_lock_task_mm() to check all process'
      threads for a valid mm, but for uml we'll do it in a separate patch.
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      137d1a26
    • A
      um: should hold tasklist_lock while traversing processes · 9bd0a077
      Anton Vorontsov 提交于
      Traversing the tasks requires holding tasklist_lock, otherwise it is
      unsafe.
      
      p.s.  However, I'm not sure that calling os_kill_ptraced_process() in the
      atomic context is correct.  It seem to work, but please take a closer
      look.
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9bd0a077
    • A
      blackfin: fix possible deadlock in decode_address() · af1be5a5
      Anton Vorontsov 提交于
      Oleg Nesterov found an interesting deadlock possibility:
      
      > sysrq_showregs_othercpus() does smp_call_function(showacpu)
      > and showacpu() show_stack()->decode_address(). Now suppose that IPI
      > interrupts the task holding read_lock(tasklist).
      
      To fix this, blackfin should not grab the write_ variant of the
      tasklist lock, read_ one is enough.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af1be5a5
    • A
      blackfin: a couple of task->mm handling fixes · 2214f707
      Anton Vorontsov 提交于
      The patch fixes two problems:
      
      1. Working with task->mm w/o getting mm or grabing the task lock is
         dangerous as ->mm might disappear (exit_mm() assigns NULL under
         task_lock(), so tasklist lock is not enough).
      
         We can't use get_task_mm()/mmput() pair as mmput() might sleep,
         so we have to take the task lock while handle its mm.
      
      2. Checking for process->mm is not enough because process' main
         thread may exit or detach its mm via use_mm(), but other threads
         may still have a valid mm.
      
         To catch this we use find_lock_task_mm(), which walks up all
         threads and returns an appropriate task (with task lock held).
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2214f707
    • A
      sh: use clear_tasks_mm_cpumask() · 1198c8b9
      Anton Vorontsov 提交于
      Checking for process->mm is not enough because process' main thread may
      exit or detach its mm via use_mm(), but other threads may still have a
      valid mm.
      
      To fix this we would need to use find_lock_task_mm(), which would walk up
      all threads and returns an appropriate task (with task lock held).
      
      clear_tasks_mm_cpumask() has the issue fixed, so let's use it.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1198c8b9
    • A
      powerpc: use clear_tasks_mm_cpumask() · 73863ab0
      Anton Vorontsov 提交于
      Current CPU hotplug code has some task->mm handling issues:
      
      1. Working with task->mm w/o getting mm or grabing the task lock is
         dangerous as ->mm might disappear (exit_mm() assigns NULL under
         task_lock(), so tasklist lock is not enough).
      
         We can't use get_task_mm()/mmput() pair as mmput() might sleep,
         so we must take the task lock while handle its mm.
      
      2. Checking for process->mm is not enough because process' main
         thread may exit or detach its mm via use_mm(), but other threads
         may still have a valid mm.
      
         To fix this we would need to use find_lock_task_mm(), which would
         walk up all threads and returns an appropriate task (with task
         lock held).
      
      clear_tasks_mm_cpumask() has all the issues fixed, so let's use it.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73863ab0
    • A
      arm: use clear_tasks_mm_cpumask() · 3eaa73bd
      Anton Vorontsov 提交于
      Checking for process->mm is not enough because process' main thread may
      exit or detach its mm via use_mm(), but other threads may still have a
      valid mm.
      
      To fix this we would need to use find_lock_task_mm(), which would walk up
      all threads and returns an appropriate task (with task lock held).
      
      clear_tasks_mm_cpumask() has this issue fixed, so let's use it.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3eaa73bd
    • A
      cpu: introduce clear_tasks_mm_cpumask() helper · cb79295e
      Anton Vorontsov 提交于
      Many architectures clear tasks' mm_cpumask like this:
      
      	read_lock(&tasklist_lock);
      	for_each_process(p) {
      		if (p->mm)
      			cpumask_clear_cpu(cpu, mm_cpumask(p->mm));
      	}
      	read_unlock(&tasklist_lock);
      
      Depending on the context, the code above may have several problems,
      such as:
      
      1. Working with task->mm w/o getting mm or grabing the task lock is
         dangerous as ->mm might disappear (exit_mm() assigns NULL under
         task_lock(), so tasklist lock is not enough).
      
      2. Checking for process->mm is not enough because process' main
         thread may exit or detach its mm via use_mm(), but other threads
         may still have a valid mm.
      
      This patch implements a small helper function that does things
      correctly, i.e.:
      
      1. We take the task's lock while whe handle its mm (we can't use
         get_task_mm()/mmput() pair as mmput() might sleep);
      
      2. To catch exited main thread case, we use find_lock_task_mm(),
         which walks up all threads and returns an appropriate task
         (with task lock held).
      
      Also, Per Peter Zijlstra's idea, now we don't grab tasklist_lock in
      the new helper, instead we take the rcu read lock. We can do this
      because the function is called after the cpu is taken down and marked
      offline, so no new tasks will get this cpu set in their mm mask.
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb79295e
    • K
      fork: call complete_vfork_done() after clearing child_tid and flushing rss-counters · f7505d64
      Konstantin Khlebnikov 提交于
      Child should wake up the parent from vfork() only after finishing all
      operations with shared mm.  There is no sense in using
      CLONE_CHILD_CLEARTID together with CLONE_VFORK, but it looks more accurate
      now.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7505d64
    • K
      proc/smaps: show amount of nonlinear ptes in vma · bca15543
      Konstantin Khlebnikov 提交于
      Currently, nonlinear mappings can not be distinguished from ordinary
      mappings.  This patch adds into /proc/pid/smaps line "Nonlinear: <size>
      kB", where size is amount of nonlinear ptes in vma, this line appears only
      if VM_NONLINEAR is set.  This information may be useful not only for
      checkpoint/restore project.
      
      Requested by Pavel Emelyanov.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bca15543
    • K
      proc/smaps: carefully handle migration entries · b1d4d9e0
      Konstantin Khlebnikov 提交于
      Currently smaps reports migration entries as "swap", as result "swap" can
      appears in shared mapping.
      
      This patch converts migration entries into pages and handles them as usual.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1d4d9e0
    • K
      proc: report file/anon bit in /proc/pid/pagemap · 052fb0d6
      Konstantin Khlebnikov 提交于
      This is an implementation of Andrew's proposal to extend the pagemap file
      bits to report what is missing about tasks' working set.
      
      The problem with the working set detection is multilateral.  In the criu
      (checkpoint/restore) project we dump the tasks' memory into image files
      and to do it properly we need to detect which pages inside mappings are
      really in use.  The mincore syscall I though could help with this did not.
       First, it doesn't report swapped pages, thus we cannot find out which
      parts of anonymous mappings to dump.  Next, it does report pages from page
      cache as present even if they are not mapped, and it doesn't make that has
      not been cow-ed.
      
      Note, that issue with swap pages is critical -- we must dump swap pages to
      image file.  But the issues with file pages are optimization -- we can
      take all file pages to image, this would be correct, but if we know that a
      page is not mapped or not cow-ed, we can remove them from dump file.  The
      dump would still be self-consistent, though significantly smaller in size
      (up to 10 times smaller on real apps).
      
      Andrew noticed, that the proc pagemap file solved 2 of 3 above issues --
      it reports whether a page is present or swapped and it doesn't report not
      mapped page cache pages.  But, it doesn't distinguish cow-ed file pages
      from not cow-ed.
      
      I would like to make the last unused bit in this file to report whether the
      page mapped into respective pte is PageAnon or not.
      
      [comment stolen from Pavel Emelyanov's v1 patch]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      052fb0d6
    • J
      procfs: use more apprioriate types when dumping /proc/N/stat · 715be1fc
      Jan Engelhardt 提交于
      - use int fpr priority and nice, since task_nice()/task_prio() return that
      
      - field 24: get_mm_rss() returns unsigned long
      Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      715be1fc
    • A
      proc: pass "fd" by value in /proc/*/{fd,fdinfo} code · af5e6171
      Alexey Dobriyan 提交于
      Pass "fd" directly, not via pointer -- one less memory read.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af5e6171
    • A
      proc: don't do dummy rcu_read_lock/rcu_read_unlock on error path · f05ed3f1
      Alexey Dobriyan 提交于
      rcu_read_lock()/rcu_read_unlock() is nop for TINY_RCU, but is not a nop
      for, say, PREEMPT_RCU.
      
      proc_fill_cache() is called without RCU lock, there is no need to
      lock/unlock on error path, simply jump out of the loop.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f05ed3f1
    • C
      proc: use mm_access() instead of ptrace_may_access() · 2344bec7
      Cong Wang 提交于
      mm_access() handles this much better, and avoids some race conditions.
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2344bec7
    • C
      proc: remove mm_for_maps() · e7dcd999
      Cong Wang 提交于
      mm_for_maps() is a simple wrapper for mm_access(), and the name is
      misleading, so just remove it and use mm_access() directly.
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7dcd999
    • C
      proc: clean up /proc/<pid>/environ handling · b409e578
      Cong Wang 提交于
      Similar to e268337d ("proc: clean up and fix /proc/<pid>/mem
      handling"), move the check of permission to open(), this will simplify
      read() code.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b409e578
    • T
      stack usage: add pid to warning printk in check_stack_usage · 168eeccb
      Tim Bird 提交于
      In embedded systems, sometimes the same program (busybox) is the cause of
      multiple warnings.  Outputting the pid with the program name in the
      warning printk helps distinguish which instances of a program are using
      the stack most.
      
      This is a small patch, but useful.
      Signed-off-by: NTim Bird <tim.bird@am.sony.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      168eeccb
    • O
      cred: remove task_is_dead() from __task_cred() validation · 43e13cc1
      Oleg Nesterov 提交于
      Commit 8f92054e ("CRED: Fix __task_cred()'s lockdep check and banner
      comment"):
      
          add the following validation condition:
      
              task->exit_state >= 0
      
          to permit the access if the target task is dead and therefore
          unable to change its own credentials.
      
      OK, but afaics currently this can only help wait_task_zombie() which calls
      __task_cred() without rcu lock.
      
      Remove this validation and change wait_task_zombie() to use task_uid()
      instead.  This means we do rcu_read_lock() only to shut up the lockdep,
      but we already do the same in, say, wait_task_stopped().
      
      task_is_dead() should die, task->exit_state != 0 means that this task has
      passed exit_notify(), only do_wait-like code paths should use this.
      
      Unfortunately, we can't kill task_is_dead() right now, it has already
      acquired buggy users in drivers/staging.  The fix already exists.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43e13cc1
    • R
      kmod.c: fix kernel-doc warning · 9b3c98cd
      Randy Dunlap 提交于
      Warning(kernel/kmod.c:419): No description found for parameter 'depth'
      Signed-off-by: NRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b3c98cd
    • B
      kmod: move call_usermodehelper_fns() to .c file and unexport all it's helpers · 785042f2
      Boaz Harrosh 提交于
      If we move call_usermodehelper_fns() to kmod.c file and EXPORT_SYMBOL it
      we can avoid exporting all it's helper functions:
      	call_usermodehelper_setup
      	call_usermodehelper_setfns
      	call_usermodehelper_exec
      And make all of them static to kmod.c
      
      Since the optimizer will see all these as a single call site it will
      inline them inside call_usermodehelper_fns().  So we loose the call to
      _fns but gain 3 calls to the helpers.  (Not that it matters)
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      785042f2
    • B
      kmod: convert two call sites to call_usermodehelper_fns() · 81ab6e7b
      Boaz Harrosh 提交于
      Both kernel/sys.c && security/keys/request_key.c where inlining the exact
      same code as call_usermodehelper_fns(); So simply convert these sites to
      directly use call_usermodehelper_fns().
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81ab6e7b
    • B
      kmod: unexport call_usermodehelper_freeinfo() · ae3cef73
      Boaz Harrosh 提交于
      call_usermodehelper_freeinfo() is not used outside of kmod.c.  So unexport
      it, and make it static to kmod.c
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae3cef73
    • N
      fat: use fat_msg_ratelimit() in fat__get_entry() · f0aac616
      Namjae Jeon 提交于
      If an application tries to lookup (opendir/readdir/stat) 5000 files on a
      fatfs USB device and the device is unplugged, many message occur, shown
      below.  This makes the application slow.  So use the new
      fat_msg_ratelimit() decrease the messaging rate.
      
        #> ./file_lookup_testcase ./files_directory/
        usb 2-1.4: USB disconnect, device number 4
        FAT-fs (sda1): FAT read failed (blocknr 2631)
        FAT-fs (sda1): Directory bread(block 396816) failed
        FAT-fs (sda1): Directory bread(block 396817) failed
        FAT-fs (sda1): Directory bread(block 396818) failed
        FAT-fs (sda1): Directory bread(block 396819) failed
        FAT-fs (sda1): Directory bread(block 396820) failed
        FAT-fs (sda1): Directory bread(block 396821) failed
        FAT-fs (sda1): Directory bread(block 396822) failed
        FAT-fs (sda1): Directory bread(block 396823) failed
        FAT-fs (sda1): Directory bread(block 406824) failed
        FAT-fs (sda1): Directory bread(block 406825) failed
        FAT-fs (sda1): Directory bread(block 406826) failed
        FAT-fs (sda1): Directory bread(block 406827) failed
        FAT-fs (sda1): Directory bread(block 406828) failed
        FAT-fs (sda1): Directory bread(block 406829) failed
        FAT-fs (sda1): Directory bread(block 406830) failed
        FAT-fs (sda1): Directory bread(block 406831) failed
        FAT-fs (sda1): Directory bread(block 417696) failed
        FAT-fs (sda1): Directory bread(block 417697) failed
        FAT-fs (sda1): Directory bread(block 417698) failed
        FAT-fs (sda1): Directory bread(block 417699) failed
        FAT-fs (sda1): Directory bread(block 417700) failed
        FAT-fs (sda1): Directory bread(block 417701) failed
        FAT-fs (sda1): Directory bread(block 417702) failed
        FAT-fs (sda1): Directory bread(block 417703) failed
        FAT-fs (sda1): FAT read failed (blocknr 2631)
        FAT-fs (sda1): Directory bread(block 396816) failed
        FAT-fs (sda1): Directory bread(block 396817) failed
        FAT-fs (sda1): Directory bread(block 396818) failed
        FAT-fs (sda1): Directory bread(block 396819) failed
        FAT-fs (sda1): Directory bread(block 396820) failed
        FAT-fs (sda1): Directory bread(block 396821) failed
      Signed-off-by: NNamjae Jeon <linkinjeon@gmail.com>
      Signed-off-by: NAmit Sahrawat <amit.sahrawat83@gmail.com>
      Acked-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0aac616
    • N
      fat: add fat_msg_ratelimit() · b742c341
      Namjae Jeon 提交于
      Add a fat_msg_ratelimit() to limit the message generation rate.
      Signed-off-by: NNamjae Jeon <linkinjeon@gmail.com>
      Signed-off-by: NAmit Sahrawat <amit.sahrawat83@gmail.com>
      Acked-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b742c341
    • A
      fat: switch to fsinfo_inode · 78491189
      Artem Bityutskiy 提交于
      Currently FAT file-system maps the VFS "superblock" abstraction to the
      FSINFO block.  The FSINFO block contains non-essential data about the
      amount of free clusters and the next free cluster.  FAT file-system can
      always find out this information by scanning the FAT table, but having it
      in the FSINFO block may speed things up sometimes.  So FAT file-system
      relies on the VFS superblock write-out services to make sure the FSINFO
      block is written out to the media from time to time.
      
      The whole "superblock write-out" VFS infrastructure is served by the
      'sync_supers()' kernel thread, which wakes up every 5 (by default) seconds
      and writes out all dirty superblock using the '->write_super()' call-back.
       But the problem with this thread is that it wastes power by waking up the
      system every 5 seconds no matter what.  So we want to kill it completely
      and thus, we need to make file-systems to stop using the '->write_super'
      VFS service, and then remove it together with the kernel thread.
      
      This patch switches the FAT FSINFO block management from
      '->write_super()'/'->s_dirt' to 'fsinfo_inode'/'->write_inode'.  Now,
      instead of setting the 's_dirt' flag, we just mark the special
      'fsinfo_inode' inode as dirty and let VFS invoke the '->write_inode'
      call-back when needed, where we write-out the FSINFO block.
      
      This patch also makes sure we do not mark the 'fsinfo_inode' inode as
      dirty if we are not FAT32 (FAT16 and FAT12 do not have the FSINFO block)
      or if we are in R/O mode.
      
      As a bonus, we can also remove the '->sync_fs()' and '->write_super()' FAT
      call-back function because they become unneeded.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78491189
    • A
      fat: mark superblock as dirty less often · 330fe3c4
      Artem Bityutskiy 提交于
      Preparation for further changes.  It touches few functions in fatent.c and
      prevents them from marking the superblock as dirty unnecessarily often.
      Namely, instead of marking it as dirty in the internal tight loops - do it
      only once at the end of the functions.  And instead of marking it as dirty
      while holding the FAT table lock, do it outside the lock.
      
      The reason for this patch is that marking the superblock as dirty will
      soon become a little bit heavier operation, so it is cleaner to do this
      only when it is necessary.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      330fe3c4