1. 24 3月, 2012 22 次提交
    • O
      usermodehelper: implement UMH_KILLABLE · d0bd587a
      Oleg Nesterov 提交于
      Implement UMH_KILLABLE, should be used along with UMH_WAIT_EXEC/PROC.
      The caller must ensure that subprocess_info->path/etc can not go away
      until call_usermodehelper_freeinfo().
      
      call_usermodehelper_exec(UMH_KILLABLE) does
      wait_for_completion_killable.  If it fails, it uses
      xchg(&sub_info->complete, NULL) to serialize with umh_complete() which
      does the same xhcg() to access sub_info->complete.
      
      If call_usermodehelper_exec wins, it can safely return.  umh_complete()
      should get NULL and call call_usermodehelper_freeinfo().
      
      Otherwise we know that umh_complete() was already called, in this case
      call_usermodehelper_exec() falls back to wait_for_completion() which
      should succeed "very soon".
      
      Note: UMH_NO_WAIT == -1 but it obviously should not be used with
      UMH_KILLABLE.  We delay the neccessary cleanup to simplify the back
      porting.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0bd587a
    • D
      ptrace: remove PTRACE_SEIZE_DEVEL bit · ee00560c
      Denys Vlasenko 提交于
      PTRACE_SEIZE code is tested and ready for production use, remove the
      code which requires special bit in data argument to make PTRACE_SEIZE
      work.
      
      Strace team prepares for a new release of strace, and we would like to
      ship the code which uses PTRACE_SEIZE, preferably after this change goes
      into released kernel.
      Signed-off-by: NDenys Vlasenko <vda.linux@googlemail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Pedro Alves <palves@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee00560c
    • D
      ptrace: renumber PTRACE_EVENT_STOP so that future new options and events can match · 5cdf389a
      Denys Vlasenko 提交于
      PTRACE_EVENT_foo and PTRACE_O_TRACEfoo used to match.
      
      New PTRACE_EVENT_STOP is the first event which has no corresponding
      PTRACE_O_TRACE option.  If we will ever want to add another such option,
      its PTRACE_EVENT's value will collide with PTRACE_EVENT_STOP's value.
      
      This patch changes PTRACE_EVENT_STOP value to prevent this.
      
      While at it, added a comment - the one atop PTRACE_EVENT block, saying
      "Wait extended result codes for the above trace options", is not true
      for PTRACE_EVENT_STOP.
      Signed-off-by: NDenys Vlasenko <vda.linux@googlemail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Pedro Alves <palves@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5cdf389a
    • D
      ptrace: simplify PTRACE_foo constants and PTRACE_SETOPTIONS code · 86b6c1f3
      Denys Vlasenko 提交于
      Exchange PT_TRACESYSGOOD and PT_PTRACE_CAP bit positions, which makes
      PT_option bits contiguous and therefore makes code in
      ptrace_setoptions() much simpler.
      
      Every PTRACE_O_TRACEevent is defined to (1 << PTRACE_EVENT_event)
      instead of using explicit numeric constants, to ensure we don't mess up
      relationship between bit positions and event ids.
      
      PT_EVENT_FLAG_SHIFT was not particularly useful, PT_OPT_FLAG_SHIFT with
      value of PT_EVENT_FLAG_SHIFT-1 is easier to use.
      
      PT_TRACE_MASK constant is nuked, the only its use is replaced by
      (PTRACE_O_MASK << PT_OPT_FLAG_SHIFT).
      Signed-off-by: NDenys Vlasenko <vda.linux@googlemail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Pedro Alves <palves@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86b6c1f3
    • O
      ptrace: don't send SIGTRAP on exec if SEIZED · b1845ff5
      Oleg Nesterov 提交于
      ptrace_event(PTRACE_EVENT_EXEC) sends SIGTRAP if PT_TRACE_EXEC is not
      set.  This is because this SIGTRAP predates PTRACE_O_TRACEEXEC option,
      we do not need/want this with PT_SEIZED which can set the options during
      attach.
      Suggested-by: NPedro Alves <palves@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Chris Evans <scarybeasts@gmail.com>
      Cc: Indan Zupancic <indan@nul.nu>
      Cc: Denys Vlasenko <vda.linux@googlemail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Pedro Alves <palves@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1845ff5
    • O
      ptrace: the killed tracee should not enter the syscall · 15cab952
      Oleg Nesterov 提交于
      Another old/known problem.  If the tracee is killed after it reports
      syscall_entry, it starts the syscall and debugger can't control this.
      This confuses the users and this creates the security problems for
      ptrace jailers.
      
      Change tracehook_report_syscall_entry() to return non-zero if killed,
      this instructs syscall_trace_enter() to abort the syscall.
      Reported-by: NChris Evans <scarybeasts@gmail.com>
      Tested-by: NIndan Zupancic <indan@nul.nu>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Denys Vlasenko <vda.linux@googlemail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Pedro Alves <palves@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15cab952
    • H
      poll: add poll_requested_events() and poll_does_not_wait() functions · 626cf236
      Hans Verkuil 提交于
      In some cases the poll() implementation in a driver has to do different
      things depending on the events the caller wants to poll for.  An example
      is when a driver needs to start a DMA engine if the caller polls for
      POLLIN, but doesn't want to do that if POLLIN is not requested but instead
      only POLLOUT or POLLPRI is requested.  This is something that can happen
      in the video4linux subsystem among others.
      
      Unfortunately, the current epoll/poll/select implementation doesn't
      provide that information reliably.  The poll_table_struct does have it: it
      has a key field with the event mask.  But once a poll() call matches one
      or more bits of that mask any following poll() calls are passed a NULL
      poll_table pointer.
      
      Also, the eventpoll implementation always left the key field at ~0 instead
      of using the requested events mask.
      
      This was changed in eventpoll.c so the key field now contains the actual
      events that should be polled for as set by the caller.
      
      The solution to the NULL poll_table pointer is to set the qproc field to
      NULL in poll_table once poll() matches the events, not the poll_table
      pointer itself.  That way drivers can obtain the mask through a new
      poll_requested_events inline.
      
      The poll_table_struct can still be NULL since some kernel code calls it
      internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h).  In
      that case poll_requested_events() returns ~0 (i.e.  all events).
      
      Very rarely drivers might want to know whether poll_wait will actually
      wait.  If another earlier file descriptor in the set already matched the
      events the caller wanted to wait for, then the kernel will return from the
      select() call without waiting.  This might be useful information in order
      to avoid doing expensive work.
      
      A new helper function poll_does_not_wait() is added that drivers can use
      to detect this situation.  This is now used in sock_poll_wait() in
      include/net/sock.h.  This was the only place in the kernel that needed
      this information.
      
      Drivers should no longer access any of the poll_table internals, but use
      the poll_requested_events() and poll_does_not_wait() access functions
      instead.  In order to enforce that the poll_table fields are now prepended
      with an underscore and a comment was added warning against using them
      directly.
      
      This required a change in unix_dgram_poll() in unix/af_unix.c which used
      the key field to get the requested events.  It's been replaced by a call
      to poll_requested_events().
      
      For qproc it was especially important to change its name since the
      behavior of that field changes with this patch since this function pointer
      can now be NULL when that wasn't possible in the past.
      
      Any driver accessing the qproc or key fields directly will now fail to compile.
      
      Some notes regarding the correctness of this patch: the driver's poll()
      function is called with a 'struct poll_table_struct *wait' argument.  This
      pointer may or may not be NULL, drivers can never rely on it being one or
      the other as that depends on whether or not an earlier file descriptor in
      the select()'s fdset matched the requested events.
      
      There are only three things a driver can do with the wait argument:
      
      1) obtain the key field:
      
      	events = wait ? wait->key : ~0;
      
         This will still work although it should be replaced with the new
         poll_requested_events() function (which does exactly the same).
         This will now even work better, since wait is no longer set to NULL
         unnecessarily.
      
      2) use the qproc callback. This could be deadly since qproc can now be
         NULL. Renaming qproc should prevent this from happening. There are no
         kernel drivers that actually access this callback directly, BTW.
      
      3) test whether wait == NULL to determine whether poll would return without
         waiting. This is no longer sufficient as the correct test is now
         wait == NULL || wait->_qproc == NULL.
      
         However, the worst that can happen here is a slight performance hit in
         the case where wait != NULL and wait->_qproc == NULL. In that case the
         driver will assume that poll_wait() will actually add the fd to the set
         of waiting file descriptors. Of course, poll_wait() will not do that
         since it tests for wait->_qproc. This will not break anything, though.
      
         There is only one place in the whole kernel where this happens
         (sock_poll_wait() in include/net/sock.h) and that code will be replaced
         by a call to poll_does_not_wait() in the next patch.
      
         Note that even if wait->_qproc != NULL drivers cannot rely on poll_wait()
         actually waiting. The next file descriptor from the set might match the
         event mask and thus any possible waits will never happen.
      Signed-off-by: NHans Verkuil <hans.verkuil@cisco.com>
      Reviewed-by: NJonathan Corbet <corbet@lwn.net>
      Reviewed-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NHans de Goede <hdegoede@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      626cf236
    • D
      crc32: bolt on crc32c · 46c5801e
      Darrick J. Wong 提交于
      Reuse the existing crc32 code to stamp out a crc32c implementation.
      Signed-off-by: NDarrick J. Wong <djwong@us.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Bob Pearson <rpearson@systemfabricworks.com>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46c5801e
    • J
      include/ and checkpatch: prefer __scanf to __attribute__((format(scanf,...) · 6061d949
      Joe Perches 提交于
      It's equivalent to __printf, so prefer __scanf.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6061d949
    • K
      leds-lm3530: support pwm input mode · bb982009
      Kim, Milo 提交于
      * add 'struct lm3530_pwm_data' in the platform data
        The pwm data is the platform specific functions which generate the pwm.
        The pwm data is only valid when brightness is pwm input mode.
        Functions should be implemented by the pwm driver.
        pwm_set_intensity() : set duty of pwm.
        pwm_get_intensity() : get current the brightness.
      
      * brightness control by pwm
        If the control mode is pwm, then brightness is changed by the duty of
        pwm=.  So pwm platform function should be called in lm3530_brightness_set().
      
      * do not update brightness register when pwm input mode
        In pwm input mode, brightness register is not used.
        If any value is updated in this register, then the led will be off.
      
      * when input mode is changed, set duty of pwm to 0 if unnecessary.
      Signed-off-by: NMilo(Woogyom) Kim <milo.kim@ti.com>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb982009
    • K
      drivers/leds/leds-lp5521.c: support led pattern data · 011af7bc
      Kim, Milo 提交于
      The lp5521 has autonomous operation mode without external control.
      Using lp5521_platform_data, various led patterns can be configurable.
      For supporting this feature, new functions and device attribute are
      added.
      
      Structure of lp5521_led_pattern: 3 channels are supported - red, green
      and blue.  Pattern(s) of each channel and numbers of pattern(s) are
      defined in the pla= tform data.  Pattern data are hexa codes which
      include pattern commands such like set pwm, wait, ramp up/down, branch
      and so on.
      
      Pattern mode functions:
       * lp5521_clear_program_memory
      	Before running new led pattern, program memory should be cleared.
       * lp5521_write_program_memory
      	Pattern data updated in the program memory via the i2c.
       * lp5521_get_pattern
      	Get pattern from predefined in the platform data.
       * lp5521_run_led_pattern
      	Stop current pattern or run new pattern.
      	Transition time is required between different operation mode.
      
      Device attribute - 'led_pattern': To load specific led pattern, new device
      attribute is added.
      
      When the lp5521 driver is unloaded, stop current led pattern mode.
      
      Documentation updated : description about how to define the led patterns
      and example.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NMilo(Woogyom) Kim <milo.kim@ti.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Cc: Arun MURTHY <arun.murthy@stericsson.com>
      Cc: Srinidhi Kasagar <srinidhi.kasagar@stericsson.com>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      011af7bc
    • K
      drivers/leds/leds-lp5521.c: add 'update_config' in the lp5521_platform_data · 3b49aacd
      Kim, Milo 提交于
      The value of CONFIG register(Addr 08h) is configurable.  For supporting
      this feature, update_config is added in the platform data.  If
      'update_config' is not defined, the default value is 'LP5521_PWRSAVE_EN |
      LP5521_CP_MODE_AUTO | LP5521_R_TO_BATT'.
      
      To define CONFIG register in the platform data, the bit definitions were
      mo= ved to the header file.
      
      Documentation updated : description about 'update_config' and example.
      Signed-off-by: NMilo(Woogyom) Kim <milo.kim@ti.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Cc: Arun MURTHY <arun.murthy@stericsson.com>
      Cc: Srinidhi Kasagar <srinidhi.kasagar@stericsson.com>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b49aacd
    • K
      drivers/leds/leds-lp5521.c: add 'name' in the lp5521_led_config · 5ae4e8a7
      Kim, Milo 提交于
      The name of each led channel can be configurable.  For the compatibility,
      the name is set to default value(xx:channelN) when 'name' is not defined.
      Signed-off-by: NMilo(Woogyom) Kim <milo.kim@ti.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Cc: Arun MURTHY <arun.murthy@stericsson.com>
      Cc: Srinidhi Kasagar <srinidhi.kasagar@stericsson.com>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ae4e8a7
    • A
      bitops: introduce for_each_clear_bit() · 03f4a822
      Akinobu Mita 提交于
      Introduce for_each_clear_bit() and for_each_clear_bit_from().  They are
      similar to for_each_set_bit() and list_for_each_set_bit_from(), but they
      iterate over all the cleared bits in a memory region.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Stefano Panella <stefano.panella@csr.com>
      Cc: David Vrabel <david.vrabel@csr.com>
      Cc: Sergei Shtylyov <sshtylyov@mvista.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03f4a822
    • A
      bitops: remove for_each_set_bit_cont() · 0a329d2d
      Akinobu Mita 提交于
      Remove for_each_set_bit_cont() after confirming that no one uses
      for_each_set_bit_cont() anymore.
      
      [sfr@canb.auug.org.au: regmap: cope with bitops API change]
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a329d2d
    • A
      bitops: rename for_each_set_bit_cont() in favor of analogous list.h function · 307b1cd7
      Akinobu Mita 提交于
      This renames for_each_set_bit_cont() to for_each_set_bit_from() because
      it is analogous to list_for_each_entry_from() in list.h rather than
      list_for_each_entry_continue().
      
      This doesn't remove for_each_set_bit_cont() for now.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      307b1cd7
    • K
      backlight: new backlight driver for LP855x devices · 7be865ab
      Kim, Milo 提交于
      THis driver supports TI LP8550/LP8551/LP8552/LP8553/LP8556 backlight
      devices.
      
      The brightness can be controlled by the I2C or PWM input.  The lp855x
      driver provides both modes.  For the PWM control, pwm-specific functions
      can be defined in the platform data.  And some information can be read
      via the sysfs(lp855x device attributes).
      
      For details, please refer to Documentation/backlight/lp855x-driver.txt.
      
      [axel.lin@gmail.com: add missing mutex_unlock in lp855x_read_byte() error path]
      [axel.lin@gmail.com: check platform data in lp855x_probe()]
      [axel.lin@gmail.com: small cleanups]
      [dan.carpenter@oracle.com: silence a compiler warning]
      [axel.lin@gmail.com: use id->driver_data to differentiate lp855x chips]
      [akpm@linux-foundation.org: simplify boolean return expression]
      Signed-off-by: NMilo(Woogyom) Kim <milo.kim@ti.com>
      Signed-off-by: NAxel Lin <axel.lin@gmail.com>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7be865ab
    • L
      prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision · ebec18a6
      Lennart Poettering 提交于
      Userspace service managers/supervisors need to track their started
      services.  Many services daemonize by double-forking and get implicitly
      re-parented to PID 1.  The service manager will no longer be able to
      receive the SIGCHLD signals for them, and is no longer in charge of
      reaping the children with wait().  All information about the children is
      lost at the moment PID 1 cleans up the re-parented processes.
      
      With this prctl, a service manager process can mark itself as a sort of
      'sub-init', able to stay as the parent for all orphaned processes
      created by the started services.  All SIGCHLD signals will be delivered
      to the service manager.
      
      Receiving SIGCHLD and doing wait() is in cases of a service-manager much
      preferred over any possible asynchronous notification about specific
      PIDs, because the service manager has full access to the child process
      data in /proc and the PID can not be re-used until the wait(), the
      service-manager itself is in charge of, has happened.
      
      As a side effect, the relevant parent PID information does not get lost
      by a double-fork, which results in a more elaborate process tree and
      'ps' output:
      
      before:
        # ps afx
        253 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
        294 ?        Sl     0:00 /usr/libexec/polkit-1/polkitd
        328 ?        S      0:00 /usr/sbin/modem-manager
        608 ?        Sl     0:00 /usr/libexec/colord
        658 ?        Sl     0:00 /usr/libexec/upowerd
        819 ?        Sl     0:00 /usr/libexec/imsettings-daemon
        916 ?        Sl     0:00 /usr/libexec/udisks-daemon
        917 ?        S      0:00  \_ udisks-daemon: not polling any devices
      
      after:
        # ps afx
        294 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
        426 ?        Sl     0:00  \_ /usr/libexec/polkit-1/polkitd
        449 ?        S      0:00  \_ /usr/sbin/modem-manager
        635 ?        Sl     0:00  \_ /usr/libexec/colord
        705 ?        Sl     0:00  \_ /usr/libexec/upowerd
        959 ?        Sl     0:00  \_ /usr/libexec/udisks-daemon
        960 ?        S      0:00  |   \_ udisks-daemon: not polling any devices
        977 ?        Sl     0:00  \_ /usr/libexec/packagekitd
      
      This prctl is orthogonal to PID namespaces.  PID namespaces are isolated
      from each other, while a service management process usually requires the
      services to live in the same namespace, to be able to talk to each
      other.
      
      Users of this will be the systemd per-user instance, which provides
      init-like functionality for the user's login session and D-Bus, which
      activates bus services on-demand.  Both need init-like capabilities to
      be able to properly keep track of the services they start.
      
      Many thanks to Oleg for several rounds of review and insights.
      
      [akpm@linux-foundation.org: fix comment layout and spelling]
      [akpm@linux-foundation.org: add lengthy code comment from Oleg]
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLennart Poettering <lennart@poettering.net>
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Acked-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebec18a6
    • J
      consolidate WARN_...ONCE() static variables · 7ccaba53
      Jan Beulich 提交于
      Due to the alignment of following variables, these typically consume
      more than just the single byte that 'bool' requires, and as there are a
      few hundred instances, the cache pollution (not so much the waste of
      memory) sums up.  Put these variables into their own section, outside of
      any half way frequently used memory range.
      
      Do the same also to the __warned variable of rcu_lockdep_assert().
      (Don't, however, include the ones used by printk_once() and alike, as
      they can potentially be hot.)
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ccaba53
    • B
      headers: include linux/types.h where appropriate · 10db4e1e
      Bobby Powers 提交于
      This addresses some header check warnings.  DRM headers which include
      "drm.h" have been excluded, as they indirectly include types.h.
      Signed-off-by: NBobby Powers <bobbypowers@gmail.com>
      Cc: Chris Ball <cjb@laptop.org>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10db4e1e
    • C
      nmi watchdog: do not use cpp symbol in Kconfig · d314d74c
      Cong Wang 提交于
      ARCH_HAS_NMI_WATCHDOG is a macro defined by arch, but config
      HARDLOCKUP_DETECTOR depends on it.  This is wrong, ARCH_HAS_NMI_WATCHDOG
      has to be a Kconfig config, and arch's need it should select it
      explicitly.
      Signed-off-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Acked-by: NDon Zickus <dzickus@redhat.com>
      Acked-by: NMike Frysinger <vapier@gentoo.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d314d74c
    • M
      magic.h: move some FS magic numbers into magic.h · b502bd11
      Muthu Kumar 提交于
      - Move open-coded filesystem magic numbers into magic.h
      
      - Rearrange magic.h so that the filesystem-related constants are grouped
        together.
      Signed-off-by: NMuthukumar R <muthur@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b502bd11
  2. 22 3月, 2012 18 次提交
    • N
      thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE · d8c37c48
      Naoya Horiguchi 提交于
      These macros will be used in a later patch, where all usages are expected
      to be optimized away without #ifdef CONFIG_TRANSPARENT_HUGEPAGE.  But to
      detect unexpected usages, we convert the existing BUG() to BUILD_BUG().
      
      [akpm@linux-foundation.org: fix build in mm/pgtable-generic.c]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8c37c48
    • K
      memcg: fix performance of mem_cgroup_begin_update_page_stat() · 4331f7d3
      KAMEZAWA Hiroyuki 提交于
      mem_cgroup_begin_update_page_stat() should be very fast because it's
      called very frequently.  Now, it needs to look up page_cgroup and its
      memcg....this is slow.
      
      This patch adds a global variable to check "any memcg is moving or not".
      With this, the caller doesn't need to visit page_cgroup and memcg.
      
      Here is a test result.  A test program makes page faults onto a file,
      MAP_SHARED and makes each page's page_mapcount(page) > 1, and free the
      range by madvise() and page fault again.  This program causes 26214400
      times of page fault onto a file(size was 1G.) and shows shows the cost of
      mem_cgroup_begin_update_page_stat().
      
      Before this patch for mem_cgroup_begin_update_page_stat()
      
          [kamezawa@bluextal test]$ time ./mmap 1G
      
          real    0m21.765s
          user    0m5.999s
          sys     0m15.434s
      
          27.46%     mmap  mmap               [.] reader
          21.15%     mmap  [kernel.kallsyms]  [k] page_fault
           9.17%     mmap  [kernel.kallsyms]  [k] filemap_fault
           2.96%     mmap  [kernel.kallsyms]  [k] __do_fault
           2.83%     mmap  [kernel.kallsyms]  [k] __mem_cgroup_begin_update_page_stat
      
      After this patch
      
          [root@bluextal test]# time ./mmap 1G
      
          real    0m21.373s
          user    0m6.113s
          sys     0m15.016s
      
      In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away.
      
      Note: we may be able to remove this optimization in future if
            we can get pointer to memcg directly from struct page.
      
      [akpm@linux-foundation.org: don't return a void]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4331f7d3
    • K
      memcg: remove PCG_FILE_MAPPED · 2ff76f11
      KAMEZAWA Hiroyuki 提交于
      With the new lock scheme for updating memcg's page stat, we don't need a
      flag PCG_FILE_MAPPED which was duplicated information of page_mapped().
      
      [hughd@google.com: cosmetic fix]
      [hughd@google.com: add comment to MEM_CGROUP_CHARGE_TYPE_MAPPED case in __mem_cgroup_uncharge_common()]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ff76f11
    • K
      memcg: use new logic for page stat accounting · 89c06bd5
      KAMEZAWA Hiroyuki 提交于
      Now, page-stat-per-memcg is recorded into per page_cgroup flag by
      duplicating page's status into the flag.  The reason is that memcg has a
      feature to move a page from a group to another group and we have race
      between "move" and "page stat accounting",
      
      Under current logic, assume CPU-A and CPU-B.  CPU-A does "move" and CPU-B
      does "page stat accounting".
      
      When CPU-A goes 1st,
      
                  CPU-A                           CPU-B
                                          update "struct page" info.
          move_lock_mem_cgroup(memcg)
          see pc->flags
          copy page stat to new group
          overwrite pc->mem_cgroup.
          move_unlock_mem_cgroup(memcg)
                                          move_lock_mem_cgroup(mem)
                                          set pc->flags
                                          update page stat accounting
                                          move_unlock_mem_cgroup(mem)
      
      stat accounting is guarded by move_lock_mem_cgroup() and "move" logic
      (CPU-A) doesn't see changes in "struct page" information.
      
      But it's costly to have the same information both in 'struct page' and
      'struct page_cgroup'.  And, there is a potential problem.
      
      For example, assume we have PG_dirty accounting in memcg.
      PG_..is a flag for struct page.
      PCG_ is a flag for struct page_cgroup.
      (This is just an example. The same problem can be found in any
       kind of page stat accounting.)
      
      	  CPU-A                               CPU-B
            TestSet PG_dirty
            (delay)                        TestClear PG_dirty
                                           if (TestClear(PCG_dirty))
                                                memcg->nr_dirty--
            if (TestSet(PCG_dirty))
                memcg->nr_dirty++
      
      Here, memcg->nr_dirty = +1, this is wrong.  This race was reported by Greg
      Thelen <gthelen@google.com>.  Now, only FILE_MAPPED is supported but
      fortunately, it's serialized by page table lock and this is not real bug,
      _now_,
      
      If this potential problem is caused by having duplicated information in
      struct page and struct page_cgroup, we may be able to fix this by using
      original 'struct page' information.  But we'll have a problem in "move
      account"
      
      Assume we use only PG_dirty.
      
               CPU-A                   CPU-B
          TestSet PG_dirty
          (delay)                    move_lock_mem_cgroup()
                                     if (PageDirty(page))
                                            new_memcg->nr_dirty++
                                     pc->mem_cgroup = new_memcg;
                                     move_unlock_mem_cgroup()
          move_lock_mem_cgroup()
          memcg = pc->mem_cgroup
          new_memcg->nr_dirty++
      
      accounting information may be double-counted.  This was original reason to
      have PCG_xxx flags but it seems PCG_xxx has another problem.
      
      I think we need a bigger lock as
      
           move_lock_mem_cgroup(page)
           TestSetPageDirty(page)
           update page stats (without any checks)
           move_unlock_mem_cgroup(page)
      
      This fixes both of problems and we don't have to duplicate page flag into
      page_cgroup.  Please note: move_lock_mem_cgroup() is held only when there
      are possibility of "account move" under the system.  So, in most path,
      status update will go without atomic locks.
      
      This patch introduces mem_cgroup_begin_update_page_stat() and
      mem_cgroup_end_update_page_stat() both should be called at modifying
      'struct page' information if memcg takes care of it.  as
      
           mem_cgroup_begin_update_page_stat()
           modify page information
           mem_cgroup_update_page_stat()
           => never check any 'struct page' info, just update counters.
           mem_cgroup_end_update_page_stat().
      
      This patch is slow because we need to call begin_update_page_stat()/
      end_update_page_stat() regardless of accounted will be changed or not.  A
      following patch adds an easy optimization and reduces the cost.
      
      [akpm@linux-foundation.org: s/lock/locked/]
      [hughd@google.com: fix deadlock by avoiding stat lock when anon]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Greg Thelen <gthelen@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89c06bd5
    • K
      memcg: remove PCG_MOVE_LOCK flag from page_cgroup · 312734c0
      KAMEZAWA Hiroyuki 提交于
      PCG_MOVE_LOCK is used for bit spinlock to avoid race between overwriting
      pc->mem_cgroup and page statistics accounting per memcg.  This lock helps
      to avoid the race but the race is very rare because moving tasks between
      cgroup is not a usual job.  So, it seems using 1bit per page is too
      costly.
      
      This patch changes this lock as per-memcg spinlock and removes
      PCG_MOVE_LOCK.
      
      If smaller lock is required, we'll be able to add some hashes but I'd like
      to start from this.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      312734c0
    • K
      memcg: kill dead prev_priority stubs · a710920c
      Konstantin Khlebnikov 提交于
      This code was removed in 25edde03 ("vmscan: kill prev_priority
      completely")
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a710920c
    • K
      memcg: remove PCG_CACHE page_cgroup flag · b2402857
      KAMEZAWA Hiroyuki 提交于
      We record 'the page is cache' with the PCG_CACHE bit in page_cgroup.
      Here, "CACHE" means anonymous user pages (and SwapCache).  This doesn't
      include shmem.
      
      Considering callers, at charge/uncharge, the caller should know what the
      page is and we don't need to record it by using one bit per page.
      
      This patch removes PCG_CACHE bit and make callers of
      mem_cgroup_charge_statistics() to specify what the page is.
      
      About page migration: Mapping of the used page is not touched during migra
      tion (see page_remove_rmap) so we can rely on it and push the correct
      charge type down to __mem_cgroup_uncharge_common from end_migration for
      unused page.  The force flag was misleading was abused for skipping the
      needless page_mapped() / PageCgroupMigration() check, as we know the
      unused page is no longer mapped and cleared the migration flag just a few
      lines up.  But doing the checks is no biggie and it's not worth adding
      another flag just to skip them.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      [hughd@google.com: fix PageAnon uncharging]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2402857
    • H
      cgroup: revert ss_id_lock to spinlock · 42aee6c4
      Hugh Dickins 提交于
      Commit c1e2ee2d ("memcg: replace ss->id_lock with a rwlock") has now
      been seen to cause the unfair behavior we should have expected from
      converting a spinlock to an rwlock: softlockup in cgroup_mkdir(), whose
      get_new_cssid() is waiting for the wlock, while there are 19 tasks using
      the rlock in css_get_next() to get on with their memcg workload (in an
      artificial test, admittedly).  Yet lib/idr.c was made suitable for RCU
      way back: revert that commit, restoring ss->id_lock to a spinlock.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42aee6c4
    • H
      memcg: replace MEM_CONT by MEM_RES_CTLR · 31a79235
      Hugh Dickins 提交于
      Correct an #endif comment in memcontrol.h from MEM_CONT to MEM_RES_CTLR.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31a79235
    • K
      page_alloc.c: remove add_from_early_node_map() · 8d13bddd
      Kautuk Consul 提交于
      add_from_early_node_map() is unused.
      Signed-off-by: NKautuk Consul <consul.kautuk@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d13bddd
    • S
      hugetlbfs: fix alignment of huge page requests · 40716e29
      Steven Truelove 提交于
      When calling shmget() with SHM_HUGETLB, shmget aligns the request size to
      PAGE_SIZE, but this is not sufficient.
      
      Modify hugetlb_file_setup() to align requests to the huge page size, and
      to accept an address argument so that all alignment checks can be
      performed in hugetlb_file_setup(), rather than in its callers.  Change
      newseg() and mmap_pgoff() to match the new prototype and eliminate a now
      redundant alignment check.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NSteven Truelove <steven.truelove@utoronto.ca>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40716e29
    • D
      mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat() · 05af2e10
      David Rientjes 提交于
      sync_mm_rss() can only be used for current to avoid race conditions in
      iterating and clearing its per-task counters.  Remove the task argument
      for it and its helper function, __sync_task_rss_stat(), to avoid thinking
      it can be used safely for anything other than current.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05af2e10
    • D
      hugepages: fix use after free bug in "quota" handling · 90481622
      David Gibson 提交于
      hugetlbfs_{get,put}_quota() are badly named.  They don't interact with the
      general quota handling code, and they don't much resemble its behaviour.
      Rather than being about maintaining limits on on-disk block usage by
      particular users, they are instead about maintaining limits on in-memory
      page usage (including anonymous MAP_PRIVATE copied-on-write pages)
      associated with a particular hugetlbfs filesystem instance.
      
      Worse, they work by having callbacks to the hugetlbfs filesystem code from
      the low-level page handling code, in particular from free_huge_page().
      This is a layering violation of itself, but more importantly, if the
      kernel does a get_user_pages() on hugepages (which can happen from KVM
      amongst others), then the free_huge_page() can be delayed until after the
      associated inode has already been freed.  If an unmount occurs at the
      wrong time, even the hugetlbfs superblock where the "quota" limits are
      stored may have been freed.
      
      Andrew Barry proposed a patch to fix this by having hugepages, instead of
      storing a pointer to their address_space and reaching the superblock from
      there, had the hugepages store pointers directly to the superblock,
      bumping the reference count as appropriate to avoid it being freed.
      Andrew Morton rejected that version, however, on the grounds that it made
      the existing layering violation worse.
      
      This is a reworked version of Andrew's patch, which removes the extra, and
      some of the existing, layering violation.  It works by introducing the
      concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
      finite logical pool of hugepages to allocate from.  hugetlbfs now creates
      a subpool for each filesystem instance with a page limit set, and a
      pointer to the subpool gets added to each allocated hugepage, instead of
      the address_space pointer used now.  The subpool has its own lifetime and
      is only freed once all pages in it _and_ all other references to it (i.e.
      superblocks) are gone.
      
      subpools are optional - a NULL subpool pointer is taken by the code to
      mean that no subpool limits are in effect.
      
      Previous discussion of this bug found in:  "Fix refcounting in hugetlbfs
      quota handling.". See:  https://lkml.org/lkml/2011/8/11/28 or
      http://marc.info/?l=linux-mm&m=126928970510627&w=1
      
      v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
      alloc_huge_page() - since it already takes the vma, it is not necessary.
      Signed-off-by: NAndrew Barry <abarry@cray.com>
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90481622
    • D
      hugetlb: cleanup hugetlb.h · a1d776ee
      David Gibson 提交于
      Make a couple of small cleanups to linux/include/hugetlb.h.  The
      set_file_hugepages() function, which was not used anywhere is removed,
      and the hugetlbfs_config and hugetlbfs_inode_info structures with its
      HUGETLBFS_I helper function are moved into inode.c, the only place they
      were used.
      
      These structures are really linked to the hugetlbfs filesystem
      specifically not to hugepage mm handling in general, so they belong in
      the filesystem code not in a generally available header.
      
      It would be nice to move the hugetlbfs_sb_info (superblock) structure in
      there as well, but it's currently needed in a number of places via the
      hstate_vma() and hstate_inode().
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Andrew Barry <abarry@cray.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1d776ee
    • M
      cpuset: mm: reduce large amounts of memory barrier related damage v3 · cc9a6c87
      Mel Gorman 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") wins a super prize for the largest number of
      memory barriers entered into fast paths for one commit.
      
      [get|put]_mems_allowed is incredibly heavy with pairs of full memory
      barriers inserted into a number of hot paths.  This was detected while
      investigating at large page allocator slowdown introduced some time
      after 2.6.32.  The largest portion of this overhead was shown by
      oprofile to be at an mfence introduced by this commit into the page
      allocator hot path.
      
      For extra style points, the commit introduced the use of yield() in an
      implementation of what looks like a spinning mutex.
      
      This patch replaces the full memory barriers on both read and write
      sides with a sequence counter with just read barriers on the fast path
      side.  This is much cheaper on some architectures, including x86.  The
      main bulk of the patch is the retry logic if the nodemask changes in a
      manner that can cause a false failure.
      
      While updating the nodemask, a check is made to see if a false failure
      is a risk.  If it is, the sequence number gets bumped and parallel
      allocators will briefly stall while the nodemask update takes place.
      
      In a page fault test microbenchmark, oprofile samples from
      __alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
      actual results were
      
                                   3.3.0-rc3          3.3.0-rc3
                                   rc3-vanilla        nobarrier-v2r1
          Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
          Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
          Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
          Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
          Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
          Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
          Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
          Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
          Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
          Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
          Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
          Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
          Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
          Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
          Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
          MMTests Statistics: duration
          Sys Time Running Test (seconds)             135.68    132.17
          User+Sys Time Running Test (seconds)         164.2    160.13
          Total Elapsed Time (seconds)                123.46    120.87
      
      The overall improvement is small but the System CPU time is much
      improved and roughly in correlation to what oprofile reported (these
      performance figures are without profiling so skew is expected).  The
      actual number of page faults is noticeably improved.
      
      For benchmarks like kernel builds, the overall benefit is marginal but
      the system CPU time is slightly reduced.
      
      To test the actual bug the commit fixed I opened two terminals.  The
      first ran within a cpuset and continually ran a small program that
      faulted 100M of anonymous data.  In a second window, the nodemask of the
      cpuset was continually randomised in a loop.
      
      Without the commit, the program would fail every so often (usually
      within 10 seconds) and obviously with the commit everything worked fine.
      With this patch applied, it also worked fine so the fix should be
      functionally equivalent.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc9a6c87
    • D
      mm, memcg: pass charge order to oom killer · e845e199
      David Rientjes 提交于
      The oom killer typically displays the allocation order at the time of oom
      as a part of its diangostic messages (for global, cpuset, and mempolicy
      ooms).
      
      The memory controller may also pass the charge order to the oom killer so
      it can emit the same information.  This is useful in determining how large
      the memory allocation is that triggered the oom killer.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e845e199
    • K
      mm: drain percpu lru add/rotate page-vectors on cpu hot-unplug · f0cb3c76
      Konstantin Khlebnikov 提交于
      This cpu hotplug hook was accidentally removed in commit 00a62ce9
      ("mm: fix Committed_AS underflow on large NR_CPUS environment")
      
      The visible effect of this accident: some pages are borrowed in per-cpu
      page-vectors.  Truncate can deal with it, but these pages cannot be
      reused while this cpu is offline.  So this is like a temporary memory
      leak.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Eric B Munson <ebmunson@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0cb3c76
    • D
      thp: allow a hwpoisoned head page to be put back to LRU · 385de357
      Dean Nelson 提交于
      Andrea Arcangeli pointed out to me that a check in __memory_failure()
      which was intended to prevent THP tail pages from being checked for the
      absence of the PG_lru flag (something that is always the case), was also
      preventing THP head pages from being checked.
      
      A THP head page could actually benefit from the call to shake_page() by
      ending up being put back to a LRU, provided it had been waiting in a
      pagevec array.
      
      Andrea suggested that the "!PageTransCompound(p)" in the if-statement
      should be replaced by a "!PageTransTail(p)", thus allowing THP head pages
      to be checked and possibly shaken.
      Signed-off-by: NDean Nelson <dnelson@redhat.com>
      Cc: Jin Dongming <jin.dongming@np.css.fujitsu.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      385de357