1. 27 7月, 2011 40 次提交
    • T
      exec: do not call request_module() twice from search_binary_handler() · 91219352
      Tetsuo Handa 提交于
      Currently, search_binary_handler() tries to load binary loader module
      using request_module() if a loader for the requested program is not yet
      loaded.  But second attempt of request_module() does not affect the result
      of search_binary_handler().
      
      If request_module() triggered recursion, calling request_module() twice
      causes 2 to the power of MAX_KMOD_CONCURRENT (= 50) repetitions.  It is
      not an infinite loop but is sufficient for users to consider as a hang up.
      
      Therefore, this patch changes not to call request_module() twice, making 1
      to the power of MAX_KMOD_CONCURRENT repetitions in case of recursion.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NRichard Weinberger <richard@nod.at>
      Tested-by: NRichard Weinberger <richard@nod.at>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91219352
    • M
      fs/exec.c: use BUILD_BUG_ON for VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP · aacb3d17
      Michal Hocko 提交于
      Commit a8bef8ff ("mm: migration: avoid race between
      shift_arg_pages() and rmap_walk() during migration by not migrating
      temporary stacks") introduced a BUG_ON() to ensure that VM_STACK_FLAGS
      and VM_STACK_INCOMPLETE_SETUP do not overlap.  The check is a compile
      time one, so BUILD_BUG_ON is more appropriate.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Richard Weinberger <richard@nod.at>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aacb3d17
    • D
    • V
      proc: fix a race in do_io_accounting() · 293eb1e7
      Vasiliy Kulikov 提交于
      If an inode's mode permits opening /proc/PID/io and the resulting file
      descriptor is kept across execve() of a setuid or similar binary, the
      ptrace_may_access() check tries to prevent using this fd against the
      task with escalated privileges.
      
      Unfortunately, there is a race in the check against execve().  If
      execve() is processed after the ptrace check, but before the actual io
      information gathering, io statistics will be gathered from the
      privileged process.  At least in theory this might lead to gathering
      sensible information (like ssh/ftp password length) that wouldn't be
      available otherwise.
      
      Holding task->signal->cred_guard_mutex while gathering the io
      information should protect against the race.
      
      The order of locking is similar to the one inside of ptrace_attach():
      first goes cred_guard_mutex, then lock_task_sighand().
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      293eb1e7
    • D
      procfs: return ENOENT on opening a being-removed proc entry · d2857e79
      Daisuke Ogino 提交于
      Change the return value to ENOENT.  This return value is then returned
      when opening the proc entry that have been removed.  For example,
      open("/proc/bus/pci/XX/YY") when the corresponding device is being
      hot-removed.
      Signed-off-by: NDaisuke Ogino <ogino.daisuke@jp.fujitsu.com>
      Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
      Acked-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2857e79
    • A
      h8300/m68k/xtensa: __FD_ISSET should return 0/1 · 5296f6d3
      Andrew Morton 提交于
      Harmonise these return values with other architectures.  In some cases
      this affects all compilers and in other cases non-gcc compilers only.
      
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Ulrich Drepper <drepper@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5296f6d3
    • O
      do_coredump: fix the "ispipe" error check · 99b64567
      Oleg Nesterov 提交于
      do_coredump() assumes that if format_corename() fails it should return
      -ENOMEM.  This is not true, for example cn_print_exe_file() can propagate
      the error from d_path.  Even if it was true, this is too fragile.  Change
      the code to check "ispipe < 0".
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Reviewed-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99b64567
    • J
      coredump: escape / in hostname and comm · 2c563731
      Jiri Slaby 提交于
      Change every occurence of / in comm and hostname to !.  If the process
      changes its name to contain /, the core is not dumped (if the directory
      tree doesn't exist like that).  The same with hostname being something
      like myhost/3.  Fix this behaviour by using the escape loop used in %E.
      (We extract it to a separate function.)
      
      Now both with comm == myprocess/1 and hostname == myhost/1, the core is
      dumped like (kernel.core_pattern='core.%p.%e.%h):
      core.2349.myprocess!1.myhost!1
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c563731
    • J
      coredump: use task comm instead of (unknown) · 3141c8b1
      Jiri Slaby 提交于
      If we don't know the file corresponding to the binary (i.e.  exe_file is
      unknown), use "task->comm (path unknown)" instead of simple "(unknown)"
      as suggested by ak.
      
      The fallback is the same as %e except it will append "(path unknown)".
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3141c8b1
    • M
      ptrace: unify show_regs() prototype · 0e9a6cb5
      Mike Frysinger 提交于
      [ poleg@redhat.com: no need to declare show_regs() in ptrace.h, sched.h does this ]
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e9a6cb5
    • M
      cpusets: randomize node rotor used in cpuset_mem_spread_node() · 778d3b0f
      Michal Hocko 提交于
      [ This patch has already been accepted as commit 0ac0c0d0 but later
        reverted (commit 35926ff5) because it itroduced arch specific
        __node_random which was defined only for x86 code so it broke other
        archs.  This is a followup without any arch specific code.  Other than
        that there are no functional changes.]
      
      Some workloads that create a large number of small files tend to assign
      too many pages to node 0 (multi-node systems).  Part of the reason is
      that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts
      at node 0 for newly created tasks.
      
      This patch changes the rotor to be initialized to a random node number
      of the cpuset.
      
      [akpm@linux-foundation.org: fix layout]
      [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
      [mhocko@suse.cz: Make it arch independent]
      [akpm@linux-foundation.org: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build]
      Signed-off-by: NJack Steiner <steiner@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Paul Menage <menage@google.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Menage <menage@google.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      778d3b0f
    • M
      memcg: get rid of percpu_charge_mutex lock · 8521fc50
      Michal Hocko 提交于
      percpu_charge_mutex protects from multiple simultaneous per-cpu charge
      caches draining because we might end up having too many work items.  At
      least this was the case until commit 26fe6168 ("memcg: fix percpu
      cached charge draining frequency") when we introduced a more targeted
      draining for async mode.
      
      Now that also sync draining is targeted we can safely remove mutex
      because we will not send more work than the current number of CPUs.
      FLUSHING_CACHED_CHARGE protects from sending the same work multiple
      times and stock->nr_pages == 0 protects from pointless sending a work if
      there is obviously nothing to be done.  This is of course racy but we
      can live with it as the race window is really small (we would have to
      see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still
      non-zero).
      
      The only remaining place where we can race is synchronous mode when we
      rely on FLUSHING_CACHED_CHARGE test which might have been set by other
      drainer on the same group but we should wait in that case as well.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8521fc50
    • M
      memcg: add mem_cgroup_same_or_subtree() helper · 3e92041d
      Michal Hocko 提交于
      We are checking whether a given two groups are same or at least in the
      same subtree of a hierarchy at several places.  Let's make a helper for
      it to make code easier to read.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e92041d
    • M
      memcg: unify sync and async per-cpu charge cache draining · d38144b7
      Michal Hocko 提交于
      Currently we have two ways how to drain per-CPU caches for charges.
      drain_all_stock_sync will synchronously drain all caches while
      drain_all_stock_async will asynchronously drain only those that refer to
      a given memory cgroup or its subtree in hierarchy.  Targeted async
      draining has been introduced by 26fe6168 (memcg: fix percpu cached
      charge draining frequency) to reduce the cpu workers number.
      
      sync draining is currently triggered only from mem_cgroup_force_empty
      which is triggered only by userspace (mem_cgroup_force_empty_write) or
      when a cgroup is removed (mem_cgroup_pre_destroy).  Although these are
      not usually frequent operations it still makes some sense to do targeted
      draining as well, especially if the box has many CPUs.
      
      This patch unifies both methods to use the single code (drain_all_stock)
      which relies on the original async implementation and just adds
      flush_work to wait on all caches that are still under work for the sync
      mode.  We are using FLUSHING_CACHED_CHARGE bit check to prevent from
      waiting on a work that we haven't triggered.  Please note that both sync
      and async functions are currently protected by percpu_charge_mutex so we
      cannot race with other drainers.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d38144b7
    • M
      memcg: do not try to drain per-cpu caches without pages · d1a05b69
      Michal Hocko 提交于
      drain_all_stock_async tries to optimize a work to be done on the work
      queue by excluding any work for the current CPU because it assumes that
      the context we are called from already tried to charge from that cache
      and it's failed so it must be empty already.
      
      While the assumption is correct we can optimize it even more by checking
      the current number of pages in the cache.  This will also reduce a work
      on other CPUs with an empty stock.
      
      For the current CPU we can simply call drain_local_stock rather than
      deferring it to the work queue.
      
      [kamezawa.hiroyu@jp.fujitsu.com: use drain_local_stock for current CPU optimization]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1a05b69
    • K
      memcg: add memory.vmscan_stat · 82f9d486
      KAMEZAWA Hiroyuki 提交于
      The commit log of 0ae5e89c ("memcg: count the soft_limit reclaim
      in...") says it adds scanning stats to memory.stat file.  But it doesn't
      because we considered we needed to make a concensus for such new APIs.
      
      This patch is a trial to add memory.scan_stat. This shows
        - the number of scanned pages(total, anon, file)
        - the number of rotated pages(total, anon, file)
        - the number of freed pages(total, anon, file)
        - the number of elaplsed time (including sleep/pause time)
      
        for both of direct/soft reclaim.
      
      The biggest difference with oringinal Ying's one is that this file
      can be reset by some write, as
      
        # echo 0 ...../memory.scan_stat
      
      Example of output is here. This is a result after make -j 6 kernel
      under 300M limit.
      
        [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
        [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
        scanned_pages_by_limit 9471864
        scanned_anon_pages_by_limit 6640629
        scanned_file_pages_by_limit 2831235
        rotated_pages_by_limit 4243974
        rotated_anon_pages_by_limit 3971968
        rotated_file_pages_by_limit 272006
        freed_pages_by_limit 2318492
        freed_anon_pages_by_limit 962052
        freed_file_pages_by_limit 1356440
        elapsed_ns_by_limit 351386416101
        scanned_pages_by_system 0
        scanned_anon_pages_by_system 0
        scanned_file_pages_by_system 0
        rotated_pages_by_system 0
        rotated_anon_pages_by_system 0
        rotated_file_pages_by_system 0
        freed_pages_by_system 0
        freed_anon_pages_by_system 0
        freed_file_pages_by_system 0
        elapsed_ns_by_system 0
        scanned_pages_by_limit_under_hierarchy 9471864
        scanned_anon_pages_by_limit_under_hierarchy 6640629
        scanned_file_pages_by_limit_under_hierarchy 2831235
        rotated_pages_by_limit_under_hierarchy 4243974
        rotated_anon_pages_by_limit_under_hierarchy 3971968
        rotated_file_pages_by_limit_under_hierarchy 272006
        freed_pages_by_limit_under_hierarchy 2318492
        freed_anon_pages_by_limit_under_hierarchy 962052
        freed_file_pages_by_limit_under_hierarchy 1356440
        elapsed_ns_by_limit_under_hierarchy 351386416101
        scanned_pages_by_system_under_hierarchy 0
        scanned_anon_pages_by_system_under_hierarchy 0
        scanned_file_pages_by_system_under_hierarchy 0
        rotated_pages_by_system_under_hierarchy 0
        rotated_anon_pages_by_system_under_hierarchy 0
        rotated_file_pages_by_system_under_hierarchy 0
        freed_pages_by_system_under_hierarchy 0
        freed_anon_pages_by_system_under_hierarchy 0
        freed_file_pages_by_system_under_hierarchy 0
        elapsed_ns_by_system_under_hierarchy 0
      
      total_xxxx is for hierarchy management.
      
      This will be useful for further memcg developments and need to be
      developped before we do some complicated rework on LRU/softlimit
      management.
      
      This patch adds a new struct memcg_scanrecord into scan_control struct.
      sc->nr_scanned at el is not designed for exporting information.  For
      example, nr_scanned is reset frequentrly and incremented +2 at scanning
      mapped pages.
      
      To avoid complexity, I added a new param in scan_control which is for
      exporting scanning score.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andrew Bresticker <abrestic@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82f9d486
    • D
      memcg: fix behavior of mem_cgroup_resize_limit() · 108b6a78
      Daisuke Nishimura 提交于
      Commit 22a668d7 ("memcg: fix behavior under memory.limit equals to
      memsw.limit") introduced "memsw_is_minimum" flag, which becomes true
      when mem_limit == memsw_limit.  The flag is checked at the beginning of
      reclaim, and "noswap" is set if the flag is true, because using swap is
      meaningless in this case.
      
      This works well in most cases, but when we try to shrink mem_limit,
      which is the same as memsw_limit now, we might fail to shrink mem_limit
      because swap doesn't used.
      
      This patch fixes this behavior by:
       - check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim
       - If it is set, don't set "noswap" flag even if memsw_is_minimum is true.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      108b6a78
    • K
      memcg: fix vmscan count in small memcgs · 4508378b
      KAMEZAWA Hiroyuki 提交于
      Commit 246e87a9 ("memcg: fix get_scan_count() for small targets")
      fixes the memcg/kswapd behavior against small targets and prevent vmscan
      priority too high.
      
      But the implementation is too naive and adds another problem to small
      memcg.  It always force scan to 32 pages of file/anon and doesn't handle
      swappiness and other rotate_info.  It makes vmscan to scan anon LRU
      regardless of swappiness and make reclaim bad.  This patch fixes it by
      adjusting scanning count with regard to swappiness at el.
      
      At a test "cat 1G file under 300M limit." (swappiness=20)
       before patch
              scanned_pages_by_limit 360919
              scanned_anon_pages_by_limit 180469
              scanned_file_pages_by_limit 180450
              rotated_pages_by_limit 31
              rotated_anon_pages_by_limit 25
              rotated_file_pages_by_limit 6
              freed_pages_by_limit 180458
              freed_anon_pages_by_limit 19
              freed_file_pages_by_limit 180439
              elapsed_ns_by_limit 429758872
       after patch
              scanned_pages_by_limit 180674
              scanned_anon_pages_by_limit 24
              scanned_file_pages_by_limit 180650
              rotated_pages_by_limit 35
              rotated_anon_pages_by_limit 24
              rotated_file_pages_by_limit 11
              freed_pages_by_limit 180634
              freed_anon_pages_by_limit 0
              freed_file_pages_by_limit 180634
              elapsed_ns_by_limit 367119089
              scanned_pages_by_system 0
      
      the numbers of scanning anon are decreased(as expected), and elapsed time
      reduced. By this patch, small memcgs will work better.
      (*) Because the amount of file-cache is much bigger than anon,
          recalaim_stat's rotate-scan counter make scanning files more.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4508378b
    • M
      memcg: change memcg_oom_mutex to spinlock · 1af8efe9
      Michal Hocko 提交于
      memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
      for oom_control.  None of the critical sections which it protects sleep
      (eventfd_signal works from atomic context and the rest are simple linked
      list resp.  oom_lock atomic operations).
      
      Mutex is also too heavyweight for those code paths because it triggers a
      lot of scheduling.  It also makes makes convoying effects more visible
      when we have a big number of oom killing because we take the lock
      mutliple times during mem_cgroup_handle_oom so we have multiple places
      where many processes can sleep.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1af8efe9
    • M
      memcg: make oom_lock 0 and 1 based rather than counter · 79dfdacc
      Michal Hocko 提交于
      Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock
      counter which is incremented by mem_cgroup_oom_lock when we are about to
      handle memcg OOM situation.  mem_cgroup_handle_oom falls back to a sleep
      if oom_lock > 1 to prevent from multiple oom kills at the same time.
      The counter is then decremented by mem_cgroup_oom_unlock called from the
      same function.
      
      This works correctly but it can lead to serious starvations when we have
      many processes triggering OOM and many CPUs available for them (I have
      tested with 16 CPUs).
      
      Consider a process (call it A) which gets the oom_lock (the first one
      that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other
      processes that are blocked on the mutex.  While A releases the mutex and
      calls mem_cgroup_out_of_memory others will wake up (one after another)
      and increase the counter and fall into sleep (memcg_oom_waitq).
      
      Once A finishes mem_cgroup_out_of_memory it takes the mutex again and
      decreases oom_lock and wakes other tasks (if releasing memory by
      somebody else - e.g.  killed process - hasn't done it yet).
      
      A testcase would look like:
        Assume malloc XXX is a program allocating XXX Megabytes of memory
        which touches all allocated pages in a tight loop
        # swapoff SWAP_DEVICE
        # cgcreate -g memory:A
        # cgset -r memory.oom_control=0   A
        # cgset -r memory.limit_in_bytes= 200M
        # for i in `seq 100`
        # do
        #     cgexec -g memory:A   malloc 10 &
        # done
      
      The main problem here is that all processes still race for the mutex and
      there is no guarantee that we will get counter back to 0 for those that
      got back to mem_cgroup_handle_oom.  In the end the whole convoy
      in/decreases the counter but we do not get to 1 that would enable
      killing so nothing useful can be done.  The time is basically unbounded
      because it highly depends on scheduling and ordering on mutex (I have
      seen this taking hours...).
      
      This patch replaces the counter by a simple {un}lock semantic.  As
      mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to
      make sure that nobody else races with us which is guaranteed by the
      memcg_oom_mutex.
      
      We have to be careful while locking subtrees because we can encounter a
      subtree which is already locked: hierarchy:
      
                A
              /   \
             B     \
            /\      \
           C  D     E
      
      B - C - D tree might be already locked.  While we want to enable locking
      E subtree because OOM situations cannot influence each other we
      definitely do not want to allow locking A.
      
      Therefore we have to refuse lock if any subtree is already locked and
      clear up the lock for all nodes that have been set up to the failure
      point.
      
      On the other hand we have to make sure that the rest of the world will
      recognize that a group is under OOM even though it doesn't have a lock.
      Therefore we have to introduce under_oom variable which is incremented
      and decremented for the whole subtree when we enter resp.  leave
      mem_cgroup_handle_oom.  under_oom, unlike oom_lock, doesn't need be
      updated under memcg_oom_mutex because its users only check a single
      group and they use atomic operations for that.
      
      This can be checked easily by the following test case:
      
        # cgcreate -g memory:A
        # cgset -r memory.use_hierarchy=1 A
        # cgset -r memory.oom_control=1   A
        # cgset -r memory.limit_in_bytes= 100M
        # cgset -r memory.memsw.limit_in_bytes= 100M
        # cgcreate -g memory:A/B
        # cgset -r memory.oom_control=1 A/B
        # cgset -r memory.limit_in_bytes=20M
        # cgset -r memory.memsw.limit_in_bytes=20M
        # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
        # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A
      
      While B gets oom_lock A will not get it.  Both of them go into sleep and
      wait for an external action.  We can make the limit higher for A to
      enforce waking it up
      
        # cgset -r memory.memsw.limit_in_bytes=300M A
        # cgset -r memory.limit_in_bytes=300M A
      
      malloc in A has to wake up even though it doesn't have oom_lock.
      
      Finally, the unlock path is very easy because we always unlock only the
      subtree we have locked previously while we always decrement under_oom.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79dfdacc
    • K
      memcg: consolidate memory cgroup lru stat functions · bb2a0de9
      KAMEZAWA Hiroyuki 提交于
      In mm/memcontrol.c, there are many lru stat functions as..
      
        mem_cgroup_zone_nr_lru_pages
        mem_cgroup_node_nr_file_lru_pages
        mem_cgroup_nr_file_lru_pages
        mem_cgroup_node_nr_anon_lru_pages
        mem_cgroup_nr_anon_lru_pages
        mem_cgroup_node_nr_unevictable_lru_pages
        mem_cgroup_nr_unevictable_lru_pages
        mem_cgroup_node_nr_lru_pages
        mem_cgroup_nr_lru_pages
        mem_cgroup_get_local_zonestat
      
      Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
      This seems bad. This patch consolidates all functions into
      
        mem_cgroup_zone_nr_lru_pages()
        mem_cgroup_node_nr_lru_pages()
        mem_cgroup_nr_lru_pages()
      
      For these functions, "which LRU?" information is passed by a mask.
      
      example:
        mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))
      
      And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.
      
      example:
        mem_cgroup_nr_lru_pages(mem, ALL_LRU)
      
      BTW, considering layout of NUMA memory placement of counters, this patch seems
      to be better.
      
      Now, when we gather all LRU information, we scan in following orer
          for_each_lru -> for_each_node -> for_each_zone.
      
      This means we'll touch cache lines in different node in turn.
      
      After patch, we'll scan
          for_each_node -> for_each_zone -> for_each_lru(mask)
      
      Then, we'll gather information in the same cacheline at once.
      
      [akpm@linux-foundation.org: fix warnigns, build error]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb2a0de9
    • K
      memcg: export memory cgroup's swappiness with mem_cgroup_swappiness() · 1f4c025b
      KAMEZAWA Hiroyuki 提交于
      Each memory cgroup has a 'swappiness' value which can be accessed by
      get_swappiness(memcg).  The major user is try_to_free_mem_cgroup_pages()
      and swappiness is passed by argument.  It's propagated by scan_control.
      
      get_swappiness() is a static function but some planned updates will need
      to get swappiness from files other than memcontrol.c This patch exports
      get_swappiness() as mem_cgroup_swappiness().  With this, we can remove the
      argument of swapiness from try_to_free...  and drop swappiness from
      scan_control.  only memcg uses it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f4c025b
    • T
      rtc: fix hrtimer deadlock · b830ac1d
      Thomas Gleixner 提交于
      Ben reported a lockup related to rtc. The lockup happens due to:
      
      CPU0                                        CPU1
      
      rtc_irq_set_state()			    __run_hrtimer()
        spin_lock_irqsave(&rtc->irq_task_lock)    rtc_handle_legacy_irq();
      					      spin_lock(&rtc->irq_task_lock);
        hrtimer_cancel()
          while (callback_running);
      
      So the running callback never finishes as it's blocked on
      rtc->irq_task_lock.
      
      Use hrtimer_try_to_cancel() instead and drop rtc->irq_task_lock while
      waiting for the callback.  Fix this for both rtc_irq_set_state() and
      rtc_irq_set_freq().
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reported-by: NBen Greear <greearb@candelatech.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b830ac1d
    • T
      rtc: limit frequency · 431e2bcc
      Thomas Gleixner 提交于
      Due to the hrtimer self rearming mode a user can DoS the machine simply
      because it's starved by hrtimer events.
      
      The RTC hrtimer is self rearming.  We really need to limit the frequency
      to something sensible.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ben Greear <greearb@candelatech.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      431e2bcc
    • T
      rtc: handle errors correctly in rtc_irq_set_state() · 2c4f57d1
      Thomas Gleixner 提交于
      The code checks the correctness of the parameters, but unconditionally
      arms/disarms the hrtimer.
      
      The result is that a random task might arm/disarm rtc timer and surprise
      the real owner by either generating events or by stopping them.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ben Greear <greearb@candelatech.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c4f57d1
    • M
      mn10300, exec: remove redundant set_fs(USER_DS) · b45d59fb
      Mathias Krause 提交于
      The address limit is already set in flush_old_exec() so this
      set_fs(USER_DS) is redundant.
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b45d59fb
    • J
      drivers/base/power/opp.c: fix dev_opp initial value · fc92805a
      Jonghwan Choi 提交于
      Dev_opp initial value shoule be ERR_PTR(), IS_ERR() is used to check
      error.
      Signed-off-by: NJonghwan Choi <jhbird.choi@samsung.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc92805a
    • M
      frv, exec: remove redundant set_fs(USER_DS) · adc400f6
      Mathias Krause 提交于
      The address limit is already set in flush_old_exec() so those calls to
      set_fs(USER_DS) are redundant.
      
      Also removed the dead code in flush_thread().
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      adc400f6
    • L
      Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus · 6fd4ce88
      Linus Torvalds 提交于
      * 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus: (31 commits)
        MIPS: Close races in TLB modify handlers.
        MIPS: Add uasm UASM_i_SRL_SAFE macro.
        MIPS: RB532: Use hex_to_bin()
        MIPS: Enable cpu_has_clo_clz for MIPS Technologies' platforms
        MIPS: PowerTV: Provide cpu-feature-overrides.h
        MIPS: Remove pointless return statement from empty void functions.
        MIPS: Limit fixrange_init() to the FIXMAP region
        MIPS: Install handlers for software IRQs
        MIPS: Move FIXADDR_TOP into spaces.h
        MIPS: Add SYNC after cacheflush
        MIPS: pfn_valid() is broken on low memory HIGHMEM systems
        MIPS: HIGHMEM DMA on noncoherent MIPS32 processors
        MIPS: topdown mmap support
        MIPS: Remove redundant addr_limit assignment on exec.
        MIPS: AR7: Replace __attribute__((__packed__)) with __packed
        MIPS: AR7: Remove 'space before tabs' in platform.c
        MIPS: Lantiq: Add missing clk_enable and clk_disable functions.
        MIPS: AR7: Fix trailing semicolon bug in clock.c
        MAINTAINERS: Update MIPS entry.
        MIPS: BCM63xx: Remove duplicate PERF_IRQSTAT_REG definition
        ...
      6fd4ce88
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client · ba5b56cb
      Linus Torvalds 提交于
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
        ceph: document unlocked d_parent accesses
        ceph: explicitly reference rename old_dentry parent dir in request
        ceph: document locking for ceph_set_dentry_offset
        ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
        ceph: protect d_parent access in ceph_d_revalidate
        ceph: protect access to d_parent
        ceph: handle racing calls to ceph_init_dentry
        ceph: set dir complete frag after adding capability
        rbd: set blk_queue request sizes to object size
        ceph: set up readahead size when rsize is not passed
        rbd: cancel watch request when releasing the device
        ceph: ignore lease mask
        ceph: fix ceph_lookup_open intent usage
        ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
        ceph: fix bad parent_inode calc in ceph_lookup_open
        ceph: avoid carrying Fw cap during write into page cache
        libceph: don't time out osd requests that haven't been received
        ceph: report f_bfree based on kb_avail rather than diffing.
        ceph: only queue capsnap if caps are dirty
        ceph: fix snap writeback when racing with writes
        ...
      ba5b56cb
    • S
      gma500: udelay(20000) it too long again · 243dd280
      Stephen Rothwell 提交于
      so replace it with mdelay(20).
      
      Fixes build error:
      
        ERROR: "__bad_udelay" [drivers/staging/gma500/psb_gfx.ko] undefined!
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      243dd280
    • R
      USB / Renesas: Fix build issue related to struct scatterlist · 9c646cfc
      Rafael J. Wysocki 提交于
      Fix build issue caused by undefined struct scatterlist in
      drivers/usb/renesas_usbhs/fifo.c.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c646cfc
    • R
      MMC / TMIO: Fix build issue related to struct scatterlist · 6c0cbef6
      Rafael J. Wysocki 提交于
      Fix build issue caused by undefined struct scatterlist in
      drivers/mmc/host/tmio_mmc.c.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c0cbef6
    • L
      Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 · 2ac232f3
      Linus Torvalds 提交于
      * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
        jbd: change the field "b_cow_tid" of struct journal_head from type unsigned to tid_t
        ext3.txt: update the links in the section "useful links" to the latest ones
        ext3: Fix data corruption in inodes with journalled data
        ext2: check xattr name_len before acquiring xattr_sem in ext2_xattr_get
        ext3: Fix compilation with -DDX_DEBUG
        quota: Remove unused declaration
        jbd: Use WRITE_SYNC in journal checkpoint.
        jbd: Fix oops in journal_remove_journal_head()
        ext3: Return -EINVAL when start is beyond the end of fs in ext3_trim_fs()
        ext3/ioctl.c: silence sparse warnings about different address spaces
        ext3/ext4 Documentation: remove bh/nobh since it has been deprecated
        ext3: Improve truncate error handling
        ext3: use proper little-endian bitops
        ext2: include fs.h into ext2_fs.h
        ext3: Fix oops in ext3_try_to_allocate_with_rsv()
        jbd: fix a bug of leaking jh->b_jcount
        jbd: remove dependency on __GFP_NOFAIL
        ext3: Convert ext3 to new truncate calling convention
        jbd: Add fixed tracepoints
        ext3: Add fixed tracepoints
      
      Resolve conflicts in fs/ext3/fsync.c due to fsync locking push-down and
      new fixed tracepoints.
      2ac232f3
    • S
      ceph: document unlocked d_parent accesses · d79698da
      Sage Weil 提交于
      For the most part we don't care about racing with rename when directing
      MDS requests; either the old or new parent is fine.  Document that, and
      do some minor cleanup.
      Reviewed-by: NYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: NSage Weil <sage@newdream.net>
      d79698da
    • S
      ceph: explicitly reference rename old_dentry parent dir in request · 41b02e1f
      Sage Weil 提交于
      We carry a pin on the parent directory for the rename source and dest
      dentries.  For the source it's r_locked_dir; we need to explicitly
      reference the old_dentry parent as well, since the dentry's d_parent may
      change between when the request was created and pinned and when it is
      freed.
      Reviewed-by: NYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: NSage Weil <sage@newdream.net>
      41b02e1f
    • S
      4f177264
    • S
      ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug · e5f86dc3
      Sage Weil 提交于
      Have caller pass in a safely-obtained reference to the parent directory
      for calculating a dentry's hash valud.
      
      While we're here, simpify the flow through ceph_encode_fh() so that there
      is a single exit point and cleanup.
      
      Also fix a bug with the dentry hash calculation: calculate the hash for the
      dentry we were given, not its parent.
      Reviewed-by: NYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: NSage Weil <sage@newdream.net>
      e5f86dc3
    • S
      ceph: protect d_parent access in ceph_d_revalidate · bf1c6aca
      Sage Weil 提交于
      Protect d_parent with d_lock.  Carry a reference.  Simplify the flow so
      that there is a single exit point and cleanup.
      Reviewed-by: NYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: NSage Weil <sage@newdream.net>
      bf1c6aca
    • S
      ceph: protect access to d_parent · 5f21c96d
      Sage Weil 提交于
      d_parent is protected by d_lock: use it when looking up a dentry's parent
      directory inode.  Also take a reference and drop it in the caller to avoid
      a use-after-free.
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Reviewed-by: NYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: NSage Weil <sage@newdream.net>
      5f21c96d