1. 10 10月, 2013 6 次提交
  2. 28 9月, 2013 1 次提交
  3. 27 9月, 2013 1 次提交
  4. 25 9月, 2013 5 次提交
    • R
      of: clean-up ifdefs in of_irq.h · b0b8c960
      Rob Herring 提交于
      Much of of_irq.h is needlessly ifdef'ed. Clean this up and minimize the
      amount ifdef'ed code. This fixes some  build warnings when CONFIG_OF
      is not enabled (seen on i386 and x86_64):
      
      include/linux/of_irq.h:82:7: warning: 'struct device_node' declared inside parameter list [enabled by default]
      include/linux/of_irq.h:82:7: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default]
      include/linux/of_irq.h:87:47: warning: 'struct device_node' declared inside parameter list [enabled by default]
      
      Compile tested on i386, sparc and arm.
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Grant Likely <grant.likely@linaro.org>
      Signed-off-by: NRob Herring <rob.herring@calxeda.com>
      b0b8c960
    • A
      revert "memcg, vmscan: integrate soft reclaim tighter with zone shrinking code" · 0608f43d
      Andrew Morton 提交于
      Revert commit 3b38722e ("memcg, vmscan: integrate soft reclaim
      tighter with zone shrinking code")
      
      I merged this prematurely - Michal and Johannes still disagree about the
      overall design direction and the future remains unclear.
      
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0608f43d
    • A
      revert "vmscan, memcg: do softlimit reclaim also for targeted reclaim" · b1aff7fc
      Andrew Morton 提交于
      Revert commit a5b7c87f ("vmscan, memcg: do softlimit reclaim also
      for targeted reclaim")
      
      I merged this prematurely - Michal and Johannes still disagree about the
      overall design direction and the future remains unclear.
      
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1aff7fc
    • A
      revert "memcg: enhance memcg iterator to support predicates" · 694fbc0f
      Andrew Morton 提交于
      Revert commit de57780d ("memcg: enhance memcg iterator to support
      predicates")
      
      I merged this prematurely - Michal and Johannes still disagree about the
      overall design direction and the future remains unclear.
      
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      694fbc0f
    • M
      watchdog: update watchdog_thresh properly · 9809b18f
      Michal Hocko 提交于
      watchdog_tresh controls how often nmi perf event counter checks per-cpu
      hrtimer_interrupts counter and blows up if the counter hasn't changed
      since the last check.  The counter is updated by per-cpu
      watchdog_hrtimer hrtimer which is scheduled with 2/5 watchdog_thresh
      period which guarantees that hrtimer is scheduled 2 times per the main
      period.  Both hrtimer and perf event are started together when the
      watchdog is enabled.
      
      So far so good.  But...
      
      But what happens when watchdog_thresh is updated from sysctl handler?
      
      proc_dowatchdog will set a new sampling period and hrtimer callback
      (watchdog_timer_fn) will use the new value in the next round.  The
      problem, however, is that nobody tells the perf event that the sampling
      period has changed so it is ticking with the period configured when it
      has been set up.
      
      This might result in an ear ripping dissonance between perf and hrtimer
      parts if the watchdog_thresh is increased.  And even worse it might lead
      to KABOOM if the watchdog is configured to panic on such a spurious
      lockup.
      
      This patch fixes the issue by updating both nmi perf even counter and
      hrtimers if the threshold value has changed.
      
      The nmi one is disabled and then reinitialized from scratch.  This has
      an unpleasant side effect that the allocation of the new event might
      fail theoretically so the hard lockup detector would be disabled for
      such cpus.  On the other hand such a memory allocation failure is very
      unlikely because the original event is deallocated right before.
      
      It would be much nicer if we just changed perf event period but there
      doesn't seem to be any API to do that right now.  It is also unfortunate
      that perf_event_alloc uses GFP_KERNEL allocation unconditionally so we
      cannot use on_each_cpu() and do the same thing from the per-cpu context.
      The update from the current CPU should be safe because
      perf_event_disable removes the event atomically before it clears the
      per-cpu watchdog_ev so it cannot change anything under running handler
      feet.
      
      The hrtimer is simply restarted (thanks to Don Zickus who has pointed
      this out) if it is queued because we cannot rely it will fire&adopt to
      the new sampling period before a new nmi event triggers (when the
      treshold is decreased).
      
      [akpm@linux-foundation.org: the UP version of __smp_call_function_single ended up in the wrong place]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDon Zickus <dzickus@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Fabio Estevam <festevam@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9809b18f
  5. 22 9月, 2013 1 次提交
  6. 21 9月, 2013 2 次提交
  7. 20 9月, 2013 5 次提交
    • M
      dm mpath: disable WRITE SAME if it fails · f84cb8a4
      Mike Snitzer 提交于
      Workaround the SCSI layer's problematic WRITE SAME heuristics by
      disabling WRITE SAME in the DM multipath device's queue_limits if an
      underlying device disabled it.
      
      The WRITE SAME heuristics, with both the original commit 5db44863
      ("[SCSI] sd: Implement support for WRITE SAME") and the updated commit
      66c28f97 ("[SCSI] sd: Update WRITE SAME heuristics"), default to enabling
      WRITE SAME(10) even without successfully determining it is supported.
      After the first failed WRITE SAME the SCSI layer will disable WRITE SAME
      for the device (by setting sdkp->device->no_write_same which results in
      'max_write_same_sectors' in device's queue_limits to be set to 0).
      
      When a device is stacked ontop of such a SCSI device any changes to that
      SCSI device's queue_limits do not automatically propagate up the stack.
      As such, a DM multipath device will not have its WRITE SAME support
      disabled.  This causes the block layer to continue to issue WRITE SAME
      requests to the mpath device which causes paths to fail and (if mpath IO
      isn't configured to queue when no paths are available) it will result in
      actual IO errors to the upper layers.
      
      This fix doesn't help configurations that have additional devices
      stacked ontop of the mpath device (e.g. LVM created linear DM devices
      ontop).  A proper fix that restacks all the queue_limits from the bottom
      of the device stack up will need to be explored if SCSI will continue to
      use this model of optimistically allowing op codes and then disabling
      them after they fail for the first time.
      
      Before this patch:
      
      EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
      device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
      device-mapper: multipath: XXX snitm debugging: failing WRITE SAME IO with error=-121
      end_request: critical target error, dev dm-6, sector 528
      dm-6: WRITE SAME failed. Manually zeroing.
      device-mapper: multipath: Failing path 8:112.
      end_request: I/O error, dev dm-6, sector 4616
      dm-6: WRITE SAME failed. Manually zeroing.
      end_request: I/O error, dev dm-6, sector 4616
      end_request: I/O error, dev dm-6, sector 5640
      end_request: I/O error, dev dm-6, sector 6664
      end_request: I/O error, dev dm-6, sector 7688
      end_request: I/O error, dev dm-6, sector 524288
      Buffer I/O error on device dm-6, logical block 65536
      lost page write due to I/O error on dm-6
      JBD2: Error -5 detected when updating journal superblock for dm-6-8.
      end_request: I/O error, dev dm-6, sector 524296
      Aborting journal on device dm-6-8.
      end_request: I/O error, dev dm-6, sector 524288
      Buffer I/O error on device dm-6, logical block 65536
      lost page write due to I/O error on dm-6
      JBD2: Error -5 detected when updating journal superblock for dm-6-8.
      
      # cat /sys/block/sdh/queue/write_same_max_bytes
      0
      # cat /sys/block/dm-6/queue/write_same_max_bytes
      33553920
      
      After this patch:
      
      EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null)
      device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121)
      device-mapper: multipath: XXX snitm debugging: WRITE SAME I/O failed with error=-121
      end_request: critical target error, dev dm-6, sector 528
      dm-6: WRITE SAME failed. Manually zeroing.
      
      # cat /sys/block/sdh/queue/write_same_max_bytes
      0
      # cat /sys/block/dm-6/queue/write_same_max_bytes
      0
      
      It should be noted that WRITE SAME support wasn't enabled in DM
      multipath until v3.10.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: stable@vger.kernel.org # 3.10+
      f84cb8a4
    • P
      perf: Fix capabilities bitfield compatibility in 'struct perf_event_mmap_page' · fa731587
      Peter Zijlstra 提交于
      Solve the problems around the broken definition of perf_event_mmap_page::
      cap_usr_time and cap_usr_rdpmc fields which used to overlap, partially
      fixed by:
      
        860f085b ("perf: Fix broken union in 'struct perf_event_mmap_page'")
      
      The problem with the fix (merged in v3.12-rc1 and not yet released
      officially), noticed by Vince Weaver is that the new behavior is
      not detectable by new user-space, and that due to the reuse of the
      field names it's easy to mis-compile a binary if old headers are used
      on a new kernel or new headers are used on an old kernel.
      
      To solve all that make this change explicit, detectable and self-contained,
      by iterating the ABI the following way:
      
       - Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not
         confuse old user-space binaries. RDPMC will be marked as unavailable
         to old binaries but that's within the ABI, this is a capability bit.
      
       - Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new
         libraries can reliably detect that bit 0 is deprecated and perma-zero
         without having to check the kernel version.
      
       - Use bits 2, 3, 4 for the newly defined, correct functionality:
      
      	cap_user_rdpmc		: 1, /* The RDPMC instruction can be used to read counts */
      	cap_user_time		: 1, /* The time_* fields are used */
      	cap_user_time_zero	: 1, /* The time_zero field is used */
      
       - Rename all the bitfield names in perf_event.h to be different from the
         old names, to make sure it's not possible to mis-compile it
         accidentally with old assumptions.
      
      The 'size' field can then be used in the future to add new fields and it
      will act as a natural ABI version indicator as well.
      
      Also adjust tools/perf/ userspace for the new definitions, noticed by
      Adrian Hunter.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Also-Fixed-by: NAdrian Hunter <adrian.hunter@intel.com>
      Link: http://lkml.kernel.org/n/tip-zr03yxjrpXesOzzupszqglbv@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fa731587
    • P
      perf: Update ABI comment · c5ecceef
      Peter Zijlstra 提交于
      For some mysterious reason the sample_id field of PERF_RECORD_MMAP went AWOL.
      Reported-by: NVince Weaver <vince@deater.net>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c5ecceef
    • D
      Revert "drm: mark context support as a legacy subsystem" · c21eb21c
      Dave Airlie 提交于
      This reverts commit 7c510133.
      
      Well looks like not enough digging was done, libdrm_nouveau before 2.4.33
      used contexts,
      
      292da616fe1f936ca78a3fa8e1b1b19883e343b6 nouveau: pull in major libdrm rewrite
      
      got rid of them,
      Reported-by: NPaul Zimmerman <Paul.Zimmerman@synopsys.com>
      Reported-by: NMikael Pettersson <mikpe@it.uu.se>
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      c21eb21c
    • A
      ip: generate unique IP identificator if local fragmentation is allowed · 703133de
      Ansis Atteka 提交于
      If local fragmentation is allowed, then ip_select_ident() and
      ip_select_ident_more() need to generate unique IDs to ensure
      correct defragmentation on the peer.
      
      For example, if IPsec (tunnel mode) has to encrypt large skbs
      that have local_df bit set, then all IP fragments that belonged
      to different ESP datagrams would have used the same identificator.
      If one of these IP fragments would get lost or reordered, then
      peer could possibly stitch together wrong IP fragments that did
      not belong to the same datagram. This would lead to a packet loss
      or data corruption.
      Signed-off-by: NAnsis Atteka <aatteka@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      703133de
  8. 18 9月, 2013 1 次提交
  9. 17 9月, 2013 2 次提交
    • P
      KVM: mmu: allow page tables to be in read-only slots · ba6a3541
      Paolo Bonzini 提交于
      Page tables in a read-only memory slot will currently cause a triple
      fault because the page walker uses gfn_to_hva and it fails on such a slot.
      
      OVMF uses such a page table; however, real hardware seems to be fine with
      that as long as the accessed/dirty bits are set.  Save whether the slot
      is readonly, and later check it when updating the accessed and dirty bits.
      Reviewed-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Reviewed-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ba6a3541
    • J
      netfilter: ipset: Consistent userspace testing with nomatch flag · 0f1799ba
      Jozsef Kadlecsik 提交于
      The "nomatch" commandline flag should invert the matching at testing,
      similarly to the --return-nomatch flag of the "set" match of iptables.
      Until now it worked with the elements with "nomatch" flag only. From
      now on it works with elements without the flag too, i.e:
      
       # ipset n test hash:net
       # ipset a test 10.0.0.0/24 nomatch
       # ipset t test 10.0.0.1
       10.0.0.1 is NOT in set test.
       # ipset t test 10.0.0.1 nomatch
       10.0.0.1 is in set test.
      
       # ipset a test 192.168.0.0/24
       # ipset t test 192.168.0.1
       192.168.0.1 is in set test.
       # ipset t test 192.168.0.1 nomatch
       192.168.0.1 is NOT in set test.
      
       Before the patch the results were
      
       ...
       # ipset t test 192.168.0.1
       192.168.0.1 is in set test.
       # ipset t test 192.168.0.1 nomatch
       192.168.0.1 is in set test.
      Signed-off-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      0f1799ba
  10. 16 9月, 2013 1 次提交
  11. 13 9月, 2013 15 次提交
    • K
      HID: provide a helper for validating hid reports · 331415ff
      Kees Cook 提交于
      Many drivers need to validate the characteristics of their HID report
      during initialization to avoid misusing the reports. This adds a common
      helper to perform validation of the report exisitng, the field existing,
      and the expected number of values within the field.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Reviewed-by: NBenjamin Tissoires <benjamin.tissoires@redhat.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      331415ff
    • M
      Remove GENERIC_HARDIRQ config option · 0244ad00
      Martin Schwidefsky 提交于
      After the last architecture switched to generic hard irqs the config
      options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
      for !CONFIG_GENERIC_HARDIRQS can be removed.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      0244ad00
    • M
      netfilter: nf_conntrack: use RCU safe kfree for conntrack extensions · c13a84a8
      Michal Kubeček 提交于
      Commit 68b80f11 (netfilter: nf_nat: fix RCU races) introduced
      RCU protection for freeing extension data when reallocation
      moves them to a new location. We need the same protection when
      freeing them in nf_ct_ext_free() in order to prevent a
      use-after-free by other threads referencing a NAT extension data
      via bysource list.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c13a84a8
    • K
      thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page() · c0292554
      Kirill A. Shutemov 提交于
      do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
      to handle fallback path.
      
      Let's consolidate code back by introducing VM_FAULT_FALLBACK return
      code.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0292554
    • K
      truncate: drop 'oldsize' truncate_pagecache() parameter · 7caef267
      Kirill A. Shutemov 提交于
      truncate_pagecache() doesn't care about old size since commit
      cedabed4 ("vfs: Fix vmtruncate() regression").  Let's drop it.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7caef267
    • C
      mm: make lru_add_drain_all() selective · 5fbc4616
      Chris Metcalf 提交于
      make lru_add_drain_all() only selectively interrupt the cpus that have
      per-cpu free pages that can be drained.
      
      This is important in nohz mode where calling mlockall(), for example,
      otherwise will interrupt every core unnecessarily.
      
      This is important on workloads where nohz cores are handling 10 Gb traffic
      in userspace.  Those CPUs do not enter the kernel and place pages into LRU
      pagevecs and they really, really don't want to be interrupted, or they
      drop packets on the floor.
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fbc4616
    • S
      memcg: add per cgroup writeback pages accounting · 3ea67d06
      Sha Zhengju 提交于
      Add memcg routines to count writeback pages, later dirty pages will also
      be accounted.
      
      After Kame's commit 89c06bd5 ("memcg: use new logic for page stat
      accounting"), we can use 'struct page' flag to test page state instead
      of per page_cgroup flag.  But memcg has a feature to move a page from a
      cgroup to another one and may have race between "move" and "page stat
      accounting".  So in order to avoid the race we have designed a new lock:
      
               mem_cgroup_begin_update_page_stat()
               modify page information        -->(a)
               mem_cgroup_update_page_stat()  -->(b)
               mem_cgroup_end_update_page_stat()
      
      It requires both (a) and (b)(writeback pages accounting) to be pretected
      in mem_cgroup_{begin/end}_update_page_stat().  It's full no-op for
      !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
      read lock in the most cases (no task is moving), and spin_lock_irqsave
      on top in the slow path.
      
      There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
      And the lock order is:
      	--> memcg->move_lock
      	  --> mapping->tree_lock
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea67d06
    • S
      memcg: remove MEMCG_NR_FILE_MAPPED · 68b4876d
      Sha Zhengju 提交于
      While accounting memcg page stat, it's not worth to use
      MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the
      complexity and presumed performance overhead.  We can use
      MEM_CGROUP_STAT_FILE_MAPPED directly.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NFengguang Wu <fengguang.wu@intel.com>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68b4876d
    • S
      memcg: rename RESOURCE_MAX to RES_COUNTER_MAX · 6de5a8bf
      Sha Zhengju 提交于
      RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Jeff Liu <jeff.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6de5a8bf
    • S
      memcg: correct RESOURCE_MAX to ULLONG_MAX · 34ff8dc0
      Sha Zhengju 提交于
      Current RESOURCE_MAX is ULONG_MAX, but the value we used to set resource
      limit is unsigned long long, so we can set bigger value than that which is
      strange.  The XXX_MAX should be reasonable max value, bigger than that
      should be overflow.
      
      Notice that this change will affect user output of default *.limit_in_bytes:
      before change:
      
        $ cat /cgroup/memory/memory.limit_in_bytes
        9223372036854775807
      
      after change:
      
        $ cat /cgroup/memory/memory.limit_in_bytes
        18446744073709551615
      
      But it doesn't alter the API in term of input - we can still use "echo -1
      > *.limit_in_bytes" to reset the numbers to "unlimited".
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Jeff Liu <jeff.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34ff8dc0
    • J
      mm: memcg: do not trap chargers with full callstack on OOM · 3812c8c8
      Johannes Weiner 提交于
      The memcg OOM handling is incredibly fragile and can deadlock.  When a
      task fails to charge memory, it invokes the OOM killer and loops right
      there in the charge code until it succeeds.  Comparably, any other task
      that enters the charge path at this point will go to a waitqueue right
      then and there and sleep until the OOM situation is resolved.  The problem
      is that these tasks may hold filesystem locks and the mmap_sem; locks that
      the selected OOM victim may need to exit.
      
      For example, in one reported case, the task invoking the OOM killer was
      about to charge a page cache page during a write(), which holds the
      i_mutex.  The OOM killer selected a task that was just entering truncate()
      and trying to acquire the i_mutex:
      
      OOM invoking task:
        mem_cgroup_handle_oom+0x241/0x3b0
        mem_cgroup_cache_charge+0xbe/0xe0
        add_to_page_cache_locked+0x4c/0x140
        add_to_page_cache_lru+0x22/0x50
        grab_cache_page_write_begin+0x8b/0xe0
        ext3_write_begin+0x88/0x270
        generic_file_buffered_write+0x116/0x290
        __generic_file_aio_write+0x27c/0x480
        generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
        do_sync_write+0xea/0x130
        vfs_write+0xf3/0x1f0
        sys_write+0x51/0x90
        system_call_fastpath+0x18/0x1d
      
      OOM kill victim:
        do_truncate+0x58/0xa0              # takes i_mutex
        do_last+0x250/0xa30
        path_openat+0xd7/0x440
        do_filp_open+0x49/0xa0
        do_sys_open+0x106/0x240
        sys_open+0x20/0x30
        system_call_fastpath+0x18/0x1d
      
      The OOM handling task will retry the charge indefinitely while the OOM
      killed task is not releasing any resources.
      
      A similar scenario can happen when the kernel OOM killer for a memcg is
      disabled and a userspace task is in charge of resolving OOM situations.
      In this case, ALL tasks that enter the OOM path will be made to sleep on
      the OOM waitqueue and wait for userspace to free resources or increase
      the group's limit.  But a userspace OOM handler is prone to deadlock
      itself on the locks held by the waiting tasks.  For example one of the
      sleeping tasks may be stuck in a brk() call with the mmap_sem held for
      writing but the userspace handler, in order to pick an optimal victim,
      may need to read files from /proc/<pid>, which tries to acquire the same
      mmap_sem for reading and deadlocks.
      
      This patch changes the way tasks behave after detecting a memcg OOM and
      makes sure nobody loops or sleeps with locks held:
      
      1. When OOMing in a user fault, invoke the OOM killer and restart the
         fault instead of looping on the charge attempt.  This way, the OOM
         victim can not get stuck on locks the looping task may hold.
      
      2. When OOMing in a user fault but somebody else is handling it
         (either the kernel OOM killer or a userspace handler), don't go to
         sleep in the charge context.  Instead, remember the OOMing memcg in
         the task struct and then fully unwind the page fault stack with
         -ENOMEM.  pagefault_out_of_memory() will then call back into the
         memcg code to check if the -ENOMEM came from the memcg, and then
         either put the task to sleep on the memcg's OOM waitqueue or just
         restart the fault.  The OOM victim can no longer get stuck on any
         lock a sleeping task may hold.
      
      Debugged by Michal Hocko.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NazurIt <azurit@pobox.sk>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3812c8c8
    • J
      mm: memcg: enable memcg OOM killer only for user faults · 519e5247
      Johannes Weiner 提交于
      System calls and kernel faults (uaccess, gup) can handle an out of memory
      situation gracefully and just return -ENOMEM.
      
      Enable the memcg OOM killer only for user faults, where it's really the
      only option available.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      519e5247
    • J
      arch: mm: pass userspace fault flag to generic fault handler · 759496ba
      Johannes Weiner 提交于
      Unlike global OOM handling, memory cgroup code will invoke the OOM killer
      in any OOM situation because it has no way of telling faults occuring in
      kernel context - which could be handled more gracefully - from
      user-triggered faults.
      
      Pass a flag that identifies faults originating in user space from the
      architecture-specific fault handlers to generic code so that memcg OOM
      handling can be improved.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      759496ba
    • M
      memcg: enhance memcg iterator to support predicates · de57780d
      Michal Hocko 提交于
      The caller of the iterator might know that some nodes or even subtrees
      should be skipped but there is no way to tell iterators about that so the
      only choice left is to let iterators to visit each node and do the
      selection outside of the iterating code.  This, however, doesn't scale
      well with hierarchies with many groups where only few groups are
      interesting.
      
      This patch adds mem_cgroup_iter_cond variant of the iterator with a
      callback which gets called for every visited node.  There are three
      possible ways how the callback can influence the walk.  Either the node is
      visited, it is skipped but the tree walk continues down the tree or the
      whole subtree of the current group is skipped.
      
      [hughd@google.com: fix memcg-less page reclaim]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de57780d
    • M
      vmscan, memcg: do softlimit reclaim also for targeted reclaim · a5b7c87f
      Michal Hocko 提交于
      Soft reclaim has been done only for the global reclaim (both background
      and direct).  Since "memcg: integrate soft reclaim tighter with zone
      shrinking code" there is no reason for this limitation anymore as the soft
      limit reclaim doesn't use any special code paths and it is a part of the
      zone shrinking code which is used by both global and targeted reclaims.
      
      From the semantic point of view it is natural to consider soft limit
      before touching all groups in the hierarchy tree which is touching the
      hard limit because soft limit tells us where to push back when there is a
      memory pressure.  It is not important whether the pressure comes from the
      limit or imbalanced zones.
      
      This patch simply enables soft reclaim unconditionally in
      mem_cgroup_should_soft_reclaim so it is enabled for both global and
      targeted reclaim paths.  mem_cgroup_soft_reclaim_eligible needs to learn
      about the root of the reclaim to know where to stop checking soft limit
      state of parents up the hierarchy.  Say we have
      
      A (over soft limit)
       \
        B (below s.l., hit the hard limit)
       / \
      C   D (below s.l.)
      
      B is the source of the outside memory pressure now for D but we shouldn't
      soft reclaim it because it is behaving well under B subtree and we can
      still reclaim from C (pressumably it is over the limit).
      mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
      hierarchy at B (root of the memory pressure).
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5b7c87f