1. 13 9月, 2013 40 次提交
    • L
      Merge tag 'blackfin-for-linus' of... · 951a730a
      Linus Torvalds 提交于
      Merge tag 'blackfin-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/realmz6/blackfin-linux
      
      Pull blackfin updates from Steven Miao.
      
      * tag 'blackfin-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/realmz6/blackfin-linux:
        blackfin: Ignore generated uImages
        blackfin: Add STMMAC platform data to enable dwmac1000 driver on BF60x.
        bf609: adv7343: add S-Video and Component output support
        bf609: add adv7343 video encoder support
        clock: add stmmac clock for ethernet driver
        blackfin: scb: Add SCB1 to SCB9 config options and data.
        blackfin: scb: Add system crossbar init code.
      951a730a
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · 0898d2aa
      Linus Torvalds 提交于
      Pull crypto fixes from Herbert Xu:
       "This fixes a 7+ year race condition in the crypto API that causes
        sporadic crashes when multiple threads load the same algorithm.
      
        It also fixes the crct10dif algorithm again to prevent boot failures
        on systems where the initramfs tool ignores module softdeps"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: crct10dif - Add fallback for broken initrds
        crypto: api - Fix race condition in larval lookup
      0898d2aa
    • M
      blackfin: Ignore generated uImages · 08b67faa
      Mark Brown 提交于
      We have the build infrastructure to generate uImages so we should ignore
      the resulting generated files.
      Signed-off-by: NMark Brown <broonie@linaro.org>
      Acked-by: NMike Frysinger <vapier@gentoo.org>
      08b67faa
    • S
      blackfin: Add STMMAC platform data to enable dwmac1000 driver on BF60x. · 1d899fd6
      Sonic Zhang 提交于
      - Enable GMAC
      - Set propler DMA PBL
      - Disable DMA store and forward mode
      - Select PTP input clock from MII
      clock.
      Signed-off-by: NSonic Zhang <sonic.zhang@analog.com>
      Signed-off-by: NSteven Miao <realmz6@gmail.com>
      1d899fd6
    • S
      e5786092
    • S
      bf609: add adv7343 video encoder support · 4940c53d
      Scott Jiang 提交于
      Signed-off-by: NScott Jiang <scott.jiang.linux@gmail.com>
      4940c53d
    • S
      clock: add stmmac clock for ethernet driver · 3036dccf
      Steven Miao 提交于
      Signed-off-by: NSteven Miao <realmz6@gmail.com>
      3036dccf
    • S
      206f060c
    • S
      blackfin: scb: Add system crossbar init code. · 24a70cf2
      Steven Miao 提交于
      If SCB exists in select blackfin cpu, developer can change the SCB
      priority in kernel configuration.
      Signed-off-by: NSonic Zhang <sonic.zhang@analog.com>
      Signed-off-by: NSteven Miao <realmz6@gmail.com>
      24a70cf2
    • L
      Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus · 5a7d8a28
      Linus Torvalds 提交于
      Pull MIPS updates from Ralf Baechle:
       "This has been sitting in -next for a while with no objections and all
        MIPS defconfigs except one are building fine; that one platform got
        broken by another patch in your tree and I'm going to submit a patch
        separately.
      
         - a handful of fixes that didn't make 3.11
         - a few bits of Octeon 3 support with more to come for a later
           release
         - platform enhancements for Octeon, ath79, Lantiq, Netlogic and
           Ralink SOCs
         - a GPIO driver for the Octeon
         - some dusting off of the DECstation code
         - the usual dose of cleanups"
      
      * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (65 commits)
        MIPS: DMA: Fix BUG due to smp_processor_id() in preemptible code
        MIPS: kexec: Fix random crashes while loading crashkernel
        MIPS: kdump: Skip walking indirection page for crashkernels
        MIPS: DECstation HRT calibration bug fixes
        MIPS: Export copy_from_user_page() (needed by lustre)
        MIPS: Add driver for the built-in PCI controller of the RT3883 SoC
        MIPS: DMA: For BMIPS5000 cores flush region just like non-coherent R10000
        MIPS: ralink: Add support for reset-controller API
        MIPS: ralink: mt7620: Add cpu-feature-override header
        MIPS: ralink: mt7620: Add spi clock definition
        MIPS: ralink: mt7620: Add wdt clock definition
        MIPS: ralink: mt7620: Improve clock frequency detection
        MIPS: ralink: mt7620: This SoC has EHCI and OHCI hosts
        MIPS: ralink: mt7620: Add verbose ram info
        MIPS: ralink: Probe clocksources from OF
        MIPS: ralink: Add support for systick timer found on newer ralink SoC
        MIPS: ralink: Add support for periodic timer irq
        MIPS: Netlogic: Built-in DTB for XLP2xx SoC boards
        MIPS: Netlogic: Add support for USB on XLP2xx
        MIPS: Netlogic: XLP2xx update for I2C controller
        ...
      5a7d8a28
    • L
      Merge tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs · e0ea4045
      Linus Torvalds 提交于
      Pull xfs update #2 from Ben Myers:
       "Here we have defrag support for v5 superblock, a number of bugfixes
        and a cleanup or two.
      
         - defrag support for CRC filesystems
         - fix endian worning in xlog_recover_get_buf_lsn
         - fixes for sparse warnings
         - fix for assert in xfs_dir3_leaf_hdr_from_disk
         - fix for log recovery of remote symlinks
         - fix for log recovery of btree root splits
         - fixes formemory allocation failures with ACLs
         - fix for assert in xfs_buf_item_relse
         - fix for assert in xfs_inode_buf_verify
         - fix an assignment in an assert that should be a test in
           xfs_bmbt_change_owner
         - remove dead code in xlog_recover_inode_pass2"
      
      * tag 'xfs-for-linus-v3.12-rc1-2' of git://oss.sgi.com/xfs/xfs:
        xfs: remove dead code from xlog_recover_inode_pass2
        xfs: = vs == typo in ASSERT()
        xfs: don't assert fail on bad inode numbers
        xfs: aborted buf items can be in the AIL.
        xfs: factor all the kmalloc-or-vmalloc fallback allocations
        xfs: fix memory allocation failures with ACLs
        xfs: ensure we copy buffer type in da btree root splits
        xfs: set remote symlink buffer type for recovery
        xfs: recovery of swap extents operations for CRC filesystems
        xfs: swap extents operations for CRC filesystems
        xfs: check magic numbers in dir3 leaf verifier first
        xfs: fix some minor sparse warnings
        xfs: fix endian warning in xlog_recover_get_buf_lsn()
      e0ea4045
    • L
      Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · 48efe453
      Linus Torvalds 提交于
      Pull SCSI target updates from Nicholas Bellinger:
       "Lots of activity again this round for I/O performance optimizations
        (per-cpu IDA pre-allocation for vhost + iscsi/target), and the
        addition of new fabric independent features to target-core
        (COMPARE_AND_WRITE + EXTENDED_COPY).
      
        The main highlights include:
      
         - Support for iscsi-target login multiplexing across individual
           network portals
         - Generic Per-cpu IDA logic (kent + akpm + clameter)
         - Conversion of vhost to use per-cpu IDA pre-allocation for
           descriptors, SGLs and userspace page pointer list
         - Conversion of iscsi-target + iser-target to use per-cpu IDA
           pre-allocation for descriptors
         - Add support for generic COMPARE_AND_WRITE (AtomicTestandSet)
           emulation for virtual backend drivers
         - Add support for generic EXTENDED_COPY (CopyOffload) emulation for
           virtual backend drivers.
         - Add support for fast memory registration mode to iser-target (Vu)
      
        The patches to add COMPARE_AND_WRITE and EXTENDED_COPY support are of
        particular significance, which make us the first and only open source
        target to support the full set of VAAI primitives.
      
        Currently Linux clients are lacking upstream support to actually
        utilize these primitives.  However, with server side support now in
        place for folks like MKP + ZAB working on the client, this logic once
        reserved for the highest end of storage arrays, can now be run in VMs
        on their laptops"
      
      * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (50 commits)
        target/iscsi: Bump versions to v4.1.0
        target: Update copyright ownership/year information to 2013
        iscsi-target: Bump default TCP listen backlog to 256
        target: Fix >= v3.9+ regression in PR APTPL + ALUA metadata write-out
        iscsi-target; Bump default CmdSN Depth to 64
        iscsi-target: Remove unnecessary wait_for_completion in iscsi_get_thread_set
        iscsi-target: Add thread_set->ts_activate_sem + use common deallocate
        iscsi-target: Fix race with thread_pre_handler flush_signals + ISCSI_THREAD_SET_DIE
        target: remove unused including <linux/version.h>
        iser-target: introduce fast memory registration mode (FRWR)
        iser-target: generalize rdma memory registration and cleanup
        iser-target: move rdma wr processing to a shared function
        target: Enable global EXTENDED_COPY setup/release
        target: Add Third Party Copy (3PC) bit in INQUIRY response
        target: Enable EXTENDED_COPY setup in spc_parse_cdb
        target: Add support for EXTENDED_COPY copy offload emulation
        target: Avoid non-existent tg_pt_gp_mem in target_alua_state_check
        target: Add global device list for EXTENDED_COPY
        target: Make helpers non static for EXTENDED_COPY command setup
        target: Make spc_parse_naa_6h_vendor_specific non static
        ...
      48efe453
    • L
      Merge branch 'akpm' (patches from Andrew Morton) · ac4de954
      Linus Torvalds 提交于
      Merge more patches from Andrew Morton:
       "The rest of MM.  Plus one misc cleanup"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
        mm/Kconfig: add MMU dependency for MIGRATION.
        kernel: replace strict_strto*() with kstrto*()
        mm, thp: count thp_fault_fallback anytime thp fault fails
        thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
        thp: do_huge_pmd_anonymous_page() cleanup
        thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
        mm: cleanup add_to_page_cache_locked()
        thp: account anon transparent huge pages into NR_ANON_PAGES
        truncate: drop 'oldsize' truncate_pagecache() parameter
        mm: make lru_add_drain_all() selective
        memcg: document cgroup dirty/writeback memory statistics
        memcg: add per cgroup writeback pages accounting
        memcg: check for proper lock held in mem_cgroup_update_page_stat
        memcg: remove MEMCG_NR_FILE_MAPPED
        memcg: reduce function dereference
        memcg: avoid overflow caused by PAGE_ALIGN
        memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
        memcg: correct RESOURCE_MAX to ULLONG_MAX
        mm: memcg: do not trap chargers with full callstack on OOM
        mm: memcg: rework and document OOM waiting and wakeup
        ...
      ac4de954
    • C
      mm/Kconfig: add MMU dependency for MIGRATION. · de32a817
      Chen Gang 提交于
      MIGRATION must depend on MMU, or allmodconfig for the nommu sh
      architecture fails to build:
      
          CC      mm/migrate.o
        mm/migrate.c: In function 'remove_migration_pte':
        mm/migrate.c:134:3: error: implicit declaration of function 'pmd_trans_huge' [-Werror=implicit-function-declaration]
           if (pmd_trans_huge(*pmd))
           ^
        mm/migrate.c:149:2: error: implicit declaration of function 'is_swap_pte' [-Werror=implicit-function-declaration]
          if (!is_swap_pte(pte))
          ^
        ...
      
      Also let CMA depend on MMU, or when NOMMU, if we select CMA, it will
      select MIGRATION by force.
      Signed-off-by: NChen Gang <gang.chen@asianux.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de32a817
    • J
      kernel: replace strict_strto*() with kstrto*() · 6072ddc8
      Jingoo Han 提交于
      The usage of strict_strto*() is not preferred, because strict_strto*() is
      obsolete.  Thus, kstrto*() should be used.
      Signed-off-by: NJingoo Han <jg1.han@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6072ddc8
    • D
      mm, thp: count thp_fault_fallback anytime thp fault fails · 17766dde
      David Rientjes 提交于
      Currently, thp_fault_fallback in vmstat only gets incremented if a
      hugepage allocation fails.  If current's memcg hits its limit or the page
      fault handler returns an error, it is incorrectly accounted as a
      successful thp_fault_alloc.
      
      Count thp_fault_fallback anytime the page fault handler falls back to
      using regular pages and only count thp_fault_alloc when a hugepage has
      actually been faulted.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17766dde
    • K
      thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page() · c0292554
      Kirill A. Shutemov 提交于
      do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
      to handle fallback path.
      
      Let's consolidate code back by introducing VM_FAULT_FALLBACK return
      code.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0292554
    • K
      thp: do_huge_pmd_anonymous_page() cleanup · 128ec037
      Kirill A. Shutemov 提交于
      Minor cleanup: unindent most code of the fucntion by inverting one
      condition.  It's preparation for the next patch.
      
      No functional changes.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      128ec037
    • K
      thp: move maybe_pmd_mkwrite() out of mk_huge_pmd() · 3122359a
      Kirill A. Shutemov 提交于
      It's confusing that mk_huge_pmd() has semantics different from mk_pte() or
      mk_pmd().  I spent some time on debugging issue cased by this
      inconsistency.
      
      Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust prototype
      to match mk_pte().
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3122359a
    • K
      mm: cleanup add_to_page_cache_locked() · 66a0c8ee
      Kirill A. Shutemov 提交于
      Make add_to_page_cache_locked() cleaner:
      
       - unindent most code of the function by inverting one condition;
       - streamline code no-error path;
       - move insert error path outside normal code path;
       - call radix_tree_preload_end() earlier;
      
      No functional changes.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66a0c8ee
    • K
      thp: account anon transparent huge pages into NR_ANON_PAGES · 3cd14fcd
      Kirill A. Shutemov 提交于
      We use NR_ANON_PAGES as base for reporting AnonPages to user.  There's
      not much sense in not accounting transparent huge pages there, but add
      them on printing to user.
      
      Let's account transparent huge pages in NR_ANON_PAGES in the first place.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Ning Qu <quning@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3cd14fcd
    • K
      truncate: drop 'oldsize' truncate_pagecache() parameter · 7caef267
      Kirill A. Shutemov 提交于
      truncate_pagecache() doesn't care about old size since commit
      cedabed4 ("vfs: Fix vmtruncate() regression").  Let's drop it.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7caef267
    • C
      mm: make lru_add_drain_all() selective · 5fbc4616
      Chris Metcalf 提交于
      make lru_add_drain_all() only selectively interrupt the cpus that have
      per-cpu free pages that can be drained.
      
      This is important in nohz mode where calling mlockall(), for example,
      otherwise will interrupt every core unnecessarily.
      
      This is important on workloads where nohz cores are handling 10 Gb traffic
      in userspace.  Those CPUs do not enter the kernel and place pages into LRU
      pagevecs and they really, really don't want to be interrupted, or they
      drop packets on the floor.
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fbc4616
    • S
      memcg: document cgroup dirty/writeback memory statistics · 9cb2dc1c
      Sha Zhengju 提交于
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cb2dc1c
    • S
      memcg: add per cgroup writeback pages accounting · 3ea67d06
      Sha Zhengju 提交于
      Add memcg routines to count writeback pages, later dirty pages will also
      be accounted.
      
      After Kame's commit 89c06bd5 ("memcg: use new logic for page stat
      accounting"), we can use 'struct page' flag to test page state instead
      of per page_cgroup flag.  But memcg has a feature to move a page from a
      cgroup to another one and may have race between "move" and "page stat
      accounting".  So in order to avoid the race we have designed a new lock:
      
               mem_cgroup_begin_update_page_stat()
               modify page information        -->(a)
               mem_cgroup_update_page_stat()  -->(b)
               mem_cgroup_end_update_page_stat()
      
      It requires both (a) and (b)(writeback pages accounting) to be pretected
      in mem_cgroup_{begin/end}_update_page_stat().  It's full no-op for
      !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
      read lock in the most cases (no task is moving), and spin_lock_irqsave
      on top in the slow path.
      
      There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
      And the lock order is:
      	--> memcg->move_lock
      	  --> mapping->tree_lock
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea67d06
    • S
      memcg: check for proper lock held in mem_cgroup_update_page_stat · 658b72c5
      Sha Zhengju 提交于
      We should call mem_cgroup_begin_update_page_stat() before
      mem_cgroup_update_page_stat() to get proper locks, however the latter
      doesn't do any checking that we use proper locking, which would be hard.
      Suggested by Michal Hock we could at least test for rcu_read_lock_held()
      because RCU is held if !mem_cgroup_disabled().
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      658b72c5
    • S
      memcg: remove MEMCG_NR_FILE_MAPPED · 68b4876d
      Sha Zhengju 提交于
      While accounting memcg page stat, it's not worth to use
      MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the
      complexity and presumed performance overhead.  We can use
      MEM_CGROUP_STAT_FILE_MAPPED directly.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NFengguang Wu <fengguang.wu@intel.com>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68b4876d
    • S
      memcg: reduce function dereference · 1a36e59d
      Sha Zhengju 提交于
      This function dereferences res far too often, so optimize it.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Jeff Liu <jeff.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a36e59d
    • S
      memcg: avoid overflow caused by PAGE_ALIGN · 3af33516
      Sha Zhengju 提交于
      Since PAGE_ALIGN is aligning up(the next page boundary), so after
      PAGE_ALIGN, the value might be overflow, such as write the MAX value to
      *.limit_in_bytes.
      
        $ cat /cgroup/memory/memory.limit_in_bytes
        18446744073709551615
      
        # echo 18446744073709551615 > /cgroup/memory/memory.limit_in_bytes
        bash: echo: write error: Invalid argument
      
      Some user programs might depend on such behaviours(like libcg, we read
      the value in snapshot, then use the value to reset cgroup later), and
      that will cause confusion.  So we need to fix it.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Jeff Liu <jeff.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3af33516
    • S
      memcg: rename RESOURCE_MAX to RES_COUNTER_MAX · 6de5a8bf
      Sha Zhengju 提交于
      RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Jeff Liu <jeff.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6de5a8bf
    • S
      memcg: correct RESOURCE_MAX to ULLONG_MAX · 34ff8dc0
      Sha Zhengju 提交于
      Current RESOURCE_MAX is ULONG_MAX, but the value we used to set resource
      limit is unsigned long long, so we can set bigger value than that which is
      strange.  The XXX_MAX should be reasonable max value, bigger than that
      should be overflow.
      
      Notice that this change will affect user output of default *.limit_in_bytes:
      before change:
      
        $ cat /cgroup/memory/memory.limit_in_bytes
        9223372036854775807
      
      after change:
      
        $ cat /cgroup/memory/memory.limit_in_bytes
        18446744073709551615
      
      But it doesn't alter the API in term of input - we can still use "echo -1
      > *.limit_in_bytes" to reset the numbers to "unlimited".
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Jeff Liu <jeff.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34ff8dc0
    • J
      mm: memcg: do not trap chargers with full callstack on OOM · 3812c8c8
      Johannes Weiner 提交于
      The memcg OOM handling is incredibly fragile and can deadlock.  When a
      task fails to charge memory, it invokes the OOM killer and loops right
      there in the charge code until it succeeds.  Comparably, any other task
      that enters the charge path at this point will go to a waitqueue right
      then and there and sleep until the OOM situation is resolved.  The problem
      is that these tasks may hold filesystem locks and the mmap_sem; locks that
      the selected OOM victim may need to exit.
      
      For example, in one reported case, the task invoking the OOM killer was
      about to charge a page cache page during a write(), which holds the
      i_mutex.  The OOM killer selected a task that was just entering truncate()
      and trying to acquire the i_mutex:
      
      OOM invoking task:
        mem_cgroup_handle_oom+0x241/0x3b0
        mem_cgroup_cache_charge+0xbe/0xe0
        add_to_page_cache_locked+0x4c/0x140
        add_to_page_cache_lru+0x22/0x50
        grab_cache_page_write_begin+0x8b/0xe0
        ext3_write_begin+0x88/0x270
        generic_file_buffered_write+0x116/0x290
        __generic_file_aio_write+0x27c/0x480
        generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
        do_sync_write+0xea/0x130
        vfs_write+0xf3/0x1f0
        sys_write+0x51/0x90
        system_call_fastpath+0x18/0x1d
      
      OOM kill victim:
        do_truncate+0x58/0xa0              # takes i_mutex
        do_last+0x250/0xa30
        path_openat+0xd7/0x440
        do_filp_open+0x49/0xa0
        do_sys_open+0x106/0x240
        sys_open+0x20/0x30
        system_call_fastpath+0x18/0x1d
      
      The OOM handling task will retry the charge indefinitely while the OOM
      killed task is not releasing any resources.
      
      A similar scenario can happen when the kernel OOM killer for a memcg is
      disabled and a userspace task is in charge of resolving OOM situations.
      In this case, ALL tasks that enter the OOM path will be made to sleep on
      the OOM waitqueue and wait for userspace to free resources or increase
      the group's limit.  But a userspace OOM handler is prone to deadlock
      itself on the locks held by the waiting tasks.  For example one of the
      sleeping tasks may be stuck in a brk() call with the mmap_sem held for
      writing but the userspace handler, in order to pick an optimal victim,
      may need to read files from /proc/<pid>, which tries to acquire the same
      mmap_sem for reading and deadlocks.
      
      This patch changes the way tasks behave after detecting a memcg OOM and
      makes sure nobody loops or sleeps with locks held:
      
      1. When OOMing in a user fault, invoke the OOM killer and restart the
         fault instead of looping on the charge attempt.  This way, the OOM
         victim can not get stuck on locks the looping task may hold.
      
      2. When OOMing in a user fault but somebody else is handling it
         (either the kernel OOM killer or a userspace handler), don't go to
         sleep in the charge context.  Instead, remember the OOMing memcg in
         the task struct and then fully unwind the page fault stack with
         -ENOMEM.  pagefault_out_of_memory() will then call back into the
         memcg code to check if the -ENOMEM came from the memcg, and then
         either put the task to sleep on the memcg's OOM waitqueue or just
         restart the fault.  The OOM victim can no longer get stuck on any
         lock a sleeping task may hold.
      
      Debugged by Michal Hocko.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NazurIt <azurit@pobox.sk>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3812c8c8
    • J
      mm: memcg: rework and document OOM waiting and wakeup · fb2a6fc5
      Johannes Weiner 提交于
      The memcg OOM handler open-codes a sleeping lock for OOM serialization
      (trylock, wait, repeat) because the required locking is so specific to
      memcg hierarchies.  However, it would be nice if this construct would be
      clearly recognizable and not be as obfuscated as it is right now.  Clean
      up as follows:
      
      1. Remove the return value of mem_cgroup_oom_unlock()
      
      2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().
      
      3. Pull the prepare_to_wait() out of the memcg_oom_lock scope.  This
         makes it more obvious that the task has to be on the waitqueue
         before attempting to OOM-trylock the hierarchy, to not miss any
         wakeups before going to sleep.  It just didn't matter until now
         because it was all lumped together into the global memcg_oom_lock
         spinlock section.
      
      4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
         It is proctected by the hierarchical OOM-lock.
      
      5. The memcg_oom_lock spinlock is only required to propagate the OOM
         lock in any given hierarchy atomically.  Restrict its scope to
         mem_cgroup_oom_(trylock|unlock).
      
      6. Do not wake up the waitqueue unconditionally at the end of the
         function.  Only the lockholder has to wake up the next in line
         after releasing the lock.
      
         Note that the lockholder kicks off the OOM-killer, which in turn
         leads to wakeups from the uncharges of the exiting task.  But a
         contender is not guaranteed to see them if it enters the OOM path
         after the OOM kills but before the lockholder releases the lock.
         Thus there has to be an explicit wakeup after releasing the lock.
      
      7. Put the OOM task on the waitqueue before marking the hierarchy as
         under OOM as that is the point where we start to receive wakeups.
         No point in listening before being on the waitqueue.
      
      8. Likewise, unmark the hierarchy before finishing the sleep, for
         symmetry.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb2a6fc5
    • J
      mm: memcg: enable memcg OOM killer only for user faults · 519e5247
      Johannes Weiner 提交于
      System calls and kernel faults (uaccess, gup) can handle an out of memory
      situation gracefully and just return -ENOMEM.
      
      Enable the memcg OOM killer only for user faults, where it's really the
      only option available.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      519e5247
    • J
      x86: finish user fault error path with fatal signal · 3a13c4d7
      Johannes Weiner 提交于
      The x86 fault handler bails in the middle of error handling when the
      task has a fatal signal pending.  For a subsequent patch this is a
      problem in OOM situations because it relies on pagefault_out_of_memory()
      being called even when the task has been killed, to perform proper
      per-task OOM state unwinding.
      
      Shortcutting the fault like this is a rather minor optimization that
      saves a few instructions in rare cases.  Just remove it for
      user-triggered faults.
      
      Use the opportunity to split the fault retry handling from actual fault
      errors and add locking documentation that reads suprisingly similar to
      ARM's.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a13c4d7
    • J
      arch: mm: pass userspace fault flag to generic fault handler · 759496ba
      Johannes Weiner 提交于
      Unlike global OOM handling, memory cgroup code will invoke the OOM killer
      in any OOM situation because it has no way of telling faults occuring in
      kernel context - which could be handled more gracefully - from
      user-triggered faults.
      
      Pass a flag that identifies faults originating in user space from the
      architecture-specific fault handlers to generic code so that memcg OOM
      handling can be improved.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      759496ba
    • J
      arch: mm: do not invoke OOM killer on kernel fault OOM · 87134102
      Johannes Weiner 提交于
      Kernel faults are expected to handle OOM conditions gracefully (gup,
      uaccess etc.), so they should never invoke the OOM killer.  Reserve this
      for faults triggered in user context when it is the only option.
      
      Most architectures already do this, fix up the remaining few.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87134102
    • J
      arch: mm: remove obsolete init OOM protection · 94bce453
      Johannes Weiner 提交于
      The memcg code can trap tasks in the context of the failing allocation
      until an OOM situation is resolved.  They can hold all kinds of locks
      (fs, mm) at this point, which makes it prone to deadlocking.
      
      This series converts memcg OOM handling into a two step process that is
      started in the charge context, but any waiting is done after the fault
      stack is fully unwound.
      
      Patches 1-4 prepare architecture handlers to support the new memcg
      requirements, but in doing so they also remove old cruft and unify
      out-of-memory behavior across architectures.
      
      Patch 5 disables the memcg OOM handling for syscalls, readahead, kernel
      faults, because they can gracefully unwind the stack with -ENOMEM.  OOM
      handling is restricted to user triggered faults that have no other
      option.
      
      Patch 6 reworks memcg's hierarchical OOM locking to make it a little
      more obvious wth is going on in there: reduce locked regions, rename
      locking functions, reorder and document.
      
      Patch 7 implements the two-part OOM handling such that tasks are never
      trapped with the full charge stack in an OOM situation.
      
      This patch:
      
      Back before smart OOM killing, when faulting tasks were killed directly on
      allocation failures, the arch-specific fault handlers needed special
      protection for the init process.
      
      Now that all fault handlers call into the generic OOM killer (see commit
      609838cf: "mm: invoke oom-killer from remaining unconverted page
      fault handlers"), which already provides init protection, the
      arch-specific leftovers can be removed.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Acked-by: Vineet Gupta <vgupta@synopsys.com>	[arch/arc bits]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94bce453
    • A
      memcg: trivial cleanups · f894ffa8
      Andrew Morton 提交于
      Clean up some mess made by the "Soft limit rework" series, and a few other
      things.
      
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f894ffa8
    • M
      memcg, vmscan: do not fall into reclaim-all pass too quickly · e975de99
      Michal Hocko 提交于
      shrink_zone starts with soft reclaim pass first and then falls back to
      regular reclaim if nothing has been scanned.  This behavior is natural
      but there is a catch.  Memcg iterators, when used with the reclaim
      cookie, are designed to help to prevent from over reclaim by
      interleaving reclaimers (per node-zone-priority) so the tree walk might
      miss many (even all) nodes in the hierarchy e.g.  when there are direct
      reclaimers racing with each other or with kswapd in the global case or
      multiple allocators reaching the limit for the target reclaim case.  To
      make it even more complicated, targeted reclaim doesn't do the whole
      tree walk because it stops reclaiming once it reclaims sufficient pages.
      As a result groups over the limit might be missed, thus nothing is
      scanned, and reclaim would fall back to the reclaim all mode.
      
      This patch checks for the incomplete tree walk in shrink_zone.  If no
      group has been visited and the hierarchy is soft reclaimable then we
      must have missed some groups, in which case the __shrink_zone is called
      again.  This doesn't guarantee there will be some progress of course
      because the current reclaimer might be still racing with others but it
      would at least give a chance to start the walk without a big risk of
      reclaim latencies.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e975de99