1. 15 1月, 2016 24 次提交
    • J
      mm: page_alloc: generalize the dirty balance reserve · a8d01437
      Johannes Weiner 提交于
      The dirty balance reserve that dirty throttling has to consider is
      merely memory not available to userspace allocations.  There is nothing
      writeback-specific about it.  Generalize the name so that it's reusable
      outside of that context.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8d01437
    • M
      mm: allow GFP_{FS,IO} for page_cache_read page cache allocation · c20cd45e
      Michal Hocko 提交于
      page_cache_read has been historically using page_cache_alloc_cold to
      allocate a new page.  This means that mapping_gfp_mask is used as the
      base for the gfp_mask.  Many filesystems are setting this mask to
      GFP_NOFS to prevent from fs recursion issues.  page_cache_read is called
      from the vm_operations_struct::fault() context during the page fault.
      This context doesn't need the reclaim protection normally.
      
      ceph and ocfs2 which call filemap_fault from their fault handlers seem
      to be OK because they are not taking any fs lock before invoking generic
      implementation.  xfs which takes XFS_MMAPLOCK_SHARED is safe from the
      reclaim recursion POV because this lock serializes truncate and punch
      hole with the page faults and it doesn't get involved in the reclaim.
      
      There is simply no reason to deliberately use a weaker allocation
      context when a __GFP_FS | __GFP_IO can be used.  The GFP_NOFS protection
      might be even harmful.  There is a push to fail GFP_NOFS allocations
      rather than loop within allocator indefinitely with a very limited
      reclaim ability.  Once we start failing those requests the OOM killer
      might be triggered prematurely because the page cache allocation failure
      is propagated up the page fault path and end up in
      pagefault_out_of_memory.
      
      We cannot play with mapping_gfp_mask directly because that would be racy
      wrt.  parallel page faults and it might interfere with other users who
      really rely on NOFS semantic from the stored gfp_mask.  The mask is also
      inode proper so it would even be a layering violation.  What we can do
      instead is to push the gfp_mask into struct vm_fault and allow fs layer
      to overwrite it should the callback need to be called with a different
      allocation context.
      
      Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO)
      because this should be safe from the page fault path normally.  Why do
      we care about mapping_gfp_mask at all then? Because this doesn't hold
      only reclaim protection flags but it also might contain zone and
      movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have
      to respect those.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NJan Kara <jack@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c20cd45e
    • D
      mm: mmap: add new /proc tunable for mmap_base ASLR · d07e2259
      Daniel Cashman 提交于
      Address Space Layout Randomization (ASLR) provides a barrier to
      exploitation of user-space processes in the presence of security
      vulnerabilities by making it more difficult to find desired code/data
      which could help an attack.  This is done by adding a random offset to
      the location of regions in the process address space, with a greater
      range of potential offset values corresponding to better protection/a
      larger search-space for brute force, but also to greater potential for
      fragmentation.
      
      The offset added to the mmap_base address, which provides the basis for
      the majority of the mappings for a process, is set once on process exec
      in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
      which reflect, hopefully, the best compromise for all systems.  The
      trade-off between increased entropy in the offset value generation and
      the corresponding increased variability in address space fragmentation
      is not absolute, however, and some platforms may tolerate higher amounts
      of entropy.  This patch introduces both new Kconfig values and a sysctl
      interface which may be used to change the amount of entropy used for
      offset generation on a system.
      
      The direct motivation for this change was in response to the
      libstagefright vulnerabilities that affected Android, specifically to
      information provided by Google's project zero at:
      
        http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
      
      The attack presented therein, by Google's project zero, specifically
      targeted the limited randomness used to generate the offset added to the
      mmap_base address in order to craft a brute-force-based attack.
      Concretely, the attack was against the mediaserver process, which was
      limited to respawning every 5 seconds, on an arm device.  The hard-coded
      8 bits used resulted in an average expected success rate of defeating
      the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
      piece).  With this patch, and an accompanying increase in the entropy
      value to 16 bits, the same attack would take an average expected time of
      over 45 hours (32768 tries), which makes it both less feasible and more
      likely to be noticed.
      
      The introduced Kconfig and sysctl options are limited by per-arch
      minimum and maximum values, the minimum of which was chosen to match the
      current hard-coded value and the maximum of which was chosen so as to
      give the greatest flexibility without generating an invalid mmap_base
      address, generally a 3-4 bits less than the number of bits in the
      user-space accessible virtual address space.
      
      When decided whether or not to change the default value, a system
      developer should consider that mmap_base address could be placed
      anywhere up to 2^(value) bits away from the non-randomized location,
      which would introduce variable-sized areas above and below the mmap_base
      address such that the maximum vm_area_struct size may be reduced,
      preventing very large allocations.
      
      This patch (of 4):
      
      ASLR only uses as few as 8 bits to generate the random offset for the
      mmap base address on 32 bit architectures.  This value was chosen to
      prevent a poorly chosen value from dividing the address space in such a
      way as to prevent large allocations.  This may not be an issue on all
      platforms.  Allow the specification of a minimum number of bits so that
      platforms desiring greater ASLR protection may determine where to place
      the trade-off.
      Signed-off-by: NDaniel Cashman <dcashman@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d07e2259
    • V
      memcg: do not allow to disable tcp accounting after limit is set · 9ee11ba4
      Vladimir Davydov 提交于
      There are two bits defined for cg_proto->flags - MEMCG_SOCK_ACTIVATED
      and MEMCG_SOCK_ACTIVE - both are set in tcp_update_limit, but the former
      is never cleared while the latter can be cleared by unsetting the limit.
      This allows to disable tcp socket accounting for new sockets after it
      was enabled by writing -1 to memory.kmem.tcp.limit_in_bytes while still
      guaranteeing that memcg_socket_limit_enabled static key will be
      decremented on memcg destruction.
      
      This functionality looks dubious, because it is not clear what a use
      case would be.  By enabling tcp accounting a user accepts the price.  If
      they then find the performance degradation unacceptable, they can always
      restart their workload with tcp accounting disabled.  It does not seem
      there is any need to flip it while the workload is running.
      
      Besides, it contradicts to how kmem accounting API works: writing
      whatever to memory.kmem.limit_in_bytes enables kmem accounting for the
      cgroup in question, after which it cannot be disabled.  Therefore one
      might expect that writing -1 to memory.kmem.tcp.limit_in_bytes just
      enables socket accounting w/o limiting it, which might be useful by
      itself, but it isn't true.
      
      Since this API peculiarity is not documented anywhere, I propose to drop
      it.  This will allow to simplify the code by dropping cg_proto->flags.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ee11ba4
    • D
      mm, vmalloc: remove VM_VPAGES · 244d63ee
      David Rientjes 提交于
      VM_VPAGES is unnecessary, it's easier to check is_vmalloc_addr() when
      reading /proc/vmallocinfo.
      
      [akpm@linux-foundation.org: remove VM_VPAGES reference via kvfree()]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      244d63ee
    • J
      mm, shmem: add internal shmem resident memory accounting · eca56ff9
      Jerome Marchand 提交于
      Currently looking at /proc/<pid>/status or statm, there is no way to
      distinguish shmem pages from pages mapped to a regular file (shmem pages
      are mapped to /dev/zero), even though their implication in actual memory
      use is quite different.
      
      The internal accounting currently counts shmem pages together with
      regular files.  As a preparation to extend the userspace interfaces,
      this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
      shmem pages separately from MM_FILEPAGES.  The next patch will expose it
      to userspace - this patch doesn't change the exported values yet, by
      adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
      used before.  The only user-visible change after this patch is the OOM
      killer message that separates the reported "shmem-rss" from "file-rss".
      
      [vbabka@suse.cz: forward-porting, tweak changelog]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eca56ff9
    • V
      mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings · 48131e03
      Vlastimil Babka 提交于
      Following the previous patch, further reduction of /proc/pid/smaps cost
      is possible for private writable shmem mappings with unpopulated areas
      where the page walk invokes the .pte_hole function.  We can use radix
      tree iterator for each such area instead of calling find_get_entry() in
      a loop.  This is possible at the extra maintenance cost of introducing
      another shmem function shmem_partial_swap_usage().
      
      To demonstrate the diference, I have measured this on a process that
      creates a private writable 2GB mapping of a partially swapped out
      /dev/shm/file (which cannot employ the optimizations from the prvious
      patch) and doesn't populate it at all.  I time how long does it take to
      cat /proc/pid/smaps of this process 100 times.
      
      Before this patch:
      
      real    0m3.831s
      user    0m0.180s
      sys     0m3.212s
      
      After this patch:
      
      real    0m1.176s
      user    0m0.180s
      sys     0m0.684s
      
      The time is similar to the case where a radix tree iterator is employed
      on the whole mapping.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48131e03
    • V
      mm, proc: reduce cost of /proc/pid/smaps for shmem mappings · 6a15a370
      Vlastimil Babka 提交于
      The previous patch has improved swap accounting for shmem mapping, which
      however made /proc/pid/smaps more expensive for shmem mappings, as we
      consult the radix tree for each pte_none entry, so the overal complexity
      is O(n*log(n)).
      
      We can reduce this significantly for mappings that cannot contain COWed
      pages, because then we can either use the statistics tha shmem object
      itself tracks (if the mapping contains the whole object, or the swap
      usage of the whole object is zero), or use the radix tree iterator,
      which is much more effective than repeated find_get_entry() calls.
      
      This patch therefore introduces a function shmem_swap_usage(vma) and
      makes /proc/pid/smaps use it when possible.  Only for writable private
      mappings of shmem objects (i.e.  tmpfs files) with the shmem object
      itself (partially) swapped outwe have to resort to the find_get_entry()
      approach.
      
      Hopefully such mappings are relatively uncommon.
      
      To demonstrate the diference, I have measured this on a process that
      creates a 2GB mapping and dirties single pages with a stride of 2MB, and
      time how long does it take to cat /proc/pid/smaps of this process 100
      times.
      
      Private writable mapping of a /dev/shm/file (the most complex case):
      
      real    0m3.831s
      user    0m0.180s
      sys     0m3.212s
      
      Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
      (which needs to employ the radix tree iterator).
      
      real    0m1.351s
      user    0m0.096s
      sys     0m0.768s
      
      Same, but with /dev/shm/file not swapped (so no radix tree walk needed)
      
      real    0m0.935s
      user    0m0.128s
      sys     0m0.344s
      
      Private anonymous mapping:
      
      real    0m0.949s
      user    0m0.116s
      sys     0m0.348s
      
      The cost is now much closer to the private anonymous mapping case, unless
      the shmem mapping is private and writable.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a15a370
    • Y
      mm/mmzone.c: memmap_valid_within() can be boolean · 5b80287a
      Yaowei Bai 提交于
      Make memmap_valid_within return bool due to this particular function
      only using either one or zero as its return value.
      
      No functional change.
      Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b80287a
    • Y
      mm/zonelist: enumerate zonelists array index · c00eb15a
      Yaowei Bai 提交于
      Hardcoding index to zonelists array in gfp_zonelist() is not a good
      idea, let's enumerate it to improve readability.
      
      No functional change.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
      [n-horiguchi@ah.jp.nec.com: fix warning in comparing enumerator]
      Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c00eb15a
    • Y
      include/linux/mmzone.h: remove unused is_unevictable_lru() · 06640290
      Yaowei Bai 提交于
      Since commit a0b8cab3 ("mm: remove lru parameter from
      __pagevec_lru_add and remove parts of pagevec API") there's no
      user of this function anymore, so remove it.
      Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06640290
    • Y
      mm/memblock.c: memblock_is_memory()/reserved() can be boolean · b4ad0c7e
      Yaowei Bai 提交于
      Make memblock_is_memory() and memblock_is_reserved return bool to
      improve readability due to these particular functions only using either
      one or zero as their return value.
      
      No functional change.
      Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4ad0c7e
    • Y
      include/linux/hugetlb.h: is_file_hugepages() can be boolean · 719ff321
      Yaowei Bai 提交于
      Make is_file_hugepages() return bool to improve readability due to this
      particular function only using either one or zero as its return value.
      
      This patch also removed the if condition to make is_file_hugepages
      return directly.
      
      No functional change.
      Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      719ff321
    • Y
      mm: change mm_vmscan_lru_shrink_inactive() proto types · ba5e9579
      yalin wang 提交于
      Move node_id zone_idx shrink flags into trace function, so thay we don't
      need caculate these args if the trace is disabled, and will make this
      function have less arguments.
      Signed-off-by: Nyalin wang <yalin.wang2010@gmail.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba5e9579
    • J
      mm/page_isolation.c: add new tracepoint, test_pages_isolated · 0f0848e5
      Joonsoo Kim 提交于
      cma allocation should be guranteeded to succeed.  But sometimes it can
      fail in the current implementation.  To track down the problem, we need
      to know which page is problematic and this new tracepoint will report
      it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f0848e5
    • N
      mm/mempolicy.c: convert the shared_policy lock to a rwlock · 4a8c7bb5
      Nathan Zimmer 提交于
      When running the SPECint_rate gcc on some very large boxes it was
      noticed that the system was spending lots of time in
      mpol_shared_policy_lookup().  The gamess benchmark can also show it and
      is what I mostly used to chase down the issue since the setup for that I
      found to be easier.
      
      To be clear the binaries were on tmpfs because of disk I/O requirements.
      We then used text replication to avoid icache misses and having all the
      copies banging on the memory where the instruction code resides.  This
      results in us hitting a bottleneck in mpol_shared_policy_lookup() since
      lookup is serialised by the shared_policy lock.
      
      I have only reproduced this on very large (3k+ cores) boxes.  The
      problem starts showing up at just a few hundred ranks getting worse
      until it threatens to livelock once it gets large enough.  For example
      on the gamess benchmark at 128 ranks this area consumes only ~1% of
      time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
      90%.
      
      To alleviate the contention in this area I converted the spinlock to an
      rwlock.  This allows a large number of lookups to happen simultaneously.
      The results were quite good reducing this consumtion at max ranks to
      around 2%.
      
      [akpm@linux-foundation.org: tidy up code comments]
      Signed-off-by: NNathan Zimmer <nzimmer@sgi.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a8c7bb5
    • C
      mm: add PHYS_PFN, use it in __phys_to_pfn() · 8f235d1a
      Chen Gang 提交于
      __phys_to_pfn and __pfn_to_phys are symmetric, PHYS_PFN and PFN_PHYS are
      semmetric:
      
       - y = (phys_addr_t)x << PAGE_SHIFT
      
       - y >> PAGE_SHIFT = (phys_add_t)x
      
       - (unsigned long)(y >> PAGE_SHIFT) = x
      
      [akpm@linux-foundation.org: use macro arg name `x']
      [arnd@arndb.de: include linux/pfn.h for PHYS_PFN definition]
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f235d1a
    • Y
      mm/vmscan.c: change trace_mm_vmscan_writepage() proto type · 3aa23851
      yalin wang 提交于
      Move trace_reclaim_flags() into trace function, so that we don't need
      caculate these flags if the trace is disabled.
      Signed-off-by: Nyalin wang <yalin.wang2010@gmail.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3aa23851
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
    • V
      slab: add SLAB_ACCOUNT flag · 230e9fc2
      Vladimir Davydov 提交于
      Currently, if we want to account all objects of a particular kmem cache,
      we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
      inconvenient.  This patch introduces SLAB_ACCOUNT flag which if passed
      to kmem_cache_create will force accounting for every allocation from
      this cache even if __GFP_ACCOUNT is not passed.
      
      This patch does not make any of the existing caches use this flag - it
      will be done later in the series.
      
      Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
      SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
      hence cannot have different sets of SLAB_* flags.  Thus using this flag
      will probably reduce the number of merged slabs even if kmem accounting
      is not used (only compiled in).
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      230e9fc2
    • V
      memcg: only account kmem allocations marked as __GFP_ACCOUNT · a9bb7e62
      Vladimir Davydov 提交于
      Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
      fragile and difficult to maintain, because there seem to be many more
      allocations that should not be accounted than those that should be.
      Besides, false accounting an allocation might result in much worse
      consequences than not accounting at all, namely increased memory
      consumption due to pinned dead kmem caches.
      
      So this patch switches kmem accounting to the white-policy: now only
      those kmem allocations that are marked as __GFP_ACCOUNT are accounted to
      memcg.  Currently, no kmem allocations are marked like this.  The
      following patches will mark several kmem allocations that are known to
      be easily triggered from userspace and therefore should be accounted to
      memcg.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9bb7e62
    • V
      Revert "gfp: add __GFP_NOACCOUNT" · 20b5c303
      Vladimir Davydov 提交于
      This reverts commit 8f4fc071 ("gfp: add __GFP_NOACCOUNT").
      
      Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
      fragile and difficult to maintain, because there seem to be many more
      allocations that should not be accounted than those that should be.
      Besides, false accounting an allocation might result in much worse
      consequences than not accounting at all, namely increased memory
      consumption due to pinned dead kmem caches.
      
      So it was decided to switch to the white-list policy.  This patch
      reverts bits introducing the black-list policy.  The white-list policy
      will be introduced later in the series.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20b5c303
    • A
      include/linux/dcache.h: remove semicolons from HASH_LEN_DECLARE · 2bd03e49
      Andrew Morton 提交于
      A little cleanup - the invocation site provdes the semicolon.
      
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bd03e49
    • J
      fsnotify: destroy marks with call_srcu instead of dedicated thread · c510eff6
      Jeff Layton 提交于
      At the time that this code was originally written, call_srcu didn't
      exist, so this thread was required to ensure that we waited for that
      SRCU grace period to settle before finally freeing the object.
      
      It does exist now however and we can much more efficiently use call_srcu
      to handle this.  That also allows us to potentially use srcu_barrier to
      ensure that they are all of the callbacks have run before proceeding.
      In order to conserve space, we union the rcu_head with the g_list.
      
      This will be necessary for nfsd which will allocate marks from a
      dedicated slabcache.  We have to be able to ensure that all of the
      objects are destroyed before destroying the cache.  That's fairly
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Cc: Eric Paris <eparis@parisplace.org>
      Reviewed-by: NJan Kara <jack@suse.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c510eff6
  2. 12 1月, 2016 9 次提交
  3. 11 1月, 2016 7 次提交
    • M
      [media] Postpone the addition of MEDIA_IOC_G_TOPOLOGY · be0270ec
      Mauro Carvalho Chehab 提交于
      There are a few discussions left with regards to this ioctl:
      
      1) the name of the new structs will contain _v2_ on it?
      2) what's the best alternative to avoid compat32 issues?
      
      Due to that, let's postpone the addition of this new ioctl to
      the next Kernel version, to give people more time to discuss it.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
      be0270ec
    • M
      [media] media-entitiy: add a function to create multiple links · b01cc9ce
      Mauro Carvalho Chehab 提交于
      Sometimes, it is desired to create 1:n and n:1 or even
      n:n links between different entities with the same
      function.
      
      This is actually needed to support DVB devices that
      have multiple frontends. While we could do a function
      like that internally at the DVB core, such function is
      generic enough to be at media-entity, and it could be
      useful on some other places.
      
      So, add such function.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
      b01cc9ce
    • M
      [media] uapi/media.h: Use u32 for the number of graph objects · 7c9d6731
      Mauro Carvalho Chehab 提交于
      While we need to keep a u64 alignment to avoid compat32 issues,
      having the number of entities/pads/links/interfaces represented
      by an u64 is incoherent with the ID number, with is an u32.
      
      In order to make it coherent, change those quantities to u32.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
      7c9d6731
    • M
      [media] media-entity.h: document the remaining functions · 630c0e80
      Mauro Carvalho Chehab 提交于
      There are two ancillary functions that are missing comments.
      
      While those are used only internally at media-entity.c,
      document them, for completeness.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
      630c0e80
    • M
      [media] media-device.h: use just one u32 counter for object ID · 05b3b77c
      Mauro Carvalho Chehab 提交于
      Instead of using one u32 counter per type for object IDs, use
      just one counter. With such change, it makes sense to simplify
      the debug logs too.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
      05b3b77c
    • M
      [media] media-entity.h fix documentation for several parameters · 03e49338
      Mauro Carvalho Chehab 提交于
      Several parameters added by the media_ent_enum patches
      were declared with wrong argument names:
      	include/media/media-device.h:333: warning: No description found for parameter 'entity_internal_idx_max'
      	include/media/media-device.h:354: warning: No description found for parameter 'ent_enum'
      	include/media/media-device.h:354: warning: Excess function parameter 'e' description in 'media_entity_enum_init'
      	include/media/media-device.h:333: warning: No description found for parameter 'entity_internal_idx_max'
      	include/media/media-device.h:354: warning: No description found for parameter 'ent_enum'
      	include/media/media-device.h:354: warning: Excess function parameter 'e' description in 'media_entity_enum_init'
      	include/media/media-entity.h:397: warning: No description found for parameter 'ent_enum'
      	include/media/media-entity.h:397: warning: Excess function parameter 'e' description in 'media_entity_enum_zero'
      	include/media/media-entity.h:409: warning: No description found for parameter 'ent_enum'
      	include/media/media-entity.h:409: warning: Excess function parameter 'e' description in 'media_entity_enum_set'
      	include/media/media-entity.h:424: warning: No description found for parameter 'ent_enum'
      	include/media/media-entity.h:424: warning: Excess function parameter 'e' description in 'media_entity_enum_clear'
      	include/media/media-entity.h:441: warning: No description found for parameter 'ent_enum'
      	include/media/media-entity.h:441: warning: Excess function parameter 'e' description in 'media_entity_enum_test'
      	include/media/media-entity.h:458: warning: No description found for parameter 'ent_enum'
      	include/media/media-entity.h:458: warning: Excess function parameter 'e' description in 'media_entity_enum_test_and_set'
      	include/media/media-entity.h:474: warning: No description found for parameter 'ent_enum'
      	include/media/media-entity.h:474: warning: Excess function parameter 'e' description in 'media_entity_enum_empty'
      	include/media/media-entity.h:474: warning: Excess function parameter 'entity' description in 'media_entity_enum_empty'
      	include/media/media-entity.h:489: warning: No description found for parameter 'ent_enum1'
      	include/media/media-entity.h:489: warning: No description found for parameter 'ent_enum2'
      	include/media/media-entity.h:489: warning: Excess function parameter 'e' description in 'media_entity_enum_intersects'
      	include/media/media-entity.h:489: warning: Excess function parameter 'f' description in 'media_entity_enum_intersects'
      
      Fix them.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
      03e49338
    • M
      [media] DocBook: document media_entity_graph_walk_cleanup() · aa360d3d
      Mauro Carvalho Chehab 提交于
      This function was added recently, but weren't documented.
      Add documentation for it.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
      aa360d3d