1. 21 1月, 2016 12 次提交
  2. 17 1月, 2016 1 次提交
  3. 16 1月, 2016 18 次提交
  4. 15 1月, 2016 9 次提交
    • K
      mm: rework virtual memory accounting · 84638335
      Konstantin Khlebnikov 提交于
      When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
      testing the RLIMIT_DATA value to figure out if we're allowed to assign
      new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
      commited that RLIMIT_DATA in a form it's implemented now doesn't do
      anything useful because most of user-space libraries use mmap() syscall
      for dynamic memory allocations.
      
      Linus suggested to convert RLIMIT_DATA rlimit into something suitable
      for anonymous memory accounting.  But in this patch we go further, and
      the changes are bundled together as:
      
       * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
       * replace mm->shared_vm with better defined mm->data_vm
       * account anonymous executable areas as executable
       * account file-backed growsdown/up areas as stack
       * drop struct file* argument from vm_stat_account
       * enforce RLIMIT_DATA for size of data areas
      
      This way code looks cleaner: now code/stack/data classification depends
      only on vm_flags state:
      
       VM_EXEC & ~VM_WRITE            -> code  (VmExe + VmLib in proc)
       VM_GROWSUP | VM_GROWSDOWN      -> stack (VmStk)
       VM_WRITE & ~VM_SHARED & !stack -> data  (VmData)
      
      The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
      "shared", but that might be strange beast like readonly-private or VM_IO
      area.
      
       - RLIMIT_AS            limits whole address space "VmSize"
       - RLIMIT_STACK         limits stack "VmStk" (but each vma individually)
       - RLIMIT_DATA          now limits "VmData"
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Kees Cook <keescook@google.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84638335
    • P
      hugetlb: make mm and fs code explicitly non-modular · 3e89e1c5
      Paul Gortmaker 提交于
      The Kconfig currently controlling compilation of this code is:
      
      config HUGETLBFS
              bool "HugeTLB file system support"
      
      ...meaning that it currently is not being built as a module by anyone.
      
      Lets remove the modular code that is essentially orphaned, so that when
      reading the driver there is no doubt it is builtin-only.
      
      Since module_init translates to device_initcall in the non-modular case,
      the init ordering gets moved to earlier levels when we use the more
      appropriate initcalls here.
      
      Originally I had the fs part and the mm part as separate commits, just
      by happenstance of the nature of how I detected these non-modular use
      cases.  But that can possibly introduce regressions if the patch merge
      ordering puts the fs part 1st -- as the 0-day testing reported a splat
      at mount time.
      
      Investigating with "initcall_debug" showed that the delta was
      init_hugetlbfs_fs being called _before_ hugetlb_init instead of after.  So
      both the fs change and the mm change are here together.
      
      In addition, it worked before due to luck of link order, since they were
      both in the same initcall category.  So we now have the fs part using
      fs_initcall, and the mm part using subsys_initcall, which puts it one
      bucket earlier.  It now passes the basic sanity test that failed in
      earlier 0-day testing.
      
      We delete the MODULE_LICENSE tag and capture that information at the top
      of the file alongside author comments, etc.
      
      We don't replace module.h with init.h since the file already has that.
      Also note that MODULE_ALIAS is a no-op for non-modular code.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Reported-by: Nkernel test robot <ying.huang@linux.intel.com>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e89e1c5
    • O
      mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd() · 0e41e277
      Oleg Nesterov 提交于
      clear_soft_dirty_pmd() is called by clear_refs_write(CLEAR_REFS_SOFT_DIRTY),
      VM_SOFTDIRTY was already cleared before walk_page_range().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e41e277
    • J
      proc: meminfo: estimate available memory more conservatively · 84ad5802
      Johannes Weiner 提交于
      The MemAvailable item in /proc/meminfo is to give users a hint of how
      much memory is allocatable without causing swapping, so it excludes the
      zones' low watermarks as unavailable to userspace.
      
      However, for a userspace allocation, kswapd will actually reclaim until
      the free pages hit a combination of the high watermark and the page
      allocator's lowmem protection that keeps a certain amount of DMA and
      DMA32 memory from userspace as well.
      
      Subtract the full amount we know to be unavailable to userspace from the
      number of free pages when calculating MemAvailable.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84ad5802
    • A
      fs/block_dev.c:bdev_write_page(): use blk_queue_enter(..., GFP_NOIO) · b832861c
      Andrew Morton 提交于
      bdev_write_page() is used by swapout and by writepage where we cannot
      use __GFP_FS or __GFP_IO.  So it is misleading to mention GFP_KERNEL
      here.
      
      blk_queue_enter() only actually looks at __GFP_DIRECT_RECLAIM, so no
      bugs were harmed in the making of this patch.
      
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b832861c
    • J
      mm, procfs: breakdown RSS for anon, shmem and file in /proc/pid/status · 8cee852e
      Jerome Marchand 提交于
      There are several shortcomings with the accounting of shared memory
      (SysV shm, shared anonymous mapping, mapping of a tmpfs file).  The
      values in /proc/<pid>/status and <...>/statm don't allow to distinguish
      between shmem memory and a shared mapping to a regular file, even though
      theirs implication on memory usage are quite different: during reclaim,
      file mapping can be dropped or written back on disk, while shmem needs a
      place in swap.
      
      Also, to distinguish the memory occupied by anonymous and file mappings,
      one has to read the /proc/pid/statm file, which has a field for the file
      mappings (again, including shmem) and total memory occupied by these
      mappings (i.e.  equivalent to VmRSS in the <...>/status file.  Getting
      the value for anonymous mappings only is thus not exactly user-friendly
      (the statm file is intended to be rather efficiently machine-readable).
      
      To address both of these shortcomings, this patch adds a breakdown of
      VmRSS in /proc/<pid>/status via new fields RssAnon, RssFile and
      RssShmem, making use of the previous preparatory patch.  These fields
      tell the user the memory occupied by private anonymous pages, mapped
      regular files and shmem, respectively.  Other existing fields in /status
      and /statm files are left without change.  The /statm file can be
      extended in the future, if there's a need for that.
      
      Example (part of) /proc/pid/status output including the new Rss* fields:
      
      VmPeak:  2001008 kB
      VmSize:  2001004 kB
      VmLck:         0 kB
      VmPin:         0 kB
      VmHWM:      5108 kB
      VmRSS:      5108 kB
      RssAnon:              92 kB
      RssFile:            1324 kB
      RssShmem:           3692 kB
      VmData:      192 kB
      VmStk:       136 kB
      VmExe:         4 kB
      VmLib:      1784 kB
      VmPTE:      3928 kB
      VmPMD:        20 kB
      VmSwap:        0 kB
      HugetlbPages:          0 kB
      
      [vbabka@suse.cz: forward-porting, tweak changelog]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cee852e
    • J
      mm, shmem: add internal shmem resident memory accounting · eca56ff9
      Jerome Marchand 提交于
      Currently looking at /proc/<pid>/status or statm, there is no way to
      distinguish shmem pages from pages mapped to a regular file (shmem pages
      are mapped to /dev/zero), even though their implication in actual memory
      use is quite different.
      
      The internal accounting currently counts shmem pages together with
      regular files.  As a preparation to extend the userspace interfaces,
      this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
      shmem pages separately from MM_FILEPAGES.  The next patch will expose it
      to userspace - this patch doesn't change the exported values yet, by
      adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
      used before.  The only user-visible change after this patch is the OOM
      killer message that separates the reported "shmem-rss" from "file-rss".
      
      [vbabka@suse.cz: forward-porting, tweak changelog]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eca56ff9
    • V
      mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings · 48131e03
      Vlastimil Babka 提交于
      Following the previous patch, further reduction of /proc/pid/smaps cost
      is possible for private writable shmem mappings with unpopulated areas
      where the page walk invokes the .pte_hole function.  We can use radix
      tree iterator for each such area instead of calling find_get_entry() in
      a loop.  This is possible at the extra maintenance cost of introducing
      another shmem function shmem_partial_swap_usage().
      
      To demonstrate the diference, I have measured this on a process that
      creates a private writable 2GB mapping of a partially swapped out
      /dev/shm/file (which cannot employ the optimizations from the prvious
      patch) and doesn't populate it at all.  I time how long does it take to
      cat /proc/pid/smaps of this process 100 times.
      
      Before this patch:
      
      real    0m3.831s
      user    0m0.180s
      sys     0m3.212s
      
      After this patch:
      
      real    0m1.176s
      user    0m0.180s
      sys     0m0.684s
      
      The time is similar to the case where a radix tree iterator is employed
      on the whole mapping.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48131e03
    • V
      mm, proc: reduce cost of /proc/pid/smaps for shmem mappings · 6a15a370
      Vlastimil Babka 提交于
      The previous patch has improved swap accounting for shmem mapping, which
      however made /proc/pid/smaps more expensive for shmem mappings, as we
      consult the radix tree for each pte_none entry, so the overal complexity
      is O(n*log(n)).
      
      We can reduce this significantly for mappings that cannot contain COWed
      pages, because then we can either use the statistics tha shmem object
      itself tracks (if the mapping contains the whole object, or the swap
      usage of the whole object is zero), or use the radix tree iterator,
      which is much more effective than repeated find_get_entry() calls.
      
      This patch therefore introduces a function shmem_swap_usage(vma) and
      makes /proc/pid/smaps use it when possible.  Only for writable private
      mappings of shmem objects (i.e.  tmpfs files) with the shmem object
      itself (partially) swapped outwe have to resort to the find_get_entry()
      approach.
      
      Hopefully such mappings are relatively uncommon.
      
      To demonstrate the diference, I have measured this on a process that
      creates a 2GB mapping and dirties single pages with a stride of 2MB, and
      time how long does it take to cat /proc/pid/smaps of this process 100
      times.
      
      Private writable mapping of a /dev/shm/file (the most complex case):
      
      real    0m3.831s
      user    0m0.180s
      sys     0m3.212s
      
      Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
      (which needs to employ the radix tree iterator).
      
      real    0m1.351s
      user    0m0.096s
      sys     0m0.768s
      
      Same, but with /dev/shm/file not swapped (so no radix tree walk needed)
      
      real    0m0.935s
      user    0m0.128s
      sys     0m0.344s
      
      Private anonymous mapping:
      
      real    0m0.949s
      user    0m0.116s
      sys     0m0.348s
      
      The cost is now much closer to the private anonymous mapping case, unless
      the shmem mapping is private and writable.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a15a370