1. 16 1月, 2016 16 次提交
  2. 15 1月, 2016 24 次提交
    • K
      mm: rework virtual memory accounting · 84638335
      Konstantin Khlebnikov 提交于
      When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
      testing the RLIMIT_DATA value to figure out if we're allowed to assign
      new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
      commited that RLIMIT_DATA in a form it's implemented now doesn't do
      anything useful because most of user-space libraries use mmap() syscall
      for dynamic memory allocations.
      
      Linus suggested to convert RLIMIT_DATA rlimit into something suitable
      for anonymous memory accounting.  But in this patch we go further, and
      the changes are bundled together as:
      
       * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
       * replace mm->shared_vm with better defined mm->data_vm
       * account anonymous executable areas as executable
       * account file-backed growsdown/up areas as stack
       * drop struct file* argument from vm_stat_account
       * enforce RLIMIT_DATA for size of data areas
      
      This way code looks cleaner: now code/stack/data classification depends
      only on vm_flags state:
      
       VM_EXEC & ~VM_WRITE            -> code  (VmExe + VmLib in proc)
       VM_GROWSUP | VM_GROWSDOWN      -> stack (VmStk)
       VM_WRITE & ~VM_SHARED & !stack -> data  (VmData)
      
      The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
      "shared", but that might be strange beast like readonly-private or VM_IO
      area.
      
       - RLIMIT_AS            limits whole address space "VmSize"
       - RLIMIT_STACK         limits stack "VmStk" (but each vma individually)
       - RLIMIT_DATA          now limits "VmData"
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Kees Cook <keescook@google.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84638335
    • P
      hugetlb: make mm and fs code explicitly non-modular · 3e89e1c5
      Paul Gortmaker 提交于
      The Kconfig currently controlling compilation of this code is:
      
      config HUGETLBFS
              bool "HugeTLB file system support"
      
      ...meaning that it currently is not being built as a module by anyone.
      
      Lets remove the modular code that is essentially orphaned, so that when
      reading the driver there is no doubt it is builtin-only.
      
      Since module_init translates to device_initcall in the non-modular case,
      the init ordering gets moved to earlier levels when we use the more
      appropriate initcalls here.
      
      Originally I had the fs part and the mm part as separate commits, just
      by happenstance of the nature of how I detected these non-modular use
      cases.  But that can possibly introduce regressions if the patch merge
      ordering puts the fs part 1st -- as the 0-day testing reported a splat
      at mount time.
      
      Investigating with "initcall_debug" showed that the delta was
      init_hugetlbfs_fs being called _before_ hugetlb_init instead of after.  So
      both the fs change and the mm change are here together.
      
      In addition, it worked before due to luck of link order, since they were
      both in the same initcall category.  So we now have the fs part using
      fs_initcall, and the mm part using subsys_initcall, which puts it one
      bucket earlier.  It now passes the basic sanity test that failed in
      earlier 0-day testing.
      
      We delete the MODULE_LICENSE tag and capture that information at the top
      of the file alongside author comments, etc.
      
      We don't replace module.h with init.h since the file already has that.
      Also note that MODULE_ALIAS is a no-op for non-modular code.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Reported-by: Nkernel test robot <ying.huang@linux.intel.com>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e89e1c5
    • O
      mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd() · 0e41e277
      Oleg Nesterov 提交于
      clear_soft_dirty_pmd() is called by clear_refs_write(CLEAR_REFS_SOFT_DIRTY),
      VM_SOFTDIRTY was already cleared before walk_page_range().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e41e277
    • J
      proc: meminfo: estimate available memory more conservatively · 84ad5802
      Johannes Weiner 提交于
      The MemAvailable item in /proc/meminfo is to give users a hint of how
      much memory is allocatable without causing swapping, so it excludes the
      zones' low watermarks as unavailable to userspace.
      
      However, for a userspace allocation, kswapd will actually reclaim until
      the free pages hit a combination of the high watermark and the page
      allocator's lowmem protection that keeps a certain amount of DMA and
      DMA32 memory from userspace as well.
      
      Subtract the full amount we know to be unavailable to userspace from the
      number of free pages when calculating MemAvailable.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84ad5802
    • A
      fs/block_dev.c:bdev_write_page(): use blk_queue_enter(..., GFP_NOIO) · b832861c
      Andrew Morton 提交于
      bdev_write_page() is used by swapout and by writepage where we cannot
      use __GFP_FS or __GFP_IO.  So it is misleading to mention GFP_KERNEL
      here.
      
      blk_queue_enter() only actually looks at __GFP_DIRECT_RECLAIM, so no
      bugs were harmed in the making of this patch.
      
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b832861c
    • J
      mm, procfs: breakdown RSS for anon, shmem and file in /proc/pid/status · 8cee852e
      Jerome Marchand 提交于
      There are several shortcomings with the accounting of shared memory
      (SysV shm, shared anonymous mapping, mapping of a tmpfs file).  The
      values in /proc/<pid>/status and <...>/statm don't allow to distinguish
      between shmem memory and a shared mapping to a regular file, even though
      theirs implication on memory usage are quite different: during reclaim,
      file mapping can be dropped or written back on disk, while shmem needs a
      place in swap.
      
      Also, to distinguish the memory occupied by anonymous and file mappings,
      one has to read the /proc/pid/statm file, which has a field for the file
      mappings (again, including shmem) and total memory occupied by these
      mappings (i.e.  equivalent to VmRSS in the <...>/status file.  Getting
      the value for anonymous mappings only is thus not exactly user-friendly
      (the statm file is intended to be rather efficiently machine-readable).
      
      To address both of these shortcomings, this patch adds a breakdown of
      VmRSS in /proc/<pid>/status via new fields RssAnon, RssFile and
      RssShmem, making use of the previous preparatory patch.  These fields
      tell the user the memory occupied by private anonymous pages, mapped
      regular files and shmem, respectively.  Other existing fields in /status
      and /statm files are left without change.  The /statm file can be
      extended in the future, if there's a need for that.
      
      Example (part of) /proc/pid/status output including the new Rss* fields:
      
      VmPeak:  2001008 kB
      VmSize:  2001004 kB
      VmLck:         0 kB
      VmPin:         0 kB
      VmHWM:      5108 kB
      VmRSS:      5108 kB
      RssAnon:              92 kB
      RssFile:            1324 kB
      RssShmem:           3692 kB
      VmData:      192 kB
      VmStk:       136 kB
      VmExe:         4 kB
      VmLib:      1784 kB
      VmPTE:      3928 kB
      VmPMD:        20 kB
      VmSwap:        0 kB
      HugetlbPages:          0 kB
      
      [vbabka@suse.cz: forward-porting, tweak changelog]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cee852e
    • J
      mm, shmem: add internal shmem resident memory accounting · eca56ff9
      Jerome Marchand 提交于
      Currently looking at /proc/<pid>/status or statm, there is no way to
      distinguish shmem pages from pages mapped to a regular file (shmem pages
      are mapped to /dev/zero), even though their implication in actual memory
      use is quite different.
      
      The internal accounting currently counts shmem pages together with
      regular files.  As a preparation to extend the userspace interfaces,
      this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
      shmem pages separately from MM_FILEPAGES.  The next patch will expose it
      to userspace - this patch doesn't change the exported values yet, by
      adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
      used before.  The only user-visible change after this patch is the OOM
      killer message that separates the reported "shmem-rss" from "file-rss".
      
      [vbabka@suse.cz: forward-porting, tweak changelog]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eca56ff9
    • V
      mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings · 48131e03
      Vlastimil Babka 提交于
      Following the previous patch, further reduction of /proc/pid/smaps cost
      is possible for private writable shmem mappings with unpopulated areas
      where the page walk invokes the .pte_hole function.  We can use radix
      tree iterator for each such area instead of calling find_get_entry() in
      a loop.  This is possible at the extra maintenance cost of introducing
      another shmem function shmem_partial_swap_usage().
      
      To demonstrate the diference, I have measured this on a process that
      creates a private writable 2GB mapping of a partially swapped out
      /dev/shm/file (which cannot employ the optimizations from the prvious
      patch) and doesn't populate it at all.  I time how long does it take to
      cat /proc/pid/smaps of this process 100 times.
      
      Before this patch:
      
      real    0m3.831s
      user    0m0.180s
      sys     0m3.212s
      
      After this patch:
      
      real    0m1.176s
      user    0m0.180s
      sys     0m0.684s
      
      The time is similar to the case where a radix tree iterator is employed
      on the whole mapping.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48131e03
    • V
      mm, proc: reduce cost of /proc/pid/smaps for shmem mappings · 6a15a370
      Vlastimil Babka 提交于
      The previous patch has improved swap accounting for shmem mapping, which
      however made /proc/pid/smaps more expensive for shmem mappings, as we
      consult the radix tree for each pte_none entry, so the overal complexity
      is O(n*log(n)).
      
      We can reduce this significantly for mappings that cannot contain COWed
      pages, because then we can either use the statistics tha shmem object
      itself tracks (if the mapping contains the whole object, or the swap
      usage of the whole object is zero), or use the radix tree iterator,
      which is much more effective than repeated find_get_entry() calls.
      
      This patch therefore introduces a function shmem_swap_usage(vma) and
      makes /proc/pid/smaps use it when possible.  Only for writable private
      mappings of shmem objects (i.e.  tmpfs files) with the shmem object
      itself (partially) swapped outwe have to resort to the find_get_entry()
      approach.
      
      Hopefully such mappings are relatively uncommon.
      
      To demonstrate the diference, I have measured this on a process that
      creates a 2GB mapping and dirties single pages with a stride of 2MB, and
      time how long does it take to cat /proc/pid/smaps of this process 100
      times.
      
      Private writable mapping of a /dev/shm/file (the most complex case):
      
      real    0m3.831s
      user    0m0.180s
      sys     0m3.212s
      
      Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
      (which needs to employ the radix tree iterator).
      
      real    0m1.351s
      user    0m0.096s
      sys     0m0.768s
      
      Same, but with /dev/shm/file not swapped (so no radix tree walk needed)
      
      real    0m0.935s
      user    0m0.128s
      sys     0m0.344s
      
      Private anonymous mapping:
      
      real    0m0.949s
      user    0m0.116s
      sys     0m0.348s
      
      The cost is now much closer to the private anonymous mapping case, unless
      the shmem mapping is private and writable.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a15a370
    • V
      mm, proc: account for shmem swap in /proc/pid/smaps · c261e7d9
      Vlastimil Babka 提交于
      Currently, /proc/pid/smaps will always show "Swap: 0 kB" for
      shmem-backed mappings, even if the mapped portion does contain pages
      that were swapped out.  This is because unlike private anonymous
      mappings, shmem does not change pte to swap entry, but pte_none when
      swapping the page out.  In the smaps page walk, such page thus looks
      like it was never faulted in.
      
      This patch changes smaps_pte_entry() to determine the swap status for
      such pte_none entries for shmem mappings, similarly to how
      mincore_page() does it.  Swapped out shmem pages are thus accounted for.
      For private mappings of tmpfs files that COWed some of the pages, swaped
      out status of the original shmem pages is naturally ignored.  If some of
      the private copies was also swapped out, they are accounted via their
      page table swap entries, so the resulting reported swap usage is then a
      sum of both swapped out private copies, and swapped out shmem pages that
      were not COWed.  No double accounting can thus happen.
      
      The accounting is arguably still not as precise as for private anonymous
      mappings, since now we will count also pages that the process in
      question never accessed, but another process populated them and then let
      them become swapped out.  I believe it is still less confusing and
      subtle than not showing any swap usage by shmem mappings at all.
      Swapped out counter might of interest of users who would like to prevent
      from future swapins during performance critical operation and pre-fault
      them at their convenience.  Especially for larger swapped out regions
      the cost of swapin is much higher than a fresh page allocation.  So a
      differentiation between pte_none vs.  swapped out is important for those
      usecases.
      
      One downside of this patch is that it makes /proc/pid/smaps more
      expensive for shmem mappings, as we consult the radix tree for each
      pte_none entry, so the overal complexity is O(n*log(n)).  I have
      measured this on a process that creates a 2GB mapping and dirties single
      pages with a stride of 2MB, and time how long does it take to cat
      /proc/pid/smaps of this process 100 times.
      
      Private anonymous mapping:
      
      real    0m0.949s
      user    0m0.116s
      sys     0m0.348s
      
      Mapping of a /dev/shm/file:
      
      real    0m3.831s
      user    0m0.180s
      sys     0m3.212s
      
      The difference is rather substantial, so the next patch will reduce the
      cost for shared or read-only mappings.
      
      In a less controlled experiment, I've gathered pids of processes on my
      desktop that have either '/dev/shm/*' or 'SYSV*' in smaps.  This
      included the Chrome browser and some KDE processes.  Again, I've run cat
      /proc/pid/smaps on each 100 times.
      
      Before this patch:
      
      real    0m9.050s
      user    0m0.518s
      sys     0m8.066s
      
      After this patch:
      
      real    0m9.221s
      user    0m0.541s
      sys     0m8.187s
      
      This suggests low impact on average systems.
      
      Note that this patch doesn't attempt to adjust the SwapPss field for
      shmem mappings, which would need extra work to determine who else could
      have the pages mapped.  Thus the value stays zero except for COWed
      swapped out pages in a shmem mapping, which are accounted as usual.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c261e7d9
    • N
      mm/mempolicy.c: convert the shared_policy lock to a rwlock · 4a8c7bb5
      Nathan Zimmer 提交于
      When running the SPECint_rate gcc on some very large boxes it was
      noticed that the system was spending lots of time in
      mpol_shared_policy_lookup().  The gamess benchmark can also show it and
      is what I mostly used to chase down the issue since the setup for that I
      found to be easier.
      
      To be clear the binaries were on tmpfs because of disk I/O requirements.
      We then used text replication to avoid icache misses and having all the
      copies banging on the memory where the instruction code resides.  This
      results in us hitting a bottleneck in mpol_shared_policy_lookup() since
      lookup is serialised by the shared_policy lock.
      
      I have only reproduced this on very large (3k+ cores) boxes.  The
      problem starts showing up at just a few hundred ranks getting worse
      until it threatens to livelock once it gets large enough.  For example
      on the gamess benchmark at 128 ranks this area consumes only ~1% of
      time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
      90%.
      
      To alleviate the contention in this area I converted the spinlock to an
      rwlock.  This allows a large number of lookups to happen simultaneously.
      The results were quite good reducing this consumtion at max ranks to
      around 2%.
      
      [akpm@linux-foundation.org: tidy up code comments]
      Signed-off-by: NNathan Zimmer <nzimmer@sgi.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a8c7bb5
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
    • V
      Revert "kernfs: do not account ino_ida allocations to memcg" · b2a209ff
      Vladimir Davydov 提交于
      Currently, all kmem allocations (namely every kmem_cache_alloc, kmalloc,
      alloc_kmem_pages call) are accounted to memory cgroup automatically.
      Callers have to explicitly opt out if they don't want/need accounting
      for some reason.  Such a design decision leads to several problems:
      
       - kmalloc users are highly sensitive to failures, many of them
         implicitly rely on the fact that kmalloc never fails, while memcg
         makes failures quite plausible.
      
       - A lot of objects are shared among different containers by design.
         Accounting such objects to one of containers is just unfair.
         Moreover, it might lead to pinning a dead memcg along with its kmem
         caches, which aren't tiny, which might result in noticeable increase
         in memory consumption for no apparent reason in the long run.
      
       - There are tons of short-lived objects. Accounting them to memcg will
         only result in slight noise and won't change the overall picture, but
         we still have to pay accounting overhead.
      
      For more info, see
      
       - http://lkml.kernel.org/r/20151105144002.GB15111%40dhcp22.suse.cz
       - http://lkml.kernel.org/r/20151106090555.GK29259@esperanza
      
      Therefore this patchset switches to the white list policy.  Now kmalloc
      users have to explicitly opt in by passing __GFP_ACCOUNT flag.
      
      Currently, the list of accounted objects is quite limited and only
      includes those allocations that (1) are known to be easily triggered
      from userspace and (2) can fail gracefully (for the full list see patch
      no.  6) and it still misses many object types.  However, accounting only
      those objects should be a satisfactory approximation of the behavior we
      used to have for most sane workloads.
      
      This patch (of 6):
      
      Revert 499611ed ("kernfs: do not account ino_ida allocations
      to memcg").
      
      Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
      fragile and difficult to maintain, because there seem to be many more
      allocations that should not be accounted than those that should be.
      Besides, false accounting an allocation might result in much worse
      consequences than not accounting at all, namely increased memory
      consumption due to pinned dead kmem caches.
      
      So it was decided to switch to the white-list policy.  This patch reverts
      bits introducing the black-list policy.  The white-list policy will be
      introduced later in the series.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2a209ff
    • J
      ocfs2/dlm: cleanup redunant lksb flags in dlmcommon.h · 3c973b0e
      Joseph Qi 提交于
      lksb flags are defined both in dlmapi.h and dlmcommon.h.  So clean them
      up from dlmcommon.h.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: NJiufei Xue <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c973b0e
    • J
      ocfs2: dlm: remove redundant code · 98e141f2
      Junxiao Bi 提交于
      Found this when do patch review, remove to make it clear and save a
      little cpu time.
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98e141f2
    • J
      ocfs2: access orphan dinode before delete entry in ocfs2_orphan_del · 074a6c65
      Joseph Qi 提交于
      In ocfs2_orphan_del, currently it finds and deletes entry first, and
      then access orphan dir dinode.  This will have a problem once
      ocfs2_journal_access_di fails.  In this case, entry will be removed from
      orphan dir, but in deed the inode hasn't been deleted successfully.  In
      other words, the file is missing but not actually deleted.  So we should
      access orphan dinode first like unlink and rename.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: NJiufei Xue <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      074a6c65
    • X
      ocfs2/dlm: do not insert a new mle when another process is already migrating · 32e49326
      xuejiufei 提交于
      When two processes are migrating the same lockres,
      dlm_add_migration_mle() return -EEXIST, but insert a new mle in hash
      list.  dlm_migrate_lockres() will detach the old mle and free the new
      one which is already in hash list, that will destroy the list.
      Signed-off-by: NJiufei Xue <xuejiufei@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32e49326
    • X
      ocfs2/dlm: ignore cleaning the migration mle that is inuse · bef5502d
      xuejiufei 提交于
      We have found that migration source will trigger a BUG that the refcount
      of mle is already zero before put when the target is down during
      migration.  The situation is as follows:
      
      dlm_migrate_lockres
        dlm_add_migration_mle
        dlm_mark_lockres_migrating
        dlm_get_mle_inuse
        <<<<<< Now the refcount of the mle is 2.
        dlm_send_one_lockres and wait for the target to become the
        new master.
        <<<<<< o2hb detect the target down and clean the migration
        mle. Now the refcount is 1.
      
      dlm_migrate_lockres woken, and put the mle twice when found the target
      goes down which trigger the BUG with the following message:
      
        "ERROR: bad mle: ".
      Signed-off-by: NJiufei Xue <xuejiufei@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bef5502d
    • G
      ocfs2: do not lock/unlock() inode DLM lock · 1cce4df0
      Goldwyn Rodrigues 提交于
      DLM does not cache locks.  So, blocking lock and unlock will only make
      the performance worse where contention over the locks is high.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1cce4df0
    • J
      ocfs2: fix slot overwritten if storage link down during mount · 1247017f
      jiangyiwen 提交于
      The following case will lead to slot overwritten.
      
      N1                               N2
      mount ocfs2 volume, find and
      allocate slot 0, then set
      osb->slot_num to 0, begin to
      write slot info to disk
                                       mount ocfs2 volume, wait for super lock
      write block fail because of
      storage link down, unlock
      super lock
                                       got super lock and also allocate slot 0
                                       then unlock super lock
      
      mount fail and then dismount,
      since osb->slot_num is 0, try to
      put invalid slot to disk. And it
      will succeed if storage link
      restores.
                                       N2 slot info is now overwritten
      
      Once another node say N3 mount, it will find and allocate slot 0 again,
      which will lead to mount hung because journal has already been locked by
      N2.  so when write slot info failed, invalidate slot in advance to avoid
      overwrite slot.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1247017f
    • X
      ocfs2/dlm: return appropriate value when dlm_grab() returns NULL · c372f219
      Xue jiufei 提交于
      dlm_grab() may return NULL when the node is doing unmount.  When doing
      code review, we found that some dlm handlers may return error to caller
      when dlm_grab() returns NULL and make caller BUG or other problems.
      Here is an example:
      
      Node 1                                 Node 2
      receives migration message
      from node 3, and send
      migrate request to others
                                           start unmounting
      
                                           receives migrate request
                                           from node 1 and call
                                           dlm_migrate_request_handler()
      
                                           unmount thread unregisters
                                           domain handlers and removes
                                           dlm_context from dlm_domains
      
                                           dlm_migrate_request_handlers()
                                           returns -EINVAL to node 1
      Exit migration neither clearing the
      migration state nor sending
      assert master message to node 3 which
      cause node 3 hung.
      Signed-off-by: NJiufei Xue <xuejiufei@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c372f219
    • J
      ocfs2: clean up redundant NULL check before iput · 72865d92
      Joseph Qi 提交于
      Since iput will take care the NULL check itself, NULL check before
      calling it is redundant.  So clean them up.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72865d92
    • J
      ocfs2/dlm: wait until DLM_LOCK_RES_SETREF_INPROG is cleared in dlm_deref_lockres_worker · b5560143
      jiangyiwen 提交于
      Commit f3f85464 ("ocfs2_dlm: Ensure correct ordering of set/clear
      refmap bit on lockres") still exists a race which can't ensure the
      ordering is exactly correct.
      
      Node1               Node2                    Node3
      umount, migrate
      lockres to Node2
                          migrate finished,
                          send migrate request
                          to Node3
                                                    received migrate request,
                                                    create a migration_mle,
                                                    respond to Node2.
                          set DLM_LOCK_RES_SETREF_INPROG
                          and send assert master to
                          Node3
                                                    delete migration_mle in
                                                    assert_master_handler,
                                                    Node3 umount without response
                                                    dlm_thread purge
                                                    this lockres, send drop
                                                    deref message to Node2
                          found the flag of
                          DLM_LOCK_RES_SETREF_INPROG
                          is set, dispatch
                          dlm_deref_lockres_worker to
                          clear refmap, but in function of
                          dlm_deref_lockres_worker,
                          only if node in refmap it wait
                          DLM_LOCK_RES_SETREF_INPROG
                          to be cleared. So worker is
                          done successfully
      
                                                    purge lockres, send
                                                    assert master response
                                                    to Node1, and finish umount
                          set Node3 in refmap, and it
                          won't be cleared forever, thus
                          lead to umount hung
      
      so wait until DLM_LOCK_RES_SETREF_INPROG is cleared in
      dlm_deref_lockres_worker.
      Signed-off-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5560143
    • J
      ocfs2: constify ocfs2_extent_tree_operations structures · 9e62dc09
      Julia Lawall 提交于
      The ocfs2_extent_tree_operations structures are never modified, so
      declare them as const.
      
      Done with the help of Coccinelle.
      Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e62dc09