1. 06 11月, 2015 7 次提交
  2. 02 11月, 2015 1 次提交
    • L
      mm: get rid of 'vmalloc_info' from /proc/meminfo · a5ad88ce
      Linus Torvalds 提交于
      It turns out that at least some versions of glibc end up reading
      /proc/meminfo at every single startup, because glibc wants to know the
      amount of memory the machine has.  And while that's arguably insane,
      it's just how things are.
      
      And it turns out that it's not all that expensive most of the time, but
      the vmalloc information statistics (amount of virtual memory used in the
      vmalloc space, and the biggest remaining chunk) can be rather expensive
      to compute.
      
      The 'get_vmalloc_info()' function actually showed up on my profiles as
      4% of the CPU usage of "make test" in the git source repository, because
      the git tests are lots of very short-lived shell-scripts etc.
      
      It turns out that apparently this same silly vmalloc info gathering
      shows up on the facebook servers too, according to Dave Jones.  So it's
      not just "make test" for git.
      
      We had two patches to just cache the information (one by me, one by
      Ingo) to mitigate this issue, but the whole vmalloc information of of
      rather dubious value to begin with, and people who *actually* want to
      know what the situation is wrt the vmalloc area should just look at the
      much more complete /proc/vmallocinfo instead.
      
      In fact, according to my testing - and perhaps more importantly,
      according to that big search engine in the sky: Google - there is
      nothing out there that actually cares about those two expensive fields:
      VmallocUsed and VmallocChunk.
      
      So let's try to just remove them entirely.  Actually, this just removes
      the computation and reports the numbers as zero for now, just to try to
      be minimally intrusive.
      
      If this breaks anything, we'll obviously have to re-introduce the code
      to compute this all and add the caching patches on top.  But if given
      the option, I'd really prefer to just remove this bad idea entirely
      rather than add even more code to work around our historical mistake
      that likely nobody really cares about.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5ad88ce
  3. 23 10月, 2015 4 次提交
  4. 21 10月, 2015 1 次提交
  5. 17 10月, 2015 6 次提交
    • V
      mm,thp: introduce flush_pmd_tlb_range · 12ebc158
      Vineet Gupta 提交于
      ARCHes with special requirements for evicting THP backing TLB entries
      can implement this.
      
      Otherwise also, it can help optimize TLB flush in THP regime.
      stock flush_tlb_range() typically has optimization to nuke the entire
      TLB if flush span is greater than a certain threshhold, which will
      likely be true for a single huge page. Thus a single thp flush will
      invalidate the entrire TLB which is not desirable.
      
      e.g. see arch/arc: flush_pmd_tlb_range
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Link: http://lkml.kernel.org/r/20151009100816.GC7873@nodeSigned-off-by: NVineet Gupta <vgupta@synopsys.com>
      12ebc158
    • V
      mm,thp: reduce ifdef'ery for THP in generic code · bd5e88ad
      Vineet Gupta 提交于
      - pgtable-generic.c: Fold individual #ifdef for each helper into a top
        level #ifdef. Makes code more readable
      
      - Converted the stub helpers for !THP to BUILD_BUG() vs. runtime BUG()
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Link: http://lkml.kernel.org/r/20151009133450.GA8597@nodeSigned-off-by: NVineet Gupta <vgupta@synopsys.com>
      bd5e88ad
    • V
      mm: group pte related helpers together · 52585bcc
      Vineet Gupta 提交于
      This reduces/simplifies the diff for the next patch which moves THP
      specific code.
      
      No semantical changes !
      
      Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com
      Link: http://lkml.kernel.org/r/1442918096-17454-9-git-send-email-vgupta@synopsys.comSigned-off-by: NVineet Gupta <vgupta@synopsys.com>
      52585bcc
    • R
      mm, dax: fix DAX deadlocks · 0f90cc66
      Ross Zwisler 提交于
      The following two locking commits in the DAX code:
      
      commit 84317297 ("dax: fix race between simultaneous faults")
      commit 46c043ed ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")
      
      introduced a number of deadlocks and other issues which need to be fixed
      for the v4.3 kernel.  The list of issues in DAX after these commits
      (some newly introduced by the commits, some preexisting) can be found
      here:
      
        https://lkml.org/lkml/2015/9/25/602 (Subject: "Re: [PATCH] dax: fix deadlock in __dax_fault").
      
      This undoes most of the changes introduced by those two commits,
      essentially returning us to the DAX locking scheme that was used in
      v4.2.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Tested-by: NDave Chinner <dchinner@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f90cc66
    • S
      memcg: convert threshold to bytes · 424cdc14
      Shaohua Li 提交于
      page_counter_memparse() returns pages for the threshold, while
      mem_cgroup_usage() returns bytes for memory usage.  Convert the
      threshold to bytes.
      
      Fixes: 3e32cb2e ("memcg: rename cgroup_event to mem_cgroup_event").
      Signed-off-by: NShaohua Li <shli@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      424cdc14
    • M
      mm, fs: obey gfp_mapping for add_to_page_cache() · 063d99b4
      Michal Hocko 提交于
      Commit 6afdb859 ("mm: do not ignore mapping_gfp_mask in page cache
      allocation paths") has caught some users of hardcoded GFP_KERNEL used in
      the page cache allocation paths.  This, however, wasn't complete and
      there were others which went unnoticed.
      
      Dave Chinner has reported the following deadlock for xfs on loop device:
      : With the recent merge of the loop device changes, I'm now seeing
      : XFS deadlock on my single CPU, 1GB RAM VM running xfs/073.
      :
      : The deadlocked is as follows:
      :
      : kloopd1: loop_queue_read_work
      :       xfs_file_iter_read
      :       lock XFS inode XFS_IOLOCK_SHARED (on image file)
      :       page cache read (GFP_KERNEL)
      :       radix tree alloc
      :       memory reclaim
      :       reclaim XFS inodes
      :       log force to unpin inodes
      :       <wait for log IO completion>
      :
      : xfs-cil/loop1: <does log force IO work>
      :       xlog_cil_push
      :       xlog_write
      :       <loop issuing log writes>
      :               xlog_state_get_iclog_space()
      :               <blocks due to all log buffers under write io>
      :               <waits for IO completion>
      :
      : kloopd1: loop_queue_write_work
      :       xfs_file_write_iter
      :       lock XFS inode XFS_IOLOCK_EXCL (on image file)
      :       <wait for inode to be unlocked>
      :
      : i.e. the kloopd, with it's split read and write work queues, has
      : introduced a dependency through memory reclaim. i.e. that writes
      : need to be able to progress for reads make progress.
      :
      : The problem, fundamentally, is that mpage_readpages() does a
      : GFP_KERNEL allocation, rather than paying attention to the inode's
      : mapping gfp mask, which is set to GFP_NOFS.
      :
      : The didn't used to happen, because the loop device used to issue
      : reads through the splice path and that does:
      :
      :       error = add_to_page_cache_lru(page, mapping, index,
      :                       GFP_KERNEL & mapping_gfp_mask(mapping));
      
      This has changed by commit aa4d8616 ("block: loop: switch to VFS
      ITER_BVEC").
      
      This patch changes mpage_readpage{s} to follow gfp mask set for the
      mapping.  There are, however, other places which are doing basically the
      same.
      
      lustre:ll_dir_filler is doing GFP_KERNEL from the function which
      apparently uses GFP_NOFS for other allocations so let's make this
      consistent.
      
      cifs:readpages_get_pages is called from cifs_readpages and
      __cifs_readpages_from_fscache called from the same path obeys mapping
      gfp.
      
      ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well
      regardless it uses mapping_gfp_mask for the page allocation.
      
      ext4_mpage_readpages is the called from the page cache allocation path
      same as read_pages and read_cache_pages
      
      As I've noticed in my previous post I cannot say I would be happy about
      sprinkling mapping_gfp_mask all over the place and it sounds like we
      should drop gfp_mask argument altogether and use it internally in
      __add_to_page_cache_locked that would require all the filesystems to use
      mapping gfp consistently which I am not sure is the case here.  From a
      quick glance it seems that some file system use it all the time while
      others are selective.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      063d99b4
  6. 16 10月, 2015 1 次提交
    • L
      vmstat: explicitly schedule per-cpu work on the CPU we need it to run on · 176bed1d
      Linus Torvalds 提交于
      The vmstat code uses "schedule_delayed_work_on()" to do the initial
      startup of the delayed work on the right CPU, but then once it was
      started it would use the non-cpu-specific "schedule_delayed_work()" to
      re-schedule it on that CPU.
      
      That just happened to schedule it on the same CPU historically (well, in
      almost all situations), but the code _requires_ this work to be per-cpu,
      and should say so explicitly rather than depend on the non-cpu-specific
      scheduling to schedule on the current CPU.
      
      The timer code is being changed to not be as single-minded in always
      running things on the calling CPU.
      
      See also commit 874bbfe6 ("workqueue: make sure delayed work run in
      local cpu") that for now maintains the local CPU guarantees just in case
      there are other broken users that depended on the accidental behavior.
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      176bed1d
  7. 15 10月, 2015 1 次提交
    • T
      block: don't release bdi while request_queue has live references · b02176f3
      Tejun Heo 提交于
      bdi's are initialized in two steps, bdi_init() and bdi_register(), but
      destroyed in a single step by bdi_destroy() which, for a bdi embedded
      in a request_queue, is called during blk_cleanup_queue() which makes
      the queue invisible and starts the draining of remaining usages.
      
      A request_queue's user can access the congestion state of the embedded
      bdi as long as it holds a reference to the queue.  As such, it may
      access the congested state of a queue which finished
      blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
      Because the congested state was embedded in backing_dev_info which in
      turn is embedded in request_queue, accessing the congested state after
      bdi_destroy() was called was fine.  The bdi was destroyed but the
      memory region for the congested state remained accessible till the
      queue got released.
      
      a13f35e8 ("writeback: don't embed root bdi_writeback_congested in
      bdi_writeback") changed the situation.  Now, the root congested state
      which is expected to be pinned while request_queue remains accessible
      is separately reference counted and the base ref is put during
      bdi_destroy().  This means that the root congested state may go away
      prematurely while the queue is between bdi_dstroy() and
      blk_cleanup_queue(), which was detected by Andrey's KASAN tests.
      
      The root cause of this problem is that bdi doesn't distinguish the two
      steps of destruction, unregistration and release, and now the root
      congested state actually requires a separate release step.  To fix the
      issue, this patch separates out bdi_unregister() and bdi_exit() from
      bdi_destroy().  bdi_unregister() is called from blk_cleanup_queue()
      and bdi_exit() from blk_release_queue().  bdi_destroy() is now just a
      simple wrapper calling the two steps back-to-back.
      
      While at it, the prototype of bdi_destroy() is moved right below
      bdi_setup_and_register() so that the counterpart operations are
      located together.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: a13f35e8 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
      Cc: stable@vger.kernel.org # v4.2+
      Reported-and-tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.comReviewed-by: NJan Kara <jack@suse.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b02176f3
  8. 13 10月, 2015 4 次提交
    • T
      writeback: fix incorrect calculation of available memory for memcg domains · c5edf9cd
      Tejun Heo 提交于
      For memcg domains, the amount of available memory was calculated as
      
       min(the amount currently in use + headroom according to memcg,
           total clean memory)
      
      This isn't quite correct as what should be capped by the amount of
      clean memory is the headroom, not the sum of memory in use and
      headroom.  For example, if a memcg domain has a significant amount of
      dirty memory, the above can lead to a value which is lower than the
      current amount in use which doesn't make much sense.  In most
      circumstances, the above leads to a number which is somewhat but not
      drastically lower.
      
      As the amount of memory which can be readily allocated to the memcg
      domain is capped by the amount of system-wide clean memory which is
      not already assigned to the memcg itself, the number we want is
      
       the amount currently in use +
       min(headroom according to memcg, clean memory elsewhere in the system)
      
      This patch updates mem_cgroup_wb_stats() to return the number of
      filepages and headroom instead of the calculated available pages.
      mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the
      above calculation from file, headroom, dirty and globally clean pages.
      
      v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading
          to build failure when !CGROUP_WRITEBACK.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: c2aa723a ("writeback: implement memcg writeback domain based throttling")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c5edf9cd
    • T
      writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions · d60d1bdd
      Tejun Heo 提交于
      MDTC_INIT() is used to initialize dirty_throttle_control for memcg
      domains.  It used DTC_INIT_COMMON() to initialized mdtc->wb and
      ->wb_completions which is incorrect as DTC_INIT_COMMON() sets the
      latter to wb->completions instead of wb->memcg_completions.  This can
      lead to wildly incorrect results when calculating the proportion of
      dirty memory the memcg domain should get.
      
      Remove DTC_INIT_COMMON() and update MDTC_INIT() to initialize
      mdtc->wb_completions to wb->memcg_completions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: c2aa723a ("writeback: implement memcg writeback domain based throttling")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d60d1bdd
    • T
      writeback: bdi_writeback iteration must not skip dying ones · b817525a
      Tejun Heo 提交于
      bdi_for_each_wb() is used in several places to wake up or issue
      writeback work items to all wb's (bdi_writeback's) on a given bdi.
      The iteration is performed by walking bdi->cgwb_tree; however, the
      tree only indexes wb's which are currently active.
      
      For example, when a memcg gets associated with a different blkcg, the
      old wb is removed from the tree so that the new one can be indexed.
      The old wb starts dying from then on but will linger till all its
      inodes are drained.  As these dying wb's may still host dirty inodes,
      writeback operations which affect all wb's must include them.
      bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
      failing to sync the inodes belonging to those wb's.
      
      This patch adds a RCU protected @bdi->wb_list which lists all wb's
      beloinging to that bdi.  wb's are added on creation and removed on
      release rather than on the start of destruction.  bdi_for_each_wb()
      usages are replaced with list_for_each[_continue]_rcu() iterations
      over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.
      
      v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
          fixed and unnecessary list head severing in cgwb_bdi_destroy()
          removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NArtem Bityutskiy <dedekind1@gmail.com>
      Fixes: ebe41ab0 ("writeback: implement bdi_for_each_wb()")
      Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b817525a
    • T
      writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration · 9ad18ab9
      Tejun Heo 提交于
      laptop_mode_timer_fn() was using bdi_for_each_wb() without the
      required RCU locking leading to the following warning.
      
       WARNING: CPU: 0 PID: 0 at include/linux/backing-dev.h:415 laptop_mode_timer_fn+0x106/0x170()
       ...
       Call Trace:
        <IRQ>  [<ffffffff81480cdc>] dump_stack+0x4e/0x82
        [<ffffffff81051912>] warn_slowpath_common+0x82/0xc0
        [<ffffffff81051a0a>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8115f0e6>] laptop_mode_timer_fn+0x106/0x170
        [<ffffffff810ca8e3>] call_timer_fn+0xb3/0x2f0
        [<ffffffff810cad25>] run_timer_softirq+0x205/0x370
        [<ffffffff81056854>] __do_softirq+0xd4/0x460
        [<ffffffff81056d69>] irq_exit+0x89/0xa0
        [<ffffffff8185a892>] smp_apic_timer_interrupt+0x42/0x50
        [<ffffffff81858a44>] apic_timer_interrupt+0x84/0x90
       ...
      
      Fix it by adding rcu_read_lock() around the iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: a06fd6b1 ("writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's")
      Reviewed-by: NJan Kara <jack@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9ad18ab9
  9. 07 10月, 2015 1 次提交
    • L
      Revert "fs: do not prefault sys_write() user buffer pages" · 00a3d660
      Linus Torvalds 提交于
      This reverts commit 998ef75d.
      
      The commit itself does not appear to be buggy per se, but it is exposing
      a bug in ext4 (and Ted thinks ext3 too, but we solved that by getting
      rid of it).  It's too late in the release cycle to really worry about
      this, even if Dave Hansen has a patch that may actually fix the
      underlying ext4 problem.  We can (and should) revisit this for the next
      release.
      
      The problem is that moving the prefaulting later now exposes a special
      case with partially successful writes that isn't handled correctly.  And
      the prefaulting likely isn't normally even that much of a performance
      issue - it looks like at least one reason Dave saw this in his
      performance tests is that he also ran them on Skylake that now supports
      the new SMAP code, which makes the normally very cheap user space
      prefaulting noticeably more expensive.
      Bisected-and-acked-by: NTed Ts'o <tytso@mit.edu>
      Analyzed-and-acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00a3d660
  10. 04 10月, 2015 1 次提交
  11. 02 10月, 2015 6 次提交
    • R
      dmapool: fix overflow condition in pool_find_page() · 676bd991
      Robin Murphy 提交于
      If a DMA pool lies at the very top of the dma_addr_t range (as may
      happen with an IOMMU involved), the calculated end address of the pool
      wraps around to zero, and page lookup always fails.
      
      Tweak the relevant calculation to be overflow-proof.
      Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Sakari Ailus <sakari.ailus@iki.fi>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      676bd991
    • G
      memcg: remove pcp_counter_lock · ef510194
      Greg Thelen 提交于
      Commit 733a572e ("memcg: make mem_cgroup_read_{stat|event}() iterate
      possible cpus instead of online") removed the last use of the per memcg
      pcp_counter_lock but forgot to remove the variable.
      
      Kill the vestigial variable.
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef510194
    • G
      memcg: make mem_cgroup_read_stat() unsigned · 484ebb3b
      Greg Thelen 提交于
      mem_cgroup_read_stat() returns a page count by summing per cpu page
      counters.  The summing is racy wrt.  updates, so a transient negative
      sum is possible.  Callers don't want negative values:
      
       - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback.
         This could confuse dirty throttling.
      
       - oom reports and memory.stat shouldn't show confusing negative usage.
      
       - tree_usage() already avoids negatives.
      
      Avoid returning negative page counts from mem_cgroup_read_stat() and
      convert it to unsigned.
      
      [akpm@linux-foundation.org: fix old typo while we're in there]
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      484ebb3b
    • G
      memcg: fix dirty page migration · 0610c25d
      Greg Thelen 提交于
      The problem starts with a file backed dirty page which is charged to a
      memcg.  Then page migration is used to move oldpage to newpage.
      
      Migration:
       - copies the oldpage's data to newpage
       - clears oldpage.PG_dirty
       - sets newpage.PG_dirty
       - uncharges oldpage from memcg
       - charges newpage to memcg
      
      Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
      count.
      
      However, because newpage is not yet charged, setting newpage.PG_dirty
      does not increment the memcg's dirty page count.  After migration
      completes newpage.PG_dirty is eventually cleared, often in
      account_page_cleaned().  At this time newpage is charged to a memcg so
      the memcg's dirty page count is decremented which causes underflow
      because the count was not previously incremented by migration.  This
      underflow causes balance_dirty_pages() to see a very large unsigned
      number of dirty memcg pages which leads to aggressive throttling of
      buffered writes by processes in non root memcg.
      
      This issue:
       - can harm performance of non root memcg buffered writes.
       - can report too small (even negative) values in
         memory.stat[(total_)dirty] counters of all memcg, including the root.
      
      To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
      page_memcg() and set_page_memcg() helpers.
      
      Test:
          0) setup and enter limited memcg
          mkdir /sys/fs/cgroup/test
          echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
          echo $$ > /sys/fs/cgroup/test/cgroup.procs
      
          1) buffered writes baseline
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
          2) buffered writes with compaction antagonist to induce migration
          yes 1 > /proc/sys/vm/compact_memory &
          rm -rf /data/tmp/foo
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          kill %
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
          3) buffered writes without antagonist, should match baseline
          rm -rf /data/tmp/foo
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
                             (speed, dirty residue)
                   unpatched                       patched
          1) 841 MB/s 0 dirty pages          886 MB/s 0 dirty pages
          2) 611 MB/s -33427456 dirty pages  793 MB/s 0 dirty pages
          3) 114 MB/s -33427456 dirty pages  891 MB/s 0 dirty pages
      
          Notice that unpatched baseline performance (1) fell after
          migration (3): 841 -> 114 MB/s.  In the patched kernel, post
          migration performance matches baseline.
      
      Fixes: c4843a75 ("memcg: add per cgroup dirty page accounting")
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0610c25d
    • M
      mm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a fault · 2f84a899
      Mel Gorman 提交于
      SunDong reported the following on
      
        https://bugzilla.kernel.org/show_bug.cgi?id=103841
      
      	I think I find a linux bug, I have the test cases is constructed. I
      	can stable recurring problems in fedora22(4.0.4) kernel version,
      	arch for x86_64.  I construct transparent huge page, when the parent
      	and child process with MAP_SHARE, MAP_PRIVATE way to access the same
      	huge page area, it has the opportunity to lead to huge page copy on
      	write failure, and then it will munmap the child corresponding mmap
      	area, but then the child mmap area with VM_MAYSHARE attributes, child
      	process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags
      	functions (vma - > vm_flags & VM_MAYSHARE).
      
      There were a number of problems with the report (e.g.  it's hugetlbfs that
      triggers this, not transparent huge pages) but it was fundamentally
      correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that
      looks like this
      
      	 vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000
      	 next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800
      	 prot 8000000000000027 anon_vma           (null) vm_ops ffffffff8182a7a0
      	 pgoff 0 file ffff88106bdb9800 private_data           (null)
      	 flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb)
      	 ------------
      	 kernel BUG at mm/hugetlb.c:462!
      	 SMP
      	 Modules linked in: xt_pkttype xt_LOG xt_limit [..]
      	 CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1
      	 Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012
      	 set_vma_resv_flags+0x2d/0x30
      
      The VM_BUG_ON is correct because private and shared mappings have
      different reservation accounting but the warning clearly shows that the
      VMA is shared.
      
      When a private COW fails to allocate a new page then only the process
      that created the VMA gets the page -- all the children unmap the page.
      If the children access that data in the future then they get killed.
      
      The problem is that the same file is mapped shared and private.  During
      the COW, the allocation fails, the VMAs are traversed to unmap the other
      private pages but a shared VMA is found and the bug is triggered.  This
      patch identifies such VMAs and skips them.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reported-by: NSunDong <sund_sky@126.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Rientjes <rientjes@google.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f84a899
    • J
      mm/slab: fix unexpected index mapping result of kmalloc_size(INDEX_NODE+1) · 03a2d2a3
      Joonsoo Kim 提交于
      Commit description is copied from the original post of this bug:
      
        http://comments.gmane.org/gmane.linux.kernel.mm/135349
      
      Kernels after v3.9 use kmalloc_size(INDEX_NODE + 1) to get the next
      larger cache size than the size index INDEX_NODE mapping.  In kernels
      3.9 and earlier we used malloc_sizes[INDEX_L3 + 1].cs_size.
      
      However, sometimes we can't get the right output we expected via
      kmalloc_size(INDEX_NODE + 1), causing a BUG().
      
      The mapping table in the latest kernel is like:
          index = {0,   1,  2 ,  3,  4,   5,   6,   n}
           size = {0,   96, 192, 8, 16,  32,  64,   2^n}
      The mapping table before 3.10 is like this:
          index = {0 , 1 , 2,   3,  4 ,  5 ,  6,   n}
          size  = {32, 64, 96, 128, 192, 256, 512, 2^(n+3)}
      
      The problem on my mips64 machine is as follows:
      
      (1) When configured DEBUG_SLAB && DEBUG_PAGEALLOC && DEBUG_LOCK_ALLOC
          && DEBUG_SPINLOCK, the sizeof(struct kmem_cache_node) will be "150",
          and the macro INDEX_NODE turns out to be "2": #define INDEX_NODE
          kmalloc_index(sizeof(struct kmem_cache_node))
      
      (2) Then the result of kmalloc_size(INDEX_NODE + 1) is 8.
      
      (3) Then "if(size >= kmalloc_size(INDEX_NODE + 1)" will lead to "size
          = PAGE_SIZE".
      
      (4) Then "if ((size >= (PAGE_SIZE >> 3))" test will be satisfied and
          "flags |= CFLGS_OFF_SLAB" will be covered.
      
      (5) if (flags & CFLGS_OFF_SLAB)" test will be satisfied and will go to
          "cachep->slabp_cache = kmalloc_slab(slab_size, 0u)", and the result
          here may be NULL while kernel bootup.
      
      (6) Finally,"BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));" causes the
          BUG info as the following shows (may be only mips64 has this problem):
      
      This patch fixes the problem of kmalloc_size(INDEX_NODE + 1) and removes
      the BUG by adding 'size >= 256' check to guarantee that all necessary
      small sized slabs are initialized regardless sequence of slab size in
      mapping table.
      
      Fixes: e3366016 ("slab: Use common kmalloc_index/kmalloc_size...")
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: NLiuhailong <liu.hailong6@zte.com.cn>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03a2d2a3
  12. 23 9月, 2015 3 次提交
  13. 18 9月, 2015 2 次提交
  14. 12 9月, 2015 1 次提交
  15. 11 9月, 2015 1 次提交