1. 16 9月, 2009 15 次提交
    • A
      HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs · cae681fc
      Andi Kleen 提交于
      Useful for some testing scenarios, although specific testing is often
      done better through MADV_POISON
      
      This can be done with the x86 level MCE injector too, but this interface
      allows it to do independently from low level x86 changes.
      
      v2: Add module license (Haicheng Li)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      cae681fc
    • A
      HWPOISON: Add madvise() based injector for hardware poisoned pages v4 · 9893e49d
      Andi Kleen 提交于
      Impact: optional, useful for debugging
      
      Add a new madvice sub command to inject poison for some
      pages in a process' address space.  This is useful for
      testing the poison page handling.
      
      This patch can allow root to tie up large amounts of memory.
      I got feedback from container developers and they didn't see any
      problem.
      
      v2: Use write flag for get_user_pages to make sure to always get
      a fresh page
      v3: Don't request write mapping (Fengguang Wu)
      v4: Move MADV_* number to avoid conflict with KSM (Hugh Dickins)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      9893e49d
    • A
      HWPOISON: Enable .remove_error_page for migration aware file systems · aa261f54
      Andi Kleen 提交于
      Enable removing of corrupted pages through truncation
      for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs
      These should cover most server needs.
      
      I chose the set of migration aware file systems for this
      for now, assuming they have been especially audited.
      But in general it should be safe for all file systems
      on the data area that support read/write and truncate.
      
      Caveat: the hardware error handler does not take i_mutex
      for now before calling the truncate function. Is that ok?
      
      Cc: tytso@mit.edu
      Cc: hch@infradead.org
      Cc: mfasheh@suse.com
      Cc: aia21@cantab.net
      Cc: hugh.dickins@tiscali.co.uk
      Cc: swhiteho@redhat.com
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      aa261f54
    • A
      HWPOISON: The high level memory error handler in the VM v7 · 6a46079c
      Andi Kleen 提交于
      Add the high level memory handler that poisons pages
      that got corrupted by hardware (typically by a two bit flip in a DIMM
      or a cache) on the Linux level. The goal is to prevent everyone
      from accessing these pages in the future.
      
      This done at the VM level by marking a page hwpoisoned
      and doing the appropriate action based on the type of page
      it is.
      
      The code that does this is portable and lives in mm/memory-failure.c
      
      To quote the overview comment:
      
      High level machine check handler. Handles pages reported by the
      hardware as being corrupted usually due to a 2bit ECC memory or cache
      failure.
      
      This focuses on pages detected as corrupted in the background.
      When the current CPU tries to consume corruption the currently
      running process can just be killed directly instead. This implies
      that if the error cannot be handled for some reason it's safe to
      just ignore it because no corruption has been consumed yet. Instead
      when that happens another machine check will happen.
      
      Handles page cache pages in various states. The tricky part
      here is that we can access any page asynchronous to other VM
      users, because memory failures could happen anytime and anywhere,
      possibly violating some of their assumptions. This is why this code
      has to be extremely careful. Generally it tries to use normal locking
      rules, as in get the standard locks, even if that means the
      error handling takes potentially a long time.
      
      Some of the operations here are somewhat inefficient and have non
      linear algorithmic complexity, because the data structures have not
      been optimized for this case. This is in particular the case
      for the mapping from a vma to a process. Since this case is expected
      to be rare we hope we can get away with this.
      
      There are in principle two strategies to kill processes on poison:
      - just unmap the data and wait for an actual reference before
      killing
      - kill as soon as corruption is detected.
      Both have advantages and disadvantages and should be used
      in different situations. Right now both are implemented and can
      be switched with a new sysctl vm.memory_failure_early_kill
      The default is early kill.
      
      The patch does some rmap data structure walking on its own to collect
      processes to kill. This is unusual because normally all rmap data structure
      knowledge is in rmap.c only. I put it here for now to keep
      everything together and rmap knowledge has been seeping out anyways
      
      Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
      Nick Piggin (who did a lot of great work) and others.
      
      Cc: npiggin@suse.de
      Cc: riel@redhat.com
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      6a46079c
    • W
      HWPOISON: shmem: call set_page_dirty() with locked page · 6746aff7
      Wu Fengguang 提交于
      The dirtying of page and set_page_dirty() can be moved into the page lock.
      
      - In shmem_write_end(), the page was dirtied while the page lock was held,
        but it's being marked dirty just after dropping the page lock.
      - In shmem_symlink(), both dirtying and marking can be moved into page lock.
      
      It's valuable for the hwpoison code to know whether one bad page can be dropped
      without losing data. It mainly judges by testing the PG_dirty bit after taking
      the page lock. So it becomes important that the dirtying of page and the
      marking of dirtiness are both done inside the page lock. Which is a common
      practice, but sadly not a rule.
      
      The noticeable exceptions are
      - mapped pages
      - pages with buffer_heads
      The above pages could go dirty at any time. Fortunately the hwpoison will
      unmap the page and release the buffer_heads beforehand anyway.
      
      Many other types of pages (eg. metadata pages) can also be dirtied at will by
      their owners, the hwpoison code cannot do meaningful things to them anyway.
      Only the dirtiness of pagecache pages owned by regular files are interested.
      
      v2: AK: Add comment about set_page_dirty rules (suggested by Peter Zijlstra)
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      6746aff7
    • A
      HWPOISON: Define a new error_remove_page address space op for async truncation · 25718736
      Andi Kleen 提交于
      Truncating metadata pages is not safe right now before
      we haven't audited all file systems.
      
      To enable truncation only for data address space define
      a new address_space callback error_remove_page.
      
      This is used for memory_failure.c memory error handling.
      
      This can be then set to truncate_inode_page()
      
      This patch just defines the new operation and adds documentation.
      
      Callers and users come in followon patches.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      25718736
    • W
      HWPOISON: Add invalidate_inode_page · 83f78668
      Wu Fengguang 提交于
      Add a simple way to invalidate a single page
      This is just a refactoring of the truncate.c code.
      Originally from Fengguang, modified by Andi Kleen.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      83f78668
    • N
      HWPOISON: Refactor truncate to allow direct truncating of page v2 · 750b4987
      Nick Piggin 提交于
      Extract out truncate_inode_page() out of the truncate path so that
      it can be used by memory-failure.c
      
      [AK: description, headers, fix typos]
      v2: Some white space changes from Fengguang Wu
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      750b4987
    • W
      HWPOISON: check and isolate corrupted free pages v2 · 2a7684a2
      Wu Fengguang 提交于
      If memory corruption hits the free buddy pages, we can safely ignore them.
      No one will access them until page allocation time, then prep_new_page()
      will automatically check and isolate PG_hwpoison page for us (for 0-order
      allocation).
      
      This patch expands prep_new_page() to check every component page in a high
      order page allocation, in order to completely stop PG_hwpoison pages from
      being recirculated.
      
      Note that the common case -- only allocating a single page, doesn't
      do any more work than before. Allocating > order 0 does a bit more work,
      but that's relatively uncommon.
      
      This simple implementation may drop some innocent neighbor pages, hopefully
      it is not a big problem because the event should be rare enough.
      
      This patch adds some runtime costs to high order page users.
      
      [AK: Improved description]
      
      v2: Andi Kleen:
      Port to -mm code
      Move check into separate function.
      Don't dump stack in bad_pages for hwpoisoned pages.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      2a7684a2
    • A
      HWPOISON: Handle hardware poisoned pages in try_to_unmap · 888b9f7c
      Andi Kleen 提交于
      When a page has the poison bit set replace the PTE with a poison entry.
      This causes the right error handling to be done later when a process runs
      into it.
      
      v2: add a new flag to not do that (needed for the memory-failure handler
      later) (Fengguang)
      v3: remove unnecessary is_migration_entry() test (Fengguang, Minchan)
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      888b9f7c
    • A
      HWPOISON: Use bitmask/action code for try_to_unmap behaviour · 14fa31b8
      Andi Kleen 提交于
      try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
      which are selected by magic flag variables. The logic is not very straight
      forward, because each of these flag change multiple behaviours (e.g.
      migration turns off aging, not only sets up migration ptes etc.)
      Also the different flags interact in magic ways.
      
      A later patch in this series adds another mode to try_to_unmap, so
      this becomes quickly unmanageable.
      
      Replace the different flags with a action code (migration, munlock, munmap)
      and some additional flags as modifiers (ignore mlock, ignore aging).
      This makes the logic more straight forward and allows easier extension
      to new behaviours. Change all the caller to declare what they want to
      do.
      
      This patch is supposed to be a nop in behaviour. If anyone can prove
      it is not that would be a bug.
      
      Cc: Lee.Schermerhorn@hp.com
      Cc: npiggin@suse.de
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      14fa31b8
    • A
      HWPOISON: Add poison check to page fault handling · a3b947ea
      Andi Kleen 提交于
      Bail out early when hardware poisoned pages are found in page fault handling.
      Since they are poisoned they should not be mapped freshly into processes,
      because that would cause another (potentially deadly) machine check
      
      This is generally handled in the same way as OOM, just a different
      error code is returned to the architecture code.
      
      v2: Do a page unlock if needed (Fengguang Wu)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      a3b947ea
    • A
      HWPOISON: Add basic support for poisoned pages in fault handler v3 · d1737fdb
      Andi Kleen 提交于
      - Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now
      architectures have to explicitely enable poison page support, so
      this is forward compatible to all architectures. They only need
      to add it when they enable poison page support.
      - Add poison page handling in swap in fault code
      
      v2: Add missing delayacct_clear_flag (Hidehiro Kawai)
      v3: Really use delayacct_clear_flag (Hidehiro Kawai)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      d1737fdb
    • A
      HWPOISON: Add support for poison swap entries v2 · a7420aa5
      Andi Kleen 提交于
      Memory migration uses special swap entry types to trigger special actions on
      page faults. Extend this mechanism to also support poisoned swap entries, to
      trigger poison handling on page faults. This allows follow-on patches to
      prevent processes from faulting in poisoned pages again.
      
      v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
      v3: Better overflow fix (Hidehiro Kawai)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      a7420aa5
    • A
      HWPOISON: Export some rmap vma locking to outside world · 10be22df
      Andi Kleen 提交于
      Needed for later patch that walks rmap entries on its own.
      
      This used to be very frowned upon, but memory-failure.c does
      some rather specialized rmap walking and rmap has been stable
      for quite some time, so I think it's ok now to export it.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      10be22df
  2. 14 9月, 2009 8 次提交
  3. 11 9月, 2009 7 次提交
    • C
      kmemleak: Improve the "Early log buffer exceeded" error message · addd72c1
      Catalin Marinas 提交于
      Based on a suggestion from Jaswinder, clarify what the user would need
      to do to avoid this error message from kmemleak.
      Reported-by: NJaswinder Singh Rajput <jaswinder@kernel.org>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      addd72c1
    • J
      writeback: check for registered bdi in flusher add and inode dirty · 500b067c
      Jens Axboe 提交于
      Also a debugging aid. We want to catch dirty inodes being added to
      backing devices that don't do writeback.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      500b067c
    • J
      writeback: add name to backing_dev_info · d993831f
      Jens Axboe 提交于
      This enables us to track who does what and print info. Its main use
      is catching dirty inodes on the default_backing_dev_info, so we can
      fix that up.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d993831f
    • J
      writeback: add some debug inode list counters to bdi stats · f09b00d3
      Jens Axboe 提交于
      Add some debug entries to be able to inspect the internal state of
      the writeback details.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      f09b00d3
    • J
      writeback: get rid of pdflush completely · d0bceac7
      Jens Axboe 提交于
      It is now unused, so kill it off.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d0bceac7
    • J
      writeback: switch to per-bdi threads for flushing data · 03ba3782
      Jens Axboe 提交于
      This gets rid of pdflush for bdi writeout and kupdated style cleaning.
      pdflush writeout suffers from lack of locality and also requires more
      threads to handle the same workload, since it has to work in a
      non-blocking fashion against each queue. This also introduces lumpy
      behaviour and potential request starvation, since pdflush can be starved
      for queue access if others are accessing it. A sample ffsb workload that
      does random writes to files is about 8% faster here on a simple SATA drive
      during the benchmark phase. File layout also seems a LOT more smooth in
      vmstat:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
       0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
       1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
       0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
       0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
       0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
       0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
       0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
       0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
       0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45
      
      where vanilla tends to fluctuate a lot in the creation phase:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
       1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
       0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
       0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
       1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
       0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
       0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
       1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
       0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
       1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
       1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
       0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54
      
      A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
      SSD based writeback test on XFS performs over 20% better as well, with
      the throughput being very stable around 1GB/sec, where pdflush only
      manages 750MB/sec and fluctuates wildly while doing so. Random buffered
      writes to many files behave a lot better as well, as does random mmap'ed
      writes.
      
      A separate thread is added to sync the super blocks. In the long term,
      adding sync_supers_bdi() functionality could get rid of this thread again.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      03ba3782
    • J
      writeback: move dirty inodes from super_block to backing_dev_info · 66f3b8e2
      Jens Axboe 提交于
      This is a first step at introducing per-bdi flusher threads. We should
      have no change in behaviour, although sb_has_dirty_inodes() is now
      ridiculously expensive, as there's no easy way to answer that question.
      Not a huge problem, since it'll be deleted in subsequent patches.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      66f3b8e2
  4. 10 9月, 2009 1 次提交
  5. 09 9月, 2009 4 次提交
  6. 08 9月, 2009 3 次提交
  7. 06 9月, 2009 2 次提交