1. 12 7月, 2008 2 次提交
  2. 01 7月, 2008 1 次提交
    • J
      Properly notify block layer of sync writes · 18ce3751
      Jens Axboe 提交于
      fsync_buffers_list() and sync_dirty_buffer() both issue async writes and
      then immediately wait on them. Conceptually, that makes them sync writes
      and we should treat them as such so that the IO schedulers can handle
      them appropriately.
      
      This patch fixes a write starvation issue that Lin Ming reported, where
      xx is stuck for more than 2 minutes because of a large number of
      synchronous IO in the system:
      
      INFO: task kjournald:20558 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
      message.
      kjournald     D ffff810010820978  6712 20558      2
      ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
      ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
      0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
      Call Trace:
      [<ffffffff803ba6f2>] kobject_get+0x12/0x17
      [<ffffffff80247537>] getnstimeofday+0x2f/0x83
      [<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
      [<ffffffff8066d195>] io_schedule+0x5d/0x9f
      [<ffffffff8029c1e7>] sync_buffer+0x3b/0x3f
      [<ffffffff8066d3f0>] __wait_on_bit+0x40/0x6f
      [<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
      [<ffffffff8066d48b>] out_of_line_wait_on_bit+0x6c/0x78
      [<ffffffff80243909>] wake_bit_function+0x0/0x23
      [<ffffffff8029e3ad>] sync_dirty_buffer+0x98/0xcb
      [<ffffffff8030056b>] journal_commit_transaction+0x97d/0xcb6
      [<ffffffff8023a676>] lock_timer_base+0x26/0x4b
      [<ffffffff8030300a>] kjournald+0xc1/0x1fb
      [<ffffffff802438db>] autoremove_wake_function+0x0/0x2e
      [<ffffffff80302f49>] kjournald+0x0/0x1fb
      [<ffffffff802437bb>] kthread+0x47/0x74
      [<ffffffff8022de51>] schedule_tail+0x28/0x5d
      [<ffffffff8020cac8>] child_rip+0xa/0x12
      [<ffffffff80243774>] kthread+0x0/0x74
      [<ffffffff8020cabe>] child_rip+0x0/0x12
      
      Lin Ming confirms that this patch fixes the issue. I've run tests with
      it for the past week and no ill effects have been observed, so I'm
      proposing it for inclusion into 2.6.26.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      18ce3751
  3. 30 4月, 2008 1 次提交
  4. 29 4月, 2008 2 次提交
  5. 28 4月, 2008 7 次提交
    • O
      Add balance_dirty_pages_ratelimited() to cont_expand_zero() · 061e9746
      OGAWA Hirofumi 提交于
      On the systems, ftruncate() which expand size for FAT became the cause
      of OOM.  The cont_expand_zero() filled all memory with dirty pages,
      and since disk is very slow, limit of page scanning was exceeded, then
      it triggered OOM.
      
      This adds balance_dirty_pages_ratelimited() to avoid filling memory
      with dirty pages.
      Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      061e9746
    • M
      mm: filter based on a nodemask as well as a gfp_mask · 19770b32
      Mel Gorman 提交于
      The MPOL_BIND policy creates a zonelist that is used for allocations
      controlled by that mempolicy.  As the per-node zonelist is already being
      filtered based on a zone id, this patch adds a version of __alloc_pages() that
      takes a nodemask for further filtering.  This eliminates the need for
      MPOL_BIND to create a custom zonelist.
      
      A positive benefit of this is that allocations using MPOL_BIND now use the
      local node's distance-ordered zonelist instead of a custom node-id-ordered
      zonelist.  I.e., pages will be allocated from the closest allowed node with
      available memory.
      
      [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19770b32
    • M
      mm: have zonelist contains structs with both a zone pointer and zone_idx · dd1a239f
      Mel Gorman 提交于
      Filtering zonelists requires very frequent use of zone_idx().  This is costly
      as it involves a lookup of another structure and a substraction operation.  As
      the zone_idx is often required, it should be quickly accessible.  The node idx
      could also be stored here if it was found that accessing zone->node is
      significant which may be the case on workloads where nodemasks are heavily
      used.
      
      This patch introduces a struct zoneref to store a zone pointer and a zone
      index.  The zonelist then consists of an array of these struct zonerefs which
      are looked up as necessary.  Helpers are given for accessing the zone index as
      well as the node index.
      
      [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
      [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
      [hugh@veritas.com: just return do_try_to_free_pages]
      [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1a239f
    • M
      mm: use two zonelist that are filtered by GFP mask · 54a6eb5c
      Mel Gorman 提交于
      Currently a node has two sets of zonelists, one for each zone type in the
      system and a second set for GFP_THISNODE allocations.  Based on the zones
      allowed by a gfp mask, one of these zonelists is selected.  All of these
      zonelists consume memory and occupy cache lines.
      
      This patch replaces the multiple zonelists per-node with two zonelists.  The
      first contains all populated zones in the system, ordered by distance, for
      fallback allocations when the target/preferred node has no free pages.  The
      second contains all populated zones in the node suitable for GFP_THISNODE
      allocations.
      
      An iterator macro is introduced called for_each_zone_zonelist() that interates
      through each zone allowed by the GFP flags in the selected zonelist.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a6eb5c
    • M
      mm: introduce node_zonelist() for accessing the zonelist for a GFP mask · 0e88460d
      Mel Gorman 提交于
      Introduce a node_zonelist() helper function.  It is used to lookup the
      appropriate zonelist given a node and a GFP mask.  The patch on its own is a
      cleanup but it helps clarify parts of the two-zonelist-per-node patchset.  If
      necessary, it can be merged with the next patch in this set without problems.
      Reviewed-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e88460d
    • M
      mm: use zonelists instead of zones when direct reclaiming pages · dac1d27b
      Mel Gorman 提交于
      The following patches replace multiple zonelists per node with two zonelists
      that are filtered based on the GFP flags.  The patches as a set fix a bug with
      regard to the use of MPOL_BIND and ZONE_MOVABLE.  With this patchset, the
      MPOL_BIND will apply to the two highest zones when the highest zone is
      ZONE_MOVABLE.  This should be considered as an alternative fix for the
      MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters
      only custom zonelists.
      
      The first patch cleans up an inconsistency where direct reclaim uses
      zonelist->zones where other places use zonelist.
      
      The second patch introduces a helper function node_zonelist() for looking up
      the appropriate zonelist for a GFP mask which simplifies patches later in the
      set.
      
      The third patch defines/remembers the "preferred zone" for numa statistics, as
      it is no longer always the first zone in a zonelist.
      
      The forth patch replaces multiple zonelists with two zonelists that are
      filtered.  The two zonelists are due to the fact that the memoryless patchset
      introduces a second set of zonelists for __GFP_THISNODE.
      
      The fifth patch introduces helper macros for retrieving the zone and node
      indices of entries in a zonelist.
      
      The final patch introduces filtering of the zonelists based on a nodemask.
      Two zonelists exist per node, one for normal allocations and one for
      __GFP_THISNODE.
      
      Performance results varied depending on the machine configuration.  In real
      workloads the gain/loss will depend on how much the userspace portion of the
      benchmark benefits from having more cache available due to reduced referencing
      of zonelists.
      
      These are the range of performance losses/gains when running against
      2.6.24-rc4-mm1.  The set and these machines are a mix of i386, x86_64 and
      ppc64 both NUMA and non-NUMA.
      			     loss   to  gain
      Total CPU time on Kernbench: -0.86% to  1.13%
      Elapsed   time on Kernbench: -0.79% to  0.76%
      page_test from aim9:         -4.37% to  0.79%
      brk_test  from aim9:         -0.71% to  4.07%
      fork_test from aim9:         -1.84% to  4.60%
      exec_test from aim9:         -0.71% to  1.08%
      
      This patch:
      
      The allocator deals with zonelists which indicate the order in which zones
      should be targeted for an allocation.  Similarly, direct reclaim of pages
      iterates over an array of zones.  For consistency, this patch converts direct
      reclaim to use a zonelist.  No functionality is changed by this patch.  This
      simplifies zonelist iterators in the next patch.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dac1d27b
    • C
      Remove set_migrateflags() · 488514d1
      Christoph Lameter 提交于
      Migrate flags must be set on slab creation as agreed upon when the antifrag
      logic was reviewed.  Otherwise some slabs of a slabcache will end up in the
      unmovable and others in the reclaimable section depending on which flag was
      active when a new slab page was allocated.
      
      This likely slid in somehow when antifrag was merged. Remove it.
      
      The buffer_heads are always allocated with __GFP_RECLAIMABLE because the
      SLAB_RECLAIM_ACCOUNT option is set.  The set_migrateflags() never had any
      effect there.
      
      Radix tree allocations are not directly reclaimable but they are allocated
      with __GFP_RECLAIMABLE set on each allocation.  We now set
      SLAB_RECLAIM_ACCOUNT on radix tree slab creation making sure that radix
      tree slabs are consistently placed in the reclaimable section.  Radix tree
      slabs will also be accounted as such.
      
      There is then no user left of set_migratepages. So remove it.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      488514d1
  6. 05 4月, 2008 1 次提交
    • L
      Be more careful about marking buffers dirty · 1be62dc1
      Linus Torvalds 提交于
      Mikulas Patocka noted that the optimization where we check if a buffer
      was already dirty (and we avoid re-dirtying it) was not really SMP-safe.
      
      Since the read of the old status was not synchronized with anything, an
      aggressive CPU re-ordering of memory accesses might have moved that read
      up to before the data was even written to the buffer, and another CPU
      that cleaned it again, causing the newly dirty state to never actually
      hit the disk.
      
      Admittedly this would probably never trigger in practice, but it's still
      wrong.
      
      Mikulas sent a patch that fixed the problem, but I dislike the subtlety
      of the whole optimization, so this is an alternate fix that is more
      explicit about the particular SMP ordering for the optimization, and
      separates out the speculative reads of the buffer state into its own
      conditional (and makes the memory barrier only happen if we are likely
      to actually hit the optimized case in the first place).
      
      I considered removing the optimization entirely, but Andrew argued for
      it's continued existence. I'm a push-over.
      
      Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be62dc1
  7. 29 3月, 2008 1 次提交
    • D
      vfs: fix data leak in nobh_write_end() · 5b41e74a
      Dmitri Monakhov 提交于
      Current nobh_write_end() implementation ignore partial writes(copied < len)
      case if page was fully mapped and simply mark page as Uptodate, which is
      totally wrong because area [pos+copied, pos+len) wasn't updated explicitly in
      previous write_begin call.  It simply contains garbage from pagecache and
      result in data leakage.
      
      #TEST_CASE_BEGIN:
      ~~~~~~~~~~~~~~~~
      In fact issue triggered by classical testcase
      	open("/mnt/test", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3
      	ftruncate(3, 409600)                    = 0
      	writev(3, [{"a", 1}, {NULL, 4095}], 2)  = 1
      ##TESTCASE_SOURCE:
      ~~~~~~~~~~~~~~~~~
      #include <stdio.h>
      #include <stdlib.h>
      #include <fcntl.h>
      #include <sys/uio.h>
      #include <sys/mman.h>
      #include <errno.h>
      int main(int argc, char **argv)
      {
      	int fd,  ret;
      	void* p;
      	struct iovec iov[2];
      	fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0666);
      	ftruncate(fd, 409600);
      	iov[0].iov_base="a";
      	iov[0].iov_len=1;
      	iov[1].iov_base=NULL;
      	iov[1].iov_len=4096;
      	ret = writev(fd, iov, sizeof(iov)/sizeof(struct iovec));
      	printf("writev  = %d, err = %d\n", ret, errno);
      	return 0;
      }
      ##TESTCASE RESULT:
      ~~~~~~~~~~~~~~~~~~
      [root@ts63 ~]# mount | grep mnt2
      /dev/mapper/test on /mnt2 type ext2 (rw,nobh)
      [root@ts63 ~]#  /tmp/writev /mnt2/test
      writev  = 1, err = 0
      [root@ts63 ~]# hexdump -C /mnt2/test
      
      00000000  61 65 62 6f 6f 74 00 00  f0 b9 b4 59 3a 00 00 00  |aeboot.....Y:...|
      00000010  20 00 00 00 00 00 00 00  21 00 00 00 00 00 00 00  | .......!.......|
      00000020  df df df df df df df df  df df df df df df df df  |................|
      00000030  3a 00 00 00 2a 00 00 00  21 00 00 00 00 00 00 00  |:...*...!.......|
      00000040  60 c0 8c 00 00 00 00 00  40 4a 8d 00 00 00 00 00  |`.......@J......|
      00000050  00 00 00 00 00 00 00 00  41 00 00 00 00 00 00 00  |........A.......|
      00000060  74 69 6d 65 20 64 64 20  69 66 3d 2f 64 65 76 2f  |time dd if=/dev/|
      00000070  6c 6f 6f 70 30 20 20 6f  66 3d 2f 64 65 76 2f 6e  |loop0  of=/dev/n|
      skip..
      00000f50  00 00 00 00 00 00 00 00  31 00 00 00 00 00 00 00  |........1.......|
      00000f60  6d 6b 66 73 2e 65 78 74  33 20 2f 64 65 76 2f 76  |mkfs.ext3 /dev/v|
      00000f70  7a 76 67 2f 74 65 73 74  20 2d 62 34 30 39 36 00  |zvg/test -b4096.|
      00000f80  a0 fe 8c 00 00 00 00 00  21 00 00 00 00 00 00 00  |........!.......|
      00000f90  23 31 32 30 35 39 35 30  34 30 34 00 3a 00 00 00  |#1205950404.:...|
      00000fa0  20 00 8d 00 00 00 00 00  21 00 00 00 00 00 00 00  | .......!.......|
      00000fb0  d0 cf 8c 00 00 00 00 00  10 d0 8c 00 00 00 00 00  |................|
      00000fc0  00 00 00 00 00 00 00 00  41 00 00 00 00 00 00 00  |........A.......|
      00000fd0  6d 6f 75 6e 74 20 2f 64  65 76 2f 76 7a 76 67 2f  |mount /dev/vzvg/|
      00000fe0  74 65 73 74 20 20 2f 76  7a 20 2d 6f 20 64 61 74  |test  /vz -o dat|
      00000ff0  61 3d 77 72 69 74 65 62  61 63 6b 00 00 00 00 00  |a=writeback.....|
      00001000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
      
      As you can see file's page contains garbage from pagecache instead of zeros.
      #TEST_CASE_END
      
      Attached patch:
      - Add sanity check BUG_ON in order to prevent incorrect usage by caller,
        This is function invariant because page can has buffers and in no zero
        *fadata pointer at the same time.
      - Always attach buffers to page is it is partial write case.
      - Always switch back to generic_write_end if page has buffers.
        This is reasonable because if page already has buffer then generic_write_begin
        was called previously.
      Signed-off-by: NDmitri Monakhov <dmonakhov@openvz.org>
      Reviewed-by: NNick Piggin <npiggin@suse.de>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b41e74a
  8. 20 3月, 2008 1 次提交
    • R
      fs: fix kernel-doc notation warnings · a6b91919
      Randy Dunlap 提交于
      Fix kernel-doc notation warnings in fs/.
      
      Warning(mmotm-2008-0314-1449//fs/super.c:560): missing initial short description on line:
       *	mark_files_ro
      Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line:
       *	lease_get_mtime
      Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line:
       *	lease_get_mtime
      Warning(mmotm-2008-0314-1449//fs/namei.c:1368): missing initial short description on line:
       * lookup_one_len:  filesystem helper to lookup single pathname component
      Warning(mmotm-2008-0314-1449//fs/buffer.c:3221): missing initial short description on line:
       * bh_uptodate_or_lock: Test whether the buffer is uptodate
      Warning(mmotm-2008-0314-1449//fs/buffer.c:3240): missing initial short description on line:
       * bh_submit_read: Submit a locked buffer for reading
      Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:30): missing initial short description on line:
       * writeback_acquire: attempt to get exclusive writeback access to a device
      Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:47): missing initial short description on line:
       * writeback_in_progress: determine whether there is writeback in progress
      Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:58): missing initial short description on line:
       * writeback_release: relinquish exclusive writeback access against a device.
      Warning(mmotm-2008-0314-1449//include/linux/jbd.h:351): contents before sections
      Warning(mmotm-2008-0314-1449//include/linux/jbd.h:561): contents before sections
      Warning(mmotm-2008-0314-1449//fs/jbd/transaction.c:1935): missing initial short description on line:
       * void journal_invalidatepage()
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6b91919
  9. 05 3月, 2008 1 次提交
  10. 04 3月, 2008 1 次提交
  11. 09 2月, 2008 3 次提交
    • J
      buffer_head: fix private_list handling · 535ee2fb
      Jan Kara 提交于
      There are two possible races in handling of private_list in buffer cache.
      
      1) When fsync_buffers_list() processes a private_list, it clears
         b_assoc_mapping and moves buffer to its private list.  Now
         drop_buffers() comes, sees a buffer is on list so it calls
         __remove_assoc_queue() which complains about b_assoc_mapping being
         cleared (as it cannot propagate possible IO error).  This race has been
         actually observed in the wild.
      
      2) When fsync_buffers_list() processes a private_list,
         mark_buffer_dirty_inode() can be called on bh which is already on the
         private list of fsync_buffers_list().  As buffer is on some list (note
         that the check is performed without private_lock), it is not readded to
         the mapping's private_list and after fsync_buffers_list() finishes, we
         have a dirty buffer which should be on private_list but it isn't.  This
         race has not been reported, probably because most (but not all) callers
         of mark_buffer_dirty_inode() hold i_mutex and thus are serialized with
         fsync().
      
      Fix these issues by not clearing b_assoc_map when fsync_buffers_list()
      moves buffer to a dedicated list and by reinserting buffer in private_list
      when it is found dirty after we have submitted buffer for IO.  We also
      change the tests whether a buffer is on a private list from
      !list_empty(&bh->b_assoc_buffers) to bh->b_assoc_map so that they are
      single word reads and hence lockless checks are safe.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      535ee2fb
    • H
      fs: remove fastcall, it is always empty · fc9b52cd
      Harvey Harrison 提交于
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc9b52cd
    • N
      rewrite rd · 9db5579b
      Nick Piggin 提交于
      This is a rewrite of the ramdisk block device driver.
      
      The old one is really difficult because it effectively implements a block
      device which serves data out of its own buffer cache.  It relies on the dirty
      bit being set, to pin its backing store in cache, however there are non
      trivial paths which can clear the dirty bit (eg.  try_to_free_buffers()),
      which had recently lead to data corruption.  And in general it is completely
      wrong for a block device driver to do this.
      
      The new one is more like a regular block device driver.  It has no idea about
      vm/vfs stuff.  It's backing store is similar to the buffer cache (a simple
      radix-tree of pages), but it doesn't know anything about page cache (the pages
      in the radix tree are not pagecache pages).
      
      There is one slight downside -- direct block device access and filesystem
      metadata access goes through an extra copy and gets stored in RAM twice.
      However, this downside is only slight, because the real buffercache of the
      device is now reclaimable (because we're not playing crazy games with it), so
      under memory intensive situations, footprint should effectively be the same --
      maybe even a slight advantage to the new driver because it can also reclaim
      buffer heads.
      
      The fact that it now goes through all the regular vm/fs paths makes it
      much more useful for testing, too.
      
         text    data     bss     dec     hex filename
         2837     849     384    4070     fe6 drivers/block/rd.o
         3528     371      12    3911     f47 drivers/block/brd.o
      
      Text is larger, but data and bss are smaller, making total size smaller.
      
      A few other nice things about it:
      - Similar structure and layout to the new loop device handlinag.
      - Dynamic ramdisk creation.
      - Runtime flexible buffer head size (because it is no longer part of the
        ramdisk code).
      - Boot / load time flexible ramdisk size, which could easily be extended
        to a per-ramdisk runtime changeable size (eg. with an ioctl).
      - Can use highmem for the backing store.
      
      [akpm@linux-foundation.org: fix build]
      [byron.bbradley@gmail.com: make rd_size non-static]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NByron Bradley <byron.bbradley@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9db5579b
  12. 06 2月, 2008 2 次提交
    • C
      bufferhead: revert constructor removal · b98938c3
      Christoph Lameter 提交于
      The constructor for buffer_head slabs was removed recently.  We need the
      constructor back in slab defrag in order to insure that slab objects always
      have a definite state even before we allocated them.
      
      I think we mistakenly merged the removal of the constuctor into a cleanup
      patch.  You (ie: akpm) had a test that showed that the removal of the
      constructor led to a small regression.  The prior state makes things easier
      for slab defrag.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b98938c3
    • C
      Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user · eebd2aa3
      Christoph Lameter 提交于
      Simplify page cache zeroing of segments of pages through 3 functions
      
      zero_user_segments(page, start1, end1, start2, end2)
      
              Zeros two segments of the page. It takes the position where to
              start and end the zeroing which avoids length calculations and
      	makes code clearer.
      
      zero_user_segment(page, start, end)
      
              Same for a single segment.
      
      zero_user(page, start, length)
      
              Length variant for the case where we know the length.
      
      We remove the zero_user_page macro. Issues:
      
      1. Its a macro. Inline functions are preferable.
      
      2. The KM_USER0 macro is only defined for HIGHMEM.
      
         Having to treat this special case everywhere makes the
         code needlessly complex. The parameter for zeroing is always
         KM_USER0 except in one single case that we open code.
      
      Avoiding KM_USER0 makes a lot of code not having to be dealing
      with the special casing for HIGHMEM anymore. Dealing with
      kmap is only necessary for HIGHMEM configurations. In those
      configurations we use KM_USER0 like we do for a series of other
      functions defined in highmem.h.
      
      Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
      function could not be a macro. zero_user_* functions introduced
      here can be be inline because that constant is not used when these
      functions are called.
      
      Also extract the flushing of the caches to be outside of the kmap.
      
      [akpm@linux-foundation.org: fix nfs and ntfs build]
      [akpm@linux-foundation.org: fix ntfs build some more]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Cc: David Chinner <dgc@sgi.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eebd2aa3
  13. 29 1月, 2008 1 次提交
  14. 21 10月, 2007 1 次提交
    • N
      nobh: nobh_write_end fix · efdc3131
      Nick Piggin 提交于
      This path mustn't have been tested :( I did attempt to exercise it
      by injecting failures here, but I suspect PageMappedToDisk may have
      been getting in the way. Will need more of a look, although I think
      nobh mode is OK for an -rc1 (it shouldn't eat anyone's data).
      
      Commit 03158cd7 ("fs: restore nobh")
      introcduced a NULL deref.  Spotted by the Coverity checker.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efdc3131
  15. 17 10月, 2007 10 次提交
    • F
      writeback: remove pages_skipped accounting in __block_write_full_page() · 1f7decf6
      Fengguang Wu 提交于
      Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:
      
      > The following strange behavior can be observed:
      >
      > 1. large file is written
      > 2. after 30 seconds, nr_dirty goes down by 1024
      > 3. then for some time (< 30 sec) nothing happens (disk idle)
      > 4. then nr_dirty again goes down by 1024
      > 5. repeat from 3. until whole file is written
      >
      > So basically a 4Mbyte chunk of the file is written every 30 seconds.
      > I'm quite sure this is not the intended behavior.
      
      It can be produced by the following test scheme:
      
      # cat bin/test-writeback.sh
      grep nr_dirty /proc/vmstat
      echo 1 > /proc/sys/fs/inode_debug
      dd if=/dev/zero of=/var/x bs=1K count=204800&
      while true; do grep nr_dirty /proc/vmstat; sleep 1; done
      
      # bin/test-writeback.sh
      nr_dirty 19207
      nr_dirty 19207
      nr_dirty 30924
      204800+0 records in
      204800+0 records out
      209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
      nr_dirty 47150
      nr_dirty 47141
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47205
      nr_dirty 47214
      nr_dirty 47214
      nr_dirty 47214
      nr_dirty 47214
      nr_dirty 47214
      nr_dirty 47215
      nr_dirty 47216
      nr_dirty 47216
      nr_dirty 47216
      nr_dirty 47154
      nr_dirty 47143
      nr_dirty 47143
      nr_dirty 47143
      nr_dirty 47143
      nr_dirty 47143
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47134
      nr_dirty 47134
      nr_dirty 47135
      nr_dirty 47135
      nr_dirty 47135
      nr_dirty 46097 <== -1038
      nr_dirty 46098
      nr_dirty 46098
      nr_dirty 46098
      [...]
      nr_dirty 46091
      nr_dirty 46092
      nr_dirty 46092
      nr_dirty 45069 <== -1023
      nr_dirty 45056
      nr_dirty 45056
      nr_dirty 45056
      [...]
      nr_dirty 37822
      nr_dirty 36799 <== -1023
      [...]
      nr_dirty 36781
      nr_dirty 35758 <== -1023
      [...]
      nr_dirty 34708
      nr_dirty 33672 <== -1024
      [...]
      nr_dirty 33692
      nr_dirty 32669 <== -1023
      
      % ls -li /var/x
      847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x
      
      % dmesg|grep 847824  # generated by a debug printk
      [  529.263184] redirtied inode 847824 line 548
      [  564.250872] redirtied inode 847824 line 548
      [  594.272797] redirtied inode 847824 line 548
      [  629.231330] redirtied inode 847824 line 548
      [  659.224674] redirtied inode 847824 line 548
      [  689.219890] redirtied inode 847824 line 548
      [  724.226655] redirtied inode 847824 line 548
      [  759.198568] redirtied inode 847824 line 548
      
      # line 548 in fs/fs-writeback.c:
      543                 if (wbc->pages_skipped != pages_skipped) {
      544                         /*
      545                          * writeback is not making progress due to locked
      546                          * buffers.  Skip this inode for now.
      547                          */
      548                         redirty_tail(inode);
      549                 }
      
      More debug efforts show that __block_write_full_page()
      never has the chance to call submit_bh() for that big dirty file:
      the buffer head is *clean*. So basicly no page io is issued by
      __block_write_full_page(), hence pages_skipped goes up.
      
      Also the comment in generic_sync_sb_inodes():
      
      544                         /*
      545                          * writeback is not making progress due to locked
      546                          * buffers.  Skip this inode for now.
      547                          */
      
      and the comment in __block_write_full_page():
      
      1713                 /*
      1714                  * The page was marked dirty, but the buffers were
      1715                  * clean.  Someone wrote them back by hand with
      1716                  * ll_rw_block/submit_bh.  A rare case.
      1717                  */
      
      do not quite agree with each other. The page writeback should be skipped for
      'locked buffer', but here it is 'clean buffer'!
      
      This patch fixes this bug. Though I'm not sure why __block_write_full_page()
      is called only to do nothing and who actually issued the writeback for us.
      
      This is the two possible new behaviors after the patch:
      
      1) pretty nice: wait 30s and write ALL:)
      2) not so good:
      	- during the dd: ~16M
      	- after 30s:      ~4M
      	- after 5s:       ~4M
      	- after 5s:     ~176M
      
      The next patch will fix case (2).
      
      Cc: David Chinner <dgc@sgi.com>
      Cc: Ken Chen <kenchen@google.com>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NDavid Chinner <dgc@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f7decf6
    • P
      mm: count reclaimable pages per BDI · c9e51e41
      Peter Zijlstra 提交于
      Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9e51e41
    • M
      Group short-lived and reclaimable kernel allocations · e12ba74d
      Mel Gorman 提交于
      This patch marks a number of allocations that are either short-lived such as
      network buffers or are reclaimable such as inode allocations.  When something
      like updatedb is called, long-lived and unmovable kernel allocations tend to
      be spread throughout the address space which increases fragmentation.
      
      This patch groups these allocations together as much as possible by adding a
      new MIGRATE_TYPE.  The MIGRATE_RECLAIMABLE type is for allocations that can be
      reclaimed on demand, but not moved.  i.e.  they can be migrated by deleting
      them and re-reading the information from elsewhere.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e12ba74d
    • N
      fs: restore nobh · 03158cd7
      Nick Piggin 提交于
      Implement nobh in new aops.  This is a bit tricky.  FWIW, nobh_truncate is
      now implemented in a way that does not create blocks in sparse regions,
      which is a silly thing for it to have been doing (isn't it?)
      
      ext2 survives fsx and fsstress. jfs is converted as well... ext3
      should be easy to do (but not done yet).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03158cd7
    • N
    • N
      fs: new cont helpers · 89e10787
      Nick Piggin 提交于
      Rework the generic block "cont" routines to handle the new aops.  Supporting
      cont_prepare_write would take quite a lot of code to support, so remove it
      instead (and we later convert all filesystems to use it).
      
      write_begin gets passed AOP_FLAG_CONT_EXPAND when called from
      generic_cont_expand, so filesystems can avoid the old hacks they used.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89e10787
    • N
      fs: introduce write_begin, write_end, and perform_write aops · afddba49
      Nick Piggin 提交于
      These are intended to replace prepare_write and commit_write with more
      flexible alternatives that are also able to avoid the buffered write
      deadlock problems efficiently (which prepare_write is unable to do).
      
      [mark.fasheh@oracle.com: API design contributions, code review and fixes]
      [akpm@linux-foundation.org: various fixes]
      [dmonakhov@sw.ru: new aop block_write_begin fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NMark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: NDmitriy Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afddba49
    • N
      fs: fix data-loss on error · 637aff46
      Nick Piggin 提交于
      New buffers against uptodate pages are simply be marked uptodate, while the
      buffer_new bit remains set.  This causes error-case code to zero out parts of
      those buffers because it thinks they contain stale data: wrong, they are
      actually uptodate so this is a data loss situation.
      
      Fix this by actually clearning buffer_new and marking the buffer dirty.  It
      makes sense to always clear buffer_new before setting a buffer uptodate.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      637aff46
    • N
      fs: fix nobh error handling · a4b0672d
      Nick Piggin 提交于
      nobh mode error handling is not just pretty slack, it's wrong.
      
      One cannot zero out the whole page to ensure new blocks are zeroed, because
      it just brings the whole page "uptodate" with zeroes even if that may not
      be the correct uptodate data.  Also, other parts of the page may already
      contain dirty data which would get lost by zeroing it out.  Thirdly, the
      writeback of zeroes to the new blocks will also erase existing blocks.  All
      these conditions are pagecache and/or filesystem corruption.
      
      The problem comes about because we didn't keep track of which buffers
      actually are new or old.  However it is not enough just to keep only this
      state, because at the point we start dirtying parts of the page (new
      blocks, with zeroes), the handling of IO errors becomes impossible without
      buffers because the page may only be partially uptodate, in which case the
      page flags allone cannot capture the state of the parts of the page.
      
      So allocate all buffers for the page upfront, but leave them unattached so
      that they don't pick up any other references and can be freed when we're
      done.  If the error path is hit, then zero the new buffers as the regular
      buffer path does, then attach the buffers to the page so that it can
      actually be written out correctly and be subject to the normal IO error
      handling paths.
      
      As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page
      systems.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4b0672d
    • D
      mm: add end_buffer_read helper function · 68671f35
      Dmitry Monakhov 提交于
      Move duplicated code from end_buffer_read_XXX methods to separate helper
      function.
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68671f35
  16. 10 10月, 2007 1 次提交
  17. 20 7月, 2007 1 次提交
  18. 19 7月, 2007 1 次提交
    • D
      [FS] Implement block_page_mkwrite. · 54171690
      David Chinner 提交于
      Many filesystems need a ->page-mkwrite callout to correctly
      set up pages that have been written to by mmap. This is especially
      important when mmap is writing into holes as it allows filesystems
      to correctly account for and allocate space before the mmap
      write is allowed to proceed.
      
      Protection against truncate races is provided by locking the page
      and checking to see whether the page mapping is correct and whether
      it is beyond EOF so we don't end up allowing allocations beyond
      the current EOF or changing EOF as a result of a mmap write.
      
      SGI-PV: 940392
      SGI-Modid: 2.6.x-xfs-melb:linux:29146a
      Signed-off-by: NDavid Chinner <dgc@sgi.com>
      Signed-off-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NTim Shimmin <tes@sgi.com>
      54171690
  19. 18 7月, 2007 2 次提交
    • N
      fs: introduce some page/buffer invariants · 787d2214
      Nick Piggin 提交于
      It is a bug to set a page dirty if it is not uptodate unless it has
      buffers.  If the page has buffers, then the page may be dirty (some buffers
      dirty) but not uptodate (some buffers not uptodate).  The exception to this
      rule is if the set_page_dirty caller is racing with truncate or invalidate.
      
      A buffer can not be set dirty if it is not uptodate.
      
      If either of these situations occurs, it indicates there could be some data
      loss problem.  Some of these warnings could be a harmless one where the
      page or buffer is set uptodate immediately after it is dirtied, however we
      should fix those up, and enforce this ordering.
      
      Bring the order of operations for truncate into line with those of
      invalidate.  This will prevent a page from being able to go !uptodate while
      we're holding the tree_lock, which is probably a good thing anyway.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      787d2214
    • A
      Lumpy Reclaim V4 · 5ad333eb
      Andy Whitcroft 提交于
      When we are out of memory of a suitable size we enter reclaim.  The current
      reclaim algorithm targets pages in LRU order, which is great for fairness at
      order-0 but highly unsuitable if you desire pages at higher orders.  To get
      pages of higher order we must shoot down a very high proportion of memory;
      >95% in a lot of cases.
      
      This patch set adds a lumpy reclaim algorithm to the allocator.  It targets
      groups of pages at the specified order anchored at the end of the active and
      inactive lists.  This encourages groups of pages at the requested orders to
      move from active to inactive, and active to free lists.  This behaviour is
      only triggered out of direct reclaim when higher order pages have been
      requested.
      
      This patch set is particularly effective when utilised with an
      anti-fragmentation scheme which groups pages of similar reclaimability
      together.
      
      This patch set is based on Peter Zijlstra's lumpy reclaim V2 patch which forms
      the foundation.  Credit to Mel Gorman for sanitity checking.
      
      Mel said:
      
        The patches have an application with hugepage pool resizing.
      
        When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can
        be resized with greater reliability.  Testing on a desktop machine with 2GB
        of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own
        was very slow as the success rate was quite low.  Without lumpy-reclaim,
        each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages.
        With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical.
      
      [akpm@osdl.org: ia64 pfn_to_nid fixes and loop cleanup]
      [bunk@stusta.de: static declarations for internal functions]
      [a.p.zijlstra@chello.nl: initial lumpy V2 implementation]
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Bob Picco <bob.picco@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ad333eb