1. 20 3月, 2009 1 次提交
  2. 19 2月, 2009 1 次提交
  3. 18 2月, 2009 1 次提交
    • J
      block: revert part of 18ce3751 · 78f707bf
      Jens Axboe 提交于
      The above commit added WRITE_SYNC and switched various places to using
      that for committing writes that will be waited upon immediately after
      submission. However, this causes a performance regression with AS and CFQ
      for ext3 at least, since sync_dirty_buffer() will submit some writes with
      WRITE_SYNC while ext3 has sumitted others dependent writes without the sync
      flag set. This causes excessive anticipation/idling in the IO scheduler
      because sync and async writes get interleaved, causing a big performance
      regression for the below test case (which is meant to simulate sqlite
      like behaviour).
      
      ---- test case ----
      
      int main(int argc, char **argv)
      {
      
      	int fdes, i;
      	FILE *fp;
      	struct timeval start;
      	struct timeval end;
      	struct timeval res;
      
      	gettimeofday(&start, NULL);
      	for (i=0; i<ROWS; i++) {
      		fp = fopen("test_file", "a");
      		fprintf(fp, "Some Text Data\n");
      		fdes = fileno(fp);
      		fsync(fdes);
      		fclose(fp);
      	}
      	gettimeofday(&end, NULL);
      
      	timersub(&end, &start, &res);
      	fprintf(stdout, "time to write %d lines is %ld(msec)\n", ROWS,
      			(res.tv_sec*1000000 + res.tv_usec)/1000);
      
      	return 0;
      }
      
      -------------------
      
      Thanks to Sean.White@APCC.com for tracking down this performance
      regression and providing a test case.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      78f707bf
  4. 07 2月, 2009 1 次提交
    • D
      vfs: Don't call attach_nobh_buffers() with an empty list · d4cf109f
      Dave Kleikamp 提交于
      This is a modification of a patch by Bill Pemberton <wfp5p@virginia.edu>
      
      nobh_write_end() could call attach_nobh_buffers() with head == NULL.
      This would result in a trap when attach_nobh_buffers() attempted to
      access bh->b_this_page.
      
      This can be illustrated by running the writev01 testcase from LTP on jfs.
      
      This error was introduced by commit 5b41e74a "vfs: fix data leak in
      nobh_write_end()".  That patch did not take into account that if
      PageMappedToDisk() is true upon entry to nobh_write_begin(), then no
      buffers will be allocated for the page.  In that case, we won't have to
      worry about a failed write leaving unitialized data in the page.
      
      Of course, head != NULL implies !page_has_buffers(page), so no need to
      test both.
      Signed-off-by: NDave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Bill Pemberton <wfp5p@virginia.edu>
      Cc: Dmitri Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4cf109f
  5. 14 1月, 2009 1 次提交
  6. 10 1月, 2009 2 次提交
    • T
      filesystem freeze: implement generic freeze feature · fcccf502
      Takashi Sato 提交于
      The ioctls for the generic freeze feature are below.
      o Freeze the filesystem
        int ioctl(int fd, int FIFREEZE, arg)
          fd: The file descriptor of the mountpoint
          FIFREEZE: request code for the freeze
          arg: Ignored
          Return value: 0 if the operation succeeds. Otherwise, -1
      
      o Unfreeze the filesystem
        int ioctl(int fd, int FITHAW, arg)
          fd: The file descriptor of the mountpoint
          FITHAW: request code for unfreeze
          arg: Ignored
          Return value: 0 if the operation succeeds. Otherwise, -1
          Error number: If the filesystem has already been unfrozen,
                        errno is set to EINVAL.
      
      [akpm@linux-foundation.org: fix CONFIG_BLOCK=n]
      Signed-off-by: NTakashi Sato <t-sato@yk.jp.nec.com>
      Signed-off-by: NMasayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com>
      Cc: <xfs-masters@oss.sgi.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Alasdair G Kergon <agk@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcccf502
    • T
      filesystem freeze: add error handling of write_super_lockfs/unlockfs · c4be0c1d
      Takashi Sato 提交于
      Currently, ext3 in mainline Linux doesn't have the freeze feature which
      suspends write requests.  So, we cannot take a backup which keeps the
      filesystem's consistency with the storage device's features (snapshot and
      replication) while it is mounted.
      
      In many case, a commercial filesystem (e.g.  VxFS) has the freeze feature
      and it would be used to get the consistent backup.
      
      If Linux's standard filesystem ext3 has the freeze feature, we can do it
      without a commercial filesystem.
      
      So I have implemented the ioctls of the freeze feature.
      I think we can take the consistent backup with the following steps.
      1. Freeze the filesystem with the freeze ioctl.
      2. Separate the replication volume or create the snapshot
         with the storage device's feature.
      3. Unfreeze the filesystem with the unfreeze ioctl.
      4. Take the backup from the separated replication volume
         or the snapshot.
      
      This patch:
      
      VFS:
      Changed the type of write_super_lockfs and unlockfs from "void"
      to "int" so that they can return an error.
      Rename write_super_lockfs and unlockfs of the super block operation
      freeze_fs and unfreeze_fs to avoid a confusion.
      
      ext3, ext4, xfs, gfs2, jfs:
      Changed the type of write_super_lockfs and unlockfs from "void"
      to "int" so that write_super_lockfs returns an error if needed,
      and unlockfs always returns 0.
      
      reiserfs:
      Changed the type of write_super_lockfs and unlockfs from "void"
      to "int" so that they always return 0 (success) to keep a current behavior.
      Signed-off-by: NTakashi Sato <t-sato@yk.jp.nec.com>
      Signed-off-by: NMasayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com>
      Cc: <xfs-masters@oss.sgi.com>
      Cc: <linux-ext4@vger.kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Alasdair G Kergon <agk@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4be0c1d
  7. 07 1月, 2009 1 次提交
  8. 05 1月, 2009 1 次提交
    • N
      fs: symlink write_begin allocation context fix · 54566b2c
      Nick Piggin 提交于
      With the write_begin/write_end aops, page_symlink was broken because it
      could no longer pass a GFP_NOFS type mask into the point where the
      allocations happened.  They are done in write_begin, which would always
      assume that the filesystem can be entered from reclaim.  This bug could
      cause filesystem deadlocks.
      
      The funny thing with having a gfp_t mask there is that it doesn't really
      allow the caller to arbitrarily tinker with the context in which it can be
      called.  It couldn't ever be GFP_ATOMIC, for example, because it needs to
      take the page lock.  The only thing any callers care about is __GFP_FS
      anyway, so turn that into a single flag.
      
      Add a new flag for write_begin, AOP_FLAG_NOFS.  Filesystems can now act on
      this flag in their write_begin function.  Change __grab_cache_page to
      accept a nofs argument as well, to honour that flag (while we're there,
      change the name to grab_cache_page_write_begin which is more instructive
      and does away with random leading underscores).
      
      This is really a more flexible way to go in the end anyway -- if a
      filesystem happens to want any extra allocations aside from the pagecache
      ones in ints write_begin function, it may now use GFP_KERNEL (rather than
      GFP_NOFS) for common case allocations (eg.  ocfs2_alloc_write_ctxt, for a
      random example).
      
      [kosaki.motohiro@jp.fujitsu.com: fix ubifs]
      [kosaki.motohiro@jp.fujitsu.com: fix fuse]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Cleaned up the calling convention: just pass in the AOP flags
        untouched to the grab_cache_page_write_begin() function.  That
        just simplifies everybody, and may even allow future expansion of the
        logic.   - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54566b2c
  9. 29 12月, 2008 1 次提交
    • K
      block: Supress Buffer I/O errors when SCSI REQ_QUIET flag set · 08bafc03
      Keith Mannthey 提交于
      Allow the scsi request REQ_QUIET flag to be propagated to the buffer
      file system layer. The basic ideas is to pass the flag from the scsi
      request to the bio (block IO) and then to the buffer layer.  The buffer
      layer can then suppress needless printks.
      
      This patch declutters the kernel log by removed the 40-50 (per lun)
      buffer io error messages seen during a boot in my multipath setup . It
      is a good chance any real errors will be missed in the "noise" it the
      logs without this patch.
      
      During boot I see blocks of messages like
      "
      __ratelimit: 211 callbacks suppressed
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242847
      Buffer I/O error on device sdm, logical block 1
      Buffer I/O error on device sdm, logical block 5242878
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242879
      Buffer I/O error on device sdm, logical block 5242872
      "
      in my logs.
      
      My disk environment is multipath fiber channel using the SCSI_DH_RDAC
      code and multipathd.  This topology includes an "active" and "ghost"
      path for each lun. IO's to the "ghost" path will never complete and the
      SCSI layer, via the scsi device handler rdac code, quick returns the IOs
      to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi
      layer messages.
      
       I am wanting to extend the QUIET behavior to include the buffer file
      system layer to deal with these errors as well. I have been running this
      patch for a while now on several boxes without issue.  A few runs of
      bonnie++ show no noticeable difference in performance in my setup.
      
      Thanks for John Stultz for the quiet_error finalization.
      Submitted-by: NKeith Mannthey <kmannth@us.ibm.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      08bafc03
  10. 28 11月, 2008 1 次提交
    • J
      udf: Fix BUG_ON() in destroy_inode() · 52b19ac9
      Jan Kara 提交于
      udf_clear_inode() can leave behind buffers on mapping's i_private list (when
      we truncated preallocation). Call invalidate_inode_buffers() so that the list
      is properly cleaned-up before we return from udf_clear_inode(). This is ugly
      and suggest that we should cleanup preallocation earlier than in clear_inode()
      but currently there's no such call available since drop_inode() is called under
      inode lock and thus is unusable for disk operations.
      Signed-off-by: NJan Kara <jack@suse.cz>
      52b19ac9
  11. 20 10月, 2008 1 次提交
  12. 27 8月, 2008 1 次提交
  13. 05 8月, 2008 1 次提交
  14. 31 7月, 2008 1 次提交
  15. 29 7月, 2008 1 次提交
    • H
      vfs: pagecache usage optimization for pagesize!=blocksize · 8ab22b9a
      Hisashi Hifumi 提交于
      When we read some part of a file through pagecache, if there is a
      pagecache of corresponding index but this page is not uptodate, read IO
      is issued and this page will be uptodate.
      
      I think this is good for pagesize == blocksize environment but there is
      room for improvement on pagesize != blocksize environment.  Because in
      this case a page can have multiple buffers and even if a page is not
      uptodate, some buffers can be uptodate.
      
      So I suggest that when all buffers which correspond to a part of a file
      that we want to read are uptodate, use this pagecache and copy data from
      this pagecache to user buffer even if a page is not uptodate.  This can
      reduce read IO and improve system throughput.
      
      I wrote a benchmark program and got result number with this program.
      
      This benchmark do:
      
        1: mount and open a test file.
      
        2: create a 512MB file.
      
        3: close a file and umount.
      
        4: mount and again open a test file.
      
        5: pwrite randomly 300000 times on a test file.  offset is aligned
           by IO size(1024bytes).
      
        6: measure time of preading randomly 100000 times on a test file.
      
      The result was:
      	2.6.26
              330 sec
      
      	2.6.26-patched
              226 sec
      
      Arch:i386
      Filesystem:ext3
      Blocksize:1024 bytes
      Memory: 1GB
      
      On ext3/4, a file is written through buffer/block.  So random read/write
      mixed workloads or random read after random write workloads are optimized
      with this patch under pagesize != blocksize environment.  This test result
      showed this.
      
      The benchmark program is as follows:
      
      #include <stdio.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <unistd.h>
      #include <time.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sys/mount.h>
      
      #define LEN 1024
      #define LOOP 1024*512 /* 512MB */
      
      main(void)
      {
      	unsigned long i, offset, filesize;
      	int fd;
      	char buf[LEN];
      	time_t t1, t2;
      
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	memset(buf, 0, LEN);
      	fd = open("/root/test1/testfile", O_CREAT|O_RDWR|O_TRUNC);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      	for (i = 0; i < LOOP; i++)
      		write(fd, buf, LEN);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      	if (mount("/dev/sda1", "/root/test1/", "ext3", 0, 0) < 0) {
      		perror("cannot mount\n");
      		exit(1);
      	}
      	fd = open("/root/test1/testfile", O_RDWR);
      	if (fd < 0) {
      		perror("cannot open file\n");
      		exit(1);
      	}
      
      	filesize = LEN * LOOP;
      	for (i = 0; i < 300000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pwrite(fd, buf, LEN, offset);
      	}
      	printf("start test\n");
      	time(&t1);
      	for (i = 0; i < 100000; i++){
      		offset = (random() % filesize) & (~(LEN - 1));
      		pread(fd, buf, LEN, offset);
      	}
      	time(&t2);
      	printf("%ld sec\n", t2-t1);
      	close(fd);
      	if (umount("/root/test1/") < 0) {
      		perror("cannot umount\n");
      		exit(1);
      	}
      }
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ab22b9a
  16. 27 7月, 2008 3 次提交
  17. 12 7月, 2008 2 次提交
  18. 01 7月, 2008 1 次提交
    • J
      Properly notify block layer of sync writes · 18ce3751
      Jens Axboe 提交于
      fsync_buffers_list() and sync_dirty_buffer() both issue async writes and
      then immediately wait on them. Conceptually, that makes them sync writes
      and we should treat them as such so that the IO schedulers can handle
      them appropriately.
      
      This patch fixes a write starvation issue that Lin Ming reported, where
      xx is stuck for more than 2 minutes because of a large number of
      synchronous IO in the system:
      
      INFO: task kjournald:20558 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
      message.
      kjournald     D ffff810010820978  6712 20558      2
      ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
      ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
      0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
      Call Trace:
      [<ffffffff803ba6f2>] kobject_get+0x12/0x17
      [<ffffffff80247537>] getnstimeofday+0x2f/0x83
      [<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
      [<ffffffff8066d195>] io_schedule+0x5d/0x9f
      [<ffffffff8029c1e7>] sync_buffer+0x3b/0x3f
      [<ffffffff8066d3f0>] __wait_on_bit+0x40/0x6f
      [<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
      [<ffffffff8066d48b>] out_of_line_wait_on_bit+0x6c/0x78
      [<ffffffff80243909>] wake_bit_function+0x0/0x23
      [<ffffffff8029e3ad>] sync_dirty_buffer+0x98/0xcb
      [<ffffffff8030056b>] journal_commit_transaction+0x97d/0xcb6
      [<ffffffff8023a676>] lock_timer_base+0x26/0x4b
      [<ffffffff8030300a>] kjournald+0xc1/0x1fb
      [<ffffffff802438db>] autoremove_wake_function+0x0/0x2e
      [<ffffffff80302f49>] kjournald+0x0/0x1fb
      [<ffffffff802437bb>] kthread+0x47/0x74
      [<ffffffff8022de51>] schedule_tail+0x28/0x5d
      [<ffffffff8020cac8>] child_rip+0xa/0x12
      [<ffffffff80243774>] kthread+0x0/0x74
      [<ffffffff8020cabe>] child_rip+0x0/0x12
      
      Lin Ming confirms that this patch fixes the issue. I've run tests with
      it for the past week and no ill effects have been observed, so I'm
      proposing it for inclusion into 2.6.26.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      18ce3751
  19. 26 6月, 2008 1 次提交
  20. 30 4月, 2008 1 次提交
  21. 29 4月, 2008 2 次提交
  22. 28 4月, 2008 7 次提交
    • O
      Add balance_dirty_pages_ratelimited() to cont_expand_zero() · 061e9746
      OGAWA Hirofumi 提交于
      On the systems, ftruncate() which expand size for FAT became the cause
      of OOM.  The cont_expand_zero() filled all memory with dirty pages,
      and since disk is very slow, limit of page scanning was exceeded, then
      it triggered OOM.
      
      This adds balance_dirty_pages_ratelimited() to avoid filling memory
      with dirty pages.
      Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      061e9746
    • M
      mm: filter based on a nodemask as well as a gfp_mask · 19770b32
      Mel Gorman 提交于
      The MPOL_BIND policy creates a zonelist that is used for allocations
      controlled by that mempolicy.  As the per-node zonelist is already being
      filtered based on a zone id, this patch adds a version of __alloc_pages() that
      takes a nodemask for further filtering.  This eliminates the need for
      MPOL_BIND to create a custom zonelist.
      
      A positive benefit of this is that allocations using MPOL_BIND now use the
      local node's distance-ordered zonelist instead of a custom node-id-ordered
      zonelist.  I.e., pages will be allocated from the closest allowed node with
      available memory.
      
      [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19770b32
    • M
      mm: have zonelist contains structs with both a zone pointer and zone_idx · dd1a239f
      Mel Gorman 提交于
      Filtering zonelists requires very frequent use of zone_idx().  This is costly
      as it involves a lookup of another structure and a substraction operation.  As
      the zone_idx is often required, it should be quickly accessible.  The node idx
      could also be stored here if it was found that accessing zone->node is
      significant which may be the case on workloads where nodemasks are heavily
      used.
      
      This patch introduces a struct zoneref to store a zone pointer and a zone
      index.  The zonelist then consists of an array of these struct zonerefs which
      are looked up as necessary.  Helpers are given for accessing the zone index as
      well as the node index.
      
      [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
      [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
      [hugh@veritas.com: just return do_try_to_free_pages]
      [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1a239f
    • M
      mm: use two zonelist that are filtered by GFP mask · 54a6eb5c
      Mel Gorman 提交于
      Currently a node has two sets of zonelists, one for each zone type in the
      system and a second set for GFP_THISNODE allocations.  Based on the zones
      allowed by a gfp mask, one of these zonelists is selected.  All of these
      zonelists consume memory and occupy cache lines.
      
      This patch replaces the multiple zonelists per-node with two zonelists.  The
      first contains all populated zones in the system, ordered by distance, for
      fallback allocations when the target/preferred node has no free pages.  The
      second contains all populated zones in the node suitable for GFP_THISNODE
      allocations.
      
      An iterator macro is introduced called for_each_zone_zonelist() that interates
      through each zone allowed by the GFP flags in the selected zonelist.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a6eb5c
    • M
      mm: introduce node_zonelist() for accessing the zonelist for a GFP mask · 0e88460d
      Mel Gorman 提交于
      Introduce a node_zonelist() helper function.  It is used to lookup the
      appropriate zonelist given a node and a GFP mask.  The patch on its own is a
      cleanup but it helps clarify parts of the two-zonelist-per-node patchset.  If
      necessary, it can be merged with the next patch in this set without problems.
      Reviewed-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e88460d
    • M
      mm: use zonelists instead of zones when direct reclaiming pages · dac1d27b
      Mel Gorman 提交于
      The following patches replace multiple zonelists per node with two zonelists
      that are filtered based on the GFP flags.  The patches as a set fix a bug with
      regard to the use of MPOL_BIND and ZONE_MOVABLE.  With this patchset, the
      MPOL_BIND will apply to the two highest zones when the highest zone is
      ZONE_MOVABLE.  This should be considered as an alternative fix for the
      MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters
      only custom zonelists.
      
      The first patch cleans up an inconsistency where direct reclaim uses
      zonelist->zones where other places use zonelist.
      
      The second patch introduces a helper function node_zonelist() for looking up
      the appropriate zonelist for a GFP mask which simplifies patches later in the
      set.
      
      The third patch defines/remembers the "preferred zone" for numa statistics, as
      it is no longer always the first zone in a zonelist.
      
      The forth patch replaces multiple zonelists with two zonelists that are
      filtered.  The two zonelists are due to the fact that the memoryless patchset
      introduces a second set of zonelists for __GFP_THISNODE.
      
      The fifth patch introduces helper macros for retrieving the zone and node
      indices of entries in a zonelist.
      
      The final patch introduces filtering of the zonelists based on a nodemask.
      Two zonelists exist per node, one for normal allocations and one for
      __GFP_THISNODE.
      
      Performance results varied depending on the machine configuration.  In real
      workloads the gain/loss will depend on how much the userspace portion of the
      benchmark benefits from having more cache available due to reduced referencing
      of zonelists.
      
      These are the range of performance losses/gains when running against
      2.6.24-rc4-mm1.  The set and these machines are a mix of i386, x86_64 and
      ppc64 both NUMA and non-NUMA.
      			     loss   to  gain
      Total CPU time on Kernbench: -0.86% to  1.13%
      Elapsed   time on Kernbench: -0.79% to  0.76%
      page_test from aim9:         -4.37% to  0.79%
      brk_test  from aim9:         -0.71% to  4.07%
      fork_test from aim9:         -1.84% to  4.60%
      exec_test from aim9:         -0.71% to  1.08%
      
      This patch:
      
      The allocator deals with zonelists which indicate the order in which zones
      should be targeted for an allocation.  Similarly, direct reclaim of pages
      iterates over an array of zones.  For consistency, this patch converts direct
      reclaim to use a zonelist.  No functionality is changed by this patch.  This
      simplifies zonelist iterators in the next patch.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dac1d27b
    • C
      Remove set_migrateflags() · 488514d1
      Christoph Lameter 提交于
      Migrate flags must be set on slab creation as agreed upon when the antifrag
      logic was reviewed.  Otherwise some slabs of a slabcache will end up in the
      unmovable and others in the reclaimable section depending on which flag was
      active when a new slab page was allocated.
      
      This likely slid in somehow when antifrag was merged. Remove it.
      
      The buffer_heads are always allocated with __GFP_RECLAIMABLE because the
      SLAB_RECLAIM_ACCOUNT option is set.  The set_migrateflags() never had any
      effect there.
      
      Radix tree allocations are not directly reclaimable but they are allocated
      with __GFP_RECLAIMABLE set on each allocation.  We now set
      SLAB_RECLAIM_ACCOUNT on radix tree slab creation making sure that radix
      tree slabs are consistently placed in the reclaimable section.  Radix tree
      slabs will also be accounted as such.
      
      There is then no user left of set_migratepages. So remove it.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      488514d1
  23. 05 4月, 2008 1 次提交
    • L
      Be more careful about marking buffers dirty · 1be62dc1
      Linus Torvalds 提交于
      Mikulas Patocka noted that the optimization where we check if a buffer
      was already dirty (and we avoid re-dirtying it) was not really SMP-safe.
      
      Since the read of the old status was not synchronized with anything, an
      aggressive CPU re-ordering of memory accesses might have moved that read
      up to before the data was even written to the buffer, and another CPU
      that cleaned it again, causing the newly dirty state to never actually
      hit the disk.
      
      Admittedly this would probably never trigger in practice, but it's still
      wrong.
      
      Mikulas sent a patch that fixed the problem, but I dislike the subtlety
      of the whole optimization, so this is an alternate fix that is more
      explicit about the particular SMP ordering for the optimization, and
      separates out the speculative reads of the buffer state into its own
      conditional (and makes the memory barrier only happen if we are likely
      to actually hit the optimized case in the first place).
      
      I considered removing the optimization entirely, but Andrew argued for
      it's continued existence. I'm a push-over.
      
      Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be62dc1
  24. 29 3月, 2008 1 次提交
    • D
      vfs: fix data leak in nobh_write_end() · 5b41e74a
      Dmitri Monakhov 提交于
      Current nobh_write_end() implementation ignore partial writes(copied < len)
      case if page was fully mapped and simply mark page as Uptodate, which is
      totally wrong because area [pos+copied, pos+len) wasn't updated explicitly in
      previous write_begin call.  It simply contains garbage from pagecache and
      result in data leakage.
      
      #TEST_CASE_BEGIN:
      ~~~~~~~~~~~~~~~~
      In fact issue triggered by classical testcase
      	open("/mnt/test", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3
      	ftruncate(3, 409600)                    = 0
      	writev(3, [{"a", 1}, {NULL, 4095}], 2)  = 1
      ##TESTCASE_SOURCE:
      ~~~~~~~~~~~~~~~~~
      #include <stdio.h>
      #include <stdlib.h>
      #include <fcntl.h>
      #include <sys/uio.h>
      #include <sys/mman.h>
      #include <errno.h>
      int main(int argc, char **argv)
      {
      	int fd,  ret;
      	void* p;
      	struct iovec iov[2];
      	fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0666);
      	ftruncate(fd, 409600);
      	iov[0].iov_base="a";
      	iov[0].iov_len=1;
      	iov[1].iov_base=NULL;
      	iov[1].iov_len=4096;
      	ret = writev(fd, iov, sizeof(iov)/sizeof(struct iovec));
      	printf("writev  = %d, err = %d\n", ret, errno);
      	return 0;
      }
      ##TESTCASE RESULT:
      ~~~~~~~~~~~~~~~~~~
      [root@ts63 ~]# mount | grep mnt2
      /dev/mapper/test on /mnt2 type ext2 (rw,nobh)
      [root@ts63 ~]#  /tmp/writev /mnt2/test
      writev  = 1, err = 0
      [root@ts63 ~]# hexdump -C /mnt2/test
      
      00000000  61 65 62 6f 6f 74 00 00  f0 b9 b4 59 3a 00 00 00  |aeboot.....Y:...|
      00000010  20 00 00 00 00 00 00 00  21 00 00 00 00 00 00 00  | .......!.......|
      00000020  df df df df df df df df  df df df df df df df df  |................|
      00000030  3a 00 00 00 2a 00 00 00  21 00 00 00 00 00 00 00  |:...*...!.......|
      00000040  60 c0 8c 00 00 00 00 00  40 4a 8d 00 00 00 00 00  |`.......@J......|
      00000050  00 00 00 00 00 00 00 00  41 00 00 00 00 00 00 00  |........A.......|
      00000060  74 69 6d 65 20 64 64 20  69 66 3d 2f 64 65 76 2f  |time dd if=/dev/|
      00000070  6c 6f 6f 70 30 20 20 6f  66 3d 2f 64 65 76 2f 6e  |loop0  of=/dev/n|
      skip..
      00000f50  00 00 00 00 00 00 00 00  31 00 00 00 00 00 00 00  |........1.......|
      00000f60  6d 6b 66 73 2e 65 78 74  33 20 2f 64 65 76 2f 76  |mkfs.ext3 /dev/v|
      00000f70  7a 76 67 2f 74 65 73 74  20 2d 62 34 30 39 36 00  |zvg/test -b4096.|
      00000f80  a0 fe 8c 00 00 00 00 00  21 00 00 00 00 00 00 00  |........!.......|
      00000f90  23 31 32 30 35 39 35 30  34 30 34 00 3a 00 00 00  |#1205950404.:...|
      00000fa0  20 00 8d 00 00 00 00 00  21 00 00 00 00 00 00 00  | .......!.......|
      00000fb0  d0 cf 8c 00 00 00 00 00  10 d0 8c 00 00 00 00 00  |................|
      00000fc0  00 00 00 00 00 00 00 00  41 00 00 00 00 00 00 00  |........A.......|
      00000fd0  6d 6f 75 6e 74 20 2f 64  65 76 2f 76 7a 76 67 2f  |mount /dev/vzvg/|
      00000fe0  74 65 73 74 20 20 2f 76  7a 20 2d 6f 20 64 61 74  |test  /vz -o dat|
      00000ff0  61 3d 77 72 69 74 65 62  61 63 6b 00 00 00 00 00  |a=writeback.....|
      00001000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
      
      As you can see file's page contains garbage from pagecache instead of zeros.
      #TEST_CASE_END
      
      Attached patch:
      - Add sanity check BUG_ON in order to prevent incorrect usage by caller,
        This is function invariant because page can has buffers and in no zero
        *fadata pointer at the same time.
      - Always attach buffers to page is it is partial write case.
      - Always switch back to generic_write_end if page has buffers.
        This is reasonable because if page already has buffer then generic_write_begin
        was called previously.
      Signed-off-by: NDmitri Monakhov <dmonakhov@openvz.org>
      Reviewed-by: NNick Piggin <npiggin@suse.de>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b41e74a
  25. 20 3月, 2008 1 次提交
    • R
      fs: fix kernel-doc notation warnings · a6b91919
      Randy Dunlap 提交于
      Fix kernel-doc notation warnings in fs/.
      
      Warning(mmotm-2008-0314-1449//fs/super.c:560): missing initial short description on line:
       *	mark_files_ro
      Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line:
       *	lease_get_mtime
      Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line:
       *	lease_get_mtime
      Warning(mmotm-2008-0314-1449//fs/namei.c:1368): missing initial short description on line:
       * lookup_one_len:  filesystem helper to lookup single pathname component
      Warning(mmotm-2008-0314-1449//fs/buffer.c:3221): missing initial short description on line:
       * bh_uptodate_or_lock: Test whether the buffer is uptodate
      Warning(mmotm-2008-0314-1449//fs/buffer.c:3240): missing initial short description on line:
       * bh_submit_read: Submit a locked buffer for reading
      Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:30): missing initial short description on line:
       * writeback_acquire: attempt to get exclusive writeback access to a device
      Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:47): missing initial short description on line:
       * writeback_in_progress: determine whether there is writeback in progress
      Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:58): missing initial short description on line:
       * writeback_release: relinquish exclusive writeback access against a device.
      Warning(mmotm-2008-0314-1449//include/linux/jbd.h:351): contents before sections
      Warning(mmotm-2008-0314-1449//include/linux/jbd.h:561): contents before sections
      Warning(mmotm-2008-0314-1449//fs/jbd/transaction.c:1935): missing initial short description on line:
       * void journal_invalidatepage()
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6b91919
  26. 05 3月, 2008 1 次提交
  27. 04 3月, 2008 1 次提交
  28. 09 2月, 2008 2 次提交
    • J
      buffer_head: fix private_list handling · 535ee2fb
      Jan Kara 提交于
      There are two possible races in handling of private_list in buffer cache.
      
      1) When fsync_buffers_list() processes a private_list, it clears
         b_assoc_mapping and moves buffer to its private list.  Now
         drop_buffers() comes, sees a buffer is on list so it calls
         __remove_assoc_queue() which complains about b_assoc_mapping being
         cleared (as it cannot propagate possible IO error).  This race has been
         actually observed in the wild.
      
      2) When fsync_buffers_list() processes a private_list,
         mark_buffer_dirty_inode() can be called on bh which is already on the
         private list of fsync_buffers_list().  As buffer is on some list (note
         that the check is performed without private_lock), it is not readded to
         the mapping's private_list and after fsync_buffers_list() finishes, we
         have a dirty buffer which should be on private_list but it isn't.  This
         race has not been reported, probably because most (but not all) callers
         of mark_buffer_dirty_inode() hold i_mutex and thus are serialized with
         fsync().
      
      Fix these issues by not clearing b_assoc_map when fsync_buffers_list()
      moves buffer to a dedicated list and by reinserting buffer in private_list
      when it is found dirty after we have submitted buffer for IO.  We also
      change the tests whether a buffer is on a private list from
      !list_empty(&bh->b_assoc_buffers) to bh->b_assoc_map so that they are
      single word reads and hence lockless checks are safe.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      535ee2fb
    • H
      fs: remove fastcall, it is always empty · fc9b52cd
      Harvey Harrison 提交于
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc9b52cd