1. 17 10月, 2007 40 次提交
    • J
      introduce I_SYNC · 1c0eeaf5
      Joern Engel 提交于
      I_LOCK was used for several unrelated purposes, which caused deadlock
      situations in certain filesystems as a side effect.  One of the purposes
      now uses the new I_SYNC bit.
      
      Also document the various bits and change their order from historical to
      logical.
      
      [bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
      Signed-off-by: NJoern Engel <joern@wohnheim.fh-wedel.de>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: David Chinner <dgc@sgi.com>
      Cc: Anton Altaparmakov <aia21@cam.ac.uk>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c0eeaf5
    • F
      writeback: introduce writeback_control.more_io to indicate more io · 2e6883bd
      Fengguang Wu 提交于
      After making dirty a 100M file, the normal behavior is to start the writeback
      for all data after 30s delays.  But sometimes the following happens instead:
      
      	- after 30s:    ~4M
      	- after 5s:     ~4M
      	- after 5s:     all remaining 92M
      
      Some analyze shows that the internal io dispatch queues goes like this:
      
      		s_io            s_more_io
      		-------------------------
      	1)	100M,1K         0
      	2)	1K              96M
      	3)	0               96M
      
      1) initial state with a 100M file and a 1K file
      2) 4M written, nr_to_write <= 0, so write more
      3) 1K written, nr_to_write > 0, no more writes(BUG)
      
      nr_to_write > 0 in (3) fools the upper layer to think that data have all been
      written out.  The big dirty file is actually still sitting in s_more_io.  We
      cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
      let the loop in generic_sync_sb_inodes() continue: this may starve newly
      expired inodes in s_dirty.  It is also not an option to draw inodes from both
      s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
      and might also starve other superblocks in sync time(well kupdate may still
      starve some superblocks, that's another bug).
      
      We have to return when a full scan of s_io completes.  So nr_to_write > 0 does
      not necessarily mean that "all data are written".  This patch introduces a
      flag writeback_control.more_io to indicate this situation.  With it the big
      dirty file no longer has to wait for the next kupdate invocation 5s later.
      
      Cc: David Chinner <dgc@sgi.com>
      Cc: Ken Chen <kenchen@google.com>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e6883bd
    • F
      writeback: remove pages_skipped accounting in __block_write_full_page() · 1f7decf6
      Fengguang Wu 提交于
      Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:
      
      > The following strange behavior can be observed:
      >
      > 1. large file is written
      > 2. after 30 seconds, nr_dirty goes down by 1024
      > 3. then for some time (< 30 sec) nothing happens (disk idle)
      > 4. then nr_dirty again goes down by 1024
      > 5. repeat from 3. until whole file is written
      >
      > So basically a 4Mbyte chunk of the file is written every 30 seconds.
      > I'm quite sure this is not the intended behavior.
      
      It can be produced by the following test scheme:
      
      # cat bin/test-writeback.sh
      grep nr_dirty /proc/vmstat
      echo 1 > /proc/sys/fs/inode_debug
      dd if=/dev/zero of=/var/x bs=1K count=204800&
      while true; do grep nr_dirty /proc/vmstat; sleep 1; done
      
      # bin/test-writeback.sh
      nr_dirty 19207
      nr_dirty 19207
      nr_dirty 30924
      204800+0 records in
      204800+0 records out
      209715200 bytes (210 MB) copied, 1.58363 seconds, 132 MB/s
      nr_dirty 47150
      nr_dirty 47141
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47205
      nr_dirty 47214
      nr_dirty 47214
      nr_dirty 47214
      nr_dirty 47214
      nr_dirty 47214
      nr_dirty 47215
      nr_dirty 47216
      nr_dirty 47216
      nr_dirty 47216
      nr_dirty 47154
      nr_dirty 47143
      nr_dirty 47143
      nr_dirty 47143
      nr_dirty 47143
      nr_dirty 47143
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47142
      nr_dirty 47134
      nr_dirty 47134
      nr_dirty 47135
      nr_dirty 47135
      nr_dirty 47135
      nr_dirty 46097 <== -1038
      nr_dirty 46098
      nr_dirty 46098
      nr_dirty 46098
      [...]
      nr_dirty 46091
      nr_dirty 46092
      nr_dirty 46092
      nr_dirty 45069 <== -1023
      nr_dirty 45056
      nr_dirty 45056
      nr_dirty 45056
      [...]
      nr_dirty 37822
      nr_dirty 36799 <== -1023
      [...]
      nr_dirty 36781
      nr_dirty 35758 <== -1023
      [...]
      nr_dirty 34708
      nr_dirty 33672 <== -1024
      [...]
      nr_dirty 33692
      nr_dirty 32669 <== -1023
      
      % ls -li /var/x
      847824 -rw-r--r-- 1 root root 200M 2007-08-12 04:12 /var/x
      
      % dmesg|grep 847824  # generated by a debug printk
      [  529.263184] redirtied inode 847824 line 548
      [  564.250872] redirtied inode 847824 line 548
      [  594.272797] redirtied inode 847824 line 548
      [  629.231330] redirtied inode 847824 line 548
      [  659.224674] redirtied inode 847824 line 548
      [  689.219890] redirtied inode 847824 line 548
      [  724.226655] redirtied inode 847824 line 548
      [  759.198568] redirtied inode 847824 line 548
      
      # line 548 in fs/fs-writeback.c:
      543                 if (wbc->pages_skipped != pages_skipped) {
      544                         /*
      545                          * writeback is not making progress due to locked
      546                          * buffers.  Skip this inode for now.
      547                          */
      548                         redirty_tail(inode);
      549                 }
      
      More debug efforts show that __block_write_full_page()
      never has the chance to call submit_bh() for that big dirty file:
      the buffer head is *clean*. So basicly no page io is issued by
      __block_write_full_page(), hence pages_skipped goes up.
      
      Also the comment in generic_sync_sb_inodes():
      
      544                         /*
      545                          * writeback is not making progress due to locked
      546                          * buffers.  Skip this inode for now.
      547                          */
      
      and the comment in __block_write_full_page():
      
      1713                 /*
      1714                  * The page was marked dirty, but the buffers were
      1715                  * clean.  Someone wrote them back by hand with
      1716                  * ll_rw_block/submit_bh.  A rare case.
      1717                  */
      
      do not quite agree with each other. The page writeback should be skipped for
      'locked buffer', but here it is 'clean buffer'!
      
      This patch fixes this bug. Though I'm not sure why __block_write_full_page()
      is called only to do nothing and who actually issued the writeback for us.
      
      This is the two possible new behaviors after the patch:
      
      1) pretty nice: wait 30s and write ALL:)
      2) not so good:
      	- during the dd: ~16M
      	- after 30s:      ~4M
      	- after 5s:       ~4M
      	- after 5s:     ~176M
      
      The next patch will fix case (2).
      
      Cc: David Chinner <dgc@sgi.com>
      Cc: Ken Chen <kenchen@google.com>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NDavid Chinner <dgc@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f7decf6
    • F
      writeback: fix ntfs with sb_has_dirty_inodes() · 08d8e974
      Fengguang Wu 提交于
      NTFS's if-condition on dirty inodes is not complete.  Fix it with
      sb_has_dirty_inodes().
      
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Ken Chen <kenchen@google.com>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08d8e974
    • F
      writeback: fix time ordering of the per superblock inode lists 8 · 2c136579
      Fengguang Wu 提交于
      Streamline the management of dirty inode lists and fix time ordering bugs.
      
      The writeback logic used to move not-yet-expired dirty inodes from s_dirty to
      s_io, *only to* move them back.  The move-inodes-back-and-forth thing is a
      mess, which is eliminated by this patch.
      
      The new scheme is:
      - s_dirty acts as a time ordered io delaying queue;
      - s_io/s_more_io together acts as an io dispatching queue.
      
      On kupdate writeback, we pull some inodes from s_dirty to s_io at the start of
      every full scan of s_io.  Otherwise  (i.e. for sync/throttle/background
      writeback), we always pull from s_dirty on each run (a partial scan).
      
      Note that the line
      	list_splice_init(&sb->s_more_io, &sb->s_io);
      is moved to queue_io() to leave s_io empty. Otherwise a big dirtied file will
      sit in s_io for a long time, preventing new expired inodes to get in.
      
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c136579
    • K
      writeback: fix periodic superblock dirty inode flushing · 0e0f4fc2
      Ken Chen 提交于
      Current -mm tree has bucketful of bug fixes in periodic writeback path.
      However, we still hit a glitch where dirty pages on a given inode aren't
      completely flushed to the disk, and system will accumulate large amount of
      dirty pages beyond what dirty_expire_interval is designed for.
      
      The problem is __sync_single_inode() will move an inode to sb->s_dirty list
      even when there are more pending dirty pages on that inode.  If there is
      another inode with a small number of dirty pages, we hit a case where the loop
      iteration in wb_kupdate() terminates prematurely because wbc.nr_to_write > 0.
      Thus leaving the inode that has large amount of dirty pages behind and it has
      to wait for another dirty_writeback_interval before we flush it again.  We
      effectively only write out MAX_WRITEBACK_PAGES every dirty_writeback_interval.
      If the rate of dirtying is sufficiently high, the system will start
      accumulate a large number of dirty pages.
      
      So fix it by having another sb->s_more_io list on which to park the inode
      while we iterate through sb->s_io and to allow each dirty inode which resides
      on that sb to have an equal chance of flushing some amount of dirty pages.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e0f4fc2
    • A
      writeback: fix time ordering of the per superblock dirty inode lists 7 · 670e4def
      Andrew Morton 提交于
      This one fixes four bugs.
      
      There are a few situation in there where writeback decides it is going to skip
      over a blockdev inode on the kernel-internal blockdev superblock.  It
      presently does this by moving the blockdev inode onto the tail of the blockdev
      superblock's s_dirty.  But
      
      a) this screws up s_dirty's reverse-time-orderedness and
      
      b) refiling the blockdev for writeback in another 30 second is rude.  We
         should try again sooner than that.
      
      Fix all this up by using redirty_head(): move the blockdev inode onto the head
      of the blockdev superblock's s_dirty list for prompt writeback.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      670e4def
    • A
      writeback: fix time ordering of the per superblock dirty inode lists 6 · 65cb9b47
      Andrew Morton 提交于
      Recycling the previous changelog:
      
        When the writeback function is operating in writeback-for-flushing mode
        (as opposed to writeback-for-integrity) and it encounters an I_LOCKed inode,
        it will skip writing that inode.  This is done for throughput and latency:
        move on to another inode rather than blocking for this one.
      
        Writeback skips this inode by moving it off s_io and onto s_dirty, so that
        writeback can proceed with the other inodes on s_io.
      
        However that inode movement can corrupt s_dirty's
        reverse-time-orderedness.  Fix that by using the new redirty_tail(), which
        will update the refiled inode's dirtied_when field.
      
        Note: the behaviour in here is a bit rude: if kupdate happens to come
        across a locked inode then it will defer writeback of that inode for another
        30 seconds.  We'll address that in the next patch.
      
      Address that here.  What we do is to move the skipped inode to the _head_ of
      s_dirty, immediately eligible for writeout again.  Instead of deferring that
      writeout for another 30 seconds.
      
      One would think that this might cause a livelock: we keep on trying to write
      the same locked inode.  But it won't because:
      
      a) if that was the case, it would _already_ be happening on the
         balance_dirty_pages codepath.  Because balance_dirty_pages() doesn't care
         about inode timestamps.
      
      b) if we skipped this inode then we won't have done any writeback.  The
         higher-level writeback paths will see that wbc.nr_to_write didn't change
         and they'll then back off and take a nap.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65cb9b47
    • A
      writeback: fix time ordering of the per superblock dirty inode lists 5 · c6945e77
      Andrew Morton 提交于
      When the writeback function is operating in writeback-for-flushing mode (as
      opposed to writeback-for-integrity) and it encounters an I_LOCKed inode, it
      will skip writing that inode.  This is done for throughput and latency: move
      on to another inode rather than blocking for this one.
      
      Writeback skips this inode by moving it off s_io and onto s_dirty, so that
      writeback can proceed with the other inodes on s_io.
      
      However that inode movement can corrupt s_dirty's reverse-time-orderedness.
      Fix that by using the new redirty_tail(), which will update the refiled
      inode's dirtied_when field.
      
      Note: the behaviour in here is a bit rude: if kupdate happens to come across a
      locked inode then it will defer writeback of that inode for another 30
      seconds.  We'll address that in the next patch.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6945e77
    • A
      writeback: fix comment, use helper function · 1b43ef91
      Andrew Morton 提交于
      There's a comment in there which claims that the inode is left on s_io
      if nfs chickened out of writing some data.
      
      But that's not been true for three years.
      9290280ced13c85689adeffa587e9a53bd3a5873 fixed a livelock by moving these
      inodes back onto s_dirty.  Fix the comment.
      
      In the second leg of the `if', use redirty_tail() rather than open-coding it.
      
      Add weaselly comment indicating lack of confidence in the code and lack of the
      fortitude which would be needed to fiddle with it.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b43ef91
    • A
      writeback: fix time ordering of the per superblock dirty inode lists 4 · c986d1e2
      Andrew Morton 提交于
      When the kupdate function has tried to write back an expired inode it will
      then check to see whether some of the inode's pages are still dirty.
      
      This can happen when the filesystem decided to not write a page for some
      reason.  But it does _not_ occur due to redirtyings: a redirtying will set
      I_DIRTY_PAGES.
      
      What we need to do here is to set I_DIRTY_PAGES to reflect reality and to then
      put the inode onto the _head_ of s_dirty for consideration on the next kupdate
      pass, in five seconds time.
      
      Problem is, the code failed to modify the inode's timestamp when pushing the
      inode onto thehead of s_dirty.
      
      The patch:
      
      If there are no other inodes on s_dirty then we leave the inode's timestamp
      alone: it is already expired.
      
      If there _are_ other inodes on s_dirty then we arrange for this inode to get
      the same timestamp as the inode which is at the head of s_dirty, thus
      preserving the s_dirty ordering.  But we only need to do this if this inode
      purports to have been dirtied before the one at head-of-list.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c986d1e2
    • A
      writeback: fix time ordering of the per superblock dirty inode lists 3 · f57b9b7b
      Andrew Morton 提交于
      While writeback is working against a dirty inode it does a check after trying
      to write some of the inode's pages:
      
      "did the lower layers skip some of the inode's dirty pages because they were
      locked (or under writeback, or whatever)"
      
      If this turns out to be true, we must move the inode back onto s_dirty and
      redirty it.  The reason for doing this is that fsync() and friends only check
      the s_dirty list, and those functions want to know about those pages which
      were locked, so they can be waited upon and, if necessary, rewritten.
      
      Problem is, that redirtying was putting the inode onto the tail of s_dirty
      without updating its timestamp.  This causes a violation of s_dirty ordering.
      
      Fix this by updating inode->dirtied_when when moving the inode onto s_dirty.
      
      But the code is still a bit buggy?  If the inode was _already_ dirty then we
      don't need to move it at all.  Oh well, hopefully it doesn't matter too much,
      as that was a redirtying, which was very recent anwyay.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f57b9b7b
    • A
      writeback: fix time ordering of the per superblock dirty inode lists: memory-backed inodes · 9852a0e7
      Andrew Morton 提交于
      For reasons which escape me, inodes which are dirty against a ram-backed
      filesystem are managed in the same way as inodes which are backed by real
      devices.
      
      Probably we could optimise things here.  But given that we skip the entire
      supeblock as son as we hit the first dirty inode, there's not a lot to be
      gained.
      
      And the code does need to handle one particular non-backed superblock: the
      kernel's fake internal superblock which holds all the blockdevs.
      
      Still.  At present when the code encounters an inode which is dirty against a
      memory-backed filesystem it will skip that inode by refiling it back onto
      s_dirty.  But it fails to update the inode's timestamp when doing so which at
      least makes the debugging code upset.
      
      Fix.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9852a0e7
    • A
      writeback: fix time-ordering of the per-superblock dirty-inode lists · 6610a0bc
      Andrew Morton 提交于
      When writeback has finished writing back an inode it looks to see if that
      inode is still dirty.  If it is, that means that a process redirtied the inode
      while its writeback was in progress.
      
      What we need to do here is to refile the redirtied inode onto the s_dirty
      list.
      
      But we're doing that wrongly: it could be that this inode was redirtied
      _before_ the last inode on s_dirty.  We're blindly appending this inode to the
      list, after an inode which might be less-recently-dirtied, thus violating the
      list's ordering.
      
      So we must either insertion-sort this inode into the correct place, or we must
      update this inode's dirtied_when field when appending it to the reverse-sorted
      s_dirty list, to preserve the reverse-time-ordering.
      
      This patch does the latter: if this inode was dirtied less recently than the
      tail inode then copy the tail inode's timestamp into this inode.
      
      This means that in rare circumstances, some inodes will be writen back later
      than they should have been.  But the time slip will be small.
      
      Cc: Mike Waychison <mikew@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6610a0bc
    • E
      reiserfs: do not repair wrong journal params · cf3d0b81
      Edward Shishkin 提交于
      When mounting a file system with wrong journal params do not try to repair
      them, suggest fsck instead.
      Signed-off-by: NEdward Shishkin <edward@namesys.com>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf3d0b81
    • E
      ext3: lighten up resize transaction requirements · 1ad6ecf9
      Eric Sandeen 提交于
      When resizing online, setup_new_group_blocks attempts to reserve a
      potentially very large transaction, depending on the current filesystem
      geometry.  For some journal sizes, there may not be enough room for this
      transaction, and the online resize will fail.
      
      The patch below resizes & restarts the transaction as necessary while
      setting up the new group, and should work with even the smallest journal.
      
      Tested with something like:
      
      [root@newbox ~]# dd if=/dev/zero of=fsfile bs=1024 count=32768
      [root@newbox ~]# mkfs.ext3 -b 1024 fsfile 16384
      [root@newbox ~]# mount -o loop fsfile mnt/
      [root@newbox ~]# resize2fs /dev/loop0
      resize2fs 1.40.2 (12-Jul-2007)
      Filesystem at /dev/loop0 is mounted on /root/mnt; on-line resizing required
      old desc_blocks = 1, new_desc_blocks = 1
      Performing an on-line resize of /dev/loop0 to 32768 (1k) blocks.
      resize2fs: No space left on device While trying to add group #2
      [root@newbox ~]# dmesg | tail -n 1
      JBD: resize2fs wants too many credits (258 > 256)
      [root@newbox ~]#
      
      With the below change, it works.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Acked-by: NAndreas Dilger <adilger@clusterfs.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ad6ecf9
    • U
      F_DUPFD_CLOEXEC implementation · 22d2b35b
      Ulrich Drepper 提交于
      One more small change to extend the availability of creation of file
      descriptors with FD_CLOEXEC set.  Adding a new command to fcntl() requires
      no new system call and the overall impact on code size if minimal.
      
      If this patch gets accepted we will also add this change to the next
      revision of the POSIX spec.
      
      To test the patch, use the following little program.  Adjust the value of
      F_DUPFD_CLOEXEC appropriately.
      
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      #include <errno.h>
      #include <fcntl.h>
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      
      #ifndef F_DUPFD_CLOEXEC
      # define F_DUPFD_CLOEXEC 12
      #endif
      
      int
      main (int argc, char *argv[])
      {
        if  (argc > 1)
          {
            if (fcntl (3, F_GETFD) == 0)
      	{
      	  puts ("descriptor not closed");
      	  exit (1);
      	}
            if (errno != EBADF)
      	{
      	  puts ("error not EBADF");
      	  exit (1);
      	}
      
            exit (0);
          }
        int fd = fcntl (STDOUT_FILENO, F_DUPFD_CLOEXEC, 0);
        if (fd == -1 && errno == EINVAL)
          {
            puts ("F_DUPFD_CLOEXEC not supported");
            return 0;
          }
        if (fd != 3)
          {
            puts ("program called with descriptors other than 0,1,2");
            return 1;
          }
      
        execl ("/proc/self/exe", "/proc/self/exe", "1", NULL);
        puts ("execl failed");
        return 1;
      }
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Signed-off-by: NUlrich Drepper <drepper@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22d2b35b
    • R
      spin_lock_unlocked cleanups · f7a75f0a
      Roel Kluin 提交于
      Replace some SPIN_LOCK_UNLOCKED with DEFINE_SPINLOCK
      Signed-off-by: NRoel Kluin <12o3l@tiscali.nl>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7a75f0a
    • F
      Break ELF_PLATFORM and stack pointer randomization dependency · d68c9d6a
      Franck Bui-Huu 提交于
      Currently arch_align_stack() is used by fs/binfmt_elf.c to randomize
      stack pointer inside a page. But this happens only if ELF_PLATFORM
      symbol is defined.
      
      ELF_PLATFORM is normally set if the architecture wants ld.so to load
      implementation specific libraries for optimization. And currently a
      lot of architectures just yield this symbol to NULL.
      
      This is the case for MIPS architecture where ELF_PLATFORM is NULL but
      arch_align_stack() has been redefined to do stack inside page
      randomization. So in this case no randomization is actually done.
      
      This patch breaks this dependency which seems to be useless and allows
      platforms such MIPS to do the randomization.
      Signed-off-by: NFranck Bui-Huu <fbuihuu@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d68c9d6a
    • D
      rename signalfd_siginfo fields · 96358de6
      Davide Libenzi 提交于
      For Michael Kerrisk request, the following patch renames signalfd_siginfo
      fields in order to keep them consistent with the siginfo_t ones.
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96358de6
    • E
      ext3: remove #ifdef CONFIG_EXT3_INDEX · 059590f4
      Eric Sandeen 提交于
      CONFIG_EXT3_INDEX is not an exposed config option in the kernel, and it is
      unconditionally defined in ext3_fs.h.  tune2fs is already able to turn off
      dir indexing, so at this point it's just cluttering up the code.  Remove
      it.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      059590f4
    • A
      fs: correct SuS compliance for open of large file without options · a9c62a18
      Alan Cox 提交于
      The early LFS work that Linux uses favours EFBIG in various places. SuSv3
      specifically uses EOVERFLOW for this as noted by Michael (Bug 7253)
      
      [EOVERFLOW]
          The named file is a regular file and the size of the file cannot be
      represented correctly in an object of type off_t. We should therefore
      transition to the proper error return code
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Cc: Theodore Tso <tytso@mit.edu>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9c62a18
    • D
      anon-inodes use open coded atomic_inc for the shared inode · 96fdc72d
      Davide Libenzi 提交于
      Since we know the shared inode count is always >0, we can avoid igrab()
      and use an open coded atomic_inc().
      
      This also fixes a bug noticed by Yan Zheng <yanzheng@21cn.com>: were checking
      for an IS_ERR() return from igrab(), but it actually returns NULL on error.
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Yan Zheng <yanzheng@21cn.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96fdc72d
    • J
      Don't truncate /proc/PID/environ at 4096 characters · 315e28c8
      James Pearson 提交于
      /proc/PID/environ currently truncates at 4096 characters, patch based on
      the /proc/PID/mem code.
      Signed-off-by: NJames Pearson <james-p@moving-picture.com>
      Cc: Anton Arapov <aarapov@redhat.com>
      Cc: Jan Engelhardt <jengelh@computergmbh.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      315e28c8
    • W
      fs/udf/balloc.c: mark a variable as uninitialized_var() · 3ad90ec0
      WANG Cong 提交于
      Kill a may-be-used-uninitialized warning.
      Signed-off-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ad90ec0
    • J
      menuconfig: transform Network Filesystems menu · ea0985ad
      Jan Engelhardt 提交于
      Turn Network File Systems into a menuconfig so that it can be disabled at
      once.
      
      (Note: I added a "default y". If you do not like that, speak up.)
      Signed-off-by: NJan Engelhardt <jengelh@gmx.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Eric Van Hensbergen <ericvh@hera.kernel.org>
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea0985ad
    • J
      menuconfig: transform NLS and DLM menus · a77b6456
      Jan Engelhardt 提交于
      Changes NLS and DLM menus into a 'menuconfig' object so that it can be
      disabled at once without having to enter the menu first to disable the config
      option.
      Signed-off-by: NJan Engelhardt <jengelh@gmx.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a77b6456
    • A
      change inotifyfs magic as the same magic is used for futexfs · fd5eea42
      Andrey Mirkin 提交于
      Right now futexfs and inotifyfs have one magic 0xBAD1DEA, that looks a
      little bit confusing.  Use 0xBAD1DEA as magic for futexfs and 0x2BAD1DEA as
      magic for inotifyfs.
      Signed-off-by: NAndrey Mirkin <major@openvz.org>
      Acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd5eea42
    • O
      increase AT_VECTOR_SIZE to terminate saved_auxv properly · 4f9a58d7
      Olaf Hering 提交于
      include/asm-powerpc/elf.h has 6 entries in ARCH_DLINFO.  fs/binfmt_elf.c
      has 14 unconditional NEW_AUX_ENT entries and 2 conditional NEW_AUX_ENT
      entries.  So in the worst case, saved_auxv does not get an AT_NULL entry at
      the end.
      
      The saved_auxv array must be terminated with an AT_NULL entry.  Make the
      size of mm_struct->saved_auxv arch dependend, based on the number of
      ARCH_DLINFO entries.
      Signed-off-by: NOlaf Hering <olh@suse.de>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f9a58d7
    • D
    • C
      UDF: coding style fixups · b1e7a4b1
      Cyrill Gorcunov 提交于
      This patch does additional coding style fixup.  Initially the code is being
      distorted by Lindent (in my patches sent not very long ago) and fixed in
      the followup patches but this stuff was accidently missed.
      
      New and old compiled files were compared with cmp to check for being
      identically.  So the patch will not break the kernel.
      Signed-off-by: NCyrill Gorcunov <gorcunov@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1e7a4b1
    • B
      fs/isofs/namei.c: Remove uninitialized local vars warning · cd215237
      Borislav Petkov 提交于
      shut up those:
      fs/isofs/namei.c: In function 'isofs_lookup':
      fs/isofs/namei.c:161: warning: 'offset' may be used uninitialized in this function
      fs/isofs/namei.c:161: warning: 'block' may be used uninitialized in this function
      
      By the way, they get overwritten at the end of isofs_find_entry().
      Signed-off-by: NBorislav Petkov <bbpetkov@yahoo.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd215237
    • L
      reiserfs: workaround for dead loop in finish_unfinished · fb46f341
      Lepton Wu 提交于
      There is possible dead loop in finish_unfinished function.
      
      In most situation, the call chain iput -> ...  -> reiserfs_delete_inode ->
      remove_save_link will success.  But for some reason such as data
      corruption, reiserfs_delete_inode fails on reiserfs_do_truncate ->
      search_for_position_by_key.
      
      Then remove_save_link won't be called.  We always get the same
      "save_link_key" in the while loop in finish_unfinished function.  The
      following patch adds a check for the possible dead loop and just remove
      save link when deap loop.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NLepton Wu <ytht.net@gmail.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: "Vladimir V. Saveliev" <vs@namesys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb46f341
    • D
      add consts where appropriate in fs/nls/* · b9ec0339
      Denys Vlasenko 提交于
      Add const modifiers to a few struct nls_table's member pointers in
      include/linux/nls.h and adds a lot of const's in fs/nls/*.c files.
      
      Resulting changes as visible by size:
      
         text    data     bss     dec     hex filename
       113612  481216    2368  597196   91ccc nls.org/built-in.o
       593548    3296     288  597132   91c8c nls/built-in.o
      
      Apparently compiler managed to optimize code a bit better
      because of const-ness.
      
      No other changes are made.
      Signed-off-by: NDenys Vlasenko <vda.linux@googlemail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9ec0339
    • D
      shrink_dcache_sb speedup · 37c42524
      Denis V. Lunev 提交于
      This patch makes shrink_dcache_sb consistent with dentry pruning policy.
      
      On the first pass we iterate over dentry unused list and prepare some
      dentries for removal.
      
      However, since the existing code moves evicted dentries to the beginning of
      the LRU it can happen that fresh dentries from other superblocks will be
      inserted *before* our dentries.
      
      This can result in significant slowdown of shrink_dcache_sb().  Moreover,
      for virtual filesystems like unionfs which can call dput() during dentries
      kill existing code results in O(n^2) complexity.
      
      We observed 2 minutes shrink_dcache_sb() with only 35000 dentries.
      
      To avoid this effects we propose to isolate sb dentries at the end
      of LRU list.
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      Signed-off-by: NKirill Korotaev <dev@openvz.org>
      Signed-off-by: NAndrey Mirkin <amirkin@openvz.org>
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37c42524
    • L
      reiserfs: fix kernel panic on corrupted directory · 80eb68d2
      Lepton Wu 提交于
      When reading corrupted reiserfs directory data, d_reclen could be a
      negative number or a big positive number, this can lead to kernel panic or
      oop.  The following patch adds a sanity check.
      Signed-off-by: NLepton Wu <ytht.net@gmail.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: "Vladimir V. Saveliev" <vs@namesys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80eb68d2
    • D
      KEYS: Make request_key() and co fundamentally asynchronous · 76181c13
      David Howells 提交于
      Make request_key() and co fundamentally asynchronous to make it easier for
      NFS to make use of them.  There are now accessor functions that do
      asynchronous constructions, a wait function to wait for construction to
      complete, and a completion function for the key type to indicate completion
      of construction.
      
      Note that the construction queue is now gone.  Instead, keys under
      construction are linked in to the appropriate keyring in advance, and that
      anyone encountering one must wait for it to be complete before they can use
      it.  This is done automatically for userspace.
      
      The following auxiliary changes are also made:
      
       (1) Key type implementation stuff is split from linux/key.h into
           linux/key-type.h.
      
       (2) AF_RXRPC provides a way to allocate null rxrpc-type keys so that AFS does
           not need to call key_instantiate_and_link() directly.
      
       (3) Adjust the debugging macros so that they're -Wformat checked even if
           they are disabled, and make it so they can be enabled simply by defining
           __KDEBUG to be consistent with other code of mine.
      
       (3) Documentation.
      
      [alan@lxorguk.ukuu.org.uk: keys: missing word in documentation]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76181c13
    • C
      try to reap reiserfs pages left around by invalidatepage · 398c95bd
      Chris Mason 提交于
      reiserfs_invalidatepage will refuse to free pages if they have been logged
      in data=journal mode, or were pinned down by a data=ordered operation.  For
      data=journal, this is fairly easy to trigger just with fsx-linux, and it
      results in a large number of pages hanging around on the LRUs with
      page->mapping == NULL.
      
      Calling try_to_free_buffers when reiserfs decides it is done with the page
      allows it to be freed earlier, and with much less VM thrashing.  Lock
      ordering rules mean that reiserfs can't call lock_page when it is releasing
      the buffers, so TestSetPageLocked is used instead.  Contention on these
      pages should be rare, so it should be sufficient most of the time.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Cc: "Vladimir V. Saveliev" <vs@namesys.com>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      398c95bd
    • J
      dcache: trivial comment fix · bc154b1e
      J. Bruce Fields 提交于
      As it stands this comment is confusing, and not quite grammatical.
      Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc154b1e
    • J
      quota: send messages via netlink · 8e893469
      Jan Kara 提交于
      Implement sending of quota messages via netlink interface.  The advantage
      is that in userspace we can better decide what to do with the message - for
      example display a dialogue in your X session or just write the message to
      the console.  As a bonus, we can get rid of problems with console locking
      deep inside filesystem code once we remove the old printing mechanism.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e893469