1. 24 8月, 2010 2 次提交
    • D
      xfs: ensure we mark all inodes in a freed cluster XFS_ISTALE · 5b3eed75
      Dave Chinner 提交于
      Under heavy load parallel metadata loads (e.g. dbench), we can fail
      to mark all the inodes in a cluster being freed as XFS_ISTALE as we
      skip inodes we cannot get the XFS_ILOCK_EXCL or the flush lock on.
      When this happens and the inode cluster buffer has already been
      marked stale and freed, inode reclaim can try to write the inode out
      as it is dirty and not marked stale. This can result in writing th
      metadata to an freed extent, or in the case it has already
      been overwritten trigger a magic number check failure and return an
      EUCLEAN error such as:
      
      Filesystem "ram0": inode 0x442ba1 background reclaim flush failed with 117
      
      Fix this by ensuring that we hoover up all in memory inodes in the
      cluster and mark them XFS_ISTALE when freeing the cluster.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      5b3eed75
    • D
      xfs: unlock items before allowing the CIL to commit · d17c701c
      Dave Chinner 提交于
      When we commit a transaction using delayed logging, we need to
      unlock the items in the transaciton before we unlock the CIL context
      and allow it to be checkpointed. If we unlock them after we release
      the CIl context lock, the CIL can checkpoint and complete before
      we free the log items. This breaks stale buffer item unlock and
      unpin processing as there is an implicit assumption that the unlock
      will occur before the unpin.
      
      Also, some log items need to store the LSN of the transaction commit
      in the item (inodes and EFIs) and so can race with other transaction
      completions if we don't prevent the CIL from checkpointing before
      the unlock occurs.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      d17c701c
  2. 18 8月, 2010 20 次提交
    • R
      nilfs2: wait for discard to finish · 1cb0c924
      Ryusuke Konishi 提交于
      nilfs_discard_segment() doesn't wait for completion of discard
      requests.  This specifies BLKDEV_IFL_WAIT flag when calling
      blkdev_issue_discard() in order to fix the sync failure.
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Christoph Hellwig <hch@lst.de>
      1cb0c924
    • T
      NFS: Fix an Oops in the NFSv4 atomic open code · 0a377cff
      Trond Myklebust 提交于
      Adam Lackorzynski reports:
      
      with 2.6.35.2 I'm getting this reproducible Oops:
      
      [  110.825396] BUG: unable to handle kernel NULL pointer dereference at
      (null)
      [  110.828638] IP: [<ffffffff811247b7>] encode_attrs+0x1a/0x2a4
      [  110.828638] PGD be89f067 PUD bf18f067 PMD 0
      [  110.828638] Oops: 0000 [#1] SMP
      [  110.828638] last sysfs file: /sys/class/net/lo/operstate
      [  110.828638] CPU 2
      [  110.828638] Modules linked in: rtc_cmos rtc_core rtc_lib amd64_edac_mod
      i2c_amd756 edac_core i2c_core dm_mirror dm_region_hash dm_log dm_snapshot
      sg sr_mod usb_storage ohci_hcd mptspi tg3 mptscsih mptbase usbcore nls_base
      [last unloaded: scsi_wait_scan]
      [  110.828638]
      [  110.828638] Pid: 11264, comm: setchecksum Not tainted 2.6.35.2 #1
      [  110.828638] RIP: 0010:[<ffffffff811247b7>]  [<ffffffff811247b7>]
      encode_attrs+0x1a/0x2a4
      [  110.828638] RSP: 0000:ffff88003bf5b878  EFLAGS: 00010296
      [  110.828638] RAX: ffff8800bddb48a8 RBX: ffff88003bf5bb18 RCX:
      0000000000000000
      [  110.828638] RDX: ffff8800be258800 RSI: 0000000000000000 RDI:
      ffff88003bf5b9f8
      [  110.828638] RBP: 0000000000000000 R08: ffff8800bddb48a8 R09:
      0000000000000004
      [  110.828638] R10: 0000000000000003 R11: ffff8800be779000 R12:
      ffff8800be258800
      [  110.828638] R13: ffff88003bf5b9f8 R14: ffff88003bf5bb20 R15:
      ffff8800be258800
      [  110.828638] FS:  0000000000000000(0000) GS:ffff880041e00000(0063)
      knlGS:00000000556bd6b0
      [  110.828638] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
      [  110.828638] CR2: 0000000000000000 CR3: 00000000be8ef000 CR4:
      00000000000006e0
      [  110.828638] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [  110.828638] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
      0000000000000400
      [  110.828638] Process setchecksum (pid: 11264, threadinfo
      ffff88003bf5a000, task ffff88003f232210)
      [  110.828638] Stack:
      [  110.828638]  0000000000000000 ffff8800bfbcf920 0000000000000000
      0000000000000ffe
      [  110.828638] <0> 0000000000000000 0000000000000000 0000000000000000
      0000000000000000
      [  110.828638] <0> 0000000000000000 0000000000000000 0000000000000000
      0000000000000000
      [  110.828638] Call Trace:
      [  110.828638]  [<ffffffff81124c1f>] ? nfs4_xdr_enc_setattr+0x90/0xb4
      [  110.828638]  [<ffffffff81371161>] ? call_transmit+0x1c3/0x24a
      [  110.828638]  [<ffffffff813774d9>] ? __rpc_execute+0x78/0x22a
      [  110.828638]  [<ffffffff81371a91>] ? rpc_run_task+0x21/0x2b
      [  110.828638]  [<ffffffff81371b7e>] ? rpc_call_sync+0x3d/0x5d
      [  110.828638]  [<ffffffff8111e284>] ? _nfs4_do_setattr+0x11b/0x147
      [  110.828638]  [<ffffffff81109466>] ? nfs_init_locked+0x0/0x32
      [  110.828638]  [<ffffffff810ac521>] ? ifind+0x4e/0x90
      [  110.828638]  [<ffffffff8111e2fb>] ? nfs4_do_setattr+0x4b/0x6e
      [  110.828638]  [<ffffffff8111e634>] ? nfs4_do_open+0x291/0x3a6
      [  110.828638]  [<ffffffff8111ed81>] ? nfs4_open_revalidate+0x63/0x14a
      [  110.828638]  [<ffffffff811056c4>] ? nfs_open_revalidate+0xd7/0x161
      [  110.828638]  [<ffffffff810a2de4>] ? do_lookup+0x1a4/0x201
      [  110.828638]  [<ffffffff810a4733>] ? link_path_walk+0x6a/0x9d5
      [  110.828638]  [<ffffffff810a42b6>] ? do_last+0x17b/0x58e
      [  110.828638]  [<ffffffff810a5fbe>] ? do_filp_open+0x1bd/0x56e
      [  110.828638]  [<ffffffff811cd5e0>] ? _atomic_dec_and_lock+0x30/0x48
      [  110.828638]  [<ffffffff810a9b1b>] ? dput+0x37/0x152
      [  110.828638]  [<ffffffff810ae063>] ? alloc_fd+0x69/0x10a
      [  110.828638]  [<ffffffff81099f39>] ? do_sys_open+0x56/0x100
      [  110.828638]  [<ffffffff81027a22>] ? ia32_sysret+0x0/0x5
      [  110.828638] Code: 83 f1 01 e8 f5 ca ff ff 48 83 c4 50 5b 5d 41 5c c3 41
      57 41 56 41 55 49 89 fd 41 54 49 89 d4 55 48 89 f5 53 48 81 ec 18 01 00 00
      <8b> 06 89 c2 83 e2 08 83 fa 01 19 db 83 e3 f8 83 c3 18 a8 01 8d
      [  110.828638] RIP  [<ffffffff811247b7>] encode_attrs+0x1a/0x2a4
      [  110.828638]  RSP <ffff88003bf5b878>
      [  110.828638] CR2: 0000000000000000
      [  112.840396] ---[ end trace 95282e83fd77358f ]---
      
      We need to ensure that the O_EXCL flag is turned off if the user doesn't
      set O_CREAT.
      
      Cc: stable@kernel.org
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      0a377cff
    • N
      fs: brlock vfsmount_lock · 99b7db7b
      Nick Piggin 提交于
      fs: brlock vfsmount_lock
      
      Use a brlock for the vfsmount lock. It must be taken for write whenever
      modifying the mount hash or associated fields, and may be taken for read when
      performing mount hash lookups.
      
      A new lock is added for the mnt-id allocator, so it doesn't need to take
      the heavy vfsmount write-lock.
      
      The number of atomics should remain the same for fastpath rlock cases, though
      code would be slightly slower due to per-cpu access. Scalability is not not be
      much improved in common cases yet, due to other locks (ie. dcache_lock) getting
      in the way. However path lookups crossing mountpoints should be one case where
      scalability is improved (currently requiring the global lock).
      
      The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
      Altix system (high latency to remote nodes), a simple umount microbenchmark
      (mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
      took 6.8s, afterwards took 7.1s, about 5% slower.
      
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      99b7db7b
    • N
      fs: scale files_lock · 6416ccb7
      Nick Piggin 提交于
      fs: scale files_lock
      
      Improve scalability of files_lock by adding per-cpu, per-sb files lists,
      protected with an lglock. The lglock provides fast access to the per-cpu lists
      to add and remove files. It also provides a snapshot of all the per-cpu lists
      (although this is very slow).
      
      One difficulty with this approach is that a file can be removed from the list
      by another CPU. We must track which per-cpu list the file is on with a new
      variale in the file struct (packed into a hole on 64-bit archs). Scalability
      could suffer if files are frequently removed from different cpu's list.
      
      However loads with frequent removal of files imply short interval between
      adding and removing the files, and the scheduler attempts to avoid moving
      processes too far away. Also, even in the case of cross-CPU removal, the
      hardware has much more opportunity to parallelise cacheline transfers with N
      cachelines than with 1.
      
      A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
      degenerates to contending on a single lock, which is no worse than before. When
      more than one CPU are allocating files, even if they are always freed by
      different CPUs, there will be more parallelism than the single-lock case.
      
      Testing results:
      
      On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
      to remove the file, the number of times it is removed by the same CPU that
      added it, and the number of times it is removed by the same node that added it.
      
      Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
      kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
      dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)
      
      So a file is removed from the same CPU it was added by over 90% of the time.
      It remains within the same node 95% of the time.
      
      Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.
      
                      throughput
      2.6.34-rc2      24.5
      +patch          24.9
      
                      us      sys     idle    IO wait (in %)
      2.6.34-rc2      51.25   28.25   17.25   3.25
      +patch          53.75   18.5    19      8.75
      
      So significantly less CPU time spent in kernel code, higher idle time and
      slightly higher throughput.
      
      Single threaded performance difference was within the noise of microbenchmarks.
      That is not to say penalty does not exist, the code is larger and more memory
      accesses required so it will be slightly slower.
      
      Cc: linux-kernel@vger.kernel.org
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6416ccb7
    • N
      tty: fix fu_list abuse · d996b62a
      Nick Piggin 提交于
      tty: fix fu_list abuse
      
      tty code abuses fu_list, which causes a bug in remount,ro handling.
      
      If a tty device node is opened on a filesystem, then the last link to the inode
      removed, the filesystem will be allowed to be remounted readonly. This is
      because fs_may_remount_ro does not find the 0 link tty inode on the file sb
      list (because the tty code incorrectly removed it to use for its own purpose).
      This can result in a filesystem with errors after it is marked "clean".
      
      Taking idea from Christoph's initial patch, allocate a tty private struct
      at file->private_data and put our required list fields in there, linking
      file and tty. This makes tty nodes behave the same way as other device nodes
      and avoid meddling with the vfs, and avoids this bug.
      
      The error handling is not trivial in the tty code, so for this bugfix, I take
      the simple approach of using __GFP_NOFAIL and don't worry about memory errors.
      This is not a problem because our allocator doesn't fail small allocs as a rule
      anyway. So proper error handling is left as an exercise for tty hackers.
      
      [ Arguably filesystem's device inode would ideally be divorced from the
      driver's pseudo inode when it is opened, but in practice it's not clear whether
      that will ever be worth implementing. ]
      
      Cc: linux-kernel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d996b62a
    • N
      fs: cleanup files_lock locking · ee2ffa0d
      Nick Piggin 提交于
      fs: cleanup files_lock locking
      
      Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
      manipulate the per-sb files list; unexport the files_lock spinlock.
      
      Cc: linux-kernel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ee2ffa0d
    • N
      fs: remove extra lookup in __lookup_hash · b04f784e
      Nick Piggin 提交于
      fs: remove extra lookup in __lookup_hash
      
      Optimize lookup for create operations, where no dentry should often be
      common-case. In cases where it is not, such as unlink, the added overhead
      is much smaller than the removed.
      
      Also, move comments about __d_lookup racyness to the __d_lookup call site.
      d_lookup is intuitive; __d_lookup is what needs commenting. So in that same
      vein, add kerneldoc comments to __d_lookup and clean up some of the comments:
      
      - We are interested in how the RCU lookup works here, particularly with
        renames. Make that explicit, and point to the document where it is explained
        in more detail.
      - RCU is pretty standard now, and macros make implementations pretty mindless.
        If we want to know about RCU barrier details, we look in RCU code.
      - Delete some boring legacy comments because we don't care much about how the
        code used to work, more about the interesting parts of how it works now. So
        comments about lazy LRU may be interesting, but would better be done in the
        LRU or refcount management code.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b04f784e
    • N
      fs: fs_struct rwlock to spinlock · 2a4419b5
      Nick Piggin 提交于
      fs: fs_struct rwlock to spinlock
      
      struct fs_struct.lock is an rwlock with the read-side used to protect root and
      pwd members while taking references to them. Taking a reference to a path
      typically requires just 2 atomic ops, so the critical section is very small.
      Parallel read-side operations would have cacheline contention on the lock, the
      dentry, and the vfsmount cachelines, so the rwlock is unlikely to ever give a
      real parallelism increase.
      
      Replace it with a spinlock to avoid one or two atomic operations in typical
      path lookup fastpath.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2a4419b5
    • N
      fs: dentry allocation consolidation · baa03890
      Nick Piggin 提交于
      fs: dentry allocation consolidation
      
      There are 2 duplicate copies of code in dentry allocation in path lookup.
      Consolidate them into a single function.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      baa03890
    • N
      fs: fix do_lookup false negative · 2e2e88ea
      Nick Piggin 提交于
      fs: fix do_lookup false negative
      
      In do_lookup, if we initially find no dentry, we take the directory i_mutex and
      re-check the lookup. If we find a dentry there, then we revalidate it if
      needed. However if that revalidate asks for the dentry to be invalidated, we
      return -ENOENT from do_lookup. What should happen instead is an attempt to
      allocate and lookup a new dentry.
      
      This is probably not noticed because it is rare. It is only reached if a
      concurrent create races in first (in which case, the dentry probably won't be
      invalidated anyway), or if the racy __d_lookup has failed due to a
      false-negative (which is very rare).
      
      Fix this by removing code and have it use the normal reval path.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2e2e88ea
    • A
      mbcache: Limit the maximum number of cache entries · 3a48ee8a
      Andreas Gruenbacher 提交于
      Limit the maximum number of mb_cache entries depending on the number of
      hash buckets: if the only limit to the number of cache entries is the
      available memory the hash chains can grow very long, taking a long time
      to search.
      
      At least partially solves https://bugzilla.lustre.org/show_bug.cgi?id=22771.
      Signed-off-by: NAndreas Gruenbacher <agruen@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3a48ee8a
    • A
      hostfs ->follow_link() braino · 3b6036d1
      Al Viro 提交于
      we want the assignment to err done inside the if () to be
      visible after it, so (re)declaring err inside if () body
      is wrong.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3b6036d1
    • A
      hostfs: dumb (and usually harmless) tpyo - strncpy instead of strlcpy · 850a496f
      Al Viro 提交于
      ... not harmless in this case - we have a string in the end of buffer
      already.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      850a496f
    • C
      remove SWRITE* I/O types · 9cb569d6
      Christoph Hellwig 提交于
      These flags aren't real I/O types, but tell ll_rw_block to always
      lock the buffer instead of giving up on a failed trylock.
      
      Instead add a new write_dirty_buffer helper that implements this semantic
      and use it from the existing SWRITE* callers.  Note that the ll_rw_block
      code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which
      this patch fixes.
      
      In the ufs code clean up the helper that used to call ll_rw_block
      to mirror sync_dirty_buffer, which is the function it implements for
      compound buffers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9cb569d6
    • C
      kill BH_Ordered flag · 87e99511
      Christoph Hellwig 提交于
      Instead of abusing a buffer_head flag just add a variant of
      sync_dirty_buffer which allows passing the exact type of write
      flag required.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      87e99511
    • J
      vfs: update ctime when changing the file's permission by setfacl · dad5eb6d
      Jan Kara 提交于
      generic_acl_set didn't update the ctime of the file when its permission was
      changed.
      
      Steps to reproduce:
       # touch aaa
       # stat -c %Z aaa
       1275289822
       # setfacl -m  'u::x,g::x,o::x' aaa
       # stat -c %Z aaa
       1275289822                         <- unchanged
      
      But, according to the spec of the ctime, vfs must update it.
      
      Port of ext3 patch by Miao Xie <miaox@cn.fujitsu.com>.
      
      CC: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      dad5eb6d
    • A
      cramfs: only unlock new inodes · b845ff8f
      Alexander Shishkin 提交于
      Commit 77b8a75f introduced a warning at fs/inode.c:692 unlock_new_inode(),
      caused by unlock_new_inode() being called on existing inodes as well.
      
      This patch changes setup_inode() to only call unlock_new_inode() for I_NEW
      inodes.
      Signed-off-by: NAlexander Shishkin <virtuoso@slind.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b845ff8f
    • S
      fix reiserfs_evict_inode end_writeback second call · f4ae2faa
      Sergey Senozhatsky 提交于
      reiserfs_evict_inode calls end_writeback two times hitting
      kernel BUG at fs/inode.c:298 becase inode->i_state is I_CLEAR already.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f4ae2faa
    • D
      Make do_execve() take a const filename pointer · d7627467
      David Howells 提交于
      Make do_execve() take a const filename pointer so that kernel_execve() compiles
      correctly on ARM:
      
      arch/arm/kernel/sys_arm.c:88: warning: passing argument 1 of 'do_execve' discards qualifiers from pointer target type
      
      This also requires the argv and envp arguments to be consted twice, once for
      the pointer array and once for the strings the array points to.  This is
      because do_execve() passes a pointer to the filename (now const) to
      copy_strings_kernel().  A simpler alternative would be to cast the filename
      pointer in do_execve() when it's passed to copy_strings_kernel().
      
      do_execve() may not change any of the strings it is passed as part of the argv
      or envp lists as they are some of them in .rodata, so marking these strings as
      const should be fine.
      
      Further kernel_execve() and sys_execve() need to be changed to match.
      
      This has been test built on x86_64, frv, arm and mips.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7627467
    • T
      NFS: Fix the selection of security flavours in Kconfig · df486a25
      Trond Myklebust 提交于
      Randy Dunlap reports:
      
      ERROR: "svc_gss_principal" [fs/nfs/nfs.ko] undefined!
      
      
      because in fs/nfs/Kconfig, NFS_V4 selects RPCSEC_GSS_KRB5
      and/or in fs/nfsd/Kconfig, NFSD_V4 selects RPCSEC_GSS_KRB5.
      
      RPCSEC_GSS_KRB5 does 5 selects, but none of these is enforced/followed
      by the fs/nfs[d]/Kconfig configs:
      
      	select SUNRPC_GSS
      	select CRYPTO
      	select CRYPTO_MD5
      	select CRYPTO_DES
      	select CRYPTO_CBC
      Reported-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Cc: J. Bruce Fields <bfields@fieldses.org>
      Acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      df486a25
  3. 16 8月, 2010 3 次提交
    • R
      nilfs2: fix false warning saying one of two super blocks is broken · ea1a16f7
      Ryusuke Konishi 提交于
      After applying commit b2ac86e1, the following message got appeared
      after unclean shutdown:
      
      > NILFS warning: broken superblock. using spare superblock.
      
      This turns out to be a false message due to the change which updates
      two super blocks alternately.  The secondary super block now can be
      selected if it's newer than the primary one.
      
      This kills the false warning by suppressing it if another super block
      is not actually broken.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      ea1a16f7
    • R
      nilfs2: fix list corruption after ifile creation failure · af4e3631
      Ryusuke Konishi 提交于
      If nilfs_attach_checkpoint() gets a memory allocation failure during
      creation of ifile, it will return without removing nilfs_sb_info
      struct from ns_supers list.  When a concurrently mounted snapshot is
      unmounted or another new snapshot is mounted after that, this causes
      kernel oops as below:
      
      > BUG: unable to handle kernel NULL pointer dereference at (null)
      > IP: [<f83662ff>] nilfs_find_sbinfo+0x74/0xa4 [nilfs2]
      > *pde = 00000000
      > Oops: 0000 [#1] SMP
      <snip>
      > Call Trace:
      >  [<f835dc29>] ? nilfs_get_sb+0x165/0x532 [nilfs2]
      >  [<c1173c87>] ? ida_get_new_above+0x16d/0x187
      >  [<c109a7f8>] ? alloc_vfsmnt+0x7e/0x10a
      >  [<c1070790>] ? kstrdup+0x2c/0x40
      >  [<c1089041>] ? vfs_kern_mount+0x96/0x14e
      >  [<c108913d>] ? do_kern_mount+0x32/0xbd
      >  [<c109b331>] ? do_mount+0x642/0x6a1
      >  [<c101a415>] ? do_page_fault+0x0/0x2d1
      >  [<c1099c00>] ? copy_mount_options+0x80/0xe2
      >  [<c10705d8>] ? strndup_user+0x48/0x67
      >  [<c109b3f1>] ? sys_mount+0x61/0x90
      >  [<c10027cc>] ? sysenter_do_call+0x12/0x22
      
      This fixes the problem.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: stable@kernel.org
      af4e3631
    • L
      mm: fix up some user-visible effects of the stack guard page · d7824370
      Linus Torvalds 提交于
      This commit makes the stack guard page somewhat less visible to user
      space. It does this by:
      
       - not showing the guard page in /proc/<pid>/maps
      
         It looks like lvm-tools will actually read /proc/self/maps to figure
         out where all its mappings are, and effectively do a specialized
         "mlockall()" in user space.  By not showing the guard page as part of
         the mapping (by just adding PAGE_SIZE to the start for grows-up
         pages), lvm-tools ends up not being aware of it.
      
       - by also teaching the _real_ mlock() functionality not to try to lock
         the guard page.
      
         That would just expand the mapping down to create a new guard page,
         so there really is no point in trying to lock it in place.
      
      It would perhaps be nice to show the guard page specially in
      /proc/<pid>/maps (or at least mark grow-down segments some way), but
      let's not open ourselves up to more breakage by user space from programs
      that depends on the exact deails of the 'maps' file.
      
      Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools
      source code to see what was going on with the whole new warning.
      
      Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be
      Reported-by: NHenrique de Moraes Holschuh <hmh@hmh.eng.br>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7824370
  4. 15 8月, 2010 1 次提交
  5. 14 8月, 2010 3 次提交
  6. 13 8月, 2010 4 次提交
  7. 12 8月, 2010 7 次提交
    • J
      mm: fix writeback_in_progress() · 81d73a32
      Jan Kara 提交于
      Commit 83ba7b07 ("writeback: simplify the write back thread queue")
      broke writeback_in_progress() as in that commit we started to remove work
      items from the list at the moment we start working on them and not at the
      moment they are finished.  Thus if the flusher thread was doing some work
      but there was no other work queued, writeback_in_progress() returned
      false.  This could in particular cause unnecessary queueing of background
      writeback from balance_dirty_pages() or writeout work from
      writeback_sb_if_idle().
      
      This patch fixes the problem by introducing a bit in the bdi state which
      indicates that the flusher thread is processing some work and uses this
      bit for writeback_in_progress() test.
      
      NOTE: Both callsites of writeback_in_progress() (namely,
      writeback_inodes_sb_if_idle() and balance_dirty_pages()) would actually
      need a different information than what writeback_in_progress() provides.
      They would need to know whether *the kind of writeback they are going to
      submit* is already queued.  But this information isn't that simple to
      provide so let's fix writeback_in_progress() for the time being.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Acked-by: NJens Axboe <jaxboe@fusionio.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81d73a32
    • W
      writeback: merge for_kupdate and !for_kupdate cases · a50aeb40
      Wu Fengguang 提交于
      Unify the logic for kupdate and non-kupdate cases.  There won't be
      starvation because the inodes requeued into b_more_io will later be
      spliced _after_ the remaining inodes in b_io, hence won't stand in the way
      of other inodes in the next run.
      
      It avoids unnecessary redirty_tail() calls, hence the update of
      i_dirtied_when.  The timestamp update is undesirable because it could
      later delay the inode's periodic writeback, or may exclude the inode from
      the data integrity sync operation (which checks timestamp to avoid extra
      work and livelock).
      
      ===
      How the redirty_tail() comes about:
      
      It was a long story..  This redirty_tail() was introduced with
      wbc.more_io.  The initial patch for more_io actually does not have the
      redirty_tail(), and when it's merged, several 100% iowait bug reports
      arised:
      
      reiserfs:
              http://lkml.org/lkml/2007/10/23/93
      
      jfs:
              commit 29a424f2
              JFS: clear PAGECACHE_TAG_DIRTY for no-write pages
      
      ext2:
              http://www.spinics.net/linux/lists/linux-ext4/msg04762.html
      
      They are all old bugs hidden in various filesystems that become "visible"
      with the more_io patch.  At the time, the ext2 bug is thought to be
      "trivial", so not fixed.  Instead the following updated more_io patch with
      redirty_tail() is merged:
      
      	http://www.spinics.net/linux/lists/linux-ext4/msg04507.html
      
      This will in general prevent 100% on ext2 and possibly other unknown FS bugs.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Martin Bligh <mbligh@google.com>
      Cc: Michael Rubin <mrubin@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a50aeb40
    • W
      writeback: fix queue_io() ordering · 4ea879b9
      Wu Fengguang 提交于
      This was not a bug, since b_io is empty for kupdate writeback.  The next
      patch will do requeue_io() for non-kupdate writeback, so let's fix it.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Martin Bligh <mbligh@google.com>
      Cc: Michael Rubin <mrubin@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ea879b9
    • W
      writeback: don't redirty tail an inode with dirty pages · 23539afc
      Wu Fengguang 提交于
      Avoid delaying writeback for an expire inode with lots of dirty pages, but
      no active dirtier at the moment.  Previously we only do that for the
      kupdate case.
      
      Any filesystem that does delayed allocation or unwritten extent conversion
      after IO completion will cause this - for example, XFS.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23539afc
    • W
      writeback: avoid unnecessary calculation of bdi dirty thresholds · 16c4042f
      Wu Fengguang 提交于
      Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
      that the latter can be avoided when under global dirty background
      threshold (which is the normal state for most systems).
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16c4042f
    • W
      AFS: Implement an autocell mount capability [ver #2] · bec5eb61
      wanglei 提交于
      Implement the ability for the root directory of a mounted AFS filesystem to
      accept lookups of arbitrary directory names, to interpet the names as the names
      of cells, to look the cell names up in the DNS for AFSDB records and to mount
      the root.cell volume of the nominated cell on the pseudo-directory created by
      lookup.
      
      This facility is requested by passing:
      
      	-o autocell
      
      to the mountpoint for which this is desired, usually the /afs mount.
      
      To use this facility, a DNS upcall program is required for AFSDB records.  This
      can be obtained from:
      
      	http://people.redhat.com/~dhowells/afs/dns.afsdb.c
      
      It should be compiled with -lresolv and -lkeyutils and installed as, say:
      
      	/usr/sbin/dns.afsdb
      
      Then the following line needs to be added to /sbin/request-key.conf:
      
      	create	dns_resolver afsdb:*	*	/usr/sbin/dns.afsdb %k
      
      This can be tested by mounting AFS, say:
      
      	insmod dns_resolver.ko
      	insmod af-rxrpc.ko
      	insmod kafs.ko rootcell=grand.central.org
      	mount -t afs "#grand.central.org:root.cell." /afs -o autocell
      
      and doing:
      
      	ls /afs/grand.central.org/
      
      which should show:
      
      	archive/  cvs/  doc/  local/  project/  service/  software/  user/  www/
      
      if it works.
      Signed-off-by: NWang Lei <wang840925@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NSteve French <sfrench@us.ibm.com>
      bec5eb61
    • W
      DNS: If the DNS server returns an error, allow that to be cached [ver #2] · 4a2d7892
      Wang Lei 提交于
      If the DNS server returns an error, allow that to be cached in the DNS resolver
      key in lieu of a value.  Userspace passes the desired error number as an option
      in the payload:
      
      	"#dnserror=<number>"
      
      Userspace must map h_errno from the name resolution routines to an appropriate
      Linux error before passing it up.  Something like the following mapping is
      recommended:
      
      	[HOST_NOT_FOUND]	= ENODATA,
      	[TRY_AGAIN]		= EAGAIN,
      	[NO_RECOVERY]		= ECONNREFUSED,
      	[NO_DATA]		= ENODATA,
      
      in lieu of Linux errors specifically for representing name service errors.  The
      filesystem must map these errors appropropriately before passing them to
      userspace.  AFS is made to map ENODATA and EAGAIN to EDESTADDRREQ for the
      return to userspace; ECONNREFUSED is allowed to stand as is.
      
      The error can be seen in /proc/keys as a negative number after the description
      of the key.  Compare, for example, the following key entries:
      
      2f97238c I--Q--     1  53s 3f010000     0     0 dns_resol afsdb:grand.centrall.org: -61
      338bfbbe I--Q--     1  59m 3f010000     0     0 dns_resol afsdb:grand.central.org: 37
      
      If the error option is supplied in the payload, the main part of the payload is
      discarded.  The key should have an expiry time set by userspace.
      Signed-off-by: NWang Lei <wang840925@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NSteve French <sfrench@us.ibm.com>
      4a2d7892