1. 30 4月, 2012 12 次提交
    • D
      ext4: Calculate and verify checksums for htree nodes · dbe89444
      Darrick J. Wong 提交于
      Calculate and verify the checksum for directory index tree (htree)
      node blocks.  The checksum is stored in the last 4 bytes of the htree
      block and requires the dx_entry array to stop 1 dx_entry short of the
      end of the block.
      Signed-off-by: NDarrick J. Wong <djwong@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      dbe89444
    • D
      ext4: verify and calculate checksums for extent tree blocks · 7ac5990d
      Darrick J. Wong 提交于
      Calculate and verify the checksum for each extent tree block.  The
      checksum is located in the space immediately after the last possible
      ext4_extent in the block.  The space is is typically the last 4-8
      bytes in the block.
      Signed-off-by: NDarrick J. Wong <djwong@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7ac5990d
    • D
      ext4: calculate and verify block bitmap checksum · fa77dcfa
      Darrick J. Wong 提交于
      Compute and verify the checksum of the block bitmap; this checksum is
      stored in the block group descriptor.
      Signed-off-by: NDarrick J. Wong <djwong@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      fa77dcfa
    • D
      ext4: calculate and verify checksums for inode bitmaps · 41a246d1
      Darrick J. Wong 提交于
      Compute and verify the checksum of the inode bitmap; the checkum is
      stored in the block group descriptor.
      Signed-off-by: NDarrick J. Wong <djwong@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      41a246d1
    • D
      ext4: calculate and verify inode checksums · 814525f4
      Darrick J. Wong 提交于
      This patch introduces to ext4 the ability to calculate and verify
      inode checksums.  This requires the use of a new ro compatibility flag
      and some accompanying e2fsprogs patches to provide the relevant
      features in tune2fs and e2fsck.  The inode generation changes have
      been integrated into this patch.
      Signed-off-by: NDarrick J. Wong <djwong@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      814525f4
    • D
      ext4: calculate and verify superblock checksum · a9c47317
      Darrick J. Wong 提交于
      Calculate and verify the superblock checksum.  Since the UUID and
      block group number are embedded in each copy of the superblock, we
      need only checksum the entire block.  Refactor some of the code to
      eliminate open-coding of the checksum update call.
      Signed-off-by: NDarrick J. Wong <djwong@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a9c47317
    • D
      ext4: load the crc32c driver if necessary · 0441984a
      Darrick J. Wong 提交于
      Obtain a reference to the cryptoapi and crc32c if we mount a
      filesystem with metadata checksumming enabled.
      Signed-off-by: NDarrick J. Wong <djwong@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0441984a
    • D
      ext4: record the checksum algorithm in use in the superblock · d25425f8
      Darrick J. Wong 提交于
      Record the type of checksum algorithm we're using for metadata in the
      superblock, in case we ever want/need to change the algorithm.
      Signed-off-by: NDarrick J. Wong <djwong@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d25425f8
    • D
      ext4: change on-disk layout to support extended metadata checksumming · e6153918
      Darrick J. Wong 提交于
      Define flags and change structure definitions to allow checksumming of
      ext4 metadata.
      Signed-off-by: NDarrick J. Wong <djwong@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e6153918
    • D
      ext4: create a new BH_Verified flag to avoid unnecessary metadata validation · f8489128
      Darrick J. Wong 提交于
      Create a new BH_Verified flag to indicate that we've verified all the
      data in a buffer_head for correctness.  This allows us to bypass
      expensive verification steps when they are not necessary without
      missing them when they are.
      Signed-off-by: NDarrick J. Wong <djwong@us.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f8489128
    • L
      autofs: make the autofsv5 packet file descriptor use a packetized pipe · 64f371bc
      Linus Torvalds 提交于
      The autofs packet size has had a very unfortunate size problem on x86:
      because the alignment of 'u64' differs in 32-bit and 64-bit modes, and
      because the packet data was not 8-byte aligned, the size of the autofsv5
      packet structure differed between 32-bit and 64-bit modes despite
      looking otherwise identical (300 vs 304 bytes respectively).
      
      We first fixed that up by making the 64-bit compat mode know about this
      problem in commit a32744d4 ("autofs: work around unhappy compat
      problem on x86-64"), and that made a 32-bit 'systemd' work happily on a
      64-bit kernel because everything then worked the same way as on a 32-bit
      kernel.
      
      But it turned out that 'automount' had actually known and worked around
      this problem in user space, so fixing the kernel to do the proper 32-bit
      compatibility handling actually *broke* 32-bit automount on a 64-bit
      kernel, because it knew that the packet sizes were wrong and expected
      those incorrect sizes.
      
      As a result, we ended up reverting that compatibility mode fix, and
      thus breaking systemd again, in commit fcbf94b9.
      
      With both automount and systemd doing a single read() system call, and
      verifying that they get *exactly* the size they expect but using
      different sizes, it seemed that fixing one of them inevitably seemed to
      break the other.  At one point, a patch I seriously considered applying
      from Michael Tokarev did a "strcmp()" to see if it was automount that
      was doing the operation.  Ugly, ugly.
      
      However, a prettier solution exists now thanks to the packetized pipe
      mode.  By marking the communication pipe as being packetized (by simply
      setting the O_DIRECT flag), we can always just write the bigger packet
      size, and if user-space does a smaller read, it will just get that
      partial end result and the extra alignment padding will simply be thrown
      away.
      
      This makes both automount and systemd happy, since they now get the size
      they asked for, and the kernel side of autofs simply no longer needs to
      care - it could pad out the packet arbitrarily.
      
      Of course, if there is some *other* user of autofs (please, please,
      please tell me it ain't so - and we haven't heard of any) that tries to
      read the packets with multiple writes, that other user will now be
      broken - the whole point of the packetized mode is that one system call
      gets exactly one packet, and you cannot read a packet in pieces.
      Tested-by: NMichael Tokarev <mjt@tls.msk.ru>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Miller <davem@davemloft.net>
      Cc: Ian Kent <raven@themaw.net>
      Cc: Thomas Meyer <thomas@m3y3r.de>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64f371bc
    • L
      pipes: add a "packetized pipe" mode for writing · 9883035a
      Linus Torvalds 提交于
      The actual internal pipe implementation is already really about
      individual packets (called "pipe buffers"), and this simply exposes that
      as a special packetized mode.
      
      When we are in the packetized mode (marked by O_DIRECT as suggested by
      Alan Cox), a write() on a pipe will not merge the new data with previous
      writes, so each write will get a pipe buffer of its own.  The pipe
      buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
      will tell the reader side to break the read at that boundary (and throw
      away any partial packet contents that do not fit in the read buffer).
      
      End result: as long as you do writes less than PIPE_BUF in size (so that
      the pipe doesn't have to split them up), you can now treat the pipe as a
      packet interface, where each read() system call will read one packet at
      a time.  You can just use a sufficiently big read buffer (PIPE_BUF is
      sufficient, since bigger than that doesn't guarantee atomicity anyway),
      and the return value of the read() will naturally give you the size of
      the packet.
      
      NOTE! We do not support zero-sized packets, and zero-sized reads and
      writes to a pipe continue to be no-ops.  Also note that big packets will
      currently be split at write time, but that the size at which that
      happens is not really specified (except that it's bigger than PIPE_BUF).
      Currently that limit is the system page size, but we might want to
      explicitly support bigger packets some day.
      
      The main user for this is going to be the autofs packet interface,
      allowing us to stop having to care so deeply about exact packet sizes
      (which have had bugs with 32/64-bit compatibility modes).  But user
      space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
      fail with an EINVAL on kernels that do not support this interface.
      Tested-by: NMichael Tokarev <mjt@tls.msk.ru>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Miller <davem@davemloft.net>
      Cc: Ian Kent <raven@themaw.net>
      Cc: Thomas Meyer <thomas@m3y3r.de>
      Cc: stable@kernel.org  # needed for systemd/autofs interaction fix
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9883035a
  2. 28 4月, 2012 8 次提交
    • L
      Revert "autofs: work around unhappy compat problem on x86-64" · fcbf94b9
      Linus Torvalds 提交于
      This reverts commit a32744d4.
      
      While that commit was technically the right thing to do, and made the
      x86-64 compat mode work identically to native 32-bit mode (and thus
      fixing the problem with a 32-bit systemd install on a 64-bit kernel), it
      turns out that the automount binaries had workarounds for this compat
      problem.
      
      Now, the workarounds are disgusting: doing an "uname()" to find out the
      architecture of the kernel, and then comparing it for the 64-bit cases
      and fixing up the size of the read() in automount for those.  And they
      were confused: it's not actually a generic 64-bit issue at all, it's
      very much tied to just x86-64, which has different alignment for an
      'u64' in 64-bit mode than in 32-bit mode.
      
      But the end result is that fixing the compat layer actually breaks the
      case of a 32-bit automount on a x86-64 kernel.
      
      There are various approaches to fix this (including just doing a
      "strcmp()" on current->comm and comparing it to "automount"), but I
      think that I will do the one that teaches pipes about a special "packet
      mode", which will allow user space to not have to care too deeply about
      the padding at the end of the autofs packet.
      
      That change will make the compat workaround unnecessary, so let's revert
      it first, and get automount working again in compat mode.  The
      packetized pipes will then fix autofs for systemd.
      Reported-and-requested-by: NMichael Tokarev <mjt@tls.msk.ru>
      Cc: Ian Kent <raven@themaw.net>
      Cc: stable@kernel.org # for 3.3
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcbf94b9
    • C
      Btrfs: reduce lock contention during extent insertion · dc7fdde3
      Chris Mason 提交于
      We're spending huge amounts of time on lock contention during
      end_io processing because we unconditionally assume we are overwriting
      an existing extent in the file for each IO.
      
      This checks to see if we are outside i_size, and if so, it uses a
      less expensive readonly search of the btree to look for existing
      extents.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      dc7fdde3
    • C
      Btrfs: avoid deadlocks from GFP_KERNEL allocations during btrfs_real_readdir · fede766f
      Chris Mason 提交于
      Btrfs has an optimization where it will preallocate dentries during
      readdir to fill in enough information to open the inode without an extra
      lookup.
      
      But, we're calling d_alloc, which is doing GFP_KERNEL allocations, and
      that leads to deadlocks because our readdir code has tree locks held.
      
      For now, disable this optimization.  We'll fix the gfp mask in the next
      merge window.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      fede766f
    • D
      Btrfs: Fix space checking during fs resize · 7654b724
      Daniel J Blueman 提交于
      Fix out-of-space checking, addressing a warning and potential resource
      leak when resizing the filesystem down while allocating blocks.
      Signed-off-by: NDaniel J Blueman <daniel@quora.org>
      Reviewed-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7654b724
    • S
      Btrfs: fix block_rsv and space_info lock ordering · 1f699d38
      Stefan Behrens 提交于
      may_commit_transaction() calls
              spin_lock(&space_info->lock);
              spin_lock(&delayed_rsv->lock);
      and update_global_block_rsv() calls
              spin_lock(&block_rsv->lock);
              spin_lock(&sinfo->lock);
      
      Lockdep complains about this at run time.
      Everywhere except in update_global_block_rsv(), the space_info lock is
      the outer lock, therefore the locking order in update_global_block_rsv()
      is changed.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1f699d38
    • D
      Btrfs: Prevent root_list corruption · 1daf3540
      Daniel J Blueman 提交于
      I was seeing root_list corruption on unmount during fs resize in 3.4-rc4; add
      correct locking to address this.
      Signed-off-by: NDaniel J Blueman <daniel@quora.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1daf3540
    • J
      Btrfs: fix repair code for RAID10 · 3e74317a
      Jan Schmidt 提交于
      btrfs_map_block sets mirror_num, so that the repair code knows eventually
      which device gave us the read error. For RAID10, mirror_num must be 1 or 2.
      Before this fix mirror_num was incorrectly related to our stripe index.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3e74317a
    • J
      Btrfs: do not start delalloc inodes during sync · 996d282c
      Josef Bacik 提交于
      btrfs_start_delalloc_inodes will just walk the list of delalloc inodes and
      start writing them out, but it doesn't splice the list or anything so as
      long as somebody is doing work on the box you could end up in this section
      _forever_.  So just remove it, it's not needed anyway since sync will start
      writeback on all inodes anyway, all we need to do is wait for ordered
      extents and then we can commit the transaction.  In my horrible torture test
      sync goes from taking 4 minutes to about 1.5 minutes.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      996d282c
  3. 26 4月, 2012 4 次提交
    • W
      revert "proc: clear_refs: do not clear reserved pages" · 63f61a6f
      Will Deacon 提交于
      Revert commit 85e72aa5 ("proc: clear_refs: do not clear reserved
      pages"), which was a quick fix suitable for -stable until ARM had been
      moved over to the gate_vma mechanism:
      
      https://lkml.org/lkml/2012/1/14/55
      
      With commit f9d4861f ("ARM: 7294/1: vectors: use gate_vma for vectors user
      mapping"), ARM does now use the gate_vma, so the PageReserved check can be
      removed from the proc code.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Cc: Nicolas Pitre <nico@linaro.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63f61a6f
    • A
      hugetlbfs: lockdep annotate root inode properly · 65ed7601
      Aneesh Kumar K.V 提交于
      This fixes the below reported false lockdep warning.  e096d0c7
      ("lockdep: Add helper function for dir vs file i_mutex annotation") added
      a similar annotation for every other inode in hugetlbfs but missed the
      root inode because it was allocated by a separate function.
      
      For HugeTLB fs we allow taking i_mutex in mmap.  HugeTLB fs doesn't
      support file write and its file read callback is modified in a05b0855
      ("hugetlbfs: avoid taking i_mutex from hugetlbfs_read()") to not take
      i_mutex.  Hence for HugeTLB fs with regular files we really don't take
      i_mutex with mmap_sem held.
      
       ======================================================
       [ INFO: possible circular locking dependency detected ]
       3.4.0-rc1+ #322 Not tainted
       -------------------------------------------------------
       bash/1572 is trying to acquire lock:
        (&mm->mmap_sem){++++++}, at: [<ffffffff810f1618>] might_fault+0x40/0x90
      
       but task is already holding lock:
        (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff81125f88>] vfs_readdir+0x56/0xa8
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (&sb->s_type->i_mutex_key#12){+.+.+.}:
              [<ffffffff810a09e5>] lock_acquire+0xd5/0xfa
              [<ffffffff816a2f5e>] __mutex_lock_common+0x48/0x350
              [<ffffffff816a3325>] mutex_lock_nested+0x2a/0x31
              [<ffffffff811fb8e1>] hugetlbfs_file_mmap+0x7d/0x104
              [<ffffffff810f859a>] mmap_region+0x272/0x47d
              [<ffffffff810f8a39>] do_mmap_pgoff+0x294/0x2ee
              [<ffffffff810f8b65>] sys_mmap_pgoff+0xd2/0x10e
              [<ffffffff8103d19e>] sys_mmap+0x1d/0x1f
              [<ffffffff816a5922>] system_call_fastpath+0x16/0x1b
      
       -> #0 (&mm->mmap_sem){++++++}:
              [<ffffffff810a0256>] __lock_acquire+0xa81/0xd75
              [<ffffffff810a09e5>] lock_acquire+0xd5/0xfa
              [<ffffffff810f1645>] might_fault+0x6d/0x90
              [<ffffffff81125d62>] filldir+0x6a/0xc2
              [<ffffffff81133a83>] dcache_readdir+0x5c/0x222
              [<ffffffff81125fa8>] vfs_readdir+0x76/0xa8
              [<ffffffff811260b6>] sys_getdents+0x79/0xc9
              [<ffffffff816a5922>] system_call_fastpath+0x16/0x1b
      
       other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(&sb->s_type->i_mutex_key#12);
                                      lock(&mm->mmap_sem);
                                      lock(&sb->s_type->i_mutex_key#12);
         lock(&mm->mmap_sem);
      
        *** DEADLOCK ***
      
       1 lock held by bash/1572:
        #0:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff81125f88>] vfs_readdir+0x56/0xa8
      
       stack backtrace:
       Pid: 1572, comm: bash Not tainted 3.4.0-rc1+ #322
       Call Trace:
        [<ffffffff81699a3c>] print_circular_bug+0x1f8/0x209
        [<ffffffff810a0256>] __lock_acquire+0xa81/0xd75
        [<ffffffff810f38aa>] ? handle_pte_fault+0x5ff/0x614
        [<ffffffff8109e622>] ? mark_lock+0x2d/0x258
        [<ffffffff810f1618>] ? might_fault+0x40/0x90
        [<ffffffff810a09e5>] lock_acquire+0xd5/0xfa
        [<ffffffff810f1618>] ? might_fault+0x40/0x90
        [<ffffffff816a3249>] ? __mutex_lock_common+0x333/0x350
        [<ffffffff810f1645>] might_fault+0x6d/0x90
        [<ffffffff810f1618>] ? might_fault+0x40/0x90
        [<ffffffff81125d62>] filldir+0x6a/0xc2
        [<ffffffff81133a83>] dcache_readdir+0x5c/0x222
        [<ffffffff81125cf8>] ? sys_ioctl+0x74/0x74
        [<ffffffff81125cf8>] ? sys_ioctl+0x74/0x74
        [<ffffffff81125cf8>] ? sys_ioctl+0x74/0x74
        [<ffffffff81125fa8>] vfs_readdir+0x76/0xa8
        [<ffffffff811260b6>] sys_getdents+0x79/0xc9
        [<ffffffff816a5922>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65ed7601
    • G
      fs/buffer.c: remove BUG() in possible but rare condition · 61065a30
      Glauber Costa 提交于
      While stressing the kernel with with failing allocations today, I hit the
      following chain of events:
      
      alloc_page_buffers():
      
      	bh = alloc_buffer_head(GFP_NOFS);
      	if (!bh)
      		goto no_grow; <= path taken
      
      grow_dev_page():
              bh = alloc_page_buffers(page, size, 0);
              if (!bh)
                      goto failed;  <= taken, consequence of the above
      
      and then the failed path BUG()s the kernel.
      
      The failure is inserted a litte bit artificially, but even then, I see no
      reason why it should be deemed impossible in a real box.
      
      Even though this is not a condition that we expect to see around every
      time, failed allocations are expected to be handled, and BUG() sounds just
      too much.  As a matter of fact, grow_dev_page() can return NULL just fine
      in other circumstances, so I propose we just remove it, then.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61065a30
    • J
      epoll: clear the tfile_check_list on -ELOOP · 13d51807
      Jason Baron 提交于
      An epoll_ctl(,EPOLL_CTL_ADD,,) operation can return '-ELOOP' to prevent
      circular epoll dependencies from being created.  However, in that case we
      do not properly clear the 'tfile_check_list'.  Thus, add a call to
      clear_tfile_check_list() for the -ELOOP case.
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Reported-by: NYurij M. Plotnikov <Yurij.Plotnikov@oktetlabs.ru>
      Cc: Nelson Elhage <nelhage@nelhage.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Tested-by: NAlexandra N. Kossovsky <Alexandra.Kossovsky@oktetlabs.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13d51807
  4. 25 4月, 2012 2 次提交
  5. 24 4月, 2012 4 次提交
  6. 22 4月, 2012 2 次提交
  7. 21 4月, 2012 8 次提交