1. 23 6月, 2015 1 次提交
    • B
      xfs: don't truncate attribute extents if no extents exist · f66bf042
      Brian Foster 提交于
      The xfs_attr3_root_inactive() call from xfs_attr_inactive() assumes that
      attribute blocks exist to invalidate. It is possible to have an
      attribute fork without extents, however. Consider the case where the
      attribute fork is created towards the beginning of xfs_attr_set() but
      some part of the subsequent attribute set fails.
      
      If an inode in such a state hits xfs_attr_inactive(), it eventually
      calls xfs_dabuf_map() and possibly xfs_bmapi_read(). The former emits a
      filesystem corruption warning, returns an error that bubbles back up to
      xfs_attr_inactive(), and leads to destruction of the in-core attribute
      fork without an on-disk reset. If the inode happens to make it back
      through xfs_inactive() in this state (e.g., via a concurrent bulkstat
      that cycles the inode from the reclaim state and releases it), i_afp
      might not exist when xfs_bmapi_read() is called and causes a NULL
      dereference panic.
      
      A '-p 2' fsstress run to ENOSPC on a relatively small fs (1GB)
      reproduces these problems. The behavior is a regression caused by:
      
      6dfe5a04 xfs: xfs_attr_inactive leaves inconsistent attr fork state behind
      
      ... which removed logic that avoided the attribute extent truncate when
      no extents exist. Restore this logic to ensure the attribute fork is
      destroyed and reset correctly if it exists without any allocated
      extents.
      
      cc: stable@vger.kernel.org # 3.12 to 4.0.x
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f66bf042
  2. 22 6月, 2015 6 次提交
  3. 29 5月, 2015 7 次提交
    • B
      xfs: fix broken i_nlink accounting for whiteout tmpfile inode · 22419ac9
      Brian Foster 提交于
      XFS uses the internal tmpfile() infrastructure for the whiteout inode
      used for RENAME_WHITEOUT operations. For tmpfile inodes, XFS allocates
      the inode, drops di_nlink, adds the inode to the agi unlinked list,
      calls d_tmpfile() which correspondingly drops i_nlink of the vfs inode,
      and then finishes the common inode setup (e.g., clear I_NEW and unlock).
      
      The d_tmpfile() call was originally made inxfs_create_tmpfile(), but was
      pulled up out of that function as part of the following commit to
      resolve a deadlock issue:
      
      	330033d6 xfs: fix tmpfile/selinux deadlock and initialize security
      
      As a result, callers of xfs_create_tmpfile() are responsible for either
      calling d_tmpfile() or fixing up i_nlink appropriately. The whiteout
      tmpfile allocation helper does neither. As a result, the vfs ->i_nlink
      becomes inconsistent with the on-disk ->di_nlink once xfs_rename() links
      it back into the source dentry and calls xfs_bumplink().
      
      Update the assert in xfs_rename() to help detect this problem in the
      future and update xfs_rename_alloc_whiteout() to decrement the link
      count as part of the manual tmpfile inode setup.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      22419ac9
    • D
      xfs: xfs_iozero can return positive errno · cddc1162
      Dave Chinner 提交于
      It was missed when we converted everything in XFs to use negative error
      numbers, so fix it now. Bug introduced in 3.17 by commit 2451337d ("xfs: global
      error sign conversion"), and should go back to stable kernels.
      
      Thanks to Brian Foster for noticing it.
      
      cc: <stable@vger.kernel.org> # 3.17, 3.18, 3.19, 4.0
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      cddc1162
    • D
      xfs: xfs_attr_inactive leaves inconsistent attr fork state behind · 6dfe5a04
      Dave Chinner 提交于
      xfs_attr_inactive() is supposed to clean up the attribute fork when
      the inode is being freed. While it removes attribute fork extents,
      it completely ignores attributes in local format, which means that
      there can still be active attributes on the inode after
      xfs_attr_inactive() has run.
      
      This leads to problems with concurrent inode writeback - the in-core
      inode attribute fork is removed without locking on the assumption
      that nothing will be attempting to access the attribute fork after a
      call to xfs_attr_inactive() because it isn't supposed to exist on
      disk any more.
      
      To fix this, make xfs_attr_inactive() completely remove all traces
      of the attribute fork from the inode, regardless of it's state.
      Further, also remove the in-core attribute fork structure safely so
      that there is nothing further that needs to be done by callers to
      clean up the attribute fork. This means we can remove the in-core
      and on-disk attribute forks atomically.
      
      Also, on error simply remove the in-memory attribute fork. There's
      nothing that can be done with it once we have failed to remove the
      on-disk attribute fork, so we may as well just blow it away here
      anyway.
      
      cc: <stable@vger.kernel.org> # 3.12 to 4.0
      Reported-by: NWaiman Long <waiman.long@hp.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      6dfe5a04
    • D
      xfs: extent size hints can round up extents past MAXEXTLEN · 6dea405e
      Dave Chinner 提交于
      This results in BMBT corruption, as seen by this test:
      
      # mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc
      ....
      # mount /dev/vdc /mnt/scratch
      # xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo
      
      which results in this failure on a debug kernel:
      
      XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211
      ....
      Call Trace:
       [<ffffffff814cf0ff>] xfs_bmbt_set_allf+0x8f/0x100
       [<ffffffff814cf18d>] xfs_bmbt_set_all+0x1d/0x20
       [<ffffffff814f2efe>] xfs_iext_insert+0x9e/0x120
       [<ffffffff814c7956>] ? xfs_bmap_add_extent_hole_real+0x1c6/0xc70
       [<ffffffff814c7956>] xfs_bmap_add_extent_hole_real+0x1c6/0xc70
       [<ffffffff814caaab>] xfs_bmapi_write+0x72b/0xed0
       [<ffffffff811c72ac>] ? kmem_cache_alloc+0x15c/0x170
       [<ffffffff814fe070>] xfs_alloc_file_space+0x160/0x400
       [<ffffffff81ddcc29>] ? down_write+0x29/0x60
       [<ffffffff815063eb>] xfs_file_fallocate+0x29b/0x310
       [<ffffffff811d2bc8>] ? __sb_start_write+0x58/0x120
       [<ffffffff811e3e18>] ? do_vfs_ioctl+0x318/0x570
       [<ffffffff811cd680>] vfs_fallocate+0x140/0x260
       [<ffffffff811ce6f8>] SyS_fallocate+0x48/0x80
       [<ffffffff81ddec09>] system_call_fastpath+0x12/0x17
      
      The tracepoint that indicates the extent that triggered the assert
      failure is:
      
      xfs_iext_insert:   idx 0 offset 0 block 16777224 count 2097152 flag 1
      
      Clearly indicating that the extent length is greater than MAXEXTLEN,
      which is 2097151. A prior trace point shows the allocation was an
      exact size match and that a length greater than MAXEXTLEN was asked
      for:
      
      xfs_alloc_size_done:  agno 1 agbno 8 minlen 2097152 maxlen 2097152
      					    ^^^^^^^        ^^^^^^^
      
      We don't see this problem with extent size hints through the IO path
      because we can't do single IOs large enough to trigger MAXEXTLEN
      allocation. fallocate(), OTOH, is not limited in it's allocation
      sizes and so needs help here.
      
      The issue is that the extent size hint alignment is rounding up the
      extent size past MAXEXTLEN, because xfs_bmapi_write() is not taking
      into account extent size hints when calculating the maximum extent
      length to allocate. xfs_bmapi_reserve_delalloc() is already doing
      this, but direct extent allocation is not.
      
      Unfortunately, the calculation in xfs_bmapi_reserve_delalloc() is
      wrong, and it works only because delayed allocation extents are not
      limited in size to MAXEXTLEN in the in-core extent tree. hence this
      calculation does not work for direct allocation, and the delalloc
      code needs fixing. This may, in fact be the underlying bug that
      occassionally causes transaction overruns in delayed allocation
      extent conversion, so now we know it's wrong we should fix it, too.
      Many thanks to Brian Foster for finding this problem during review
      of this patch.
      
      Hence the fix, after much code reading, is to allow
      xfs_bmap_extsize_align() to align partial extents when full
      alignment would extend the alignment past MAXEXTLEN. We can safely
      do this because all callers have higher layer allocation loops that
      already handle short allocations, and so will simply run another
      allocation to cover the remainder of the requested allocation range
      that we ignored during alignment. The advantage of this approach is
      that it also removes the need for callers to do anything other than
      limit their requests to MAXEXTLEN - they don't really need to be
      aware of extent size hints at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6dea405e
    • D
      xfs: inode and free block counters need to use __percpu_counter_compare · 8c1903d3
      Dave Chinner 提交于
      Because the counters use a custom batch size, the comparison
      functions need to be aware of that batch size otherwise the
      comparison does not work correctly. This leads to ASSERT failures
      on generic/027 like this:
      
       XFS: Assertion failed: 0, file: fs/xfs/xfs_mount.c, line: 1099
       ------------[ cut here ]------------
      ....
       Call Trace:
        [<ffffffff81522a39>] xfs_mod_icount+0x99/0xc0
        [<ffffffff815285cb>] xfs_trans_unreserve_and_mod_sb+0x28b/0x5b0
        [<ffffffff8152f941>] xfs_log_commit_cil+0x321/0x580
        [<ffffffff81528e17>] xfs_trans_commit+0xb7/0x260
        [<ffffffff81503d4d>] xfs_bmap_finish+0xcd/0x1b0
        [<ffffffff8151da41>] xfs_inactive_ifree+0x1e1/0x250
        [<ffffffff8151dbe0>] xfs_inactive+0x130/0x200
        [<ffffffff81523a21>] xfs_fs_evict_inode+0x91/0xf0
        [<ffffffff811f3958>] evict+0xb8/0x190
        [<ffffffff811f433b>] iput+0x18b/0x1f0
        [<ffffffff811e8853>] do_unlinkat+0x1f3/0x320
        [<ffffffff811d548a>] ? filp_close+0x5a/0x80
        [<ffffffff811e999b>] SyS_unlinkat+0x1b/0x40
        [<ffffffff81e0892e>] system_call_fastpath+0x12/0x71
      
      This is a regression introduced by commit 501ab323 ("xfs: use generic
      percpu counters for inode counter").
      
      This patch fixes the same problem for both the inode counter and the
      free block counter in the superblocks.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8c1903d3
    • D
      percpu_counter: batch size aware __percpu_counter_compare() · 80188b0d
      Dave Chinner 提交于
      XFS uses non-stanard batch sizes for avoiding frequent global
      counter updates on it's allocated inode counters, as they increment
      or decrement in batches of 64 inodes. Hence the standard percpu
      counter batch of 32 means that the counter is effectively a global
      counter. Currently Xfs uses a batch size of 128 so that it doesn't
      take the global lock on every single modification.
      
      However, Xfs also needs to compare accurately against zero, which
      means we need to use percpu_counter_compare(), and that has a
      hard-coded batch size of 32, and hence will spuriously fail to
      detect when it is supposed to use precise comparisons and hence
      the accounting goes wrong.
      
      Add __percpu_counter_compare() to take a custom batch size so we can
      use it sanely in XFS and factor percpu_counter_compare() to use it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      80188b0d
    • G
      xfs: use percpu_counter_read_positive for mp->m_icount · 74f9ce1c
      George Wang 提交于
      Function percpu_counter_read just return the current counter, which can be
      negative. This will cause the checking of "allocated inode
      counts <= m_maxicount" false positive. Use percpu_counter_read_positive can
      solve this problem, and be consistent with the purpose to introduce percpu
      mechanism to xfs.
      Signed-off-by: NGeorge Wang <xuw2015@gmail.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      74f9ce1c
  4. 04 5月, 2015 8 次提交
  5. 03 5月, 2015 3 次提交
    • J
      ext4: fix growing of tiny filesystems · 2c869b26
      Jan Kara 提交于
      The estimate of necessary transaction credits in ext4_flex_group_add()
      is too pessimistic. It reserves credit for sb, resize inode, and resize
      inode dindirect block for each group added in a flex group although they
      are always the same block and thus it is enough to account them only
      once. Also the number of modified GDT block is overestimated since we
      fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block.
      
      Make the estimation more precise. That reduces number of requested
      credits enough that we can grow 20 MB filesystem (which has 1 MB
      journal, 79 reserved GDT blocks, and flex group size 16 by default).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      2c869b26
    • D
      ext4: move check under lock scope to close a race. · 280227a7
      Davide Italiano 提交于
      fallocate() checks that the file is extent-based and returns
      EOPNOTSUPP in case is not. Other tasks can convert from and to
      indirect and extent so it's safe to check only after grabbing
      the inode mutex.
      Signed-off-by: NDavide Italiano <dccitaliano@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      280227a7
    • L
      ext4: fix data corruption caused by unwritten and delayed extents · d2dc317d
      Lukas Czerner 提交于
      Currently it is possible to lose whole file system block worth of data
      when we hit the specific interaction with unwritten and delayed extents
      in status extent tree.
      
      The problem is that when we insert delayed extent into extent status
      tree the only way to get rid of it is when we write out delayed buffer.
      However there is a limitation in the extent status tree implementation
      so that when inserting unwritten extent should there be even a single
      delayed block the whole unwritten extent would be marked as delayed.
      
      At this point, there is no way to get rid of the delayed extents,
      because there are no delayed buffers to write out. So when a we write
      into said unwritten extent we will convert it to written, but it still
      remains delayed.
      
      When we try to write into that block later ext4_da_map_blocks() will set
      the buffer new and delayed and map it to invalid block which causes
      the rest of the block to be zeroed loosing already written data.
      
      For now we can fix this by simply not allowing to set delayed status on
      written extent in the extent status tree. Also add WARN_ON() to make
      sure that we notice if this happens in the future.
      
      This problem can be easily reproduced by running the following xfs_io.
      
      xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
                -c "falloc 0 131072" \
                -c "pwrite -S 0xbb 65536 2048" \
                -c "fsync" /mnt/test/fff
      
      echo 3 > /proc/sys/vm/drop_caches
      xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff
      
      This can be theoretically also reproduced by at random by running fsx,
      but it's not very reliable, though on machines with bigger page size
      (like ppc) this can be seen more often (especially xfstest generic/127)
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      d2dc317d
  6. 02 5月, 2015 10 次提交
  7. 01 5月, 2015 5 次提交
    • L
      Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · 64887b68
      Linus Torvalds 提交于
      Pull btrfs fixes from Chris Mason:
       "A few more btrfs fixes.
      
        These range from corners Filipe found in the new free space cache
        writeback to a grab bag of fixes from the list"
      
      * 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
        Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent
        Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.
        btrfs: unlock i_mutex after attempting to delete subvolume during send
        btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache
        btrfs: fix race on ENOMEM in alloc_extent_buffer
        btrfs: handle ENOMEM in btrfs_alloc_tree_block
        Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
        Btrfs: don't check for delalloc_bytes in cache_save_setup
        Btrfs: fix deadlock when starting writeback of bg caches
        Btrfs: fix race between start dirty bg cache writeout and bg deletion
      64887b68
    • L
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 036f351e
      Linus Torvalds 提交于
      Pull arm64 fixes from Will Deacon:
       "Not too much here, but we've addressed a couple of nasty issues in the
        dma-mapping code as well as adding the halfword and byte variants of
        load_acquire/store_release following on from the CSD locking bug that
        you fixed in the core.
      
         - fix perf devicetree warnings at probe time
      
         - fix memory leak in __dma_free()
      
         - ensure DMA buffers are always zeroed
      
         - show IRQ trigger in /proc/interrupts (for parity with ARM)
      
         - implement byte and halfword access for smp_{load_acquire,store_release}"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: perf: Fix the pmu node name in warning message
        arm64: perf: don't warn about missing interrupt-affinity property for PPIs
        arm64: add missing PAGE_ALIGN() to __dma_free()
        arm64: dma-mapping: always clear allocated buffers
        ARM64: Enable CONFIG_GENERIC_IRQ_SHOW_LEVEL
        arm64: add missing data types in smp_load_acquire/smp_store_release
      036f351e
    • S
      powerpc/powernv: Restore non-volatile CRs after nap · 0aab3747
      Sam Bobroff 提交于
      Patches 7cba160a "powernv/cpuidle: Redesign idle states management"
      and 77b54e9f "powernv/powerpc: Add winkle support for offline cpus"
      use non-volatile condition registers (cr2, cr3 and cr4) early in the system
      reset interrupt handler (system_reset_pSeries()) before it has been determined
      if state loss has occurred. If state loss has not occurred, control returns via
      the power7_wakeup_noloss() path which does not restore those condition
      registers, leaving them corrupted.
      
      Fix this by restoring the condition registers in the power7_wakeup_noloss()
      case.
      
      This is apparent when running a KVM guest on hardware that does not
      support winkle or sleep and the guest makes use of secondary threads. In
      practice this means Power7 machines, though some early unreleased Power8
      machines may also be susceptible.
      
      The secondary CPUs are taken off line before the guest is started and
      they call pnv_smp_cpu_kill_self(). This checks support for sleep
      states (in this case there is no support) and power7_nap() is called.
      
      When the CPU is woken, power7_nap() returns and because the CPU is
      still off line, the main while loop executes again. The sleep states
      support test is executed again, but because the tested values cannot
      have changed, the compiler has optimized the test away and instead we
      rely on the result of the first test, which has been left in cr3
      and/or cr4. With the result overwritten, the wrong branch is taken and
      power7_winkle() is called on a CPU that does not support it, leading
      to it stalling.
      
      Fixes: 7cba160a ("powernv/cpuidle: Redesign idle states management")
      Fixes: 77b54e9f ("powernv/powerpc: Add winkle support for offline cpus")
      [mpe: Massage change log a bit more]
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0aab3747
    • G
      powerpc/eeh: Delay probing EEH device during hotplug · d91dafc0
      Gavin Shan 提交于
      Commit 1c509148b ("powerpc/eeh: Do probe on pci_dn") probes EEH
      devices in early stage, which is reasonable to pSeries platform.
      However, it's wrong for PowerNV platform because the PE# isn't
      determined until the resources (IO and MMIO) are assigned to
      PE in hotplug case. So we have to delay probing EEH devices
      for PowerNV platform until the PE# is assigned.
      
      Fixes: ff57b454 ("powerpc/eeh: Do probe on pci_dn")
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d91dafc0
    • G
      powerpc/eeh: Fix race condition in pcibios_set_pcie_reset_state() · 1ae79b78
      Gavin Shan 提交于
      When asserting reset in pcibios_set_pcie_reset_state(), the PE
      is enforced to (hardware) frozen state in order to drop unexpected
      PCI transactions (except PCI config read/write) automatically by
      hardware during reset, which would cause recursive EEH error.
      However, the (software) frozen state EEH_PE_ISOLATED is missed.
      When users get 0xFF from PCI config or MMIO read, EEH_PE_ISOLATED
      is set in PE state retrival backend. Unfortunately, nobody (the
      reset handler or the EEH recovery functinality in host) will clear
      EEH_PE_ISOLATED when the PE has been passed through to guest.
      
      The patch sets and clears EEH_PE_ISOLATED properly during reset
      in function pcibios_set_pcie_reset_state() to fix the issue.
      
      Fixes: 28158cd1 ("Enhance pcibios_set_pcie_reset_state()")
      Reported-by: NCarol L. Soto <clsoto@us.ibm.com>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Tested-by: NCarol L. Soto <clsoto@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1ae79b78