1. 19 8月, 2015 1 次提交
    • B
      xfs: disentagle EFI release from the extent count · 5e4b5386
      Brian Foster 提交于
      Release of the EFI either occurs based on the reference count or the
      extent count. The extent count used is either the count tracked in
      the EFI or EFD, depending on the particular situation. In either
      case, the count is initialized to the final value and thus always
      matches the current efi_next_extent value once the EFI is completely
      constructed.  For example, the EFI extent count is increased as the
      extents are logged in xfs_bmap_finish() and the full free list is
      always completely processed. Therefore, the count is guaranteed to
      be complete once the EFI transaction is committed. The EFD uses the
      efd_nextents counter to release the EFI. This counter is initialized
      to the count of the EFI when the EFD is created. Thus the EFD, as
      currently used, has no concept of partial EFI release based on
      extent count.
      
      Given that the EFI extent count is always released in whole, use of
      the extent count for reference counting is unnecessary. Remove this
      level of the API and release the EFI based on the core reference
      count. The efi_next_extent counter remains because it is still used
      to track the slot to log the next extent to free.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5e4b5386
  2. 24 6月, 2015 2 次提交
  3. 23 6月, 2015 1 次提交
    • B
      xfs: don't truncate attribute extents if no extents exist · f66bf042
      Brian Foster 提交于
      The xfs_attr3_root_inactive() call from xfs_attr_inactive() assumes that
      attribute blocks exist to invalidate. It is possible to have an
      attribute fork without extents, however. Consider the case where the
      attribute fork is created towards the beginning of xfs_attr_set() but
      some part of the subsequent attribute set fails.
      
      If an inode in such a state hits xfs_attr_inactive(), it eventually
      calls xfs_dabuf_map() and possibly xfs_bmapi_read(). The former emits a
      filesystem corruption warning, returns an error that bubbles back up to
      xfs_attr_inactive(), and leads to destruction of the in-core attribute
      fork without an on-disk reset. If the inode happens to make it back
      through xfs_inactive() in this state (e.g., via a concurrent bulkstat
      that cycles the inode from the reclaim state and releases it), i_afp
      might not exist when xfs_bmapi_read() is called and causes a NULL
      dereference panic.
      
      A '-p 2' fsstress run to ENOSPC on a relatively small fs (1GB)
      reproduces these problems. The behavior is a regression caused by:
      
      6dfe5a04 xfs: xfs_attr_inactive leaves inconsistent attr fork state behind
      
      ... which removed logic that avoided the attribute extent truncate when
      no extents exist. Restore this logic to ensure the attribute fork is
      destroyed and reset correctly if it exists without any allocated
      extents.
      
      cc: stable@vger.kernel.org # 3.12 to 4.0.x
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f66bf042
  4. 22 6月, 2015 10 次提交
  5. 04 6月, 2015 13 次提交
  6. 02 6月, 2015 2 次提交
    • T
      writeback: separate out include/linux/backing-dev-defs.h · 66114cad
      Tejun Heo 提交于
      With the planned cgroup writeback support, backing-dev related
      declarations will be more widely used across block and cgroup;
      unfortunately, including backing-dev.h from include/linux/blkdev.h
      makes cyclic include dependency quite likely.
      
      This patch separates out backing-dev-defs.h which only has the
      essential definitions and updates blkdev.h to include it.  c files
      which need access to more backing-dev details now include
      backing-dev.h directly.  This takes backing-dev.h off the common
      include dependency chain making it a lot easier to use it across block
      and cgroup.
      
      v2: fs/fat build failure fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      66114cad
    • G
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen 提交于
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: NSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4843a75
  7. 01 6月, 2015 4 次提交
    • N
      xfs: Clean up xfs_trans_dup_dqinfo · 339e4f66
      Nan Jia 提交于
      Fixed two missing spaces.
      Signed-off-by: NNan Jia <jiananmail@gmail.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      339e4f66
    • E
      xfs: don't cast string literals · 39e56d92
      Eric Sandeen 提交于
      The commit:
      
      a9273ca5 xfs: convert attr to use unsigned names
      
      added these (unsigned char *) casts, but then the _SIZE macros
      return "7" - size of a pointer minus one - not the length of
      the string.  This is harmless in the kernel, because the _SIZE
      macros are not used, but as we sync up with userspace, this will
      matter.
      
      I don't think the cast is necessary; i.e. assigning the string
      literal to an unsigned char *, or passing it to a function
      expecting an unsigned char *, should be ok, right?
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      39e56d92
    • B
      xfs: fix quota block reservation leak when tp allocates and frees blocks · 7f884dc1
      Brian Foster 提交于
      Al Viro reports that generic/231 fails frequently on XFS and bisected
      the problem to the following commit:
      
      	5d11fb4b xfs: rework zero range to prevent invalid i_size updates
      
      ... which is just the first commit that happens to cause fsx to
      reproduce the problem. fsx reproduces via zero range calls. The
      aforementioned commit overhauls zero range to use hole punch and
      fallocate. As it turns out, the problem is reproducible on demand using
      basic hole punch as follows:
      
      $ mkfs.xfs -f -m crc=1,finobt=1 <dev>
      $ mount <dev> /mnt -o uquota
      $ xfs_io -f -c "falloc 0 50m" /mnt/file
      $ for i in $(seq 1 20); do xfs_io -c "fpunch ${i}m 32k" /mnt/file; done
      $ rm -f /mnt/file
      $ repquota -us /mnt
      ...
      User            used    soft    hard  grace    used  soft  hard  grace
      ----------------------------------------------------------------------
      root      --     32K      0K      0K              3     0     0
      
      A file is allocated with a single 50m extent. The extent count increases
      via hole punches until the bmap converts to btree format. The file is
      removed but quota reports 32k of space usage for the user. This
      reservation is effectively leaked for the lifetime of the mount.
      
      The reason this occurs is because the quota block reservation tracking
      is confused when a transaction happens to free and allocate blocks at
      the same time. Consider the following sequence of events:
      
      - tp is allocated from xfs_free_file_space() and reserves several blocks
        for btree management. Blocks are reserved against the dquot and marked
        as such in the transaction (qtrx->qt_blk_res).
      - 8 blocks are accounted free when the 32k range is punched out.
        xfs_trans_mod_dquot() is called with XFS_TRANS_DQ_BCOUNT and sets
        ->qt_bcount_delta to -8.
      - Subsequently, a block is allocated against the same transaction by
        xfs_bmap_extents_to_btree() for btree conversion. A call to
        xfs_trans_mod_dquot() increases qt_blk_res_used to 1 and qt_bcount_delta
        to -7.
      - The transaction is dup'd and committed by xfs_bmap_finish().
        xfs_trans_dup_dqinfo() sets the first transaction up such that it has a
        matching qt_blk_res and qt_blk_res_used of 1. The remaining unused
        reservation is transferred to the duplicate tp.
      
      When the transactions are committed, the dquots are fixed up in
      xfs_trans_apply_dquot_deltas() according to one of two methods:
      
      1.) If the transaction holds a block reservation (->qt_blk_res != 0),
      _only_ the unused portion reservation is unaccounted from the dquot.
      Note that the tp duplication behavior of xfs_bmap_finish() makes it such
      that qt_blk_res is typically 0 for tp's with unused reservation.
      2.) Otherwise, the dquot is fixed up based on the block delta
      (->qt_bcount_delta) created by the transaction.
      
      Therefore, if a transaction has a negative qt_bcount_delta and positive
      qt_blk_res_used, the former set of blocks that have been removed from
      the file are never factored out of the in-core dquot reservation.
      Instead, *_apply_dquot_deltas() sees 1 block used out of a 1 block
      reservation and believes there is nothing to fix up. The on-disk
      d_bcount is updated independently from qt_bcount_delta, and thus is
      correct (and allows the quota usage to correct on remount).
      
      To deal with this situation, we effectively want the "used reservation"
      part of the transaction to be consistent with any freed blocks with
      respect to quota tracking. For example, if 8 blocks are freed, the
      subsequent single block allocation does not need to consume the initial
      reservation made by the tp. Instead, it simply borrows one from the
      previously freed. One possible implementation of such borrowing is to
      avoid the blks_res_used increment when bcount_delta is negative. This
      alone is flawed logic in that it only handles the case where blocks are
      freed before allocated, however.
      
      Rather than add more complexity to manage synchronization between
      bcount_delta and blks_res_used, kill the latter entirely. blk_res_used
      is only updated in one place and always in sync with delta_bcount.
      Therefore, the net block reservation consumption of the transaction is
      always available from bcount_delta. Calculate the reservation
      consumption on the fly where necessary based on whether the tp has a
      reservation and results in a positive net block delta on the inode.
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      7f884dc1
    • B
      xfs: always log the inode on unwritten extent conversion · 2e588a46
      Brian Foster 提交于
      The fsync() requirements for crash consistency on XFS are to flush file
      data and force any in-core inode updates to the log. We currently check
      whether the inode is pinned to identify whether the log needs to be
      forced, since a non-zero pin count generally represents an inode that
      has transactions awaiting a flush to the on-disk log.
      
      This is not sufficient in all cases, however. Reports of xfstests test
      generic/311 failures on ppc64/s390x hosts have identified failures to
      fsync outstanding inode modifications due to the inode not being pinned
      at the time of the fsync. This occurs because certain bmap updates can
      complete by logging bmapbt buffers but without ever dirtying (and thus
      pinning) the core inode. The following is a specific incarnation of this
      problem:
      
      $ mount $dev /mnt -o noatime,nobarrier
      $ for i in $(seq 0 2 31); do \
              xfs_io -f -c "falloc $((i * 32768)) 32k" -c fsync /mnt/file; \
      	done
      $ xfs_io -c "pwrite -S 0 80k 16k" -c fsync -c "pwrite 76k 4k" -c fsync /mnt/file; \
      	hexdump /mnt/file; \
      	./xfstests-dev/src/godown /mnt
      ...
      0000000 0000 0000 0000 0000 0000 0000 0000 0000
      *
      0013000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
      *
      0014000 0000 0000 0000 0000 0000 0000 0000 0000
      *
      00f8000
      $ umount /mnt; mount ...
      $ hexdump /mnt/file
      0000000 0000 0000 0000 0000 0000 0000 0000 0000
      *
      00f8000
      
      In short, the unwritten extent conversion for the last write is lost
      despite the fact that an fsync executed before the filesystem was
      shutdown. Note that this is impossible to reproduce on v5 supers due to
      unconditional time callbacks for di_changecount and highly difficult to
      reproduce on CONFIG_HZ=1000 kernels due to those same callbacks
      frequently updating cmtime prior to the bmap update. CONFIG_HZ=100
      reduces timer granularity enough to increase the odds that time updates
      are skipped and allows this to reproduce within a handful of attempts.
      
      To deal with this problem, unconditionally log the core in the unwritten
      extent conversion path. Fix up logflags after the extent conversion to
      keep the extent update code consistent with the other extent update
      helpers. This fixup is not necessary for the other (hole, delay) extent
      helpers because they execute in the block allocation codepath, which
      already logs the inode for other reasons (e.g., for di_nblocks).
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      2e588a46
  8. 29 5月, 2015 7 次提交
    • B
      xfs: enable sparse inode chunks for v5 superblocks · 22ce1e14
      Brian Foster 提交于
      Enable mounting of filesystems with sparse inode support enabled. Add
      the incompat. feature bit to the *_ALL mask.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      22ce1e14
    • B
      xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() · 09b56604
      Brian Foster 提交于
      xfs_ifree_cluster() is called to mark all in-memory inodes and inode
      buffers as stale. This occurs after we've removed the inobt records and
      dropped any references of inobt data. xfs_ifree_cluster() uses the
      starting inode number to walk the namespace of inodes expected for a
      single chunk a cluster buffer at a time. The cluster buffer disk
      addresses are calculated by decoding the sequential inode numbers
      expected from the chunk.
      
      The problem with this approach is that if the inode chunk being removed
      is a sparse chunk, not all of the buffer addresses that are calculated
      as part of this sequence may be inode clusters. Attempting to acquire
      the buffer based on expected inode characterstics (i.e., cluster length)
      can lead to errors and is generally incorrect.
      
      We already use a couple variables to carry requisite state from
      xfs_difree() to xfs_ifree_cluster(). Rather than add a third, define a
      new internal structure to carry the existing parameters through these
      functions. Add an alloc field that represents the physical allocation
      bitmap of inodes in the chunk being removed. Modify xfs_ifree_cluster()
      to check each inode against the bitmap and skip the clusters that were
      never allocated as real inodes on disk.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      09b56604
    • B
      xfs: only free allocated regions of inode chunks · 10ae3dc7
      Brian Foster 提交于
      An inode chunk is currently added to the transaction free list based on
      a simple fsb conversion and hardcoded chunk length. The nature of sparse
      chunks is such that the physical chunk of inodes on disk may consist of
      one or more discontiguous parts. Blocks that reside in the holes of the
      inode chunk are not inodes and could be allocated to any other use or
      not allocated at all.
      
      Refactor the existing xfs_bmap_add_free() call into the
      xfs_difree_inode_chunk() helper. The new helper uses the existing
      calculation if a chunk is not sparse. Otherwise, use the inobt record
      holemask to free the contiguous regions of the chunk.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      10ae3dc7
    • B
      xfs: filter out sparse regions from individual inode allocation · 26dd5217
      Brian Foster 提交于
      Inode allocation from an existing record with free inodes traditionally
      selects the first inode available according to the ir_free mask. With
      sparse inode chunks, the ir_free mask could refer to an unallocated
      region. We must mask the unallocated regions out of ir_free before using
      it to select a free inode in the chunk.
      
      Update the xfs_inobt_first_free_inode() helper to find the first free
      inode available of the allocated regions of the inode chunk.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      26dd5217
    • B
      xfs: randomly do sparse inode allocations in DEBUG mode · 1cdadee1
      Brian Foster 提交于
      Sparse inode allocations generally only occur when full inode chunk
      allocation fails. This requires some level of filesystem space usage and
      fragmentation.
      
      For filesystems formatted with sparse inode chunks enabled, do random
      sparse inode chunk allocs when compiled in DEBUG mode to increase test
      coverage.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      1cdadee1
    • B
      xfs: allocate sparse inode chunks on full chunk allocation failure · 56d1115c
      Brian Foster 提交于
      xfs_ialloc_ag_alloc() makes several attempts to allocate a full inode
      chunk. If all else fails, reduce the allocation to the sparse length and
      alignment and attempt to allocate a sparse inode chunk.
      
      If sparse chunk allocation succeeds, check whether an inobt record
      already exists that can track the chunk. If so, inherit and update the
      existing record. Otherwise, insert a new record for the sparse chunk.
      
      Create helpers to align sparse chunk inode records and insert or update
      existing records in the inode btrees. The xfs_inobt_insert_sprec()
      helper implements the merge or update semantics required for sparse
      inode records with respect to both the inobt and finobt. To update the
      inobt, either insert a new record or merge with an existing record. To
      update the finobt, use the updated inobt record to either insert or
      replace an existing record.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      56d1115c
    • B
      xfs: helper to convert holemask to inode alloc. bitmap · 4148c347
      Brian Foster 提交于
      The inobt record holemask field is a condensed data type designed to fit
      into the existing on-disk record and is zero based (allocated regions
      are set to 0, sparse regions are set to 1) to provide backwards
      compatibility. This makes the type somewhat complex for use in higher
      level inode manipulations such as individual inode allocation, etc.
      
      Rather than foist the complexity of dealing with this field to every bit
      of logic that requires inode granular information, create a helper to
      convert the holemask to an inode allocation bitmap. The inode allocation
      bitmap is inode granularity similar to the inobt record free mask and
      indicates which inodes of the chunk are physically allocated on disk,
      irrespective of whether the inode is considered allocated or free by the
      filesystem.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4148c347