1. 19 8月, 2015 8 次提交
    • B
      xfs: fix broken icreate log item cancellation · fc0d1656
      Brian Foster 提交于
      Inode cluster buffers are invalidated and cancelled when inode chunks
      are freed to notify log recovery that previous logged updates to the
      metadata buffer should be skipped. This ensures that log recovery does
      not overwrite buffers that might have already been reused.
      
      On v4 filesystems, inode chunk allocation and inode updates are logged
      via the cluster buffers and thus cancellation is easily detected via
      buffer cancellation items. v5 filesystems use the new icreate
      transaction, which uses logical logging and ordered buffers to log a
      full inode chunk allocation at once. The resulting icreate item often
      spans multiple inode cluster buffers.
      
      Log recovery checks for cancelled buffers when processing icreate log
      items, but it has a couple problems. First, it uses the full length of
      the inode chunk rather than the cluster size. Second, it uses the length
      in FSB units rather than BB units. Either of these problems prevent
      icreate recovery from identifying cancelled buffers and thus inode
      initialization proceeds unconditionally.
      
      Update xlog_recover_do_icreate_pass2() to iterate the icreate range in
      cluster sized increments and check each increment for cancellation.
      Since icreate is currently only used for the minimum atomic inode chunk
      allocation, we expect that either all or none of the buffers will be
      cancelled. Cancel the icreate if at least one buffer is cancelled to
      avoid making a bad situation worse by initializing a partial inode
      chunk, but detect such anomalies and warn the user.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      fc0d1656
    • B
      xfs: icreate log item recovery and cancellation tracepoints · 78d57e45
      Brian Foster 提交于
      Various log items have recovery tracepoints to identify whether a
      particular log item is recovered or cancelled. Add the equivalent
      tracepoints for the icreate transaction.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      78d57e45
    • B
      xfs: don't leave EFIs on AIL on mount failure · f0b2efad
      Brian Foster 提交于
      Log recovery occurs in two phases at mount time. In the first phase,
      EFIs and EFDs are processed and potentially cancelled out. EFIs without
      EFD objects are inserted into the AIL for processing and recovery in the
      second phase. xfs_mountfs() runs various other operations between the
      phases and is thus subject to failure. If failure occurs after the first
      phase but before the second, pending EFIs sit on the AIL, pin it and
      cause the mount to hang.
      
      Update the mount sequence to ensure that pending EFIs are cancelled in
      the event of failure. Add a recovery cancellation mechanism to iterate
      the AIL and cancel all EFI items when requested. Plumb cancellation
      support through the log mount finish helper and update xfs_mountfs() to
      invoke cancellation in the event of failure after recovery has started.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f0b2efad
    • B
      xfs: use EFI refcount consistently in log recovery · e32a1d1f
      Brian Foster 提交于
      The EFI is initialized with a reference count of 2. One for the EFI to
      ensure the item makes it to the AIL and one for the subsequently created
      EFD to release the EFI once the EFD is committed. Log recovery uses the
      EFI in a similar manner, but implements a hack to remove both references
      in one call once the EFD is handled.
      
      Update log recovery to use EFI reference counting in a manner consistent
      with the log. When an EFI is encountered during recovery, an EFI item is
      allocated and inserted to the AIL directly. Since the EFI reference is
      typically dropped when the EFI is unpinned and this is analogous with
      AIL insertion, drop the EFI reference at this point.
      
      When a corresponding EFD is encountered in the log, this indicates that
      the extents were freed, no processing is required and the EFI can be
      dropped. Update xlog_recover_efd_pass2() to simply drop the EFD
      reference at this point rather than open code the AIL removal and EFI
      free.
      
      Remaining EFIs (i.e., with no corresponding EFD) are processed in
      xlog_recover_finish(). An EFD transaction is allocated and the extents
      are freed, which transfers ownership of the EFI reference to the EFD
      item in the log.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e32a1d1f
    • B
      xfs: ensure EFD trans aborts on log recovery extent free failure · 6bc43af3
      Brian Foster 提交于
      Log recovery attempts to free extents with leftover EFIs in the AIL
      after initial processing. If the extent free fails (e.g., due to
      unrelated fs corruption), the transaction is cancelled, though it
      might not be dirtied at the time. If this is the case, the EFD does
      not abort and thus does not release the EFI. This can lead to hangs
      as the EFI pins the AIL.
      
      Update xlog_recover_process_efi() to log the EFD in the transaction
      before xfs_free_extent() errors are handled to ensure the
      transaction is dirty, aborts the EFD and releases the EFI on error.
      Since this is a requirement for EFD processing (and consistent with
      xfs_bmap_finish()), update the EFD logging helper to do the extent
      free and unconditionally log the EFD. This encodes the required EFD
      logging behavior into the helper and reduces the likelihood of
      errors down the road.
      
      [dchinner: re-add xfs_alloc.h to xfs_log_recover.c to fix build
       failure.]
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6bc43af3
    • B
      xfs: fix efi/efd error handling to avoid fs shutdown hangs · 8d99fe92
      Brian Foster 提交于
      Freeing an extent in XFS involves logging an EFI (extent free
      intention), freeing the actual extent, and logging an EFD (extent
      free done). The EFI object is created with a reference count of 2:
      one for the current transaction and one for the subsequently created
      EFD. Under normal circumstances, the first reference is dropped when
      the EFI is unpinned and the second reference is dropped when the EFD
      is committed to the on-disk log.
      
      In event of errors or filesystem shutdown, there are various
      potential cleanup scenarios depending on the state of the EFI/EFD.
      The cleanup scenarios are confusing and racy, as demonstrated by the
      following test sequence:
      
      	# mount $dev $mnt
      	# fsstress -d $mnt -n 99999 -p 16 -z -f fallocate=1 \
      		-f punch=1 -f creat=1 -f unlink=1 &
      	# sleep 5
      	# killall -9 fsstress; wait
      	# godown -f $mnt
      	# umount
      
      ... in which the final umount can hang due to the AIL being pinned
      indefinitely by one or more EFI items. This can occur due to several
      conditions. For example, if the shutdown occurs after the EFI is
      committed to the on-disk log and the EFD committed to the CIL, but
      before the EFD committed to the log, the EFD iop_committed() abort
      handler does not drop its reference to the EFI. Alternatively,
      manual error injection in the xfs_bmap_finish() codepath shows that
      if an error occurs after the EFI transaction is committed but before
      the EFD is constructed and logged, the EFI is never released from
      the AIL.
      
      Update the EFI/EFD item handling code to use a more straightforward
      and reliable approach to error handling. If an error occurs after
      the EFI transaction is committed and before the EFD is constructed,
      release the EFI explicitly from xfs_bmap_finish(). If the EFI
      transaction is cancelled, release the EFI in the unlock handler.
      
      Once the EFD is constructed, it is responsible for releasing the EFI
      under any circumstances (including whether the EFI item aborts due
      to log I/O error). Update the EFD item handlers to release the EFI
      if the transaction is cancelled or aborts due to log I/O error.
      Finally, update xfs_bmap_finish() to log at least one EFD extent to
      the transaction before xfs_free_extent() errors are handled to
      ensure the transaction is dirty and EFD item error handling is
      triggered.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8d99fe92
    • B
      xfs: return committed status from xfs_trans_roll() · d43ac29b
      Brian Foster 提交于
      Some callers need to make error handling decisions based on whether
      the current transaction successfully committed or not. Rename
      xfs_trans_roll(), add a new parameter and provide a wrapper to
      preserve existing callers.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      d43ac29b
    • B
      xfs: disentagle EFI release from the extent count · 5e4b5386
      Brian Foster 提交于
      Release of the EFI either occurs based on the reference count or the
      extent count. The extent count used is either the count tracked in
      the EFI or EFD, depending on the particular situation. In either
      case, the count is initialized to the final value and thus always
      matches the current efi_next_extent value once the EFI is completely
      constructed.  For example, the EFI extent count is increased as the
      extents are logged in xfs_bmap_finish() and the full free list is
      always completely processed. Therefore, the count is guaranteed to
      be complete once the EFI transaction is committed. The EFD uses the
      efd_nextents counter to release the EFI. This counter is initialized
      to the count of the EFI when the EFD is created. Thus the EFD, as
      currently used, has no concept of partial EFI release based on
      extent count.
      
      Given that the EFI extent count is always released in whole, use of
      the extent count for reference counting is unnecessary. Remove this
      level of the API and release the EFI based on the core reference
      count. The efi_next_extent counter remains because it is still used
      to track the slot to log the next extent to free.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5e4b5386
  2. 24 6月, 2015 2 次提交
  3. 23 6月, 2015 1 次提交
    • B
      xfs: don't truncate attribute extents if no extents exist · f66bf042
      Brian Foster 提交于
      The xfs_attr3_root_inactive() call from xfs_attr_inactive() assumes that
      attribute blocks exist to invalidate. It is possible to have an
      attribute fork without extents, however. Consider the case where the
      attribute fork is created towards the beginning of xfs_attr_set() but
      some part of the subsequent attribute set fails.
      
      If an inode in such a state hits xfs_attr_inactive(), it eventually
      calls xfs_dabuf_map() and possibly xfs_bmapi_read(). The former emits a
      filesystem corruption warning, returns an error that bubbles back up to
      xfs_attr_inactive(), and leads to destruction of the in-core attribute
      fork without an on-disk reset. If the inode happens to make it back
      through xfs_inactive() in this state (e.g., via a concurrent bulkstat
      that cycles the inode from the reclaim state and releases it), i_afp
      might not exist when xfs_bmapi_read() is called and causes a NULL
      dereference panic.
      
      A '-p 2' fsstress run to ENOSPC on a relatively small fs (1GB)
      reproduces these problems. The behavior is a regression caused by:
      
      6dfe5a04 xfs: xfs_attr_inactive leaves inconsistent attr fork state behind
      
      ... which removed logic that avoided the attribute extent truncate when
      no extents exist. Restore this logic to ensure the attribute fork is
      destroyed and reset correctly if it exists without any allocated
      extents.
      
      cc: stable@vger.kernel.org # 3.12 to 4.0.x
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f66bf042
  4. 22 6月, 2015 10 次提交
  5. 04 6月, 2015 13 次提交
  6. 02 6月, 2015 2 次提交
    • T
      writeback: separate out include/linux/backing-dev-defs.h · 66114cad
      Tejun Heo 提交于
      With the planned cgroup writeback support, backing-dev related
      declarations will be more widely used across block and cgroup;
      unfortunately, including backing-dev.h from include/linux/blkdev.h
      makes cyclic include dependency quite likely.
      
      This patch separates out backing-dev-defs.h which only has the
      essential definitions and updates blkdev.h to include it.  c files
      which need access to more backing-dev details now include
      backing-dev.h directly.  This takes backing-dev.h off the common
      include dependency chain making it a lot easier to use it across block
      and cgroup.
      
      v2: fs/fat build failure fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      66114cad
    • G
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen 提交于
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: NSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4843a75
  7. 01 6月, 2015 4 次提交
    • N
      xfs: Clean up xfs_trans_dup_dqinfo · 339e4f66
      Nan Jia 提交于
      Fixed two missing spaces.
      Signed-off-by: NNan Jia <jiananmail@gmail.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      339e4f66
    • E
      xfs: don't cast string literals · 39e56d92
      Eric Sandeen 提交于
      The commit:
      
      a9273ca5 xfs: convert attr to use unsigned names
      
      added these (unsigned char *) casts, but then the _SIZE macros
      return "7" - size of a pointer minus one - not the length of
      the string.  This is harmless in the kernel, because the _SIZE
      macros are not used, but as we sync up with userspace, this will
      matter.
      
      I don't think the cast is necessary; i.e. assigning the string
      literal to an unsigned char *, or passing it to a function
      expecting an unsigned char *, should be ok, right?
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      39e56d92
    • B
      xfs: fix quota block reservation leak when tp allocates and frees blocks · 7f884dc1
      Brian Foster 提交于
      Al Viro reports that generic/231 fails frequently on XFS and bisected
      the problem to the following commit:
      
      	5d11fb4b xfs: rework zero range to prevent invalid i_size updates
      
      ... which is just the first commit that happens to cause fsx to
      reproduce the problem. fsx reproduces via zero range calls. The
      aforementioned commit overhauls zero range to use hole punch and
      fallocate. As it turns out, the problem is reproducible on demand using
      basic hole punch as follows:
      
      $ mkfs.xfs -f -m crc=1,finobt=1 <dev>
      $ mount <dev> /mnt -o uquota
      $ xfs_io -f -c "falloc 0 50m" /mnt/file
      $ for i in $(seq 1 20); do xfs_io -c "fpunch ${i}m 32k" /mnt/file; done
      $ rm -f /mnt/file
      $ repquota -us /mnt
      ...
      User            used    soft    hard  grace    used  soft  hard  grace
      ----------------------------------------------------------------------
      root      --     32K      0K      0K              3     0     0
      
      A file is allocated with a single 50m extent. The extent count increases
      via hole punches until the bmap converts to btree format. The file is
      removed but quota reports 32k of space usage for the user. This
      reservation is effectively leaked for the lifetime of the mount.
      
      The reason this occurs is because the quota block reservation tracking
      is confused when a transaction happens to free and allocate blocks at
      the same time. Consider the following sequence of events:
      
      - tp is allocated from xfs_free_file_space() and reserves several blocks
        for btree management. Blocks are reserved against the dquot and marked
        as such in the transaction (qtrx->qt_blk_res).
      - 8 blocks are accounted free when the 32k range is punched out.
        xfs_trans_mod_dquot() is called with XFS_TRANS_DQ_BCOUNT and sets
        ->qt_bcount_delta to -8.
      - Subsequently, a block is allocated against the same transaction by
        xfs_bmap_extents_to_btree() for btree conversion. A call to
        xfs_trans_mod_dquot() increases qt_blk_res_used to 1 and qt_bcount_delta
        to -7.
      - The transaction is dup'd and committed by xfs_bmap_finish().
        xfs_trans_dup_dqinfo() sets the first transaction up such that it has a
        matching qt_blk_res and qt_blk_res_used of 1. The remaining unused
        reservation is transferred to the duplicate tp.
      
      When the transactions are committed, the dquots are fixed up in
      xfs_trans_apply_dquot_deltas() according to one of two methods:
      
      1.) If the transaction holds a block reservation (->qt_blk_res != 0),
      _only_ the unused portion reservation is unaccounted from the dquot.
      Note that the tp duplication behavior of xfs_bmap_finish() makes it such
      that qt_blk_res is typically 0 for tp's with unused reservation.
      2.) Otherwise, the dquot is fixed up based on the block delta
      (->qt_bcount_delta) created by the transaction.
      
      Therefore, if a transaction has a negative qt_bcount_delta and positive
      qt_blk_res_used, the former set of blocks that have been removed from
      the file are never factored out of the in-core dquot reservation.
      Instead, *_apply_dquot_deltas() sees 1 block used out of a 1 block
      reservation and believes there is nothing to fix up. The on-disk
      d_bcount is updated independently from qt_bcount_delta, and thus is
      correct (and allows the quota usage to correct on remount).
      
      To deal with this situation, we effectively want the "used reservation"
      part of the transaction to be consistent with any freed blocks with
      respect to quota tracking. For example, if 8 blocks are freed, the
      subsequent single block allocation does not need to consume the initial
      reservation made by the tp. Instead, it simply borrows one from the
      previously freed. One possible implementation of such borrowing is to
      avoid the blks_res_used increment when bcount_delta is negative. This
      alone is flawed logic in that it only handles the case where blocks are
      freed before allocated, however.
      
      Rather than add more complexity to manage synchronization between
      bcount_delta and blks_res_used, kill the latter entirely. blk_res_used
      is only updated in one place and always in sync with delta_bcount.
      Therefore, the net block reservation consumption of the transaction is
      always available from bcount_delta. Calculate the reservation
      consumption on the fly where necessary based on whether the tp has a
      reservation and results in a positive net block delta on the inode.
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      7f884dc1
    • B
      xfs: always log the inode on unwritten extent conversion · 2e588a46
      Brian Foster 提交于
      The fsync() requirements for crash consistency on XFS are to flush file
      data and force any in-core inode updates to the log. We currently check
      whether the inode is pinned to identify whether the log needs to be
      forced, since a non-zero pin count generally represents an inode that
      has transactions awaiting a flush to the on-disk log.
      
      This is not sufficient in all cases, however. Reports of xfstests test
      generic/311 failures on ppc64/s390x hosts have identified failures to
      fsync outstanding inode modifications due to the inode not being pinned
      at the time of the fsync. This occurs because certain bmap updates can
      complete by logging bmapbt buffers but without ever dirtying (and thus
      pinning) the core inode. The following is a specific incarnation of this
      problem:
      
      $ mount $dev /mnt -o noatime,nobarrier
      $ for i in $(seq 0 2 31); do \
              xfs_io -f -c "falloc $((i * 32768)) 32k" -c fsync /mnt/file; \
      	done
      $ xfs_io -c "pwrite -S 0 80k 16k" -c fsync -c "pwrite 76k 4k" -c fsync /mnt/file; \
      	hexdump /mnt/file; \
      	./xfstests-dev/src/godown /mnt
      ...
      0000000 0000 0000 0000 0000 0000 0000 0000 0000
      *
      0013000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
      *
      0014000 0000 0000 0000 0000 0000 0000 0000 0000
      *
      00f8000
      $ umount /mnt; mount ...
      $ hexdump /mnt/file
      0000000 0000 0000 0000 0000 0000 0000 0000 0000
      *
      00f8000
      
      In short, the unwritten extent conversion for the last write is lost
      despite the fact that an fsync executed before the filesystem was
      shutdown. Note that this is impossible to reproduce on v5 supers due to
      unconditional time callbacks for di_changecount and highly difficult to
      reproduce on CONFIG_HZ=1000 kernels due to those same callbacks
      frequently updating cmtime prior to the bmap update. CONFIG_HZ=100
      reduces timer granularity enough to increase the odds that time updates
      are skipped and allows this to reproduce within a handful of attempts.
      
      To deal with this problem, unconditionally log the core in the unwritten
      extent conversion path. Fix up logflags after the extent conversion to
      keep the extent update code consistent with the other extent update
      helpers. This fixup is not necessary for the other (hole, delay) extent
      helpers because they execute in the block allocation codepath, which
      already logs the inode for other reasons (e.g., for di_nblocks).
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      2e588a46