1. 15 6月, 2012 4 次提交
    • D
      xfs: make largest supported offset less shouty · 32972383
      Dave Chinner 提交于
      XFS_MAXIOFFSET() is just a simple macro that resolves to
      mp->m_maxioffset. It doesn't need to exist, and it just makes the
      code unnecessarily loud and shouty.
      
      Make it quiet and easy to read.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      32972383
    • D
      xfs: m_maxioffset is redundant · d2c28191
      Dave Chinner 提交于
      The m_maxioffset field in the struct xfs_mount contains the same
      value as the superblock s_maxbytes field. There is no need to carry
      two copies of this limit around, so use the VFS superblock version.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      d2c28191
    • J
      xfs: fix debug_object WARN at xfs_alloc_vextent() · 0f2cf9d3
      Jeff Liu 提交于
      Fengguang reports:
      
      [  780.529603] XFS (vdd): Ending clean mount
      [  781.454590] ODEBUG: object is on stack, but not annotated
      [  781.455433] ------------[ cut here ]------------
      [  781.455433] WARNING: at /c/kernel-tests/sound/lib/debugobjects.c:301 __debug_object_init+0x173/0x1f1()
      [  781.455433] Hardware name: Bochs
      [  781.455433] Modules linked in:
      [  781.455433] Pid: 26910, comm: kworker/0:2 Not tainted 3.4.0+ #51
      [  781.455433] Call Trace:
      [  781.455433]  [<ffffffff8106bc84>] warn_slowpath_common+0x83/0x9b
      [  781.455433]  [<ffffffff8106bcb6>] warn_slowpath_null+0x1a/0x1c
      [  781.455433]  [<ffffffff814919a5>] __debug_object_init+0x173/0x1f1
      [  781.455433]  [<ffffffff81491c65>] debug_object_init+0x14/0x16
      [  781.455433]  [<ffffffff8108842a>] __init_work+0x20/0x22
      [  781.455433]  [<ffffffff8134ea56>] xfs_alloc_vextent+0x6c/0xd5
      
      Use INIT_WORK_ONSTACK in xfs_alloc_vextent instead of INIT_WORK.
      Reported-by: NWu Fengguang <wfg@linux.intel.com>
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      0f2cf9d3
    • A
      xfs: xfs_vm_writepage clear iomap_valid when !buffer_uptodate (REV2) · 7d0fa3ec
      Alain Renaud 提交于
      On filesytems with a block size smaller than PAGE_SIZE we currently have
      a problem with unwritten extents.  If a we have multi-block page for
      which an unwritten extent has been allocated, and only some of the
      buffers have been written to, and they are not contiguous, we can expose
      stale data from disk in the blocks between the writes after extent
      conversion.
      
      Example of a page with unwritten and real data.
      buffer  content
      0       empty  b_state = 0
      1       DATA   b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten
      2       DATA   b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten
      3       empty  b_state = 0
      4       empty  b_state = 0
      5       DATA   b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten
      6       DATA   b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten
      7       empty  b_state = 0
      
      Buffers 1, 2, 5, and 6 have been written to, leaving 0, 3, 4, and 7
      empty.  Currently buffers 1, 2, 5, and 6 are added to a single ioend,
      and when IO has completed, extent conversion creates a real extent from
      block 1 through block 6, leaving 0 and 7 unwritten.  However buffers 3
      and 4 were not written to disk, so stale data is exposed from those
      blocks on a subsequent read.
      
      Fix this by setting iomap_valid = 0 when we find a buffer that is not
      Uptodate.  This ensures that buffers 5 and 6 are not added to the same
      ioend as buffers 1 and 2.  Later these blocks will be converted into two
      separate real extents, leaving the blocks in between unwritten.
      Signed-off-by: NAlain Renaud <arenaud@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      7d0fa3ec
  2. 02 6月, 2012 1 次提交
    • J
      fs: introduce inode operation ->update_time · c3b2da31
      Josef Bacik 提交于
      Btrfs has to make sure we have space to allocate new blocks in order to modify
      the inode, so updating time can fail.  We've gotten around this by having our
      own file_update_time but this is kind of a pain, and Christoph has indicated he
      would like to make xfs do something different with atime updates.  So introduce
      ->update_time, where we will deal with i_version an a/m/c time updates and
      indicate which changes need to be made.  The normal version just does what it
      has always done, updates the time and marks the inode dirty, and then
      filesystems can choose to do something different.
      
      I've gone through all of the users of file_update_time and made them check for
      errors with the exception of the fault code since it's complicated and I wasn't
      quite sure what to do there, also Jan is going to be pushing the file time
      updates into page_mkwrite for those who have it so that should satisfy btrfs and
      make it not a big deal to check the file_update_time() return code in the
      generic fault path. Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      c3b2da31
  3. 30 5月, 2012 2 次提交
  4. 21 5月, 2012 3 次提交
    • D
      xfs: add trace points for log forces · 14c26c6a
      Dave Chinner 提交于
      To enable easy tracing of the location of log forces and the
      frequency of them via perf, add a pair of trace points to the log
      force functions.  This will help debug where excessive log forces
      are being issued from by simple perf commands like:
      
      # ~/perf/perf top -e xfs:xfs_log_force -G -U
      
      Which gives this sort of output:
      
      Events: 141  xfs:xfs_log_force
      -  100.00%  [kernel]  [k] xfs_log_force
         - xfs_log_force
              87.04% xfsaild
                 kthread
                 kernel_thread_helper
            - 12.87% xfs_buf_lock
                 _xfs_buf_find
                 xfs_buf_get
                 xfs_trans_get_buf
                 xfs_da_do_buf
                 xfs_da_get_buf
                 xfs_dir2_data_init
                 xfs_dir2_leaf_addname
                 xfs_dir_createname
                 xfs_create
                 xfs_vn_mknod
                 xfs_vn_create
                 vfs_create
                 do_last.isra.41
                 path_openat
                 do_filp_open
                 do_sys_open
                 sys_open
                 system_call_fastpath
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sig.com>
      14c26c6a
    • P
      xfs: fix memory reclaim deadlock on agi buffer · 3ba31603
      Peter Watkins 提交于
      Note xfs_iget can be called while holding a locked agi buffer. If
      it goes into memory reclaim then inode teardown may try to lock the
      same buffer. Prevent the deadlock by calling radix_tree_preload
      with GFP_NOFS.
      Signed-off-by: NPeter Watkins <treestem@gmail.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      3ba31603
    • D
      xfs: fix delalloc quota accounting on failure · ea562ed6
      Dave Chinner 提交于
      xfstest 270 was causing quota reservations way beyond what was sane
      (ten to hundreds of TB) for a 4GB filesystem. There's a sign problem
      in the error handling path of xfs_bmapi_reserve_delalloc() because
      xfs_trans_unreserve_quota_nblks() simple negates the value passed -
      which doesn't work for an unsigned variable. This causes
      reservations of close to 2^32 block instead of removing a
      reservation of a handful of blocks.
      
      Fix the same problem in the other xfs_trans_unreserve_quota_nblks()
      callers where unsigned integer variables are used, too.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ea562ed6
  5. 16 5月, 2012 1 次提交
    • B
      xfs: protect xfs_sync_worker with s_umount semaphore · 1307bbd2
      Ben Myers 提交于
      xfs_sync_worker checks the MS_ACTIVE flag in s_flags to avoid doing
      work during mount and unmount.  This flag can be cleared by unmount
      after the xfs_sync_worker checks it but before the work is completed.
      The has caused crashes in the completion handler for the dummy
      transaction commited by xfs_sync_worker:
      
      PID: 27544  TASK: ffff88013544e040  CPU: 3   COMMAND: "kworker/3:0"
       #0 [ffff88016fdff930] machine_kexec at ffffffff810244e9
       #1 [ffff88016fdff9a0] crash_kexec at ffffffff8108d053
       #2 [ffff88016fdffa70] oops_end at ffffffff813ad1b8
       #3 [ffff88016fdffaa0] no_context at ffffffff8102bd48
       #4 [ffff88016fdffaf0] __bad_area_nosemaphore at ffffffff8102c04d
       #5 [ffff88016fdffb40] bad_area_nosemaphore at ffffffff8102c12e
       #6 [ffff88016fdffb50] do_page_fault at ffffffff813afaee
       #7 [ffff88016fdffc60] page_fault at ffffffff813ac635
          [exception RIP: xlog_get_lowest_lsn+0x30]
          RIP: ffffffffa04a9910  RSP: ffff88016fdffd10  RFLAGS: 00010246
          RAX: ffffc90014e48000  RBX: ffff88014d879980  RCX: ffff88014d879980
          RDX: ffff8802214ee4c0  RSI: 0000000000000000  RDI: 0000000000000000
          RBP: ffff88016fdffd10   R8: ffff88014d879a80   R9: 0000000000000000
          R10: 0000000000000001  R11: 0000000000000000  R12: ffff8802214ee400
          R13: ffff88014d879980  R14: 0000000000000000  R15: ffff88022fd96605
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #8 [ffff88016fdffd18] xlog_state_do_callback at ffffffffa04aa186 [xfs]
       #9 [ffff88016fdffd98] xlog_state_done_syncing at ffffffffa04aa568 [xfs]
      
      Protect xfs_sync_worker by using the s_umount semaphore at the read
      level to provide exclusion with unmount while work is progressing.
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      1307bbd2
  6. 15 5月, 2012 29 次提交
    • J
      xfs: introduce SEEK_DATA/SEEK_HOLE support · 3fe3e6b1
      Jeff Liu 提交于
      This patch adds lseek(2) SEEK_DATA/SEEK_HOLE functionality to xfs.
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      3fe3e6b1
    • B
      xfs: make xfs_extent_busy_trim not static · e700a06c
      Ben Myers 提交于
      Commit e459df5, 'xfs: move busy extent handling to it's own file'
      moved some code from xfs_alloc.c into xfs_extent_busy.c for
      convenience in userspace code merges.  One of the functions moved is
      xfs_extent_busy_trim (formerly xfs_alloc_busy_trim) which is defined
      STATIC.  Unfortunately this function is still used in xfs_alloc.c, and
      this results in an undefined symbol in xfs.ko.
      
      Make xfs_extent_busy_trim not static and add its prototype to
      xfs_extent_busy.h.
      Signed-off-by: NBen Myers <bpm@sgi.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      e700a06c
    • D
      xfs: make XBF_MAPPED the default behaviour · 611c9946
      Dave Chinner 提交于
      Rather than specifying XBF_MAPPED for almost all buffers, introduce
      XBF_UNMAPPED for the couple of users that use unmapped buffers.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      611c9946
    • D
      xfs: flush outstanding buffers on log mount failure · d4f3512b
      Dave Chinner 提交于
      When we fail to mount the log in xfs_mountfs(), we tear down all the
      infrastructure we have already allocated. However, the process of
      mounting the log may have progressed to the point of reading,
      caching and modifying buffers in memory. Hence before we can free
      all the infrastructure, we have to flush and remove all the buffers
      from memory.
      
      Problem first reported by Eric Sandeen, later a different incarnation
      was reported by Ben Myers.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      d4f3512b
    • D
      xfs: Properly exclude IO type flags from buffer flags · 12bcb3f7
      Dave Chinner 提交于
      Recent event tracing during a debugging session showed that flags
      that define the IO type for a buffer are leaking into the flags on
      the buffer incorrectly. Fix the flag exclusion mask in
      xfs_buf_alloc() to avoid problems that may be caused by such
      leakage.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      12bcb3f7
    • D
      xfs: clean up xfs_bit.h includes · ad1e95c5
      Dave Chinner 提交于
      With the removal of xfs_rw.h and other changes over time, xfs_bit.h
      is being included in many files that don't actually need it. Clean
      up the includes as necessary.
      
      Also move the only-used-once xfs_ialloc_find_free() static inline
      function out of a header file that is widely included to reduce
      the number of needless dependencies on xfs_bit.h.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ad1e95c5
    • D
      xfs: move xfs_do_force_shutdown() and kill xfs_rw.c · 2af51f3a
      Dave Chinner 提交于
      xfs_do_force_shutdown now is the only thing in xfs_rw.c. There is no
      need to keep it in it's own file anymore, so move it to xfs_fsops.c
      next to xfs_fs_goingdown() and kill xfs_rw.c.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      2af51f3a
    • D
      xfs: move xfs_get_extsz_hint() and kill xfs_rw.h · 2a0ec1d9
      Dave Chinner 提交于
      The only thing left in xfs_rw.h is a function prototype for an inode
      function.  Move that to xfs_inode.h, and kill xfs_rw.h.
      
      Also move the function implementing the prototype from xfs_rw.c to
      xfs_inode.c so we only have one function left in xfs_rw.c
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      2a0ec1d9
    • D
      xfs: move xfs_fsb_to_db to xfs_bmap.h · fd50092c
      Dave Chinner 提交于
      This is the only remaining useful function in xfs_rw.h, so move it
      to a header file responsible for block mapping functions that the
      callers already include. Soon we can get rid of xfs_rw.h.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      fd50092c
    • D
      xfs: clean up busy extent naming · 4ecbfe63
      Dave Chinner 提交于
      Now that the busy extent tracking has been moved out of the
      allocation files, clean up the namespace it uses to
      "xfs_extent_busy" rather than a mix of "xfs_busy" and
      "xfs_alloc_busy".
      
      Signed-off-by: Dave Chinner<dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4ecbfe63
    • D
      xfs: move busy extent handling to it's own file · efc27b52
      Dave Chinner 提交于
      To make it easier to handle userspace code merges, move all the busy
      extent handling out of the allocation code and into it's own file.
      The userspace code does not need the busy extent code, so this
      simplifies the merging of the kernel code into the userspace
      xfsprogs library.
      
      Because the busy extent code has been almost completely rewritten
      over the past couple of years, also update the copyright on this new
      file to include the authors that made all those changes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      efc27b52
    • D
      xfs: move xfsagino_t to xfs_types.h · 60a34607
      Dave Chinner 提交于
      Untangle the header file includes a bit by moving the definition of
      xfs_agino_t to xfs_types.h. This removes the dependency that xfs_ag.h has on
      xfs_inum.h, meaning we don't need to include xfs_inum.h everywhere we include
      xfs_ag.h.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      60a34607
    • D
      xfs: use iolock on XFS_IOC_ALLOCSP calls · bc4010ec
      Dave Chinner 提交于
      fsstress has a particular effective way of stopping debug XFS
      kernels. We keep seeing assert failures due finding delayed
      allocation extents where there should be none. This shows up when
      extracting extent maps and we are holding all the locks we should be
      to prevent races, so this really makes no sense to see these errors.
      
      After checking that fsstress does not use mmap, it occurred to me
      that fsstress uses something that no sane application uses - the
      XFS_IOC_ALLOCSP ioctl interfaces for preallocation. These interfaces
      do allocation of blocks beyond EOF without using preallocation, and
      then call setattr to extend and zero the allocated blocks.
      
      THe problem here is this is a buffered write, and hence the
      allocation is a delayed allocation. Unlike the buffered IO path, the
      allocation and zeroing are not serialised using the IOLOCK. Hence
      the ALLOCSP operation can race with operations holding the iolock to
      prevent buffered IO operations from occurring.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      bc4010ec
    • D
      xfs: kill XBF_DONTBLOCK · aa5c158e
      Dave Chinner 提交于
      Just about all callers of xfs_buf_read() and xfs_buf_get() use XBF_DONTBLOCK.
      This is used to make memory allocation use GFP_NOFS rather than GFP_KERNEL to
      avoid recursion through memory reclaim back into the filesystem.
      
      All the blocking get calls in growfs occur inside a transaction, even though
      they are no part of the transaction, so all allocation will be GFP_NOFS due to
      the task flag PF_TRANS being set. The blocking read calls occur during log
      recovery, so they will probably be unaffected by converting to GFP_NOFS
      allocations.
      
      Hence make XBF_DONTBLOCK behaviour always occur for buffers and kill the flag.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      aa5c158e
    • D
      xfs: kill xfs_read_buf() · 7ca790a5
      Dave Chinner 提交于
      xfs_read_buf() is effectively the same as xfs_trans_read_buf() when called
      outside a transaction context. The error handling is slightly different in that
      xfs_read_buf stales the errored buffer it gets back, but there is probably good
      reason for xfs_trans_read_buf() for doing this.
      
      Hence update xfs_trans_read_buf() to the same error handling as xfs_read_buf(),
      and convert all the callers of xfs_read_buf() to use the former function. We can
      then remove xfs_read_buf().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      7ca790a5
    • D
      xfs: kill XBF_LOCK · a8acad70
      Dave Chinner 提交于
      Buffers are always returned locked from the lookup routines. Hence
      we don't need to tell the lookup routines to return locked buffers,
      on to try and lock them. Remove XBF_LOCK from all the callers and
      from internal buffer cache usage.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      a8acad70
    • D
      xfs: kill xfs_buf_btoc · 795cac72
      Dave Chinner 提交于
      xfs_buf_btoc and friends are simple macros that do basic block
      to page index conversion and vice versa. These aren't widely used,
      and we use open coded masking and shifting everywhere else. Hence
      remove the macros and open code the work they do.
      
      Also, use of PAGE_CACHE_{SIZE|SHIFT|MASK} for these macros is now
      incorrect - we are using pages directly and not the page cache, so
      use PAGE_{SIZE|MASK|SHIFT} instead.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      795cac72
    • D
      xfs: use blocks for storing the desired IO size · aa0e8833
      Dave Chinner 提交于
      Now that we pass block counts everywhere, and index buffers by block
      number and length in units of blocks, convert the desired IO size
      into block counts rather than bytes. Convert the code to use block
      counts, and those that need byte counts get converted at the time of
      use.
      
      Rename the b_desired_count variable to something closer to it's
      purpose - b_io_length - as it is only used to specify the length of
      an IO for a subset of the buffer.  The only time this is used is for
      log IO - both writing iclogs and during log recovery. In all other
      cases, the b_io_length matches b_length, and hence a lot of code
      confuses the two. e.g. the buf item code uses the io count
      exclusively when it should be using the buffer length. Fix these
      apprpriately as they are found.
      
      Also, remove the XFS_BUF_{SET_}COUNT() macros that are just wrappers
      around the desired IO length. They only serve to make the code
      shouty loud, don't actually add any real value, and are often used
      incorrectly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      aa0e8833
    • D
      xfs: use blocks for counting length of buffers · 4e94b71b
      Dave Chinner 提交于
      Now that we pass block counts everywhere, and index buffers by block
      number, track the length of the buffer in units of blocks rather
      than bytes. Convert the code to use block counts, and those that
      need byte counts get converted at the time of use.
      
      Also, remove the XFS_BUF_{SET_}SIZE() macros that are just wrappers
      around the buffer length. They only serve to make the code shouty
      loud and don't actually add any real value.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4e94b71b
    • D
      xfs: kill b_file_offset · de1cbee4
      Dave Chinner 提交于
      Seeing as we pass block numbers around everywhere in the buffer
      cache now, it makes no sense to index everything by byte offset.
      Replace all the byte offset indexing with block number based
      indexing, and replace all uses of the byte offset with direct
      conversion from the block index.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      de1cbee4
    • D
      xfs: clean up buffer get/read call API · e70b73f8
      Dave Chinner 提交于
      The xfs_buf_get/read API is not consistent in the units it uses, and
      does not use appropriate or consistent units/types for the
      variables.
      
      Convert the API to use disk addresses and block counts for all
      buffer get and read calls. Use consistent naming for all the
      functions and their declarations, and convert the internal functions
      to use disk addresses and block counts to avoid need to convert them
      from one type to another and back again.
      
      Fix all the callers to use disk addresses and block counts. In many
      cases, this removes an additional conversion from the function call
      as the callers already have a block count.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      e70b73f8
    • D
      xfs: use kmem_zone_zalloc for buffers · bf813cdd
      Dave Chinner 提交于
      To replace the alloc/memset pair.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      bf813cdd
    • D
      xfs: fix incorrect b_offset initialisation · ead360c5
      Dave Chinner 提交于
      Because we no longer use the page cache for buffering, there is no
      direct block number to page offset relationship anymore.
      xfs_buf_get_pages is still setting up b_offset as if there was some
      relationship, and that is leading to incorrectly setting up
      *uncached* buffers that don't overwrite b_offset once they've had
      pages allocated.
      
      For cached buffers, the first block of the buffer is always at offset
      zero into the allocated memory. This is true for sub-page sized
      buffers, as well as for multiple-page buffers.
      
      For uncached buffers, b_offset is only non-zero when we are
      associating specific memory to the buffers, and that is set
      correctly by the code setting up the buffer.
      
      Hence remove the setting of b_offset in xfs_buf_get_pages, because
      it is now always the wrong thing to do.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ead360c5
    • D
      xfs: check for buffer errors before waiting · 0e95f19a
      Dave Chinner 提交于
      If we call xfs_buf_iowait() on a buffer that failed dispatch due to
      an IO error, it will wait forever for an Io that does not exist.
      This is hndled in xfs_buf_read, but there is other code that calls
      xfs_buf_iowait directly that doesn't.
      
      Rather than make the call sites have to handle checking for dispatch
      errors and then checking for completion errors, make
      xfs_buf_iowait() check for dispatch errors on the buffer before
      waiting. This means we handle both dispatch and completion errors
      with one set of error handling at the caller sites.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      0e95f19a
    • D
      xfs: fix buffer lookup race on allocation failure · fe2429b0
      Dave Chinner 提交于
      When memory allocation fails to add the page array or tht epages to
      a buffer during xfs_buf_get(), the buffer is left in the cache in a
      partially initialised state. There is enough state left for the next
      lookup on that buffer to find the buffer, and for the buffer to then
      be used without finishing the initialisation.  As a result, when an
      attempt to do IO on the buffer occurs, it fails with EIO because
      there are no pages attached to the buffer.
      
      We cannot remove the buffer from the cache immediately and free it,
      because there may already be a racing lookup that is blocked on the
      buffer lock. Hence the moment we unlock the buffer to then free it,
      the other user is woken and we have a use-after-free situation.
      
      To avoid this race condition altogether, allocate the pages for the
      buffer before we insert it into the cache.  This then means that we
      don't have an allocation  failure case to deal after the buffer is
      already present in the cache, and hence avoid the problem
      altogether.  In most cases we won't have racing inserts for the same
      buffer, and so won't increase the memory pressure allocation before
      insertion may entail.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      fe2429b0
    • D
      xfs: Use preallocation for inodes with extsz hints · aff3a9ed
      Dave Chinner 提交于
      xfstest 229 exposes a problem with buffered IO, delayed allocation
      and extent size hints. That is when we do delayed allocation during
      buffered IO, we reserve space for the extent size hint alignment and
      allocate the physical space to align the extent, but we do not zero
      the regions of the extent that aren't written by the write(2)
      syscall. The result is that we expose stale data in unwritten
      regions of the extent size hints.
      
      There are two ways to fix this. The first is to detect that we are
      doing unaligned writes, check if there is already a mapping or data
      over the extent size hint range, and if not zero the page cache
      first before then doing the real write. This can be very expensive
      for large extent size hints, especially if the subsequent writes
      fill then entire extent size before the data is written to disk.
      
      The second, and simpler way, is simply to turn off delayed
      allocation when the extent size hint is set and use preallocation
      instead. This results in unwritten extents being laid down on disk
      and so only the written portions will be converted. This matches the
      behaviour for direct IO, and will also work for the real time
      device. The disadvantage of this approach is that for small extent
      size hints we can get file fragmentation, but in general extent size
      hints are fairly large (e.g. stripe width sized) so this isn't a big
      deal.
      
      Implement the second approach as it is simple and effective.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      aff3a9ed
    • D
      xfs: limit specualtive delalloc to maxioffset · 3ed9116e
      Dave Chinner 提交于
      Speculative delayed allocation beyond EOF near the maximum supported
      file offset can result in creating delalloc extents beyond
      mp->m_maxioffset (8EB). These can never be trimmed during
      xfs_free_eof_blocks() because they are beyond mp->m_maxioffset, and
      that results in assert failures in xfs_fs_destroy_inode() due to
      delalloc blocks still being present. xfstests 071 exposes this
      problem.
      
      Limit speculative delalloc to mp->m_maxioffset to avoid this
      problem.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      3ed9116e
    • D
      xfs: don't assert on delalloc regions beyond EOF · 58e20770
      Dave Chinner 提交于
      When we are doing speculative delayed allocation beyond EOF,
      conversion of the region allocated beyond EOF is dependent on the
      largest free space extent available. If the largest free extent is
      smaller than the delalloc range, then after allocation we leave
      a delalloc extent that starts beyond EOF. This extent cannot *ever*
      be converted by flushing data, and so will remain there until either
      the EOF moves into the extent or it is truncated away.
      
      Hence if xfs_getbmap() runs on such an inode and is asked to return
      extents beyond EOF, it will assert fail on this extent even though
      there is nothing xfs_getbmap() can do to convert it to a real
      extent. Hence we should simply report these delalloc extents rather
      than assert that there should be none.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      58e20770
    • D
      xfs: prevent needless mount warning causing test failures · 81158e0c
      Dave Chinner 提交于
      Often mounting small filesystem with small logs will emit a warning
      such as:
      
      XFS (vdb): Invalid block length (0x2000) for buffer
      
      during log recovery. This causes tests to randomly fail because this
      output causes the clean filesystem checks on test completion to
      think the filesystem is inconsistent.
      
      The cause of the error is simply that log recovery is asking for a
      buffer size that is larger than the log when zeroing the tail. This
      is because the buffer size is rounded up, and if the right head and
      tail conditions exist then the buffer size can be larger than the log.
      Limit the variable size xlog_get_bp() callers to requesting buffers
      smaller than the log.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      81158e0c