1. 12 2月, 2013 6 次提交
    • C
      f2fs: add un/freeze_fs into super_operations · d6212a5f
      Changman Lee 提交于
      This patch supports ioctl FIFREEZE and FITHAW to snapshot filesystem.
      Before calling f2fs_freeze, all writers would be suspended and sync_fs
      would be completed. So no f2fs has to do something.
      Just background gc operation should be skipped due to generate dirty
      nodes and data until unfreeze.
      Signed-off-by: NChangman Lee <cm224.lee@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      d6212a5f
    • M
      f2fs: clean up the add_orphan_inode func · a2617dc6
      majianpeng 提交于
      For the code
      > prev = list_entry(orphan->list.prev, typeof(*prev), list);
      if orphan->list.prev == head, it can't get the right prev.
      And we can use the parameter 'this' to add.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      a2617dc6
    • A
      f2fs: fix disable_ext_identify option spelling · aa43507f
      Alejandro Martinez Ruiz 提交于
      There is a typo in the ->show_options function for disable_ext_identify.
      Fix it to match the spelling from the documentation.
      Signed-off-by: NAlejandro Martinez Ruiz <alex@nowcomputing.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      aa43507f
    • J
      f2fs: cover global locks for reserve_new_block · bd43df02
      Jaegeuk Kim 提交于
      The fill_zero() from fallocate() calls get_new_data_page() in which calls
      reserve_new_block().
      The reserve_new_block() should be covered by *DATA_NEW*, one of global locks.
      And also, before getting the lock, we should check free sections by calling
      f2fs_balance_fs().
      
      If we break this rule, f2fs is able to face with out-of-control free space
      management and fall into infinite loop like the following scenario as well.
      
      [f2fs_sync_fs()]             [fallocate()]
       - write_checkpoint()        - fill_zero()
        - block_operations()        - get_new_data_page()
         : grab NODE_NEW             - get_dnode_of_data()
                                      : get locked dirty node page
          - sync_node_pages()
                                      : try to grab NODE_NEW for data allocation
           : trylock and skip the dirty node page
         : call sync_node_pages() repeatedly in order to flush all the dirty node
           pages!
      
      In order to avoid this, we should grab another global lock such as DATA_NEW
      before calling get_new_data_page() in fill_zero().
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      bd43df02
    • J
      f2fs: prevent checkpoint once any IO failure is detected · 577e3495
      Jaegeuk Kim 提交于
      This patch enhances the checkpoint routine to cope with IO errors.
      
      Basically f2fs detects IO errors from end_io_write, and the errors are able to
      be occurred during one of data, node, and meta page writes.
      
      In the previous code, when an IO error is occurred during writes, f2fs sets a
      flag, CP_ERROR_FLAG, in the raw ckeckpoint buffer which will be written to disk.
      Afterwards, write_checkpoint() will check the flag and remount f2fs as a
      read-only (ro) mode.
      
      However, even once f2fs is remounted as a ro mode, dirty checkpoint pages are
      freely able to be written to disk by flusher or kswapd in background.
      In such a case, after cold reboot, f2fs would restore the checkpoint data having
      CP_ERROR_FLAG, resulting in disabling write_checkpoint and remounting f2fs as
      a ro mode again.
      
      Therefore, let's prevent any checkpoint page (meta) writes once an IO error is
      occurred, and remount f2fs as a ro mode right away at that moment.
      Reported-by: NOliver Winker <oliver@oli1170.net>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      Reviewed-by: NNamjae Jeon <namjae.jeon@samsung.com>
      577e3495
    • C
      f2fs: save device node number into f2fs_inode · 7d79e75f
      Changman Lee 提交于
      This patch stores inode->i_rdev into on-disk inode structure.
      
      Alun reported that:
       aspire tmp # mount -t f2fs /dev/sdb mnt
       aspire tmp # mknod mnt/sda1 b 8 1
       aspire tmp # mknod mnt/null c 1 3
       aspire tmp # mknod mnt/console c 5 1
       aspire tmp # ls -l mnt
       total 2
       crw-r--r-- 1 root root 5, 1 Jan 22 18:44 console
       crw-r--r-- 1 root root 1, 3 Jan 22 18:44 null
       brw-r--r-- 1 root root 8, 1 Jan 22 18:44 sda1
       aspire tmp # umount mnt
       aspire tmp # mount -t f2fs /dev/sdb mnt
       aspire tmp # ls -l mnt
       total 2
       crw-r--r-- 1 root root 0, 0 Jan 22 18:44 console
       crw-r--r-- 1 root root 0, 0 Jan 22 18:44 null
       brw-r--r-- 1 root root 0, 0 Jan 22 18:44 sda1
      
      In this report, f2fs lost the major/minor numbers of device files after umount.
      The reason was revealed that f2fs does not store the inode->i_rdev to the
      on-disk inode data structure.
      
      So, as the other file systems do, f2fs also stores i_rdev into the i_addr fields
      in on-disk inode structure without any on-disk layout changes.
      Note that, this bug is limited to device files made by mknod().
      Reported-and-Tested-by: NAlun Jones <alun.linux@ty-penguin.org.uk>
      Signed-off-by: NChangman Lee <cm224.lee@samsung.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk.kim@samsung.com>
      7d79e75f
  2. 07 2月, 2013 1 次提交
  3. 06 2月, 2013 7 次提交
    • J
      Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata · eb6b88d9
      Jan Schmidt 提交于
      When btrfs_qgroup_reserve returned a failure, we were missing a counter
      operation for BTRFS_I(inode)->outstanding_extents++, leading to warning
      messages about outstanding extents and space_info->bytes_may_use != 0.
      Additionally, the error handling code didn't take into account that we
      dropped the inode lock which might require more cleanup.
      
      Luckily, all the cleanup code we need is already there and can be shared
      with reserve_metadata_bytes, which is exactly what this patch does.
      Reported-by: NLev Vainblat <lev@zadarastorage.com>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      eb6b88d9
    • J
      Btrfs: fix possible stale data exposure · 59fe4f41
      Josef Bacik 提交于
      We specifically do not update the disk i_size if there are ordered extents
      outstanding for any area between the current disk_i_size and our ordered
      extent so that we do not expose stale data.  The problem is the check we
      have only checks if the ordered extent starts at or after the current
      disk_i_size, which doesn't take into account an ordered extent that starts
      before the current disk_i_size and ends past the disk_i_size.  Fix this by
      checking if the extent ends past the disk_i_size.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      59fe4f41
    • J
      Btrfs: fix missing i_size update · 5d1f4020
      Josef Bacik 提交于
      If we have an ordered extent before the ordered extent we are currently
      completing that is after the current disk_i_size we will put our i_size
      update into that ordered extent so that we do not expose stale data.  The
      problem is that if our disk i_size is updated past the previous ordered
      extent we won't update the i_size with the pending i_size update.  So check
      the pending i_size update and if its above the current disk i_size we need
      to go ahead and try to update.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      5d1f4020
    • L
      Btrfs: fix race between snapshot deletion and getting inode · 6f1c3605
      Liu Bo 提交于
      While running snapshot testscript created by Mitch and David,
      the race between autodefrag and snapshot deletion can lead to
      corruption of dead_root list so that we can get crash on
      btrfs_clean_old_snapshots().
      
      And besides autodefrag, scrub also does the same thing, ie. read
      root first and get inode.
      
      Here is the story(take autodefrag as an example):
      (1) when we delete a snapshot or subvolume, it will set its root's
      refs to zero and do a iput() on its own inode, and if this inode happens
      to be the only active in-meory one in root's inode rbtree, it will add
      itself to the global dead_roots list for later cleanup.
      
      (2) after (1), the autodefrag thread may read another inode for defrag
      and the inode is just in the deleted snapshot/subvolume, but all of these
      are without checking if the root is still valid(refs > 0).  So the end up
      result is adding the deleted snapshot/subvolume's root to the global
      dead_roots list AGAIN.
      
      Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.
      
      So all we need to do is to take the lock to protect 'read root and get inode',
      since we synchronize to wait for the rcu grace period before adding something
      to the global dead_roots list.
      Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      6f1c3605
    • M
      Btrfs: fix missing release of the space/qgroup reservation in start_transaction() · 843fcf35
      Miao Xie 提交于
      When we fail to start a transaction, we need to release the reserved free space
      and qgroup space, fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      843fcf35
    • M
      Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write() · 0a3404dc
      Miao Xie 提交于
      If the checks at the beginning of btrfs_file_aio_write() fail, we needn't
      decrease ->sync_writers, because we have not increased it. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0a3404dc
    • J
      Btrfs: do not merge logged extents if we've removed them from the tree · 222c81dc
      Josef Bacik 提交于
      You can run into this problem where if somebody is fsyncing and writing out
      the existing extents you will have removed the extent map from the em tree,
      but it's still valid for the current fsync so we go ahead and write it.  The
      problem is we unconditionally try to merge it back into the em tree, but if
      we've removed it from the em tree that will cause use after free problems.
      Fix this to only merge if we are still a part of the tree.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      222c81dc
  4. 05 2月, 2013 2 次提交
    • V
      nilfs2: fix fix very long mount time issue · a9bae189
      Vyacheslav Dubeyko 提交于
      There exists a situation when GC can work in background alone without
      any other filesystem activity during significant time.
      
      The nilfs_clean_segments() method calls nilfs_segctor_construct() that
      updates superblocks in the case of NILFS_SC_SUPER_ROOT and
      THE_NILFS_DISCONTINUED flags are set.  But when GC is working alone the
      nilfs_clean_segments() is called with unset THE_NILFS_DISCONTINUED flag.
      As a result, the update of superblocks doesn't occurred all this time
      and in the case of SPOR superblocks keep very old values of last super
      root placement.
      
      SYMPTOMS:
      
      Trying to mount a NILFS2 volume after SPOR in such environment ends with
      very long mounting time (it can achieve about several hours in some
      cases).
      
      REPRODUCING PATH:
      
      1. It needs to use external USB HDD, disable automount and doesn't
         make any additional filesystem activity on the NILFS2 volume.
      
      2. Generate temporary file with size about 100 - 500 GB (for example,
         dd if=/dev/zero of=<file_name> bs=1073741824 count=200).  The size of
         file defines duration of GC working.
      
      3. Then it needs to delete file.
      
      4. Start GC manually by means of command "nilfs-clean -p 0".  When you
         start GC by means of such way then, at the end, superblocks is updated
         by once.  So, for simulation of SPOR, it needs to wait sometime (15 -
         40 minutes) and simply switch off USB HDD manually.
      
      5. Switch on USB HDD again and try to mount NILFS2 volume.  As a
         result, NILFS2 volume will mount during very long time.
      
      REPRODUCIBILITY: 100%
      
      FIX:
      
      This patch adds checking that superblocks need to update and set
      THE_NILFS_DISCONTINUED flag before nilfs_clean_segments() call.
      Reported-by: NSergey Alexandrov <splavgm@gmail.com>
      Signed-off-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Tested-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9bae189
    • D
      dlm: check the write size from user · d4b0bcf3
      David Teigland 提交于
      Return EINVAL from write if the size is larger than
      allowed.  Do this before allocating kernel memory for
      the bogus size, which could lead to OOM.
      Reported-by: NSasha Levin <levinsasha928@gmail.com>
      Tested-by: NJana Saout <jana@saout.de>
      Signed-off-by: NDavid Teigland <teigland@redhat.com>
      d4b0bcf3
  5. 02 2月, 2013 1 次提交
  6. 31 1月, 2013 2 次提交
    • T
      NFSv4.1: Handle NFS4ERR_DELAY when resetting the NFSv4.1 session · c489ee29
      Trond Myklebust 提交于
      NFS4ERR_DELAY is a legal reply when we call DESTROY_SESSION. It
      usually means that the server is busy handling an unfinished RPC
      request. Just sleep for a second and then retry.
      We also need to be able to handle the NFS4ERR_BACK_CHAN_BUSY return
      value. If the NFS server has outstanding callbacks, we just want to
      similarly sleep & retry.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: stable@vger.kernel.org
      c489ee29
    • T
      NFS: Don't silently fail setattr() requests on mountpoints · ab225417
      Trond Myklebust 提交于
      Ensure that any setattr and getattr requests for junctions and/or
      mountpoints are sent to the server. Ever since commit
      0ec26fd0 (vfs: automount should ignore LOOKUP_FOLLOW), we have
      silently dropped any setattr requests to a server-side mountpoint.
      For referrals, we have silently dropped both getattr and setattr
      requests.
      
      This patch restores the original behaviour for setattr on mountpoints,
      and tries to do the same for referrals, provided that we have a
      filehandle...
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: stable@vger.kernel.org
      ab225417
  7. 29 1月, 2013 7 次提交
    • T
      xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages() · 65e3aa77
      Torsten Kaiser 提交于
      Commit fb595814 removed
      xfs_flushinval_pages() and changed its callers to use
      filemap_write_and_wait() and  truncate_pagecache_range() directly.
      
      But in xfs_swap_extents() this change accidental switched the argument
      for 'tip' to 'ip'. This patch switches it back to 'tip'
      Signed-off-by: NTorsten Kaiser <just.for.lkml@googlemail.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      65e3aa77
    • J
      xfs: Fix possible use-after-free with AIO · 4b05d09c
      Jan Kara 提交于
      Running AIO is pinning inode in memory using file reference. Once AIO
      is completed using aio_complete(), file reference is put and inode can
      be freed from memory. So we have to be sure that calling aio_complete()
      is the last thing we do with the inode.
      
      CC: xfs@oss.sgi.com
      CC: Ben Myers <bpm@sgi.com>
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4b05d09c
    • D
      xfs: fix shutdown hang on invalid inode during create · 9f87832a
      Dave Chinner 提交于
      When the new inode verify in xfs_iread() fails, the create
      transaction is aborted and a shutdown occurs. The subsequent unmount
      then hangs in xfs_wait_buftarg() on a buffer that has an elevated
      hold count. Debug showed that it was an AGI buffer getting stuck:
      
      [   22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      [   22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      [   23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      [   23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      
      The trace of this buffer leading up to the shutdown (trimmed for
      brevity) looks like:
      
      xfs_buf_init:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map
      xfs_buf_get:         bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map
      xfs_buf_read:        bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map
      xfs_buf_iorequest:   bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
      xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest
      xfs_buf_iowait:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
      xfs_buf_ioerror:     bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io
      xfs_buf_iodone:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend
      xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
      xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init
      xfs_trans_read_buf:  bno 0x2 len 0x200 hold 2 recur 0 refcount 1
      xfs_trans_brelse:    bno 0x2 len 0x200 hold 2 recur 0 refcount 1
      xfs_buf_item_relse:  bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse
      xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
      xfs_buf_trylock:     bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find
      xfs_buf_find:        bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map
      xfs_buf_get:         bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map
      xfs_buf_read:        bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map
      xfs_buf_hold:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init
      xfs_trans_read_buf:  bno 0x2 len 0x200 hold 3 recur 0 refcount 1
      xfs_trans_log_buf:   bno 0x2 len 0x200 hold 3 recur 0 refcount 1
      xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED
      xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
      
      And that is the AGI buffer from cold cache read into memory to
      transaction abort. You can see at transaction abort the bli is dirty
      and only has a single reference. The item is not pinned, and it's
      not in the AIL. Hence the only reference to it is this transaction.
      
      The problem is that the xfs_buf_item_unlock() call is dropping the
      last reference to the xfs_buf_log_item attached to the buffer (which
      holds a reference to the buffer), but it is not freeing the
      xfs_buf_log_item. Hence nothing will ever release the buffer, and
      the unmount hangs waiting for this reference to go away.
      
      The fix is simple - xfs_buf_item_unlock needs to detect the last
      reference going away in this case and free the xfs_buf_log_item to
      release the reference it holds on the buffer.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      9f87832a
    • D
      xfs: limit speculative prealloc near ENOSPC thresholds · f2a45956
      Dave Chinner 提交于
      There is a window on small filesytsems where specualtive
      preallocation can be larger than that ENOSPC throttling thresholds,
      resulting in specualtive preallocation trying to reserve more space
      than there is space available. This causes immediate ENOSPC to be
      triggered, prealloc to be turned off and flushing to occur. One the
      next write (i.e. next 4k page), we do exactly the same thing, and so
      effective drive into synchronous 4k writes by triggering ENOSPC
      flushing on every page while in the window between the prealloc size
      and the ENOSPC prealloc throttle threshold.
      
      Fix this by checking to see if the prealloc size would consume all
      free space, and throttle it appropriately to avoid premature
      ENOSPC...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      f2a45956
    • D
      xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end · eb178619
      Dave Chinner 提交于
      When _xfs_buf_find is passed an out of range address, it will fail
      to find a relevant struct xfs_perag and oops with a null
      dereference. This can happen when trying to walk a filesystem with a
      metadata inode that has a partially corrupted extent map (i.e. the
      block number returned is corrupt, but is otherwise intact) and we
      try to read from the corrupted block address.
      
      In this case, just fail the lookup. If it is readahead being issued,
      it will simply not be done, but if it is real read that fails we
      will get an error being reported.  Ideally this case should result
      in an EFSCORRUPTED error being reported, but we cannot return an
      error through xfs_buf_read() or xfs_buf_get() so this lookup failure
      may result in ENOMEM or EIO errors being reported instead.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      eb178619
    • B
      xfs: pull up stack_switch check into xfs_bmapi_write · d26978dd
      Brian Foster 提交于
      The stack_switch check currently occurs in __xfs_bmapi_allocate,
      which means the stack switch only occurs when xfs_bmapi_allocate()
      is called in a loop. Pull the check up before the loop in
      xfs_bmapi_write() such that the first iteration of the loop has
      consistent behavior.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      d26978dd
    • E
      xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic · 1bee12b8
      Eric Sandeen 提交于
      98021821 changed the return value from EWRONGFS (aka EINVAL)
      to EFSCORRUPTED which doesn't seem to be handled properly by
      the root filesystem probe.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Tested-by: NSergei Trofimovich <slyfox@gentoo.org>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      1bee12b8
  8. 28 1月, 2013 5 次提交
  9. 25 1月, 2013 8 次提交
  10. 23 1月, 2013 1 次提交