1. 11 1月, 2012 2 次提交
  2. 16 12月, 2011 1 次提交
    • J
      Btrfs: fix how we do delalloc reservations and how we free reservations on error · 660d3f6c
      Josef Bacik 提交于
      Running xfstests 269 with some tracing my scripts kept spitting out errors about
      releasing bytes that we didn't actually have reserved.  This took me down a huge
      rabbit hole and it turns out the way we deal with reserved_extents is wrong,
      we need to only be setting it if the reservation succeeds, otherwise the free()
      method will come in and unreserve space that isn't actually reserved yet, which
      can lead to other warnings and such.  The math was all working out right in the
      end, but it caused all sorts of other issues in addition to making my scripts
      yell and scream and generally make it impossible for me to track down the
      original issue I was looking for.  The other problem is with our error handling
      in the reservation code.  There are two cases that we need to deal with
      
      1) We raced with free.  In this case free won't free anything because csum_bytes
      is modified before we dro the lock in our reservation path, so free rightly
      doesn't release any space because the reservation code may be depending on that
      reservation.  However if we fail, we need the reservation side to do the free at
      that point since that space is no longer in use.  So as it stands the code was
      doing this fine and it worked out, except in case #2
      
      2) We don't race with free.  Nobody comes in and changes anything, and our
      reservation fails.  In this case we didn't reserve anything anyway and we just
      need to clean up csum_bytes but not free anything.  So we keep track of
      csum_bytes before we drop the lock and if it hasn't changed we know we can just
      decrement csum_bytes and carry on.
      
      Because of the case where we can race with free()'s since we have to drop our
      spin_lock to do the reservation, I'm going to serialize all reservations with
      the i_mutex.  We already get this for free in the heavy use paths, truncate and
      file write all hold the i_mutex, just needed to add it to page_mkwrite and
      various ioctl/balance things.  With this patch my space leak scripts no longer
      scream bloody murder.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      660d3f6c
  3. 15 12月, 2011 1 次提交
    • L
      Btrfs: fix ctime update of on-disk inode · 306424cc
      Li Zefan 提交于
      To reproduce the bug:
      
          # touch /mnt/tmp
          # stat /mnt/tmp | grep Change
          Change: 2011-12-09 09:32:23.412105981 +0800
          # chattr +i /mnt/tmp
          # stat /mnt/tmp | grep Change
          Change: 2011-12-09 09:32:43.198105295 +0800
          # umount /mnt
          # mount /dev/loop1 /mnt
          # stat /mnt/tmp | grep Change
          Change: 2011-12-09 09:32:23.412105981 +0800
      
      We should update ctime of in-memory inode before calling
      btrfs_update_inode().
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      306424cc
  4. 01 12月, 2011 1 次提交
  5. 20 11月, 2011 2 次提交
    • A
      Btrfs: prefix resize related printks with btrfs: · 5bb14682
      Arnd Hannemann 提交于
      For the user it is confusing to find something like:
      [10197.627710] new size for /dev/mapper/vg0-usr_share is 3221225472
      in kernel log, because it doesn't point directly to btrfs.
      
      This patch prefixes those messages with "btrfs:" like other btrfs
      related printks.
      Signed-off-by: NArnd Hannemann <arnd@arndnet.de>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5bb14682
    • J
      btrfs: Fix up 32/64-bit compatibility for new ioctls · 745c4d8e
      Jeff Mahoney 提交于
       This patch casts to unsigned long before casting to a pointer and fixes
       the following warnings:
      fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
      fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
      fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
      fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
      fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
      fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      745c4d8e
  6. 06 11月, 2011 2 次提交
  7. 21 10月, 2011 6 次提交
    • L
      btrfs: return EINVAL if start > total_bytes in fitrim ioctl · f4c697e6
      Lukas Czerner 提交于
      We should retirn EINVAL if the start is beyond the end of the file
      system in the btrfs_ioctl_fitrim(). Fix that by adding the appropriate
      check for it.
      
      Also in the btrfs_trim_fs() it is possible that len+start might overflow
      if big values are passed. Fix it by decrementing the len so that start+len
      is equal to the file system size in the worst case.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      f4c697e6
    • L
      Btrfs: honor extent thresh during defragmentation · 008873ea
      Li Zefan 提交于
      We won't defrag an extent, if it's bigger than the threshold we
      specified and there's no small extent before it, but actually
      the code doesn't work this way.
      
      There are three bugs:
      
      - When should_defrag_range() decides we should keep on defragmenting
        an extent, last_len is not incremented. (old bug)
      
      - The length that passes to should_defrag_range() is not the length
        we're going to defrag. (new bug)
      
      - We always defrag 256K bytes data, and a big extent can be part of
        this range. (new bug)
      
      For a file with 4 extents:
      
              | 4K | 4K | 256K | 256K |
      
      The result of defrag with (the default) 256K extent thresh should be:
      
              | 264K | 256K |
      
      but with those bugs, we'll get:
      
              | 520K |
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      008873ea
    • L
      Btrfs: fix wrong max_to_defrag in btrfs_defrag_file() · 5ca49660
      Li Zefan 提交于
      It's off-by-one, and thus we may skip the last page while defragmenting.
      
      An example case:
      
        # create /mnt/file with 2 4K file extents
        # btrfs fi defrag /mnt/file
        # sync
        # filefrag /mnt/file
        /mnt/file: 2 extents found
      
      So it's not defragmented.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      5ca49660
    • L
      Btrfs: use i_size_read() in btrfs_defrag_file() · 151a31b2
      Li Zefan 提交于
      Don't use inode->i_size directly, since we're not holding i_mutex.
      
      This also fixes another bug, that i_size can change after it's checked
      against 0 and then (i_size - 1) can be negative.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      151a31b2
    • L
      Btrfs: fix defragmentation regression · cbcc8326
      Li Zefan 提交于
      There's an off-by-one bug:
      
        # create a file with lots of 4K file extents
        # btrfs fi defrag /mnt/file
        # sync
        # filefrag -v /mnt/file
        Filesystem type is: 9123683e
        File size of /mnt/file is 1228800 (300 blocks, blocksize 4096)
         ext logical physical expected length flags
           0       0     3372              64
           1      64     3136     3435      1
           2      65     3436     3136     64
           3     129     3201     3499      1
           4     130     3500     3201     64
           5     194     3266     3563      1
           6     195     3564     3266     64
           7     259     3331     3627      1
           8     260     3628     3331     40 eof
      
      After this patch:
      
        ...
        # filefrag -v /mnt/file
        Filesystem type is: 9123683e
        File size of /mnt/file is 1228800 (300 blocks, blocksize 4096)
         ext logical physical expected length flags
           0       0     3372             300 eof
        /mnt/file: 1 extent found
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      cbcc8326
    • D
      btrfs: fix memory leak in btrfs_defrag_file · 60ccf82f
      Diego Calleja 提交于
      kmemleak found this:
      unreferenced object 0xffff8801b64af968 (size 512):
        comm "btrfs-cleaner", pid 3317, jiffies 4306810886 (age 903.272s)
        hex dump (first 32 bytes):
          00 82 01 07 00 ea ff ff c0 83 01 07 00 ea ff ff  ................
          80 82 01 07 00 ea ff ff c0 87 01 07 00 ea ff ff  ................
        backtrace:
          [<ffffffff816875cc>] kmemleak_alloc+0x5c/0xc0
          [<ffffffff8114aec3>] kmem_cache_alloc_trace+0x163/0x240
          [<ffffffff8127a290>] btrfs_defrag_file+0xf0/0xb20
          [<ffffffff8125d9a5>] btrfs_run_defrag_inodes+0x165/0x210
          [<ffffffff812479d7>] cleaner_kthread+0x177/0x190
          [<ffffffff81075c7d>] kthread+0x8d/0xa0
          [<ffffffff816af5f4>] kernel_thread_helper+0x4/0x10
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      "pages" is not always freed. Fix it removing the unnecesary additional return.
      Signed-off-by: NDiego Calleja <diegocg@gmail.com>
      60ccf82f
  8. 20 10月, 2011 2 次提交
    • J
      Btrfs: only inherit btrfs specific flags when creating files · e27425d6
      Josef Bacik 提交于
      Xfstests 79 was failing because we were inheriting the S_APPEND flag when we
      weren't supposed to.  There isn't any specific documentation on this so I'm
      taking the test as the standard of how things work, and having S_APPEND set on a
      directory doesn't mean that S_APPEND gets inherited by its children according to
      this test.  So only inherit btrfs specific things.  This will let us set
      compress/nocompress on specific directories and everything in the directories
      will inherit this flag, same with nodatacow.  With this patch test 79 passes.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      e27425d6
    • J
      Btrfs: use the inode's mapping mask for allocating pages · 3b16a4e3
      Josef Bacik 提交于
      Johannes pointed out we were allocating only kernel pages for doing writes,
      which is kind of a big deal if you are on 32bit and have more than a gig of ram.
      So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we
      don't re-enter.  Thanks,
      Reported-by: NJohannes Weiner <jweiner@redhat.com>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      3b16a4e3
  9. 11 10月, 2011 2 次提交
    • C
      Btrfs: make sure not to defrag extents past i_size · f7f43cc8
      Chris Mason 提交于
      The btrfs file defrag code will loop through the extents and
      force COW on them.  But there is a concurrent truncate in the middle of
      the defrag, it might end up defragging the same range over and over
      again.
      
      The problem is that writepage won't go through and do anything on pages
      past i_size, so the cow won't happen, so the file will appear to still
      be fragmented.  defrag will end up hitting the same extents again and
      again.
      
      In the worst case, the truncate can actually live lock with the defrag
      because the defrag keeps creating new ordered extents which the truncate
      code keeps waiting on.
      
      The fix here is to make defrag check for i_size inside the main loop,
      instead of just once before the looping starts.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f7f43cc8
    • L
      Btrfs: fix recursive auto-defrag · 2a0f7f57
      Li Zefan 提交于
      Follow those steps:
      
        # mount -o autodefrag /dev/sda7 /mnt
        # dd if=/dev/urandom of=/mnt/tmp bs=200K count=1
        # sync
        # dd if=/dev/urandom of=/mnt/tmp bs=8K count=1 conv=notrunc
      
      and then it'll go into a loop: writeback -> defrag -> writeback ...
      
      It's because writeback writes [8K, 200K] and then writes [0, 8K].
      
      I tried to make writeback know if the pages are dirtied by defrag,
      but the patch was a bit intrusive. Here I simply set writeback_index
      when we defrag a file.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2a0f7f57
  10. 29 9月, 2011 1 次提交
  11. 21 9月, 2011 1 次提交
  12. 18 9月, 2011 3 次提交
  13. 11 9月, 2011 2 次提交
  14. 17 8月, 2011 1 次提交
  15. 02 8月, 2011 1 次提交
  16. 28 7月, 2011 2 次提交
    • J
      Btrfs: fix enospc problems with delalloc · 9e0baf60
      Josef Bacik 提交于
      So I had this brilliant idea to use atomic counters for outstanding and reserved
      extents, but this turned out to be a bad idea.  Consider this where we have 1
      outstanding extent and 1 reserved extent
      
      Reserver				Releaser
      					atomic_dec(outstanding) now 0
      atomic_read(outstanding)+1 get 1
      atomic_read(reserved) get 1
      don't actually reserve anything because
      they are the same
      					atomic_cmpxchg(reserved, 1, 0)
      atomic_inc(outstanding)
      atomic_add(0, reserved)
      					free reserved space for 1 extent
      
      Then the reserver now has no actual space reserved for it, and when it goes to
      finish the ordered IO it won't have enough space to do it's allocation and you
      get those lovely warnings.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9e0baf60
    • J
      Btrfs: use find_or_create_page instead of grab_cache_page · a94733d0
      Josef Bacik 提交于
      grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to
      GFP_HIGHUSER_MOVABLE.  So instead use find_or_create_page in all cases where we
      need GFP_NOFS so we don't deadlock.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      a94733d0
  17. 21 7月, 2011 1 次提交
  18. 16 6月, 2011 1 次提交
    • J
      Btrfs: protect the pending_snapshots list with trans_lock · 8351583e
      Josef Bacik 提交于
      Currently there is nothing protecting the pending_snapshots list on the
      transaction.  We only hold the directory mutex that we are snapshotting and a
      read lock on the subvol_sem, so we could race with somebody else creating a
      snapshot in a different directory and end up with list corruption.  So protect
      this list with the trans_lock.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      8351583e
  19. 11 6月, 2011 1 次提交
  20. 04 6月, 2011 1 次提交
  21. 27 5月, 2011 1 次提交
  22. 24 5月, 2011 5 次提交
    • X
      Btrfs: using rcu lock in the reader side of devices list · 1f78160c
      Xiao Guangrong 提交于
      fs_devices->devices is only updated on remove and add device paths, so we can
      use rcu to protect it in the reader side
      Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1f78160c
    • H
      btrfs: Ensure the tree search ioctl returns the right number of records · e2156867
      Hugo Mills 提交于
      Btrfs's tree search ioctl has a field to indicate that no more than a
      given number of records should be returned. The ioctl doesn't honour
      this, as the tested value is not incremented until the end of the
      copy_to_sk function. This patch removes an unnecessary local variable,
      and updates the num_found counter as each key is found in the tree.
      Signed-off-by: NHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e2156867
    • J
      Btrfs: kill BTRFS_I(inode)->block_group · d82a6f1d
      Josef Bacik 提交于
      Originally this was going to be used as a way to give hints to the allocator,
      but frankly we can get much better hints elsewhere and it's not even used at all
      for anything usefull.  In addition to be completely useless, when we initialize
      an inode we try and find a freeish block group to set as the inodes block group,
      and with a completely full 40gb fs this takes _forever_, so I imagine with say
      1tb fs this is just unbearable.  So just axe the thing altoghether, we don't
      need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
      inode lookup in my testcase.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      d82a6f1d
    • J
      Btrfs: kill trans_mutex · a4abeea4
      Josef Bacik 提交于
      We use trans_mutex for lots of things, here's a basic list
      
      1) To serialize trans_handles joining the currently running transaction
      2) To make sure that no new trans handles are started while we are committing
      3) To protect the dead_roots list and the transaction lists
      
      Really the serializing trans_handles joining is not too hard, and can really get
      bogged down in acquiring a reference to the transaction.  So replace the
      trans_mutex with a trans_lock spinlock and use it to do the following
      
      1) Protect fs_info->running_transaction.  All trans handles have to do is check
      this, and then take a reference of the transaction and keep on going.
      2) Protect the fs_info->trans_list.  This doesn't get used too much, basically
      it just holds the current transactions, which will usually just be the currently
      committing transaction and the currently running transaction at most.
      3) Protect the dead roots list.  This is only ever processed by splicing the
      list so this is relatively simple.
      4) Protect the fs_info->reloc_ctl stuff.  This is very lightweight and was using
      the trans_mutex before, so this is a pretty straightforward change.
      5) Protect fs_info->no_trans_join.  Because we don't hold the trans_lock over
      the entirety of the commit we need to have a way to block new people from
      creating a new transaction while we're doing our work.  So we set no_trans_join
      and in join_transaction we test to see if that is set, and if it is we do a
      wait_on_commit.
      6) Make the transaction use count atomic so we don't need to take locks to
      modify it when we're dropping references.
      7) Add a commit_lock to the transaction to make sure multiple people trying to
      commit the same transaction don't race and commit at the same time.
      8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl
      trans.
      
      I have tested this with xfstests, but obviously it is a pretty hairy change so
      lots of testing is greatly appreciated.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      a4abeea4
    • J
      Btrfs: take away the num_items argument from btrfs_join_transaction · 7a7eaa40
      Josef Bacik 提交于
      I keep forgetting that btrfs_join_transaction() just ignores the num_items
      argument, which leads me to sending pointless patches and looking stupid :).  So
      just kill the num_items argument from btrfs_join_transaction and
      btrfs_start_ioctl_transaction, since neither of them use it.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      7a7eaa40