1. 29 7月, 2009 2 次提交
  2. 28 7月, 2009 3 次提交
    • Y
      Btrfs: Fix async caching interaction with unmount · f25784b3
      Yan Zheng 提交于
      - don't stop the caching thread until btrfs_commit_super return.
      
      - if caching is interrupted by umount, set last to (u64)-1.
        otherwise the un-scanned range of block group will be considered
        as free extent.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f25784b3
    • J
      Btrfs: change how we unpin extents · 68b38550
      Josef Bacik 提交于
      We are racy with async block caching and unpinning extents.  This patch makes
      things much less complicated by only unpinning the extent if the block group is
      cached.  We check the block_group->cached var under the block_group->lock spin
      lock.  If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
      and unpin the extent and add the free space back.  If it is not set to this, we
      start the caching of the block group so the next time we unpin extents we can
      unpin the extent.  This keeps us from racing with the async caching threads,
      lets us kill the fs wide async thread counter, and keeps us from having to set
      DELALLOC bits for every extent we hit if there are caching kthreads going.
      
      One thing that needed to be changed was btrfs_free_super_mirror_extents.  Now
      instead of just looking for LOCKED extents, we also look for DIRTY extents,
      since we could have left some extents pinned in the previous transaction that
      will never get freed now that we are unmounting, which would cause us to leak
      memory.  So btrfs_free_super_mirror_extents has been changed to
      btrfs_free_pinned_extents, and it will clear the extents locked for the super
      mirror, and any remaining pinned extents that may be present.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      68b38550
    • J
      Btrfs: Correct redundant test in add_inode_ref · 631c07c8
      Julia Lawall 提交于
      dir has already been tested.  It seems that this test should be on the
      recently returned value inode.
      
      A simplified version of the semantic match that finds this problem is as
      follows: (http://www.emn.fr/x-info/coccinelle/)
      Signed-off-by: NJulia Lawall <julia@diku.dk>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      631c07c8
  3. 25 7月, 2009 4 次提交
    • C
      Btrfs: find smallest available device extent during chunk allocation · 9779b72f
      Chris Mason 提交于
      Allocating new block group is easy when the disk has plenty of space.
      But things get difficult as the disk fills up, especially if
      the FS has been run through btrfs-vol -b.  The balance operation
      is likely to make the total bytes available on the device greater
      than the largest extent we'll actually be able to allocate.
      
      But the device extent allocation code incorrectly assumes that a device
      with 5G free will be able to allocate a 5G extent.  It isn't normally a
      problem because device extents don't get freed unless btrfs-vol -b
      is run.
      
      This fixes the device extent allocator to remember the largest free
      extent it can find, and then uses that value as a fallback.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9779b72f
    • C
      Btrfs: clear all space_info->full after removing a block group · 283bb197
      Chris Mason 提交于
      Btrfs allocates individual extents from block groups, and each
      block group has a specific type.  It may hold metadata, data
      mirrored or striped etc.
      
      When we balance space (btrfs-vol -b) or remove a drive (btrfs-vol -r)
      we free block groups.  Once a block group is freed, the space it was
      using on the device may be available for use by new block groups.
      
      btrfs_remove_block_group was clearing the flag that said
      'our devices are full, don't even try to allocate new block groups',
      but it was only clearing that flag for a specific type of block group.
      
      This commit clears the full flag for all of the types of block groups,
      making it much more likely that we'll be able to balance space when
      the drive is close to full.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      283bb197
    • S
      Btrfs: make flushoncommit mount option correctly wait on ordered_extents · ebecd3d9
      Sage Weil 提交于
      The commit_transaction call to wait_ordered_extents when snap_pending
      passes nocow_only=1 to process only NOCOW or PREALLOC extents.  This isn't
      correct for the 'flushoncommit' mode, as it skips extents we just started
      IO on in start_delalloc_inodes.
      
      So, in the flushoncommit case, wait on all ordered extents.  Otherwise,
      only pass the nocow_only flag to wait_ordered_extents if snap_pending.
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ebecd3d9
    • Y
      Btrfs: Avoid delayed reference update looping · d717aa1d
      Yan Zheng 提交于
      btrfs_split_leaf and btrfs_del_items can end up in a loop
      where one is constantly spliting a given leaf and the other
      is constantly merging it back with the adjacent nodes.
      
      There is a better fix for this, but in the interest of something
      small, this patch just changes btrfs_del_items back to balancing less
      often.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d717aa1d
  4. 24 7月, 2009 6 次提交
    • Y
      Btrfs: Fix ordering of key field checks in btrfs_previous_item · 0a4eefbb
      Yan Zheng 提交于
      Check objectid of item before checking the item type, otherwise we may return
      zero for a key that is actually too low.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0a4eefbb
    • Y
      Btrfs: find_free_dev_extent doesn't handle holes at the start of the device · 1fcbac58
      Yan Zheng 提交于
      find_free_dev_extent does not properly handle the case where
      the device is not complete free, and there is a free extent
      at the beginning of the device.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1fcbac58
    • D
      Btrfs: Remove code duplication in comp_keys · 20736aba
      Diego Calleja 提交于
      comp_keys is duplicating what is done in btrfs_comp_cpu_keys, so just
      call it.
      Signed-off-by: NDiego Calleja <diegocg@gmail.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      20736aba
    • J
      Btrfs: async block group caching · 817d52f8
      Josef Bacik 提交于
      This patch moves the caching of the block group off to a kthread in order to
      allow people to allocate sooner.  Instead of blocking up behind the caching
      mutex, we instead kick of the caching kthread, and then attempt to make an
      allocation.  If we cannot, we wait on the block groups caching waitqueue, which
      the caching kthread will wake the waiting threads up everytime it finds 2 meg
      worth of space, and then again when its finished caching.  This is how I tested
      the speedup from this
      
      mkfs the disk
      mount the disk
      fill the disk up with fs_mark
      unmount the disk
      mount the disk
      time touch /mnt/foo
      
      Without my changes this took 11 seconds on my box, with these changes it now
      takes 1 second.
      
      Another change thats been put in place is we lock the super mirror's in the
      pinned extent map in order to keep us from adding that stuff as free space when
      caching the block group.  This doesn't really change anything else as far as the
      pinned extent map is concerned, since for actual pinned extents we use
      EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
      those extents to keep from leaking memory.
      
      I've also added a check where when we are reading block groups from disk, if the
      amount of space used == the size of the block group, we go ahead and mark the
      block group as cached.  This drastically reduces the amount of time it takes to
      cache the block groups.  Using the same test as above, except doing a dd to a
      file and then unmounting, it used to take 33 seconds to umount, now it takes 3
      seconds.
      
      This version uses the commit_root in the caching kthread, and then keeps track
      of how many async caching threads are running at any given time so if one of the
      async threads is still running as we cross transactions we can wait until its
      finished before handling the pinned extents.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      817d52f8
    • J
      Btrfs: use hybrid extents+bitmap rb tree for free space · 96303081
      Josef Bacik 提交于
      Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
      tracking free space.  As free space gets fragmented, we end up with thousands of
      entries on an rb-tree per block group, which usually spans 1 gig of area.  Since
      we currently don't ever flush free space cache back to disk this gets to be a
      bit unweildly on large fs's with lots of fragmentation.
      
      This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
      space cache.  Initially we calculate a threshold of extent entries we can
      handle, which is however many extent entries we can cram into 16k of ram.  The
      maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
      will be 32k of RAM, which scales much better than we did before.
      
      Once we pass the extent threshold, we start adding bitmaps and using those
      instead for tracking the free space.  This patch also makes it so that any free
      space thats less than 4 * sectorsize we go ahead and put into a bitmap.  This is
      nice since we try and allocate out of the front of a block group, so if the
      front of a block group is heavily fragmented and then has a huge chunk of free
      space at the end, we go ahead and add the fragmented areas to bitmaps and use a
      normal extent entry to track the big chunk at the back of the block group.
      
      I've also taken the opportunity to revamp how we search for free space.
      Previously we indexed free space via an offset indexed rb tree and a bytes
      indexed rb tree.  I've dropped the bytes indexed rb tree and use only the offset
      indexed rb tree.  This cuts the number of tree operations we were doing
      previously down by half, and gives us a little bit of a better allocation
      pattern since we will always start from a specific offset and search forward
      from there, instead of searching for the size we need and try and get it as
      close as possible to the offset we want.
      
      I've given this a healthy amount of testing pre-new format stuff, as well as
      post-new format stuff.  I've booted up my fedora box which is installed on btrfs
      with this patch and ran with it for a few days without issues.  I've not seen
      any performance regressions in any of my tests.
      
      Since the last patch Yan Zheng fixed a problem where we could have overlapping
      entries, so updating their offset inline would cause problems.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      96303081
    • S
      jfs: Fix early release of acl in jfs_get_acl · 4a19fb11
      Stefan Bader 提交于
      BugLink: http://bugs.launchpad.net/ubuntu/+bug/396780
      
      Commit 073aaa1b "helpers for acl
      caching + switch to those" introduced new helper functions for
      acl handling but seems to have introduced a regression for jfs as
      the acl is released before returning it to the caller, instead of
      leaving this for the caller to do.
      This causes the acl object to be used after freeing it, leading
      to kernel panics in completely different places.
      
      Thanks to Christophe Dumez for reporting and bisecting into this.
      Reported-by: NChristophe Dumez <dchris@gmail.com>
      Tested-by: NChristophe Dumez <dchris@gmail.com>
      Signed-off-by: NStefan Bader <stefan.bader@canonical.com>
      Acked-by: NAndy Whitcroft <apw@canonical.com>
      Signed-off-by: NDave Kleikamp <shaggy@linux.vnet.ibm.com>
      4a19fb11
  5. 23 7月, 2009 9 次提交
  6. 22 7月, 2009 15 次提交
    • Y
      Btrfs: make sure all dirty blocks are written at commit time · 4a8c9a62
      Yan Zheng 提交于
      Write dirty block groups may allocate new block, and so may add new delayed
      back ref. btrfs_run_delayed_refs may make some block groups dirty.
      
      commit_cowonly_roots does not handle the recursion properly, and some dirty
      blocks can be left unwritten at commit time. This patch moves
      btrfs_run_delayed_refs into the loop that writes dirty block groups, and makes
      the code not break out of the loop until there are no dirty block groups or
      delayed back refs.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4a8c9a62
    • Y
      Btrfs: fix locking issue in btrfs_find_next_key · 33c66f43
      Yan Zheng 提交于
      When walking up the tree, btrfs_find_next_key assumes the upper level tree
      block is properly locked. This isn't always true even path->keep_locks is 1.
      This is because btrfs_find_next_key may advance path->slots[] several times
      instead of only once.
      
      When 'path->slots[level] >= btrfs_header_nritems(path->nodes[level])' is found,
      we can't guarantee the original value of 'path->slots[level]' is
      'btrfs_header_nritems(path->nodes[level]) - 1'. If it's not, the tree block at
      'level + 1' isn't locked.
      
      This patch fixes the issue by explicitly checking the locking state,
      re-searching the tree if it's not locked.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      33c66f43
    • Y
      Btrfs: fix double increment of path->slots[0] in btrfs_next_leaf · e457afec
      Yan Zheng 提交于
      if 1 is returned by btrfs_search_slot, the path already points to the
      first item with 'key > searching key'. So increasing path->slots[0] by
      one is superfluous in that case.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e457afec
    • Y
      Btrfs: properly update space information after shrinking device. · bf1fb512
      Yan Zheng 提交于
      Change 'goto done' to 'break' for the case of all device extents have
      been freed, so that the code updates space information will be execute.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bf1fb512
    • Y
      Btrfs: fix definition of struct btrfs_extent_inline_ref · 1bec1aed
      Yan Zheng 提交于
      use __le64 instead of u64 in on-disk structure definition.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1bec1aed
    • T
      NFSv4: Fix a problem whereby a buggy server can oops the kernel · d953126a
      Trond Myklebust 提交于
      We just had a case in which a buggy server occasionally returns the wrong
      attributes during an OPEN call. While the client does catch this sort of
      condition in nfs4_open_done(), and causes the nfs4_atomic_open() to return
      -EISDIR, the logic in nfs_atomic_lookup() is broken, since it causes a
      fallback to an ordinary lookup instead of just returning the error.
      
      When the buggy server then returns a regular file for the fallback lookup,
      the VFS allows the open, and bad things start to happen, since the open
      file doesn't have any associated NFSv4 state.
      
      The fix is firstly to return the EISDIR/ENOTDIR errors immediately, and
      secondly to ensure that we are always careful when dereferencing the
      nfs_open_context state pointer.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      d953126a
    • T
      NFSv4: Fix an NFSv4 mount regression · fccba804
      Trond Myklebust 提交于
      Commit 008f55d0 (nfs41: recover lease in
      _nfs4_lookup_root) forces the state manager to always run on mount. This is
      a bug in the case of NFSv4.0, which doesn't require us to send a
      setclientid until we want to grab file state.
      
      In any case, this is completely the wrong place to be doing state
      management. Moving that code into nfs4_init_session...
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      fccba804
    • T
      NFSv4: Fix an Oops in nfs4_free_lock_state · b64aec8d
      Trond Myklebust 提交于
      The oops http://www.kerneloops.org/raw.php?rawid=537858&msgid= appears to
      be due to the nfs4_lock_state->ls_state field being uninitialised. This
      happens if the call to nfs4_free_lock_state() is triggered at the end of
      nfs4_get_lock_state().
      
      The fix is to move the initialisation of ls_state into the allocator.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      b64aec8d
    • E
      inotify: use GFP_NOFS under potential memory pressure · f44aebcc
      Eric Paris 提交于
      inotify can have a watchs removed under filesystem reclaim.
      
      =================================
      [ INFO: inconsistent lock state ]
      2.6.31-rc2 #16
      ---------------------------------
      inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
      khubd/217 [HC0[0]:SC0[0]:HE1:SE1] takes:
       (iprune_mutex){+.+.?.}, at: [<c10ba899>] invalidate_inodes+0x20/0xe3
      {IN-RECLAIM_FS-W} state was registered at:
        [<c10536ab>] __lock_acquire+0x2c9/0xac4
        [<c1053f45>] lock_acquire+0x9f/0xc2
        [<c1308872>] __mutex_lock_common+0x2d/0x323
        [<c1308c00>] mutex_lock_nested+0x2e/0x36
        [<c10ba6ff>] shrink_icache_memory+0x38/0x1b2
        [<c108bfb6>] shrink_slab+0xe2/0x13c
        [<c108c3e1>] kswapd+0x3d1/0x55d
        [<c10449b5>] kthread+0x66/0x6b
        [<c1003fdf>] kernel_thread_helper+0x7/0x10
        [<ffffffff>] 0xffffffff
      
      Two things are needed to fix this.  First we need a method to tell
      fsnotify_create_event() to use GFP_NOFS and second we need to stop using
      one global IN_IGNORED event and allocate them one at a time.  This solves
      current issues with multiple IN_IGNORED on a queue having tail drop
      problems and simplifies the allocations since we don't have to worry about
      two tasks opperating on the IGNORED event concurrently.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      f44aebcc
    • E
      fsnotify: fix inotify tail drop check with path entries · c05594b6
      Eric Paris 提交于
      fsnotify drops new events when they are the same as the tail event on the
      queue to be sent to userspace.  The problem is that if the event comes with
      a path we forget to break out of the switch statement and fall into the
      code path which matches on events that do not have any type of file backed
      information (things like IN_UNMOUNT and IN_Q_OVERFLOW).  The problem is
      that this code thinks all such events should be dropped.  Fix is to add a
      break.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      c05594b6
    • E
      inotify: check filename before dropping repeat events · 4a148ba9
      Eric Paris 提交于
      inotify drops events if the last event on the queue is the same as the
      current event.  But it does 2 things wrong.  First it is comparing old->inode
      with new->inode.  But after an event if put on the queue the ->inode is no
      longer allowed to be used.  It's possible between the last event and this new
      event the inode could be reused and we would falsely match the inode's memory
      address between two differing events.
      
      The second problem is that when a file is removed fsnotify is passed the
      negative dentry for the removed object rather than the postive dentry from
      immediately before the removal.  This mean the (broken) inotify tail drop code
      was matching the NULL ->inode of differing events.
      
      The fix is to check the file name which is stored with events when doing the
      tail drop instead of wrongly checking the address of the stored ->inode.
      Reported-by: NScott James Remnant <scott@ubuntu.com>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      4a148ba9
    • E
      fsnotify: use def_bool in kconfig instead of letting the user choose · 520dc2a5
      Eric Paris 提交于
      fsnotify doens't give the user anything.  If someone chooses inotify or
      dnotify it should build fsnotify, if they don't select one it shouldn't be
      built.  This patch changes fsnotify to be a def_bool=n and makes everything
      else select it.  Also fixes the issue people complained about on lwn where
      gdm hung because they didn't have inotify and they didn't get the inotify
      build option.....
      Signed-off-by: NEric Paris <eparis@redhat.com>
      520dc2a5
    • E
      inotify: fix error paths in inotify_update_watch · 7e790dd5
      Eric Paris 提交于
      inotify_update_watch could leave things in a horrid state on a number of
      error paths.  We could try to remove idr entries that didn't exist, we
      could send an IN_IGNORED to userspace for watches that don't exist, and a
      bit of other stupidity.  Clean these up by doing the idr addition before we
      put the mark on the inode since we can clean that up on error and getting
      off the inode's mark list is hard.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      7e790dd5
    • E
      inotify: do not leak inode marks in inotify_add_watch · 75fe2b26
      Eric Paris 提交于
      inotify_add_watch had a couple of problems.  The biggest being that if
      inotify_add_watch was called on the same inode twice (to update or change the
      event mask) a refence was taken on the original inode mark by
      fsnotify_find_mark_entry but was not being dropped at the end of the
      inotify_add_watch call.  Thus if inotify_rm_watch was called although the mark
      was removed from the inode, the refcnt wouldn't hit zero and we would leak
      memory.
      Reported-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      75fe2b26
    • E
      inotify: drop user watch count when a watch is removed · 5549f7cd
      Eric Paris 提交于
      The inotify rewrite forgot to drop the inotify watch use cound when a watch
      was removed.  This means that a single inotify fd can only ever register a
      maximum of /proc/sys/fs/max_user_watches even if some of those had been
      freed.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      5549f7cd
  7. 21 7月, 2009 1 次提交