1. 21 10月, 2011 19 次提交
    • S
      GFS2: Correctly set goal block after allocation · ccad4e14
      Steven Whitehouse 提交于
      The new goal block should be set to the end of the newly
      allocated extent, not the start of it.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      ccad4e14
    • S
      GFS2: Fix AIL flush issue during fsync · b5b24d7a
      Steven Whitehouse 提交于
      Unfortunately, it is not enough to just ignore locked buffers during
      the AIL flush from fsync. We need to be able to ignore all buffers
      which are locked, dirty or pinned at this stage as they might have
      been added subsequent to the log flush earlier in the fsync function.
      
      In addition, this means that we no longer need to rely on i_mutex to
      keep out writes during fsync, so we can, as a side-effect, remove
      that protection too.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Tested-By: NAbhijith Das <adas@redhat.com>
      b5b24d7a
    • S
      GFS2: Use cached rgrp in gfs2_rlist_add() · 70b0c365
      Steven Whitehouse 提交于
      Each block which is deallocated, requires a call to gfs2_rlist_add()
      and each of those calls was calling gfs2_blk2rgrpd() in order to
      figure out which rgrp the block belonged in. This can be speeded up
      by making use of the rgrp cached in the inode. We also reset this
      cached rgrp in case the block has changed rgrp. This should provide
      a big reduction in gfs2_blk2rgrpd() calls during deallocation.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      70b0c365
    • S
      GFS2: Call do_strip() directly from recursive_scan() · d56fa8a1
      Steven Whitehouse 提交于
      The recursive_scan() function only ever takes a single "bc"
      argument, so we might as well just call do_strip() directly
      from resource_scan() rather than pass it in as an argument.
      
      Also the "data" argument is always a struct strip_mine, so
      we can pass that in, rather than using a void pointer.
      
      This also moves do_strip() ahead of recursive_scan() so that
      we don't need to add a prototype.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      d56fa8a1
    • S
      GFS2: Remove obsolete assert · 534029e2
      Steven Whitehouse 提交于
      Given that a resource group has been locked, there is no reason why
      we should not be able to allocate as many blocks as are free. The
      al_requested parameter should really be considered as a minimum
      number of blocks to be available. Should this limit be overshot,
      there are other mechanisms which will prevent over allocation.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      534029e2
    • S
      GFS2: Cache the most recently used resource group in the inode · 54335b1f
      Steven Whitehouse 提交于
      This means that after the initial allocation for any inode, the
      last used resource group is cached in the inode for future use.
      This drastically reduces the number of lookups of resource
      groups in the common case, and this the contention on that
      data structure.
      
      The allocation algorithm is the same as previously, except that we
      always check to see if the goal block is within the cached rgrp
      first before going to the rbtree to look one up.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      54335b1f
    • S
      GFS2: Make resource groups "append only" during life of fs · 8339ee54
      Steven Whitehouse 提交于
      Since we have ruled out supporting online filesystem shrink,
      it is possible to make the resource group list append only
      during the life of a super block. This gives several benefits:
      
      Firstly, we only need to read new rindex elements as they are added
      rather than needing to reread the whole rindex file each time one
      element is added.
      
      Secondly, the rindex glock can be held for much shorter periods of
      time, and is completely removed from the fast path for allocations.
      The lock is taken in shared mode only when updating the resource
      groups when the first allocation occurs, and after a grow has
      taken place.
      
      Thirdly, this results in a reduction in code size, and everything
      gets a lot simpler to understand in this area.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      8339ee54
    • B
      GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme · 7c9ca621
      Bob Peterson 提交于
      Here is an update of Bob's original rbtree patch which, in addition, also
      resolves the rather strange ref counting that was being done relating to
      the bitmap blocks.
      
      Originally we had a dual system for journaling resource groups. The metadata
      blocks were journaled and also the rgrp itself was added to a list. The reason
      for adding the rgrp to the list in the journal was so that the "repolish
      clones" code could be run to update the free space, and potentially send any
      discard requests when the log was flushed. This was done by comparing the
      "cloned" bitmap with what had been written back on disk during the transaction
      commit.
      
      Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
      until the journal had been flushed. For that reason, there was a rather
      complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
      both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
      count on the buffers.
      
      However, the journal maintains a reference count on the buffers anyway, since
      they are being journaled as metadata buffers. So by moving the code which deals
      with the post-journal accounting for bitmap blocks to the metadata journaling
      code, we can entirely dispense with the rather strange buffer ref counting
      scheme and also the requirement to journal the rgrps.
      
      The net result of all this is that the ->sd_rindex_spin is left to do exactly
      one job, and that is to look after the rbtree or rgrps.
      
      This patch is designed to be a stepping stone towards using RCU for the rbtree
      of resource groups, however the reduction in the number of uses of the
      ->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
      anyway.
      
      The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
      be removed in future in favour of calling the functions directly where required
      in the code. That will allow locking of resource groups without needing to
      actually read them in - something that could be useful in speeding up statfs.
      
      In the mean time though it is valid to dereference ->bi_bh only when the rgrp
      is locked. This is basically the same rule as before, modulo the references not
      being valid until the following journal flush.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Cc: Benjamin Marzinski <bmarzins@redhat.com>
      7c9ca621
    • S
      GFS2: Fix lseek after SEEK_DATA, SEEK_HOLE have been added · 9453615a
      Steven Whitehouse 提交于
      We need to take the inode's glock whenever the inode's size
      is referenced, otherwise it might not be uptodate. Even
      though generic_file_llseek_unlocked() doesn't implement
      SEEK_DATA, SEEK_HOLE directly, it does reference the inode's
      size in those cases, so we need to add them to the list
      of origins which need the glock.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      9453615a
    • S
      GFS2: Clean up gfs2_create · 9a63edd1
      Steven Whitehouse 提交于
      If we pass through knowledge of whether the creation is intended to be
      exclusive or not, then we can deal with that in gfs2_create_inode
      and remove one set of locking. Also this removes the loop in
      gfs2_create and simplifies the code a bit.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      9a63edd1
    • S
      GFS2: Use ->dirty_inode() · ab9bbda0
      Steven Whitehouse 提交于
      The aim of this patch is to use the newly enhanced ->dirty_inode()
      super block operation to deal with atime updates, rather than
      piggy backing that code into ->write_inode() as is currently
      done.
      
      The net result is a simplification of the code in various places
      and a reduction of the number of gfs2_dinode_out() calls since
      this is now implied by ->dirty_inode().
      
      Some of the mark_inode_dirty() calls have been moved under glocks
      in order to take advantage of then being able to avoid locking in
      ->dirty_inode() when we already have suitable locks.
      
      One consequence is that generic_write_end() now correctly deals
      with file size updates, so that we do not need a separate check
      for that afterwards. This also, indirectly, means that fdatasync
      should work correctly on GFS2 - the current code always syncs the
      metadata whether it needs to or not.
      
      Has survived testing with postmark (with and without atime) and
      also fsx.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      ab9bbda0
    • S
      GFS2: Fix bug trap and journaled data fsync · f1818529
      Steven Whitehouse 提交于
      Journaled data requires that a complete flush of all dirty data for
      the file is done, in order that the ail flush which comes after
      will succeed.
      
      Also the recently enhanced bug trap can trigger falsely in case
      an ail flush from fsync races with a page read. This updates the
      bug trap such that it will ignore buffers which are locked and
      only trigger on dirty and/or pinned buffers when the ail flush
      is run from fsync. The original bug trap is retained when ail
      flush is run from ->go_sync()
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      f1818529
    • S
      GFS2: Fix inode allocation error path · 40ac218f
      Steven Whitehouse 提交于
      If we have got far enough through the inode allocation code
      path that an inode has already been allocated, then we must
      call iput to dispose of it, if an error occurs during a
      later part of the process. This will always be the final iput
      since there will be no other references to the inode.
      
      Unlike when the inode has been unlinked, its block state will
      be GFS2_BLKST_INODE rather than GFS2_BLKST_UNLINKED so we need
      to skip the test in ->evict_inode() for this one case in order
      to ensure that it will be deallocated correctly. This patch adds
      a new flag in order to ensure that this will happen correctly.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      40ac218f
    • S
      GFS2: Make atime checks more efficient · 1d4ec642
      Steven Whitehouse 提交于
      We do not need to start a transaction unless the atime
      check has proved positive. Also if we are going to flush
      the complete ail list anyway, we might as well skip the
      writeback for this specific inode's metadata, since that
      will be done as part of the ail writeback process in an
      order offering potentially more efficient I/O.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      1d4ec642
    • S
      GFS2: Fix bug-trap in ail flush code · 75549186
      Steven Whitehouse 提交于
      The assert was being tested under the wrong lock, a
      legacy of the original code. Also, if it does trigger,
      the resulting information was not always a lot of help.
      
      This moves the patch under the correct lock and also
      prints out more useful information in tacking down the
      source of the problem.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      75549186
    • S
      GFS2: Split data write & wait in fsync · 2f0264d5
      Steven Whitehouse 提交于
      Now that the data writing is part of fsync proper, we can split
      the waiting part out and do it later on. This reduces the
      number of waits that we do during fsync on average.
      
      There is also no need to take the i_mutex unless we are flushing
      metadata to disk, so we can move that to within the metadata
      flushing code.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      2f0264d5
    • S
      GFS2: Clean up dir hash table reading · 4c28d338
      Steven Whitehouse 提交于
      Since there is now only a single caller to gfs2_dir_read_data()
      and it has a number of constant arguments, we can factor
      those out. Also some tests relating to the inode size were
      being done twice.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      4c28d338
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · fd11e153
      Linus Torvalds 提交于
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
        sparc: Add alignment flag to PCI expansion resources
        sparc: Avoid calling sigprocmask()
        sparc: Use set_current_blocked()
        sparc32,leon: SRMMU MMU Table probe fix
      fd11e153
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 505f48b5
      Linus Torvalds 提交于
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        fib_rules: fix unresolved_rules counting
        r8169: fix wrong eee setting for rlt8111evl
        r8169: fix driver shutdown WoL regression.
        ehea: Change maintainer to me
        pptp: pptp_rcv_core() misses pskb_may_pull() call
        tproxy: copy transparent flag when creating a time wait
        pptp: fix skb leak in pptp_xmit()
        bonding: use local function pointer of bond->recv_probe in bond_handle_frame
        smsc911x: Add support for SMSC LAN89218
        tg3: negate USE_PHYLIB flag check
        netconsole: enable netconsole can make net_device refcnt incorrent
        bluetooth: Properly clone LSM attributes to newly created child connections
        l2tp: fix a potential skb leak in l2tp_xmit_skb()
        bridge: fix hang on removal of bridge via netlink
        x25: Prevent skb overreads when checking call user data
        x25: Handle undersized/fragmented skbs
        x25: Validate incoming call user data lengths
        udplite: fast-path computation of checksum coverage
        IPVS netns shutdown/startup dead-lock
        netfilter: nf_conntrack: fix event flooding in GRE protocol tracker
      505f48b5
  2. 20 10月, 2011 6 次提交
  3. 19 10月, 2011 14 次提交
  4. 18 10月, 2011 1 次提交
    • P
      cputimer: Cure lock inversion · bcd5cff7
      Peter Zijlstra 提交于
      There's a lock inversion between the cputimer->lock and rq->lock;
      notably the two callchains involved are:
      
       update_rlimit_cpu()
         sighand->siglock
         set_process_cpu_timer()
           cpu_timer_sample_group()
             thread_group_cputimer()
               cputimer->lock
               thread_group_cputime()
                 task_sched_runtime()
                   ->pi_lock
                   rq->lock
      
       scheduler_tick()
         rq->lock
         task_tick_fair()
           update_curr()
             account_group_exec()
               cputimer->lock
      
      Where the first one is enabling a CLOCK_PROCESS_CPUTIME_ID timer, and
      the second one is keeping up-to-date.
      
      This problem was introduced by e8abccb7 ("posix-cpu-timers: Cure
      SMP accounting oddities").
      
      Cure the problem by removing the cputimer->lock and rq->lock nesting,
      this leaves concurrent enablers doing duplicate work, but the time
      wasted should be on the same order otherwise wasted spinning on the
      lock and the greater-than assignment filter should ensure we preserve
      monotonicity.
      Reported-by: NDave Jones <davej@redhat.com>
      Reported-by: NSimon Kirby <sim@hostway.ca>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Link: http://lkml.kernel.org/r/1318928713.21167.4.camel@twinsSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bcd5cff7