1. 22 5月, 2010 2 次提交
  2. 19 5月, 2010 10 次提交
    • J
      ocfs2: Silence a gcc warning. · 18d3a98f
      Joel Becker 提交于
      ocfs2_block_group_claim_bits() is never called with min_bits=0, but we
      shouldn't leave status undefined if it ever is.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      18d3a98f
    • T
      ocfs2: Don't retry xattr set in case value extension fails. · 5f5261ac
      Tao Ma 提交于
      In normal xattr set, the set sequence is inode, xattr block
      and finally xattr bucket if we meet with a ENOSPC. But there
      is a corner case.
      So consider we will set a xattr whose value will be stored in
      a cluster, and there is no xattr block by now. So we will
      reserve 1 xattr block and 1 cluster for setting it. Now if we
      fail in value extension(in case the volume is almost full and
      we can't allocate the cluster because the check in
      ocfs2_test_bg_bit_allocatable), ENOSPC will be returned. So
      we will try to create a bucket(this time there is a chance that
      the reserved cluster will be used), and when we try value extension
      again, kernel bug happens. We did meet with it. Check the bug below.
      http://oss.oracle.com/bugzilla/show_bug.cgi?id=1251
      
      This patch just try to avoid this by adding a set_abort in
      ocfs2_xattr_set_ctxt, so in case ENOSPC happens in value extension,
      we will check whether it is caused by the real ENOSPC or just the
      full of inode or xattr block. If it is the first case, we set set_abort
      so that we don't try any further. we are safe to exit directly here
      ince it is really ENOSPC.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      5f5261ac
    • W
      ocfs2:dlm: avoid dlm->ast_lock lockres->spinlock dependency break · d9ef7522
      Wengang Wang 提交于
      Currently we process a dirty lockres with the lockres->spinlock taken. While
      during the process, we may need to lock on dlm->ast_lock. This breaks the
      dependency of dlm->ast_lock(lock first) and lockres->spinlock(lock second).
      
      This patch fixes the problem.
      Since we can't release lockres->spinlock, we have to take dlm->ast_lock
      just before taking the lockres->spinlock and release it after lockres->spinlock
      is released. And use __dlm_queue_bast()/__dlm_queue_ast(), the nolock version,
      in dlm_shuffle_lists(). There are no too many locks on a lockres, so there is no
      performance harm.
      Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      d9ef7522
    • T
      ocfs2: Reset xattr value size after xa_cleanup_value_truncate(). · d5a7df06
      Tao Ma 提交于
      In ocfs2_prepare_xattr_entry, if we fail to grow an existing value,
      xa_cleanup_value_truncate() will leave the old entry in place.  Thus, we
      reset its value size.  However, if we were allocating a new value, we
      must not reset the value size or we will BUG().  This resolves
      oss.oracle.com bug 1247.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      d5a7df06
    • J
      fs/ocfs2/dlm: Use kstrdup · 316ce2ba
      Julia Lawall 提交于
      Use kstrdup when the goal of an allocation is copy a string into the
      allocated region.
      
      The semantic patch that makes this change is as follows:
      (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @@
      expression from,to;
      expression flag,E1,E2;
      statement S;
      @@
      
      -  to = kmalloc(strlen(from) + 1,flag);
      +  to = kstrdup(from, flag);
         ... when != \(from = E1 \| to = E1 \)
         if (to==NULL || ...) S
         ... when != \(from = E2 \| to = E2 \)
      -  strcpy(to, from);
      // </smpl>
      Signed-off-by: NJulia Lawall <julia@diku.dk>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      316ce2ba
    • J
      fs/ocfs2/dlm: Drop memory allocation cast · 3914ed0c
      Julia Lawall 提交于
      Drop cast on the result of kmalloc and similar functions.
      
      The semantic patch that makes this change is as follows:
      (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @@
      type T;
      @@
      
      - (T *)
        (\(kmalloc\|kzalloc\|kcalloc\|kmem_cache_alloc\|kmem_cache_zalloc\|
         kmem_cache_alloc_node\|kmalloc_node\|kzalloc_node\)(...))
      // </smpl>
      Signed-off-by: NJulia Lawall <julia@diku.dk>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      3914ed0c
    • T
      Ocfs2: Optimize punching-hole code. · c1631d4a
      Tristan Ye 提交于
      This patch simplifies the logic of handling existing holes and
      skipping extent blocks and removes some confusing comments.
      
      The patch survived the fill_verify_holes testcase in ocfs2-test.
      It also passed my manual sanity check and stress tests with enormous
      extent records.
      
      Currently punching a hole on a file with 3+ extent tree depth was
      really a performance disaster.  It can even take several hours,
      though we may not hit this in real life with such a huge extent
      number.
      
      One simple way to improve the performance is quite straightforward.
      From the logic of truncate, we can punch the hole from hole_end to
      hole_start, which reduces the overhead of btree operations in a
      significant way, such as tree rotation and moving.
      
      Following is the testing result when punching hole from 0 to file end
      in bytes, on a 1G file, 1G file consists of 256k extent records, each record
      cover 4k data(just one cluster, clustersize is 4k):
      
      ===========================================================================
       * Original punching-hole mechanism:
      ===========================================================================
      
         I waited 1 hour for its completion, unfortunately it's still ongoing.
      
      ===========================================================================
       * Patched punching-hode mechanism:
      ===========================================================================
      
         real 0m2.518s
         user 0m0.000s
         sys  0m2.445s
      
      That means we've gained up to 1000 times improvement on performance in this
      case, whee! It's fairly cool. and it looks like that performance gain will
      be raising when extent records grow.
      
      The patch was based on my former 2 patches, which were about truncating
      codes optimization and fixup to handle CoW on punching hole.
      Signed-off-by: NTristan Ye <tristan.ye@oracle.com>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      c1631d4a
    • T
      Ocfs2: Make ocfs2_find_cpos_for_left_leaf() public. · ee149a7c
      Tristan Ye 提交于
      The original idea to pull ocfs2_find_cpos_for_left_leaf() out of
      alloc.c is to benefit punching-holes optimization patch, it however,
      can also be referred by other funcs in the future who want to do the
      same job.
      Signed-off-by: NTristan Ye <tristan.ye@oracle.com>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      ee149a7c
    • T
      Ocfs2: Fix hole punching to correctly do CoW during cluster zeroing. · e8aec068
      Tristan Ye 提交于
      Based on the previous patch of optimizing truncate, the bugfix for
      refcount trees when punching holes can be fairly easy
      and straightforward since most of work we should take into account for
      refcounting have been completed already in ocfs2_remove_btree_range().
      
      This patch performs CoW for refcounted extents when a hole being punched
      whose start or end offset were in the middle of a cluster, which means
      partial zeroing of the cluster will be performed soon.
      
      The patch has been tested fixing the following bug:
      
      http://oss.oracle.com/bugzilla/show_bug.cgi?id=1216Signed-off-by: NTristan Ye <tristan.ye@oracle.com>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      e8aec068
    • T
      Ocfs2: Optimize ocfs2 truncate to use ocfs2_remove_btree_range() instead. · 78f94673
      Tristan Ye 提交于
      Truncate is just a special case of punching holes(from new i_size to
      end), we therefore could take advantage of the existing
      ocfs2_remove_btree_range() to reduce the comlexity and redundancy in
      alloc.c.  The goal here is to make truncate more generic and
      straightforward.
      
      Several functions only used by ocfs2_commit_truncate() will smiply be
      removed.
      
      ocfs2_remove_btree_range() was originally used by the hole punching
      code, which didn't take refcount trees into account (definitely a bug).
      We therefore need to change that func a bit to handle refcount trees.
      It must take the refcount lock, calculate and reserve blocks for
      refcount tree changes, and decrease refcounts at the end.  We replace 
      ocfs2_lock_allocators() here by adding a new func
      ocfs2_reserve_blocks_for_rec_trunc() which accepts some extra blocks to
      reserve.  This will not hurt any other code using
      ocfs2_remove_btree_range() (such as dir truncate and hole punching).
      
      I merged the following steps into one patch since they may be
      logically doing one thing, though I know it looks a little bit fat
      to review.
      
      1). Remove redundant code used by ocfs2_commit_truncate(), since we're
          moving to ocfs2_remove_btree_range anyway.
      
      2). Add a new func ocfs2_reserve_blocks_for_rec_trunc() for purpose of
          accepting some extra blocks to reserve.
      
      3). Change ocfs2_prepare_refcount_change_for_del() a bit to fit our
          needs.  It's safe to do this since it's only being called by
          truncate.
      
      4). Change ocfs2_remove_btree_range() a bit to take refcount case into
          account.
      
      5). Finally, we change ocfs2_commit_truncate() to call
          ocfs2_remove_btree_range() in a proper way.
      
      The patch has been tested normally for sanity check, stress tests
      with heavier workload will be expected.
      
      Based on this patch, fixing the punching holes bug will be fairly easy.
      Signed-off-by: NTristan Ye <tristan.ye@oracle.com>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      78f94673
  3. 11 5月, 2010 2 次提交
    • J
      ocfs2: Block signals for mkdir/link/symlink/O_CREAT. · 547ba7c8
      Joel Becker 提交于
      Once file or link creation gets going, it can't be interrupted by a
      signal.  They're not idempotent.
      
      This blocks signals in ocfs2_mknod(), ocfs2_link(), and ocfs2_symlink()
      once we start actually changing things.  ocfs2_mknod() covers mknod(),
      creat(), mkdir(), and open(O_CREAT).
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      547ba7c8
    • J
      ocfs2: Wrap signal blocking in void functions. · e4b963f1
      Joel Becker 提交于
      ocfs2 sometimes needs to block signals around dlm operations, but it
      currently does it with sigprocmask().  Even worse, it's checking the
      error code of sigprocmask().  The in-kernel sigprocmask() can only error
      if you get the SIG_* argument wrong.  We don't.
      
      Wrap the sigprocmask() calls with ocfs2_[un]block_signals().  These
      functions are void, but they will BUG() if somehow sigprocmask() returns
      an error.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      e4b963f1
  4. 06 5月, 2010 18 次提交
    • S
      ocfs2/dlm: Increase o2dlm lockres hash size · 0467ae95
      Sunil Mushran 提交于
      Lockres hash size of 16KB is far too small for large filesystems (where we
      have hundreds of thousands of lock resources stored in the table).
      This patch increases it to 128KB.
      Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      0467ae95
    • T
      ocfs2: Make ocfs2_extend_trans() really extend. · c901fb00
      Tao Ma 提交于
      In ocfs2, we use ocfs2_extend_trans() to extend a journal handle's
      blocks. But if jbd2_journal_extend() fails, it will only restart
      with the the new number of blocks.  This tends to be awkward since
      in most cases we want additional reserved blocks. It makes our code
      harder to mantain since the caller can't be sure all the original
      blocks will not be accessed and dirtied again.  There are 15 callers
      of ocfs2_extend_trans() in fs/ocfs2, and 12 of them have to add
      h_buffer_credits before they call ocfs2_extend_trans().  This makes
      ocfs2_extend_trans() really extend atop the original block count.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      c901fb00
    • T
      ocfs2/trivial: Code cleanup for allocation reservation. · 3e4218df
      Tao Ma 提交于
      Two tiny cleanup for allocation reservation.
      1. Remove some extra codes in ocfs2_local_alloc_find_clear_bits.
      2. Remove an unuseful variables in ocfs2_find_resv_lhs.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      3e4218df
    • T
      ocfs2: make ocfs2_adjust_resv_from_alloc simple. · b065556a
      Tao Ma 提交于
      When we allocate some bits from the reservation, we always
      allocate from the r_start(see ocfs2_resmap_resv_bits).
      So there should be no reason to check between r_start
      and start. And I don't think we will change this behaviour
      later by allocating from some bits after r_start.  Why not make
      ocfs2_adjust_resv_from_alloc simple for now?
      
      The only chance we have to adjust the reservation is when we haven't
      reached the end. With this patch, the function is more readable.
      
      Note:
      btw, this patch also fixes an original bug in the function
      which I haven't found before.
      	if (end < ocfs2_resv_end(resv))
      		rhs = end - ocfs2_resv_end(resv);
      This code is of course buggy. ;)
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      b065556a
    • S
      ocfs2: Make nointr a default mount option · 4b37fcb7
      Sunil Mushran 提交于
      OCFS2 has never really supported intr. This patch acknowledges this reality
      and makes nointr the default mount option. In a later patch, we intend to
      support intr.
      Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      4b37fcb7
    • S
      ocfs2/dlm: Make o2dlm domain join/leave messages KERN_NOTICE · 5c80d4c9
      Sunil Mushran 提交于
      o2dlm join and leave messages are more than informational as they are
      required for debugging locking issues. This patch changes them from
      KERN_INFO to KERN_NOTICE.
      Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      5c80d4c9
    • S
      o2net: log socket state changes · 23fd9abd
      Srinivas Eeda 提交于
      This patch logs socket state changes that lead to socket shutdown.
      Signed-off-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      23fd9abd
    • W
      ocfs2: print node # when tcp fails · a5196ec5
      Wengang Wang 提交于
      Print the node number of a peer node if sending it a message failed.
      Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      a5196ec5
    • M
      ocfs2: Add dir_resv_level mount option · 83f92318
      Mark Fasheh 提交于
      The default behavior for directory reservations stays the same, but we add a
      mount option so people can tweak the size of directory reservations
      according to their workloads.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      83f92318
    • M
      ocfs2: change default reservation window sizes · b07f8f24
      Mark Fasheh 提交于
      The default reservation size of 4 (32-bit windows) is a bit too ambitious.
      Scale it back to 16 bits (resv_level=2). I have been testing various sizes
      on a 4-node cluster which runs a mixed workload that is heavily threaded.
      With a 256MB local alloc, I get *roughly* the following levels of average file
      fragmentation:
      
      resv_level=0	70%
      resv_level=1	21%
      resv_level=2	23%
      resv_level=3	24%
      resv_level=4	60%
      resv_level=5	did not test
      resv_level=6	60%
      
      resv_level=2 seemed like a good compromise between not letting windows be
      too small, but not so big that heavier workloads will immediately suffer
      without tuning.
      
      This patch also change the behavior of directory reservations - they now
      track file reservations.  The previous compromise of giving directory
      windows only 8 bits wound up fragmenting more at some window sizes because
      file allocations had smaller unused windows to poach from.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      b07f8f24
    • M
      ocfs2: increase the default size of local alloc windows · 6b82021b
      Mark Fasheh 提交于
      I have observed that the current size of 8M gives us pretty poor
      fragmentation on multi-threaded workloads which do lots of writes.
      
      Generally, I can increase the size of local alloc windows and observe a
      marked decrease in fragmentation, even up and beyond window sizes of 512
      megabytes. This makes sense for a couple reasons - larger local alloc means
      more room for reservation windows. On multi-node workloads the larger local
      alloc helps as well because we don't have to do window slides as often.
      
      Also, I removed the OCFS2_DEFAULT_LOCAL_ALLOC_SIZE constant as it is no
      longer used and the comment above it was out of date.
      
      To test fragmentation, I used a workload which launched 4 threads that did
      4k writes into a series of about 140 alternating files.
      
      With resv_level=2, and a 4k/4k file system I observed the following average
      fragmentation for various localalloc= parameters:
      
      localalloc=	avg. fragmentation
      	8		48
      	32		16
      	64		10
      	120		7
      
      On larger cluster sizes, the difference is more dramatic.
      
      The new default size top out at 256M, which we'll only get for cluster
      sizes of 32K and above.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      6b82021b
    • M
      ocfs2: clean up localalloc mount option size parsing · 73c8a800
      Mark Fasheh 提交于
      This patch pulls the local alloc sizing code into localalloc.c and provides
      a callout to it from ocfs2_fill_super(). Behavior is essentially unchanged
      except that I correctly calculate the maximum local alloc size. The old code
      in ocfs2_parse_options() calculated the max size as:
      
      ocfs2_local_alloc_size(sb) * 8
      
      which is correct, in bits. Unfortunately though the option passed in is in
      megabytes. Ultimately, this bug made no real difference - the shrink code
      would catch a too-large size and bring it down to something reasonable.
      Still, it's less than efficient as-is.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      73c8a800
    • M
      ocfs2: remove ocfs2_local_alloc_in_range() · a57c8fd2
      Mark Fasheh 提交于
      Inodes are always allocated from the global bitmap now so we don't need this
      any more. Also, the existing implementation bounces reservations around
      needlessly.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      a57c8fd2
    • M
      ocfs2: allocate btree internal block groups from the global bitmap · 33d5d380
      Mark Fasheh 提交于
      Otherwise, the need for a very large contiguous allocation tends to
      wreak havoc on many inode allocation reservations on the local alloc, thus
      ruining any chances for contiguousness.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      33d5d380
    • M
      ocfs2: use allocation reservations for directory data · e3b4a97d
      Mark Fasheh 提交于
      Use the reservations system for unindexed dir tree allocations. We don't
      bother with the indexed tree as reads from it are mostly random anyway.
      Directory reservations are marked seperately, to allow the reservations code
      a chance to optimize their window sizes. This patch allocates only 8 bits
      for directory windows as they generally are not expected to grow as quickly
      as file data. Future improvements to dir window sizing can trivially be
      made.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      e3b4a97d
    • M
      ocfs2: use allocation reservations during file write · 4fe370af
      Mark Fasheh 提交于
      Add a per-inode reservations structure and pass it through to the
      reservations code.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      4fe370af
    • M
      ocfs2: allocation reservations · d02f00cc
      Mark Fasheh 提交于
      This patch improves Ocfs2 allocation policy by allowing an inode to
      reserve a portion of the local alloc bitmap for itself. The reserved
      portion (allocation window) is advisory in that other allocation
      windows might steal it if the local alloc bitmap becomes
      full. Otherwise, the reservations are honored and guaranteed to be
      free. When the local alloc window is moved to a different portion of
      the bitmap, existing reservations are discarded.
      
      Reservation windows are represented internally by a red-black
      tree. Within that tree, each node represents the reservation window of
      one inode. An LRU of active reservations is also maintained. When new
      data is written, we allocate it from the inodes window. When all bits
      in a window are exhausted, we allocate a new one as close to the
      previous one as possible. Should we not find free space, an existing
      reservation is pulled off the LRU and cannibalized.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      d02f00cc
    • J
      ocfs2: Make ocfs2_journal_dirty() void. · ec20cec7
      Joel Becker 提交于
      jbd[2]_journal_dirty_metadata() only returns 0.  It's been returning 0
      since before the kernel moved to git.  There is no point in checking
      this error.
      
      ocfs2_journal_dirty() has been faithfully returning the status since the
      beginning.  All over ocfs2, we have blocks of code checking this can't
      fail status.  In the past few years, we've tried to avoid adding these
      checks, because they are pointless.  But anyone who looks at our code
      assumes they are needed.
      
      Finally, ocfs2_journal_dirty() is made a void function.  All error
      checking is removed from other files.  We'll BUG_ON() the status of
      jbd2_journal_dirty_metadata() just in case they change it someday.  They
      won't.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      ec20cec7
  5. 04 5月, 2010 1 次提交
  6. 01 5月, 2010 1 次提交
  7. 24 4月, 2010 6 次提交