1. 01 2月, 2018 1 次提交
  2. 07 9月, 2017 1 次提交
  3. 23 2月, 2017 1 次提交
    • E
      ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock · 439a36b8
      Eric Ren 提交于
      We are in the situation that we have to avoid recursive cluster locking,
      but there is no way to check if a cluster lock has been taken by a precess
      already.
      
      Mostly, we can avoid recursive locking by writing code carefully.
      However, we found that it's very hard to handle the routines that are
      invoked directly by vfs code.  For instance:
      
        const struct inode_operations ocfs2_file_iops = {
            .permission     = ocfs2_permission,
            .get_acl        = ocfs2_iop_get_acl,
            .set_acl        = ocfs2_iop_set_acl,
        };
      
      Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR):
      
        do_sys_open
         may_open
          inode_permission
           ocfs2_permission
            ocfs2_inode_lock() <=== first time
             generic_permission
              get_acl
               ocfs2_iop_get_acl
        	ocfs2_inode_lock() <=== recursive one
      
      A deadlock will occur if a remote EX request comes in between two of
      ocfs2_inode_lock().  Briefly describe how the deadlock is formed:
      
      On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in
      BAST(ocfs2_generic_handle_bast) when downconvert is started on behalf of
      the remote EX lock request.  Another hand, the recursive cluster lock
      (the second one) will be blocked in in __ocfs2_cluster_lock() because of
      OCFS2_LOCK_BLOCKED.  But, the downconvert never complete, why? because
      there is no chance for the first cluster lock on this node to be
      unlocked - we block ourselves in the code path.
      
      The idea to fix this issue is mostly taken from gfs2 code.
      
      1. introduce a new field: struct ocfs2_lock_res.l_holders, to keep track
         of the processes' pid who has taken the cluster lock of this lock
         resource;
      
      2. introduce a new flag for ocfs2_inode_lock_full:
         OCFS2_META_LOCK_GETBH; it means just getting back disk inode bh for
         us if we've got cluster lock.
      
      3. export a helper: ocfs2_is_locked_by_me() is used to check if we have
         got the cluster lock in the upper code path.
      
      The tracking logic should be used by some of the ocfs2 vfs's callbacks,
      to solve the recursive locking issue cuased by the fact that vfs
      routines can call into each other.
      
      The performance penalty of processing the holder list should only be
      seen at a few cases where the tracking logic is used, such as get/set
      acl.
      
      You may ask what if the first time we got a PR lock, and the second time
      we want a EX lock? fortunately, this case never happens in the real
      world, as far as I can see, including permission check,
      (get|set)_(acl|attr), and the gfs2 code also do so.
      
      [sfr@canb.auug.org.au remove some inlines]
      Link: http://lkml.kernel.org/r/20170117100948.11657-2-zren@suse.comSigned-off-by: NEric Ren <zren@suse.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <jiangqi903@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      439a36b8
  4. 13 12月, 2016 1 次提交
  5. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  6. 26 3月, 2016 1 次提交
    • J
      ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local · 35ddf78e
      jiangyiwen 提交于
      This patch fixes a deadlock, as follows:
      
        Node 1                Node 2                  Node 3
      1)volume a and b are    only mount vol a        only mount vol b
        mounted
      
      2)                      start to mount b        start to mount a
      
      3)                      check hb of Node 3      check hb of Node 2
                              in vol a, qs_holds++    in vol b, qs_holds++
      
      4) -------------------- all nodes' network down --------------------
      
      5)                      progress of mount b     the same situation as
                              failed, and then call   Node 2
                              ocfs2_dismount_volume.
                              but the process is hung,
                              since there is a work
                              in ocfs2_wq cannot beo
                              completed. This work is
                              about vol a, because
                              ocfs2_wq is global wq.
                              BTW, this work which is
                              scheduled in ocfs2_wq is
                              ocfs2_orphan_scan_work,
                              and the context in this work
                              needs to take inode lock
                              of orphan_dir, because
                              lockres owner are Node 1 and
                              all nodes' nework has been down
                              at the same time, so it can't
                              get the inode lock.
      
      6)                      Why can't this node be fenced
                              when network disconnected?
                              Because the process of
                              mount is hung what caused qs_holds
                              is not equal 0.
      
      Because all works in the ocfs2_wq are relative to the super block.
      
      The solution is to change the ocfs2_wq from global to local.  In other
      words, move it into struct ocfs2_super.
      Signed-off-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Xue jiufei <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35ddf78e
  7. 05 9月, 2015 1 次提交
    • G
      ocfs2: add errors=continue · 7d0fb914
      Goldwyn Rodrigues 提交于
      OCFS2 is often used in high-availaibility systems.  However, ocfs2
      converts the filesystem to read-only at the drop of the hat.  This may
      not be necessary, since turning the filesystem read-only would affect
      other running processes as well, decreasing availability.
      
      This attempt is to add errors=continue, which would return the EIO to
      the calling process and terminate furhter processing so that the
      filesystem is not corrupted further.  However, the filesystem is not
      converted to read-only.
      
      As a future plan, I intend to create a small utility or extend
      fsck.ocfs2 to fix small errors such as in the inode.  The input to the
      utility such as the inode can come from the kernel logs so we don't have
      to schedule a downtime for fixing small-enough errors.
      
      The patch changes the ocfs2_error to return an error.  The error
      returned depends on the mount option set.  If none is set, the default
      is to turn the filesystem read-only.
      
      Perhaps errors=continue is not the best option name.  Historically it is
      used for making an attempt to progress in the current process itself.
      Should we call it errors=eio? or errors=killproc? Suggestions/Comments
      welcome.
      
      Sources are available at:
        https://github.com/goldwynr/linux/tree/error-contSigned-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d0fb914
  8. 25 6月, 2015 1 次提交
  9. 13 3月, 2015 1 次提交
    • M
      ocfs2: make append_dio an incompat feature · 18d585f0
      Mark Fasheh 提交于
      It turns out that making this feature ro_compat isn't quite enough to
      prevent accidental corruption on mount from older kernels.  Ocfs2 (like
      other file systems) will process orphaned inodes even when the user mounts
      in 'ro' mode.  So for the case of a filesystem not knowing the append_dio
      feature, mounting the filesystem could result in orphaned-for-dio files
      being deleted, which we clearly don't want.
      
      So instead, turn this into an incompat flag.
      
      Btw, this is kind of my fault - initially I asked that we add a flag to
      cover the feature and even suggested that we use an ro flag.  It wasn't
      until I was looking through our commits for v4.0-rc1 that I realized we
      actually want this to be incompat.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18d585f0
  10. 17 2月, 2015 3 次提交
  11. 11 2月, 2015 1 次提交
  12. 11 12月, 2014 1 次提交
  13. 05 6月, 2014 1 次提交
    • X
      ocfs2: fix umount hang while shutting down truncate log · a9e9acae
      Xue jiufei 提交于
      Revert commit 75f82eaa ("ocfs2: fix NULL pointer dereference when
      dismount and ocfs2rec simultaneously") because it may cause a umount
      hang while shutting down the truncate log.
      
      fix NULL pointer dereference when dismount and ocfs2rec simultaneously
      
      The situation is as followes:
      ocfs2_dismout_volume
      -> ocfs2_recovery_exit
        -> free osb->recovery_map
      -> ocfs2_truncate_shutdown
        -> lock global bitmap inode
          -> ocfs2_wait_for_recovery
      	  -> check whether osb->recovery_map->rm_used is zero
      
      Because osb->recovery_map is already freed, rm_used can be any other
      values, so it may yield umount hang.
      
      To prevent NULL pointer dereference while getting sys_root_inode, we use
      a osb_tl_disable flag to disable schedule osb_truncate_log_wq after
      truncate log shutdown.
      Signed-off-by: Njoyce.xue <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9e9acae
  14. 04 4月, 2014 3 次提交
    • J
      ocfs2: avoid system inode ref confusion by adding mutex lock · 43b10a20
      jiangyiwen 提交于
      The following case may lead to the same system inode ref in confusion.
      
      A thread                            B thread
      ocfs2_get_system_file_inode
      ->get_local_system_inode
      ->_ocfs2_get_system_file_inode
                                          because of *arr == NULL,
                                          ocfs2_get_system_file_inode
                                          ->get_local_system_inode
                                          ->_ocfs2_get_system_file_inode
      gets first ref thru
      _ocfs2_get_system_file_inode,
      gets second ref thru igrab and
      set *arr = inode
                                          at the moment, B thread also gets
                                          two refs, so lead to one more
                                          inode ref.
      
      So add mutex lock to avoid multi thread set two inode ref once at the
      same time.
      Signed-off-by: Njiangyiwen <jiangyiwen@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43b10a20
    • G
      ocfs2: revert iput deferring code in ocfs2_drop_dentry_lock · 8ed6b237
      Goldwyn Rodrigues 提交于
      The following patches are reverted in this patch because these patches
      caused performance regression in the remote unlink() calls.
      
        ea455f8a - ocfs2: Push out dropping of dentry lock to ocfs2_wq
        f7b1aa69 - ocfs2: Fix deadlock on umount
        5fd13189 - ocfs2: Don't oops in ocfs2_kill_sb on a failed mount
      
      Previous patches in this series removed the possible deadlocks from
      downconvert thread so the above patches shouldn't be needed anymore.
      
      The regression is caused because these patches delay the iput() in case
      of dentry unlocks.  This also delays the unlocking of the open lockres.
      The open lockresource is required to test if the inode can be wiped from
      disk or not.  When the deleting node does not get the open lock, it
      marks it as orphan (even though it is not in use by another
      node/process) and causes a journal checkpoint.  This delays operations
      following the inode eviction.  This also moves the inode to the orphaned
      inode which further causes more I/O and a lot of unneccessary orphans.
      
      The following script can be used to generate the load causing issues:
      
        declare -a create
        declare -a remove
        declare -a iterations=(1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384)
        unique="`mktemp -u XXXXX`"
        script="/tmp/idontknow-${unique}.sh"
        cat <<EOF > "${script}"
        for n in {1..8}; do mkdir -p test/dir\${n}
          eval touch test/dir\${n}/foo{1.."\$1"}
        done
        EOF
        chmod 700 "${script}"
      
        function fcreate ()
        {
          exec 2>&1 /usr/bin/time --format=%E "${script}" "$1"
        }
      
        function fremove ()
        {
          exec 2>&1 /usr/bin/time --format=%E ssh node2 "cd `pwd`; rm -Rf test*"
        }
      
        function fcp ()
        {
          exec 2>&1 /usr/bin/time --format=%E ssh node3 "cd `pwd`; cp -R test test.new"
        }
      
        echo -------------------------------------------------
        echo "| # files | create #s | copy #s | remove #s |"
        echo -------------------------------------------------
        for ((x=0; x < ${#iterations[*]} ; x++)) do
          create[$x]="`fcreate ${iterations[$x]}`"
          copy[$x]="`fcp ${iterations[$x]}`"
          remove[$x]="`fremove`"
          printf "| %8d | %9s | %9s | %9s |\n" ${iterations[$x]} ${create[$x]} ${copy[$x]} ${remove[$x]}
        done
        rm "${script}"
        echo "------------------------"
      Signed-off-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ed6b237
    • J
      ocfs2: implement delayed dropping of last dquot reference · e3a767b6
      Jan Kara 提交于
      We cannot drop last dquot reference from downconvert thread as that
      creates the following deadlock:
      
      NODE 1                                  NODE2
      holds dentry lock for 'foo'
      holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
                                              dquot_initialize(bar)
                                                ocfs2_dquot_acquire()
                                                  ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
                                                  ...
      downconvert thread (triggered from another
      node or a different process from NODE2)
        ocfs2_dentry_post_unlock()
          ...
          iput(foo)
            ocfs2_evict_inode(foo)
              ocfs2_clear_inode(foo)
                dquot_drop(inode)
                  ...
      	    ocfs2_dquot_release()
                    ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
                     - blocks
                                                  finds we need more space in
                                                  quota file
                                                  ...
                                                  ocfs2_extend_no_holes()
                                                    ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
                                                      - deadlocks waiting for
                                                        downconvert thread
      
      We solve the problem by postponing dropping of the last dquot reference to
      a workqueue if it happens from the downconvert thread.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3a767b6
  15. 22 1月, 2014 1 次提交
    • G
      ocfs2: add clustername to cluster connection · c74a3bdd
      Goldwyn Rodrigues 提交于
      This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM
      handling up to the times with respect to DLM (>=4.0.1) and corosync
      (2.3.x).  AFAIK, cman also is being phased out for a unified corosync
      cluster stack.
      
      fs/dlm performs all the functions with respect to fencing and node
      management and provides the API's to do so for ocfs2.  For all future
      references, DLM stands for fs/dlm code.
      
      The advantages are:
       + No need to run an additional userspace daemon (ocfs2_controld)
       + No controld device handling and controld protocol
       + Shifting responsibilities of node management to DLM layer
      
      For backward compatibility, we are keeping the controld handling code.
      Once enough time has passed we can remove a significant portion of the
      code.  This was tested by using the kernel with changes on older
      unmodified tools.  The kernel used ocfs2_controld as expected, and
      displayed the appropriate warning message.
      
      This feature requires modification in the userspace ocfs2-tools.  The
      changes can be found at: https://github.com/goldwynr/ocfs2-tools branch:
      nocontrold Currently, not many checks are present in the userspace code,
      but that would change soon.
      
      This patch (of 6):
      
      Add clustername to cluster connection.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c74a3bdd
  16. 04 7月, 2013 1 次提交
  17. 02 12月, 2011 1 次提交
    • A
      ocfs2: avoid unaligned access to dqc_bitmap · 93925579
      Akinobu Mita 提交于
      The dqc_bitmap field of struct ocfs2_local_disk_chunk is 32-bit aligned,
      but not 64-bit aligned.  The dqc_bitmap is accessed by ocfs2_set_bit(),
      ocfs2_clear_bit(), ocfs2_test_bit(), or ocfs2_find_next_zero_bit().  These
      are wrapper macros for ext2_*_bit() which need to take an unsigned long
      aligned address (though some architectures are able to handle unaligned
      address correctly)
      
      So some 64bit architectures may not be able to access the dqc_bitmap
      correctly.
      
      This avoids such unaligned access by using another wrapper functions for
      ext2_*_bit().  The code is taken from fs/ext4/mballoc.c which also need to
      handle unaligned bitmap access.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Acked-by: NJoel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJoel Becker <jlbec@evilplan.org>
      93925579
  18. 01 6月, 2011 1 次提交
  19. 24 3月, 2011 1 次提交
  20. 20 2月, 2011 1 次提交
    • S
      ocfs2: Use hrtimer to track ocfs2 fs lock stats · 5bc970e8
      Sunil Mushran 提交于
      Patch makes use of the hrtimer to track times in ocfs2 lock stats.
      
      The patch is a bit involved to ensure no additional impact on the memory
      footprint. The size of ocfs2_inode_cache remains 1280 bytes on 32-bit systems.
      
      A related change was to modify the unit of the max wait time from nanosec to
      microsec allowing us to track max time larger than 4 secs. This change
      necessitated the bumping of the output version in the debugfs file,
      locking_state, from 2 to 3.
      Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: NJoel Becker <jlbec@evilplan.org>
      5bc970e8
  21. 16 12月, 2010 1 次提交
    • T
      ocfs2: Try to free truncate log when meeting ENOSPC in write. · 50308d81
      Tao Ma 提交于
      Recently, one of our colleagues meet with a problem that if we
      write/delete a 32mb files repeatly, we will get an ENOSPC in
      the end. And the corresponding bug is 1288.
      http://oss.oracle.com/bugzilla/show_bug.cgi?id=1288
      
      The real problem is that although we have freed the clusters,
      they are in truncate log and they will be summed up so that
      we can free them once in a whole.
      
      So this patch just try to resolve it. In case we see -ENOSPC
      in ocfs2_write_begin_no_lock, we will check whether the truncate
      log has enough clusters for our need, if yes, we will try to
      flush the truncate log at that point and try again. This method
      is inspired by Mark Fasheh <mfasheh@suse.com>. Thanks.
      
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      50308d81
  22. 19 11月, 2010 1 次提交
  23. 13 11月, 2010 1 次提交
  24. 12 10月, 2010 1 次提交
    • T
      ocfs2: Add a mount option "coherency=*" to handle cluster coherency for O_DIRECT writes. · 7bdb0d18
      Tristan Ye 提交于
      Currently, the default behavior of O_DIRECT writes was allowing
      concurrent writing among nodes to the same file, with no cluster
      coherency guaranteed (no EX lock held).  This can leave stale data in
      the cache for buffered reads on other nodes.
      
      The new mount option introduce a chance to choose two different
      behaviors for O_DIRECT writes:
      
          * coherency=full, as the default value, will disallow
                            concurrent O_DIRECT writes by taking
                            EX locks.
      
          * coherency=buffered, allow concurrent O_DIRECT writes
                                without EX lock among nodes, which
                                gains high performance at risk of
                                getting stale data on other nodes.
      Signed-off-by: NTristan Ye <tristan.ye@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      7bdb0d18
  25. 08 10月, 2010 1 次提交
    • S
      · 2c442719
      Sunil Mushran 提交于
      ocfs2: Add support for heartbeat=global mount option
      
      Adds support for heartbeat=global mount option. It ensures that the heartbeat
      mode passed matches the one enabled on disk.
      Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
      2c442719
  26. 10 10月, 2010 1 次提交
    • S
      · 98f486f2
      Sunil Mushran 提交于
      ocfs2: Add an incompat feature flag OCFS2_FEATURE_INCOMPAT_CLUSTERINFO
      
      OCFS2_FEATURE_INCOMPAT_CLUSTERINFO allows us to use sb->s_cluster_info for
      both userspace and o2cb cluster stacks. It also allows us to extend cluster
      info to include stack flags.
      
      This patch also adds stackflags to sb->s_clusterinfo. It also introduces a
      clusterinfo flag OCFS2_CLUSTER_O2CB_GLOBAL_HEARTBEAT to denote the enabled
      global heartbeat mode.
      
      This incompat flag can be set/cleared using tunefs.ocfs2 --fs-features. The
      clusterinfo flag is set/cleared using tunefs.ocfs2 --update-cluster-stack.
      Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
      98f486f2
  27. 10 9月, 2010 2 次提交
    • T
      ocfs2: Cache system inodes of other slots. · b4d693fc
      Tao Ma 提交于
      Durring orphan scan, if we are slot 0, and we are replaying
      orphan_dir:0001, the general process is that for every file
      in this dir:
      1. we will iget orphan_dir:0001, since there is no inode for it.
         we will have to create an inode and read it from the disk.
      2. do the normal work, such as delete_inode and remove it from
         the dir if it is allowed.
      3. call iput orphan_dir:0001 when we are done. In this case,
         since we have no dcache for this inode, i_count will
         reach 0, and VFS will have to call clear_inode and in
         ocfs2_clear_inode we will checkpoint the inode which will let
         ocfs2_cmt and journald begin to work.
      4. We loop back to 1 for the next file.
      
      So you see, actually for every deleted file, we have to read the
      orphan dir from the disk and checkpoint the journal. It is very
      time consuming and cause a lot of journal checkpoint I/O.
      A better solution is that we can have another reference for these
      inodes in ocfs2_super. So if there is no other race among
      nodes(which will let dlmglue to checkpoint the inode), for step 3,
      clear_inode won't be called and for step 1, we may only need to
      read the inode for the 1st time. This is a big win for us.
      
      So this patch will try to cache system inodes of other slots so
      that we will have one more reference for these inodes and avoid
      the extra inode read and journal checkpoint.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      b4d693fc
    • G
      Reorganize data elements to reduce struct sizes · 83fd9c7f
      Goldwyn Rodrigues 提交于
      Thanks for the comments. I have incorportated them all.
      
      CONFIG_OCFS2_FS_STATS is enabled and CONFIG_DEBUG_LOCK_ALLOC is disabled.
      Statistics now look like -
      ocfs2_write_ctxt: 2144 - 2136 = 8
      ocfs2_inode_info: 1960 - 1848 = 112
      ocfs2_journal: 168 - 160 = 8
      ocfs2_lock_res: 336 - 304 = 32
      ocfs2_refcount_tree: 512 - 472 = 40
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.de>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      83fd9c7f
  28. 06 5月, 2010 4 次提交
    • M
      ocfs2: Add dir_resv_level mount option · 83f92318
      Mark Fasheh 提交于
      The default behavior for directory reservations stays the same, but we add a
      mount option so people can tweak the size of directory reservations
      according to their workloads.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      83f92318
    • M
      ocfs2: increase the default size of local alloc windows · 6b82021b
      Mark Fasheh 提交于
      I have observed that the current size of 8M gives us pretty poor
      fragmentation on multi-threaded workloads which do lots of writes.
      
      Generally, I can increase the size of local alloc windows and observe a
      marked decrease in fragmentation, even up and beyond window sizes of 512
      megabytes. This makes sense for a couple reasons - larger local alloc means
      more room for reservation windows. On multi-node workloads the larger local
      alloc helps as well because we don't have to do window slides as often.
      
      Also, I removed the OCFS2_DEFAULT_LOCAL_ALLOC_SIZE constant as it is no
      longer used and the comment above it was out of date.
      
      To test fragmentation, I used a workload which launched 4 threads that did
      4k writes into a series of about 140 alternating files.
      
      With resv_level=2, and a 4k/4k file system I observed the following average
      fragmentation for various localalloc= parameters:
      
      localalloc=	avg. fragmentation
      	8		48
      	32		16
      	64		10
      	120		7
      
      On larger cluster sizes, the difference is more dramatic.
      
      The new default size top out at 256M, which we'll only get for cluster
      sizes of 32K and above.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      6b82021b
    • M
      ocfs2: clean up localalloc mount option size parsing · 73c8a800
      Mark Fasheh 提交于
      This patch pulls the local alloc sizing code into localalloc.c and provides
      a callout to it from ocfs2_fill_super(). Behavior is essentially unchanged
      except that I correctly calculate the maximum local alloc size. The old code
      in ocfs2_parse_options() calculated the max size as:
      
      ocfs2_local_alloc_size(sb) * 8
      
      which is correct, in bits. Unfortunately though the option passed in is in
      megabytes. Ultimately, this bug made no real difference - the shrink code
      would catch a too-large size and bring it down to something reasonable.
      Still, it's less than efficient as-is.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      73c8a800
    • M
      ocfs2: allocation reservations · d02f00cc
      Mark Fasheh 提交于
      This patch improves Ocfs2 allocation policy by allowing an inode to
      reserve a portion of the local alloc bitmap for itself. The reserved
      portion (allocation window) is advisory in that other allocation
      windows might steal it if the local alloc bitmap becomes
      full. Otherwise, the reservations are honored and guaranteed to be
      free. When the local alloc window is moved to a different portion of
      the bitmap, existing reservations are discarded.
      
      Reservation windows are represented internally by a red-black
      tree. Within that tree, each node represents the reservation window of
      one inode. An LRU of active reservations is also maintained. When new
      data is written, we allocate it from the inodes window. When all bits
      in a window are exhausted, we allocate a new one as close to the
      previous one as possible. Should we not find free space, an existing
      reservation is pulled off the LRU and cannibalized.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      d02f00cc
  29. 24 3月, 2010 1 次提交
    • M
      ocfs2: Clear undo bits when local alloc is freed · b4414eea
      Mark Fasheh 提交于
      When the local alloc file changes windows, unused bits are freed back to the
      global bitmap. By defnition, those bits can not be in use by any file. Also,
      the local alloc will never have been able to allocate those bits if they
      were part of a previous truncate. Therefore it makes sense that we should
      clear unused local alloc bits in the undo buffer so that they can be used
      immediatly.
      
      [ Modified to call it ocfs2_release_clusters() -- Joel ]
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      b4414eea
  30. 22 4月, 2010 1 次提交
  31. 13 4月, 2010 1 次提交
  32. 03 3月, 2010 1 次提交