1. 17 7月, 2010 1 次提交
  2. 16 7月, 2010 3 次提交
    • J
      jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions · 13ceef09
      Jan Kara 提交于
      OCFS2 uses t_commit trigger to compute and store checksum of the just
      committed blocks. When a buffer has b_frozen_data, checksum is computed
      for it instead of b_data but this can result in an old checksum being
      written to the filesystem in the following scenario:
      
      1) transaction1 is opened
      2) handle1 is opened
      3) journal_access(handle1, bh)
          - This sets jh->b_transaction to transaction1
      4) modify(bh)
      5) journal_dirty(handle1, bh)
      6) handle1 is closed
      7) start committing transaction1, opening transaction2
      8) handle2 is opened
      9) journal_access(handle2, bh)
          - This copies off b_frozen_data to make it safe for transaction1 to commit.
            jh->b_next_transaction is set to transaction2.
      10) jbd2_journal_write_metadata() checksums b_frozen_data
      11) the journal correctly writes b_frozen_data to the disk journal
      12) handle2 is closed
          - There was no dirty call for the bh on handle2, so it is never queued for
            any more journal operation
      13) Checkpointing finally happens, and it just spools the bh via normal buffer
      writeback.  This will write b_data, which was never triggered on and thus
      contains a wrong (old) checksum.
      
      This patch fixes the problem by calling the trigger at the moment data is
      frozen for journal commit - i.e., either when b_frozen_data is created by
      do_get_write_access or just before we write a buffer to the log if
      b_frozen_data does not exist. We also rename the trigger to t_frozen as
      that better describes when it is called.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      13ceef09
    • W
      ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down node · a39953dd
      Wengang Wang 提交于
      For migration, we are waiting for DLM_LOCK_RES_MIGRATING flag to be set
      before sending DLM_MIG_LOCKRES_MSG message to the target. We are using
      dlm_migration_can_proceed() for that purpose.  However, if the node is
      down, dlm_migration_can_proceed() will also return "go ahead".  In this
      rare case, the DLM_LOCK_RES_MIGRATING flag might not be set yet. Remove
      the BUG_ON() that trips over this condition.
      Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      a39953dd
    • T
      ocfs2: Don't duplicate pages past i_size during CoW. · f5e27b6d
      Tao Ma 提交于
      During CoW, the pages after i_size don't contain valid data, so there's
      no need to read and duplicate them.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      f5e27b6d
  3. 13 7月, 2010 6 次提交
    • D
      ocfs2: tighten up strlen() checking · e372357b
      Dan Carpenter 提交于
      This function is only called from one place and it's like this:
      	dlm_register_domain(conn->cc_name, dlm_key, &fs_version);
      
      The "conn->cc_name" is 64 characters long.  If strlen(conn->cc_name)
      were equal to O2NM_MAX_NAME_LEN (64) that would be a bug because
      strlen() doesn't count the NULL character.
      
      In fact, if you look how O2NM_MAX_NAME_LEN is used, it mostly describes
      64 character buffers.  The only exception is nd_name from struct
      o2nm_node.
      
      Anyway I looked into it and in this case the domain string comes from
      osb->uuid_str in ocfs2_setup_osb_uuid().  That's 32 characters and NULL
      which easily fits into O2NM_MAX_NAME_LEN.  This patch doesn't change how
      the code works, but I think it makes the code a little cleaner.
      Signed-off-by: NDan Carpenter <error27@gmail.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      e372357b
    • T
      ocfs2: Make xattr reflink work with new local alloc reservation. · 121a39bb
      Tao Ma 提交于
      The new reservation code in local alloc has add the limitation
      that the caller should handle the case that the local alloc
      doesn't give use enough contiguous clusters. It make the old
      xattr reflink code broken.
      
      So this patch udpate the xattr reflink code so that it can
      handle the case that local alloc give us one cluster at a time.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      121a39bb
    • T
      ocfs2: make xattr extension work with new local alloc reservation. · a78f9f46
      Tao Ma 提交于
      The old ocfs2_xattr_extent_allocation is too optimistic about
      the clusters we can get. So actually if the file system is
      too fragmented, ocfs2_add_clusters_in_btree will return us
      with EGAIN and we need to allocate clusters once again.
      
      So this patch change it to a while loop so that we can allocate
      clusters until we reach clusters_to_add.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Cc: stable@kernel.org
      a78f9f46
    • T
      ocfs2: Remove the redundant cpu_to_le64. · 0a463b74
      Tao Ma 提交于
      In ocfs2_block_group_alloc, we set c_blkno by bg->bg_blkno.
      But actually bg->bg_blkno is already changed to little endian
      in ocfs2_block_group_fill. So remove the extra cpu_to_le64.
      Reported-by: NMarcos Matsunaga <Marcos.Matsunaga@oracle.com>
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      0a463b74
    • W
      ocfs2/dlm: don't access beyond bitmap size · f471c9df
      Wengang Wang 提交于
      dlm->recovery_map is defined as
      	unsigned long recovery_map[BITS_TO_LONGS(O2NM_MAX_NODES)];
      
      We should treat O2NM_MAX_NODES as the bit map size in bits.
      This patches fixes a bit operation that takes O2NM_MAX_NODES + 1 as bitmap size.
      Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      f471c9df
    • J
      ocfs2: No need to zero pages past i_size. · 693c241a
      Joel Becker 提交于
      When ocfs2 fills a hole, it does so by allocating clusters.  When a
      cluster is larger than the write, ocfs2 must zero the portions of the
      cluster outside of the write.  If the clustersize is smaller than a
      pagecache page, this is handled by the normal pagecache mechanisms, but
      when the clustersize is larger than a page, ocfs2's write code will zero
      the pages adjacent to the write.  This makes sure the entire cluster is
      zeroed correctly.
      
      Currently ocfs2 behaves exactly the same when writing past i_size.
      However, this means ocfs2 is writing zeroed pages for portions of a new
      cluster that are beyond i_size.  The page writeback code isn't expecting
      this.  It treats all pages past the one containing i_size as left behind
      due to a previous truncate operation.
      
      Thankfully, ocfs2 calculates the number of pages it will be working on
      up front.  The rest of the write code merely honors the original
      calculation.  We can simply trim the number of pages to only cover the
      actual file data.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Cc: stable@kernel.org
      693c241a
  4. 09 7月, 2010 2 次提交
    • J
      ocfs2: Zero the tail cluster when extending past i_size. · 5693486b
      Joel Becker 提交于
      ocfs2's allocation unit is the cluster.  This can be larger than a block
      or even a memory page.  This means that a file may have many blocks in
      its last extent that are beyond the block containing i_size.  There also
      may be more unwritten extents after that.
      
      When ocfs2 grows a file, it zeros the entire cluster in order to ensure
      future i_size growth will see cleared blocks.  Unfortunately,
      block_write_full_page() drops the pages past i_size.  This means that
      ocfs2 is actually leaking garbage data into the tail end of that last
      cluster.  This is a bug.
      
      We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
      when a write or truncate is past i_size.  They will use
      ocfs2_zero_extend() to ensure the data is properly zeroed.
      
      Older versions of ocfs2_zero_extend() simply zeroed every block between
      i_size and the zeroing position.  This presumes three things:
      
      1) There is allocation for all of these blocks.
      2) The extents are not unwritten.
      3) The extents are not refcounted.
      
      (1) and (2) hold true for non-sparse filesystems, which used to be the
      only users of ocfs2_zero_extend().  (3) is another bug.
      
      Since we're now using ocfs2_zero_extend() for sparse filesystems as
      well, we teach ocfs2_zero_extend() to check every extent between
      i_size and the zeroing position.  If the extent is unwritten, it is
      ignored.  If it is refcounted, it is CoWed.  Then it is zeroed.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Cc: stable@kernel.org
      5693486b
    • J
      ocfs2: When zero extending, do it by page. · a4bfb4cf
      Joel Becker 提交于
      ocfs2_zero_extend() does its zeroing block by block, but it calls a
      function named ocfs2_write_zero_page().  Let's have
      ocfs2_write_zero_page() handle the page level.  From
      ocfs2_zero_extend()'s perspective, it is now page-at-a-time.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Cc: stable@kernel.org
      a4bfb4cf
  5. 28 6月, 2010 1 次提交
  6. 16 6月, 2010 3 次提交
    • T
      ocfs2: Limit default local alloc size within bitmap range. · 1739da40
      Tao Ma 提交于
      In commit 6b82021b, we increase
      our local alloc size and calculate how much megabytes we can
      get according to group size and volume size.
      But we also need to check the maximum bits a local alloc block
      bitmap can have. With a bs=512, cs=32K, local volume with 160G,
      it calculate 96MB while the maximum local alloc size is only
      76M. So the bitmap will overflow and corrupt the system truncate
      log file. See bug
      http://oss.oracle.com/bugzilla/show_bug.cgi?id=1262Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      1739da40
    • T
      ocfs2: Move orphan scan work to ocfs2_wq. · 40f165f4
      Tao Ma 提交于
      We used to let orphan scan work in the default work queue,
      but there is a corner case which will make the system deadlock.
      The scenario is like this:
      1. set heartbeat threadshold to 200. this will allow us to have a
         great chance to have a orphan scan work before our quorum decision.
      2. mount node 1.
      3. after 1~2 minutes, mount node 2(in order to make the bug easier
         to reproduce, better add maxcpus=1 to kernel command line).
      4. node 1 do orphan scan work.
      5. node 2 do orphan scan work.
      6. node 1 do orphan scan work. After this, node 1 hold the orphan scan
         lock while node 2 know node 1 is the master.
      7. ifdown eth2 in node 2(eth2 is what we do ocfs2 interconnection).
      
      Now when node 2 begins orphan scan, the system queue is blocked.
      
      The root cause is that both orphan scan work and quorum decision work
      will use the system event work queue. orphan scan has a chance of
      blocking the event work queue(in dlm_wait_for_node_death) so that there
      is no chance for quorum decision work to proceed.
      
      This patch resolve it by moving orphan scan work to ocfs2_wq.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      40f165f4
    • J
      fs/ocfs2/dlm: Add missing spin_unlock · 6469272c
      Julia Lawall 提交于
      Add a spin_unlock missing on the error path.  Unlock as in the other code
      that leads to the leave label.
      
      The semantic match that finds this problem is as follows:
      (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @@
      expression E1;
      @@
      
      * spin_lock(E1,...);
        <+... when != E1
        if (...) {
          ... when != E1
      *   return ...;
        }
        ...+>
      * spin_unlock(E1,...);
      // </smpl>
      Signed-off-by: NJulia Lawall <julia@diku.dk>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      6469272c
  7. 28 5月, 2010 2 次提交
  8. 25 5月, 2010 1 次提交
  9. 24 5月, 2010 4 次提交
  10. 22 5月, 2010 10 次提交
  11. 19 5月, 2010 7 次提交
    • J
      ocfs2: Silence a gcc warning. · 18d3a98f
      Joel Becker 提交于
      ocfs2_block_group_claim_bits() is never called with min_bits=0, but we
      shouldn't leave status undefined if it ever is.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      18d3a98f
    • T
      ocfs2: Don't retry xattr set in case value extension fails. · 5f5261ac
      Tao Ma 提交于
      In normal xattr set, the set sequence is inode, xattr block
      and finally xattr bucket if we meet with a ENOSPC. But there
      is a corner case.
      So consider we will set a xattr whose value will be stored in
      a cluster, and there is no xattr block by now. So we will
      reserve 1 xattr block and 1 cluster for setting it. Now if we
      fail in value extension(in case the volume is almost full and
      we can't allocate the cluster because the check in
      ocfs2_test_bg_bit_allocatable), ENOSPC will be returned. So
      we will try to create a bucket(this time there is a chance that
      the reserved cluster will be used), and when we try value extension
      again, kernel bug happens. We did meet with it. Check the bug below.
      http://oss.oracle.com/bugzilla/show_bug.cgi?id=1251
      
      This patch just try to avoid this by adding a set_abort in
      ocfs2_xattr_set_ctxt, so in case ENOSPC happens in value extension,
      we will check whether it is caused by the real ENOSPC or just the
      full of inode or xattr block. If it is the first case, we set set_abort
      so that we don't try any further. we are safe to exit directly here
      ince it is really ENOSPC.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      5f5261ac
    • W
      ocfs2:dlm: avoid dlm->ast_lock lockres->spinlock dependency break · d9ef7522
      Wengang Wang 提交于
      Currently we process a dirty lockres with the lockres->spinlock taken. While
      during the process, we may need to lock on dlm->ast_lock. This breaks the
      dependency of dlm->ast_lock(lock first) and lockres->spinlock(lock second).
      
      This patch fixes the problem.
      Since we can't release lockres->spinlock, we have to take dlm->ast_lock
      just before taking the lockres->spinlock and release it after lockres->spinlock
      is released. And use __dlm_queue_bast()/__dlm_queue_ast(), the nolock version,
      in dlm_shuffle_lists(). There are no too many locks on a lockres, so there is no
      performance harm.
      Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      d9ef7522
    • T
      ocfs2: Reset xattr value size after xa_cleanup_value_truncate(). · d5a7df06
      Tao Ma 提交于
      In ocfs2_prepare_xattr_entry, if we fail to grow an existing value,
      xa_cleanup_value_truncate() will leave the old entry in place.  Thus, we
      reset its value size.  However, if we were allocating a new value, we
      must not reset the value size or we will BUG().  This resolves
      oss.oracle.com bug 1247.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      d5a7df06
    • J
      fs/ocfs2/dlm: Use kstrdup · 316ce2ba
      Julia Lawall 提交于
      Use kstrdup when the goal of an allocation is copy a string into the
      allocated region.
      
      The semantic patch that makes this change is as follows:
      (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @@
      expression from,to;
      expression flag,E1,E2;
      statement S;
      @@
      
      -  to = kmalloc(strlen(from) + 1,flag);
      +  to = kstrdup(from, flag);
         ... when != \(from = E1 \| to = E1 \)
         if (to==NULL || ...) S
         ... when != \(from = E2 \| to = E2 \)
      -  strcpy(to, from);
      // </smpl>
      Signed-off-by: NJulia Lawall <julia@diku.dk>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      316ce2ba
    • J
      fs/ocfs2/dlm: Drop memory allocation cast · 3914ed0c
      Julia Lawall 提交于
      Drop cast on the result of kmalloc and similar functions.
      
      The semantic patch that makes this change is as follows:
      (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @@
      type T;
      @@
      
      - (T *)
        (\(kmalloc\|kzalloc\|kcalloc\|kmem_cache_alloc\|kmem_cache_zalloc\|
         kmem_cache_alloc_node\|kmalloc_node\|kzalloc_node\)(...))
      // </smpl>
      Signed-off-by: NJulia Lawall <julia@diku.dk>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      3914ed0c
    • T
      Ocfs2: Optimize punching-hole code. · c1631d4a
      Tristan Ye 提交于
      This patch simplifies the logic of handling existing holes and
      skipping extent blocks and removes some confusing comments.
      
      The patch survived the fill_verify_holes testcase in ocfs2-test.
      It also passed my manual sanity check and stress tests with enormous
      extent records.
      
      Currently punching a hole on a file with 3+ extent tree depth was
      really a performance disaster.  It can even take several hours,
      though we may not hit this in real life with such a huge extent
      number.
      
      One simple way to improve the performance is quite straightforward.
      From the logic of truncate, we can punch the hole from hole_end to
      hole_start, which reduces the overhead of btree operations in a
      significant way, such as tree rotation and moving.
      
      Following is the testing result when punching hole from 0 to file end
      in bytes, on a 1G file, 1G file consists of 256k extent records, each record
      cover 4k data(just one cluster, clustersize is 4k):
      
      ===========================================================================
       * Original punching-hole mechanism:
      ===========================================================================
      
         I waited 1 hour for its completion, unfortunately it's still ongoing.
      
      ===========================================================================
       * Patched punching-hode mechanism:
      ===========================================================================
      
         real 0m2.518s
         user 0m0.000s
         sys  0m2.445s
      
      That means we've gained up to 1000 times improvement on performance in this
      case, whee! It's fairly cool. and it looks like that performance gain will
      be raising when extent records grow.
      
      The patch was based on my former 2 patches, which were about truncating
      codes optimization and fixup to handle CoW on punching hole.
      Signed-off-by: NTristan Ye <tristan.ye@oracle.com>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      c1631d4a