1. 08 10月, 2016 1 次提交
  2. 27 7月, 2016 1 次提交
  3. 26 3月, 2016 2 次提交
    • R
      ocfs2: fix ip_unaligned_aio deadlock with dio work queue · e63890f3
      Ryan Ding 提交于
      In the current implementation of unaligned aio+dio, lock order behave as
      follow:
      
      in user process context:
        -> call io_submit()
          -> get i_mutex
      		<== window1
            -> get ip_unaligned_aio
              -> submit direct io to block device
          -> release i_mutex
        -> io_submit() return
      
      in dio work queue context(the work queue is created in __blockdev_direct_IO):
        -> release ip_unaligned_aio
      		<== window2
          -> get i_mutex
            -> clear unwritten flag & change i_size
          -> release i_mutex
      
      There is a limitation to the thread number of dio work queue.  256 at
      default.  If all 256 thread are in the above 'window2' stage, and there
      is a user process in the 'window1' stage, the system will became
      deadlock.  Since the user process hold i_mutex to wait ip_unaligned_aio
      lock, while there is a direct bio hold ip_unaligned_aio mutex who is
      waiting for a dio work queue thread to be schedule.  But all the dio
      work queue thread is waiting for i_mutex lock in 'window2'.
      
      This case only happened in a test which send a large number(more than
      256) of aio at one io_submit() call.
      
      My design is to remove ip_unaligned_aio lock.  Change it to a sync io
      instead.  Just like ip_unaligned_aio lock, serialize the unaligned aio
      dio.
      
      [akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi]
      Signed-off-by: NRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e63890f3
    • R
      ocfs2: record UNWRITTEN extents when populate write desc · 4506cfb6
      Ryan Ding 提交于
      To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
      
      There is still one issue in the direct write procedure.
      
      phase 1: alloc extent with UNWRITTEN flag
      phase 2: submit direct data to disk, add zero page to page cache
      phase 3: clear UNWRITTEN flag when data has been written to disk
      
      When there are 2 direct write A(0~3KB),B(4~7KB) writing to the same
      cluster 0~7KB (cluster size 8KB).  Write request A arrive phase 2 first,
      it will zero the region (4~7KB).  Before request A enter to phase 3,
      request B arrive phase 2, it will zero region (0~3KB).  This is just like
      request B steps request A.
      
      To resolve this issue, we should let request B knows this cluster is already
      under zero, to prevent it from steps the previous write request.
      
      This patch will add function ocfs2_unwritten_check() to do this job.  It
      will record all clusters that are under direct write(it will be recorded
      in the 'ip_unwritten_list' member of inode info), and prevent the later
      direct write writing to the same cluster to do the zero work again.
      Signed-off-by: NRyan Ding <ryan.ding@oracle.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4506cfb6
  4. 23 3月, 2016 1 次提交
  5. 06 11月, 2015 1 次提交
  6. 05 9月, 2015 1 次提交
    • J
      ocfs2: fix race between dio and recover orphan · 512f62ac
      Joseph Qi 提交于
      During direct io the inode will be added to orphan first and then
      deleted from orphan.  There is a race window that the orphan entry will
      be deleted twice and thus trigger the BUG when validating
      OCFS2_DIO_ORPHANED_FL in ocfs2_del_inode_from_orphan.
      
      ocfs2_direct_IO_write
          ...
          ocfs2_add_inode_to_orphan
          >>>>>>>> race window.
                   1) another node may rm the file and then down, this node
                   take care of orphan recovery and clear flag
                   OCFS2_DIO_ORPHANED_FL.
                   2) since rw lock is unlocked, it may race with another
                   orphan recovery and append dio.
          ocfs2_del_inode_from_orphan
      
      So take inode mutex lock when recovering orphans and make rw unlock at the
      end of aio write in case of append dio.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Reported-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Cc: Weiwei Wang <wangww631@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      512f62ac
  7. 17 2月, 2015 1 次提交
  8. 10 11月, 2014 1 次提交
  9. 10 10月, 2014 1 次提交
  10. 04 4月, 2014 3 次提交
  11. 08 5月, 2013 1 次提交
  12. 28 7月, 2011 1 次提交
    • M
      ocfs2: serialize unaligned aio · a11f7e63
      Mark Fasheh 提交于
      Fix a corruption that can happen when we have (two or more) outstanding
      aio's to an overlapping unaligned region.  Ext4
      (e9e3bcec) and xfs recently had to fix
      similar issues.
      
      In our case what happens is that we can have an outstanding aio on a region
      and if a write comes in with some bytes overlapping the original aio we may
      decide to read that region into a page before continuing (typically because
      of buffered-io fallback).  Since we have no ordering guarantees with the
      aio, we can read stale or bad data into the page and then write it back out.
      
      If the i/o is page and block aligned, then we avoid this issue as there
      won't be any need to read data from disk.
      
      I took the same approach as Eric in the ext4 patch and introduced some
      serialization of unaligned async direct i/o.  I don't expect this to have an
      effect on the most common cases of AIO.  Unaligned aio will be slower
      though, but that's far more acceptable than data corruption.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <jlbec@evilplan.org>
      a11f7e63
  13. 11 9月, 2010 1 次提交
    • G
      Track negative entries v3 · 5e98d492
      Goldwyn Rodrigues 提交于
      Track negative dentries by recording the generation number of the parent
      directory in d_fsdata. The generation number for the parent directory is
      recorded in the inode_info, which increments every time the lock on the
      directory is dropped.
      
      If the generation number of the parent directory and the negative dentry
      matches, there is no need to perform the revalidate, else a revalidate
      is forced. This improves performance in situations where nodes look for
      the same non-existent file multiple times.
      
      Thanks Mark for explaining the DLM sequence.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.de>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      5e98d492
  14. 10 9月, 2010 1 次提交
  15. 10 8月, 2010 2 次提交
  16. 06 5月, 2010 1 次提交
  17. 24 4月, 2010 1 次提交
  18. 05 9月, 2009 6 次提交
  19. 04 4月, 2009 2 次提交
    • W
      ocfs2: fix rare stale inode errors when exporting via nfs · 6ca497a8
      wengang wang 提交于
      For nfs exporting, ocfs2_get_dentry() returns the dentry for fh.
      ocfs2_get_dentry() may read from disk when the inode is not in memory,
      without any cross cluster lock. this leads to the file system loading a
      stale inode.
      
      This patch fixes above problem.
      
      Solution is that in case of inode is not in memory, we get the cluster
      lock(PR) of alloc inode where the inode in question is allocated from (this
      causes node on which deletion is done sync the alloc inode) before reading
      out the inode itsself. then we check the bitmap in the group (the inode in
      question allcated from) to see if the bit is clear. if it's clear then it's
      stale. if the bit is set, we then check generation as the existing code
      does.
      
      We have to read out the inode in question from disk first to know its alloc
      slot and allot bit. And if its not stale we read it out using ocfs2_iget().
      The second read should then be from cache.
      
      And also we have to add a per superblock nfs_sync_lock to cover the lock for
      alloc inode and that for inode in question. this is because ocfs2_get_dentry()
      and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked
      in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so
      that mutliple ocfs2_delete_inode() can run concurrently in normal case.
      
      [mfasheh@suse.com: build warning fixes and comment cleanups]
      Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com>
      Acked-by: NJoel Becker <joel.becker@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      6ca497a8
    • T
      ocfs2: Optimize inode allocation by remembering last group · 13821151
      Tao Ma 提交于
      In ocfs2, the inode block search looks for the "emptiest" inode
      group to allocate from. So if an inode alloc file has many equally
      (or almost equally) empty groups, new inodes will tend to get
      spread out amongst them, which in turn can put them all over the
      disk. This is undesirable because directory operations on conceptually
      "nearby" inodes force a large number of seeks.
      
      So we add ip_last_used_group in core directory inodes which records
      the last used allocation group. Another field named ip_last_used_slot
      is also added in case inode stealing happens. When claiming new inode,
      we passed in directory's inode so that the allocation can use this
      information.
      For more details, please see
      http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      13821151
  20. 06 1月, 2009 2 次提交
    • J
      ocfs2: Implementation of local and global quota file handling · 9e33d69f
      Jan Kara 提交于
      For each quota type each node has local quota file. In this file it stores
      changes users have made to disk usage via this node. Once in a while this
      information is synced to global file (and thus with other nodes) so that
      limits enforcement at least aproximately works.
      
      Global quota files contain all the information about usage and limits. It's
      mostly handled by the generic VFS code (which implements a trie of structures
      inside a quota file). We only have to provide functions to convert structures
      from on-disk format to in-memory one. We also have to provide wrappers for
      various quota functions starting transactions and acquiring necessary cluster
      locks before the actual IO is really started.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      9e33d69f
    • J
      ocfs2: Wrap inode block reads in a dedicated function. · b657c95c
      Joel Becker 提交于
      The ocfs2 code currently reads inodes off disk with a simple
      ocfs2_read_block() call.  Each place that does this has a different set
      of sanity checks it performs.  Some check only the signature.  A couple
      validate the block number (the block read vs di->i_blkno).  A couple
      others check for VALID_FL.  Only one place validates i_fs_generation.  A
      couple check nothing.  Even when an error is found, they don't all do
      the same thing.
      
      We wrap inode reading into ocfs2_read_inode_block().  This will validate
      all the above fields, going readonly if they are invalid (they never
      should be).  ocfs2_read_inode_block_full() is provided for the places
      that want to pass read_block flags.  Every caller is passing a struct
      inode with a valid ip_blkno, so we don't need a separate blkno argument
      either.
      
      We will remove the validation checks from the rest of the code in a
      later commit, as they are no longer necessary.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      b657c95c
  21. 15 10月, 2008 1 次提交
  22. 14 10月, 2008 2 次提交
    • J
      ocfs2: Switch over to JBD2. · 2b4e30fb
      Joel Becker 提交于
      ocfs2 wants JBD2 for many reasons, not the least of which is that JBD is
      limiting our maximum filesystem size.
      
      It's a pretty trivial change.  Most functions are just renamed.  The
      only functional change is moving to Jan's inode-based ordered data mode.
      It's better, too.
      
      Because JBD2 reads and writes JBD journals, this is compatible with any
      existing filesystem.  It can even interact with JBD-based ocfs2 as long
      as the journal is formated for JBD.
      
      We provide a compatibility option so that paranoid people can still use
      JBD for the time being.  This will go away shortly.
      
      [ Moved call of ocfs2_begin_ordered_truncate() from ocfs2_delete_inode() to
        ocfs2_truncate_for_delete(). --Mark ]
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      2b4e30fb
    • T
      ocfs2: Add extended attribute support · cf1d6c76
      Tiger Yang 提交于
      This patch implements storing extended attributes both in inode or a single
      external block. We only store EA's in-inode when blocksize > 512 or that
      inode block has free space for it. When an EA's value is larger than 80
      bytes, we will store the value via b-tree outside inode or block.
      Signed-off-by: NTiger Yang <tiger.yang@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      cf1d6c76
  23. 26 1月, 2008 3 次提交
  24. 13 10月, 2007 1 次提交
  25. 03 5月, 2007 1 次提交
  26. 27 4月, 2007 1 次提交
    • M
      ocfs2: Cache extent records · 83418978
      Mark Fasheh 提交于
      The extent map code was ripped out earlier because of an inability to deal
      with holes. This patch adds back a simpler caching scheme requiring far less
      code.
      
      Our old extent map caching was designed back when meta data block caching in
      Ocfs2 didn't work very well, resulting in many disk reads. These days our
      metadata caching is much better, resulting in no un-necessary disk reads. As
      a result, extent caching doesn't have to be as fancy, nor does it have to
      cache as many extents. Keeping the last 3 extents seen should be sufficient
      to give us a small performance boost on some streaming workloads.
      Signed-off-by: NMark Fasheh <mark.fasheh@oracle.com>
      83418978