1. 12 12月, 2009 2 次提交
  2. 10 12月, 2009 11 次提交
    • B
      exofs: Multi-device mirror support · 04dc1e88
      Boaz Harrosh 提交于
      This patch changes on-disk format, it is accompanied with a parallel
      patch to mkfs.exofs that enables multi-device capabilities.
      
      After this patch, old exofs will refuse to mount a new formatted FS and
      new exofs will refuse an old format. This is done by moving the magic
      field offset inside the FSCB. A new FSCB *version* field was added. In
      the future, exofs will refuse to mount unmatched FSCB version. To
      up-grade or down-grade an exofs one must use mkfs.exofs --upgrade option
      before mounting.
      
      Introduced, a new object that contains a *device-table*. This object
      contains the default *data-map* and a linear array of devices
      information, which identifies the devices used in the filesystem. This
      object is only written to offline by mkfs.exofs. This is why it is kept
      separate from the FSCB, since the later is written to while mounted.
      
      Same partition number, same object number is used on all devices only
      the device varies.
      
      * define the new format, then load the device table on mount time make
        sure every thing is supported.
      
      * Change I/O engine to now support Mirror IO, .i.e write same data
        to multiple devices, read from a random device to spread the
        read-load from multiple clients (TODO: stripe read)
      
      Implementation notes:
       A few points introduced in previous patch should be mentioned here:
      
      * Special care was made so absolutlly all operation that have any chance
        of failing are done before any osd-request is executed. This is to
        minimize the need for a data consistency recovery, to only real IO
        errors.
      
      * Each IO state has a kref. It starts at 1, any osd-request executed
        will increment the kref, finally when all are executed the first ref
        is dropped. At IO-done, each request completion decrements the kref,
        the last one to return executes the internal _last_io() routine.
        _last_io() will call the registered io_state_done. On sync mode a
        caller does not supply a done method, indicating a synchronous
        request, the caller is put to sleep and a special io_state_done is
        registered that will awaken the caller. Though also in sync mode all
        operations are executed in parallel.
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      04dc1e88
    • B
      exofs: Move all operations to an io_engine · 06886a5a
      Boaz Harrosh 提交于
      In anticipation for multi-device operations, we separate osd operations
      into an abstract I/O API. Currently only one device is used but later
      when adding more devices, we will drive all devices in parallel according
      to a "data_map" that describes how data is arranged on multiple devices.
      The file system level operates, like before, as if there is one object
      (inode-number) and an i_size. The io engine will split this to the same
      object-number but on multiple device.
      
      At first we introduce Mirror (raid 1) layout. But at the final outcome
      we intend to fully implement the pNFS-Objects data-map, including
      raid 0,4,5,6 over mirrored devices, over multiple device-groups. And
      more. See: http://tools.ietf.org/html/draft-ietf-nfsv4-pnfs-obj-12
      
      * Define an io_state based API for accessing osd storage devices
        in an abstract way.
        Usage:
      	First a caller allocates an io state with:
      		exofs_get_io_state(struct exofs_sb_info *sbi,
      				   struct exofs_io_state** ios);
      
      	Then calles one of:
      		exofs_sbi_create(struct exofs_io_state *ios);
      		exofs_sbi_remove(struct exofs_io_state *ios);
      		exofs_sbi_write(struct exofs_io_state *ios);
      		exofs_sbi_read(struct exofs_io_state *ios);
      		exofs_oi_truncate(struct exofs_i_info *oi, u64 new_len);
      
      	And when done
      		exofs_put_io_state(struct exofs_io_state *ios);
      
      * Convert all source files to use this new API
      * Convert from bio_alloc to bio_kmalloc
      * In io engine we make use of the now fixed osd_req_decode_sense
      
      There are no functional changes or on disk additions after this patch.
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      06886a5a
    • B
      exofs: move osd.c to ios.c · 8ce9bdd1
      Boaz Harrosh 提交于
      If I do a "git mv" together with a massive code change
      and commit in one patch, git looses the rename and
      records a delete/new instead. This is bad because I want
      a rename recorded so later rebased/cherry-picked patches
      to the old name will work. Also the --follow is lost.
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      8ce9bdd1
    • B
      exofs: statfs blocks is sectors not FS blocks · cae012d8
      Boaz Harrosh 提交于
      Even though exofs has a 4k block size, statfs blocks
      is in sectors (512 bytes).
      
      Also if target returns 0 for capacity then make it
      ULLONG_MAX. df does not like zero-size filesystems
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      cae012d8
    • B
      exofs: Prints on mount and unmout · 19fe294f
      Boaz Harrosh 提交于
      It is important to print in the logs when a filesystem was
      mounted and eventually unmounted.
      
      Print the osd-device's osd_name and pid the FS was
      mounted/unmounted on.
      
      TODO: How to also print the namespace path the filesystem was
            mounted on?
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      19fe294f
    • B
      exofs: refactor exofs_i_info initialization into common helper · 9cfdc7aa
      Boaz Harrosh 提交于
      There are two places that initialize inodes: exofs_iget() and
      exofs_new_inode()
      
      As more members of exofs_i_info that need initialization are
      added this code will grow. (soon)
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      9cfdc7aa
    • B
      exofs: dbg-print less · fe33cc1e
      Boaz Harrosh 提交于
      Iner-loops printing is converted to EXOFS_DBG2 which is #defined
      to nothing.
      
      It is now almost bareable to just leave debug-on. Every operation
      is printed once, with most relevant info (I hope).
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      fe33cc1e
    • B
      exofs: More sane debug print · 58311c43
      Boaz Harrosh 提交于
      debug prints should be somewhat useful without actually
      reading the source code
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      58311c43
    • T
      ext4: Fix potential fiemap deadlock (mmap_sem vs. i_data_sem) · fab3a549
      Theodore Ts'o 提交于
      Fix the following potential circular locking dependency between
      mm->mmap_sem and ei->i_data_sem:
      
          =======================================================
          [ INFO: possible circular locking dependency detected ]
          2.6.32-04115-gec044c5 #37
          -------------------------------------------------------
          ureadahead/1855 is trying to acquire lock:
           (&mm->mmap_sem){++++++}, at: [<ffffffff81107224>] might_fault+0x5c/0xac
      
          but task is already holding lock:
           (&ei->i_data_sem){++++..}, at: [<ffffffff811be1fd>] ext4_fiemap+0x11b/0x159
      
          which lock already depends on the new lock.
      
          the existing dependency chain (in reverse order) is:
      
          -> #1 (&ei->i_data_sem){++++..}:
                 [<ffffffff81099bfa>] __lock_acquire+0xb67/0xd0f
                 [<ffffffff81099e7e>] lock_acquire+0xdc/0x102
                 [<ffffffff81516633>] down_read+0x51/0x84
                 [<ffffffff811a2414>] ext4_get_blocks+0x50/0x2a5
                 [<ffffffff811a3453>] ext4_get_block+0xab/0xef
                 [<ffffffff81154f39>] do_mpage_readpage+0x198/0x48d
                 [<ffffffff81155360>] mpage_readpages+0xd0/0x114
                 [<ffffffff811a104b>] ext4_readpages+0x1d/0x1f
                 [<ffffffff810f8644>] __do_page_cache_readahead+0x12f/0x1bc
                 [<ffffffff810f86f2>] ra_submit+0x21/0x25
                 [<ffffffff810f0cfd>] filemap_fault+0x19f/0x32c
                 [<ffffffff81107b97>] __do_fault+0x55/0x3a2
                 [<ffffffff81109db0>] handle_mm_fault+0x327/0x734
                 [<ffffffff8151aaa9>] do_page_fault+0x292/0x2aa
                 [<ffffffff81518205>] page_fault+0x25/0x30
                 [<ffffffff812a34d8>] clear_user+0x38/0x3c
                 [<ffffffff81167e16>] padzero+0x20/0x31
                 [<ffffffff81168b47>] load_elf_binary+0x8bc/0x17ed
                 [<ffffffff81130e95>] search_binary_handler+0xc2/0x259
                 [<ffffffff81166d64>] load_script+0x1b8/0x1cc
                 [<ffffffff81130e95>] search_binary_handler+0xc2/0x259
                 [<ffffffff8113255f>] do_execve+0x1ce/0x2cf
                 [<ffffffff81027494>] sys_execve+0x43/0x5a
                 [<ffffffff8102918a>] stub_execve+0x6a/0xc0
      
          -> #0 (&mm->mmap_sem){++++++}:
                 [<ffffffff81099aa4>] __lock_acquire+0xa11/0xd0f
                 [<ffffffff81099e7e>] lock_acquire+0xdc/0x102
                 [<ffffffff81107251>] might_fault+0x89/0xac
                 [<ffffffff81139382>] fiemap_fill_next_extent+0x95/0xda
                 [<ffffffff811bcb43>] ext4_ext_fiemap_cb+0x138/0x157
                 [<ffffffff811be069>] ext4_ext_walk_space+0x178/0x1f1
                 [<ffffffff811be21e>] ext4_fiemap+0x13c/0x159
                 [<ffffffff811390e6>] do_vfs_ioctl+0x348/0x4d6
                 [<ffffffff811392ca>] sys_ioctl+0x56/0x79
                 [<ffffffff81028cb2>] system_call_fastpath+0x16/0x1b
      
          other info that might help us debug this:
      
          1 lock held by ureadahead/1855:
           #0:  (&ei->i_data_sem){++++..}, at: [<ffffffff811be1fd>] ext4_fiemap+0x11b/0x159
      
          stack backtrace:
          Pid: 1855, comm: ureadahead Not tainted 2.6.32-04115-gec044c5 #37
          Call Trace:
           [<ffffffff81098c70>] print_circular_bug+0xa8/0xb7
           [<ffffffff81099aa4>] __lock_acquire+0xa11/0xd0f
           [<ffffffff8102f229>] ? sched_clock+0x9/0xd
           [<ffffffff81099e7e>] lock_acquire+0xdc/0x102
           [<ffffffff81107224>] ? might_fault+0x5c/0xac
           [<ffffffff81107251>] might_fault+0x89/0xac
           [<ffffffff81107224>] ? might_fault+0x5c/0xac
           [<ffffffff81124b44>] ? __kmalloc+0x13b/0x18c
           [<ffffffff81139382>] fiemap_fill_next_extent+0x95/0xda
           [<ffffffff811bcb43>] ext4_ext_fiemap_cb+0x138/0x157
           [<ffffffff811bca0b>] ? ext4_ext_fiemap_cb+0x0/0x157
           [<ffffffff811be069>] ext4_ext_walk_space+0x178/0x1f1
           [<ffffffff811be21e>] ext4_fiemap+0x13c/0x159
           [<ffffffff81107224>] ? might_fault+0x5c/0xac
           [<ffffffff811390e6>] do_vfs_ioctl+0x348/0x4d6
           [<ffffffff8129f6d0>] ? __up_read+0x8d/0x95
           [<ffffffff81517fb5>] ? retint_swapgs+0x13/0x1b
           [<ffffffff811392ca>] sys_ioctl+0x56/0x79
           [<ffffffff81028cb2>] system_call_fastpath+0x16/0x1b
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      fab3a549
    • T
      ext4: Do not override ext2 or ext3 if built they are built as modules · a214238d
      Theodore Ts'o 提交于
      The CONFIG_EXT4_USE_FOR_EXT23 option must not try to take over the
      ext2 or ext3 file systems if the those file system drivers are
      configured to be built as mdoules.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a214238d
    • T
      jbd2: Export jbd2_log_start_commit to fix ext4 build · 3b799d15
      Theodore Ts'o 提交于
          
          This fixes:
          ERROR: "jbd2_log_start_commit" [fs/ext4/ext4.ko] undefined!
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3b799d15
  3. 08 12月, 2009 1 次提交
  4. 07 12月, 2009 1 次提交
    • A
      ext4: Fix insufficient checks in EXT4_IOC_MOVE_EXT · 4a58579b
      Akira Fujita 提交于
      This patch fixes three problems in the handling of the
      EXT4_IOC_MOVE_EXT ioctl:
      
      1. In current EXT4_IOC_MOVE_EXT, there are read access mode checks for
      original and donor files, but they allow the illegal write access to
      donor file, since donor file is overwritten by original file data.  To
      fix this problem, change access mode checks of original (r->r/w) and
      donor (r->w) files.
      
      2.  Disallow the use of donor files that have a setuid or setgid bits.
      
      3.  Call mnt_want_write() and mnt_drop_write() before and after
      ext4_move_extents() calling to get write access to a mount.
      Signed-off-by: NAkira Fujita <a-fujita@rs.jp.nec.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4a58579b
  5. 09 12月, 2009 7 次提交
  6. 08 12月, 2009 1 次提交
  7. 07 12月, 2009 1 次提交
  8. 05 12月, 2009 2 次提交
  9. 04 12月, 2009 4 次提交
  10. 03 12月, 2009 10 次提交
    • W
      writeback: remove unused nonblocking and congestion checks · 0d99519e
      Wu Fengguang 提交于
      - no one is calling wb_writeback and write_cache_pages with
        wbc.nonblocking=1 any more
      - lumpy pageout will want to do nonblocking writeback without the
        congestion wait
      
      So remove the congestion checks as suggested by Chris.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Alex Elder <aelder@sgi.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      0d99519e
    • W
      writeback: introduce wbc.for_background · b17621fe
      Wu Fengguang 提交于
      It will lower the flush priority for NFS, and maybe more in future.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b17621fe
    • W
      writeback: remove the always false bdi_cap_writeback_dirty() test · 951c30d1
      Wu Fengguang 提交于
      This is dead code because no bdi flush thread will be started for
      !bdi_cap_writeback_dirty bdi.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      951c30d1
    • S
      GFS2: Fix glock refcount issues · 26bb7505
      Steven Whitehouse 提交于
      This patch fixes some ref counting issues. Firstly by moving
      the point at which we drop the ref count after a dlm lock
      operation has completed we ensure that we never call
      gfs2_glock_hold() on a lock with a zero ref count.
      
      Secondly, by using atomic_dec_and_lock() in gfs2_glock_put()
      we ensure that at no time will a glock with zero ref count
      appear on the lru_list. That means that we can remove the
      check for this in our shrinker (which was racy).
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      26bb7505
    • W
      writeback: remove unused nonblocking and congestion checks (gfs2) · c29cd900
      Wu Fengguang 提交于
      No one is calling wb_writeback and write_cache_pages with
      wbc.nonblocking=1 any more. And lumpy pageout will want to do
      nonblocking writeback without the congestion wait.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      c29cd900
    • B
      GFS2: drop rindex glock to refresh rindex list · 9ae3c6de
      Benjamin Marzinski 提交于
      When a gfs2 filesystem is grown, it needs to rebuild the rindex list to be able
      to use the new space.  gfs2 does this when the rindex is marked not uptodate,
      which happens when the rindex glock is dropped.  However, on a single node
      setup, there is never any reason to drop the rindex glock, so gfs2 never
      invalidates the the rindex. This patch makes gfs2 automatically drop the
      rindex glock after filesystem grows, so it can refresh the rindex list.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      9ae3c6de
    • S
      GFS2: Tag all metadata with jid · 0ab7d13f
      Steven Whitehouse 提交于
      There are two spare field in the header common to all GFS2
      metadata. One is just the right size to fit a journal id
      in it, and this patch updates the journal code so that each
      time a metadata block is modified, we tag it with the journal
      id of the node which is performing the modification.
      
      The reason for this is that it should make it much easier to
      debug issues which arise if we can tell which node was the
      last to modify a particular metadata block.
      
      Since the field is updated before the block is written into
      the journal, each journal should only contain metadata which
      is tagged with its own journal id. The one exception to this
      is the journal header block, which might have a different node's
      id in it, if that journal was recovered by another node in the
      cluster.
      
      Thus each journal will contain a record of which nodes recovered
      it, via the journal header.
      
      The other field in the metadata header could potentially be
      used to hold information about what kind of operation was
      performed, but for the time being we just zero it on each
      transaction so that if we use it for that in future, we'll
      know that the information (where it exists) is reliable.
      
      I did consider using the other field to hold the journal
      sequence number, however since in GFS2's journaling we write
      the modified data into the journal and not the original
      data, this gives no information as to what action caused the
      modification, so I think we can probably come up with a better
      use for those 64 bits in the future.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      0ab7d13f
    • S
      GFS2: Locking order fix in gfs2_check_blk_state · 2c776349
      Steven Whitehouse 提交于
      In some cases we already have the rindex lock when
      we enter this function.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      2c776349
    • S
      GFS2: Remove dirent_first() function · 1579343a
      Steven Whitehouse 提交于
      This function only had one caller left, and that caller only
      called it for leaf blocks, hence one branch of the "if" was
      never taken. In addition the call to get_left had already
      verified the metadata type, so the function can be reduced
      to a single line of code in its caller.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      1579343a
    • S
      GFS2: Display nobarrier option in /proc/mounts · cdcfde62
      Steven Whitehouse 提交于
      Since the default is barriers on, this only displays the
      nobarrier option when that is active.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      cdcfde62