1. 14 10月, 2008 16 次提交
    • T
      ocfs2: Make high level btree extend code generic · 0eb8d47e
      Tao Ma 提交于
      Factor out the non-inode specifics of ocfs2_do_extend_allocation() into a more generic
      function, ocfs2_do_cluster_allocation(). ocfs2_do_extend_allocation calls
      ocfs2_do_cluster_allocation() now, but the latter can be used for other
      btree types as well.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      0eb8d47e
    • T
      ocfs2: Abstract ocfs2_extent_tree in b-tree operations. · e7d4cb6b
      Tao Ma 提交于
      In the old extent tree operation, we take the hypothesis that we
      are using the ocfs2_extent_list in ocfs2_dinode as the tree root.
      As xattr will also use ocfs2_extent_list to store large value
      for a xattr entry, we refactor the tree operation so that xattr
      can use it directly.
      
      The refactoring includes 4 steps:
      1. Abstract set/get of last_eb_blk and update_clusters since they may
         be stored in different location for dinode and xattr.
      2. Add a new structure named ocfs2_extent_tree to indicate the
         extent tree the operation will work on.
      3. Remove all the use of fe_bh and di, use root_bh and root_el in
         extent tree instead. So now all the fe_bh is replaced with
         et->root_bh, el with root_el accordingly.
      4. Make ocfs2_lock_allocators generic. Now it is limited to be only used
         in file extend allocation. But the whole function is useful when we want
         to store large EAs.
      
      Note: This patch doesn't touch ocfs2_commit_truncate() since it is not used
      for anything other than truncate inode data btrees.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      e7d4cb6b
    • T
      ocfs2: Use ocfs2_extent_list instead of ocfs2_dinode. · 811f933d
      Tao Ma 提交于
      ocfs2_extend_meta_needed(), ocfs2_calc_extend_credits() and
      ocfs2_reserve_new_metadata() are all useful for extent tree operations. But
      they are all limited to an inode btree because they use a struct
      ocfs2_dinode parameter. Change their parameter to struct ocfs2_extent_list
      (the part of an ocfs2_dinode they actually use) so that the xattr btree code
      can use these functions.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      811f933d
    • T
      ocfs2: Modify ocfs2_num_free_extents for future xattr usage. · 231b87d1
      Tao Ma 提交于
      ocfs2_num_free_extents() is used to find the number of free extent records
      in an inode btree. Hence, it takes an "ocfs2_dinode" parameter. We want to
      use this for extended attribute trees in the future, so genericize the
      interface the take a buffer head. A future patch will allow that buffer_head
      to contain any structure rooting an ocfs2 btree.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      231b87d1
    • M
      ocfs2: track local alloc state via debugfs · 9a8ff578
      Mark Fasheh 提交于
      A per-mount debugfs file, "local_alloc" is created which when read will
      expose live state of the nodes local alloc file. Performance impact is
      minimal, only a bit of memory overhead per mount point. Still, the code is
      hidden behind CONFIG_OCFS2_FS_STATS. This feature will help us debug
      local alloc performance problems on a live system.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      9a8ff578
    • M
      ocfs2: throttle back local alloc when low on disk space · 9c7af40b
      Mark Fasheh 提交于
      Ocfs2's local allocator disables itself for the duration of a mount point
      when it has trouble allocating a large enough area from the primary bitmap.
      That can cause performance problems, especially for disks which were only
      temporarily full or fragmented. This patch allows for the allocator to
      shrink it's window first, before being disabled. Later, it can also be
      re-enabled so that any performance drop is minimized.
      
      To do this, we allow the value of osb->local_alloc_bits to be shrunk when
      needed. The default value is recorded in a mostly read-only variable so that
      we can re-initialize when required.
      
      Locking had to be updated so that we could protect changes to
      local_alloc_bits. Mostly this involves protecting various local alloc values
      with the osb spinlock. A new state is also added, OCFS2_LA_THROTTLED, which
      is used when the local allocator is has shrunk, but is not disabled. If the
      available space dips below 1 megabyte, the local alloc file is disabled. In
      either case, local alloc is re-enabled 30 seconds after the event, or when
      an appropriate amount of bits is seen in the primary bitmap.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      9c7af40b
    • M
      ocfs2: Track local alloc bits internally · ebcee4b5
      Mark Fasheh 提交于
      Do this instead of tracking absolute local alloc size. This avoids
      needless re-calculatiion of bits from bytes in localalloc.c. Additionally,
      the value is now in a more natural unit for internal file system bitmap
      work.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      ebcee4b5
    • M
      ocfs2: POSIX file locks support · 53da4939
      Mark Fasheh 提交于
      This is actually pretty easy since fs/dlm already handles the bulk of the
      work. The Ocfs2 userspace cluster stack module already uses fs/dlm as the
      underlying lock manager, so I only had to add the right calls.
      
      Cluster-aware POSIX locks ("plocks") can be turned off by the same means at
      UNIX locks - mount with 'noflocks', or create a local-only Ocfs2 volume.
      Internally, the file system uses two sets of file_operations, depending on
      whether cluster aware plocks is required. This turns out to be easier than
      implementing local-only versions of ->lock.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      53da4939
    • S
      vfs: Use const for kernel parser table · a447c093
      Steven Whitehouse 提交于
      This is a much better version of a previous patch to make the parser
      tables constant. Rather than changing the typedef, we put the "const" in
      all the various places where its required, allowing the __initconst
      exception for nfsroot which was the cause of the previous trouble.
      
      This was posted for review some time ago and I believe its been in -mm
      since then.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Alexander Viro <aviro@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a447c093
    • S
      Simplify devpts_pty_kill · a6f37daa
      Sukadev Bhattiprolu 提交于
      When creating a new pty, save the pty's inode in the tty->driver_data.
      Use this inode in pty_kill() to identify the devpts instance. Since
      we now have the inode for the pty, we can skip get_node() lookup and
      remove the unused get_node().
      
      TODO:
      	- check if the mutex_lock is needed in pty_kill().
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6f37daa
    • S
      Simplify devpts_pty_new() · 89a52e10
      Sukadev Bhattiprolu 提交于
      devpts_pty_new() is called when setting up a new pty and would not
      will not have an existing dentry or inode for the pty. So don't bother
      looking for an existing dentry - just create a new one.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89a52e10
    • S
      Simplify devpts_get_tty() · 527b3e47
      Sukadev Bhattiprolu 提交于
      As pointed out by H. Peter Anvin, since the inode for the pty is known,
      we don't need to look it up.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      527b3e47
    • S
      Add an instance parameter devpts interfaces · 15f1a633
      Sukadev Bhattiprolu 提交于
      Pass-in 'inode' or 'tty' parameter to devpts interfaces.  With multiple
      devpts instances, these parameters will be used in subsequent patches
      to identify the instance of devpts mounted. The parameters also help
      simplify devpts implementation.
      
      Changelog[v3]:
      	- minor changes due to merge with ttydev updates
      	- rename parameters to emphasize they are ptmx or pts inodes
      	- pass-in tty_struct * to devpts_pty_kill() (this will help
      	  cleanup the get_node() call in a subsequent patch)
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15f1a633
    • A
      tty: Redo current tty locking · 934e6ebf
      Alan Cox 提交于
      Currently it is sometimes locked by the tty mutex and sometimes by the
      sighand lock. The latter is in fact correct and now we can hand back referenced
      objects we can fix this up without problems around sleeping functions.
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      934e6ebf
    • A
      tty: the vhangup syscall is racy · 2cb5998b
      Alan Cox 提交于
      We now have the infrastructure to sort this out but rather than teaching
      the syscall tty lock rules we move the hard work into a tty helper
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2cb5998b
    • A
      tty: Make get_current_tty use a kref · 452a00d2
      Alan Cox 提交于
      We now return a kref covered tty reference. That ensures the tty structure
      doesn't go away when you have a return from get_current_tty. This is not
      enough to protect you from most of the resources being freed behind your
      back - yet.
      
      [Updated to include fixes for SELinux problems found by Andrew Morton and
       an s390 leak found while debugging the former]
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      452a00d2
  2. 13 10月, 2008 4 次提交
  3. 12 10月, 2008 1 次提交
  4. 11 10月, 2008 4 次提交
    • H
      ext4: add an option to control error handling on file data · 5bf5683a
      Hidehiro Kawai 提交于
      If the journal doesn't abort when it gets an IO error in file data
      blocks, the file data corruption will spread silently.  Because
      most of applications and commands do buffered writes without fsync(),
      they don't notice the IO error.  It's scary for mission critical
      systems.  On the other hand, if the journal aborts whenever it gets
      an IO error in file data blocks, the system will easily become
      inoperable.  So this patch introduces a filesystem option to
      determine whether it aborts the journal or just call printk() when
      it gets an IO error in file data.
      
      If you mount an ext4 fs with data_err=abort option, it aborts on file
      data write error.  If you mount it with data_err=ignore, it doesn't
      abort, just call printk().  data_err=ignore is the default.
      
      Here is the corresponding patch of the ext3 version:
      http://kerneltrap.org/mailarchive/linux-kernel/2008/9/9/3239374Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      5bf5683a
    • H
      jbd2: don't dirty original metadata buffer on abort · 7ad7445f
      Hidehiro Kawai 提交于
      Currently, original metadata buffers are dirtied when they are
      unfiled whether the journal has aborted or not.  Eventually these
      buffers will be written-back to the filesystem by pdflush.  This
      means some metadata buffers are written to the filesystem without
      journaling if the journal aborts.  So if both journal abort and
      system crash happen at the same time, the filesystem would become
      inconsistent state.  Additionally, replaying journaled metadata
      can overwrite the latest metadata on the filesystem partly.
      Because, if the journal gets aborted, journaled metadata are
      preserved and replayed during the next mount not to lose
      uncheckpointed metadata.  This would also break the consistency
      of the filesystem.
      
      This patch prevents original metadata buffers from being dirtied
      on abort by clearing BH_JBDDirty flag from those buffers.  Thus,
      no metadata buffers are written to the filesystem without journaling.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7ad7445f
    • H
      ext4: add checks for errors from jbd2 · 7ffe1ea8
      Hidehiro Kawai 提交于
      If the journal has aborted due to a checkpointing failure, we
      have to keep the contents of the journal space.  Otherwise, the
      filesystem will lose uncheckpointed metadata completely and
      become inconsistent.  To avoid this, we need to keep needs_recovery
      flag if checkpoint has failed.
      
      With this patch, ext4_put_super() detects a checkpointing failure
      from the return value of journal_destroy(), then it invokes
      ext4_abort() to make the filesystem read only and keep
      needs_recovery flag.  Errors from jbd2_journal_flush() are also
      handled by this patch in some places.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      7ffe1ea8
    • H
      jbd2: fix error handling for checkpoint io · 44519faf
      Hidehiro Kawai 提交于
      When a checkpointing IO fails, current JBD2 code doesn't check the
      error and continue journaling.  This means latest metadata can be
      lost from both the journal and filesystem.
      
      This patch leaves the failed metadata blocks in the journal space
      and aborts journaling in the case of jbd2_log_do_checkpoint().
      To achieve this, we need to do:
      
      1. don't remove the failed buffer from the checkpoint list where in
         the case of __try_to_free_cp_buf() because it may be released or
         overwritten by a later transaction
      2. jbd2_log_do_checkpoint() is the last chance, remove the failed
         buffer from the checkpoint list and abort the journal
      3. when checkpointing fails, don't update the journal super block to
         prevent the journaled contents from being cleaned.  For safety,
         don't update j_tail and j_tail_sequence either
      4. when checkpointing fails, notify this error to the ext4 layer so
         that ext4 don't clear the needs_recovery flag, otherwise the
         journaled contents are ignored and cleaned in the recovery phase
      5. if the recovery fails, keep the needs_recovery flag
      6. prevent jbd2_cleanup_journal_tail() from being called between
         __jbd2_journal_drop_transaction() and jbd2_journal_abort()
         (a possible race issue between jbd2_log_do_checkpoint()s called by
         jbd2_journal_flush() and __jbd2_log_wait_for_space())
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      44519faf
  5. 13 10月, 2008 1 次提交
    • H
      jbd2: abort when failed to log metadata buffers · 77e841de
      Hidehiro Kawai 提交于
      If we failed to write metadata buffers to the journal space and
      succeeded to write the commit record, stale data can be written
      back to the filesystem as metadata in the recovery phase.
      
      To avoid this, when we failed to write out metadata buffers,
      abort the journal before writing the commit record.
      
      We can also avoid this kind of corruption by using the journal
      checksum feature because it can detect invalid metadata blocks in the
      journal and avoid them from being replayed.  So we don't need to care
      about asynchronous commit record writeout with a checksum.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      77e841de
  6. 11 10月, 2008 2 次提交
  7. 10 10月, 2008 11 次提交
  8. 09 10月, 2008 1 次提交