1. 30 7月, 2010 1 次提交
  2. 29 7月, 2010 8 次提交
    • S
      Revert "GFS2: recovery stuck on transaction lock" · 7cdee5db
      Steven Whitehouse 提交于
      This reverts commit b7dc2df5.
      
      The initial patch didn't quite work since it doesn't cover all
      the possible routes by which the GLF_FROZEN flag might be set.
      A revised fix is coming up in the next patch.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      7cdee5db
    • S
      GFS2: Make "try" lock not try quite so hard · d5341a92
      Steven Whitehouse 提交于
      This looks like a big change, but in reality its only a single line of actual
      code change, the rest is just moving a function to before its new caller.
      The "try" flag for glocks is a rather subtle and delicate setting since it
      requires that the state machine tries just hard enough to ensure that it has
      a good chance of getting the requested lock, but no so hard that the
      request can land up blocked behind another.
      
      The patch adds in an additional check which will fail any queued try
      locks if there is another request blocking the try lock request which
      is not granted and compatible, nor in progress already. The check is made
      only after all pending locks which may be granted have been granted.
      
      I've checked this with the reproducer for the reported flock bug which
      this is intended to fix, and it now passes.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      d5341a92
    • D
      GFS2: remove dependency on __GFP_NOFAIL · 4244b52e
      David Rientjes 提交于
      The k[mc]allocs in dr_split_leaf() and dir_double_exhash() are failable,
      so remove __GFP_NOFAIL from their masks.
      
      Cc: Bob Peterson <rpeterso@redhat.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      4244b52e
    • B
      GFS2: Simplify gfs2_write_alloc_required · 461cb419
      Bob Peterson 提交于
      Function gfs2_write_alloc_required always returned zero as its
      return code.  Therefore, it doesn't need to return a return code
      at all.  Given that, we can use the return value to return whether
      or not the dinode needs block allocations rather than passing
      that value in, which in turn simplifies a bunch of error checking.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      461cb419
    • S
      GFS2: Wait for journal id on mount if not specified on mount command line · ba6e9364
      Steven Whitehouse 提交于
      This patch implements a wait for the journal id in the case that it has
      not been specified on the command line. This is to allow the future
      removal of the mount.gfs2 helper. The journal id would instead be
      directly communicated by gfs_controld to the file system. Here is a
      comparison of the two systems:
      
      Current:
      1. mount calls mount.gfs2
      2. mount.gfs2 connects to gfs_controld to retrieve the journal id
      3. mount.gfs2 adds the journal id to the mount command line and calls
      the mount system call
      4. gfs_controld receives the status of the mount request via a uevent
      
      Proposed:
      1. mount calls the mount system call (no mount.gfs2 helper)
      2. gfs_controld receives a uevent for a gfs2 fs which it doesn't know
      about already
      3. gfs_controld assigns a journal id to it via sysfs
      4. the mount system call then completes as normal (sending a uevent
      according to status)
      
      The advantage of the proposed system is that it is completely backward
      compatible with the current system both at the kernel and at the
      userland levels. The "first" parameter can also be set the same way,
      with the restriction that it must be set before the journal id is
      assigned.
      
      In addition, if mount becomes stuck waiting for a reply from
      gfs_controld which never arrives, then it is killable and will abort the
      mount gracefully.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      ba6e9364
    • S
      GFS2: Use nobh_writepage · 30116ff6
      Steven Whitehouse 提交于
      Use nobh_writepage rather than calling mpage_writepage directly.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      30116ff6
    • A
      ecryptfs: Bugfix for error related to ecryptfs_hash_buckets · a6f80fb7
      Andre Osterhues 提交于
      The function ecryptfs_uid_hash wrongly assumes that the
      second parameter to hash_long() is the number of hash
      buckets instead of the number of hash bits.
      This patch fixes that and renames the variable
      ecryptfs_hash_buckets to ecryptfs_hash_bits to make it
      clearer.
      
      Fixes: CVE-2010-2492
      Signed-off-by: NAndre Osterhues <aosterhues@escrypt.com>
      Signed-off-by: NTyler Hicks <tyhicks@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6f80fb7
    • S
      GFS2: Use kmalloc when possible for ->readdir() · d2a97a4e
      Steven Whitehouse 提交于
      If we don't need a huge amount of memory in ->readdir() then
      we can use kmalloc rather than vmalloc to allocate it. This
      should cut down on the greater overheads associated with
      vmalloc for smaller directories.
      
      We may be able to eliminate vmalloc entirely at some stage,
      but this is easy to do right away.
      
      Also using GFP_NOFS to avoid any issues wrt to deleting inodes
      while under a glock, and suggestion from Linus to factor out
      the alloc/dealloc.
      
      I've given this a test with a variety of different sized
      directories and it seems to work ok.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2a97a4e
  3. 28 7月, 2010 2 次提交
  4. 27 7月, 2010 3 次提交
  5. 25 7月, 2010 1 次提交
  6. 24 7月, 2010 4 次提交
  7. 23 7月, 2010 2 次提交
    • S
      ceph: avoid dcache readdir for snapdir · a0dff78d
      Sage Weil 提交于
      We should always go to the MDS for readdir on the hidden snapdir.  The
      set of snapshots can change at any time; the client can't trust its cache
      for that.
      Signed-off-by: NSage Weil <sage@newdream.net>
      a0dff78d
    • D
      CIFS: Fix a malicious redirect problem in the DNS lookup code · 4c0c03ca
      David Howells 提交于
      Fix the security problem in the CIFS filesystem DNS lookup code in which a
      malicious redirect could be installed by a random user by simply adding a
      result record into one of their keyrings with add_key() and then invoking a
      CIFS CFS lookup [CVE-2010-2524].
      
      This is done by creating an internal keyring specifically for the caching of
      DNS lookups.  To enforce the use of this keyring, the module init routine
      creates a set of override credentials with the keyring installed as the thread
      keyring and instructs request_key() to only install lookup result keys in that
      keyring.
      
      The override is then applied around the call to request_key().
      
      This has some additional benefits when a kernel service uses this module to
      request a key:
      
       (1) The result keys are owned by root, not the user that caused the lookup.
      
       (2) The result keys don't pop up in the user's keyrings.
      
       (3) The result keys don't come out of the quota of the user that caused the
           lookup.
      
      The keyring can be viewed as root by doing cat /proc/keys:
      
      2a0ca6c3 I-----     1 perm 1f030000     0     0 keyring   .dns_resolver: 1/4
      
      It can then be listed with 'keyctl list' by root.
      
      	# keyctl list 0x2a0ca6c3
      	1 key in keyring:
      	726766307: --alswrv     0     0 dns_resolver: foo.bar.com
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-and-Tested-by: NJeff Layton <jlayton@redhat.com>
      Acked-by: NSteve French <smfrench@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c0c03ca
  8. 22 7月, 2010 1 次提交
  9. 20 7月, 2010 5 次提交
    • D
      xfs: track AGs with reclaimable inodes in per-ag radix tree · 16fd5367
      Dave Chinner 提交于
      https://bugzilla.kernel.org/show_bug.cgi?id=16348
      
      When the filesystem grows to a large number of allocation groups,
      the summing of recalimable inodes gets expensive. In many cases,
      most AGs won't have any reclaimable inodes and so we are wasting CPU
      time aggregating over these AGs. This is particularly important for
      the inode shrinker that gets called frequently under memory
      pressure.
      
      To avoid the overhead, track AGs with reclaimable inodes in the
      per-ag radix tree so that we can find all the AGs with reclaimable
      inodes via a simple gang tag lookup. This involves setting the tag
      when the first reclaimable inode is tracked in the AG, and removing
      the tag when the last reclaimable inode is removed from the tree.
      Then the summation process becomes a loop walking the radix tree
      summing AGs with the reclaim tag set.
      
      This significantly reduces the overhead of scanning - a 6400 AG
      filesystea now only uses about 25% of a cpu in kswapd while slab
      reclaim progresses instead of being permanently stuck at 100% CPU
      and making little progress. Clean filesystems filesystems will see
      no overhead and the overhead only increases linearly with the number
      of dirty AGs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      16fd5367
    • D
      xfs: convert inode shrinker to per-filesystem contexts · 70e60ce7
      Dave Chinner 提交于
      Now the shrinker passes us a context, wire up a shrinker context per
      filesystem. This allows us to remove the global mount list and the
      locking problems that introduced. It also means that a shrinker call
      does not need to traverse clean filesystems before finding a
      filesystem with reclaimable inodes.  This significantly reduces
      scanning overhead when lots of filesystems are present.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      70e60ce7
    • D
      Btrfs: fix checks in BTRFS_IOC_CLONE_RANGE · 2ebc3464
      Dan Rosenberg 提交于
      1.  The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check
      whether the donor file is append-only before writing to it.
      
      2.  The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer
      overflow that allows a user to specify an out-of-bounds range to copy
      from the source file (if off + len wraps around).  I haven't been able
      to successfully exploit this, but I'd imagine that a clever attacker
      could use this to read things he shouldn't.  Even if it's not
      exploitable, it couldn't hurt to be safe.
      Signed-off-by: NDan Rosenberg <dan.j.rosenberg@gmail.com>
      cc: stable@kernel.org
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2ebc3464
    • S
      Btrfs: fix CLONE ioctl destination file size expansion to block boundary · b5384d48
      Sage Weil 提交于
      The CLONE and CLONE_RANGE ioctls round up the range of extents being
      cloned to the block size when the range to clone extends to the end of file
      (this is always the case with CLONE).  It was then using that offset when
      extending the destination file's i_size.  Fix this by not setting i_size
      beyond the originally requested ending offset.
      
      This bug was introduced by a22285a6 (2.6.35-rc1).
      Signed-off-by: NSage Weil <sage@newdream.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b5384d48
    • C
      Btrfs: fix split_leaf double split corner case · 99d8f83c
      Chris Mason 提交于
      split_leaf was not properly balancing leaves when it was forced to
      split a leaf twice.  This commit adds an extra push left and right
      before forcing the double split in hopes of getting the slot where
      we want to insert at either the start or end of the leaf.
      
      If the extra pushes do work, then we are able to avoid splitting twice
      and we keep the tree properly balanced.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      99d8f83c
  10. 19 7月, 2010 2 次提交
    • P
      [S390] dasd: use correct label location for diag fba disks · cffab6bc
      Peter Oberparleiter 提交于
      Partition boundary calculation fails for DASD FBA disks under the
      following conditions:
      - disk is formatted with CMS FORMAT with a blocksize of more than
        512 bytes
      - all of the disk is reserved to a single CMS file using CMS RESERVE
      - the disk is accessed using the DIAG mode of the DASD driver
      
      Under these circumstances, the partition detection code tries to
      read the CMS label block containing partition-relevant information
      from logical block offset 1, while it is in fact located at physical
      block offset 1.
      
      Fix this problem by using the correct CMS label block location
      depending on the device type as determined by the DASD SENSE ID
      information.
      Signed-off-by: NPeter Oberparleiter <peter.oberparleiter@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      cffab6bc
    • D
      mm: add context argument to shrinker callback · 7f8275d0
      Dave Chinner 提交于
      The current shrinker implementation requires the registered callback
      to have global state to work from. This makes it difficult to shrink
      caches that are not global (e.g. per-filesystem caches). Pass the shrinker
      structure to the callback so that users can embed the shrinker structure
      in the context the shrinker needs to operate on and get back to it in the
      callback via container_of().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      7f8275d0
  11. 17 7月, 2010 3 次提交
  12. 16 7月, 2010 3 次提交
    • J
      jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions · 13ceef09
      Jan Kara 提交于
      OCFS2 uses t_commit trigger to compute and store checksum of the just
      committed blocks. When a buffer has b_frozen_data, checksum is computed
      for it instead of b_data but this can result in an old checksum being
      written to the filesystem in the following scenario:
      
      1) transaction1 is opened
      2) handle1 is opened
      3) journal_access(handle1, bh)
          - This sets jh->b_transaction to transaction1
      4) modify(bh)
      5) journal_dirty(handle1, bh)
      6) handle1 is closed
      7) start committing transaction1, opening transaction2
      8) handle2 is opened
      9) journal_access(handle2, bh)
          - This copies off b_frozen_data to make it safe for transaction1 to commit.
            jh->b_next_transaction is set to transaction2.
      10) jbd2_journal_write_metadata() checksums b_frozen_data
      11) the journal correctly writes b_frozen_data to the disk journal
      12) handle2 is closed
          - There was no dirty call for the bh on handle2, so it is never queued for
            any more journal operation
      13) Checkpointing finally happens, and it just spools the bh via normal buffer
      writeback.  This will write b_data, which was never triggered on and thus
      contains a wrong (old) checksum.
      
      This patch fixes the problem by calling the trigger at the moment data is
      frozen for journal commit - i.e., either when b_frozen_data is created by
      do_get_write_access or just before we write a buffer to the log if
      b_frozen_data does not exist. We also rename the trigger to t_frozen as
      that better describes when it is called.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      13ceef09
    • W
      ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down node · a39953dd
      Wengang Wang 提交于
      For migration, we are waiting for DLM_LOCK_RES_MIGRATING flag to be set
      before sending DLM_MIG_LOCKRES_MSG message to the target. We are using
      dlm_migration_can_proceed() for that purpose.  However, if the node is
      down, dlm_migration_can_proceed() will also return "go ahead".  In this
      rare case, the DLM_LOCK_RES_MIGRATING flag might not be set yet. Remove
      the BUG_ON() that trips over this condition.
      Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      a39953dd
    • T
      ocfs2: Don't duplicate pages past i_size during CoW. · f5e27b6d
      Tao Ma 提交于
      During CoW, the pages after i_size don't contain valid data, so there's
      no need to read and duplicate them.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      f5e27b6d
  13. 15 7月, 2010 5 次提交
    • B
      GFS2: rename causes kernel Oops · 728a756b
      Bob Peterson 提交于
      This patch fixes a kernel Oops in the GFS2 rename code.
      
      The problem was in the way the gfs2 directory code was trying
      to re-use sentinel directory entries.
      
      In the failing case, gfs2's rename function was renaming a
      file to another name that had the same non-trivial length.
      The file being renamed happened to be the first directory
      entry on the leaf block.
      
      First, the rename code (gfs2_rename in ops_inode.c) found the
      original directory entry and decided it could do its job by
      simply replacing the directory entry with another.  Therefore
      it determined correctly that no block allocations were needed.
      
      Next, the rename code deleted the old directory entry prior to
      replacing it with the new name.  Therefore, the soon-to-be
      replaced directory entry was temporarily made into a directory
      entry "sentinel" or a place holder at the start of a leaf block.
      
      Lastly, it went to re-add the replacement directory entry in
      that leaf block.  However, when gfs2_dirent_find_space was
      looking for space in the leaf block, it used the wrong value
      for the sentinel.  That threw off its calculations so later
      it decides it can't really re-use the sentinel and therefore
      must allocate a new leaf block.  But because it previously decided
      to re-use the directory entry, it didn't waste the time to
      grab a new block allocation for the inode.  Therefore, the
      inode's i_alloc pointer was still NULL and it crashes trying to
      reference it.
      
      In the case of sentinel directory entries, the entire dirent is
      reused, not just the "free space" portion of it, and therefore
      the function gfs2_dirent_find_space should use the value 0
      rather than GFS2_DIRENT_SIZE(0) for the actual dirent size.
      
      Fixing this calculation enables the reproducer programs to work
      properly.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      728a756b
    • A
      GFS2: BUG in gfs2_adjust_quota · 8b421601
      Abhijith Das 提交于
      HighMem pages on i686 do not get mapped to the buffer_heads and this was
      causing a NULL pointer dereference when we were trying to memset page buffers
      to zero.
      We now use zero_user() that kmaps the page and directly manipulates page data.
      This patch also fixes a boundary condition that was incorrect.
      Signed-off-by: NAbhi Das <adas@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      8b421601
    • B
      GFS2: Fix kernel NULL pointer dereference by dlm_astd · b1becbde
      Bob Peterson 提交于
      This patch fixes a problem in an error path when looking
      up dinodes.  There are two sister-functions, gfs2_inode_lookup
      and gfs2_process_unlinked_inode.  Both functions acquire and
      hold the i_iopen glock for the dinode being looked up. The last
      thing they try to do is hold the i_gl glock for the dinode.
      If that glock fails for some reason, the error path was
      incorrectly calling gfs2_glock_put for the i_iopen glock twice.
      This resulted in the glock being prematurely freed.  The
      "minimum hold time" usually kept the glock in memory, but the
      lock interface to dlm (aka lock_dlm) freed its memory for the
      glock.  In some circumstances, it would cause dlm's dlm_astd daemon
      to try to call the bast function for the freed lock_dlm memory,
      which resulted in a NULL pointer dereference.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      b1becbde
    • B
      GFS2: recovery stuck on transaction lock · b7dc2df5
      Bob Peterson 提交于
      This patch fixes bugzilla bug #590878: GFS2: recovery stuck on
      transaction lock.  We set the frozen flag on the glock when we receive
      a completion that cannot be delivered due to blocked locks. At that
      point we check to see whether the first waiting holder has the noexp
      flag set. If the noexp lock is queued later, then we need to unfreeze
      the glock at that point in time, namely, in the glock work function.
      
      This patch was originally written by Steve Whitehouse, but since
      he's on holiday, I'm submitting it.  It's been well tested with a
      complex recovery test called revolver.
      Signed-off-by: NSteve Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      b7dc2df5
    • B
      GFS2: O_TRUNC not working on stuffed files across cluster · a8bf2bc2
      Bob Peterson 提交于
      This patch replaces a statement that got dropped out by accident.
      Without the patch, truncates on stuffed (very small) files cause
      those files to have an unpredictable size.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      a8bf2bc2