1. 20 1月, 2010 1 次提交
    • T
      eCryptfs: Use notify_change for truncating lower inodes · 5f3ef64f
      Tyler Hicks 提交于
      When truncating inodes in the lower filesystem, eCryptfs directly
      invoked vmtruncate(). As Christoph Hellwig pointed out, vmtruncate() is
      a filesystem helper function, but filesystems may need to do more than
      just a call to vmtruncate().
      
      This patch moves the lower inode truncation out of ecryptfs_truncate()
      and renames the function to truncate_upper().  truncate_upper() updates
      an iattr for the lower inode to indicate if the lower inode needs to be
      truncated upon return.  ecryptfs_setattr() then calls notify_change(),
      using the updated iattr for the lower inode, to complete the truncation.
      
      For eCryptfs functions needing to truncate, ecryptfs_truncate() is
      reintroduced as a simple way to truncate the upper inode to a specified
      size and then truncate the lower inode accordingly.
      
      https://bugs.launchpad.net/bugs/451368Reported-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NDustin Kirkland <kirkland@canonical.com>
      Cc: ecryptfs-devel@lists.launchpad.net
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NTyler Hicks <tyhicks@linux.vnet.ibm.com>
      5f3ef64f
  2. 17 1月, 2010 7 次提交
  3. 16 1月, 2010 9 次提交
    • E
      inotify: only warn once for inotify problems · 976ae32b
      Eric Paris 提交于
      inotify will WARN() if it finds that the idr and the fsnotify internals
      somehow got out of sync.  It was only supposed to do this once but due
      to this stupid bug it would warn every single time a problem was
      detected.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      976ae32b
    • E
      inotify: do not reuse watch descriptors · 9e572cc9
      Eric Paris 提交于
      Since commit 7e790dd5 ("inotify: fix
      error paths in inotify_update_watch") inotify changed the manor in which
      it gave watch descriptors back to userspace.  Previous to this commit
      inotify acted like the following:
      
        inotify_add_watch(X, Y, Z) = 1
        inotify_rm_watch(X, 1);
        inotify_add_watch(X, Y, Z) = 2
      
      but after this patch inotify would return watch descriptors like so:
      
        inotify_add_watch(X, Y, Z) = 1
        inotify_rm_watch(X, 1);
        inotify_add_watch(X, Y, Z) = 1
      
      which I saw as equivalent to opening an fd where
      
        open(file) = 1;
        close(1);
        open(file) = 1;
      
      seemed perfectly reasonable.  The issue is that quite a bit of userspace
      apparently relies on the behavior in which watch descriptors will not be
      quickly reused.  KDE relies on it, I know some selinux packages rely on
      it, and I have heard complaints from other random sources such as debian
      bug 558981.
      
      Although the man page implies what we do is ok, we broke userspace so
      this patch almost reverts us to the old behavior.  It is still slightly
      racey and I have patches that would fix that, but they are rather large
      and this will fix it for all real world cases.  The race is as follows:
      
       - task1 creates a watch and blocks in idr_new_watch() before it updates
         the hint.
       - task2 creates a watch and updates the hint.
       - task1 updates the hint with it's older wd
       - task removes the watch created by task2
       - task adds a new watch and will reuse the wd originally given to task2
      
      it requires moving some locking around the hint (last_wd) but this should
      solve it for the real world and be -stable safe.
      
      As a side effect this patch papers over a bug in the lib/idr code which
      is causing a large number WARN's to pop on people's system and many
      reports in kerneloops.org.  I'm working on the root cause of that idr
      bug seperately but this should make inotify immune to that issue.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e572cc9
    • D
      xfs: xfs_swap_extents needs to handle dynamic fork offsets · e09f9860
      Dave Chinner 提交于
      When swapping extents, we can corrupt inodes by swapping data forks
      that are in incompatible formats.  This is caused by the two indoes
      having different fork offsets due to the presence of an attribute
      fork on an attr2 filesystem.  xfs_fsr tries to be smart about
      setting the fork offset, but the trick it plays only works on attr1
      (old fixed format attribute fork) filesystems.
      
      Changing the way xfs_fsr sets up the attribute fork will prevent
      this situation from ever occurring, so in the kernel code we can get
      by with a preventative fix - check that the data fork in the
      defragmented inode is in a format valid for the inode it is being
      swapped into.  This will lead to files that will silently and
      potentially repeatedly fail defragmentation, so issue a warning to
      the log when this particular failure occurs to let us know that
      xfs_fsr needs updating/fixing.
      
      To help identify how to improve xfs_fsr to avoid this issue, add
      trace points for the inodes being swapped so that we can determine
      why the swap was rejected and to confirm that the code is making the
      right decisions and modifications when swapping forks.
      
      A further complication is even when the swap is allowed to proceed
      when the fork offset is different between the two inodes then value
      for the maximum number of extents the data fork can hold can be
      wrong. Make sure these are also set correctly after the swap occurs.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      e09f9860
    • D
      xfs: fix missing error check in xfs_rtfree_range · 3daeb42c
      Dave Chinner 提交于
      When xfs_rtfind_forw() returns an error, the block is returned
      uninitialised.  xfs_rtfree_range() is not checking the error return,
      so could be using an uninitialised block number for modifying bitmap
      summary info.
      
      The problem was found by gcc when compiling the *userspace* libxfs
      code - it is an copy of the kernel code with the exact same bug.
      gcc gives an uninitialised variable warning on the userspace code
      but not on the kernel code. You gotta love the consistency (Mmmm,
      slightly chewy today!).
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      3daeb42c
    • D
      xfs: fix stale inode flush avoidance · 4b6a4688
      Dave Chinner 提交于
      When reclaiming stale inodes, we need to guarantee that inodes are
      unpinned before returning with a "clean" status. If we don't we can
      reclaim inodes that are pinned, leading to use after free in the
      transaction subsystem as transactions complete.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      4b6a4688
    • D
      xfs: Remove inode iolock held check during allocation · 126976c7
      Dave Chinner 提交于
      lockdep complains about a the lock not being initialised as we do an
      ASSERT based check that the lock is not held before we initialise it
      to catch inodes freed with the lock held.
      
      lockdep does this check for us in the lock initialisation code, so
      remove the ASSERT to stop the lockdep warning.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      126976c7
    • D
      xfs: reclaim all inodes by background tree walks · 57817c68
      Dave Chinner 提交于
      We cannot do direct inode reclaim without taking the flush lock to
      ensure that we do not reclaim an inode under IO. We check the inode
      is clean before doing direct reclaim, but this is not good enough
      because the inode flush code marks the inode clean once it has
      copied the in-core dirty state to the backing buffer.
      
      It is the flush lock that determines whether the inode is still
      under IO, even though it is marked clean, and the inode is still
      required at IO completion so we can't reclaim it even though it is
      clean in core. Hence the requirement that we need to take the flush
      lock even on clean inodes because this guarantees that the inode
      writeback IO has completed and it is safe to reclaim the inode.
      
      With delayed write inode flushing, we coul dend up waiting a long
      time on the flush lock even for a clean inode. The background
      reclaim already handles this efficiently, so avoid all the problems
      by killing the direct reclaim path altogether.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      57817c68
    • D
      xfs: Avoid inodes in reclaim when flushing from inode cache · 018027be
      Dave Chinner 提交于
      The reclaim code will handle flushing of dirty inodes before reclaim
      occurs, so avoid them when determining whether an inode is a
      candidate for flushing to disk when walking the radix trees.  This
      is based on a test patch from Christoph Hellwig.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      018027be
    • D
      xfs: reclaim inodes under a write lock · c8e20be0
      Dave Chinner 提交于
      Make the inode tree reclaim walk exclusive to avoid races with
      concurrent sync walkers and lookups. This is a version of a patch
      posted by Christoph Hellwig that avoids all the code duplication.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      c8e20be0
  4. 14 1月, 2010 7 次提交
  5. 13 1月, 2010 1 次提交
  6. 12 1月, 2010 2 次提交
    • M
      smaps: fix wrong rss count · 7f53a09e
      Minchan Kim 提交于
      A long time ago we regarded zero page as file_rss and vm_normal_page
      doesn't return NULL.
      
      But now, we reinstated ZERO_PAGE and vm_normal_page's implementation can
      return NULL in case of zero page.  Also we don't count it with file_rss
      any more.
      
      Then, RSS and PSS can't be matched.  For consistency, Let's ignore zero
      page in smaps_pte_range.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f53a09e
    • K
      proc: partially revert "procfs: provide stack information for threads" · 1306d603
      KOSAKI Motohiro 提交于
      Commit d899bf7b (procfs: provide stack information for threads) introduced
      to show stack information in /proc/{pid}/status.  But it cause large
      performance regression.  Unfortunately /proc/{pid}/status is used ps
      command too and ps is one of most important component.  Because both to
      take mmap_sem and page table walk are heavily operation.
      
      If many process run, the ps performance is,
      
      [before d899bf7b]
      
      % perf stat ps >/dev/null
      
       Performance counter stats for 'ps':
      
           4090.435806  task-clock-msecs         #      0.032 CPUs
                   229  context-switches         #      0.000 M/sec
                     0  CPU-migrations           #      0.000 M/sec
                   234  page-faults              #      0.000 M/sec
            8587565207  cycles                   #   2099.425 M/sec
            9866662403  instructions             #      1.149 IPC
            3789415411  cache-references         #    926.409 M/sec
              30419509  cache-misses             #      7.437 M/sec
      
         128.859521955  seconds time elapsed
      
      [after d899bf7b]
      
      % perf stat  ps  > /dev/null
      
       Performance counter stats for 'ps':
      
           4305.081146  task-clock-msecs         #      0.028 CPUs
                   480  context-switches         #      0.000 M/sec
                     2  CPU-migrations           #      0.000 M/sec
                   237  page-faults              #      0.000 M/sec
            9021211334  cycles                   #   2095.480 M/sec
           10605887536  instructions             #      1.176 IPC
            3612650999  cache-references         #    839.160 M/sec
              23917502  cache-misses             #      5.556 M/sec
      
         152.277819582  seconds time elapsed
      
      Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps
      provide almost same information. we can use it.
      
      Commit d899bf7b introduced two features:
      
       1) Add the annotattion of [thread stack: xxxx] mark to
          /proc/{pid}/task/{tid}/maps.
       2) Add StackUsage field to /proc/{pid}/status.
      
      I only revert (2), because I haven't seen (1) cause regression.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Stefani Seibold <stefani@seibold.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1306d603
  7. 11 1月, 2010 6 次提交
    • J
      quota: Fix dquot_transfer for filesystems different from ext4 · 05b5d898
      Jan Kara 提交于
      Commit fd8fbfc1 modified the way we find amount of reserved space
      belonging to an inode. The amount of reserved space is checked
      from dquot_transfer and thus inode_reserved_space gets called
      even for filesystems that don't provide get_reserved_space callback
      which results in a BUG.
      
      Fix the problem by checking get_reserved_space callback and return 0 if
      the filesystem does not provide it.
      
      CC: Dmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      05b5d898
    • S
      GFS2: Use MAX_LFS_FILESIZE for meta inode size · ba198098
      Steven Whitehouse 提交于
      Using ~0ULL was cauing sign issues in filemap_fdatawrite_range, so
      use MAX_LFS_FILESIZE instead.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      ba198098
    • D
      xfs: Ensure we force all busy extents in range to disk · fd45e478
      Dave Chinner 提交于
      When we search for and find a busy extent during allocation we
      force the log out to ensure the extent free transaction is on
      disk before the allocation transaction. The current implementation
      has a subtle bug in it--it does not handle multiple overlapping
      ranges.
      
      That is, if we free lots of little extents into a single
      contiguous extent, then allocate the contiguous extent, the busy
      search code stops searching at the first extent it finds that
      overlaps the allocated range. It then uses the commit LSN of the
      transaction to force the log out to.
      
      Unfortunately, the other busy ranges might have more recent
      commit LSNs than the first busy extent that is found, and this
      results in xfs_alloc_search_busy() returning before all the
      extent free transactions are on disk for the range being
      allocated. This can lead to potential metadata corruption or
      stale data exposure after a crash because log replay won't replay
      all the extent free transactions that cover the allocation range.
      Modified-by: NAlex Elder <aelder@sgi.com>
      
      (Dropped the "found" argument from the xfs_alloc_busysearch trace
      event.)
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      fd45e478
    • D
      xfs: Don't flush stale inodes · 44e08c45
      Dave Chinner 提交于
      Because inodes remain in cache much longer than inode buffers do
      under memory pressure, we can get the situation where we have
      stale, dirty inodes being reclaimed but the backing storage has
      been freed.  Hence we should never, ever flush XFS_ISTALE inodes
      to disk as there is no guarantee that the backing buffer is in
      cache and still marked stale when the flush occurs.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      44e08c45
    • C
      xfs: fix timestamp handling in xfs_setattr · d6d59bad
      Christoph Hellwig 提交于
      We currently have some rather odd code in xfs_setattr for
      updating the a/c/mtime timestamps:
      
       - first we do a non-transaction update if all three are updated
         together
       - second we implicitly update the ctime for various changes
         instead of relying on the ATTR_CTIME flag
       - third we set the timestamps to the current time instead of the
         arguments in the iattr structure in many cases.
      
      This patch makes sure we update it in a consistent way:
      
       - always transactional
       - ctime is only updated if ATTR_CTIME is set or we do a size
         update, which is a special case
       - always to the times passed in from the caller instead of the
         current time
      
      The only non-size caller of xfs_setattr that doesn't come from
      the VFS is updated to set ATTR_CTIME and pass in a valid ctime
      value.
      Reported-by: NEric Blake <ebb9@byu.net>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      d6d59bad
    • C
      xfs: use DECLARE_EVENT_CLASS · ea9a4888
      Christoph Hellwig 提交于
      Using DECLARE_EVENT_CLASS allows us to to use trace event code
      instead of duplicating it in the binary.  This was not available
      before 2.6.33 so it had to be done as a separate step once the
      prerequisite was merged.
      
      This only requires changes to xfs_trace.h and the results are
      rather impressive:
      
      hch@brick:~/work/linux-2.6/obj-kvm$ size fs/xfs/xfs.o*
      text	   data	    bss	    dec	    hex	filename
       607732	  41884	   3616	 653232	  9f7b0	fs/xfs/xfs.o
      1026732	  41884	   3808	1072424	 105d28	fs/xfs/xfs.o.old
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      ea9a4888
  8. 09 1月, 2010 1 次提交
  9. 08 1月, 2010 3 次提交
  10. 07 1月, 2010 3 次提交