1. 08 9月, 2005 1 次提交
  2. 14 7月, 2005 1 次提交
    • A
      [PATCH] Fix soft lockup due to NTFS: VFS part and explanation · 88bd5121
      Anton Altaparmakov 提交于
      Something has changed in the core kernel such that we now get concurrent
      inode write outs, one e.g via pdflush and one via sys_sync or whatever.
      This causes a nasty deadlock in ntfs.  The only clean solution
      unfortunately requires a minor vfs api extension.
      
      First the deadlock analysis:
      
      Prerequisive knowledge: NTFS has a file $MFT (inode 0) loaded at mount
      time.  The NTFS driver uses the page cache for storing the file contents as
      usual.  More interestingly this file contains the table of on-disk inodes
      as a sequence of MFT_RECORDs.  Thus NTFS driver accesses the on-disk inodes
      by accessing the MFT_RECORDs in the page cache pages of the loaded inode
      $MFT.
      
      The situation: VFS inode X on a mounted ntfs volume is dirty.  For same
      inode X, the ntfs_inode is dirty and thus corresponding on-disk inode,
      which is as explained above in a dirty PAGE_CACHE_PAGE belonging to the
      table of inodes ($MFT, inode 0).
      
      What happens:
      
      Process 1: sys_sync()/umount()/whatever...  calls __sync_single_inode() for
      $MFT -> do_writepages() -> write_page for the dirty page containing the
      on-disk inode X, the page is now locked -> ntfs_write_mst_block() which
      clears PageUptodate() on the page to prevent anyone else getting hold of it
      whilst it does the write out (this is necessary as the on-disk inode needs
      "fixups" applied before the write to disk which are removed again after the
      write and PageUptodate is then set again).  It then analyses the page
      looking for dirty on-disk inodes and when it finds one it calls
      ntfs_may_write_mft_record() to see if it is safe to write this on-disk
      inode.  This then calls ilookup5() to check if the corresponding VFS inode
      is in icache().  This in turn calls ifind() which waits on the inode lock
      via wait_on_inode whilst holding the global inode_lock.
      
      Process 2: pdflush results in a call to __sync_single_inode for the same
      VFS inode X on the ntfs volume.  This locks the inode (I_LOCK) then calls
      write-inode -> ntfs_write_inode -> map_mft_record() -> read_cache_page() of
      the page (in page cache of table of inodes $MFT, inode 0) containing the
      on-disk inode.  This page has PageUptodate() clear because of Process 1
      (see above) so read_cache_page() blocks when tries to take the page lock
      for the page so it can call ntfs_read_page().
      
      Thus Process 1 is holding the page lock on the page containing the on-disk
      inode X and it is waiting on the inode X to be unlocked in ifind() so it
      can write the page out and then unlock the page.
      
      And Process 2 is holding the inode lock on inode X and is waiting for the
      page to be unlocked so it can call ntfs_readpage() or discover that
      Process 1 set PageUptodate() again and use the page.
      
      Thus we have a deadlock due to ifind() waiting on the inode lock.
      
      The only sensible solution: NTFS does not care whether the VFS inode is
      locked or not when it calls ilookup5() (it doesn't use the VFS inode at
      all, it just uses it to find the corresponding ntfs_inode which is of
      course attached to the VFS inode (both are one single struct); and it uses
      the ntfs_inode which is subject to its own locking so I_LOCK is irrelevant)
      hence we want a modified ilookup5_nowait() which is the same as ilookup5()
      but it does not wait on the inode lock.
      
      Without such functionality I would have to keep my own ntfs_inode cache in
      the NTFS driver just so I can find ntfs_inodes independent of their VFS
      inodes which would be slow, memory and cpu cycle wasting, and incredibly
      stupid given the icache already exists in the VFS.
      
      Below is a patch that does the ilookup5_nowait() implementation in
      fs/inode.c and exports it.
      
      ilookup5_nowait.diff:
      
      Introduce ilookup5_nowait() which is basically the same as ilookup5() but
      it does not wait on the inode's lock (i.e. it omits the wait_on_inode()
      done in ifind()).
      
      This is needed to avoid a nasty deadlock in NTFS.
      Signed-off-by: NAnton Altaparmakov <aia21@cantab.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      88bd5121
  3. 13 7月, 2005 3 次提交
    • R
      [PATCH] inotify · 0eeca283
      Robert Love 提交于
      inotify is intended to correct the deficiencies of dnotify, particularly
      its inability to scale and its terrible user interface:
      
              * dnotify requires the opening of one fd per each directory
                that you intend to watch. This quickly results in too many
                open files and pins removable media, preventing unmount.
              * dnotify is directory-based. You only learn about changes to
                directories. Sure, a change to a file in a directory affects
                the directory, but you are then forced to keep a cache of
                stat structures.
              * dnotify's interface to user-space is awful.  Signals?
      
      inotify provides a more usable, simple, powerful solution to file change
      notification:
      
              * inotify's interface is a system call that returns a fd, not SIGIO.
      	  You get a single fd, which is select()-able.
              * inotify has an event that says "the filesystem that the item
                you were watching is on was unmounted."
              * inotify can watch directories or files.
      
      Inotify is currently used by Beagle (a desktop search infrastructure),
      Gamin (a FAM replacement), and other projects.
      
      See Documentation/filesystems/inotify.txt.
      Signed-off-by: NRobert Love <rml@novell.com>
      Cc: John McCutchan <ttb@tentacle.dhs.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0eeca283
    • A
      [PATCH] bugfix: two read_inode() calls without clear_inode() call between · 4120db47
      Artem B. Bityuckiy 提交于
      Bug symptoms
      ~~~~~~~~~~~~
      For the same inode VFS calls read_inode() twice and doesn't call
      clear_inode() between the two read_inode() invocations.
      
      Bug description
      ~~~~~~~~~~~~~~~
      Suppose we have an inode which has zero reference count but is still in
      the inode cache. Suppose kswapd invokes shrink_icache_memory() to free
      some RAM. In prune_icache() inodes are removed from i_hash. prune_icache
      () is then going to call clear_inode(), but drops the inode_lock
      spinlock before this. If in this moment another task calls iget() for an
      inode which was just removed from i_hash by prune_icache(), then iget()
      invokes read_inode() for this inode, because it is *already removed*
      from i_hash.
      
      The end result is: we call iget(#N) then iput(#N); inode #N has zero
      i_count now and is in the inode cache; kswapd starts. kswapd removes the
      inode #N from i_hash ans is preempted; we call iget(#N) again;
      read_inode() is invoked as the result; but we expect clear_inode()
      before.
      
      Fix
      ~~~~~~~
      To fix the bug I remove inodes from i_hash later, when clear_inode() is
      actually called. I remove them from i_hash under spinlock protection.
      Since the i_state is set to I_FREEING, it is safe to do this. The others
      will sleep waiting for the inode state change.
      
      I also postpone removing inodes from i_sb_list. It is not compulsory to
      do so but I do it for readability reasons. Inodes are added/removed to
      the lists together everywhere in the code and there is no point to
      change this rule. This is harmless because the only user of i_sb_list
      which somehow may interfere with me (invalidate_list()) is excluded by
      the iprune_sem mutex.
      
      The same race is possible in invalidate_list() so I do the same for it.
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4120db47
    • M
      [PATCH] __wait_on_freeing_inode fix · 168a9fd6
      Miklos Szeredi 提交于
      This patch fixes queer behavior in __wait_on_freeing_inode().
      
      If I_LOCK was not set it called yield(), effectively busy waiting for the
      removal of the inode from the hash.  This change was introduced within
      "[PATCH] eliminate inode waitqueue hashtable" Changeset 1.1938.166.16 last
      october by wli.
      
      The solution is to restore the old behavior, of unconditionally waiting on
      the waitqueue.  It doesn't matter if I_LOCK is not set initally, the task
      will go to sleep, and wake up when wake_up_inode() is called from
      generic_delete_inode() after removing the inode from the hash chain.
      
      Comment is also updated to better reflect current behavior.
      
      This condition is very hard to trigger normally (simultaneous clear_inode()
      with iget()) so probably only heavy stress testing can reveal any change of
      behavior.
      Signed-off-by: NMiklos Szeredi <miklos@szeredi.hu>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      168a9fd6
  4. 08 7月, 2005 1 次提交
    • M
      [PATCH] export generic_drop_inode() to modules · cb2c0233
      Mark Fasheh 提交于
      OCFS2 wants to mark an inode which has been orphaned by another node so
      that during final iput it takes the correct path through the VFS and can
      pass through the OCFS2 delete_inode callback.  Since i_nlink can get out of
      date with other nodes, the best way I see to accomplish this is by clearing
      i_nlink on those inodes at drop_inode time.  Other than this small amount
      of work, nothing different needs to happen, so I think it would be cleanest
      to be able to just call generic_drop_inode at the end of the OCFS2
      drop_inode callback.
      Signed-off-by: NMark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cb2c0233
  5. 24 6月, 2005 1 次提交
    • A
      [PATCH] fix for prune_icache()/forced final iput() races · 991114c6
      Alexander Viro 提交于
      Based on analysis and a patch from Russ Weight <rweight@us.ibm.com>
      
      There is a race condition that can occur if an inode is allocated and then
      released (using iput) during the ->fill_super functions.  The race
      condition is between kswapd and mount.
      
      For most filesystems this can only happen in an error path when kswapd is
      running concurrently.  For isofs, however, the error can occur in a more
      common code path (which is how the bug was found).
      
      The logic here is "we want final iput() to free inode *now* instead of
      letting it sit in cache if fs is going down or had not quite come up".  The
      problem is with kswapd seeing such inodes in the middle of being killed and
      happily taking over.
      
      The clean solution would be to tell kswapd to leave those inodes alone and
      let our final iput deal with them.  I.e.  add a new flag
      (I_FORCED_FREEING), set it before write_inode_now() there and make
      prune_icache() leave those alone.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      991114c6
  6. 06 5月, 2005 2 次提交
  7. 17 4月, 2005 1 次提交
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4