1. 07 1月, 2006 2 次提交
    • J
      NLM: Further cancel fixes · 64a318ee
      J. Bruce Fields 提交于
       If the server receives an NLM cancel call and finds no waiting lock to
       cancel, then chances are the lock has already been applied, and the client
       just hadn't yet processed the NLM granted callback before it sent the
       cancel.
      
       The Open Group text, for example, perimts a server to return either success
       (LCK_GRANTED) or failure (LCK_DENIED) in this case.  But returning an error
       seems more helpful; the client may be able to use it to recognize that a
       race has occurred and to recover from the race.
      
       So, modify the relevant functions to return an error in this case.
      Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      64a318ee
    • B
      [PATCH] madvise(MADV_REMOVE): remove pages from tmpfs shm backing store · f6b3ec23
      Badari Pulavarty 提交于
      Here is the patch to implement madvise(MADV_REMOVE) - which frees up a
      given range of pages & its associated backing store.  Current
      implementation supports only shmfs/tmpfs and other filesystems return
      -ENOSYS.
      
      "Some app allocates large tmpfs files, then when some task quits and some
      client disconnect, some memory can be released.  However the only way to
      release tmpfs-swap is to MADV_REMOVE". - Andrea Arcangeli
      
      Databases want to use this feature to drop a section of their bufferpool
      (shared memory segments) - without writing back to disk/swap space.
      
      This feature is also useful for supporting hot-plug memory on UML.
      
      Concerns raised by Andrew Morton:
      
      - "We have no plan for holepunching!  If we _do_ have such a plan (or
        might in the future) then what would the API look like?  I think
        sys_holepunch(fd, start, len), so we should start out with that."
      
      - Using madvise is very weird, because people will ask "why do I need to
        mmap my file before I can stick a hole in it?"
      
      - None of the other madvise operations call into the filesystem in this
        manner.  A broad question is: is this capability an MM operation or a
        filesytem operation?  truncate, for example, is a filesystem operation
        which sometimes has MM side-effects.  madvise is an mm operation and with
        this patch, it gains FS side-effects, only they're really, really
        significant ones."
      
      Comments:
      
      - Andrea suggested the fs operation too but then it's more efficient to
        have it as a mm operation with fs side effects, because they don't
        immediatly know fd and physical offset of the range.  It's possible to
        fixup in userland and to use the fs operation but it's more expensive,
        the vmas are already in the kernel and we can use them.
      
      Short term plan &  Future Direction:
      
      - We seem to need this interface only for shmfs/tmpfs files in the short
        term.  We have to add hooks into the filesystem for correctness and
        completeness.  This is what this patch does.
      
      - In the future, plan is to support both fs and mmap apis also.  This
        also involves (other) filesystem specific functions to be implemented.
      
      - Current patch doesn't support VM_NONLINEAR - which can be addressed in
        the future.
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Andrea Arcangeli <andrea@suse.de>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f6b3ec23
  2. 04 1月, 2006 1 次提交
    • Z
      [PATCH] add AOP_TRUNCATED_PAGE, prepend AOP_ to WRITEPAGE_ACTIVATE · 994fc28c
      Zach Brown 提交于
      readpage(), prepare_write(), and commit_write() callers are updated to
      understand the special return code AOP_TRUNCATED_PAGE in the style of
      writepage() and WRITEPAGE_ACTIVATE.  AOP_TRUNCATED_PAGE tells the caller that
      the callee has unlocked the page and that the operation should be tried again
      with a new page.  OCFS2 uses this to detect and work around a lock inversion in
      its aop methods.  There should be no change in behaviour for methods that don't
      return AOP_TRUNCATED_PAGE.
      
      WRITEPAGE_ACTIVATE is also prepended with AOP_ for consistency and they are
      made enums so that kerneldoc can be used to document their semantics.
      Signed-off-by: NZach Brown <zach.brown@oracle.com>
      994fc28c
  3. 09 11月, 2005 2 次提交
  4. 08 11月, 2005 6 次提交
  5. 07 11月, 2005 2 次提交
  6. 31 10月, 2005 1 次提交
  7. 28 10月, 2005 1 次提交
    • A
      [PATCH] gfp_t: fs/* · 27496a8c
      Al Viro 提交于
       - ->releasepage() annotated (s/int/gfp_t), instances updated
       - missing gfp_t in fs/* added
       - fixed misannotation from the original sweep caught by bitwise checks:
         XFS used __nocast both for gfp_t and for flags used by XFS allocator.
         The latter left with unsigned int __nocast; we might want to add a
         different type for those but for now let's leave them alone.  That,
         BTW, is a case when __nocast use had been actively confusing - it had
         been used in the same code for two different and similar types, with
         no way to catch misuses.  Switch of gfp_t to bitwise had caught that
         immediately...
      
      One tricky bit is left alone to be dealt with later - mapping->flags is
      a mix of gfp_t and error indications.  Left alone for now.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      27496a8c
  8. 11 9月, 2005 1 次提交
  9. 10 9月, 2005 1 次提交
    • D
      [PATCH] files: files struct with RCU · ab2af1f5
      Dipankar Sarma 提交于
      Patch to eliminate struct files_struct.file_lock spinlock on the reader side
      and use rcu refcounting rcuref_xxx api for the f_count refcounter.  The
      updates to the fdtable are done by allocating a new fdtable structure and
      setting files->fdt to point to the new structure.  The fdtable structure is
      protected by RCU thereby allowing lock-free lookup.  For fd arrays/sets that
      are vmalloced, we use keventd to free them since RCU callbacks can't sleep.  A
      global list of fdtable to be freed is not scalable, so we use a per-cpu list.
      If keventd is already handling the current cpu's work, we use a timer to defer
      queueing of that work.
      
      Since the last publication, this patch has been re-written to avoid using
      explicit memory barriers and use rcu_assign_pointer(), rcu_dereference()
      premitives instead.  This required that the fd information is kept in a
      separate structure (fdtable) and updated atomically.
      Signed-off-by: NDipankar Sarma <dipankar@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ab2af1f5
  10. 08 9月, 2005 4 次提交
  11. 20 8月, 2005 1 次提交
    • L
      Fix nasty ncpfs symlink handling bug. · cc314eef
      Linus Torvalds 提交于
      This bug could cause oopses and page state corruption, because ncpfs
      used the generic page-cache symlink handlign functions.  But those
      functions only work if the page cache is guaranteed to be "stable", ie a
      page that was installed when the symlink walk was started has to still
      be installed in the page cache at the end of the walk.
      
      We could have fixed ncpfs to not use the generic helper routines, but it
      is in many ways much cleaner to instead improve on the symlink walking
      helper routines so that they don't require that absolute stability.
      
      We do this by allowing "follow_link()" to return a error-pointer as a
      cookie, which is fed back to the cleanup "put_link()" routine.  This
      also simplifies NFS symlink handling.
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cc314eef
  12. 28 7月, 2005 1 次提交
    • P
      [PATCH] stale POSIX lock handling · c293621b
      Peter Staubach 提交于
      I believe that there is a problem with the handling of POSIX locks, which
      the attached patch should address.
      
      The problem appears to be a race between fcntl(2) and close(2).  A
      multithreaded application could close a file descriptor at the same time as
      it is trying to acquire a lock using the same file descriptor.  I would
      suggest that that multithreaded application is not providing the proper
      synchronization for itself, but the OS should still behave correctly.
      
      SUS3 (Single UNIX Specification Version 3, read: POSIX) indicates that when
      a file descriptor is closed, that all POSIX locks on the file, owned by the
      process which closed the file descriptor, should be released.
      
      The trick here is when those locks are released.  The current code releases
      all locks which exist when close is processing, but any locks in progress
      are handled when the last reference to the open file is released.
      
      There are three cases to consider.
      
      One is the simple case, a multithreaded (mt) process has a file open and
      races to close it and acquire a lock on it.  In this case, the close will
      release one reference to the open file and when the fcntl is done, it will
      release the other reference.  For this situation, no locks should exist on
      the file when both the close and fcntl operations are done.  The current
      system will handle this case because the last reference to the open file is
      being released.
      
      The second case is when the mt process has dup(2)'d the file descriptor.
      The close will release one reference to the file and the fcntl, when done,
      will release another, but there will still be at least one more reference
      to the open file.  One could argue that the existence of a lock on the file
      after the close has completed is okay, because it was acquired after the
      close operation and there is still a way for the application to release the
      lock on the file, using an existing file descriptor.
      
      The third case is when the mt process has forked, after opening the file
      and either before or after becoming an mt process.  In this case, each
      process would hold a reference to the open file.  For each process, this
      degenerates to first case above.  However, the lock continues to exist
      until both processes have released their references to the open file.  This
      lock could block other lock requests.
      
      The changes to release the lock when the last reference to the open file
      aren't quite right because they would allow the lock to exist as long as
      there was a reference to the open file.  This is too long.
      
      The new proposed solution is to add support in the fcntl code path to
      detect a race with close and then to release the lock which was just
      acquired when such as race is detected.  This causes locks to be released
      in a timely fashion and for the system to conform to the POSIX semantic
      specification.
      
      This was tested by instrumenting a kernel to detect the handling locks and
      then running a program which generates case #3 above.  A dangling lock
      could be reliably generated.  When the changes to detect the close/fcntl
      race were added, a dangling lock could no longer be generated.
      
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c293621b
  13. 14 7月, 2005 1 次提交
    • A
      [PATCH] Fix soft lockup due to NTFS: VFS part and explanation · 88bd5121
      Anton Altaparmakov 提交于
      Something has changed in the core kernel such that we now get concurrent
      inode write outs, one e.g via pdflush and one via sys_sync or whatever.
      This causes a nasty deadlock in ntfs.  The only clean solution
      unfortunately requires a minor vfs api extension.
      
      First the deadlock analysis:
      
      Prerequisive knowledge: NTFS has a file $MFT (inode 0) loaded at mount
      time.  The NTFS driver uses the page cache for storing the file contents as
      usual.  More interestingly this file contains the table of on-disk inodes
      as a sequence of MFT_RECORDs.  Thus NTFS driver accesses the on-disk inodes
      by accessing the MFT_RECORDs in the page cache pages of the loaded inode
      $MFT.
      
      The situation: VFS inode X on a mounted ntfs volume is dirty.  For same
      inode X, the ntfs_inode is dirty and thus corresponding on-disk inode,
      which is as explained above in a dirty PAGE_CACHE_PAGE belonging to the
      table of inodes ($MFT, inode 0).
      
      What happens:
      
      Process 1: sys_sync()/umount()/whatever...  calls __sync_single_inode() for
      $MFT -> do_writepages() -> write_page for the dirty page containing the
      on-disk inode X, the page is now locked -> ntfs_write_mst_block() which
      clears PageUptodate() on the page to prevent anyone else getting hold of it
      whilst it does the write out (this is necessary as the on-disk inode needs
      "fixups" applied before the write to disk which are removed again after the
      write and PageUptodate is then set again).  It then analyses the page
      looking for dirty on-disk inodes and when it finds one it calls
      ntfs_may_write_mft_record() to see if it is safe to write this on-disk
      inode.  This then calls ilookup5() to check if the corresponding VFS inode
      is in icache().  This in turn calls ifind() which waits on the inode lock
      via wait_on_inode whilst holding the global inode_lock.
      
      Process 2: pdflush results in a call to __sync_single_inode for the same
      VFS inode X on the ntfs volume.  This locks the inode (I_LOCK) then calls
      write-inode -> ntfs_write_inode -> map_mft_record() -> read_cache_page() of
      the page (in page cache of table of inodes $MFT, inode 0) containing the
      on-disk inode.  This page has PageUptodate() clear because of Process 1
      (see above) so read_cache_page() blocks when tries to take the page lock
      for the page so it can call ntfs_read_page().
      
      Thus Process 1 is holding the page lock on the page containing the on-disk
      inode X and it is waiting on the inode X to be unlocked in ifind() so it
      can write the page out and then unlock the page.
      
      And Process 2 is holding the inode lock on inode X and is waiting for the
      page to be unlocked so it can call ntfs_readpage() or discover that
      Process 1 set PageUptodate() again and use the page.
      
      Thus we have a deadlock due to ifind() waiting on the inode lock.
      
      The only sensible solution: NTFS does not care whether the VFS inode is
      locked or not when it calls ilookup5() (it doesn't use the VFS inode at
      all, it just uses it to find the corresponding ntfs_inode which is of
      course attached to the VFS inode (both are one single struct); and it uses
      the ntfs_inode which is subject to its own locking so I_LOCK is irrelevant)
      hence we want a modified ilookup5_nowait() which is the same as ilookup5()
      but it does not wait on the inode lock.
      
      Without such functionality I would have to keep my own ntfs_inode cache in
      the NTFS driver just so I can find ntfs_inodes independent of their VFS
      inodes which would be slow, memory and cpu cycle wasting, and incredibly
      stupid given the icache already exists in the VFS.
      
      Below is a patch that does the ilookup5_nowait() implementation in
      fs/inode.c and exports it.
      
      ilookup5_nowait.diff:
      
      Introduce ilookup5_nowait() which is basically the same as ilookup5() but
      it does not wait on the inode's lock (i.e. it omits the wait_on_inode()
      done in ifind()).
      
      This is needed to avoid a nasty deadlock in NTFS.
      Signed-off-by: NAnton Altaparmakov <aia21@cantab.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      88bd5121
  14. 13 7月, 2005 1 次提交
    • R
      [PATCH] inotify · 0eeca283
      Robert Love 提交于
      inotify is intended to correct the deficiencies of dnotify, particularly
      its inability to scale and its terrible user interface:
      
              * dnotify requires the opening of one fd per each directory
                that you intend to watch. This quickly results in too many
                open files and pins removable media, preventing unmount.
              * dnotify is directory-based. You only learn about changes to
                directories. Sure, a change to a file in a directory affects
                the directory, but you are then forced to keep a cache of
                stat structures.
              * dnotify's interface to user-space is awful.  Signals?
      
      inotify provides a more usable, simple, powerful solution to file change
      notification:
      
              * inotify's interface is a system call that returns a fd, not SIGIO.
      	  You get a single fd, which is select()-able.
              * inotify has an event that says "the filesystem that the item
                you were watching is on was unmounted."
              * inotify can watch directories or files.
      
      Inotify is currently used by Beagle (a desktop search infrastructure),
      Gamin (a FAM replacement), and other projects.
      
      See Documentation/filesystems/inotify.txt.
      Signed-off-by: NRobert Love <rml@novell.com>
      Cc: John McCutchan <ttb@tentacle.dhs.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0eeca283
  15. 08 7月, 2005 1 次提交
    • M
      [PATCH] export generic_drop_inode() to modules · cb2c0233
      Mark Fasheh 提交于
      OCFS2 wants to mark an inode which has been orphaned by another node so
      that during final iput it takes the correct path through the VFS and can
      pass through the OCFS2 delete_inode callback.  Since i_nlink can get out of
      date with other nodes, the best way I see to accomplish this is by clearing
      i_nlink on those inodes at drop_inode time.  Other than this small amount
      of work, nothing different needs to happen, so I think it would be cleanest
      to be able to just call generic_drop_inode at the end of the OCFS2
      drop_inode callback.
      Signed-off-by: NMark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cb2c0233
  16. 28 6月, 2005 1 次提交
    • J
      [PATCH] Update cfq io scheduler to time sliced design · 22e2c507
      Jens Axboe 提交于
      This updates the CFQ io scheduler to the new time sliced design (cfq
      v3).  It provides full process fairness, while giving excellent
      aggregate system throughput even for many competing processes.  It
      supports io priorities, either inherited from the cpu nice value or set
      directly with the ioprio_get/set syscalls.  The latter closely mimic
      set/getpriority.
      
      This import is based on my latest from -mm.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      22e2c507
  17. 24 6月, 2005 8 次提交
  18. 23 6月, 2005 1 次提交
  19. 21 6月, 2005 1 次提交
    • A
      [PATCH] libfs: add simple attribute files · acaefc25
      Arnd Bergmann 提交于
      Based on the discussion about spufs attributes, this is my suggestion
      for a more generic attribute file support that can be used by both
      debugfs and spufs.
      
      Simple attribute files behave similarly to sequential files from
      a kernel programmers perspective in that a standard set of file
      operations is provided and only an open operation needs to
      be written that registers file specific get() and set() functions.
      
      These operations are defined as
      
      void foo_set(void *data, u64 val); and
      u64 foo_get(void *data);
      
      where data is the inode->u.generic_ip pointer of the file and the
      operations just need to make send of that pointer. The infrastructure
      makes sure this works correctly with concurrent access and partial
      read calls.
      
      A macro named DEFINE_SIMPLE_ATTRIBUTE is provided to further simplify
      using the attributes.
      
      This patch already contains the changes for debugfs to use attributes
      for its internal file operations.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      acaefc25
  20. 06 5月, 2005 1 次提交
  21. 01 5月, 2005 2 次提交
    • M
      [PATCH] DocBook: fix some descriptions · 67be2dd1
      Martin Waitz 提交于
      Some KernelDoc descriptions are updated to match the current code.
      No code changes.
      Signed-off-by: NMartin Waitz <tali@admingilde.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      67be2dd1
    • P
      [PATCH] DocBook: changes and extensions to the kernel documentation · 4dc3b16b
      Pavel Pisa 提交于
      I have recompiled Linux kernel 2.6.11.5 documentation for me and our
      university students again.  The documentation could be extended for more
      sources which are equipped by structured comments for recent 2.6 kernels.  I
      have tried to proceed with that task.  I have done that more times from 2.6.0
      time and it gets boring to do same changes again and again.  Linux kernel
      compiles after changes for i386 and ARM targets.  I have added references to
      some more files into kernel-api book, I have added some section names as well.
       So please, check that changes do not break something and that categories are
      not too much skewed.
      
      I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved
      by kernel convention.  Most of the other changes are modifications in the
      comments to make kernel-doc happy, accept some parameters description and do
      not bail out on errors.  Changed <pid> to @pid in the description, moved some
      #ifdef before comments to correct function to comments bindings, etc.
      
      You can see result of the modified documentation build at
        http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz
      
      Some more sources are ready to be included into kernel-doc generated
      documentation.  Sources has been added into kernel-api for now.  Some more
      section names added and probably some more chaos introduced as result of quick
      cleanup work.
      Signed-off-by: NPavel Pisa <pisa@cmp.felk.cvut.cz>
      Signed-off-by: NMartin Waitz <tali@admingilde.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4dc3b16b