1. 06 4月, 2012 1 次提交
  2. 02 4月, 2012 1 次提交
  3. 22 3月, 2012 1 次提交
    • A
      vfs: remove unused superblock helpers · 9d547c35
      Artem Bityutskiy 提交于
      Remove the 'sb_mark_dirty()', 'sb_mark_clean()' and 'sb_is_dirty()'
      helpers which are not used. I introduced them 2 years and the
      intention was to make all file-systems use them in order to be able to
      optimize 'sync_supers()'.  However, Al Viro vetoed my patches at the
      end and asked me to push superblock management down to file-systems
      and get rid of the 's_dirt' flag completely, as well as kill
      'sync_supers()' altogether. Thus, remove the helpers.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9d547c35
  4. 21 3月, 2012 3 次提交
  5. 14 3月, 2012 1 次提交
  6. 05 3月, 2012 1 次提交
    • P
      BUG: headers with BUG/BUG_ON etc. need linux/bug.h · 187f1882
      Paul Gortmaker 提交于
      If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
      other BUG variant in a static inline (i.e. not in a #define) then
      that header really should be including <linux/bug.h> and not just
      expecting it to be implicitly present.
      
      We can make this change risk-free, since if the files using these
      headers didn't have exposure to linux/bug.h already, they would have
      been causing compile failures/warnings.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      187f1882
  7. 14 2月, 2012 1 次提交
  8. 24 1月, 2012 2 次提交
  9. 13 1月, 2012 4 次提交
    • A
      vfs: cache request_queue in struct block_device · 87192a2a
      Andi Kleen 提交于
      This makes it possible to get from the inode to the request_queue with one
      less cache miss.  Used in followon optimization.
      
      The livetime of the pointer is the same as the gendisk.
      
      This assumes that the queue will always stay the same in the gendisk while
      it's visible to block_devices.  I think that's safe correct?
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87192a2a
    • M
      mm: compaction: introduce sync-light migration for use by compaction · a6bc32b8
      Mel Gorman 提交于
      This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
      mode that avoids writing back pages to backing storage.  Async compaction
      maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
      For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
      used.
      
      This avoids sync compaction stalling for an excessive length of time,
      particularly when copying files to a USB stick where there might be a
      large number of dirty pages backed by a filesystem that does not support
      ->writepages.
      
      [aarcange@redhat.com: This patch is heavily based on Andrea's work]
      [akpm@linux-foundation.org: fix fs/nfs/write.c build]
      [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6bc32b8
    • M
      mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage · b969c4ab
      Mel Gorman 提交于
      Asynchronous compaction is used when allocating transparent hugepages to
      avoid blocking for long periods of time.  Due to reports of stalling,
      there was a debate on disabling synchronous compaction but this severely
      impacted allocation success rates.  Part of the reason was that many dirty
      pages are skipped in asynchronous compaction by the following check;
      
      	if (PageDirty(page) && !sync &&
      		mapping->a_ops->migratepage != migrate_page)
      			rc = -EBUSY;
      
      This skips over all mapping aops using buffer_migrate_page() even though
      it is possible to migrate some of these pages without blocking.  This
      patch updates the ->migratepage callback with a "sync" parameter.  It is
      the responsibility of the callback to fail gracefully if migration would
      block.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b969c4ab
    • J
      epoll: limit paths · 28d82dc1
      Jason Baron 提交于
      The current epoll code can be tickled to run basically indefinitely in
      both loop detection path check (on ep_insert()), and in the wakeup paths.
      The programs that tickle this behavior set up deeply linked networks of
      epoll file descriptors that cause the epoll algorithms to traverse them
      indefinitely.  A couple of these sample programs have been previously
      posted in this thread: https://lkml.org/lkml/2011/2/25/297.
      
      To fix the loop detection path check algorithms, I simply keep track of
      the epoll nodes that have been already visited.  Thus, the loop detection
      becomes proportional to the number of epoll file descriptor and links.
      This dramatically decreases the run-time of the loop check algorithm.  In
      one diabolical case I tried it reduced the run-time from 15 mintues (all
      in kernel time) to .3 seconds.
      
      Fixing the wakeup paths could be done at wakeup time in a similar manner
      by keeping track of nodes that have already been visited, but the
      complexity is harder, since there can be multiple wakeups on different
      cpus...Thus, I've opted to limit the number of possible wakeup paths when
      the paths are created.
      
      This is accomplished, by noting that the end file descriptor points that
      are found during the loop detection pass (from the newly added link), are
      actually the sources for wakeup events.  I keep a list of these file
      descriptors and limit the number and length of these paths that emanate
      from these 'source file descriptors'.  In the current implemetation I
      allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
      length 4 and 10 of length 5.  Note that it is sufficient to check the
      'source file descriptors' reachable from the newly added link, since no
      other 'source file descriptors' will have newly added links.  This allows
      us to check only the wakeup paths that may have gotten too long, and not
      re-check all possible wakeup paths on the system.
      
      In terms of the path limit selection, I think its first worth noting that
      the most common case for epoll, is probably the model where you have 1
      epoll file descriptor that is monitoring n number of 'source file
      descriptors'.  In this case, each 'source file descriptor' has a 1 path of
      length 1.  Thus, I believe that the limits I'm proposing are quite
      reasonable and in fact may be too generous.  Thus, I'm hoping that the
      proposed limits will not prevent any workloads that currently work to
      fail.
      
      In terms of locking, I have extended the use of the 'epmutex' to all
      epoll_ctl add and remove operations.  Currently its only used in a subset
      of the add paths.  I need to hold the epmutex, so that we can correctly
      traverse a coherent graph, to check the number of paths.  I believe that
      this additional locking is probably ok, since its in the setup/teardown
      paths, and doesn't affect the running paths, but it certainly is going to
      add some extra overhead.  Also, worth noting is that the epmuex was
      recently added to the ep_ctl add operations in the initial path loop
      detection code using the argument that it was not on a critical path.
      
      Another thing to note here, is the length of epoll chains that is allowed.
      Currently, eventpoll.c defines:
      
      /* Maximum number of nesting allowed inside epoll sets */
      #define EP_MAX_NESTS 4
      
      This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
      + 1).  However, this limit is currently only enforced during the loop
      check detection code, and only when the epoll file descriptors are added
      in a certain order.  Thus, this limit is currently easily bypassed.  The
      newly added check for wakeup paths, stricly limits the wakeup paths to a
      length of 5, regardless of the order in which ep's are linked together.
      Thus, a side-effect of the new code is a more consistent enforcement of
      the graph depth.
      
      Thus far, I've tested this, using the sample programs previously
      mentioned, which now either return quickly or return -EINVAL.  I've also
      testing using the piptest.c epoll tester, which showed no difference in
      performance.  I've also created a number of different epoll networks and
      tested that they behave as expectded.
      
      I believe this solves the original diabolical test cases, while still
      preserving the sane epoll nesting.
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Cc: Nelson Elhage <nelhage@ksplice.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28d82dc1
  10. 11 1月, 2012 1 次提交
  11. 07 1月, 2012 8 次提交
  12. 04 1月, 2012 9 次提交
  13. 09 12月, 2011 1 次提交
  14. 07 12月, 2011 1 次提交
    • A
      fix apparmor dereferencing potentially freed dentry, sanitize __d_path() API · 02125a82
      Al Viro 提交于
      __d_path() API is asking for trouble and in case of apparmor d_namespace_path()
      getting just that.  The root cause is that when __d_path() misses the root
      it had been told to look for, it stores the location of the most remote ancestor
      in *root.  Without grabbing references.  Sure, at the moment of call it had
      been pinned down by what we have in *path.  And if we raced with umount -l, we
      could have very well stopped at vfsmount/dentry that got freed as soon as
      prepend_path() dropped vfsmount_lock.
      
      It is safe to compare these pointers with pre-existing (and known to be still
      alive) vfsmount and dentry, as long as all we are asking is "is it the same
      address?".  Dereferencing is not safe and apparmor ended up stepping into
      that.  d_namespace_path() really wants to examine the place where we stopped,
      even if it's not connected to our namespace.  As the result, it looked
      at ->d_sb->s_magic of a dentry that might've been already freed by that point.
      All other callers had been careful enough to avoid that, but it's really
      a bad interface - it invites that kind of trouble.
      
      The fix is fairly straightforward, even though it's bigger than I'd like:
      	* prepend_path() root argument becomes const.
      	* __d_path() is never called with NULL/NULL root.  It was a kludge
      to start with.  Instead, we have an explicit function - d_absolute_root().
      Same as __d_path(), except that it doesn't get root passed and stops where
      it stops.  apparmor and tomoyo are using it.
      	* __d_path() returns NULL on path outside of root.  The main
      caller is show_mountinfo() and that's precisely what we pass root for - to
      skip those outside chroot jail.  Those who don't want that can (and do)
      use d_path().
      	* __d_path() root argument becomes const.  Everyone agrees, I hope.
      	* apparmor does *NOT* try to use __d_path() or any of its variants
      when it sees that path->mnt is an internal vfsmount.  In that case it's
      definitely not mounted anywhere and dentry_path() is exactly what we want
      there.  Handling of sysctl()-triggered weirdness is moved to that place.
      	* if apparmor is asked to do pathname relative to chroot jail
      and __d_path() tells it we it's not in that jail, the sucker just calls
      d_absolute_path() instead.  That's the other remaining caller of __d_path(),
      BTW.
              * seq_path_root() does _NOT_ return -ENAMETOOLONG (it's stupid anyway -
      the normal seq_file logics will take care of growing the buffer and redoing
      the call of ->show() just fine).  However, if it gets path not reachable
      from root, it returns SEQ_SKIP.  The only caller adjusted (i.e. stopped
      ignoring the return value as it used to do).
      Reviewed-by: NJohn Johansen <john.johansen@canonical.com>
      ACKed-by: NJohn Johansen <john.johansen@canonical.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org
      02125a82
  15. 17 11月, 2011 1 次提交
    • A
      new helper: mount_subtree() · ea441d11
      Al Viro 提交于
      takes vfsmount and relative path, does lookup within that vfsmount
      (possibly triggering automounts) and returns the result as root
      of subtree suitable for return by ->mount() (i.e. a reference to
      dentry and an active reference to its superblock grabbed, superblock
      locked exclusive).
      
      btrfs and nfs switched to it instead of open-coding the sucker.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ea441d11
  16. 02 11月, 2011 2 次提交
  17. 01 11月, 2011 2 次提交
    • J
      treewide: use __printf not __attribute__((format(printf,...))) · b9075fa9
      Joe Perches 提交于
      Standardize the style for compiler based printf format verification.
      Standardized the location of __printf too.
      
      Done via script and a little typing.
      
      $ grep -rPl --include=*.[ch] -w "__attribute__" * | \
        grep -vP "^(tools|scripts|include/linux/compiler-gcc.h)" | \
        xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s*\(\s*\(\s*format\s*\(\s*printf\s*,\s*(.+)\s*,\s*(.+)\s*\)\s*\)\s*\)/__printf($1, $2)/g ; print; }'
      
      [akpm@linux-foundation.org: revert arch bits]
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9075fa9
    • C
      Cross Memory Attach · fcf63409
      Christopher Yeoh 提交于
      The basic idea behind cross memory attach is to allow MPI programs doing
      intra-node communication to do a single copy of the message rather than a
      double copy of the message via shared memory.
      
      The following patch attempts to achieve this by allowing a destination
      process, given an address and size from a source process, to copy memory
      directly from the source process into its own address space via a system
      call.  There is also a symmetrical ability to copy from the current
      process's address space into a destination process's address space.
      
      - Use of /proc/pid/mem has been considered, but there are issues with
        using it:
        - Does not allow for specifying iovecs for both src and dest, assuming
          preadv or pwritev was implemented either the area read from or
        written to would need to be contiguous.
        - Currently mem_read allows only processes who are currently
        ptrace'ing the target and are still able to ptrace the target to read
        from the target. This check could possibly be moved to the open call,
        but its not clear exactly what race this restriction is stopping
        (reason  appears to have been lost)
        - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
        domain socket is a bit ugly from a userspace point of view,
        especially when you may have hundreds if not (eventually) thousands
        of processes  that all need to do this with each other
        - Doesn't allow for some future use of the interface we would like to
        consider adding in the future (see below)
        - Interestingly reading from /proc/pid/mem currently actually
        involves two copies! (But this could be fixed pretty easily)
      
      As mentioned previously use of vmsplice instead was considered, but has
      problems.  Since you need the reader and writer working co-operatively if
      the pipe is not drained then you block.  Which requires some wrapping to
      do non blocking on the send side or polling on the receive.  In all to all
      communication it requires ordering otherwise you can deadlock.  And in the
      example of many MPI tasks writing to one MPI task vmsplice serialises the
      copying.
      
      There are some cases of MPI collectives where even a single copy interface
      does not get us the performance gain we could.  For example in an
      MPI_Reduce rather than copy the data from the source we would like to
      instead use it directly in a mathops (say the reduce is doing a sum) as
      this would save us doing a copy.  We don't need to keep a copy of the data
      from the source.  I haven't implemented this, but I think this interface
      could in the future do all this through the use of the flags - eg could
      specify the math operation and type and the kernel rather than just
      copying the data would apply the specified operation between the source
      and destination and store it in the destination.
      
      Although we don't have a "second user" of the interface (though I've had
      some nibbles from people who may be interested in using it for intra
      process messaging which is not MPI).  This interface is something which
      hardware vendors are already doing for their custom drivers to implement
      fast local communication.  And so in addition to this being useful for
      OpenMPI it would mean the driver maintainers don't have to fix things up
      when the mm changes.
      
      There was some discussion about how much faster a true zero copy would
      go. Here's a link back to the email with some testing I did on that:
      
      http://marc.info/?l=linux-mm&m=130105930902915&w=2
      
      There is a basic man page for the proposed interface here:
      
      http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
      
      This has been implemented for x86 and powerpc, other architecture should
      mainly (I think) just need to add syscall numbers for the process_vm_readv
      and process_vm_writev. There are 32 bit compatibility versions for
      64-bit kernels.
      
      For arch maintainers there are some simple tests to be able to quickly
      verify that the syscalls are working correctly here:
      
      http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: NChris Yeoh <yeohc@au1.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: <linux-man@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcf63409