1. 11 5月, 2012 1 次提交
    • J
      block: don't mark buffers beyond end of disk as mapped · 080399aa
      Jeff Moyer 提交于
      Hi,
      
      We have a bug report open where a squashfs image mounted on ppc64 would
      exhibit errors due to trying to read beyond the end of the disk.  It can
      easily be reproduced by doing the following:
      
      [root@ibm-p750e-02-lp3 ~]# ls -l install.img
      -rw-r--r-- 1 root root 142032896 Apr 30 16:46 install.img
      [root@ibm-p750e-02-lp3 ~]# mount -o loop ./install.img /mnt/test
      [root@ibm-p750e-02-lp3 ~]# dd if=/dev/loop0 of=/dev/null
      dd: reading `/dev/loop0': Input/output error
      277376+0 records in
      277376+0 records out
      142016512 bytes (142 MB) copied, 0.9465 s, 150 MB/s
      
      In dmesg, you'll find the following:
      
      squashfs: version 4.0 (2009/01/31) Phillip Lougher
      [   43.106012] attempt to access beyond end of device
      [   43.106029] loop0: rw=0, want=277410, limit=277408
      [   43.106039] Buffer I/O error on device loop0, logical block 138704
      [   43.106053] attempt to access beyond end of device
      [   43.106057] loop0: rw=0, want=277412, limit=277408
      [   43.106061] Buffer I/O error on device loop0, logical block 138705
      [   43.106066] attempt to access beyond end of device
      [   43.106070] loop0: rw=0, want=277414, limit=277408
      [   43.106073] Buffer I/O error on device loop0, logical block 138706
      [   43.106078] attempt to access beyond end of device
      [   43.106081] loop0: rw=0, want=277416, limit=277408
      [   43.106085] Buffer I/O error on device loop0, logical block 138707
      [   43.106089] attempt to access beyond end of device
      [   43.106093] loop0: rw=0, want=277418, limit=277408
      [   43.106096] Buffer I/O error on device loop0, logical block 138708
      [   43.106101] attempt to access beyond end of device
      [   43.106104] loop0: rw=0, want=277420, limit=277408
      [   43.106108] Buffer I/O error on device loop0, logical block 138709
      [   43.106112] attempt to access beyond end of device
      [   43.106116] loop0: rw=0, want=277422, limit=277408
      [   43.106120] Buffer I/O error on device loop0, logical block 138710
      [   43.106124] attempt to access beyond end of device
      [   43.106128] loop0: rw=0, want=277424, limit=277408
      [   43.106131] Buffer I/O error on device loop0, logical block 138711
      [   43.106135] attempt to access beyond end of device
      [   43.106139] loop0: rw=0, want=277426, limit=277408
      [   43.106143] Buffer I/O error on device loop0, logical block 138712
      [   43.106147] attempt to access beyond end of device
      [   43.106151] loop0: rw=0, want=277428, limit=277408
      [   43.106154] Buffer I/O error on device loop0, logical block 138713
      [   43.106158] attempt to access beyond end of device
      [   43.106162] loop0: rw=0, want=277430, limit=277408
      [   43.106166] attempt to access beyond end of device
      [   43.106169] loop0: rw=0, want=277432, limit=277408
      ...
      [   43.106307] attempt to access beyond end of device
      [   43.106311] loop0: rw=0, want=277470, limit=2774
      
      Squashfs manages to read in the end block(s) of the disk during the
      mount operation.  Then, when dd reads the block device, it leads to
      block_read_full_page being called with buffers that are beyond end of
      disk, but are marked as mapped.  Thus, it would end up submitting read
      I/O against them, resulting in the errors mentioned above.  I fixed the
      problem by modifying init_page_buffers to only set the buffer mapped if
      it fell inside of i_size.
      
      Cheers,
      Jeff
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NNick Piggin <npiggin@kernel.dk>
      
      --
      
      Changes from v1->v2: re-used max_block, as suggested by Nick Piggin.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      080399aa
  2. 06 4月, 2012 1 次提交
  3. 02 4月, 2012 1 次提交
  4. 22 3月, 2012 1 次提交
    • A
      vfs: remove unused superblock helpers · 9d547c35
      Artem Bityutskiy 提交于
      Remove the 'sb_mark_dirty()', 'sb_mark_clean()' and 'sb_is_dirty()'
      helpers which are not used. I introduced them 2 years and the
      intention was to make all file-systems use them in order to be able to
      optimize 'sync_supers()'.  However, Al Viro vetoed my patches at the
      end and asked me to push superblock management down to file-systems
      and get rid of the 's_dirt' flag completely, as well as kill
      'sync_supers()' altogether. Thus, remove the helpers.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9d547c35
  5. 21 3月, 2012 3 次提交
  6. 14 3月, 2012 1 次提交
  7. 05 3月, 2012 1 次提交
    • P
      BUG: headers with BUG/BUG_ON etc. need linux/bug.h · 187f1882
      Paul Gortmaker 提交于
      If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
      other BUG variant in a static inline (i.e. not in a #define) then
      that header really should be including <linux/bug.h> and not just
      expecting it to be implicitly present.
      
      We can make this change risk-free, since if the files using these
      headers didn't have exposure to linux/bug.h already, they would have
      been causing compile failures/warnings.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      187f1882
  8. 14 2月, 2012 1 次提交
  9. 24 1月, 2012 2 次提交
  10. 13 1月, 2012 4 次提交
    • A
      vfs: cache request_queue in struct block_device · 87192a2a
      Andi Kleen 提交于
      This makes it possible to get from the inode to the request_queue with one
      less cache miss.  Used in followon optimization.
      
      The livetime of the pointer is the same as the gendisk.
      
      This assumes that the queue will always stay the same in the gendisk while
      it's visible to block_devices.  I think that's safe correct?
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87192a2a
    • M
      mm: compaction: introduce sync-light migration for use by compaction · a6bc32b8
      Mel Gorman 提交于
      This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
      mode that avoids writing back pages to backing storage.  Async compaction
      maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
      For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
      used.
      
      This avoids sync compaction stalling for an excessive length of time,
      particularly when copying files to a USB stick where there might be a
      large number of dirty pages backed by a filesystem that does not support
      ->writepages.
      
      [aarcange@redhat.com: This patch is heavily based on Andrea's work]
      [akpm@linux-foundation.org: fix fs/nfs/write.c build]
      [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6bc32b8
    • M
      mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage · b969c4ab
      Mel Gorman 提交于
      Asynchronous compaction is used when allocating transparent hugepages to
      avoid blocking for long periods of time.  Due to reports of stalling,
      there was a debate on disabling synchronous compaction but this severely
      impacted allocation success rates.  Part of the reason was that many dirty
      pages are skipped in asynchronous compaction by the following check;
      
      	if (PageDirty(page) && !sync &&
      		mapping->a_ops->migratepage != migrate_page)
      			rc = -EBUSY;
      
      This skips over all mapping aops using buffer_migrate_page() even though
      it is possible to migrate some of these pages without blocking.  This
      patch updates the ->migratepage callback with a "sync" parameter.  It is
      the responsibility of the callback to fail gracefully if migration would
      block.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b969c4ab
    • J
      epoll: limit paths · 28d82dc1
      Jason Baron 提交于
      The current epoll code can be tickled to run basically indefinitely in
      both loop detection path check (on ep_insert()), and in the wakeup paths.
      The programs that tickle this behavior set up deeply linked networks of
      epoll file descriptors that cause the epoll algorithms to traverse them
      indefinitely.  A couple of these sample programs have been previously
      posted in this thread: https://lkml.org/lkml/2011/2/25/297.
      
      To fix the loop detection path check algorithms, I simply keep track of
      the epoll nodes that have been already visited.  Thus, the loop detection
      becomes proportional to the number of epoll file descriptor and links.
      This dramatically decreases the run-time of the loop check algorithm.  In
      one diabolical case I tried it reduced the run-time from 15 mintues (all
      in kernel time) to .3 seconds.
      
      Fixing the wakeup paths could be done at wakeup time in a similar manner
      by keeping track of nodes that have already been visited, but the
      complexity is harder, since there can be multiple wakeups on different
      cpus...Thus, I've opted to limit the number of possible wakeup paths when
      the paths are created.
      
      This is accomplished, by noting that the end file descriptor points that
      are found during the loop detection pass (from the newly added link), are
      actually the sources for wakeup events.  I keep a list of these file
      descriptors and limit the number and length of these paths that emanate
      from these 'source file descriptors'.  In the current implemetation I
      allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
      length 4 and 10 of length 5.  Note that it is sufficient to check the
      'source file descriptors' reachable from the newly added link, since no
      other 'source file descriptors' will have newly added links.  This allows
      us to check only the wakeup paths that may have gotten too long, and not
      re-check all possible wakeup paths on the system.
      
      In terms of the path limit selection, I think its first worth noting that
      the most common case for epoll, is probably the model where you have 1
      epoll file descriptor that is monitoring n number of 'source file
      descriptors'.  In this case, each 'source file descriptor' has a 1 path of
      length 1.  Thus, I believe that the limits I'm proposing are quite
      reasonable and in fact may be too generous.  Thus, I'm hoping that the
      proposed limits will not prevent any workloads that currently work to
      fail.
      
      In terms of locking, I have extended the use of the 'epmutex' to all
      epoll_ctl add and remove operations.  Currently its only used in a subset
      of the add paths.  I need to hold the epmutex, so that we can correctly
      traverse a coherent graph, to check the number of paths.  I believe that
      this additional locking is probably ok, since its in the setup/teardown
      paths, and doesn't affect the running paths, but it certainly is going to
      add some extra overhead.  Also, worth noting is that the epmuex was
      recently added to the ep_ctl add operations in the initial path loop
      detection code using the argument that it was not on a critical path.
      
      Another thing to note here, is the length of epoll chains that is allowed.
      Currently, eventpoll.c defines:
      
      /* Maximum number of nesting allowed inside epoll sets */
      #define EP_MAX_NESTS 4
      
      This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
      + 1).  However, this limit is currently only enforced during the loop
      check detection code, and only when the epoll file descriptors are added
      in a certain order.  Thus, this limit is currently easily bypassed.  The
      newly added check for wakeup paths, stricly limits the wakeup paths to a
      length of 5, regardless of the order in which ep's are linked together.
      Thus, a side-effect of the new code is a more consistent enforcement of
      the graph depth.
      
      Thus far, I've tested this, using the sample programs previously
      mentioned, which now either return quickly or return -EINVAL.  I've also
      testing using the piptest.c epoll tester, which showed no difference in
      performance.  I've also created a number of different epoll networks and
      tested that they behave as expectded.
      
      I believe this solves the original diabolical test cases, while still
      preserving the sane epoll nesting.
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Cc: Nelson Elhage <nelhage@ksplice.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28d82dc1
  11. 11 1月, 2012 1 次提交
  12. 07 1月, 2012 8 次提交
  13. 04 1月, 2012 9 次提交
  14. 09 12月, 2011 1 次提交
  15. 07 12月, 2011 1 次提交
    • A
      fix apparmor dereferencing potentially freed dentry, sanitize __d_path() API · 02125a82
      Al Viro 提交于
      __d_path() API is asking for trouble and in case of apparmor d_namespace_path()
      getting just that.  The root cause is that when __d_path() misses the root
      it had been told to look for, it stores the location of the most remote ancestor
      in *root.  Without grabbing references.  Sure, at the moment of call it had
      been pinned down by what we have in *path.  And if we raced with umount -l, we
      could have very well stopped at vfsmount/dentry that got freed as soon as
      prepend_path() dropped vfsmount_lock.
      
      It is safe to compare these pointers with pre-existing (and known to be still
      alive) vfsmount and dentry, as long as all we are asking is "is it the same
      address?".  Dereferencing is not safe and apparmor ended up stepping into
      that.  d_namespace_path() really wants to examine the place where we stopped,
      even if it's not connected to our namespace.  As the result, it looked
      at ->d_sb->s_magic of a dentry that might've been already freed by that point.
      All other callers had been careful enough to avoid that, but it's really
      a bad interface - it invites that kind of trouble.
      
      The fix is fairly straightforward, even though it's bigger than I'd like:
      	* prepend_path() root argument becomes const.
      	* __d_path() is never called with NULL/NULL root.  It was a kludge
      to start with.  Instead, we have an explicit function - d_absolute_root().
      Same as __d_path(), except that it doesn't get root passed and stops where
      it stops.  apparmor and tomoyo are using it.
      	* __d_path() returns NULL on path outside of root.  The main
      caller is show_mountinfo() and that's precisely what we pass root for - to
      skip those outside chroot jail.  Those who don't want that can (and do)
      use d_path().
      	* __d_path() root argument becomes const.  Everyone agrees, I hope.
      	* apparmor does *NOT* try to use __d_path() or any of its variants
      when it sees that path->mnt is an internal vfsmount.  In that case it's
      definitely not mounted anywhere and dentry_path() is exactly what we want
      there.  Handling of sysctl()-triggered weirdness is moved to that place.
      	* if apparmor is asked to do pathname relative to chroot jail
      and __d_path() tells it we it's not in that jail, the sucker just calls
      d_absolute_path() instead.  That's the other remaining caller of __d_path(),
      BTW.
              * seq_path_root() does _NOT_ return -ENAMETOOLONG (it's stupid anyway -
      the normal seq_file logics will take care of growing the buffer and redoing
      the call of ->show() just fine).  However, if it gets path not reachable
      from root, it returns SEQ_SKIP.  The only caller adjusted (i.e. stopped
      ignoring the return value as it used to do).
      Reviewed-by: NJohn Johansen <john.johansen@canonical.com>
      ACKed-by: NJohn Johansen <john.johansen@canonical.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org
      02125a82
  16. 17 11月, 2011 1 次提交
    • A
      new helper: mount_subtree() · ea441d11
      Al Viro 提交于
      takes vfsmount and relative path, does lookup within that vfsmount
      (possibly triggering automounts) and returns the result as root
      of subtree suitable for return by ->mount() (i.e. a reference to
      dentry and an active reference to its superblock grabbed, superblock
      locked exclusive).
      
      btrfs and nfs switched to it instead of open-coding the sucker.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ea441d11
  17. 02 11月, 2011 2 次提交
  18. 01 11月, 2011 1 次提交