1. 13 11月, 2010 1 次提交
    • T
      block: make blkdev_get/put() handle exclusive access · e525fd89
      Tejun Heo 提交于
      Over time, block layer has accumulated a set of APIs dealing with bdev
      open, close, claim and release.
      
      * blkdev_get/put() are the primary open and close functions.
      
      * bd_claim/release() deal with exclusive open.
      
      * open/close_bdev_exclusive() are combination of open and claim and
        the other way around, respectively.
      
      * bd_link/unlink_disk_holder() to create and remove holder/slave
        symlinks.
      
      * open_by_devnum() wraps bdget() + blkdev_get().
      
      The interface is a bit confusing and the decoupling of open and claim
      makes it impossible to properly guarantee exclusive access as
      in-kernel open + claim sequence can disturb the existing exclusive
      open even before the block layer knows the current open if for another
      exclusive access.  Reorganize the interface such that,
      
      * blkdev_get() is extended to include exclusive access management.
        @holder argument is added and, if is @FMODE_EXCL specified, it will
        gain exclusive access atomically w.r.t. other exclusive accesses.
      
      * blkdev_put() is similarly extended.  It now takes @mode argument and
        if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
        the last exclusive claim is released, the holder/slave symlinks are
        removed automatically.
      
      * bd_claim/release() and close_bdev_exclusive() are no longer
        necessary and either made static or removed.
      
      * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
        is no longer necessary and removed.
      
      * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
        and blkdev_get().  It also has an unexpected extra bdev_read_only()
        test which probably should be moved into blkdev_get().
      
      * open_by_devnum() is modified to take @holder argument and pass it to
        blkdev_get().
      
      Most of bdev open/close operations are unified into blkdev_get/put()
      and most exclusive accesses are tested atomically at the open time (as
      it should).  This cleans up code and removes some, both valid and
      invalid, but unnecessary all the same, corner cases.
      
      open_bdev_exclusive() and open_by_devnum() can use further cleanup -
      rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
      special features.  Well, let's leave them for another day.
      
      Most conversions are straight-forward.  drbd conversion is a bit more
      involved as there was some reordering, but the logic should stay the
      same.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Brown <neilb@suse.de>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: dm-devel@redhat.com
      Cc: drbd-dev@lists.linbit.com
      Cc: Leo Chen <leochen@broadcom.com>
      Cc: Scott Branden <sbranden@broadcom.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      e525fd89
  2. 27 10月, 2010 1 次提交
  3. 17 9月, 2010 1 次提交
    • C
      block: remove BLKDEV_IFL_WAIT · dd3932ed
      Christoph Hellwig 提交于
      All the blkdev_issue_* helpers can only sanely be used for synchronous
      caller.  To issue cache flushes or barriers asynchronously the caller needs
      to set up a bio by itself with a completion callback to move the asynchronous
      state machine ahead.  So drop the BLKDEV_IFL_WAIT flag that is always
      specified when calling blkdev_issue_* and also remove the now unused flags
      argument to blkdev_issue_flush and blkdev_issue_zeroout.  For
      blkdev_issue_discard we need to keep it for the secure discard flag, which
      gains a more descriptive name and loses the bitops vs flag confusion.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      dd3932ed
  4. 10 9月, 2010 5 次提交
    • C
      swap: do not send discards as barriers · 349f429e
      Christoph Hellwig 提交于
      The swap code already uses synchronous discards, no need to add I/O barriers.
      
      tj: superflous newlines removed.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NHugh Dickins <hughd@google.com>
      Tested-by: NNigel Cunningham <nigel@tuxonice.net>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      349f429e
    • H
      swap: discard while swapping only if SWAP_FLAG_DISCARD · 33994466
      Hugh Dickins 提交于
      Tests with recent firmware on Intel X25-M 80GB and OCZ Vertex 60GB SSDs
      show a shift since I last tested in December: in part because of firmware
      updates, in part because of the necessary move from barriers to awaiting
      completion at the block layer.  While discard at swapon still shows as
      slightly beneficial on both, discarding 1MB swap cluster when allocating
      is now disadvanteous: adds 25% overhead on Intel, adds 230% on OCZ (YMMV).
      
      Surrender: discard as presently implemented is more hindrance than help
      for swap; but might prove useful on other devices, or with improvements.
      So continue to do the discard at swapon, but make discard while swapping
      conditional on a SWAP_FLAG_DISCARD to sys_swapon() (which has been using
      only the lower 16 bits of int flags).
      
      We can add a --discard or -d to swapon(8), and a "discard" to swap in
      /etc/fstab: matching the mount option for btrfs, ext4, fat, gfs2, nilfs2.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Nigel Cunningham <nigel@tuxonice.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33994466
    • C
      swap: do not send discards as barriers · 8f2ae0fa
      Christoph Hellwig 提交于
      The swap code already uses synchronous discards, no need to add I/O
      barriers.
      
      This fixes the worst of the terrible slowdown in swap allocation for
      hibernation, reported on 2.6.35 by Nigel Cunningham; but does not entirely
      eliminate that regression.
      
      [tj@kernel.org: superflous newlines removed]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NNigel Cunningham <nigel@tuxonice.net>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f2ae0fa
    • H
      swap: prevent reuse during hibernation · b73d7fce
      Hugh Dickins 提交于
      Move the hibernation check from scan_swap_map() into try_to_free_swap():
      to catch not only the common case when hibernation's allocation itself
      triggers swap reuse, but also the less likely case when concurrent page
      reclaim (shrink_page_list) might happen to try_to_free_swap from a page.
      
      Hibernation already clears __GFP_IO from the gfp_allowed_mask, to stop
      reclaim from going to swap: check that to prevent swap reuse too.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Ondrej Zary <linux@rainbow-software.org>
      Cc: Andrea Gelmini <andrea.gelmini@gmail.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nigel Cunningham <nigel@tuxonice.net>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b73d7fce
    • H
      swap: revert special hibernation allocation · 910321ea
      Hugh Dickins 提交于
      Please revert 2.6.36-rc commit d2997b10
      "hibernation: freeze swap at hibernation".  It complicated matters by
      adding a second swap allocation path, just for hibernation; without in any
      way fixing the issue that it was intended to address - page reclaim after
      fixing the hibernation image might free swap from a page already imaged as
      swapcache, letting its swap be reallocated to store a different page of
      the image: resulting in data corruption if the imaged page were freed as
      clean then swapped back in.  Pages freed to si->swap_map were still in
      danger of being reallocated by the alternative allocation path.
      
      I guess it inadvertently fixed slow SSD swap allocation for hibernation,
      as reported by Nigel Cunningham: by missing out the discards that occur on
      the usual swap allocation path; but that was unintentional, and needs a
      separate fix.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Ondrej Zary <linux@rainbow-software.org>
      Cc: Andrea Gelmini <andrea.gelmini@gmail.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nigel Cunningham <nigel@tuxonice.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      910321ea
  5. 10 8月, 2010 2 次提交
  6. 19 5月, 2010 2 次提交
  7. 29 4月, 2010 1 次提交
  8. 13 3月, 2010 1 次提交
    • D
      memcg: move charges of anonymous swap · 02491447
      Daisuke Nishimura 提交于
      This patch is another core part of this move-charge-at-task-migration
      feature.  It enables moving charges of anonymous swaps.
      
      To move the charge of swap, we need to exchange swap_cgroup's record.
      
      In current implementation, swap_cgroup's record is protected by:
      
        - page lock: if the entry is on swap cache.
        - swap_lock: if the entry is not on swap cache.
      
      This works well in usual swap-in/out activity.
      
      But this behavior make the feature of moving swap charge check many
      conditions to exchange swap_cgroup's record safely.
      
      So I changed modification of swap_cgroup's recored(swap_cgroup_record())
      to use xchg, and define a new function to cmpxchg swap_cgroup's record.
      
      This patch also enables moving charge of non pte_present but not uncharged
      swap caches, which can be exist on swap-out path, by getting the target
      pages via find_get_page() as do_mincore() does.
      
      [kosaki.motohiro@jp.fujitsu.com: fix ia64 build]
      [akpm@linux-foundation.org: fix typos]
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02491447
  9. 07 3月, 2010 4 次提交
    • H
      mm: add comment on swap_duplicate's error code · 08259d58
      Hugh Dickins 提交于
      swap_duplicate()'s loop appears to miss out on returning the error code
      from __swap_duplicate(), except when that's -ENOMEM.  In fact this is
      intentional: prior to -ENOMEM for swap_count_continuation,
      swap_duplicate() was void (and the case only occurs when copy_one_pte()
      hits a corrupt pte).  But that's surprising behaviour, which certainly
      deserves a comment.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reported-by: NHuang Shijie <shijie8@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08259d58
    • H
      mm/swapfile.c: fix swapon size off-by-one · ad2bd7e0
      Hugh Dickins 提交于
      There's an off-by-one disagreement between mkswap and swapon about the
      meaning of swap_header last_page: mkswap (in all versions I've looked at:
      util-linux-ng and BusyBox and old util-linux; probably as far back as
      1999) consistently means the offset (in page units) of the last page of
      the swap area, whereas kernel sys_swapon (as far back as 2.2 and 2.3)
      strangely takes it to mean the size (in page units) of the swap area.
      
      This disagreement is the safe way round; but it's worrying people, and
      loses us one page of swap.
      
      The fix is not just to add one to nr_good_pages: we need to get maxpages
      (the size of the swap_map array) right before that; and though that is an
      unsigned long, be careful not to overflow the unsigned int p->max which
      later holds it (probably why header uses __u32 last_page instead of size).
      
      Why did we subtract one from the maximum swp_offset to calculate maxpages?
       Though it was probably me who made that change in 2.4.10, I don't get it:
      and now we should be adding one (without risk of overflow in this case).
      
      Fix the handling of swap_header badpages: it could have overrun the
      swap_map when very large swap area used on a more limited architecture.
      
      Remove pre-initializations of swap_header, nr_good_pages and maxpages:
      those date from when sys_swapon was supporting other versions of header.
      Reported-by: NNitin Gupta <ngupta@vflare.org>
      Reported-by: NJarkko Lavinen <jarkko.lavinen@nokia.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad2bd7e0
    • K
      mm: count swap usage · b084d435
      KAMEZAWA Hiroyuki 提交于
      A frequent questions from users about memory management is what numbers of
      swap ents are user for processes.  And this information will give some
      hints to oom-killer.
      
      Besides we can count the number of swapents per a process by scanning
      /proc/<pid>/smaps, this is very slow and not good for usual process
      information handler which works like 'ps' or 'top'.  (ps or top is now
      enough slow..)
      
      This patch adds a counter of swapents to mm_counter and update is at each
      swap events.  Information is exported via /proc/<pid>/status file as
      
      [kamezawa@bluextal memory]$ cat /proc/self/status
      Name:   cat
      State:  R (running)
      Tgid:   2910
      Pid:    2910
      PPid:   2823
      TracerPid:      0
      Uid:    500     500     500     500
      Gid:    500     500     500     500
      FDSize: 256
      Groups: 500
      VmPeak:    82696 kB
      VmSize:    82696 kB
      VmLck:         0 kB
      VmHWM:       432 kB
      VmRSS:       432 kB
      VmData:      172 kB
      VmStk:        84 kB
      VmExe:        48 kB
      VmLib:      1568 kB
      VmPTE:        40 kB
      VmSwap:        0 kB <=============== this.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b084d435
    • K
      mm: clean up mm_counter · d559db08
      KAMEZAWA Hiroyuki 提交于
      Presently, per-mm statistics counter is defined by macro in sched.h
      
      This patch modifies it to
        - defined in mm.h as inlinf functions
        - use array instead of macro's name creation.
      
      This patch is for reducing patch size in future patch to modify
      implementation of per-mm counter.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d559db08
  10. 16 12月, 2009 11 次提交
  11. 03 11月, 2009 1 次提交
  12. 02 10月, 2009 1 次提交
  13. 22 9月, 2009 1 次提交
  14. 16 9月, 2009 1 次提交
    • A
      HWPOISON: Add support for poison swap entries v2 · a7420aa5
      Andi Kleen 提交于
      Memory migration uses special swap entry types to trigger special actions on
      page faults. Extend this mechanism to also support poisoned swap entries, to
      trigger poison handling on page faults. This allows follow-on patches to
      prevent processes from faulting in poisoned pages again.
      
      v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
      v3: Better overflow fix (Hidehiro Kawai)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      a7420aa5
  15. 14 9月, 2009 1 次提交
    • C
      block: use blkdev_issue_discard in blk_ioctl_discard · 746cd1e7
      Christoph Hellwig 提交于
      blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
      the only difference between the two is that blkdev_issue_discard needs to
      send a barrier discard request and blk_ioctl_discard a non-barrier one,
      and blk_ioctl_discard needs to wait on the request.  To facilitates this
      add a flags argument to blkdev_issue_discard to control both aspects of the
      behaviour.  This will be very useful later on for using the waiting
      funcitonality for other callers.
      
      Based on an earlier patch from Matthew Wilcox <matthew@wil.cx>.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      746cd1e7
  16. 30 7月, 2009 1 次提交
  17. 19 6月, 2009 1 次提交
    • K
      memcg: fix swap accounting · 8a9478ca
      KAMEZAWA Hiroyuki 提交于
      This patch fixes mis-accounting of swap usage in memcg.
      
      In the current implementation, memcg's swap account is uncharged only when
      swap is completely freed.  But there are several cases where swap cannot
      be freed cleanly.  For handling that, this patch changes that memcg
      uncharges swap account when swap has no references other than cache.
      
      By this, memcg's swap entry accounting can be fully synchronous with the
      application's behavior.
      
      This patch also changes memcg's hooks for swap-out.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NBalbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a9478ca
  18. 17 6月, 2009 3 次提交
  19. 22 2月, 2009 1 次提交