1. 13 2月, 2015 5 次提交
    • M
      zram: check bd_openers instead of bd_holders · 2b269ce6
      Minchan Kim 提交于
      bd_holders is increased only when user open the device file as FMODE_EXCL
      so if something opens zram0 as !FMODE_EXCL and request I/O while another
      user reset zram0, we can see following warning.
      
        zram0: detected capacity change from 0 to 64424509440
        Buffer I/O error on dev zram0, logical block 180823, lost async page write
        Buffer I/O error on dev zram0, logical block 180824, lost async page write
        Buffer I/O error on dev zram0, logical block 180825, lost async page write
        Buffer I/O error on dev zram0, logical block 180826, lost async page write
        Buffer I/O error on dev zram0, logical block 180827, lost async page write
        Buffer I/O error on dev zram0, logical block 180828, lost async page write
        Buffer I/O error on dev zram0, logical block 180829, lost async page write
        Buffer I/O error on dev zram0, logical block 180830, lost async page write
        Buffer I/O error on dev zram0, logical block 180831, lost async page write
        Buffer I/O error on dev zram0, logical block 180832, lost async page write
        ------------[ cut here ]------------
        WARNING: CPU: 11 PID: 1996 at fs/block_dev.c:57 __blkdev_put+0x1d7/0x210()
        Modules linked in:
        CPU: 11 PID: 1996 Comm: dd Not tainted 3.19.0-rc6-next-20150202+ #1125
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x45/0x57
          warn_slowpath_common+0x8a/0xc0
          warn_slowpath_null+0x1a/0x20
          __blkdev_put+0x1d7/0x210
          blkdev_put+0x50/0x130
          blkdev_close+0x25/0x30
          __fput+0xdf/0x1e0
          ____fput+0xe/0x10
          task_work_run+0xa7/0xe0
          do_notify_resume+0x49/0x60
          int_signal+0x12/0x17
        ---[ end trace 274fbbc5664827d2 ]---
      
      The warning comes from bdev_write_node in blkdev_put path.
      
         static void bdev_write_inode(struct inode *inode)
         {
              spin_lock(&inode->i_lock);
              while (inode->i_state & I_DIRTY) {
                      spin_unlock(&inode->i_lock);
                      WARN_ON_ONCE(write_inode_now(inode, true)); <========= here.
                      spin_lock(&inode->i_lock);
              }
              spin_unlock(&inode->i_lock);
         }
      
      The reason is dd process encounters I/O fails due to sudden block device
      disappear so in filemap_check_errors in __writeback_single_inode returns
      -EIO.
      
      If we check bd_openers instead of bd_holders, we could address the
      problem.  When I see the brd, it already have used it rather than
      bd_holders so although I'm not a expert of block layer, it seems to be
      better.
      
      I can make following warning with below simple script.  In addition, I
      added msleep(2000) below set_capacity(zram->disk, 0) after applying your
      patch to make window huge(Kudos to Ganesh!)
      
      script:
      
         echo $((60<<30)) > /sys/block/zram0/disksize
         setsid dd if=/dev/zero of=/dev/zram0 &
         sleep 1
         setsid echo 1 > /sys/block/zram0/reset
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b269ce6
    • S
      zram: rework reset and destroy path · a096cafc
      Sergey Senozhatsky 提交于
      We need to return set_capacity(disk, 0) from reset_store() back to
      zram_reset_device(), a catch by Ganesh Mahendran.  Potentially, we can
      race set_capacity() calls from init and reset paths.
      
      The problem is that zram_reset_device() is also getting called from
      zram_exit(), which performs operations in misleading reversed order -- we
      first create_device() and then init it, while zram_exit() perform
      destroy_device() first and then does zram_reset_device().  This is done to
      remove sysfs group before we reset device, so we can continue with device
      reset/destruction not being raced by sysfs attr write (f.e.  disksize).
      
      Apart from that, destroy_device() releases zram->disk (but we still have
      ->disk pointer), so we cannot acces zram->disk in later
      zram_reset_device() call, which may cause additional errors in the future.
      
      So, this patch rework and cleanup destroy path.
      
      1) remove several unneeded goto labels in zram_init()
      
      2) factor out zram_init() error path and zram_exit() into
         destroy_devices() function, which takes the number of devices to
         destroy as its argument.
      
      3) remove sysfs group in destroy_devices() first, so we can reorder
         operations -- reset device (as expected) goes before disk destroy and
         queue cleanup.  So we can always access ->disk in zram_reset_device().
      
      4) and, finally, return set_capacity() back under ->init_lock.
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a096cafc
    • S
      zram: fix umount-reset_store-mount race condition · ba6b17d6
      Sergey Senozhatsky 提交于
      Ganesh Mahendran was the first one who proposed to use bdev->bd_mutex to
      avoid ->bd_holders race condition:
      
              CPU0                            CPU1
      umount /* zram->init_done is true */
      reset_store()
      bdev->bd_holders == 0                   mount
      ...                                     zram_make_request()
      zram_reset_device()
      
      However, his solution required some considerable amount of code movement,
      which we can avoid.
      
      Apart from using bdev->bd_mutex in reset_store(), this patch also
      simplifies zram_reset_device().
      
      zram_reset_device() has a bool parameter reset_capacity which tells it
      whether disk capacity and itself disk should be reset.  There are two
      zram_reset_device() callers:
      
      -- zram_exit() passes reset_capacity=false
      -- reset_store() passes reset_capacity=true
      
      So we can move reset_capacity-sensitive work out of zram_reset_device()
      and perform it unconditionally in reset_store().  This also lets us drop
      reset_capacity parameter from zram_reset_device() and pass zram pointer
      only.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba6b17d6
    • G
      zram: free meta table in zram_meta_free · 1fec1172
      Ganesh Mahendran 提交于
      zram_meta_alloc() and zram_meta_free() are a pair.  In
      zram_meta_alloc(), meta table is allocated.  So it it better to free it
      in zram_meta_free().
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fec1172
    • S
      zram: clean up zram_meta_alloc() · b8179958
      Sergey Senozhatsky 提交于
      A trivial cleanup of zram_meta_alloc() error handling.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8179958
  2. 14 12月, 2014 4 次提交
  3. 14 11月, 2014 1 次提交
  4. 30 10月, 2014 1 次提交
  5. 10 10月, 2014 4 次提交
    • S
      zram: use notify_free to account all free notifications · 015254da
      Sergey Senozhatsky 提交于
      `notify_free' device attribute accounts the number of slot free
      notifications and internally represents the number of zram_free_page()
      calls.  Slot free notifications are sent only when device is used as a
      swap device, hence `notify_free' is used only for swap devices.  Since
      f4659d8e (zram: support REQ_DISCARD) ZRAM handles yet another one
      free notification (also via zram_free_page() call) -- REQ_DISCARD
      requests, which are sent by a filesystem, whenever some data blocks are
      discarded.  However, there is no way to know the number of notifications
      in the latter case.
      
      Use `notify_free' to account the number of pages freed by
      zram_bio_discard() and zram_slot_free_notify().  Depending on usage
      scenario `notify_free' represents:
      
       a) the number of pages freed because of slot free notifications, which is
         equal to the number of swap_slot_free_notify() calls, so there is no
         behaviour change
      
       b) the number of pages freed because of REQ_DISCARD notifications
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Chao Yu <chao2.yu@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      015254da
    • M
      zram: report maximum used memory · 461a8eee
      Minchan Kim 提交于
      Normally, zram user could get maximum memory usage zram consumed via
      polling mem_used_total with sysfs in userspace.
      
      But it has a critical problem because user can miss peak memory usage
      during update inverval of polling.  For avoiding that, user should poll it
      with shorter interval(ie, 0.0000000001s) with mlocking to avoid page fault
      delay when memory pressure is heavy.  It would be troublesome.
      
      This patch adds new knob "mem_used_max" so user could see the maximum
      memory usage easily via reading the knob and reset it via "echo 0 >
      /sys/block/zram0/mem_used_max".
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: <juno.choi@lge.com>
      Cc: <seungho1.park@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Reviewed-by: NDavid Horner <ds2horner@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      461a8eee
    • M
      zram: zram memory size limitation · 9ada9da9
      Minchan Kim 提交于
      Since zram has no control feature to limit memory usage, it makes hard to
      manage system memrory.
      
      This patch adds new knob "mem_limit" via sysfs to set up the a limit so
      that zram could fail allocation once it reaches the limit.
      
      In addition, user could change the limit in runtime so that he could
      manage the memory more dynamically.
      
      Initial state is no limit so it doesn't break old behavior.
      
      [akpm@linux-foundation.org: fix typo, per Sergey]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: <juno.choi@lge.com>
      Cc: <seungho1.park@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: David Horner <ds2horner@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ada9da9
    • M
      zsmalloc: change return value unit of zs_get_total_size_bytes · 722cdc17
      Minchan Kim 提交于
      zs_get_total_size_bytes returns a amount of memory zsmalloc consumed with
      *byte unit* but zsmalloc operates *page unit* rather than byte unit so
      let's change the API so benefit we could get is that reduce unnecessary
      overhead (ie, change page unit with byte unit) in zsmalloc.
      
      Since return type is pages, "zs_get_total_pages" is better than
      "zs_get_total_size_bytes".
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: <juno.choi@lge.com>
      Cc: <seungho1.park@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: David Horner <ds2horner@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      722cdc17
  6. 05 10月, 2014 1 次提交
    • M
      block: disable entropy contributions for nonrot devices · b277da0a
      Mike Snitzer 提交于
      Clear QUEUE_FLAG_ADD_RANDOM in all block drivers that set
      QUEUE_FLAG_NONROT.
      
      Historically, all block devices have automatically made entropy
      contributions.  But as previously stated in commit e2e1a148 ("block: add
      sysfs knob for turning off disk entropy contributions"):
          - On SSD disks, the completion times aren't as random as they
            are for rotational drives. So it's questionable whether they
            should contribute to the random pool in the first place.
          - Calling add_disk_randomness() has a lot of overhead.
      
      There are more reliable sources for randomness than non-rotational block
      devices.  From a security perspective it is better to err on the side of
      caution than to allow entropy contributions from unreliable "random"
      sources.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b277da0a
  7. 30 8月, 2014 1 次提交
  8. 07 8月, 2014 2 次提交
    • W
      zram: replace global tb_lock with fine grain lock · d2d5e762
      Weijie Yang 提交于
      Currently, we use a rwlock tb_lock to protect concurrent access to the
      whole zram meta table.  However, according to the actual access model,
      there is only a small chance for upper user to access the same
      table[index], so the current lock granularity is too big.
      
      The idea of optimization is to change the lock granularity from whole
      meta table to per table entry (table -> table[index]), so that we can
      protect concurrent access to the same table[index], meanwhile allow the
      maximum concurrency.
      
      With this in mind, several kinds of locks which could be used as a
      per-entry lock were tested and compared:
      
      Test environment:
      x86-64 Intel Core2 Q8400, system memory 4GB, Ubuntu 12.04,
      kernel v3.15.0-rc3 as base, zram with 4 max_comp_streams LZO.
      
      iozone test:
      iozone -t 4 -R -r 16K -s 200M -I +Z
      (1GB zram with ext4 filesystem, take the average of 10 tests, KB/s)
      
            Test       base      CAS    spinlock    rwlock   bit_spinlock
      -------------------------------------------------------------------
       Initial write  1381094   1425435   1422860   1423075   1421521
             Rewrite  1529479   1641199   1668762   1672855   1654910
                Read  8468009  11324979  11305569  11117273  10997202
             Re-read  8467476  11260914  11248059  11145336  10906486
        Reverse Read  6821393   8106334   8282174   8279195   8109186
         Stride read  7191093   8994306   9153982   8961224   9004434
         Random read  7156353   8957932   9167098   8980465   8940476
      Mixed workload  4172747   5680814   5927825   5489578   5972253
        Random write  1483044   1605588   1594329   1600453   1596010
              Pwrite  1276644   1303108   1311612   1314228   1300960
               Pread  4324337   4632869   4618386   4457870   4500166
      
      To enhance the possibility of access the same table[index] concurrently,
      set zram a small disksize(10MB) and let threads run with large loop
      count.
      
      fio test:
      fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers
      --scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4
      --filename=/dev/zram0 --name=seq-write --rw=write --stonewall
      --name=seq-read --rw=read --stonewall --name=seq-readwrite
      --rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall
      (10MB zram raw block device, take the average of 10 tests, KB/s)
      
          Test     base     CAS    spinlock    rwlock  bit_spinlock
      -------------------------------------------------------------
      seq-write   933789   999357   1003298    995961   1001958
       seq-read  5634130  6577930   6380861   6243912   6230006
         seq-rw  1405687  1638117   1640256   1633903   1634459
        rand-rw  1386119  1614664   1617211   1609267   1612471
      
      All the optimization methods show a higher performance than the base,
      however, it is hard to say which method is the most appropriate.
      
      On the other hand, zram is mostly used on small embedded system, so we
      don't want to increase any memory footprint.
      
      This patch pick the bit_spinlock method, pack object size and page_flag
      into an unsigned long table.value, so as to not increase any memory
      overhead on both 32-bit and 64-bit system.
      
      On the third hand, even though different kinds of locks have different
      performances, we can ignore this difference, because: if zram is used as
      zram swapfile, the swap subsystem can prevent concurrent access to the
      same swapslot; if zram is used as zram-blk for set up filesystem on it,
      the upper filesystem and the page cache also prevent concurrent access
      of the same block mostly.  So we can ignore the different performances
      among locks.
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2d5e762
    • M
      zram: use size_t instead of u16 · 023b409f
      Minchan Kim 提交于
      Some architectures (eg, hexagon and PowerPC) could use PAGE_SHIFT of 16
      or more.  In these cases u16 is not sufficiently large to represent a
      compressed page's size so use size_t.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NWeijie Yang <weijie.yang@samsung.com>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      023b409f
  9. 24 7月, 2014 1 次提交
  10. 04 7月, 2014 1 次提交
  11. 05 6月, 2014 1 次提交
    • W
      zram: correct offset usage in zram_bio_discard · 38515c73
      Weijie Yang 提交于
      We want to skip the physical block(PAGE_SIZE) which is partially covered
      by the discard bio, so we check the remaining size and subtract it if
      there is a need to goto the next physical block.
      
      The current offset usage in zram_bio_discard is incorrect, it will cause
      its upper filesystem breakdown.  Consider the following scenario:
      
      On some architecture or config, PAGE_SIZE is 64K for example, filesystem
      is set up on zram disk without PAGE_SIZE aligned, a discard bio leads to a
      offset = 4K and size=72K, normally, it should not really discard any
      physical block as it partially cover two physical blocks.  However, with
      the current offset usage, it will discard the second physical block and
      free its memory, which will cause filesystem breakdown.
      
      This patch corrects the offset usage in zram_bio_discard.
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38515c73
  12. 08 4月, 2014 17 次提交
  13. 04 3月, 2014 1 次提交