1. 13 2月, 2015 6 次提交
    • M
      zram: remove init_lock in zram_make_request · 08eee69f
      Minchan Kim 提交于
      Admin could reset zram during I/O operation going on so we have used
      zram->init_lock as read-side lock in I/O path to prevent sudden zram
      meta freeing.
      
      However, the init_lock is really troublesome.  We can't do call
      zram_meta_alloc under init_lock due to lockdep splat because
      zram_rw_page is one of the function under reclaim path and hold it as
      read_lock while other places in process context hold it as write_lock.
      So, we have used allocation out of the lock to avoid lockdep warn but
      it's not good for readability and fainally, I met another lockdep splat
      between init_lock and cpu_hotplug from kmem_cache_destroy during working
      zsmalloc compaction.  :(
      
      Yes, the ideal is to remove horrible init_lock of zram in rw path.  This
      patch removes it in rw path and instead, add atomic refcount for meta
      lifetime management and completion to free meta in process context.
      It's important to free meta in process context because some of resource
      destruction needs mutex lock, which could be held if we releases the
      resource in reclaim context so it's deadlock, again.
      
      As a bonus, we could remove init_done check in rw path because
      zram_meta_get will do a role for it, instead.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08eee69f
    • M
      zram: check bd_openers instead of bd_holders · 2b269ce6
      Minchan Kim 提交于
      bd_holders is increased only when user open the device file as FMODE_EXCL
      so if something opens zram0 as !FMODE_EXCL and request I/O while another
      user reset zram0, we can see following warning.
      
        zram0: detected capacity change from 0 to 64424509440
        Buffer I/O error on dev zram0, logical block 180823, lost async page write
        Buffer I/O error on dev zram0, logical block 180824, lost async page write
        Buffer I/O error on dev zram0, logical block 180825, lost async page write
        Buffer I/O error on dev zram0, logical block 180826, lost async page write
        Buffer I/O error on dev zram0, logical block 180827, lost async page write
        Buffer I/O error on dev zram0, logical block 180828, lost async page write
        Buffer I/O error on dev zram0, logical block 180829, lost async page write
        Buffer I/O error on dev zram0, logical block 180830, lost async page write
        Buffer I/O error on dev zram0, logical block 180831, lost async page write
        Buffer I/O error on dev zram0, logical block 180832, lost async page write
        ------------[ cut here ]------------
        WARNING: CPU: 11 PID: 1996 at fs/block_dev.c:57 __blkdev_put+0x1d7/0x210()
        Modules linked in:
        CPU: 11 PID: 1996 Comm: dd Not tainted 3.19.0-rc6-next-20150202+ #1125
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x45/0x57
          warn_slowpath_common+0x8a/0xc0
          warn_slowpath_null+0x1a/0x20
          __blkdev_put+0x1d7/0x210
          blkdev_put+0x50/0x130
          blkdev_close+0x25/0x30
          __fput+0xdf/0x1e0
          ____fput+0xe/0x10
          task_work_run+0xa7/0xe0
          do_notify_resume+0x49/0x60
          int_signal+0x12/0x17
        ---[ end trace 274fbbc5664827d2 ]---
      
      The warning comes from bdev_write_node in blkdev_put path.
      
         static void bdev_write_inode(struct inode *inode)
         {
              spin_lock(&inode->i_lock);
              while (inode->i_state & I_DIRTY) {
                      spin_unlock(&inode->i_lock);
                      WARN_ON_ONCE(write_inode_now(inode, true)); <========= here.
                      spin_lock(&inode->i_lock);
              }
              spin_unlock(&inode->i_lock);
         }
      
      The reason is dd process encounters I/O fails due to sudden block device
      disappear so in filemap_check_errors in __writeback_single_inode returns
      -EIO.
      
      If we check bd_openers instead of bd_holders, we could address the
      problem.  When I see the brd, it already have used it rather than
      bd_holders so although I'm not a expert of block layer, it seems to be
      better.
      
      I can make following warning with below simple script.  In addition, I
      added msleep(2000) below set_capacity(zram->disk, 0) after applying your
      patch to make window huge(Kudos to Ganesh!)
      
      script:
      
         echo $((60<<30)) > /sys/block/zram0/disksize
         setsid dd if=/dev/zero of=/dev/zram0 &
         sleep 1
         setsid echo 1 > /sys/block/zram0/reset
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b269ce6
    • S
      zram: rework reset and destroy path · a096cafc
      Sergey Senozhatsky 提交于
      We need to return set_capacity(disk, 0) from reset_store() back to
      zram_reset_device(), a catch by Ganesh Mahendran.  Potentially, we can
      race set_capacity() calls from init and reset paths.
      
      The problem is that zram_reset_device() is also getting called from
      zram_exit(), which performs operations in misleading reversed order -- we
      first create_device() and then init it, while zram_exit() perform
      destroy_device() first and then does zram_reset_device().  This is done to
      remove sysfs group before we reset device, so we can continue with device
      reset/destruction not being raced by sysfs attr write (f.e.  disksize).
      
      Apart from that, destroy_device() releases zram->disk (but we still have
      ->disk pointer), so we cannot acces zram->disk in later
      zram_reset_device() call, which may cause additional errors in the future.
      
      So, this patch rework and cleanup destroy path.
      
      1) remove several unneeded goto labels in zram_init()
      
      2) factor out zram_init() error path and zram_exit() into
         destroy_devices() function, which takes the number of devices to
         destroy as its argument.
      
      3) remove sysfs group in destroy_devices() first, so we can reorder
         operations -- reset device (as expected) goes before disk destroy and
         queue cleanup.  So we can always access ->disk in zram_reset_device().
      
      4) and, finally, return set_capacity() back under ->init_lock.
      
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a096cafc
    • S
      zram: fix umount-reset_store-mount race condition · ba6b17d6
      Sergey Senozhatsky 提交于
      Ganesh Mahendran was the first one who proposed to use bdev->bd_mutex to
      avoid ->bd_holders race condition:
      
              CPU0                            CPU1
      umount /* zram->init_done is true */
      reset_store()
      bdev->bd_holders == 0                   mount
      ...                                     zram_make_request()
      zram_reset_device()
      
      However, his solution required some considerable amount of code movement,
      which we can avoid.
      
      Apart from using bdev->bd_mutex in reset_store(), this patch also
      simplifies zram_reset_device().
      
      zram_reset_device() has a bool parameter reset_capacity which tells it
      whether disk capacity and itself disk should be reset.  There are two
      zram_reset_device() callers:
      
      -- zram_exit() passes reset_capacity=false
      -- reset_store() passes reset_capacity=true
      
      So we can move reset_capacity-sensitive work out of zram_reset_device()
      and perform it unconditionally in reset_store().  This also lets us drop
      reset_capacity parameter from zram_reset_device() and pass zram pointer
      only.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba6b17d6
    • G
      zram: free meta table in zram_meta_free · 1fec1172
      Ganesh Mahendran 提交于
      zram_meta_alloc() and zram_meta_free() are a pair.  In
      zram_meta_alloc(), meta table is allocated.  So it it better to free it
      in zram_meta_free().
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fec1172
    • S
      zram: clean up zram_meta_alloc() · b8179958
      Sergey Senozhatsky 提交于
      A trivial cleanup of zram_meta_alloc() error handling.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8179958
  2. 14 12月, 2014 5 次提交
  3. 14 11月, 2014 1 次提交
  4. 30 10月, 2014 1 次提交
  5. 10 10月, 2014 4 次提交
    • S
      zram: use notify_free to account all free notifications · 015254da
      Sergey Senozhatsky 提交于
      `notify_free' device attribute accounts the number of slot free
      notifications and internally represents the number of zram_free_page()
      calls.  Slot free notifications are sent only when device is used as a
      swap device, hence `notify_free' is used only for swap devices.  Since
      f4659d8e (zram: support REQ_DISCARD) ZRAM handles yet another one
      free notification (also via zram_free_page() call) -- REQ_DISCARD
      requests, which are sent by a filesystem, whenever some data blocks are
      discarded.  However, there is no way to know the number of notifications
      in the latter case.
      
      Use `notify_free' to account the number of pages freed by
      zram_bio_discard() and zram_slot_free_notify().  Depending on usage
      scenario `notify_free' represents:
      
       a) the number of pages freed because of slot free notifications, which is
         equal to the number of swap_slot_free_notify() calls, so there is no
         behaviour change
      
       b) the number of pages freed because of REQ_DISCARD notifications
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Chao Yu <chao2.yu@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      015254da
    • M
      zram: report maximum used memory · 461a8eee
      Minchan Kim 提交于
      Normally, zram user could get maximum memory usage zram consumed via
      polling mem_used_total with sysfs in userspace.
      
      But it has a critical problem because user can miss peak memory usage
      during update inverval of polling.  For avoiding that, user should poll it
      with shorter interval(ie, 0.0000000001s) with mlocking to avoid page fault
      delay when memory pressure is heavy.  It would be troublesome.
      
      This patch adds new knob "mem_used_max" so user could see the maximum
      memory usage easily via reading the knob and reset it via "echo 0 >
      /sys/block/zram0/mem_used_max".
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: <juno.choi@lge.com>
      Cc: <seungho1.park@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Reviewed-by: NDavid Horner <ds2horner@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      461a8eee
    • M
      zram: zram memory size limitation · 9ada9da9
      Minchan Kim 提交于
      Since zram has no control feature to limit memory usage, it makes hard to
      manage system memrory.
      
      This patch adds new knob "mem_limit" via sysfs to set up the a limit so
      that zram could fail allocation once it reaches the limit.
      
      In addition, user could change the limit in runtime so that he could
      manage the memory more dynamically.
      
      Initial state is no limit so it doesn't break old behavior.
      
      [akpm@linux-foundation.org: fix typo, per Sergey]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: <juno.choi@lge.com>
      Cc: <seungho1.park@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: David Horner <ds2horner@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ada9da9
    • M
      zsmalloc: change return value unit of zs_get_total_size_bytes · 722cdc17
      Minchan Kim 提交于
      zs_get_total_size_bytes returns a amount of memory zsmalloc consumed with
      *byte unit* but zsmalloc operates *page unit* rather than byte unit so
      let's change the API so benefit we could get is that reduce unnecessary
      overhead (ie, change page unit with byte unit) in zsmalloc.
      
      Since return type is pages, "zs_get_total_pages" is better than
      "zs_get_total_size_bytes".
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: <juno.choi@lge.com>
      Cc: <seungho1.park@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: David Horner <ds2horner@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      722cdc17
  6. 05 10月, 2014 1 次提交
    • M
      block: disable entropy contributions for nonrot devices · b277da0a
      Mike Snitzer 提交于
      Clear QUEUE_FLAG_ADD_RANDOM in all block drivers that set
      QUEUE_FLAG_NONROT.
      
      Historically, all block devices have automatically made entropy
      contributions.  But as previously stated in commit e2e1a148 ("block: add
      sysfs knob for turning off disk entropy contributions"):
          - On SSD disks, the completion times aren't as random as they
            are for rotational drives. So it's questionable whether they
            should contribute to the random pool in the first place.
          - Calling add_disk_randomness() has a lot of overhead.
      
      There are more reliable sources for randomness than non-rotational block
      devices.  From a security perspective it is better to err on the side of
      caution than to allow entropy contributions from unreliable "random"
      sources.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b277da0a
  7. 30 8月, 2014 1 次提交
  8. 07 8月, 2014 4 次提交
    • W
      zram: replace global tb_lock with fine grain lock · d2d5e762
      Weijie Yang 提交于
      Currently, we use a rwlock tb_lock to protect concurrent access to the
      whole zram meta table.  However, according to the actual access model,
      there is only a small chance for upper user to access the same
      table[index], so the current lock granularity is too big.
      
      The idea of optimization is to change the lock granularity from whole
      meta table to per table entry (table -> table[index]), so that we can
      protect concurrent access to the same table[index], meanwhile allow the
      maximum concurrency.
      
      With this in mind, several kinds of locks which could be used as a
      per-entry lock were tested and compared:
      
      Test environment:
      x86-64 Intel Core2 Q8400, system memory 4GB, Ubuntu 12.04,
      kernel v3.15.0-rc3 as base, zram with 4 max_comp_streams LZO.
      
      iozone test:
      iozone -t 4 -R -r 16K -s 200M -I +Z
      (1GB zram with ext4 filesystem, take the average of 10 tests, KB/s)
      
            Test       base      CAS    spinlock    rwlock   bit_spinlock
      -------------------------------------------------------------------
       Initial write  1381094   1425435   1422860   1423075   1421521
             Rewrite  1529479   1641199   1668762   1672855   1654910
                Read  8468009  11324979  11305569  11117273  10997202
             Re-read  8467476  11260914  11248059  11145336  10906486
        Reverse Read  6821393   8106334   8282174   8279195   8109186
         Stride read  7191093   8994306   9153982   8961224   9004434
         Random read  7156353   8957932   9167098   8980465   8940476
      Mixed workload  4172747   5680814   5927825   5489578   5972253
        Random write  1483044   1605588   1594329   1600453   1596010
              Pwrite  1276644   1303108   1311612   1314228   1300960
               Pread  4324337   4632869   4618386   4457870   4500166
      
      To enhance the possibility of access the same table[index] concurrently,
      set zram a small disksize(10MB) and let threads run with large loop
      count.
      
      fio test:
      fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers
      --scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4
      --filename=/dev/zram0 --name=seq-write --rw=write --stonewall
      --name=seq-read --rw=read --stonewall --name=seq-readwrite
      --rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall
      (10MB zram raw block device, take the average of 10 tests, KB/s)
      
          Test     base     CAS    spinlock    rwlock  bit_spinlock
      -------------------------------------------------------------
      seq-write   933789   999357   1003298    995961   1001958
       seq-read  5634130  6577930   6380861   6243912   6230006
         seq-rw  1405687  1638117   1640256   1633903   1634459
        rand-rw  1386119  1614664   1617211   1609267   1612471
      
      All the optimization methods show a higher performance than the base,
      however, it is hard to say which method is the most appropriate.
      
      On the other hand, zram is mostly used on small embedded system, so we
      don't want to increase any memory footprint.
      
      This patch pick the bit_spinlock method, pack object size and page_flag
      into an unsigned long table.value, so as to not increase any memory
      overhead on both 32-bit and 64-bit system.
      
      On the third hand, even though different kinds of locks have different
      performances, we can ignore this difference, because: if zram is used as
      zram swapfile, the swap subsystem can prevent concurrent access to the
      same swapslot; if zram is used as zram-blk for set up filesystem on it,
      the upper filesystem and the page cache also prevent concurrent access
      of the same block mostly.  So we can ignore the different performances
      among locks.
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2d5e762
    • M
      zram: use size_t instead of u16 · 023b409f
      Minchan Kim 提交于
      Some architectures (eg, hexagon and PowerPC) could use PAGE_SHIFT of 16
      or more.  In these cases u16 is not sufficiently large to represent a
      compressed page's size so use size_t.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NWeijie Yang <weijie.yang@samsung.com>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      023b409f
    • S
      zram: remove unused SECTOR_SIZE define · a830eff7
      Sergey Senozhatsky 提交于
      Drop SECTOR_SIZE define, because it's not used.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a830eff7
    • S
      zram: rename struct `table' to `zram_table_entry' · cb8f2eec
      Sergey Senozhatsky 提交于
      Andrew Morton has recently noted that `struct table' actually represents
      table entry and, thus, should be renamed.  Rename to `zram_table_entry'.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb8f2eec
  9. 24 7月, 2014 1 次提交
  10. 04 7月, 2014 1 次提交
  11. 05 6月, 2014 1 次提交
    • W
      zram: correct offset usage in zram_bio_discard · 38515c73
      Weijie Yang 提交于
      We want to skip the physical block(PAGE_SIZE) which is partially covered
      by the discard bio, so we check the remaining size and subtract it if
      there is a need to goto the next physical block.
      
      The current offset usage in zram_bio_discard is incorrect, it will cause
      its upper filesystem breakdown.  Consider the following scenario:
      
      On some architecture or config, PAGE_SIZE is 64K for example, filesystem
      is set up on zram disk without PAGE_SIZE aligned, a discard bio leads to a
      offset = 4K and size=72K, normally, it should not really discard any
      physical block as it partially cover two physical blocks.  However, with
      the current offset usage, it will discard the second physical block and
      free its memory, which will cause filesystem breakdown.
      
      This patch corrects the offset usage in zram_bio_discard.
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38515c73
  12. 08 4月, 2014 14 次提交
    • J
      zram: support REQ_DISCARD · f4659d8e
      Joonsoo Kim 提交于
      zram is ram based block device and can be used by backend of filesystem.
      When filesystem deletes a file, it normally doesn't do anything on data
      block of that file.  It just marks on metadata of that file.  This
      behavior has no problem on disk based block device, but has problems on
      ram based block device, since we can't free memory used for data block.
      To overcome this disadvantage, there is REQ_DISCARD functionality.  If
      block device support REQ_DISCARD and filesystem is mounted with discard
      option, filesystem sends REQ_DISCARD to block device whenever some data
      blocks are discarded.  All we have to do is to handle this request.
      
      This patch implements to flag up QUEUE_FLAG_DISCARD and handle this
      REQ_DISCARD request.  With it, we can free memory used by zram if it isn't
      used.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4659d8e
    • S
      zram: use scnprintf() in attrs show() methods · 56b4e8cb
      Sergey Senozhatsky 提交于
      sysfs.txt documentation lists the following requirements:
      
       - The buffer will always be PAGE_SIZE bytes in length. On i386, this
         is 4096.
      
       - show() methods should return the number of bytes printed into the
         buffer. This is the return value of scnprintf().
      
       - show() should always use scnprintf().
      
      Use scnprintf() in show() functions.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56b4e8cb
    • M
      zram: propagate error to user · 60a726e3
      Minchan Kim 提交于
      When we initialized zcomp with single, we couldn't change
      max_comp_streams without zram reset but current interface doesn't show
      any error to user and even it changes max_comp_streams's value without
      any effect so it would make user very confusing.
      
      This patch prevents max_comp_streams's change when zcomp was initialized
      as single zcomp and emit the error to user(ex, echo).
      
      [akpm@linux-foundation.org: don't return with the lock held, per Sergey]
      [fengguang.wu@intel.com: fix coccinelle warnings]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60a726e3
    • S
      zram: return error-valued pointer from zcomp_create() · fcfa8d95
      Sergey Senozhatsky 提交于
      Instead of returning just NULL, return ERR_PTR from zcomp_create() if
      compressing backend creation has failed.  ERR_PTR(-EINVAL) for unsupported
      compression algorithm request, ERR_PTR(-ENOMEM) for allocation (zcomp or
      compression stream) error.
      
      Perform IS_ERR() check of returned from zcomp_create() value in
      disksize_store() and set return code to PTR_ERR().
      
      Change suggested by Jerome Marchand.
      
      [akpm@linux-foundation.org: clean up error recovery flow]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcfa8d95
    • S
      zram: move comp allocation out of init_lock · d61f98c7
      Sergey Senozhatsky 提交于
      While fixing lockdep spew of ->init_lock reported by Sasha Levin [1],
      Minchan Kim noted [2] that it's better to move compression backend
      allocation (using GPF_KERNEL) out of the ->init_lock lock, same way as
      with zram_meta_alloc(), in order to prevent the same lockdep spew.
      
      [1] https://lkml.org/lkml/2014/2/27/337
      [2] https://lkml.org/lkml/2014/3/3/32Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d61f98c7
    • S
      zram: add lz4 algorithm backend · 6e76668e
      Sergey Senozhatsky 提交于
      Introduce LZ4 compression backend and make it available for selection.
      LZ4 support is optional and requires user to set ZRAM_LZ4_COMPRESS config
      option.  The default compression backend is LZO.
      
      TEST
      
      (x86_64, core i5, 2 cores + 2 hyperthreading, zram disk size 1G,
      ext4 file system, 3 compression streams)
      
      iozone -t 3 -R -r 16K -s 60M -I +Z
      
             Test           LZO           LZ4
      ----------------------------------------------
        Initial write   1642744.62    1317005.09
              Rewrite   2498980.88    1800645.16
                 Read   3957026.38    5877043.75
              Re-read   3950997.38    5861847.00
         Reverse Read   2937114.56    5047384.00
          Stride read   2948163.19    4929587.38
          Random read   3292692.69    4880793.62
       Mixed workload   1545602.62    3502940.38
         Random write   2448039.75    1758786.25
               Pwrite   1670051.03    1338329.69
                Pread   2530682.00    5097177.62
               Fwrite   3232085.62    3275942.56
                Fread   6306880.25    6645271.12
      
      So on my system LZ4 is slower in write-only tests, while it performs
      better in read-only and mixed (reads + writes) tests.
      
      Official LZ4 benchmarks available here http://code.google.com/p/lz4/
      (linux kernel uses revision r90).
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e76668e
    • S
      zram: make compression algorithm selection possible · e46b8a03
      Sergey Senozhatsky 提交于
      Add and document `comp_algorithm' device attribute.  This attribute allows
      to show supported compression and currently selected compression
      algorithms:
      
      	cat /sys/block/zram0/comp_algorithm
      	[lzo] lz4
      
      and change selected compression algorithm:
      	echo lzo > /sys/block/zram0/comp_algorithm
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e46b8a03
    • S
      zram: add set_max_streams knob · fe8eb122
      Sergey Senozhatsky 提交于
      This patch allows to change max_comp_streams on initialised zcomp.
      
      Introduce zcomp set_max_streams() knob, zcomp_strm_multi_set_max_streams()
      and zcomp_strm_single_set_max_streams() callbacks to change streams limit
      for zcomp_strm_multi and zcomp_strm_single, accordingly.  set_max_streams
      for single steam zcomp does nothing.
      
      If user has lowered the limit, then zcomp_strm_multi_set_max_streams()
      attempts to immediately free extra streams (as much as it can, depending
      on idle streams availability).
      
      Note, this patch does not allow to change stream 'policy' from single to
      multi stream (or vice versa) on already initialised compression backend.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe8eb122
    • S
      zram: add multi stream functionality · beca3ec7
      Sergey Senozhatsky 提交于
      Existing zram (zcomp) implementation has only one compression stream
      (buffer and algorithm private part), so in order to prevent data
      corruption only one write (compress operation) can use this compression
      stream, forcing all concurrent write operations to wait for stream lock
      to be released.  This patch changes zcomp to keep a compression streams
      list of user-defined size (via sysfs device attr).  Each write operation
      still exclusively holds compression stream, the difference is that we
      can have N write operations (depending on size of streams list)
      executing in parallel.  See TEST section later in commit message for
      performance data.
      
      Introduce struct zcomp_strm_multi and a set of functions to manage
      zcomp_strm stream access.  zcomp_strm_multi has a list of idle
      zcomp_strm structs, spinlock to protect idle list and wait queue, making
      it possible to perform parallel compressions.
      
      The following set of functions added:
      - zcomp_strm_multi_find()/zcomp_strm_multi_release()
        find and release a compression stream, implement required locking
      - zcomp_strm_multi_create()/zcomp_strm_multi_destroy()
        create and destroy zcomp_strm_multi
      
      zcomp ->strm_find() and ->strm_release() callbacks are set during
      initialisation to zcomp_strm_multi_find()/zcomp_strm_multi_release()
      correspondingly.
      
      Each time zcomp issues a zcomp_strm_multi_find() call, the following set
      of operations performed:
      
      - spin lock strm_lock
      - if idle list is not empty, remove zcomp_strm from idle list, spin
        unlock and return zcomp stream pointer to caller
      - if idle list is empty, current adds itself to wait queue. it will be
        awaken by zcomp_strm_multi_release() caller.
      
      zcomp_strm_multi_release():
      - spin lock strm_lock
      - add zcomp stream to idle list
      - spin unlock, wake up sleeper
      
      Minchan Kim reported that spinlock-based locking scheme has demonstrated
      a severe perfomance regression for single compression stream case,
      comparing to mutex-based (see https://lkml.org/lkml/2014/2/18/16)
      
      base                      spinlock                    mutex
      
      ==Initial write           ==Initial write             ==Initial  write
      records:  5               records:  5                 records:   5
      avg:      1642424.35      avg:      699610.40         avg:       1655583.71
      std:      39890.95(2.43%) std:      232014.19(33.16%) std:       52293.96
      max:      1690170.94      max:      1163473.45        max:       1697164.75
      min:      1568669.52      min:      573429.88         min:       1553410.23
      ==Rewrite                 ==Rewrite                   ==Rewrite
      records:  5               records:  5                 records:   5
      avg:      1611775.39      avg:      501406.64         avg:       1684419.11
      std:      17144.58(1.06%) std:      15354.41(3.06%)   std:       18367.42
      max:      1641800.95      max:      531356.78         max:       1706445.84
      min:      1593515.27      min:      488817.78         min:       1655335.73
      
      When only one compression stream available, mutex with spin on owner
      tends to perform much better than frequent wait_event()/wake_up().  This
      is why single stream implemented as a special case with mutex locking.
      
      Introduce and document zram device attribute max_comp_streams.  This
      attr shows and stores current zcomp's max number of zcomp streams
      (max_strm).  Extend zcomp's zcomp_create() with `max_strm' parameter.
      `max_strm' limits the number of zcomp_strm structs in compression
      backend's idle list (max_comp_streams).
      
      max_comp_streams used during initialisation as follows:
      -- passing to zcomp_create() max_strm equals to 1 will initialise zcomp
      using single compression stream zcomp_strm_single (mutex-based locking).
      -- passing to zcomp_create() max_strm greater than 1 will initialise zcomp
      using multi compression stream zcomp_strm_multi (spinlock-based locking).
      
      default max_comp_streams value is 1, meaning that zram with single stream
      will be initialised.
      
      Later patch will introduce configuration knob to change max_comp_streams
      on already initialised and used zcomp.
      
      TEST
      iozone -t 3 -R -r 16K -s 60M -I +Z
      
             test           base       1 strm (mutex)     3 strm (spinlock)
      -----------------------------------------------------------------------
       Initial write      589286.78       583518.39          718011.05
             Rewrite      604837.97       596776.38         1515125.72
        Random write      584120.11       595714.58         1388850.25
              Pwrite      535731.17       541117.38          739295.27
              Fwrite     1418083.88      1478612.72         1484927.06
      
      Usage example:
      set max_comp_streams to 4
              echo 4 > /sys/block/zram0/max_comp_streams
      
      show current max_comp_streams (default value is 1).
              cat /sys/block/zram0/max_comp_streams
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      beca3ec7
    • S
      zram: factor out single stream compression · 9cc97529
      Sergey Senozhatsky 提交于
      This is preparation patch to add multi stream support to zcomp.
      
      Introduce struct zcomp_strm_single and a set of functions to manage
      zcomp_strm stream access.  zcomp_strm_single implements single compession
      stream, same way as current zcomp implementation.  This moves zcomp_strm
      stream control and locking from zcomp, so compressing backend zcomp is not
      aware of required locking.
      
      Single and multi streams require different locking schemes.  Minchan Kim
      reported that spinlock-based locking scheme (which is used in multi stream
      implementation) has demonstrated a severe perfomance regression for single
      compression stream case, comparing to mutex-based.  see
      https://lkml.org/lkml/2014/2/18/16
      
      The following set of functions added:
      - zcomp_strm_single_find()/zcomp_strm_single_release()
        find and release a compression stream, implement required locking
      - zcomp_strm_single_create()/zcomp_strm_single_destroy()
        create and destroy zcomp_strm_single
      
      New ->strm_find() and ->strm_release() callbacks added to zcomp, which are
      set to zcomp_strm_single_find() and zcomp_strm_single_release() during
      initialisation.  Instead of direct locking and zcomp_strm access from
      zcomp_strm_find() and zcomp_strm_release(), zcomp now calls ->strm_find()
      and ->strm_release() correspondingly.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cc97529
    • S
      zram: use zcomp compressing backends · b7ca232e
      Sergey Senozhatsky 提交于
      Do not perform direct LZO compress/decompress calls, initialise
      and use zcomp LZO backend (single compression stream) instead.
      
      [akpm@linux-foundation.org: resolve conflicts with zram-delete-zram_init_device-fix.patch]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7ca232e
    • S
      zram: introduce compressing backend abstraction · e7e1ef43
      Sergey Senozhatsky 提交于
      ZRAM performs direct LZO compression algorithm calls, making it the one
      and only option.  While LZO is generally performs well, LZ4 algorithm
      tends to have a faster decompression (see http://code.google.com/p/lz4/
      for full report)
      
      	Name            Ratio  C.speed D.speed
      	                        MB/s    MB/s
      	LZ4 (r101)      2.084    422    1820
      	LZO 2.06        2.106    414     600
      
      Thus, users who have mostly read (decompress) usage scenarious or mixed
      workflow (writes with relatively high read ops number) will benefit from
      using LZ4 compression backend.
      
      Introduce compressing backend abstraction zcomp in order to support
      multiple compression algorithms with the following set of operations:
      
              .create
              .destroy
              .compress
              .decompress
      
      Schematically zram write() usually contains the following steps:
      0) preparation (decompression of partioal IO, etc.)
      1) lock buffer_lock mutex (protects meta compress buffers)
      2) compress (using meta compress buffers)
      3) alloc and map zs_pool object
      4) copy compressed data (from meta compress buffers) to object allocated by 3)
      5) free previous pool page, assign a new one
      6) unlock buffer_lock mutex
      
      As we can see, compressing buffers must remain untouched from 1) to 4),
      because, otherwise, concurrent write() can overwrite data.  At the same
      time, zram_meta must be aware of a) specific compression algorithm memory
      requirements and b) necessary locking to protect compression buffers.  To
      remove requirement a) new struct zcomp_strm introduced, which contains a
      compress/decompress `buffer' and compression algorithm `private' part.
      While struct zcomp implements zcomp_strm stream handling and locking and
      removes requirement b) from zram meta.  zcomp ->create() and ->destroy(),
      respectively, allocate and deallocate algorithm specific zcomp_strm
      `private' part.
      
      Every zcomp has zcomp stream and mutex to protect its compression stream.
      Stream usage semantics remains the same -- only one write can hold stream
      lock and use its buffers.  zcomp_strm_find() turns caller into exclusive
      user of a stream (holding stream mutex until zram release stream), and
      zcomp_strm_release() makes zcomp stream available (unlock the stream
      mutex).  Hence no concurrent write (compression) operations possible at
      the moment.
      
      iozone -t 3 -R -r 16K -s 60M -I +Z
      
             test            base           patched
      --------------------------------------------------
        Initial write      597992.91       591660.58
              Rewrite      609674.34       616054.97
                 Read     2404771.75      2452909.12
              Re-read     2459216.81      2470074.44
         Reverse Read     1652769.66      1589128.66
          Stride read     2202441.81      2202173.31
          Random read     2236311.47      2276565.31
       Mixed workload     1423760.41      1709760.06
         Random write      579584.08       615933.86
               Pwrite      597550.02       594933.70
                Pread     1703672.53      1718126.72
               Fwrite     1330497.06      1461054.00
                Fread     3922851.00      3957242.62
      
      Usage examples:
      
      	comp = zcomp_create(NAME) /* NAME e.g. "lzo" */
      
      which initialises compressing backend if requested algorithm is supported.
      
      Compress:
      	zstrm = zcomp_strm_find(comp)
      	zcomp_compress(comp, zstrm, src, &dst_len)
      	[..] /* copy compressed data */
      	zcomp_strm_release(comp, zstrm)
      
      Decompress:
      	zcomp_decompress(comp, src, src_len, dst);
      
      Free compessing backend and its zcomp stream:
      	zcomp_destroy(comp)
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7e1ef43
    • S
      zram: delete zram_init_device() · b67d1ec1
      Sergey Senozhatsky 提交于
      allocate new `zram_meta' in disksize_store() only for uninitialised zram
      device, saving a number of allocations and deallocations in case if
      disksize_store() was called on currently used device.  at the same time
      zram_meta stack variable is not necessary, because we can set ->meta
      directly.  there is also no need in setting QUEUE_FLAG_NONROT queue on
      every disksize_store(), set it once during device creation.
      
      [minchan@kernel.org: handle zram->meta alloc fail case]
      [minchan@kernel.org: prevent lockdep spew of init_lock]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67d1ec1
    • S
      zram: move zram size warning to documentation · e64cd51d
      Sergey Senozhatsky 提交于
      Move zram warning about disksize and size of memory correlation to zram
      documentation.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e64cd51d