1. 09 8月, 2014 1 次提交
  2. 07 8月, 2014 4 次提交
    • W
      zram: replace global tb_lock with fine grain lock · d2d5e762
      Weijie Yang 提交于
      Currently, we use a rwlock tb_lock to protect concurrent access to the
      whole zram meta table.  However, according to the actual access model,
      there is only a small chance for upper user to access the same
      table[index], so the current lock granularity is too big.
      
      The idea of optimization is to change the lock granularity from whole
      meta table to per table entry (table -> table[index]), so that we can
      protect concurrent access to the same table[index], meanwhile allow the
      maximum concurrency.
      
      With this in mind, several kinds of locks which could be used as a
      per-entry lock were tested and compared:
      
      Test environment:
      x86-64 Intel Core2 Q8400, system memory 4GB, Ubuntu 12.04,
      kernel v3.15.0-rc3 as base, zram with 4 max_comp_streams LZO.
      
      iozone test:
      iozone -t 4 -R -r 16K -s 200M -I +Z
      (1GB zram with ext4 filesystem, take the average of 10 tests, KB/s)
      
            Test       base      CAS    spinlock    rwlock   bit_spinlock
      -------------------------------------------------------------------
       Initial write  1381094   1425435   1422860   1423075   1421521
             Rewrite  1529479   1641199   1668762   1672855   1654910
                Read  8468009  11324979  11305569  11117273  10997202
             Re-read  8467476  11260914  11248059  11145336  10906486
        Reverse Read  6821393   8106334   8282174   8279195   8109186
         Stride read  7191093   8994306   9153982   8961224   9004434
         Random read  7156353   8957932   9167098   8980465   8940476
      Mixed workload  4172747   5680814   5927825   5489578   5972253
        Random write  1483044   1605588   1594329   1600453   1596010
              Pwrite  1276644   1303108   1311612   1314228   1300960
               Pread  4324337   4632869   4618386   4457870   4500166
      
      To enhance the possibility of access the same table[index] concurrently,
      set zram a small disksize(10MB) and let threads run with large loop
      count.
      
      fio test:
      fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers
      --scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4
      --filename=/dev/zram0 --name=seq-write --rw=write --stonewall
      --name=seq-read --rw=read --stonewall --name=seq-readwrite
      --rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall
      (10MB zram raw block device, take the average of 10 tests, KB/s)
      
          Test     base     CAS    spinlock    rwlock  bit_spinlock
      -------------------------------------------------------------
      seq-write   933789   999357   1003298    995961   1001958
       seq-read  5634130  6577930   6380861   6243912   6230006
         seq-rw  1405687  1638117   1640256   1633903   1634459
        rand-rw  1386119  1614664   1617211   1609267   1612471
      
      All the optimization methods show a higher performance than the base,
      however, it is hard to say which method is the most appropriate.
      
      On the other hand, zram is mostly used on small embedded system, so we
      don't want to increase any memory footprint.
      
      This patch pick the bit_spinlock method, pack object size and page_flag
      into an unsigned long table.value, so as to not increase any memory
      overhead on both 32-bit and 64-bit system.
      
      On the third hand, even though different kinds of locks have different
      performances, we can ignore this difference, because: if zram is used as
      zram swapfile, the swap subsystem can prevent concurrent access to the
      same swapslot; if zram is used as zram-blk for set up filesystem on it,
      the upper filesystem and the page cache also prevent concurrent access
      of the same block mostly.  So we can ignore the different performances
      among locks.
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2d5e762
    • M
      zram: use size_t instead of u16 · 023b409f
      Minchan Kim 提交于
      Some architectures (eg, hexagon and PowerPC) could use PAGE_SHIFT of 16
      or more.  In these cases u16 is not sufficiently large to represent a
      compressed page's size so use size_t.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NWeijie Yang <weijie.yang@samsung.com>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      023b409f
    • S
      zram: remove unused SECTOR_SIZE define · a830eff7
      Sergey Senozhatsky 提交于
      Drop SECTOR_SIZE define, because it's not used.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a830eff7
    • S
      zram: rename struct `table' to `zram_table_entry' · cb8f2eec
      Sergey Senozhatsky 提交于
      Andrew Morton has recently noted that `struct table' actually represents
      table entry and, thus, should be renamed.  Rename to `zram_table_entry'.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb8f2eec
  3. 24 7月, 2014 1 次提交
  4. 10 7月, 2014 1 次提交
  5. 04 7月, 2014 1 次提交
  6. 25 6月, 2014 1 次提交
  7. 23 6月, 2014 1 次提交
    • I
      rbd: handle parent_overlap on writes correctly · 9638556a
      Ilya Dryomov 提交于
      The following check in rbd_img_obj_request_submit()
      
          rbd_dev->parent_overlap <= obj_request->img_offset
      
      allows the fall through to the non-layered write case even if both
      parent_overlap and obj_request->img_offset belong to the same RADOS
      object.  This leads to data corruption, because the area to the left of
      parent_overlap ends up unconditionally zero-filled instead of being
      populated with parent data.  Suppose we want to write 1M to offset 6M
      of image bar, which is a clone of foo@snap; object_size is 4M,
      parent_overlap is 5M:
      
          rbd_data.<id>.0000000000000001
           ---------------------|----------------------|------------
          | should be copyup'ed | should be zeroed out | write ...
           ---------------------|----------------------|------------
         4M                    5M                     6M
                          parent_overlap    obj_request->img_offset
      
      4..5M should be copyup'ed from foo, yet it is zero-filled, just like
      5..6M is.
      
      Given that the only striping mode kernel client currently supports is
      chunking (i.e. stripe_unit == object_size, stripe_count == 1), round
      parent_overlap up to the next object boundary for the purposes of the
      overlap check.
      
      Cc: stable@vger.kernel.org # 3.10+
      Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      9638556a
  8. 18 6月, 2014 1 次提交
  9. 17 6月, 2014 1 次提交
  10. 14 6月, 2014 1 次提交
  11. 13 6月, 2014 2 次提交
  12. 12 6月, 2014 1 次提交
  13. 11 6月, 2014 3 次提交
  14. 07 6月, 2014 2 次提交
  15. 06 6月, 2014 6 次提交
    • J
      block: add blk_rq_set_block_pc() · f27b087b
      Jens Axboe 提交于
      With the optimizations around not clearing the full request at alloc
      time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC
      up to the user allocating the request.
      
      Add a blk_rq_set_block_pc() that sets the command type to
      REQ_TYPE_BLOCK_PC, and properly initializes the members associated
      with this type of request. Update callers to use this function instead
      of manipulating rq->cmd_type directly.
      
      Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed
      attempt.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f27b087b
    • I
      rbd: fix ida/idr memory leak · ffe312cf
      Ilya Dryomov 提交于
      ida_destroy() needs to be called on module exit to release ida caches.
      Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      ffe312cf
    • A
      rbd: use reference counts for image requests · 0f2d5be7
      Alex Elder 提交于
      Each image request contains a reference count, but to date it has
      not actually been used.  (I think this was just an oversight.) A
      recent report involving rbd failing an assertion shed light on why
      and where we need to use these reference counts.
      
      Every OSD request associated with an object request uses
      rbd_osd_req_callback() as its callback function.  That function will
      call a helper function (dependent on the type of OSD request) that
      will set the object request's "done" flag if the object request if
      appropriate.  If that "done" flag is set, the object request is
      passed to rbd_obj_request_complete().
      
      In rbd_obj_request_complete(), requests are processed in sequential
      order.  So if an object request completes before one of its
      predecessors in the image request, the completion is deferred.
      Otherwise, if it's a completing object's "turn" to be completed, it
      is passed to rbd_img_obj_end_request(), which records the result of
      the operation, accumulates transferred bytes, and so on.  Next, the
      successor to this request is checked and if it is marked "done",
      (deferred) completion processing is performed on that request, and
      so on.  If the last object request in an image request is completed,
      rbd_img_request_complete() is called, which (typically) destroys
      the image request.
      
      There is a race here, however.  The instant an object request is
      marked "done" it can be provided (by a thread handling completion of
      one of its predecessor operations) to rbd_img_obj_end_request(),
      which (for the last request) can then lead to the image request
      getting torn down.  And this can happen *before* that object has
      itself entered rbd_img_obj_end_request().  As a result, once it
      *does* enter that function, the image request (and even the object
      request itself) may have been freed and become invalid.
      
      All that's necessary to avoid this is to properly count references
      to the image requests.  We tear down an image request's object
      requests all at once--only when the entire image request has
      completed.  So there's no need for an image request to count
      references for its object requests.  However, we don't want an
      image request to go away until the last of its object requests
      has passed through rbd_img_obj_callback().  In other words,
      we don't want rbd_img_request_complete() to necessarily
      result in the image request being destroyed, because it may
      get called before we've finished processing on all of its
      object requests.
      
      So the fix is to add a reference to an image request for
      each of its object requests.  The reference can be viewed
      as representing an object request that has not yet finished
      its call to rbd_img_obj_callback().  That is emphasized by
      getting the reference right after assigning that as the image
      object's callback function.  The corresponding release of that
      reference is done at the end of rbd_img_obj_callback(), which
      every image object request passes through exactly once.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAlex Elder <elder@linaro.org>
      Reviewed-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      0f2d5be7
    • I
      rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync() · b30a01f2
      Ilya Dryomov 提交于
      osd_request, along with r_request and r_reply messages attached to it
      are leaked in __rbd_dev_header_watch_sync() if the requested image
      doesn't exist.  This is because lingering requests are special and get
      an extra ref in the reply path.  Fix it by unregistering linger request
      on the error path and split __rbd_dev_header_watch_sync() into two
      functions to make it maintainable.
      Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      b30a01f2
    • I
      rbd: make sure we have latest osdmap on 'rbd map' · 30ba1f02
      Ilya Dryomov 提交于
      Given an existing idle mapping (img1), mapping an image (img2) in
      a newly created pool (pool2) fails:
      
          $ ceph osd pool create pool1 8 8
          $ rbd create --size 1000 pool1/img1
          $ sudo rbd map pool1/img1
          $ ceph osd pool create pool2 8 8
          $ rbd create --size 1000 pool2/img2
          $ sudo rbd map pool2/img2
          rbd: sysfs write failed
          rbd: map failed: (2) No such file or directory
      
      This is because client instances are shared by default and we don't
      request an osdmap update when bumping a ref on an existing client.  The
      fix is to use the mon_get_version request to see if the osdmap we have
      is the latest, and block until the requested update is received if it's
      not.
      
      Fixes: http://tracker.ceph.com/issues/8184Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      30ba1f02
    • D
      rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO · 461f758a
      Duan Jiong 提交于
      This patch fixes coccinelle error regarding usage of IS_ERR and
      PTR_ERR instead of PTR_ERR_OR_ZERO.
      Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Reviewed-by: NYan, Zheng <zheng.z.yan@intel.com>
      461f758a
  16. 05 6月, 2014 4 次提交
  17. 04 6月, 2014 9 次提交