1. 30 3月, 2013 1 次提交
    • A
      rbd: don't zero-fill non-image object requests · 6e2a4505
      Alex Elder 提交于
      A result of ENOENT from a read request for an object that's part of
      an rbd image indicates that there is a hole in that portion of the
      image.  Similarly, a short read for such an object indicates that
      the remainder of the read should be interpreted a full read with
      zeros filling out the end of the request.
      
      This behavior is not correct for objects that are not backing rbd
      image data.  Currently rbd_img_obj_request_callback() assumes it
      should be done for all objects.
      
      Change rbd_img_obj_request_callback() so it only does this zeroing
      for image objects.  Encapsulate that special handling in its own
      function.  Add an assertion that the image object request is a bio
      request, since we assume that (and we currently don't support any
      other types).
      
      This resolves a problem identified here:
          http://tracker.ceph.com/issues/4559
      
      The regression was introduced by bf0d5f50.
      Reported-by: NDan van der Ster <dan@vanderster.com>
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-off-by: NSage Weil <sage@inktank.com>
      6e2a4505
  2. 23 3月, 2013 1 次提交
  3. 28 2月, 2013 7 次提交
  4. 27 2月, 2013 3 次提交
    • S
      libceph: update osd request/reply encoding · 1b83bef2
      Sage Weil 提交于
      Use the new version of the encoding for osd requests and replies.  In the
      process, update the way we are tracking request ops and reply lengths and
      results in the struct ceph_osd_request.  Update the rbd and fs/ceph users
      appropriately.
      
      The main changes are:
       - we keep pointers into the request memory for fields we need to update
         each time the request is sent out over the wire
       - we keep information about the result in an array in the request struct
         where the users can easily get at it.
      Signed-off-by: NSage Weil <sage@inktank.com>
      Reviewed-by: NAlex Elder <elder@inktank.com>
      1b83bef2
    • A
      rbd: pass length, not op for osd completions · c47f9371
      Alex Elder 提交于
      The only thing type-specific osd completion functions do with their
      osd op parameter is (in some cases) extract the number of bytes
      transferred from it.  In the other cases, the xferred bytes field
      is not used, and total message data transfer byte count (which may
      well be zero) is used.
      
      Just set the object request transfer count in the main osd request
      callback function and provide that to the other routines.  There is
      then no longer any need to pass the op pointer to the type-specific
      completion routines, so drop those parameters.
      
      Stop doing anything with the total message data length.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      c47f9371
    • A
      rbd: move rbd_osd_trivial_callback() · 39bf2c5d
      Alex Elder 提交于
      This function is slightly out of place, probably the result
      of an errant automatic merge or something.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      39bf2c5d
  5. 26 2月, 2013 5 次提交
  6. 23 2月, 2013 1 次提交
  7. 22 2月, 2013 6 次提交
    • G
      loopdev: ignore negative offset when calculate loop device size · b7a1da69
      Guo Chao 提交于
      Negative offset may cause loop device size larger than backing file
      size.
      
       $ fallocate -l 1M a
       $ losetup --offset 0xffffffffffff0000 /dev/loop0 a
       $ blockdev --getsize64 /dev/loop0
       1114112
       $ ls -l a
       -rw-r--r-- 1 root root 1048576 Jan 23 12:46 a
       $ cat /dev/loop0
       cat: /dev/loop0: Input/output error
      
      It makes no sense to do that. Only apply offset when it's positive.
      
      Fix a typo in the comment by the way.
      Signed-off-by: NGuo Chao <yan@linux.vnet.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: M. Hindess <hindessm@uk.ibm.com>
      Cc: Nikanth Karthikesan <knikanth@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b7a1da69
    • G
      loopdev: remove an user triggerable oops · b1a66504
      Guo Chao 提交于
      When loopdev is built as module and we pass an invalid parameter,
      loop_init() will return directly without deregister misc device, which
      will cause an oops when insert loop module next time because we left some
      garbage in the misc device list.
      
      Test case:
      sudo modprobe loop max_part=1024
      (failed due to invalid parameter)
      sudo modprobe loop
      (oops)
      
      Clean up nicely to avoid such oops.
      Signed-off-by: NGuo Chao <yan@linux.vnet.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: M. Hindess <hindessm@uk.ibm.com>
      Cc: Nikanth Karthikesan <knikanth@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b1a66504
    • G
      loopdev: move common code into loop_figure_size() · 7b0576a3
      Guo Chao 提交于
      Update block device size in accord with gendisk size and let userspace
      know the change in loop_figure_size(). This is a clean up to remove
      common code of loop_figure_size()'s two callers.
      Signed-off-by: NGuo Chao <yan@linux.vnet.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: M. Hindess <hindessm@uk.ibm.com>
      Cc: Nikanth Karthikesan <knikanth@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7b0576a3
    • G
      loopdev: update block device size in loop_set_status() · 541c742a
      Guo Chao 提交于
      Loop device driver sometimes fails to impose the size limit on the
      device. Keep issuing following two commands:
      
      losetup --offset 7517244416 --sizelimit 3224971264 /dev/loop0 backed_file
      blockdev --getsize64 /dev/loop0
      
      blockdev reports file size instead of sizelimit several out of 100 times.
      
      The problems are:
      
      	- losetup set up the device in two ioctl:
      		  LOOP_SET_FD and LOOP_SET_STATUS64.
      
      	- LOOP_SET_STATUS64 only update size of gendisk.
      
      Block device size will be updated lazily when device comes to use. If udev
      rushes in between the two ioctl, it will bring in a block device whose
      size is backing file size. If the device is not released after
      LOOP_SET_STATUS64 ioctl, blockdev will not see the updated size.
      
      Update block size in LOOP_SET_STATUS64 ioctl.
      Signed-off-by: NGuo Chao <yan@linux.vnet.ibm.com>
      Reported-by: NM. Hindess <hindessm@uk.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Nikanth Karthikesan <knikanth@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      541c742a
    • G
      loopdev: fix a deadlock · 5370019d
      Guo Chao 提交于
      bd_mutex and lo_ctl_mutex can be held in different order.
      
      Path #1:
      
      blkdev_open
       blkdev_get
        __blkdev_get (hold bd_mutex)
         lo_open (hold lo_ctl_mutex)
      
      Path #2:
      
      blkdev_ioctl
       lo_ioctl (hold lo_ctl_mutex)
        lo_set_capacity (hold bd_mutex)
      
      Lockdep does not report it, because path #2 actually holds a subclass of
      lo_ctl_mutex.  This subclass seems creep into the code by mistake.  The
      patch author actually just mentioned it in the changelog, see commit
      f028f3b2 ("loop: fix circular locking in loop_clr_fd()"), also see:
      
      	http://marc.info/?l=linux-kernel&m=123806169129727&w=2
      
      Path #2 hold bd_mutex to call bd_set_size(), I've protected it
      with i_mutex in a previous patch, so drop bd_mutex at this site.
      Signed-off-by: NGuo Chao <yan@linux.vnet.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: M. Hindess <hindessm@uk.ibm.com>
      Cc: Nikanth Karthikesan <knikanth@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5370019d
    • C
      drivers/block/swim3.c: fix null pointer dereference · 7414d4f6
      Cong Ding 提交于
      The use of pointer fs should be after the null check.
      Signed-off-by: NCong Ding <dinggnu@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7414d4f6
  8. 20 2月, 2013 9 次提交
    • A
      libceph: drop return value from page vector copy routines · 903bb32e
      Alex Elder 提交于
      The return values provided for ceph_copy_to_page_vector() and
      ceph_copy_from_page_vector() serve no purpose, so get rid of them.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      903bb32e
    • A
      rbd: ignore result of ceph_copy_from_page_vector() · 23ed6e13
      Alex Elder 提交于
      The result of ceph_copy_from_page_vector() is simply the length
      argument it is provided.
      
      This is called by rbd_obj_method_sync(), which returns the result if
      it's non-negative.  But we always either ignore or overwrite that
      return value.  So explicitly ignore what's returned by the copy
      function, and have rbd_obj_method_sync() always return either a
      negative errno or 0.
      
      We also return the result of ceph_copy_from_page_vector() in
      rbd_obj_read_sync().  There we still want to return the number of
      bytes transferred, but we can use the value we already have in hand
      rather than what ceph_copy_from_page_vector() provides.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      23ed6e13
    • A
      rbd: prevent bytes transferred overflow · 1ceae7ef
      Alex Elder 提交于
      In rbd_obj_read_sync(), verify the number of bytes transferred won't
      exceed what can be represented by a size_t before using it to
      indicate the number of bytes to copy to the result buffer.
      
      (The real motivation for this is to prepare for the next patch.)
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      1ceae7ef
    • A
      libceph: allow STAT osd operations · fbfab539
      Alex Elder 提交于
      Add support for CEPH_OSD_OP_STAT operations in the osd client
      and in rbd.
      
      This operation sends no data to the osd; everything required is
      encoded in identity of the target object.
      
      The result will be ENOENT if the object doesn't exist.  If it does
      exist and no other error occurs the server returns the size and last
      modification time of the target object as output data (in little
      endian format).  The size is a 64 bit unsigned and the time is
      ceph_timespec structure (two unsigned 32-bit integers, representing
      a seconds and nanoseconds value).
      
      This resolves:
          http://tracker.ceph.com/issues/4007Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      fbfab539
    • A
      rbd: add parentheses to object request iterator macros · ef06f4d3
      Alex Elder 提交于
      The for_each_obj_request*() macros should parenthesize their uses of
      the ireq parameter.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      ef06f4d3
    • R
      xen-blkback: use balloon pages for persistent grants · 087ffecd
      Roger Pau Monne 提交于
      With current persistent grants implementation we are not freeing the
      persistent grants after we disconnect the device. Since grant map
      operations change the mfn of the allocated page, and we can no longer
      pass it to __free_page without setting the mfn to a sane value, use
      balloon grant pages instead, as the gntdev device does.
      Signed-off-by: NRoger Pau Monné <roger.pau@citrix.com>
      Cc: stable@vger.kernel.org
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      087ffecd
    • K
      xen-blkfront: drop the use of llist_for_each_entry_safe · f84adf49
      Konrad Rzeszutek Wilk 提交于
      Replace llist_for_each_entry_safe with a while loop.
      
      llist_for_each_entry_safe can trigger a bug in GCC 4.1, so it's best
      to remove it and use a while loop and do the deletion manually.
      
      Specifically this bug can be triggered by hot-unplugging a disk, either
      by doing xm block-detach or by save/restore cycle.
      
      BUG: unable to handle kernel paging request at fffffffffffffff0
      IP: [<ffffffffa0047223>] blkif_free+0x63/0x130 [xen_blkfront]
      The crash call trace is:
      	...
      bad_area_nosemaphore+0x13/0x20
      do_page_fault+0x25e/0x4b0
      page_fault+0x25/0x30
      ? blkif_free+0x63/0x130 [xen_blkfront]
      blkfront_resume+0x46/0xa0 [xen_blkfront]
      xenbus_dev_resume+0x6c/0x140
      pm_op+0x192/0x1b0
      device_resume+0x82/0x1e0
      dpm_resume+0xc9/0x1a0
      dpm_resume_end+0x15/0x30
      do_suspend+0x117/0x1e0
      
      When drilling down to the assembler code, on newer GCC it does
      .L29:
              cmpq    $-16, %r12      #, persistent_gnt check
              je      .L30    	#, out of the loop
      .L25:
      	... code in the loop
              testq   %r13, %r13      # n
              je      .L29    	#, back to the top of the loop
              cmpq    $-16, %r12      #, persistent_gnt check
              movq    16(%r12), %r13  # <variable>.node.next, n
              jne     .L25    	#,	back to the top of the loop
      .L30:
      
      While on GCC 4.1, it is:
      L78:
      	... code in the loop
      	testq   %r13, %r13      # n
              je      .L78    #,	back to the top of the loop
              movq    16(%rbx), %r13  # <variable>.node.next, n
              jmp     .L78    #,	back to the top of the loop
      
      Which basically means that the exit loop condition instead of
      being:
      
      	&(pos)->member != NULL;
      
      is:
      	;
      
      which makes the loop unbound.
      
      Since xen-blkfront is the only user of the llist_for_each_entry_safe
      macro remove it from llist.h.
      
      Orabug: 16263164
      CC: stable@vger.kernel.org
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      f84adf49
    • K
      xen/blkback: Don't trust the handle from the frontend. · 01c681d4
      Konrad Rzeszutek Wilk 提交于
      The 'handle' is the device that the request is from. For the life-time
      of the ring we copy it from a request to a response so that the frontend
      is not surprised by it. But we do not need it - when we start processing
      I/Os we have our own 'struct phys_req' which has only most essential
      information about the request. In fact the 'vbd_translate' ends up
      over-writing the preq.dev with a value from the backend.
      
      This assignment of preq.dev with the 'handle' value is superfluous
      so lets not do it.
      
      Cc: stable@vger.kernel.org
      Acked-by: NJan Beulich <jbeulich@suse.com>
      Acked-by: NIan Campbell <ian.campbell@citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      01c681d4
    • J
      xen-blkback: do not leak mode property · 9d092603
      Jan Beulich 提交于
      "be->mode" is obtained from xenbus_read(), which does a kmalloc() for
      the message body. The short string is never released, so do it along
      with freeing "be" itself, and make sure the string isn't kept when
      backend_changed() doesn't complete successfully (which made it
      desirable to slightly re-structure that function, so that the error
      cleanup can be done in one place).
      Reported-by: NOlaf Hering <olaf@aepfle.de>
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      9d092603
  9. 19 2月, 2013 2 次提交
  10. 15 2月, 2013 1 次提交
  11. 14 2月, 2013 4 次提交
    • A
      rbd: add barriers near done flag operations · 07741308
      Alex Elder 提交于
      Somehow, I missed this little item in Documentation/atomic_ops.txt:
          *** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***
      
      Create and use some helper functions that include the proper memory
      barriers for manipulating the done field.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      07741308
    • A
      rbd: turn off interrupts for open/remove locking · a14ea269
      Alex Elder 提交于
      This commit:
          bc7a62ee5 rbd: prevent open for image being removed
      added checking for removing rbd before allowing an open, and used
      the same request spinlock for protecting that and updating the open
      count as is used for the request queue.
      
      However it used the non-irq protected version of the spinlocks.
      Fix that.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      a14ea269
    • A
      libceph: don't require r_num_pages for bio requests · 9cbb1d72
      Alex Elder 提交于
      There is a check in the completion path for osd requests that
      ensures the number of pages allocated is enough to hold the amount
      of incoming data expected.
      
      For bio requests coming from rbd the "number of pages" is not really
      meaningful (although total length would be).  So stop requiring that
      nr_pages be supplied for bio requests.  This is done by checking
      whether the pages pointer is null before checking the value of
      nr_pages.
      
      Note that this value is passed on to the messenger, but there it's
      only used for debugging--it's never used for validation.
      
      While here, change another spot that used r_pages in a debug message
      inappropriately, and also invalidate the r_con_filling_msg pointer
      after dropping a reference to it.
      
      This resolves:
          http://tracker.ceph.com/issues/3875Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      9cbb1d72
    • A
      rbd: don't take extra bio reference for osd client · 1e32d34c
      Alex Elder 提交于
      Currently, if the OSD client finds an osd request has had a bio list
      attached to it, it drops a reference to it (or rather, to the first
      entry on that list) when the request is released.
      
      The code that added that reference (i.e., the rbd client) is
      therefore required to take an extra reference to that first bio
      structure.
      
      The osd client doesn't really do anything with the bio pointer other
      than transfer it from the osd request structure to outgoing (for
      writes) and ingoing (for reads) messages.  So it really isn't the
      right place to be taking or dropping references.
      
      Furthermore, the rbd client already holds references to all bio
      structures it passes to the osd client, and holds them until the
      request is completed.  So there's no need for this extra reference
      whatsoever.
      
      So remove the bio_put() call in ceph_osdc_release_request(), as
      well as its matching bio_get() call in rbd_osd_req_create().
      
      This change could lead to a crash if old libceph.ko was used with
      new rbd.ko.  Add a compatibility check at rbd initialization time to
      avoid this possibilty.
      
      This resolves:
          http://tracker.ceph.com/issues/3798    and
          http://tracker.ceph.com/issues/3799Signed-off-by: NAlex Elder <elder@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      1e32d34c