1. 29 5月, 2014 1 次提交
    • V
      xen-blkback: defer freeing blkif to avoid blocking xenwatch · 814d04e7
      Valentin Priescu 提交于
      Currently xenwatch blocks in VBD disconnect, waiting for all pending I/O
      requests to finish. If the VBD is attached to a hot-swappable disk, then
      xenwatch can hang for a long period of time, stalling other watches.
      
       INFO: task xenwatch:39 blocked for more than 120 seconds.
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       ffff880057f01bd0 0000000000000246 ffff880057f01ac0 ffffffff810b0782
       ffff880057f01ad0 00000000000131c0 0000000000000004 ffff880057edb040
       ffff8800344c6080 0000000000000000 ffff880058c00ba0 ffff880057edb040
       Call Trace:
       [<ffffffff810b0782>] ? irq_to_desc+0x12/0x20
       [<ffffffff8128f761>] ? list_del+0x11/0x40
       [<ffffffff8147a080>] ? wait_for_common+0x60/0x160
       [<ffffffff8147bcef>] ? _raw_spin_lock_irqsave+0x2f/0x50
       [<ffffffff8147bd49>] ? _raw_spin_unlock_irqrestore+0x19/0x20
       [<ffffffff8147a26a>] schedule+0x3a/0x60
       [<ffffffffa018fe6a>] xen_blkif_disconnect+0x8a/0x100 [xen_blkback]
       [<ffffffff81079f70>] ? wake_up_bit+0x40/0x40
       [<ffffffffa018ffce>] xen_blkbk_remove+0xae/0x1e0 [xen_blkback]
       [<ffffffff8130b254>] xenbus_dev_remove+0x44/0x90
       [<ffffffff81345cb7>] __device_release_driver+0x77/0xd0
       [<ffffffff81346488>] device_release_driver+0x28/0x40
       [<ffffffff813456e8>] bus_remove_device+0x78/0xe0
       [<ffffffff81342c9f>] device_del+0x12f/0x1a0
       [<ffffffff81342d2d>] device_unregister+0x1d/0x60
       [<ffffffffa0190826>] frontend_changed+0xa6/0x4d0 [xen_blkback]
       [<ffffffffa019c252>] ? frontend_changed+0x192/0x650 [xen_netback]
       [<ffffffff8130ae50>] ? cmp_dev+0x60/0x60
       [<ffffffff81344fe4>] ? bus_for_each_dev+0x94/0xa0
       [<ffffffff8130b06e>] xenbus_otherend_changed+0xbe/0x120
       [<ffffffff8130b4cb>] frontend_changed+0xb/0x10
       [<ffffffff81309c82>] xenwatch_thread+0xf2/0x130
       [<ffffffff81079f70>] ? wake_up_bit+0x40/0x40
       [<ffffffff81309b90>] ? xenbus_directory+0x80/0x80
       [<ffffffff810799d6>] kthread+0x96/0xa0
       [<ffffffff81485934>] kernel_thread_helper+0x4/0x10
       [<ffffffff814839f3>] ? int_ret_from_sys_call+0x7/0x1b
       [<ffffffff8147c17c>] ? retint_restore_args+0x5/0x6
       [<ffffffff81485930>] ? gs_change+0x13/0x13
      
      With this patch, when there is still pending I/O, the actual disconnect
      is done by the last reference holder (last pending I/O request). In this
      case, xenwatch doesn't block indefinitely.
      Signed-off-by: NValentin Priescu <priescuv@amazon.com>
      Reviewed-by: NSteven Kady <stevkady@amazon.com>
      Reviewed-by: NSteven Noonan <snoonan@amazon.com>
      Reviewed-by: NDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      814d04e7
  2. 12 2月, 2014 1 次提交
  3. 08 2月, 2014 3 次提交
  4. 18 6月, 2013 1 次提交
    • K
      xen/blkback: Check for insane amounts of request on the ring (v6). · 8e3f8755
      Konrad Rzeszutek Wilk 提交于
      Check that the ring does not have an insane amount of requests
      (more than there could fit on the ring).
      
      If we detect this case we will stop processing the requests
      and wait until the XenBus disconnects the ring.
      
      The existing check RING_REQUEST_CONS_OVERFLOW which checks for how
      many responses we have created in the past (rsp_prod_pvt) vs
      requests consumed (req_cons) and whether said difference is greater or
      equal to the size of the ring, does not catch this case.
      
      Wha the condition does check if there is a need to process more
      as we still have a backlog of responses to finish. Note that both
      of those values (rsp_prod_pvt and req_cons) are not exposed on the
      shared ring.
      
      To understand this problem a mini crash course in ring protocol
      response/request updates is in place.
      
      There are four entries: req_prod and rsp_prod; req_event and rsp_event
      to track the ring entries. We are only concerned about the first two -
      which set the tone of this bug.
      
      The req_prod is a value incremented by frontend for each request put
      on the ring. Conversely the rsp_prod is a value incremented by the backend
      for each response put on the ring (rsp_prod gets set by rsp_prod_pvt when
      pushing the responses on the ring).  Both values can
      wrap and are modulo the size of the ring (in block case that is 32).
      Please see RING_GET_REQUEST and RING_GET_RESPONSE for the more details.
      
      The culprit here is that if the difference between the
      req_prod and req_cons is greater than the ring size we have a problem.
      Fortunately for us, the '__do_block_io_op' loop:
      
      	rc = blk_rings->common.req_cons;
      	rp = blk_rings->common.sring->req_prod;
      
      	while (rc != rp) {
      
      		..
      		blk_rings->common.req_cons = ++rc; /* before make_response() */
      
      	}
      
      will loop up to the point when rc == rp. The macros inside of the
      loop (RING_GET_REQUEST) is smart and is indexing based on the modulo
      of the ring size. If the frontend has provided a bogus req_prod value
      we will loop until the 'rc == rp' - which means we could be processing
      already processed requests (or responses) often.
      
      The reason the RING_REQUEST_CONS_OVERFLOW is not helping here is
      b/c it only tracks how many responses we have internally produced
      and whether we would should process more. The astute reader will
      notice that the macro RING_REQUEST_CONS_OVERFLOW provides two
      arguments - more on this later.
      
      For example, if we were to enter this function with these values:
      
             	blk_rings->common.sring->req_prod =  X+31415 (X is the value from
      		the last time __do_block_io_op was called).
              blk_rings->common.req_cons = X
              blk_rings->common.rsp_prod_pvt = X
      
      The RING_REQUEST_CONS_OVERFLOW(&blk_rings->common, blk_rings->common.req_cons)
      is doing:
      
      	req_cons - rsp_prod_pvt >= 32
      
      Which is,
      	X - X >= 32 or 0 >= 32
      
      And that is false, so we continue on looping (this bug).
      
      If we re-use said macro RING_REQUEST_CONS_OVERFLOW and pass in the rp
      instead (sring->req_prod) of rc, the this macro can do the check:
      
           req_prod - rsp_prov_pvt >= 32
      
      Which is,
             X + 31415 - X >= 32 , or 31415 >= 32
      
      which is true, so we can error out and break out of the function.
      
      Unfortunatly the difference between rsp_prov_pvt and req_prod can be
      at 32 (which would error out in the macro). This condition exists when
      the backend is lagging behind with the responses and still has not finished
      responding to all of them (so make_response has not been called), and
      the rsp_prov_pvt + 32 == req_cons. This ends up with us not being able
      to use said macro.
      
      Hence introducing a new macro called RING_REQUEST_PROD_OVERFLOW which does
      a simple check of:
      
          req_prod - rsp_prod_pvt > RING_SIZE
      
      And with the X values from above:
      
         X + 31415 - X > 32
      
      Returns true. Also not that if the ring is full (which is where
      the RING_REQUEST_CONS_OVERFLOW triggered), we would not hit the
      same condition:
      
         X + 32 - X > 32
      
      Which is false.
      
      Lets use that macro.
      Note that in v5 of this patchset the macro was different - we used an
      earlier version.
      
      Cc: stable@vger.kernel.org
      [v1: Move the check outside the loop]
      [v2: Add a pr_warn as suggested by David]
      [v3: Use RING_REQUEST_CONS_OVERFLOW as suggested by Jan]
      [v4: Move wake_up after kthread_stop as suggested by Jan]
      [v5: Use RING_REQUEST_PROD_OVERFLOW instead]
      [v6: Use RING_REQUEST_PROD_OVERFLOW - Jan's version]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: NJan Beulich <jbeulich@suse.com>
      
      gadsa
      8e3f8755
  5. 07 5月, 2013 1 次提交
  6. 19 4月, 2013 1 次提交
    • R
      xen-block: implement indirect descriptors · 402b27f9
      Roger Pau Monne 提交于
      Indirect descriptors introduce a new block operation
      (BLKIF_OP_INDIRECT) that passes grant references instead of segments
      in the request. This grant references are filled with arrays of
      blkif_request_segment_aligned, this way we can send more segments in a
      request.
      
      The proposed implementation sets the maximum number of indirect grefs
      (frames filled with blkif_request_segment_aligned) to 256 in the
      backend and 32 in the frontend. The value in the frontend has been
      chosen experimentally, and the backend value has been set to a sane
      value that allows expanding the maximum number of indirect descriptors
      in the frontend if needed.
      
      The migration code has changed from the previous implementation, in
      which we simply remapped the segments on the shared ring. Now the
      maximum number of segments allowed in a request can change depending
      on the backend, so we have to requeue all the requests in the ring and
      in the queue and split the bios in them if they are bigger than the
      new maximum number of segments.
      
      [v2: Fixed minor comments by Konrad.
      [v1: Added padding to make the indirect request 64bit aligned.
       Added some BUGs, comments; fixed number of indirect pages in
       blkif_get_x86_{32/64}_req. Added description about the indirect operation
       in blkif.h]
      Signed-off-by: NRoger Pau Monné <roger.pau@citrix.com>
      [v3: Fixed spaces and tabs mix ups]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      402b27f9
  7. 18 4月, 2013 3 次提交
    • R
      xen-blkback: make the queue of free requests per backend · bf0720c4
      Roger Pau Monne 提交于
      Remove the last dependency from blkbk by moving the list of free
      requests to blkif. This change reduces the contention on the list of
      available requests.
      Signed-off-by: NRoger Pau Monné <roger.pau@citrix.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: xen-devel@lists.xen.org
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      bf0720c4
    • R
      xen-blkback: implement LRU mechanism for persistent grants · 3f3aad5e
      Roger Pau Monne 提交于
      This mechanism allows blkback to change the number of grants
      persistently mapped at run time.
      
      The algorithm uses a simple LRU mechanism that removes (if needed) the
      persistent grants that have not been used since the last LRU run, or
      if all grants have been used it removes the first grants in the list
      (that are not in use).
      
      The algorithm allows the user to change the maximum number of
      persistent grants, by changing max_persistent_grants in sysfs.
      
      Since we are storing the persistent grants used inside the request
      struct (to be able to mark them as "unused" when unmapping), we no
      longer need the bitmap (unmap_seg).
      Signed-off-by: NRoger Pau Monné <roger.pau@citrix.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: xen-devel@lists.xen.org
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      3f3aad5e
    • R
      xen-blkback: use balloon pages for all mappings · c6cc142d
      Roger Pau Monne 提交于
      Using balloon pages for all granted pages allows us to simplify the
      logic in blkback, especially in the xen_blkbk_map function, since now
      we can decide if we want to map a grant persistently or not after we
      have actually mapped it. This could not be done before because
      persistent grants used ballooned pages, whereas non-persistent grants
      used pages from the kernel.
      
      This patch also introduces several changes, the first one is that the
      list of free pages is no longer global, now each blkback instance has
      it's own list of free pages that can be used to map grants. Also, a
      run time parameter (max_buffer_pages) has been added in order to tune
      the maximum number of free pages each blkback instance will keep in
      it's buffer.
      Signed-off-by: NRoger Pau Monné <roger.pau@citrix.com>
      Cc: xen-devel@lists.xen.org
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      c6cc142d
  8. 20 3月, 2013 1 次提交
    • R
      xen-blkback: don't store dev_bus_addr · ffb1dabd
      Roger Pau Monne 提交于
      dev_bus_addr returned in the grant ref map operation is the mfn of the
      passed page, there's no need to store it in the persistent grant
      entry, since we can always get it provided that we have the page.
      
      This reduces the memory overhead of persistent grants in blkback.
      
      While at it, rename the 'seg[i].buf' to be 'seg[i].offset' as
      it makes much more sense - as we use that value in bio_add_page
      which as the fourth argument expects the offset.
      
      We hadn't used the physical address as part of this at all.
      Signed-off-by: NRoger Pau Monné <roger.pau@citrix.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: xen-devel@lists.xen.org
      [v1: s/buf/offset/]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      ffb1dabd
  9. 12 3月, 2013 2 次提交
  10. 30 10月, 2012 2 次提交
    • R
      xen/blkback: Persistent grant maps for xen blk drivers · 0a8704a5
      Roger Pau Monne 提交于
      This patch implements persistent grants for the xen-blk{front,back}
      mechanism. The effect of this change is to reduce the number of unmap
      operations performed, since they cause a (costly) TLB shootdown. This
      allows the I/O performance to scale better when a large number of VMs
      are performing I/O.
      
      Previously, the blkfront driver was supplied a bvec[] from the request
      queue. This was granted to dom0; dom0 performed the I/O and wrote
      directly into the grant-mapped memory and unmapped it; blkfront then
      removed foreign access for that grant. The cost of unmapping scales
      badly with the number of CPUs in Dom0. An experiment showed that when
      Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
      ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
      (at which point 650,000 IOPS are being performed in total). If more
      than 5 guests are used, the performance declines. By 10 guests, only
      400,000 IOPS are being performed.
      
      This patch improves performance by only unmapping when the connection
      between blkfront and back is broken.
      
      On startup blkfront notifies blkback that it is using persistent
      grants, and blkback will do the same. If blkback is not capable of
      persistent mapping, blkfront will still use the same grants, since it
      is compatible with the previous protocol, and simplifies the code
      complexity in blkfront.
      
      To perform a read, in persistent mode, blkfront uses a separate pool
      of pages that it maps to dom0. When a request comes in, blkfront
      transmutes the request so that blkback will write into one of these
      free pages. Blkback keeps note of which grefs it has already
      mapped. When a new ring request comes to blkback, it looks to see if
      it has already mapped that page. If so, it will not map it again. If
      the page hasn't been previously mapped, it is mapped now, and a record
      is kept of this mapping. Blkback proceeds as usual. When blkfront is
      notified that blkback has completed a request, it memcpy's from the
      shared memory, into the bvec supplied. A record that the {gref, page}
      tuple is mapped, and not inflight is kept.
      
      Writes are similar, except that the memcpy is peformed from the
      supplied bvecs, into the shared pages, before the request is put onto
      the ring.
      
      Blkback stores a mapping of grefs=>{page mapped to by gref} in
      a red-black tree. As the grefs are not known apriori, and provide no
      guarantees on their ordering, we have to perform a search
      through this tree to find the page, for every gref we receive. This
      operation takes O(log n) time in the worst case. In blkfront grants
      are stored using a single linked list.
      
      The maximum number of grants that blkback will persistenly map is
      currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
      prevent a malicios guest from attempting a DoS, by supplying fresh
      grefs, causing the Dom0 kernel to map excessively. If a guest
      is using persistent grants and exceeds the maximum number of grants to
      map persistenly the newly passed grefs will be mapped and unmaped.
      Using this approach, we can have requests that mix persistent and
      non-persistent grants, and we need to handle them correctly.
      This allows us to set the maximum number of persistent grants to a
      lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
      setting it will lead to unpredictable performance.
      
      In writing this patch, the question arrises as to if the additional
      cost of performing memcpys in the guest (to/from the pool of granted
      pages) outweigh the gains of not performing TLB shootdowns. The answer
      to that question is `no'. There appears to be very little, if any
      additional cost to the guest of using persistent grants. There is
      perhaps a small saving, from the reduced number of hypercalls
      performed in granting, and ending foreign access.
      Signed-off-by: NOliver Chick <oliver.chick@citrix.com>
      Signed-off-by: NRoger Pau Monne <roger.pau@citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      [v1: Fixed up the misuse of bool as int]
      0a8704a5
    • O
      xen/blkback: Change xen_vbd's flush_support and discard_secure to have type... · 1f999572
      Oliver Chick 提交于
      xen/blkback: Change xen_vbd's flush_support and discard_secure to have type unsigned int, rather than bool
      
      Changing the type of bdev parameters to be unsigned int :1, rather than bool.
      This is more consistent with the types of other features in the block drivers.
      Signed-off-by: NOliver Chick <oliver.chick@citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1f999572
  11. 31 5月, 2012 1 次提交
  12. 24 3月, 2012 1 次提交
    • K
      xen/blkback: Squash the discard support for 'file' and 'phy' type. · 4dae7670
      Konrad Rzeszutek Wilk 提交于
      The only reason for the distinction was for the special case of
      'file' (which is assumed to be loopback device), was to reach inside
      the loopback device, find the underlaying file, and call fallocate on it.
      Fortunately "xen-blkback: convert hole punching to discard request on
      loop devices" removes that use-case and we now based the discard
      support based on blk_queue_discard(q) and extract all appropriate
      parameters from the 'struct request_queue'.
      
      CC: Li Dongyang <lidongyang@novell.com>
      Acked-by: NJan Beulich <JBeulich@suse.com>
      [v1: Dropping pointless initializer and keeping blank line]
      [v2: Remove the kfree as it is not used anymore]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      4dae7670
  13. 19 11月, 2011 2 次提交
  14. 26 10月, 2011 1 次提交
  15. 13 10月, 2011 3 次提交
  16. 15 9月, 2011 1 次提交
  17. 22 8月, 2011 1 次提交
  18. 13 5月, 2011 9 次提交
  19. 12 5月, 2011 1 次提交
  20. 06 5月, 2011 1 次提交
  21. 20 4月, 2011 2 次提交
  22. 19 4月, 2011 1 次提交