1. 11 7月, 2014 2 次提交
    • L
      drbd: close race when detaching from disk · ba3c6fb8
      Lars Ellenberg 提交于
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
      IP: bd_release+0x21/0x70
      Process drbd_w_t7146
      Call Trace:
       close_bdev_exclusive
       drbd_free_ldev		[drbd]
       drbd_ldev_destroy	[drbd]
       w_after_state_ch	[drbd]
      
      Race probably went like this:
        state.disk = D_FAILED
      
      ... first one to hit zero during D_FAILED:
         put_ldev() /* ----------------> 0 */
           i = atomic_dec_return()
           if (i == 0)
             if (state.disk == D_FAILED)
               schedule_work(go_diskless)
                                      /* 1 <------ */ get_ldev_if_state()
         go_diskless()
            do_some_pre_cleanup()                     corresponding put_ldev():
            force_state(D_DISKLESS)   /* 0 <------ */ i = atomic_dec_return()
                                                      if (i == 0)
              atomic_inc() /* ---------> 1 */
              state.disk = D_DISKLESS
              schedule_work(after_state_ch)           /* execution pre-empted by IRQ ? */
      
         after_state_ch()
           put_ldev()
             i = atomic_dec_return()  /* 0 */
             if (i == 0)
               if (state.disk == D_DISKLESS)            if (state.disk == D_DISKLESS)
                 drbd_ldev_destroy()                      drbd_ldev_destroy();
      
      Trying to fix this by checking the disk state *before* the
      atomic_dec_return(), which implies memory barriers, and by inserting
      extra memory barriers around the state assignment in __drbd_set_state().
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      ba3c6fb8
    • L
      drbd: fix resync finished detection · 5ab7d2c0
      Lars Ellenberg 提交于
      This fixes one recent regresion,
      and one long existing bug.
      
      The bug:
      drbd_try_clear_on_disk_bm() assumed that all "count" bits have to be
      accounted in the resync extent corresponding to the start sector.
      
      Since we allow application requests to cross our "extent" boundaries,
      this assumption is no longer true, resulting in possible misaccounting,
      scary messages
      ("BAD! sector=12345s enr=6 rs_left=-7 rs_failed=0 count=58 cstate=..."),
      and potentially, if the last bit to be cleared during resync would
      reside in previously misaccounted resync extent, the resync would never
      be recognized as finished, but would be "stalled" forever, even though
      all blocks are in sync again and all bits have been cleared...
      
      The regression was introduced by
          drbd: get rid of atomic update on disk bitmap works
      
      For an "empty" resync (rs_total == 0), we must not "finish" the
      resync on the SyncSource before the SyncTarget knows all relevant
      information (sync uuid).  We need to wait for the full round-trip,
      the SyncTarget will then explicitly notify us.
      
      Also for normal, non-empty resyncs (rs_total > 0), the resync-finished
      condition needs to be tested before the schedule() in wait_for_work, or
      it is likely to be missed.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      5ab7d2c0
  2. 10 7月, 2014 1 次提交
  3. 01 5月, 2014 2 次提交
  4. 17 2月, 2014 18 次提交
  5. 28 6月, 2013 1 次提交
  6. 29 3月, 2013 3 次提交
  7. 22 1月, 2013 1 次提交
    • L
      drbd: fix potential protocol error and resulting disconnect/reconnect · 2681f7f6
      Lars Ellenberg 提交于
      When we notice a disk failure on the receiving side,
      we stop sending it new incoming writes.
      
      Depending on exact timing of various events, the same transfer log epoch
      could end up containing both replicated (before we noticed the failure)
      and local-only requests (after we noticed the failure).
      
      The sanity checks in tl_release(), called when receiving a
      P_BARRIER_ACK, check that the ack'ed transfer log epoch matches
      the expected epoch, and the number of contained writes matches
      the number of ack'ed writes.
      
      In this case, they counted both replicated and local-only writes,
      but the peer only acknowledges those it has seen.  We get a mismatch,
      resulting in a protocol error and disconnect/reconnect cycle.
      
      Messages logged are
        "BAD! BarrierAck #%u received with n_writes=%u, expected n_writes=%u!\n"
      
      A similar issue can also be triggered when starting a resync while
      having a healthy replication link, by invalidating one side, forcing a
      full sync, or attaching to a diskless node.
      
      Fix this by closing the current epoch if the state changes in a way
      that would cause the replication intent of the next write.
      
      Epochs now contain either only non-replicated,
      or only replicated writes.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      2681f7f6
  8. 01 12月, 2012 1 次提交
    • J
      drbd: fixup after wait_even_lock_irq() addition to generic code · 2cecb730
      Jens Axboe 提交于
      Compiling drbd yields:
      
      drivers/block/drbd/drbd_state.c: In function ‘_conn_request_state’:
      drivers/block/drbd/drbd_state.c:1804:5: error: macro "wait_event_lock_irq" passed 4 arguments, but takes just 3
      drivers/block/drbd/drbd_state.c:1801:3: error: ‘wait_event_lock_irq’ undeclared (first use in this function)
      drivers/block/drbd/drbd_state.c:1801:3: note: each undeclared identifier is reported only once for each function it appears in
      drivers/block/drbd/drbd_state.c: At top level:
      drivers/block/drbd/drbd_state.c:1734:1: warning: ‘_conn_rq_cond’ defined but not used [-Wunused-function]
      
      Due to drbd having copied the MD definition for wait_event_lock_irq()
      as well. Kill them.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2cecb730
  9. 09 11月, 2012 11 次提交