1. 13 2月, 2019 1 次提交
  2. 15 1月, 2019 1 次提交
  3. 10 1月, 2019 2 次提交
  4. 09 1月, 2019 1 次提交
    • M
      zram: idle writeback fixes and cleanup · 1d69a3f8
      Minchan Kim 提交于
      This patch includes some fixes and cleanup for idle-page writeback.
      
      1. writeback_limit interface
      
      Now writeback_limit interface is rather conusing.  For example, once
      writeback limit budget is exausted, admin can see 0 from
      /sys/block/zramX/writeback_limit which is same semantic with disable
      writeback_limit at this moment.  IOW, admin cannot tell that zero came
      from disable writeback limit or exausted writeback limit.
      
      To make the interface clear, let's sepatate enable of writeback limit to
      another knob - /sys/block/zram0/writeback_limit_enable
      
      * before:
        while true :
          # to re-enable writeback limit once previous one is used up
          echo 0 > /sys/block/zram0/writeback_limit
          echo $((200<<20)) > /sys/block/zram0/writeback_limit
          ..
          .. # used up the writeback limit budget
      
      * new
        # To enable writeback limit, from the beginning, admin should
        # enable it.
        echo $((200<<20)) > /sys/block/zram0/writeback_limit
        echo 1 > /sys/block/zram/0/writeback_limit_enable
        while true :
          echo $((200<<20)) > /sys/block/zram0/writeback_limit
          ..
          .. # used up the writeback limit budget
      
      It's much strightforward.
      
      2. fix condition check idle/huge writeback mode check
      
      The mode in writeback_store is not bit opeartion any more so no need to
      use bit operations.  Furthermore, current condition check is broken in
      that it does writeback every pages regardless of huge/idle.
      
      3. clean up idle_store
      
      No need to use goto.
      
      [minchan@kernel.org: missed spin_lock_init]
        Link: http://lkml.kernel.org/r/20190103001601.GA255139@google.com
      Link: http://lkml.kernel.org/r/20181224033529.19450-1-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Suggested-by: NJohn Dias <joaodias@google.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Srinivas Paladugu <srnvs@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d69a3f8
  5. 08 1月, 2019 1 次提交
  6. 07 1月, 2019 1 次提交
  7. 03 1月, 2019 1 次提交
  8. 01 1月, 2019 4 次提交
    • F
      block/swim3: Fix regression on PowerBook G3 · 427c5ce4
      Finn Thain 提交于
      As of v4.20, the swim3 driver crashes when loaded on a PowerBook G3
      (Wallstreet).
      
      MacIO PCI driver attached to Gatwick chipset
      MacIO PCI driver attached to Heathrow chipset
      swim3 0.00015000:floppy: [fd0] SWIM3 floppy controller in media bay
      0.00013020:ch-a: ttyS0 at MMIO 0xf3013020 (irq = 16, base_baud = 230400) is a Z85c30 ESCC - Serial port
      0.00013000:ch-b: ttyS1 at MMIO 0xf3013000 (irq = 17, base_baud = 230400) is a Z85c30 ESCC - Infrared port
      macio: fixed media-bay irq on gatwick
      macio: fixed left floppy irqs
      swim3 1.00015000:floppy: [fd1] Couldn't request interrupt
      Unable to handle kernel paging request for data at address 0x00000024
      Faulting instruction address: 0xc02652f8
      Oops: Kernel access of bad area, sig: 11 [#1]
      BE SMP NR_CPUS=2 PowerMac
      Modules linked in:
      CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.20.0 #2
      NIP:  c02652f8 LR: c026915c CTR: c0276d1c
      REGS: df43ba10 TRAP: 0300   Not tainted  (4.20.0)
      MSR:  00009032 <EE,ME,IR,DR,RI>  CR: 28228288  XER: 00000100
      DAR: 00000024 DSISR: 40000000
      GPR00: c026915c df43bac0 df439060 c0731524 df494700 00000000 c06e1c08 00000001
      GPR08: 00000001 00000000 df5ff220 00001032 28228282 00000000 c0004ca4 00000000
      GPR16: 00000000 00000000 00000000 c073144c dfffe064 c0731524 00000120 c0586108
      GPR24: c073132c c073143c c073143c 00000000 c0731524 df67cd70 df494700 00000001
      NIP [c02652f8] blk_mq_free_rqs+0x28/0xf8
      LR [c026915c] blk_mq_sched_tags_teardown+0x58/0x84
      Call Trace:
      [df43bac0] [c0045f50] flush_workqueue_prep_pwqs+0x178/0x1c4 (unreliable)
      [df43bae0] [c026915c] blk_mq_sched_tags_teardown+0x58/0x84
      [df43bb00] [c02697f0] blk_mq_exit_sched+0x9c/0xb8
      [df43bb20] [c0252794] elevator_exit+0x84/0xa4
      [df43bb40] [c0256538] blk_exit_queue+0x30/0x50
      [df43bb50] [c0256640] blk_cleanup_queue+0xe8/0x184
      [df43bb70] [c034732c] swim3_attach+0x330/0x5f0
      [df43bbb0] [c034fb24] macio_device_probe+0x58/0xec
      [df43bbd0] [c032ba88] really_probe+0x1e4/0x2f4
      [df43bc00] [c032bd28] driver_probe_device+0x64/0x204
      [df43bc20] [c0329ac4] bus_for_each_drv+0x60/0xac
      [df43bc50] [c032b824] __device_attach+0xe8/0x160
      [df43bc80] [c032ab38] bus_probe_device+0xa0/0xbc
      [df43bca0] [c0327338] device_add+0x3d8/0x630
      [df43bcf0] [c0350848] macio_add_one_device+0x444/0x48c
      [df43bd50] [c03509f8] macio_pci_add_devices+0x168/0x1bc
      [df43bd90] [c03500ec] macio_pci_probe+0xc0/0x10c
      [df43bda0] [c02ad884] pci_device_probe+0xd4/0x184
      [df43bdd0] [c032ba88] really_probe+0x1e4/0x2f4
      [df43be00] [c032bd28] driver_probe_device+0x64/0x204
      [df43be20] [c032bfcc] __driver_attach+0x104/0x108
      [df43be40] [c0329a00] bus_for_each_dev+0x64/0xb4
      [df43be70] [c032add8] bus_add_driver+0x154/0x238
      [df43be90] [c032ca24] driver_register+0x84/0x148
      [df43bea0] [c0004aa0] do_one_initcall+0x40/0x188
      [df43bf00] [c0690100] kernel_init_freeable+0x138/0x1d4
      [df43bf30] [c0004cbc] kernel_init+0x18/0x10c
      [df43bf40] [c00121e4] ret_from_kernel_thread+0x14/0x1c
      Instruction dump:
      5484d97e 4bfff4f4 9421ffe0 7c0802a6 bf410008 7c9e2378 90010024 8124005c
      2f890000 419e0078 81230004 7c7c1b78 <81290024> 2f890000 419e0064 81440000
      ---[ end trace 12025ab921a9784c ]---
      
      Reverting commit 8ccb8cb1 ("swim3: convert to blk-mq") resolves the
      problem.
      
      That commit added a struct blk_mq_tag_set to struct floppy_state and
      initialized it with a blk_mq_init_sq_queue() call. Unfortunately, there
      is a memset() in swim3_add_device() that subsequently clears the
      floppy_state struct. That means fs->tag_set->ops is a NULL pointer, and
      it gets dereferenced by blk_mq_free_rqs() which gets called in the
      request_irq() error path. Move the memset() to fix this bug.
      
      BTW, the request_irq() failure for the left mediabay floppy (fd1) is not
      a regression. I don't know why it happens. The right media bay floppy
      (fd0) works fine however.
      Reported-and-tested-by: NStan Johnson <userm57@yahoo.com>
      Fixes: 8ccb8cb1 ("swim3: convert to blk-mq")
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NFinn Thain <fthain@telegraphics.com.au>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      427c5ce4
    • F
      block/swim3: Fix -EBUSY error when re-opening device after unmount · 296dcc40
      Finn Thain 提交于
      When the block device is opened with FMODE_EXCL, ref_count is set to -1.
      This value doesn't get reset when the device is closed which means the
      device cannot be opened again. Fix this by checking for refcount <= 0
      in the release method.
      Reported-and-tested-by: NStan Johnson <userm57@yahoo.com>
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NFinn Thain <fthain@telegraphics.com.au>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      296dcc40
    • F
      block/swim3: Remove dead return statement · f3010ec5
      Finn Thain 提交于
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NFinn Thain <fthain@telegraphics.com.au>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f3010ec5
    • F
      block/amiflop: Don't log error message on invalid ioctl · d4d179c3
      Finn Thain 提交于
      Cc: linux-m68k@lists.linux-m68k.org
      Signed-off-by: NFinn Thain <fthain@telegraphics.com.au>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d4d179c3
  9. 29 12月, 2018 7 次提交
    • M
      zram: writeback throttle · bb416d18
      Minchan Kim 提交于
      If there are lots of write IO with flash device, it could have a
      wearout problem of storage. To overcome the problem, admin needs
      to design write limitation to guarantee flash health
      for entire product life.
      
      This patch creates a new knob "writeback_limit" for zram.
      
      writeback_limit's default value is 0 so that it doesn't limit
      any writeback. If admin want to measure writeback count in a
      certain period, he could know it via /sys/block/zram0/bd_stat's
      3rd column.
      
      If admin want to limit writeback as per-day 400M, he could do it
      like below.
      
      	MB_SHIFT=20
      	4K_SHIFT=12
      	echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
      		/sys/block/zram0/writeback_limit.
      
      If admin want to allow further write again, he could do it like below
      
      	echo 0 > /sys/block/zram0/writeback_limit
      
      If admin want to see remaining writeback budget,
      
      	cat /sys/block/zram0/writeback_limit
      
      The writeback_limit count will reset whenever you reset zram (e.g., system
      reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of writeback
      happened until you reset the zram to allocate extra writeback budget in
      next setting is user's job.
      
      [minchan@kernel.org: v4]
        Link: http://lkml.kernel.org/r/20181203024045.153534-8-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20181127055429.251614-8-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb416d18
    • M
      zram: add bd_stat statistics · 23eddf39
      Minchan Kim 提交于
      bd_stat represents things that happened in the backing device.  Currently
      it supports bd_counts, bd_reads and bd_writes which are helpful to
      understand wearout of flash and memory saving.
      
      [minchan@kernel.org: v4]
        Link: http://lkml.kernel.org/r/20181203024045.153534-7-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20181127055429.251614-7-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23eddf39
    • M
      zram: support idle/huge page writeback · a939888e
      Minchan Kim 提交于
      Add a new feature "zram idle/huge page writeback".  In the zram-swap use
      case, zram usually has many idle/huge swap pages.  It's pointless to keep
      them in memory (ie, zram).
      
      To solve this problem, this feature introduces idle/huge page writeback to
      the backing device so the goal is to save more memory space on embedded
      systems.
      
      Normal sequence to use idle/huge page writeback feature is as follows,
      
      while (1) {
              # mark allocated zram slot to idle
              echo all > /sys/block/zram0/idle
              # leave system working for several hours
              # Unless there is no access for some blocks on zram,
      	# they are still IDLE marked pages.
      
              echo "idle" > /sys/block/zram0/writeback
      	or/and
      	echo "huge" > /sys/block/zram0/writeback
              # write the IDLE or/and huge marked slot into backing device
      	# and free the memory.
      }
      
      Per the discussion at
      https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u,
      
      This patch removes direct incommpressibe page writeback feature
      (d2afd25114f4 ("zram: write incompressible pages to backing device")).
      
      Below concerns from Sergey:
      == &< ==
      
      "IDLE writeback" is superior to "incompressible writeback".
      
      "incompressible writeback" is completely unpredictable and uncontrollable;
      it depens on data patterns and compression algorithms.  While "IDLE
      writeback" is predictable.
      
      I even suspect, that, *ideally*, we can remove "incompressible writeback".
      "IDLE pages" is a super set which also includes "incompressible" pages.
      So, technically, we still can do "incompressible writeback" from "IDLE
      writeback" path; but a much more reasonable one, based on a page idling
      period.
      
      I understand that you want to keep "direct incompressible writeback"
      around.  ZRAM is especially popular on devices which do suffer from flash
      wearout, so I can see "incompressible writeback" path becoming a dead
      code, long term.
      
      == &< ==
      
      Below concerns from Minchan:
      == &< ==
      
      My concern is if we enable CONFIG_ZRAM_WRITEBACK in this implementation,
      both hugepage/idlepage writeck will turn on.  However someuser want to
      enable only idlepage writeback so we need to introduce turn on/off knob
      for hugepage or new CONFIG_ZRAM_IDLEPAGE_WRITEBACK for those usecase.  I
      don't want to make it complicated *if possible*.
      
      Long term, I imagine we need to make VM aware of new swap hierarchy a
      little bit different with as-is.  For example, first high priority swap
      can return -EIO or -ENOCOMP, swap try to fallback to next lower priority
      swap device.  With that, hugepage writeback will work tranparently.
      
      So we could regard it as regression because incompressible pages doesn't
      go to backing storage automatically.  Instead, user should do it via "echo
      huge" > /sys/block/zram/writeback" manually.
      
      == &< ==
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-6-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NJoey Pabalinas <joeypabalinas@gmail.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a939888e
    • M
      zram: introduce ZRAM_IDLE flag · e82592c4
      Minchan Kim 提交于
      To support idle page writeback with upcoming patches, this patch
      introduces a new ZRAM_IDLE flag.
      
      Userspace can mark zram slots as "idle" via
      	"echo all > /sys/block/zramX/idle"
      which marks every allocated zram slot as ZRAM_IDLE.
      User could see it by /sys/kernel/debug/zram/zram0/block_state.
      
                300    75.033841 ...i
                301    63.806904 s..i
                302    63.806919 ..hi
      
      Once there is IO for the slot, the mark will be disappeared.
      
      	  300    75.033841 ...
                301    63.806904 s..i
                302    63.806919 ..hi
      
      Therefore, 300th block is idle zpage. With this feature,
      user can how many zram has idle pages which are waste of memory.
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NJoey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e82592c4
    • M
      zram: refactor flags and writeback stuff · 7e529283
      Minchan Kim 提交于
      Rename some variables and restructure some code for better readability in
      writeback and zs_free_page.
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-4-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NJoey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e529283
    • M
      zram: fix double free backing device · 5547932d
      Minchan Kim 提交于
      If blkdev_get fails, we shouldn't do blkdev_put.  Otherwise, kernel emits
      below log.  This patch fixes it.
      
        WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 blkdev_put+0x105/0x120
        Modules linked in:
        CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
        RIP: 0010:blkdev_put+0x105/0x120
        Call Trace:
          __x64_sys_swapoff+0x46d/0x490
          do_syscall_64+0x5a/0x190
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
        irq event stamp: 4466
        hardirqs last  enabled at (4465):  __free_pages_ok+0x1e3/0x490
        hardirqs last disabled at (4466):  trace_hardirqs_off_thunk+0x1a/0x1c
        softirqs last  enabled at (3420):  __do_softirq+0x333/0x446
        softirqs last disabled at (3407):  irq_exit+0xd1/0xe0
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-3-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NJoey Pabalinas <joeypabalinas@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5547932d
    • M
      zram: fix lockdep warning of free block handling · 3c9959e0
      Minchan Kim 提交于
      Patch series "zram idle page writeback", v3.
      
      Inherently, swap device has many idle pages which are rare touched since
      it was allocated.  It is never problem if we use storage device as swap.
      However, it's just waste for zram-swap.
      
      This patchset supports zram idle page writeback feature.
      
      * Admin can define what is idle page "no access since X time ago"
      * Admin can define when zram should writeback them
      * Admin can define when zram should stop writeback to prevent wearout
      
      Details are in each patch's description.
      
      This patch (of 7):
      
        ================================
        WARNING: inconsistent lock state
        4.19.0+ #390 Not tainted
        --------------------------------
        inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
        00000000b1828693 (&(&zram->bitmap_lock)->rlock){+.?.}, at: put_entry_bdev+0x1e/0x50
        {SOFTIRQ-ON-W} state was registered at:
          _raw_spin_lock+0x2c/0x40
          zram_make_request+0x755/0xdc9
          generic_make_request+0x373/0x6a0
          submit_bio+0x6c/0x140
          __swap_writepage+0x3a8/0x480
          shrink_page_list+0x1102/0x1a60
          shrink_inactive_list+0x21b/0x3f0
          shrink_node_memcg.constprop.99+0x4f8/0x7e0
          shrink_node+0x7d/0x2f0
          do_try_to_free_pages+0xe0/0x300
          try_to_free_pages+0x116/0x2b0
          __alloc_pages_slowpath+0x3f4/0xf80
          __alloc_pages_nodemask+0x2a2/0x2f0
          __handle_mm_fault+0x42e/0xb50
          handle_mm_fault+0x55/0xb0
          __do_page_fault+0x235/0x4b0
          page_fault+0x1e/0x30
        irq event stamp: 228412
        hardirqs last  enabled at (228412): [<ffffffff98245846>] __slab_free+0x3e6/0x600
        hardirqs last disabled at (228411): [<ffffffff98245625>] __slab_free+0x1c5/0x600
        softirqs last  enabled at (228396): [<ffffffff98e0031e>] __do_softirq+0x31e/0x427
        softirqs last disabled at (228403): [<ffffffff98072051>] irq_exit+0xd1/0xe0
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(&(&zram->bitmap_lock)->rlock);
          <Interrupt>
            lock(&(&zram->bitmap_lock)->rlock);
      
         *** DEADLOCK ***
      
        no locks held by zram_verify/2095.
      
        stack backtrace:
        CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
        Call Trace:
         <IRQ>
         dump_stack+0x67/0x9b
         print_usage_bug+0x1bd/0x1d3
         mark_lock+0x4aa/0x540
         __lock_acquire+0x51d/0x1300
         lock_acquire+0x90/0x180
         _raw_spin_lock+0x2c/0x40
         put_entry_bdev+0x1e/0x50
         zram_free_page+0xf6/0x110
         zram_slot_free_notify+0x42/0xa0
         end_swap_bio_read+0x5b/0x170
         blk_update_request+0x8f/0x340
         scsi_end_request+0x2c/0x1e0
         scsi_io_completion+0x98/0x650
         blk_done_softirq+0x9e/0xd0
         __do_softirq+0xcc/0x427
         irq_exit+0xd1/0xe0
         do_IRQ+0x93/0x120
         common_interrupt+0xf/0xf
         </IRQ>
      
      With writeback feature, zram_slot_free_notify could be called in softirq
      context by end_swap_bio_read.  However, bitmap_lock is not aware of that
      so lockdep yell out:
      
        get_entry_bdev
        spin_lock(bitmap->lock);
        irq
        softirq
        end_swap_bio_read
        zram_slot_free_notify
        zram_slot_lock <-- deadlock prone
        zram_free_page
        put_entry_bdev
        spin_lock(bitmap->lock); <-- deadlock prone
      
      With akpm's suggestion (i.e.  bitmap operation is already atomic), we
      could remove bitmap lock.  It might fail to find a empty slot if serious
      contention happens.  However, it's not severe problem because huge page
      writeback has already possiblity to fail if there is severe memory
      pressure.  Worst case is just keeping the incompressible in memory, not
      storage.
      
      The other problem is zram_slot_lock in zram_slot_slot_free_notify.  To
      make it safe is this patch introduces zram_slot_trylock where
      zram_slot_free_notify uses it.  Although it's rare to be contented, this
      patch adds new debug stat "miss_free" to keep monitoring how often it
      happens.
      
      Link: http://lkml.kernel.org/r/20181127055429.251614-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NJoey Pabalinas <joeypabalinas@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c9959e0
  10. 23 12月, 2018 2 次提交
  11. 21 12月, 2018 16 次提交
    • N
      drbd: Change drbd_request_detach_interruptible's return type to int · 5816a093
      Nathan Chancellor 提交于
      Clang warns when an implicit conversion is done between enumerated
      types:
      
      drivers/block/drbd/drbd_state.c:708:8: warning: implicit conversion from
      enumeration type 'enum drbd_ret_code' to different enumeration type
      'enum drbd_state_rv' [-Wenum-conversion]
                      rv = ERR_INTR;
                         ~ ^~~~~~~~
      
      drbd_request_detach_interruptible's only call site is in the return
      statement of adm_detach, which returns an int. Change the return type of
      drbd_request_detach_interruptible to match, silencing Clang's warning.
      Reported-by: NNick Desaulniers <ndesaulniers@google.com>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5816a093
    • L
      drbd: introduce P_ZEROES (REQ_OP_WRITE_ZEROES on the "wire") · f31e583a
      Lars Ellenberg 提交于
      And also re-enable partial-zero-out + discard aligned.
      
      With the introduction of REQ_OP_WRITE_ZEROES,
      we started to use that for both WRITE_ZEROES and DISCARDS,
      hoping that WRITE_ZEROES would "do what we want",
      UNMAP if possible, zero-out the rest.
      
      The example scenario is some LVM "thin" backend.
      
      While an un-allocated block on dm-thin reads as zeroes, on a dm-thin
      with "skip_block_zeroing=true", after a partial block write allocated
      that block, that same block may well map "undefined old garbage" from
      the backends on LBAs that have not yet been written to.
      
      If we cannot distinguish between zero-out and discard on the receiving
      side, to avoid "undefined old garbage" to pop up randomly at later times
      on supposedly zero-initialized blocks, we'd need to map all discards to
      zero-out on the receiving side.  But that would potentially do a full
      alloc on thinly provisioned backends, even when the expectation was to
      unmap/trim/discard/de-allocate.
      
      We need to distinguish on the protocol level, whether we need to guarantee
      zeroes (and thus use zero-out, potentially doing the mentioned full-alloc),
      or if we want to put the emphasis on discard, and only do a "best effort
      zeroing" (by "discarding" blocks aligned to discard-granularity, and zeroing
      only potential unaligned head and tail clippings to at least *try* to
      avoid "false positives" in an online-verify later), hoping that someone
      set skip_block_zeroing=false.
      
      For some discussion regarding this on dm-devel, see also
      https://www.mail-archive.com/dm-devel%40redhat.com/msg07965.html
      https://www.redhat.com/archives/dm-devel/2018-January/msg00271.html
      
      For backward compatibility, P_TRIM means zero-out, unless the
      DRBD_FF_WZEROES feature flag is agreed upon during handshake.
      
      To have upper layers even try to submit WRITE ZEROES requests,
      we need to announce "efficient zeroout" independently.
      
      We need to fixup max_write_zeroes_sectors after blk_queue_stack_limits():
      if we can handle "zeroes" efficiently on the protocol,
      we want to do that, even if our backend does not announce
      max_write_zeroes_sectors itself.
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f31e583a
    • L
      drbd: skip spurious timeout (ping-timeo) when failing promote · 9848b6dd
      Lars Ellenberg 提交于
      If you try to promote a Secondary while connected to a Primary
      and allow-two-primaries is NOT set, we will wait for "ping-timeout"
      to give this node a chance to detect a dead primary,
      in case the cluster manager noticed faster than we did.
      
      But if we then are *still* connected to a Primary,
      we fail (after an additional timeout of ping-timout).
      
      This change skips the spurious second timeout.
      
      Most people won't notice really,
      since "ping-timeout" by default is half a second.
      
      But in some installations, ping-timeout may be 10 or 20 seconds or more,
      and spuriously delaying the error return becomes annoying.
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9848b6dd
    • L
      drbd: don't retry connection if peers do not agree on "authentication" settings · 9049ccd4
      Lars Ellenberg 提交于
      emma: "Unexpected data packet AuthChallenge (0x0010)"
       ava: "expected AuthChallenge packet, received: ReportProtocol (0x000b)"
            "Authentication of peer failed, trying again."
      
      Pattern repeats.
      
      There is no point in retrying the handshake,
      if we expect to receive an AuthChallenge,
      but the peer is not even configured to expect or use a shared secret.
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9049ccd4
    • L
      drbd: fix print_st_err()'s prototype to match the definition · 2c38f035
      Luc Van Oostenryck 提交于
      print_st_err() is defined with its 4th argument taking an
      'enum drbd_state_rv' but its prototype use an int for it.
      
      Fix this by using 'enum drbd_state_rv' in the prototype too.
      Signed-off-by: NLuc Van Oostenryck <luc.vanoostenryck@gmail.com>
      Signed-off-by: NRoland Kammerer <roland.kammerer@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2c38f035
    • L
      drbd: avoid spurious self-outdating with concurrent disconnect / down · be80ff88
      Lars Ellenberg 提交于
      If peers are "simultaneously" told to disconnect from each other,
      either explicitly, or implicitly by taking down the resource,
      with bad timing, one side may see its disconnect "fail" with
      a result of "state change failed by peer", and interpret this as
      "please oudate yourself".
      
      Try to catch this by checking for current connection status,
      and possibly retry as local-only state change instead.
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      be80ff88
    • L
      drbd: do not block when adjusting "disk-options" while IO is frozen · f708bd08
      Lars Ellenberg 提交于
      "suspending" IO is overloaded.
      It can mean "do not allow new requests" (obviously),
      but it also may mean "must not complete pending IO",
      for example while the fencing handlers do their arbitration.
      
      When adjusting disk options, we suspend io (disallow new requests), then
      wait for the activity-log to become unused (drain all IO completions),
      and possibly replace it with a new activity log of different size.
      
      If the other "suspend IO" aspect is active, pending IO completions won't
      happen, and we would block forever (unkillable drbdsetup process).
      
      Fix this by skipping the activity log adjustment if the "al-extents"
      setting did not change. Also, in case it did change, fail early without
      blocking if it looks like we would block forever.
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f708bd08
    • L
      drbd: fix comment typos · a2823ea9
      Lars Ellenberg 提交于
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a2823ea9
    • L
      drbd: reject attach of unsuitable uuids even if connected · fe43ed97
      Lars Ellenberg 提交于
      Multiple failure scenario:
      a) all good
         Connected Primary/Secondary UpToDate/UpToDate
      b) lose disk on Primary,
         Connected Primary/Secondary Diskless/UpToDate
      c) continue to write to the device,
         changes only make it to the Secondary storage.
      d) lose disk on Secondary,
         Connected Primary/Secondary Diskless/Diskless
      e) now try to re-attach on Primary
      
      This would have succeeded before, even though that is clearly the
      wrong data set to attach to (missing the modifications from c).
      Because we only compared our "effective" and the "to-be-attached"
      data generation uuid tags if (device->state.conn < C_CONNECTED).
      
      Fix: change that constraint to (device->state.pdsk != D_UP_TO_DATE)
      compare the uuids, and reject the attach.
      
      This patch also tries to improve the reverse scenario:
      first lose Secondary, then Primary disk,
      then try to attach the disk on Secondary.
      
      Before this patch, the attach on the Secondary succeeds, but since commit
      drbd: disconnect, if the wrong UUIDs are attached on a connected peer
      the Primary will notice unsuitable data, and drop the connection hard.
      
      Though unfortunately at a point in time during the handshake where
      we cannot easily abort the attach on the peer without more
      refactoring of the handshake.
      
      We now reject any attach to "unsuitable" uuids,
      as long as we can see a Primary role,
      unless we already have access to "good" data.
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fe43ed97
    • L
      drbd: attach on connected diskless peer must not shrink a consistent device · ad6e8979
      Lars Ellenberg 提交于
      If we would reject a new handshake, if the peer had attached first,
      and then connected, we should force disconnect if the peer first connects,
      and only then attaches.
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ad6e8979
    • L
      drbd: fix confusing error message during attach · 4ef2a4f4
      Lars Ellenberg 提交于
      If we attach a (consistent) backing device,
      which knows about a last-agreed effective size,
      and that effective size is *larger* than the currently requested size,
      we refused to attach with ERR_DISK_TOO_SMALL
        Failure: (111) Low.dev. smaller than requested DRBD-dev. size.
      which is confusing to say the least.
      
      This patch changes the error code in that case to ERR_IMPLICIT_SHRINK
        Failure: (170) Implicit device shrinking not allowed. See kernel log.
        additional info from kernel:
        To-be-attached device has last effective > current size, and is consistent
        (9999 > 7777 sectors). Refusing to attach.
      
      It also allows to attach with an explicit size.
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4ef2a4f4
    • L
      drbd: disconnect, if the wrong UUIDs are attached on a connected peer · b17b5960
      Lars Ellenberg 提交于
      With "on-no-data-accessible suspend-io", DRBD requires the next attach
      or connect to be to the very same data generation uuid tag it lost last.
      
      If we first lost connection to the peer,
      then later lost connection to our own disk,
      we would usually refuse to re-connect to the peer,
      because it presents the wrong data set.
      
      However, if the peer first connects without a disk,
      and then attached its disk, we accepted that same wrong data set,
      which would be "unexpected" by any user of that DRBD
      and cause "undefined results" (read: very likely data corruption).
      
      The fix is to forcefully disconnect as soon as we notice that the peer
      attached to the "wrong" dataset.
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b17b5960
    • L
      drbd: ignore "all zero" peer volume sizes in handshake · 94c43a13
      Lars Ellenberg 提交于
      During handshake, if we are diskless ourselves, we used to accept any size
      presented by the peer.
      
      Which could be zero if that peer was just brought up and connected
      to us without having a disk attached first, in which case both
      peers would just "flip" their volume sizes.
      
      Now, even a diskless node will ignore "zero" sizes
      presented by a diskless peer.
      
      Also a currently Diskless Primary will refuse to shrink during handshake:
      it may be frozen, and waiting for a "suitable" local disk or peer to
      re-appear (on-no-data-accessible suspend-io). If the peer is smaller
      than what we used to be, it is not suitable.
      
      The logic for a diskless node during handshake is now supposed to be:
      believe the peer, if
       - I don't have a current size myself
       - we agree on the size anyways
       - I do have a current size, am Secondary, and he has the only disk
       - I do have a current size, am Primary, and he has the only disk,
         which is larger than my current size
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      94c43a13
    • L
      drbd: centralize printk reporting of new size into drbd_set_my_capacity() · d5412e8d
      Lars Ellenberg 提交于
      Previously, some implicit resizes that happend during handshake
      have not been reported as prominently as explicit resize.
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d5412e8d
    • L
    • R
      drbd: narrow rcu_read_lock in drbd_sync_handshake · d29e89e3
      Roland Kammerer 提交于
      So far there was the possibility that we called
      genlmsg_new(GFP_NOIO)/mutex_lock() while holding an rcu_read_lock().
      
      This included cases like:
      
      drbd_sync_handshake (acquire the RCU lock)
        drbd_asb_recover_1p
          drbd_khelper
            drbd_bcast_event
              genlmsg_new(GFP_NOIO) --> may sleep
      
      drbd_sync_handshake (acquire the RCU lock)
        drbd_asb_recover_1p
          drbd_khelper
            notify_helper
              genlmsg_new(GFP_NOIO) --> may sleep
      
      drbd_sync_handshake (acquire the RCU lock)
        drbd_asb_recover_1p
          drbd_khelper
            notify_helper
              mutex_lock --> may sleep
      
      While using GFP_ATOMIC whould have been possible in the first two cases,
      the real fix is to narrow the rcu_read_lock.
      Reported-by: NJia-Ju Bai <baijiaju1990@163.com>
      Reviewed-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NRoland Kammerer <roland.kammerer@linbit.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d29e89e3
  12. 20 12月, 2018 1 次提交
  13. 17 12月, 2018 2 次提交