1. 08 8月, 2016 2 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
    • J
      block/mm: make bdev_ops->rw_page() take a bool for read/write · c11f0c0b
      Jens Axboe 提交于
      Commit abf54548 changed it from an 'rw' flags type to the
      newer ops based interface, but now we're effectively leaking
      some bdev internals to the rest of the kernel. Since we only
      care about whether it's a read or a write at that level, just
      pass in a bool 'is_write' parameter instead.
      
      Then we can also move op_is_write() and friends back under
      CONFIG_BLOCK protection.
      Reviewed-by: NMike Christie <mchristi@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c11f0c0b
  2. 05 8月, 2016 5 次提交
    • M
      mm/block: convert rw_page users to bio op use · abf54548
      Mike Christie 提交于
      The rw_page users were not converted to use bio/req ops. As a result
      bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
      be sent down as reads.
      Signed-off-by: NMike Christie <mchristi@redhat.com>
      Fixes: 4e1b2d52 ("block, fs, drivers: remove REQ_OP compat defs and related code")
      
      Modified by me to:
      
      1) Drop op_flags passing into ->rw_page(), as we don't use it.
      2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK
      Signed-off-by: NJens Axboe <axboe@fb.com>
      abf54548
    • C
      loop: make do_req_filebacked more robust · c1c87c2b
      Christoph Hellwig 提交于
      Use a switch statement to iterate over the possible operations and
      error out if it's an incorrect one.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c1c87c2b
    • C
      loop: don't try to use AIO for discards · f0225cac
      Christoph Hellwig 提交于
      Fix a fat-fingered conversion to the req_op accessors, and also
      use a switch statement to make it more obvious what is being checked.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Fixes: c2df40 ("drivers: use req op accessor");
      Reviewed-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f0225cac
    • V
      nbd: fix race in ioctl · 97240963
      Vegard Nossum 提交于
      Quentin ran into this bug:
      
      WARNING: CPU: 64 PID: 10085 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x65/0x80
      sysfs: cannot create duplicate filename '/devices/virtual/block/nbd3/pid'
      Modules linked in: nbd
      CPU: 64 PID: 10085 Comm: qemu-nbd Tainted: G      D         4.6.0+ #7
       0000000000000000 ffff8820330bba68 ffffffff814b8791 ffff8820330bbac8
       0000000000000000 ffff8820330bbab8 ffffffff810d04ab ffff8820330bbaa8
       0000001f00000296 0000000000017681 ffff8810380bf000 ffffffffa0001790
      Call Trace:
       [<ffffffff814b8791>] dump_stack+0x4d/0x6c
       [<ffffffff810d04ab>] __warn+0xdb/0x100
       [<ffffffff810d0574>] warn_slowpath_fmt+0x44/0x50
       [<ffffffff81218c65>] sysfs_warn_dup+0x65/0x80
       [<ffffffff81218a02>] sysfs_add_file_mode_ns+0x172/0x180
       [<ffffffff81218a35>] sysfs_create_file_ns+0x25/0x30
       [<ffffffff81594a76>] device_create_file+0x36/0x90
       [<ffffffffa0000e8d>] __nbd_ioctl+0x32d/0x9b0 [nbd]
       [<ffffffff814cc8e8>] ? find_next_bit+0x18/0x20
       [<ffffffff810f7c29>] ? select_idle_sibling+0xe9/0x120
       [<ffffffff810f6cd7>] ? __enqueue_entity+0x67/0x70
       [<ffffffff810f9bf0>] ? enqueue_task_fair+0x630/0xe20
       [<ffffffff810efa76>] ? resched_curr+0x36/0x70
       [<ffffffff810f0078>] ? check_preempt_curr+0x78/0x90
       [<ffffffff810f00a2>] ? ttwu_do_wakeup+0x12/0x80
       [<ffffffff810f01b1>] ? ttwu_do_activate.constprop.86+0x61/0x70
       [<ffffffff810f0c15>] ? try_to_wake_up+0x185/0x2d0
       [<ffffffff810f0d6d>] ? default_wake_function+0xd/0x10
       [<ffffffff81105471>] ? autoremove_wake_function+0x11/0x40
       [<ffffffffa0001577>] nbd_ioctl+0x67/0x94 [nbd]
       [<ffffffff814ac0fd>] blkdev_ioctl+0x14d/0x940
       [<ffffffff811b0da2>] ? put_pipe_info+0x22/0x60
       [<ffffffff811d96cc>] block_ioctl+0x3c/0x40
       [<ffffffff811ba08d>] do_vfs_ioctl+0x8d/0x5e0
       [<ffffffff811aa329>] ? ____fput+0x9/0x10
       [<ffffffff810e9092>] ? task_work_run+0x72/0x90
       [<ffffffff811ba627>] SyS_ioctl+0x47/0x80
       [<ffffffff8185f5df>] entry_SYSCALL_64_fastpath+0x17/0x93
      ---[ end trace 7899b295e4f850c8 ]---
      
      It seems fairly obvious that device_create_file() is not being protected
      from being run concurrently on the same nbd.
      
      Quentin found the following relevant commits:
      
      1a2ad211 nbd: add locking to nbd_ioctl
      90b8f282 [PATCH] end of methods switch: remove the old ones
      d4430d62 [PATCH] beginning of methods conversion
      08f85851 [PATCH] move block_device_operations to blkdev.h
      
      It would seem that the race was introduced in the process of moving nbd
      from BKL to unlocked ioctls.
      
      By setting nbd->task_recv while the mutex is held, we can prevent other
      processes from running concurrently (since nbd->task_recv is also checked
      while the mutex is held).
      Reported-and-tested-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Markus Pargmann <mpa@pengutronix.de>
      Cc: Paul Clements <paul.clements@steeleye.com>
      Cc: Pavel Machek <pavel@suse.cz>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      97240963
    • J
      floppy: fix open(O_ACCMODE) for ioctl-only open · ff06db1e
      Jiri Kosina 提交于
      Commit 09954bad ("floppy: refactor open() flags handling"), as a
      side-effect, causes open(/dev/fdX, O_ACCMODE) to fail. It turns out that
      this is being used setfdprm userspace for ioctl-only open().
      
      Reintroduce back the original behavior wrt !(FMODE_READ|FMODE_WRITE)
      modes, while still keeping the original O_NDELAY bug fixed.
      
      Cc: stable@vger.kernel.org # v4.5+
      Reported-by: NWim Osterholt <wim@djo.tudelft.nl>
      Tested-by: NWim Osterholt <wim@djo.tudelft.nl>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ff06db1e
  3. 03 8月, 2016 1 次提交
  4. 28 7月, 2016 2 次提交
  5. 27 7月, 2016 7 次提交
    • M
      zram: use __GFP_MOVABLE for memory allocation · 9bc482d3
      Minchan Kim 提交于
      Zsmalloc is ready for page migration so zram can use __GFP_MOVABLE from
      now on.
      
      I did test to see how it helps to make higher order pages.  Test
      scenario is as follows.
      
      KVM guest, 1G memory, ext4 formated zram block device,
      
        for i in `seq 1 8`;
        do
                dd if=/dev/vda1 of=mnt/test$i.txt bs=128M count=1 &
        done
      
        wait `pidof dd`
      
        for i in `seq 1 2 8`;
        do
                rm -rf mnt/test$i.txt
        done
        fstrim -v mnt
      
        echo "init"
        cat /proc/buddyinfo
      
        echo "compaction"
        echo 1 > /proc/sys/vm/compact_memory
        cat /proc/buddyinfo
      
      old:
      
        init
        Node 0, zone      DMA    208    120     51     41     11      0      0      0      0      0      0
        Node 0, zone    DMA32  16380  13777   9184   3805    789     54      3      0      0      0      0
        compaction
        Node 0, zone      DMA    132     82     40     39     16      2      1      0      0      0      0
        Node 0, zone    DMA32   5219   5526   4969   3455   1831    677    139     15      0      0      0
      
      new:
      
        init
        Node 0, zone      DMA    379    115     97     19      2      0      0      0      0      0      0
        Node 0, zone    DMA32  18891  16774  10862   3947    637     21      0      0      0      0      0
        compaction
        Node 0, zone      DMA    214     66     87     29     10      3      0      0      0      0      0
        Node 0, zone    DMA32   1612   3139   3154   2469   1745    990    384     94      7      0      0
      
      As you can see, compaction made so many high-order pages. Yay!
      
      Link: http://lkml.kernel.org/r/1464736881-24886-13-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9bc482d3
    • S
      zram: drop gfp_t from zcomp_strm_alloc() · 16d37725
      Sergey Senozhatsky 提交于
      We now allocate streams from CPU_UP hot-plug path, there are no
      context-dependent stream allocations anymore and we can schedule from
      zcomp_strm_alloc().  Use GFP_KERNEL directly and drop a gfp_t parameter.
      
      Link: http://lkml.kernel.org/r/20160531122017.2878-9-sergey.senozhatsky@gmail.comSigned-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16d37725
    • S
      zram: add more compression algorithms · eb9f56d8
      Sergey Senozhatsky 提交于
      Add "deflate", "lz4hc", "842" algorithms to the list of known
      compression backends.  The real availability of those algorithms,
      however, depends on the corresponding CONFIG_CRYPTO_FOO config options.
      
      [sergey.senozhatsky@gmail.com: zram-add-more-compression-algorithms-v3]
        Link: http://lkml.kernel.org/r/20160604024902.11778-7-sergey.senozhatsky@gmail.com
      Link: http://lkml.kernel.org/r/20160531122017.2878-8-sergey.senozhatsky@gmail.comSigned-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb9f56d8
    • S
      zram: delete custom lzo/lz4 · ce1ed9f9
      Sergey Senozhatsky 提交于
      Remove lzo/lz4 backends, we use crypto API now.
      
      [sergey.senozhatsky@gmail.com: zram-delete-custom-lzo-lz4-v3]
        Link: http://lkml.kernel.org/r/20160604024902.11778-6-sergey.senozhatsky@gmail.com
      Link: http://lkml.kernel.org/r/20160531122017.2878-7-sergey.senozhatsky@gmail.comSigned-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce1ed9f9
    • S
      zram: use crypto api to check alg availability · 415403be
      Sergey Senozhatsky 提交于
      There is no way to get a string with all the crypto comp algorithms
      supported by the crypto comp engine, so we need to maintain our own
      backends list.  At the same time we additionally need to use
      crypto_has_comp() to make sure that the user has requested a compression
      algorithm that is recognized by the crypto comp engine.  Relying on
      /proc/crypto is not an options here, because it does not show
      not-yet-inserted compression modules.
      
      Example:
      
       modprobe zram
       cat /proc/crypto | grep -i lz4
       modprobe lz4
       cat /proc/crypto | grep -i lz4
      name         : lz4
      driver       : lz4-generic
      module       : lz4
      
      So the user can't tell exactly if the lz4 is really supported from
      /proc/crypto output, unless someone or something has loaded it.
      
      This patch also adds crypto_has_comp() to zcomp_available_show().  We
      store all the compression algorithms names in zcomp's `backends' array,
      regardless the CONFIG_CRYPTO_FOO configuration, but show only those that
      are also supported by crypto engine.  This helps user to know the exact
      list of compression algorithms that can be used.
      
      Example:
        module lz4 is not loaded yet, but is supported by the crypto
        engine. /proc/crypto has no information on this module, while
        zram's `comp_algorithm' lists it:
      
       cat /proc/crypto | grep -i lz4
      
       cat /sys/block/zram0/comp_algorithm
      [lzo] lz4 deflate lz4hc 842
      
      We still use the `backends' array to determine if the requested
      compression backend is known to crypto api.  This array, however, may not
      contain some entries, therefore as the last step we call crypto_has_comp()
      function which attempts to insmod the requested compression algorithm to
      determine if crypto api supports it.  The advantage of this method is that
      now we permit the usage of out-of-tree crypto compression modules
      (implementing S/W or H/W compression).
      
      [sergey.senozhatsky@gmail.com: zram-use-crypto-api-to-check-alg-availability-v3]
        Link: http://lkml.kernel.org/r/20160604024902.11778-4-sergey.senozhatsky@gmail.com
      Link: http://lkml.kernel.org/r/20160531122017.2878-5-sergey.senozhatsky@gmail.comSigned-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      415403be
    • S
      zram: switch to crypto compress API · ebaf9ab5
      Sergey Senozhatsky 提交于
      We don't have an idle zstreams list anymore and our write path now works
      absolutely differently, preventing preemption during compression.  This
      removes possibilities of read paths preempting writes at wrong places
      (which could badly affect the performance of both paths) and at the same
      time opens the door for a move from custom LZO/LZ4 compression backends
      implementation to a more generic one, using crypto compress API.
      
      Joonsoo Kim [1] attempted to do this a while ago, but faced with the
      need of introducing a new crypto API interface.  The root cause was the
      fact that crypto API compression algorithms require a compression stream
      structure (in zram terminology) for both compression and decompression
      ops, while in reality only several of compression algorithms really need
      it.  This resulted in a concept of context-less crypto API compression
      backends [2].  Both write and read paths, though, would have been
      executed with the preemption enabled, which in the worst case could have
      resulted in a decreased worst-case performance, e.g.  consider the
      following case:
      
      	CPU0
      
      	zram_write()
      	  spin_lock()
      	    take the last idle stream
      	  spin_unlock()
      
      	<< preempted >>
      
      		zram_read()
      		  spin_lock()
      		   no idle streams
      			  spin_unlock()
      			  schedule()
      
      	resuming zram_write compression()
      
      but it took me some time to realize that, and it took even longer to
      evolve zram and to make it ready for crypto API.  The key turned out to be
      -- drop the idle streams list entirely.  Without the idle streams list we
      are free to use compression algorithms that require compression stream for
      decompression (read), because streams are now placed in per-cpu data and
      each write path has to disable preemption for compression op, almost
      completely eliminating the aforementioned case (technically, we still have
      a small chance, because write path has a fast and a slow paths and the
      slow path is executed with the preemption enabled; but the frequency of
      failed fast path is too low).
      
      TEST
      ====
      
      - 4 CPUs, x86_64 system
      - 3G zram, lzo
      - fio tests: read, randread, write, randwrite, rw, randrw
      
      test script [3] command:
       ZRAM_SIZE=3G LOG_SUFFIX=XXXX FIO_LOOPS=5 ./zram-fio-test.sh
      
                         BASE           PATCHED
      jobs1
      READ:           2527.2MB/s	 2482.7MB/s
      READ:           2102.7MB/s	 2045.0MB/s
      WRITE:          1284.3MB/s	 1324.3MB/s
      WRITE:          1080.7MB/s	 1101.9MB/s
      READ:           430125KB/s	 437498KB/s
      WRITE:          430538KB/s	 437919KB/s
      READ:           399593KB/s	 403987KB/s
      WRITE:          399910KB/s	 404308KB/s
      jobs2
      READ:           8133.5MB/s	 7854.8MB/s
      READ:           7086.6MB/s	 6912.8MB/s
      WRITE:          3177.2MB/s	 3298.3MB/s
      WRITE:          2810.2MB/s	 2871.4MB/s
      READ:           1017.6MB/s	 1023.4MB/s
      WRITE:          1018.2MB/s	 1023.1MB/s
      READ:           977836KB/s	 984205KB/s
      WRITE:          979435KB/s	 985814KB/s
      jobs3
      READ:           13557MB/s	 13391MB/s
      READ:           11876MB/s	 11752MB/s
      WRITE:          4641.5MB/s	 4682.1MB/s
      WRITE:          4164.9MB/s	 4179.3MB/s
      READ:           1453.8MB/s	 1455.1MB/s
      WRITE:          1455.1MB/s	 1458.2MB/s
      READ:           1387.7MB/s	 1395.7MB/s
      WRITE:          1386.1MB/s	 1394.9MB/s
      jobs4
      READ:           20271MB/s	 20078MB/s
      READ:           18033MB/s	 17928MB/s
      WRITE:          6176.8MB/s	 6180.5MB/s
      WRITE:          5686.3MB/s	 5705.3MB/s
      READ:           2009.4MB/s	 2006.7MB/s
      WRITE:          2007.5MB/s	 2004.9MB/s
      READ:           1929.7MB/s	 1935.6MB/s
      WRITE:          1926.8MB/s	 1932.6MB/s
      jobs5
      READ:           18823MB/s	 19024MB/s
      READ:           18968MB/s	 19071MB/s
      WRITE:          6191.6MB/s	 6372.1MB/s
      WRITE:          5818.7MB/s	 5787.1MB/s
      READ:           2011.7MB/s	 1981.3MB/s
      WRITE:          2011.4MB/s	 1980.1MB/s
      READ:           1949.3MB/s	 1935.7MB/s
      WRITE:          1940.4MB/s	 1926.1MB/s
      jobs6
      READ:           21870MB/s	 21715MB/s
      READ:           19957MB/s	 19879MB/s
      WRITE:          6528.4MB/s	 6537.6MB/s
      WRITE:          6098.9MB/s	 6073.6MB/s
      READ:           2048.6MB/s	 2049.9MB/s
      WRITE:          2041.7MB/s	 2042.9MB/s
      READ:           2013.4MB/s	 1990.4MB/s
      WRITE:          2009.4MB/s	 1986.5MB/s
      jobs7
      READ:           21359MB/s	 21124MB/s
      READ:           19746MB/s	 19293MB/s
      WRITE:          6660.4MB/s	 6518.8MB/s
      WRITE:          6211.6MB/s	 6193.1MB/s
      READ:           2089.7MB/s	 2080.6MB/s
      WRITE:          2085.8MB/s	 2076.5MB/s
      READ:           2041.2MB/s	 2052.5MB/s
      WRITE:          2037.5MB/s	 2048.8MB/s
      jobs8
      READ:           20477MB/s	 19974MB/s
      READ:           18922MB/s	 18576MB/s
      WRITE:          6851.9MB/s	 6788.3MB/s
      WRITE:          6407.7MB/s	 6347.5MB/s
      READ:           2134.8MB/s	 2136.1MB/s
      WRITE:          2132.8MB/s	 2134.4MB/s
      READ:           2074.2MB/s	 2069.6MB/s
      WRITE:          2087.3MB/s	 2082.4MB/s
      jobs9
      READ:           19797MB/s	 19994MB/s
      READ:           18806MB/s	 18581MB/s
      WRITE:          6878.7MB/s	 6822.7MB/s
      WRITE:          6456.8MB/s	 6447.2MB/s
      READ:           2141.1MB/s	 2154.7MB/s
      WRITE:          2144.4MB/s	 2157.3MB/s
      READ:           2084.1MB/s	 2085.1MB/s
      WRITE:          2091.5MB/s	 2092.5MB/s
      jobs10
      READ:           19794MB/s	 19784MB/s
      READ:           18794MB/s	 18745MB/s
      WRITE:          6984.4MB/s	 6676.3MB/s
      WRITE:          6532.3MB/s	 6342.7MB/s
      READ:           2150.6MB/s	 2155.4MB/s
      WRITE:          2156.8MB/s	 2161.5MB/s
      READ:           2106.4MB/s	 2095.6MB/s
      WRITE:          2109.7MB/s	 2098.4MB/s
      
                                          BASE                       PATCHED
      jobs1                              perfstat
      stalled-cycles-frontend     102,480,595,419 (  41.53%)	  114,508,864,804 (  46.92%)
      stalled-cycles-backend       51,941,417,832 (  21.05%)	   46,836,112,388 (  19.19%)
      instructions                283,612,054,215 (    1.15)	  283,918,134,959 (    1.16)
      branches                     56,372,560,385 ( 724.923)	   56,449,814,753 ( 733.766)
      branch-misses                   374,826,000 (   0.66%)	      326,935,859 (   0.58%)
      jobs2                              perfstat
      stalled-cycles-frontend     155,142,745,777 (  40.99%)	  164,170,979,198 (  43.82%)
      stalled-cycles-backend       70,813,866,387 (  18.71%)	   66,456,858,165 (  17.74%)
      instructions                463,436,648,173 (    1.22)	  464,221,890,191 (    1.24)
      branches                     91,088,733,902 ( 760.088)	   91,278,144,546 ( 769.133)
      branch-misses                   504,460,363 (   0.55%)	      394,033,842 (   0.43%)
      jobs3                              perfstat
      stalled-cycles-frontend     201,300,397,212 (  39.84%)	  223,969,902,257 (  44.44%)
      stalled-cycles-backend       87,712,593,974 (  17.36%)	   81,618,888,712 (  16.19%)
      instructions                642,869,545,023 (    1.27)	  644,677,354,132 (    1.28)
      branches                    125,724,560,594 ( 690.682)	  126,133,159,521 ( 694.542)
      branch-misses                   527,941,798 (   0.42%)	      444,782,220 (   0.35%)
      jobs4                              perfstat
      stalled-cycles-frontend     246,701,197,429 (  38.12%)	  280,076,030,886 (  43.29%)
      stalled-cycles-backend      119,050,341,112 (  18.40%)	  110,955,641,671 (  17.15%)
      instructions                822,716,962,127 (    1.27)	  825,536,969,320 (    1.28)
      branches                    160,590,028,545 ( 688.614)	  161,152,996,915 ( 691.068)
      branch-misses                   650,295,287 (   0.40%)	      550,229,113 (   0.34%)
      jobs5                              perfstat
      stalled-cycles-frontend     298,958,462,516 (  38.30%)	  344,852,200,358 (  44.16%)
      stalled-cycles-backend      137,558,742,122 (  17.62%)	  129,465,067,102 (  16.58%)
      instructions              1,005,714,688,752 (    1.29)	1,007,657,999,432 (    1.29)
      branches                    195,988,773,962 ( 697.730)	  196,446,873,984 ( 700.319)
      branch-misses                   695,818,940 (   0.36%)	      624,823,263 (   0.32%)
      jobs6                              perfstat
      stalled-cycles-frontend     334,497,602,856 (  36.71%)	  387,590,419,779 (  42.38%)
      stalled-cycles-backend      163,539,365,335 (  17.95%)	  152,640,193,639 (  16.69%)
      instructions              1,184,738,177,851 (    1.30)	1,187,396,281,677 (    1.30)
      branches                    230,592,915,640 ( 702.902)	  231,253,802,882 ( 702.356)
      branch-misses                   747,934,786 (   0.32%)	      643,902,424 (   0.28%)
      jobs7                              perfstat
      stalled-cycles-frontend     396,724,684,187 (  37.71%)	  460,705,858,952 (  43.84%)
      stalled-cycles-backend      188,096,616,496 (  17.88%)	  175,785,787,036 (  16.73%)
      instructions              1,364,041,136,608 (    1.30)	1,366,689,075,112 (    1.30)
      branches                    265,253,096,936 ( 700.078)	  265,890,524,883 ( 702.839)
      branch-misses                   784,991,589 (   0.30%)	      729,196,689 (   0.27%)
      jobs8                              perfstat
      stalled-cycles-frontend     440,248,299,870 (  36.92%)	  509,554,793,816 (  42.46%)
      stalled-cycles-backend      222,575,930,616 (  18.67%)	  213,401,248,432 (  17.78%)
      instructions              1,542,262,045,114 (    1.29)	1,545,233,932,257 (    1.29)
      branches                    299,775,178,439 ( 697.666)	  300,528,458,505 ( 694.769)
      branch-misses                   847,496,084 (   0.28%)	      748,794,308 (   0.25%)
      jobs9                              perfstat
      stalled-cycles-frontend     506,269,882,480 (  37.86%)	  592,798,032,820 (  44.43%)
      stalled-cycles-backend      253,192,498,861 (  18.93%)	  233,727,666,185 (  17.52%)
      instructions              1,721,985,080,913 (    1.29)	1,724,666,236,005 (    1.29)
      branches                    334,517,360,255 ( 694.134)	  335,199,758,164 ( 697.131)
      branch-misses                   873,496,730 (   0.26%)	      815,379,236 (   0.24%)
      jobs10                             perfstat
      stalled-cycles-frontend     549,063,363,749 (  37.18%)	  651,302,376,662 (  43.61%)
      stalled-cycles-backend      281,680,986,810 (  19.07%)	  277,005,235,582 (  18.55%)
      instructions              1,901,859,271,180 (    1.29)	1,906,311,064,230 (    1.28)
      branches                    369,398,536,153 ( 694.004)	  370,527,696,358 ( 688.409)
      branch-misses                   967,929,335 (   0.26%)	      890,125,056 (   0.24%)
      
                                  BASE           PATCHED
      seconds elapsed        79.421641008	78.735285546
      seconds elapsed        61.471246133	60.869085949
      seconds elapsed        62.317058173	62.224188495
      seconds elapsed        60.030739363	60.081102518
      seconds elapsed        74.070398362	74.317582865
      seconds elapsed        84.985953007	85.414364176
      seconds elapsed        97.724553255	98.173311344
      seconds elapsed        109.488066758	110.268399318
      seconds elapsed        122.768189405	122.967164498
      seconds elapsed        135.130035105	136.934770801
      
      On my other system (8 x86_64 CPUs, short version of test results):
      
                                  BASE           PATCHED
      seconds elapsed        19.518065994	19.806320662
      seconds elapsed        15.172772749	15.594718291
      seconds elapsed        13.820925970	13.821708564
      seconds elapsed        13.293097816	14.585206405
      seconds elapsed        16.207284118	16.064431606
      seconds elapsed        17.958376158	17.771825767
      seconds elapsed        19.478009164	19.602961508
      seconds elapsed        21.347152811	21.352318709
      seconds elapsed        24.478121126	24.171088735
      seconds elapsed        26.865057442	26.767327618
      
      So performance-wise the numbers are quite similar.
      
      Also update zcomp interface to be more aligned with the crypto API.
      
      [1] http://marc.info/?l=linux-kernel&m=144480832108927&w=2
      [2] http://marc.info/?l=linux-kernel&m=145379613507518&w=2
      [3] https://github.com/sergey-senozhatsky/zram-perf-test
      
      Link: http://lkml.kernel.org/r/20160531122017.2878-3-sergey.senozhatsky@gmail.comSigned-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Suggested-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebaf9ab5
    • S
      zram: rename zstrm find-release functions · 2aea8493
      Sergey Senozhatsky 提交于
      This has started as a 'add zlib support' work, but after some thinking I
      saw no blockers for a bigger change -- a switch to crypto API.
      
      We don't have an idle zstreams list anymore and our write path now works
      absolutely differently, preventing preemption during compression.  This
      removes possibilities of read paths preempting writes at wrong places
      and opens the door for a move from custom LZO/LZ4 compression backends
      implementation to a more generic one, using crypto compress API.
      
      This patch set also eliminates the need of a new context-less crypto API
      interface, which was quite hard to sell, so we can move along faster.
      
      benchmarks:
      
      (x86_64, 4GB, zram-perf script)
      
      perf reported run-time fio (max jobs=3).  I performed fio test with the
      increasing number of parallel jobs (max to 3) on a 3G zram device, using
      `static' data and the following crypto comp algorithms:
      
      	842, deflate, lz4, lz4hc, lzo
      
      the output was:
      
       - test running time (which can tell us what algorithms performs faster)
      
      and
      
       - zram mm_stat (which tells the compressed memory size, max used memory, etc).
      
      It's just for information.  for example, LZ4HC has twice the running
      time of LZO, but the compressed memory size is: 23592960 vs 34603008
      bytes.
      
        test-fio-zram-842
           197.907655282 seconds time elapsed
           201.623142884 seconds time elapsed
           226.854291345 seconds time elapsed
        test-fio-zram-DEFLATE
           253.259516155 seconds time elapsed
           258.148563401 seconds time elapsed
           290.251909365 seconds time elapsed
        test-fio-zram-LZ4
            27.022598717 seconds time elapsed
            29.580522717 seconds time elapsed
            33.293463430 seconds time elapsed
        test-fio-zram-LZ4HC
            56.393954615 seconds time elapsed
            74.904659747 seconds time elapsed
           101.940998564 seconds time elapsed
        test-fio-zram-LZO
            28.155948075 seconds time elapsed
            30.390036330 seconds time elapsed
            34.455773159 seconds time elapsed
      
      zram mm_stat-s (max fio jobs=3)
      
        test-fio-zram-842
        mm_stat (jobs1): 3221225472 673185792 690266112        0 690266112        0        0
        mm_stat (jobs2): 3221225472 673185792 690266112        0 690266112        0        0
        mm_stat (jobs3): 3221225472 673185792 690266112        0 690266112        0        0
        test-fio-zram-DEFLATE
        mm_stat (jobs1): 3221225472  24379392  37761024        0  37761024        0        0
        mm_stat (jobs2): 3221225472  24379392  37761024        0  37761024        0        0
        mm_stat (jobs3): 3221225472  24379392  37761024        0  37761024        0        0
        test-fio-zram-LZ4
        mm_stat (jobs1): 3221225472  23592960  37761024        0  37761024        0        0
        mm_stat (jobs2): 3221225472  23592960  37761024        0  37761024        0        0
        mm_stat (jobs3): 3221225472  23592960  37761024        0  37761024        0        0
        test-fio-zram-LZ4HC
        mm_stat (jobs1): 3221225472  23592960  37761024        0  37761024        0        0
        mm_stat (jobs2): 3221225472  23592960  37761024        0  37761024        0        0
        mm_stat (jobs3): 3221225472  23592960  37761024        0  37761024        0        0
        test-fio-zram-LZO
        mm_stat (jobs1): 3221225472  34603008  50335744        0  50335744        0        0
        mm_stat (jobs2): 3221225472  34603008  50335744        0  50335744        0        0
        mm_stat (jobs3): 3221225472  34603008  50335744        0  50339840        0        0
      
      This patch (of 8):
      
      We don't perform any zstream idle list lookup anymore, so
      zcomp_strm_find()/zcomp_strm_release() names are not representative.
      
      Rename to zcomp_stream_get()/zcomp_stream_put().
      
      Link: http://lkml.kernel.org/r/20160531122017.2878-2-sergey.senozhatsky@gmail.comSigned-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2aea8493
  6. 22 7月, 2016 4 次提交
  7. 21 7月, 2016 4 次提交
  8. 13 7月, 2016 2 次提交
  9. 30 6月, 2016 1 次提交
    • B
      xen-blkfront: save uncompleted reqs in blkfront_resume() · 7b427a59
      Bob Liu 提交于
      Uncompleted reqs used to be 'saved and resubmitted' in blkfront_recover() during
      migration, but that's too late after multi-queue was introduced.
      
      After a migrate to another host (which may not have multiqueue support), the
      number of rings (block hardware queues) may be changed and the ring and shadow
      structure will also be reallocated.
      
      The blkfront_recover() then can't 'save and resubmit' the real
      uncompleted reqs because shadow structure have been reallocated.
      
      This patch fixes this issue by moving the 'save' logic out of
      blkfront_recover() to earlier place in blkfront_resume().
      
      The 'resubmit' is not changed and still in blkfront_recover().
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: stable@vger.kernel.org
      7b427a59
  10. 29 6月, 2016 1 次提交
  11. 28 6月, 2016 1 次提交
    • D
      block: convert to device_add_disk() · 0d52c756
      Dan Williams 提交于
      For block drivers that specify a parent device, convert them to use
      device_add_disk().
      
      This conversion was done with the following semantic patch:
      
          @@
          struct gendisk *disk;
          expression E;
          @@
      
          - disk->driverfs_dev = E;
          ...
          - add_disk(disk);
          + device_add_disk(E, disk);
      
          @@
          struct gendisk *disk;
          expression E1, E2;
          @@
      
          - disk->driverfs_dev = E1;
          ...
          E2 = disk;
          ...
          - add_disk(E2);
          + device_add_disk(E1, E2);
      
      ...plus some manual fixups for a few missed conversions.
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      0d52c756
  12. 25 6月, 2016 1 次提交
    • M
      tree wide: get rid of __GFP_REPEAT for order-0 allocations part I · 32d6bd90
      Michal Hocko 提交于
      This is the third version of the patchset previously sent [1].  I have
      basically only rebased it on top of 4.7-rc1 tree and dropped "dm: get
      rid of superfluous gfp flags" which went through dm tree.  I am sending
      it now because it is tree wide and chances for conflicts are reduced
      considerably when we want to target rc2.  I plan to send the next step
      and rename the flag and move to a better semantic later during this
      release cycle so we will have a new semantic ready for 4.8 merge window
      hopefully.
      
      Motivation:
      
      While working on something unrelated I've checked the current usage of
      __GFP_REPEAT in the tree.  It seems that a majority of the usage is and
      always has been bogus because __GFP_REPEAT has always been about costly
      high order allocations while we are using it for order-0 or very small
      orders very often.  It seems that a big pile of them is just a
      copy&paste when a code has been adopted from one arch to another.
      
      I think it makes some sense to get rid of them because they are just
      making the semantic more unclear.  Please note that GFP_REPEAT is
      documented as
      
      * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
      
      * _might_ fail.  This depends upon the particular VM implementation.
        while !costly requests have basically nofail semantic.  So one could
        reasonably expect that order-0 request with __GFP_REPEAT will not loop
        for ever.  This is not implemented right now though.
      
      I would like to move on with __GFP_REPEAT and define a better semantic
      for it.
      
        $ git grep __GFP_REPEAT origin/master | wc -l
        111
        $ git grep __GFP_REPEAT | wc -l
        36
      
      So we are down to the third after this patch series.  The remaining
      places really seem to be relying on __GFP_REPEAT due to large allocation
      requests.  This still needs some double checking which I will do later
      after all the simple ones are sorted out.
      
      I am touching a lot of arch specific code here and I hope I got it right
      but as a matter of fact I even didn't compile test for some archs as I
      do not have cross compiler for them.  Patches should be quite trivial to
      review for stupid compile mistakes though.  The tricky parts are usually
      hidden by macro definitions and thats where I would appreciate help from
      arch maintainers.
      
      [1] http://lkml.kernel.org/r/1461849846-27209-1-git-send-email-mhocko@kernel.org
      
      This patch (of 19):
      
      __GFP_REPEAT has a rather weak semantic but since it has been introduced
      around 2.6.12 it has been ignored for low order allocations.  Yet we
      have the full kernel tree with its usage for apparently order-0
      allocations.  This is really confusing because __GFP_REPEAT is
      explicitly documented to allow allocation failures which is a weaker
      semantic than the current order-0 has (basically nofail).
      
      Let's simply drop __GFP_REPEAT from those places.  This would allow to
      identify place which really need allocator to retry harder and formulate
      a more specific semantic for what the flag is supposed to do actually.
      
      Link: http://lkml.kernel.org/r/1464599699-30131-2-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Crispin <blogic@openwrt.org>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32d6bd90
  13. 14 6月, 2016 9 次提交
    • L
      drbd: correctly handle failed crypto_alloc_hash · 1b57e663
      Lars Ellenberg 提交于
      crypto_alloc_hash returns an ERR_PTR(), not NULL.
      
      Also reset peer_integrity_tfm to NULL, to not call crypto_free_hash()
      on an errno in the cleanup path.
      Reported-by: NInsu Yun <wuninsu@gmail.com>
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1b57e663
    • L
      drbd: al_write_transaction: skip re-scanning of bitmap page pointer array · 27ea1d87
      Lars Ellenberg 提交于
      For larger devices, the array of bitmap page pointers can grow very
      large (8000 pointers per TB of storage).
      
      For each activity log transaction, we need to flush the associated
      bitmap pages to stable storage. Currently, we just "mark" the respective
      pages while setting up the transaction, then tell the bitmap code to
      write out all marked pages, but skip unchanged pages.
      
      But one such transaction can affect only a small number of bitmap pages,
      there is no need to scan the full array of several (ten-)thousand
      page pointers to find the few marked ones.
      
      Instead, remember the index numbers of the few affected pages,
      and later only re-check those to skip duplicates and unchanged ones.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      27ea1d87
    • L
      drbd: finally report ms, not jiffies, in log message · 13c2088d
      Lars Ellenberg 提交于
      Also skip the message unless bitmap IO took longer than 5 ms.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      13c2088d
    • R
      drbd: get rid of empty statement in is_valid_state · 4e526a00
      Roland Kammerer 提交于
      This should silence a warning about an empty statement. Thanks to Fabian
      Frederick <fabf@skynet.be> who sent a patch I modified to be smaller and
      avoids an additional indent level.
      Signed-off-by: NRoland Kammerer <roland.kammerer@linbit.com>
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4e526a00
    • F
      drbd: code cleanups without semantic changes · 7e5fec31
      Fabian Frederick 提交于
      This contains various cosmetic fixes ranging from simple typos to
      const-ifying, and using booleans properly.
      
      Original commit messages from Fabian's patch set:
      drbd: debugfs: constify drbd_version_fops
      drbd: use seq_put instead of seq_print where possible
      drbd: include linux/uaccess.h instead of asm/uaccess.h
      drbd: use const char * const for drbd strings
      drbd: kerneldoc warning fix in w_e_end_data_req()
      drbd: use unsigned for one bit fields
      drbd: use bool for peer is_ states
      drbd: fix typo
      drbd: use | for bitmask combination
      drbd: use true/false for bool
      drbd: fix drbd_bm_init() comments
      drbd: introduce peer state union
      drbd: fix maybe_pull_ahead() locking comments
      drbd: use bool for growing
      drbd: remove redundant declarations
      drbd: replace if/BUG by BUG_ON
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NRoland Kammerer <roland.kammerer@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7e5fec31
    • L
      drbd: bump current uuid when resuming IO with diskless peer · 20004e24
      Lars Ellenberg 提交于
      Scenario, starting with normal operation
       Connected Primary/Secondary UpToDate/UpToDate
       NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen)
       ... more failures happen, secondary loses it's disk,
       but eventually is able to re-establish the replication link ...
       Connected Primary/Secondary UpToDate/Diskless (resumed; needs to bump uuid!)
      
      We used to just resume/resent suspended requests,
      without bumping the UUID.
      
      Which will lead to problems later, when we want to re-attach the disk on
      the peer, without first disconnecting, or if we experience additional
      failures, because we now have diverging data without being able to
      recognize it.
      
      Make sure we also bump the current data generation UUID,
      if we notice "peer disk unknown" -> "peer disk known bad".
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      20004e24
    • L
      drbd: disallow promotion during resync handshake, avoid deadlock and hard reset · 31d64604
      Lars Ellenberg 提交于
      We already serialize connection state changes,
      and other, non-connection state changes (role changes)
      while we are establishing a connection.
      
      But if we have an established connection,
      then trigger a resync handshake (by primary --force or similar),
      until now we just had to be "lucky".
      
      Consider this sequence (e.g. deployment scenario):
      create-md; up;
        -> Connected Secondary/Secondary Inconsistent/Inconsistent
      then do a racy primary --force on both peers.
      
       block drbd0: drbd_sync_handshake:
       block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
       block drbd0: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
       block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Inconsistent )
       block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
        *** HERE things go wrong. ***
       block drbd0: role( Secondary -> Primary )
       block drbd0: drbd_sync_handshake:
       block drbd0: self 0000000000000005:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
       block drbd0: peer C90D2FC716D232AB:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
       block drbd0: Becoming sync target due to disk states.
       block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake.
       block drbd0: Remote failed to finish a request within 6007ms > ko-count (2) * timeout (30 * 0.1s)
       drbd s0: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
      
      The problem here is that the local promotion happens before the sync handshake
      triggered by the remote promotion was completed.  Some assumptions elsewhere
      become wrong, and when the expected resync handshake is then received and
      processed, we get stuck in a deadlock, which can only be recovered by reboot :-(
      
      Fix: if we know the peer has good data,
      and our own disk is present, but NOT good,
      and there is no resync going on yet,
      we expect a sync handshake to happen "soon".
      So reject a racy promotion with SS_IN_TRANSIENT_STATE.
      
      Result:
       ... as above ...
       block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
        *** local promotion being postponed until ... ***
       block drbd0: drbd_sync_handshake:
       block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
       block drbd0: peer 77868BDA836E12A5:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
        ...
       block drbd0: conn( WFBitMapT -> WFSyncUUID )
       block drbd0: updated sync uuid 85D06D0E8887AD44:0000000000000000:0000000000000000:0000000000000000
       block drbd0: conn( WFSyncUUID -> SyncTarget )
        *** ... after the resync handshake ***
       block drbd0: role( Secondary -> Primary )
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      31d64604
    • L
      drbd: sync_handshake: handle identical uuids with current (frozen) Primary · f2d3d75b
      Lars Ellenberg 提交于
      If in a two-primary scenario, we lost our peer, freeze IO,
      and are still frozen (no UUID rotation) when the peer comes back
      as Secondary after a hard crash, we will see identical UUIDs.
      
      The "rule_nr = 40" chose to use the "CRASHED_PRIMARY" bit as
      arbitration, but that would cause the still running (but frozen) Primary
      to become SyncTarget (which it typically refuses), and the handshake is
      declined.
      
      Fix: check current roles.
      If we have *one* current primary, the Primary wins.
      (rule_nr = 41)
      
      Since that is a protocol change, use the newly introduced DRBD_FF_WSAME
      to determine if rule_nr = 41 can be applied.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f2d3d75b
    • L
      drbd: introduce WRITE_SAME support · 9104d31a
      Lars Ellenberg 提交于
      We will support WRITE_SAME, if
       * all peers support WRITE_SAME (both in kernel and DRBD version),
       * all peer devices support WRITE_SAME
       * logical_block_size is identical on all peers.
      
      We may at some point introduce a fallback on the receiving side
      for devices/kernels that do not support WRITE_SAME,
      by open-coding a submit loop. But not yet.
      Signed-off-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9104d31a