1. 11 11月, 2017 24 次提交
    • M
    • J
      ede6507d
    • J
      dm cache policy smq: allocate cache blocks in order · 9768a10d
      Joe Thornber 提交于
      Previously, cache blocks were being allocated in reverse order.  Fix
      this by pulling the block off the head of the free list.
      
      Shouldn't have any impact on performance or latency but it is more
      correct to have the cache blocks allocated/mapped in ascending order.
      This fix will slightly increase the chances of two adjacent oblocks
      being in adjacent cblocks.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      9768a10d
    • J
      dm cache policy smq: change max background work from 10240 to 4096 blocks · 8ee18ede
      Joe Thornber 提交于
      10240 blocks was too much, lowering this reduces the latency of copying
      and consumes less memory.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      8ee18ede
    • J
      dm cache background tracker: limit amount of background work that may be issued at once · 64748b16
      Joe Thornber 提交于
      On large systems the cache policy can be over enthusiastic and queue far
      too much dirty data to be written back.  This consumes memory.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      64748b16
    • J
      dm cache policy smq: take origin idle status into account when queuing writebacks · deb71918
      Joe Thornber 提交于
      If the origin device is idle try and writeback more data.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      deb71918
    • J
      dm cache policy smq: handle races with queuing background_work · 1e72a8e8
      Joe Thornber 提交于
      The background_tracker holds a set of promotions/demotions that the
      cache policy wishes the core target to implement.
      
      When adding a new operation to the tracker it's possible that an
      operation on the same block is already present (but in practise this
      doesn't appear to be happening).  Catch these situations and do the
      appropriate cleanup.
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      1e72a8e8
    • H
      dm raid: fix panic when attempting to force a raid to sync · 23397844
      Heinz Mauelshagen 提交于
      Requesting a sync on an active raid device via a table reload
      (see 'sync' parameter in Documentation/device-mapper/dm-raid.txt)
      skips the super_load() call that defines the superblock size
      (rdev->sb_size) -- resulting in an oops if/when super_sync()->memset()
      is called.
      
      Fix by moving the initialization of the superblock start and size
      out of super_load() to the caller (analyse_superblocks).
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      23397844
    • M
      dm integrity: allow unaligned bv_offset · 95b1369a
      Mikulas Patocka 提交于
      When slub_debug is enabled kmalloc returns unaligned memory. XFS uses
      this unaligned memory for its buffers (if an unaligned buffer crosses a
      page, XFS frees it and allocates a full page instead - see the function
      xfs_buf_allocate_memory).
      
      dm-integrity checks if bv_offset is aligned on page size and this check
      fail with slub_debug and XFS.
      
      Fix this bug by removing the bv_offset check, leaving only the check for
      bv_len.
      
      Fixes: 7eada909 ("dm: add integrity target")
      Cc: stable@vger.kernel.org # v4.12+
      Reported-by: NBruno Prémont <bonbons@sysophe.eu>
      Reviewed-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      95b1369a
    • M
      dm crypt: allow unaligned bv_offset · 0440d5c0
      Mikulas Patocka 提交于
      When slub_debug is enabled kmalloc returns unaligned memory. XFS uses
      this unaligned memory for its buffers (if an unaligned buffer crosses a
      page, XFS frees it and allocates a full page instead - see the function
      xfs_buf_allocate_memory).
      
      dm-crypt checks if bv_offset is aligned on page size and these checks
      fail with slub_debug and XFS.
      
      Fix this bug by removing the bv_offset checks. Switch to checking if
      bv_len is aligned instead of bv_offset (this check should be sufficient
      to prevent overruns if a bio with too small bv_len is received).
      
      Fixes: 8f0009a2 ("dm crypt: optionally support larger encryption sector size")
      Cc: stable@vger.kernel.org # v4.12+
      Reported-by: NBruno Prémont <bonbons@sysophe.eu>
      Tested-by: NBruno Prémont <bonbons@sysophe.eu>
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      0440d5c0
    • M
      dm: small cleanup in dm_get_md() · 49de5769
      Mike Snitzer 提交于
      Makes dm_get_md() and dm_get_from_kobject() have similar code.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      49de5769
    • H
      dm: fix race between dm_get_from_kobject() and __dm_destroy() · b9a41d21
      Hou Tao 提交于
      The following BUG_ON was hit when testing repeat creation and removal of
      DM devices:
      
          kernel BUG at drivers/md/dm.c:2919!
          CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44
          Call Trace:
           [<ffffffff81649e8b>] dm_get_from_kobject+0x34/0x3a
           [<ffffffff81650ef1>] dm_attr_show+0x2b/0x5e
           [<ffffffff817b46d1>] ? mutex_lock+0x26/0x44
           [<ffffffff811df7f5>] sysfs_kf_seq_show+0x83/0xcf
           [<ffffffff811de257>] kernfs_seq_show+0x23/0x25
           [<ffffffff81199118>] seq_read+0x16f/0x325
           [<ffffffff811de994>] kernfs_fop_read+0x3a/0x13f
           [<ffffffff8117b625>] __vfs_read+0x26/0x9d
           [<ffffffff8130eb59>] ? security_file_permission+0x3c/0x44
           [<ffffffff8117bdb8>] ? rw_verify_area+0x83/0xd9
           [<ffffffff8117be9d>] vfs_read+0x8f/0xcf
           [<ffffffff81193e34>] ? __fdget_pos+0x12/0x41
           [<ffffffff8117c686>] SyS_read+0x4b/0x76
           [<ffffffff817b606e>] system_call_fastpath+0x12/0x71
      
      The bug can be easily triggered, if an extra delay (e.g. 10ms) is added
      between the test of DMF_FREEING & DMF_DELETING and dm_get() in
      dm_get_from_kobject().
      
      To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and
      dm_get() are done in an atomic way, so _minor_lock is used.
      
      The other callers of dm_get() have also been checked to be OK: some
      callers invoke dm_get() under _minor_lock, some callers invoke it under
      _hash_lock, and dm_start_request() invoke it after increasing
      md->open_count.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      b9a41d21
    • M
      dm: allocate struct mapped_device with kvzalloc · 856eb091
      Mikulas Patocka 提交于
      The structure srcu_struct can be very big, its size is proportional to the
      value CONFIG_NR_CPUS. The Fedora kernel has CONFIG_NR_CPUS 8192, the field
      io_barrier in the struct mapped_device has 84kB in the debugging kernel
      and 50kB in the non-debugging kernel. The large size may result in failure
      of the function kzalloc_node.
      
      In order to avoid the allocation failure, we use the function
      kvzalloc_node, this function falls back to vmalloc if a large contiguous
      chunk of memory is not available. This patch also moves the field
      io_barrier to the last position of struct mapped_device - the reason is
      that on many processor architectures, short memory offsets result in
      smaller code than long memory offsets - on x86-64 it reduces code size by
      320 bytes.
      
      Note to stable kernel maintainers - the kernels 4.11 and older don't have
      the function kvzalloc_node, you can use the function vzalloc_node instead.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      856eb091
    • D
      dm zoned: ignore last smaller runt zone · 114e0259
      Damien Le Moal 提交于
      The SCSI layer allows ZBC drives to have a smaller last runt zone. For
      such a device, specifying the entire capacity for a dm-zoned target
      table entry fails because the specified capacity is not aligned on a
      device zone size indicated in the request queue structure of the
      device.
      
      Fix this problem by ignoring the last runt zone in the entry length
      when seting up the dm-zoned target (ctr method) and when iterating table
      entries of the target (iterate_devices method). This allows dm-zoned
      users to still easily setup a target using the entire device capacity
      (as mandated by dm-zoned) or the aligned capacity excluding the last
      runt zone.
      
      While at it, replace direct references to the device queue chunk_sectors
      limit with calls to the accessor blk_queue_zone_sectors().
      Reported-by: NPeter Desnoyers <pjd@ccs.neu.edu>
      Cc: stable@vger.kernel.org
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      114e0259
    • J
      dm space map metadata: use ARRAY_SIZE · fbc61291
      Jérémy Lefaure 提交于
      Using the ARRAY_SIZE macro improves the readability of the code.
      
      Found with Coccinelle with the following semantic patch:
      @r depends on (org || report)@
      type T;
      T[] E;
      position p;
      @@
      (
       (sizeof(E)@p /sizeof(*E))
      |
       (sizeof(E)@p /sizeof(E[...]))
      |
       (sizeof(E)@p /sizeof(T))
      )
      Signed-off-by: NJérémy Lefaure <jeremy.lefaure@lse.epita.fr>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      fbc61291
    • R
      dm log writes: add support for DAX · 98d82f48
      Ross Zwisler 提交于
      Now that we have the ability log filesystem writes using a flat buffer, add
      support for DAX.
      
      The motivation for this support is the need for an xfstest that can test
      the new MAP_SYNC DAX flag.  By logging the filesystem activity with
      dm-log-writes we can show that the MAP_SYNC page faults are writing out
      their metadata as they happen, instead of requiring an explicit
      msync/fsync.
      
      Unfortunately we can't easily track data that has been written via
      mmap() now that the dax_flush() abstraction was removed by commit
      c3ca015f ("dax: remove the pmem_dax_ops->flush abstraction").
      Otherwise we could just treat each flush as a big write, and store the
      data that is being synced to media.  It may be worthwhile to add the
      dax_flush() entry point back, just as a notifier so we can do this
      logging.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      98d82f48
    • R
      dm log writes: add support for inline data buffers · e5a20660
      Ross Zwisler 提交于
      Currently dm-log-writes supports writing filesystem data via BIOs, and
      writing internal metadata from a flat buffer via write_metadata().
      
      For DAX writes, though, we won't have a BIO, but will instead have an
      iterator that we'll want to use to fill a flat data buffer.
      
      So, create write_inline_data() which allows us to write filesystem data
      using a flat buffer as a source, and wire it up in log_one_block().
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      e5a20660
    • M
      dm cache: simplify get_per_bio_data() by removing data_size argument · 693b960e
      Mike Snitzer 提交于
      There is only one per_bio_data size now that writethrough-specific data
      was removed from the per_bio_data structure.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      693b960e
    • M
      dm cache: remove all obsolete writethrough-specific code · 9958f1d9
      Mike Snitzer 提交于
      Now that the writethrough code is much simpler there is no need to track
      so much state or cascade bio submission (as was done, via
      writethrough_endio(), to issue origin then cache IO in series).
      
      As such the obsolete writethrough list and workqueue is also removed.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      9958f1d9
    • M
      dm cache: submit writethrough writes in parallel to origin and cache · 2df3bae9
      Mike Snitzer 提交于
      Discontinue issuing writethrough write IO in series to the origin and
      then cache.
      
      Use bio_clone_fast() to create a new origin clone bio that will be
      mapped to the origin device and then bio_chain() it to the bio that gets
      remapped to the cache device.  The origin clone bio does _not_ have a
      copy of the per_bio_data -- as such check_if_tick_bio_needed() will not
      be called.
      
      The cache bio (parent bio) will not complete until the origin bio has
      completed -- this fulfills bio_clone_fast()'s requirements as well as
      the requirement to not complete the original IO until the write IO has
      completed to both the origin and cache device.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      2df3bae9
    • M
      dm cache: pass cache structure to mode functions · 8e3c3827
      Mike Snitzer 提交于
      No functional changes, just a bit cleaner than passing cache_features
      structure.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      8e3c3827
    • J
      dm cache: fix race condition in the writeback mode overwrite_bio optimisation · d1260e2a
      Joe Thornber 提交于
      When a DM cache in writeback mode moves data between the slow and fast
      device it can often avoid a copy if the triggering bio either:
      
      i) covers the whole block (no point copying if we're about to overwrite it)
      ii) the migration is a promotion and the origin block is currently discarded
      
      Prior to this fix there was a race with case (ii).  The discard status
      was checked with a shared lock held (rather than exclusive).  This meant
      another bio could run in parallel and write data to the origin, removing
      the discard state.  After the promotion the parallel write would have
      been lost.
      
      With this fix the discard status is re-checked once the exclusive lock
      has been aquired.  If the block is no longer discarded it falls back to
      the slower full copy path.
      
      Fixes: b29d4986 ("dm cache: significant rework to leverage dm-bio-prison-v2")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      d1260e2a
    • Z
      md: free unused memory after bitmap resize · 0868b99c
      Zdenek Kabelac 提交于
      When bitmap is resized, the old kalloced chunks just are not released
      once the resized bitmap starts to use new space.
      
      This fixes in particular kmemleak reports like this one:
      
      unreferenced object 0xffff8f4311e9c000 (size 4096):
        comm "lvm", pid 19333, jiffies 4295263268 (age 528.265s)
        hex dump (first 32 bytes):
          02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80  ................
          02 80 02 80 02 80 02 80 02 80 02 80 02 80 02 80  ................
        backtrace:
          [<ffffffffa69471ca>] kmemleak_alloc+0x4a/0xa0
          [<ffffffffa628c10e>] kmem_cache_alloc_trace+0x14e/0x2e0
          [<ffffffffa676cfec>] bitmap_checkpage+0x7c/0x110
          [<ffffffffa676d0c5>] bitmap_get_counter+0x45/0xd0
          [<ffffffffa676d6b3>] bitmap_set_memory_bits+0x43/0xe0
          [<ffffffffa676e41c>] bitmap_init_from_disk+0x23c/0x530
          [<ffffffffa676f1ae>] bitmap_load+0xbe/0x160
          [<ffffffffc04c47d3>] raid_preresume+0x203/0x2f0 [dm_raid]
          [<ffffffffa677762f>] dm_table_resume_targets+0x4f/0xe0
          [<ffffffffa6774b52>] dm_resume+0x122/0x140
          [<ffffffffa6779b9f>] dev_suspend+0x18f/0x290
          [<ffffffffa677a3a7>] ctl_ioctl+0x287/0x560
          [<ffffffffa677a693>] dm_ctl_ioctl+0x13/0x20
          [<ffffffffa62d6b46>] do_vfs_ioctl+0xa6/0x750
          [<ffffffffa62d7269>] SyS_ioctl+0x79/0x90
          [<ffffffffa6956d41>] entry_SYSCALL_64_fastpath+0x1f/0xc2
      Signed-off-by: NZdenek Kabelac <zkabelac@redhat.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0868b99c
    • Z
      md: release allocated bitset sync_set · 0202ce8a
      Zdenek Kabelac 提交于
      Patch fixes kmemleak on md_stop() path used likely only by dm-raid wrapper.
      Code of md is using  mddev_put() where both bitsets are released however this
      freeing is not shared.
      
      Also set NULL to bio_set and sync_set pointers just like mddev_put is
      doing.
      Signed-off-by: NZdenek Kabelac <zkabelac@redhat.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0202ce8a
  2. 09 11月, 2017 2 次提交
    • H
      md/bitmap: clear BITMAP_WRITE_ERROR bit before writing it to sb · 97f0eb9f
      Hou Tao 提交于
      For a RAID1 device using a file-based bitmap, if a bitmap write error
      occurs but the later writes succeed, it's possible both BITMAP_STALE
      and BITMAP_WRITE_ERROR bits will be written to the bitmap super block,
      the BITMAP_STALE bit will be handled properly and be cleared, but the
      BITMAP_WRITE_ERROR bit in sb->flags will make bitmap_create() to fail.
      
      So clear it to protect against the write failure-and-then-recovery case.
      Signed-off-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      97f0eb9f
    • N
      md: be cautious about using ->curr_resync_completed for ->recovery_offset · db0505d3
      NeilBrown 提交于
      The ->recovery_offset shows how much of a non-InSync device is actually
      in sync - how much has been recoveryed.
      
      When performing a recovery, ->curr_resync and ->curr_resync_completed
      follow the device address being recovered and so can be used to update
      ->recovery_offset.
      
      When performing a reshape, ->curr_resync* might follow the device
      addresses (raid5) or might follow array addresses (raid10), so cannot
      in general be used to set ->recovery_offset.  When reshaping backwards,
      ->curre_resync* measures from the *end* of the array-or-device, so is
      particularly unhelpful.
      
      So change the common code in md.c to only use ->curr_resync_complete
      for the simple recovery case, and add code to raid5.c to update
      ->recovery_offset during a forwards reshape.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      db0505d3
  3. 03 11月, 2017 1 次提交
  4. 02 11月, 2017 13 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
    • A
      md: don't check MD_SB_CHANGE_CLEAN in md_allow_write · b90f6ff0
      Artur Paszkiewicz 提交于
      Only MD_SB_CHANGE_PENDING should be used to wait for transition from
      clean to dirty. Checking also MD_SB_CHANGE_CLEAN is unnecessary and can
      race with e.g. md_do_sync(). This sporadically causes a hang when
      changing consistency policy during resync:
      
      INFO: task mdadm:6183 blocked for more than 30 seconds.
            Not tainted 4.14.0-rc3+ #391
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      mdadm           D12752  6183   6022 0x00000000
      Call Trace:
       __schedule+0x93f/0x990
       schedule+0x6b/0x90
       md_allow_write+0x100/0x130 [md_mod]
       ? do_wait_intr_irq+0x90/0x90
       resize_stripes+0x3a/0x5b0 [raid456]
       ? kernfs_fop_write+0xbe/0x180
       raid5_change_consistency_policy+0xa6/0x200 [raid456]
       consistency_policy_store+0x2e/0x70 [md_mod]
       md_attr_store+0x90/0xc0 [md_mod]
       sysfs_kf_write+0x42/0x50
       kernfs_fop_write+0x119/0x180
       __vfs_write+0x28/0x110
       ? rcu_sync_lockdep_assert+0x12/0x60
       ? __sb_start_write+0x15a/0x1c0
       ? vfs_write+0xa3/0x1a0
       vfs_write+0xb4/0x1a0
       SyS_write+0x49/0xa0
       entry_SYSCALL_64_fastpath+0x18/0xad
      
      Fixes: 2214c260 ("md: don't return -EAGAIN in md_allow_write for external metadata arrays")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b90f6ff0
    • G
      md-cluster: update document for raid10 · f0e230ad
      Guoqing Jiang 提交于
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      f0e230ad
    • C
      md: remove redundant variable q · fc33060b
      Colin Ian King 提交于
      The pointer q is assigned but never read; it is redundant and can
      be removed.  Cleans up clang warning:
      
      drivers/md/md-multipath.c:260:4: warning: Value stored to 'q' is
      never read
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      fc33060b
    • G
      raid1: remove obsolete code in raid1_write_request · f81f7302
      Guoqing Jiang 提交于
      There are some lines could be removed due to recent
      change for raid1 such as commit 3956df15d634 ("md:
      move suspend_hi/lo handling into core md code").
      
      Also, seems some comments are put to wrong place,
      move them before wait_barrier.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      f81f7302
    • G
      md-cluster: Use a small window for raid10 resync · 8db87912
      Guoqing Jiang 提交于
      Suspending the entire device for resync could take
      too long. Resync in small chunks.
      
      cluster's resync window is maintained in r10conf as
      cluster_sync_low and cluster_sync_high, and processed
      in raid10's sync_request(). If the current resync is
      outside the cluster resync window:
      
      1. Set the cluster_sync_low to curr_resync_completed.
      2. Set cluster_sync_high to cluster_sync_low + stripe
         size.
      3. Send a message to all nodes so they may add it in
         their suspension list.
      
      Note:
      We only support "near" raid10 so far, resync a far or
      offset raid10 array could have trouble. So raid10_run
      checks the layout of clustered raid10, it will refuse
      to run if the layout is not correct.
      
      With the "near" layout we process one stripe at a time
      progressing monotonically through the address space.
      So we can have a sliding window of whole-stripes which
      moves through the array suspending IO on other nodes,
      and both resync which uses array addresses and recovery
      which uses device addresses can stay within this window.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      8db87912
    • G
      md-cluster: Suspend writes in RAID10 if within range · cb8a7a7e
      Guoqing Jiang 提交于
      If there is a resync going on, all nodes must suspend
      writes to the range. This is recorded in suspend_info
      and suspend_list.
      
      If there is an I/O within the ranges of any of the
      suspend_info, area_resyncing will return 1.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cb8a7a7e
    • G
      md-cluster/raid10: set "do_balance = 0" if area is resyncing · d4098c72
      Guoqing Jiang 提交于
      Just like clustered raid1, it is impossible for cluster raid10
      to choose the best device for read balance when the area of
      array is resyncing. Because we cannot trust the data to be the
      same on all devices at that time, so we choose just the first
      one to use, so set do_balance to 0.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d4098c72
    • S
      md: use lockdep_assert_held · efa4b77b
      Shaohua Li 提交于
      lockdep_assert_held is a better way to assert lock held, and it works
      for UP.
      Signed-off-by: NShaohua Li <shli@fb.com>
      efa4b77b
    • N
      raid1: prevent freeze_array/wait_all_barriers deadlock · f6eca2d4
      Nate Dailey 提交于
      If freeze_array is attempted in the middle of close_sync/
      wait_all_barriers, deadlock can occur.
      
      freeze_array will wait for nr_pending and nr_queued to line up.
      wait_all_barriers increments nr_pending for each barrier bucket, one
      at a time, but doesn't actually issue IO that could be counted in
      nr_queued. So freeze_array is blocked until wait_all_barriers
      completes and allow_all_barriers runs. At the same time, when
      _wait_barrier sees array_frozen == 1, it stops and waits for
      freeze_array to complete.
      
      Prevent the deadlock by making close_sync call _wait_barrier and
      _allow_barrier for one bucket at a time, instead of deferring the
      _allow_barrier calls until after all _wait_barriers are complete.
      Signed-off-by: NNate Dailey <nate.dailey@stratus.com>
      Fix: fd76863e(RAID1: a new I/O barrier implementation to remove resync window)
      Reviewed-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org (v4.11)
      Signed-off-by: NShaohua Li <shli@fb.com>
      f6eca2d4
    • M
      md: use TASK_IDLE instead of blocking signals · ae89fd3d
      Mikulas Patocka 提交于
      Hi - I submit this patch for the next merge window:
      
      Some times ago, I made a patch f9c79bc0 that blocks signals around the
      schedule() calls in MD. The MD subsystem needs to do an uninterruptible
      sleep that is not accounted in load average - so we block signals and use
      interruptible sleep.
      
      The kernel has a special TASK_IDLE state for this purpose, so we can use
      it instead of blocking signals. This patch doesn't fix any bug, it just
      makes the code simpler.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ae89fd3d
    • N
      md: remove special meaning of ->quiesce(.., 2) · b03e0ccb
      NeilBrown 提交于
      The '2' argument means "wake up anything that is waiting".
      This is an inelegant part of the design and was added
      to help support management of suspend_lo/suspend_hi setting.
      Now that suspend_lo/hi is managed in mddev_suspend/resume,
      that need is gone.
      These is still a couple of places where we call 'quiesce'
      with an argument of '2', but they can safely be changed to
      call ->quiesce(.., 1); ->quiesce(.., 0) which
      achieve the same result at the small cost of pausing IO
      briefly.
      
      This removes a small "optimization" from suspend_{hi,lo}_store,
      but it isn't clear that optimization served a useful purpose.
      The code now is a lot clearer.
      Suggested-by: NShaohua Li <shli@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b03e0ccb
    • N
      md: allow metadata update while suspending. · 35bfc521
      NeilBrown 提交于
      There are various deadlocks that can occur
      when a thread holds reconfig_mutex and calls
      ->quiesce(mddev, 1).
      As some write request block waiting for
      metadata to be updated (e.g. to record device
      failure), and as the md thread updates the metadata
      while the reconfig mutex is held, holding the mutex
      can stop write requests completing, and this prevents
      ->quiesce(mddev, 1) from completing.
      
      ->quiesce() is now usually called from mddev_suspend(),
      and it is always called with reconfig_mutex held.  So
      at this time it is safe for the thread to update metadata
      without explicitly taking the lock.
      
      So add 2 new flags, one which says the unlocked updates is
      allowed, and one which ways it is happening.  Then allow it
      while the quiesce completes, and then wait for it to finish.
      Reported-and-tested-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      35bfc521