1. 14 2月, 2017 9 次提交
    • S
      md/raid5-cache: stripe reclaim only counts valid stripes · e8fd52ee
      Shaohua Li 提交于
      When log space is tight, we try to reclaim stripes from log head. There
      are stripes which can't be reclaimed right now if some conditions are
      met. We skip such stripes but accidentally count them, which might cause
      no stripes are claimed. Fixing this by only counting valid stripes.
      
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      e8fd52ee
    • S
      MD: add doc for raid5-cache · 5a6265f9
      Shaohua Li 提交于
      I'm starting document of the raid5-cache feature. Please note this is a
      kernel doc instead of a mdadm manual, so I don't add the details about
      how to use the feature in mdadm side.
      
      Cc: NeilBrown <neilb@suse.com>
      Reviewed-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5a6265f9
    • S
      Documentation: move MD related doc into a separate dir · 1601c590
      Shaohua Li 提交于
      Signed-off-by: NShaohua Li <shli@fb.com>
      1601c590
    • N
      md: ensure md devices are freed before module is unloaded. · 9356863c
      NeilBrown 提交于
      Commit: cbd19983 ("md: Fix unfortunate interaction with evms")
      change mddev_put() so that it would not destroy an md device while
      ->ctime was non-zero.
      
      Unfortunately, we didn't make sure to clear ->ctime when unloading
      the module, so it is possible for an md device to remain after
      module unload.  An attempt to open such a device will trigger
      an invalid memory reference in:
        get_gendisk -> kobj_lookup -> exact_lock -> get_disk
      
      when tring to access disk->fops, which was in the module that has
      been removed.
      
      So ensure we clear ->ctime in md_exit(), and explain how that is useful,
      as it isn't immediately obvious when looking at the code.
      
      Fixes: cbd19983 ("md: Fix unfortunate interaction with evms")
      Tested-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      9356863c
    • S
      md/r5cache: improve journal device efficiency · 39b99586
      Song Liu 提交于
      It is important to be able to flush all stripes in raid5-cache.
      Therefore, we need reserve some space on the journal device for
      these flushes. If flush operation includes pending writes to the
      stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
      for the flush out. This reduces the efficiency of journal space.
      If we exclude these pending writes from flush operation, we only
      need (conf->max_degraded + 1) pages per stripe.
      
      With this patch, when log space is critical (R5C_LOG_CRITICAL=1),
      pending writes will be excluded from stripe flush out. Therefore,
      we can reduce reserved space for flush out and thus improve journal
      device efficiency.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      39b99586
    • S
      md/r5cache: enable chunk_aligned_read with write back cache · 03b047f4
      Song Liu 提交于
      Chunk aligned read significantly reduces CPU usage of raid456.
      However, it is not safe to fully bypass the write back cache.
      This patch enables chunk aligned read with write back cache.
      
      For chunk aligned read, we track stripes in write back cache at
      a bigger granularity, "big_stripe". Each chunk may contain more
      than one stripe (for example, a 256kB chunk contains 64 4kB-page,
      so this chunk contain 64 stripes). For chunk_aligned_read, these
      stripes are grouped into one big_stripe, so we only need one lookup
      for the whole chunk.
      
      For each big_stripe, struct big_stripe_info tracks how many stripes
      of this big_stripe are in the write back cache. We count how many
      stripes of this big_stripe are in the write back cache. These
      counters are tracked in a radix tree (big_stripe_tree).
      r5c_tree_index() is used to calculate keys for the radix tree.
      
      chunk_aligned_read() calls r5c_big_stripe_cached() to look up
      big_stripe of each chunk in the tree. If this big_stripe is in the
      tree, chunk_aligned_read() aborts. This look up is protected by
      rcu_read_lock().
      
      It is necessary to remember whether a stripe is counted in
      big_stripe_tree. Instead of adding new flag, we reuses existing flags:
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
      two flags are set, the stripe is counted in big_stripe_tree. This
      requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
      r5c_try_caching_write(); and moving clear_bit of
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
      r5c_finish_stripe_write_out().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      03b047f4
    • S
      EXPORT_SYMBOL radix_tree_replace_slot · 10257d71
      Song Liu 提交于
      It will be used in drivers/md/raid5-cache.c
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      10257d71
    • S
      raid5: only dispatch IO from raid5d for harddisk raid · 765d704d
      Shaohua Li 提交于
      We made raid5 stripe handling multi-thread before. It works well for
      SSD. But for harddisk, the multi-threading creates more disk seek, so
      not always improve performance. For several hard disks based raid5,
      multi-threading is required as raid5d becames a bottleneck especially
      for sequential write.
      
      To overcome the disk seek issue, we only dispatch IO from raid5d if the
      array is harddisk based. Other threads can still handle stripes, but
      can't dispatch IO.
      
      Idealy, we should control IO dispatching order according to IO position
      interrnally. Right now we still depend on block layer, which isn't very
      efficient sometimes though.
      
      My setup has 9 harddisks, each disk can do around 180M/s sequential
      write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
      write. The test machine uses an ATOM CPU. I measure sequential write
      with large iodepth bandwidth to raid array:
      
      without patch: ~600M/s
      without patch and group_thread_cnt=4: 750M/s
      with patch and group_thread_cnt=4: 950M/s
      with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
      
      We are pretty close to the maximum bandwidth in the large iodepth
      iodepth case. The performance gap of small iodepth sequential write
      between software raid and theory value is still very big though, because
      we don't have an efficient pipeline.
      
      Cc: NeilBrown <neilb@suse.com>
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      765d704d
    • C
      md linear: fix a race between linear_add() and linear_congested() · 03a9e24e
      colyli@suse.de 提交于
      Recently I receive a bug report that on Linux v3.0 based kerenl, hot add
      disk to a md linear device causes kernel crash at linear_congested(). From
      the crash image analysis, I find in linear_congested(), mddev->raid_disks
      contains value N, but conf->disks[] only has N-1 pointers available. Then
      a NULL pointer deference crashes the kernel.
      
      There is a race between linear_add() and linear_congested(), RCU stuffs
      used in these two functions cannot avoid the race. Since Linuv v4.0
      RCU code is replaced by introducing mddev_suspend().  After checking the
      upstream code, it seems linear_congested() is not called in
      generic_make_request() code patch, so mddev_suspend() cannot provent it
      from being called. The possible race still exists.
      
      Here I explain how the race still exists in current code.  For a machine
      has many CPUs, on one CPU, linear_add() is called to add a hard disk to a
      md linear device; at the same time on other CPU, linear_congested() is
      called to detect whether this md linear device is congested before issuing
      an I/O request onto it.
      
      Now I use a possible code execution time sequence to demo how the possible
      race happens,
      
      seq    linear_add()                linear_congested()
       0                                 conf=mddev->private
       1   oldconf=mddev->private
       2   mddev->raid_disks++
       3                              for (i=0; i<mddev->raid_disks;i++)
       4                                bdev_get_queue(conf->disks[i].rdev->bdev)
       5   mddev->private=newconf
      
      In linear_add() mddev->raid_disks is increased in time seq 2, and on
      another CPU in linear_congested() the for-loop iterates conf->disks[i] by
      the increased mddev->raid_disks in time seq 3,4. But conf with one more
      element (which is a pointer to struct dev_info type) to conf->disks[] is
      not updated yet, accessing its structure member in time seq 4 will cause a
      NULL pointer deference fault.
      
      To fix this race, there are 2 parts of modification in the patch,
       1) Add 'int raid_disks' in struct linear_conf, as a copy of
          mddev->raid_disks. It is initialized in linear_conf(), always being
          consistent with pointers number of 'struct dev_info disks[]'. When
          iterating conf->disks[] in linear_congested(), use conf->raid_disks to
          replace mddev->raid_disks in the for-loop, then NULL pointer deference
          will not happen again.
       2) RCU stuffs are back again, and use kfree_rcu() in linear_add() to
          free oldconf memory. Because oldconf may be referenced as mddev->private
          in linear_congested(), kfree_rcu() makes sure that its memory will not
          be released until no one uses it any more.
      Also some code comments are added in this patch, to make this modification
      to be easier understandable.
      
      This patch can be applied for kernels since v4.0 after commit:
      3be260cc ("md/linear: remove rcu protections in favour of
      suspend/resume"). But this bug is reported on Linux v3.0 based kernel, for
      people who maintain kernels before Linux v4.0, they need to do some back
      back port to this patch.
      
      Changelog:
       - V3: add 'int raid_disks' in struct linear_conf, and use kfree_rcu() to
             replace rcu_call() in linear_add().
       - v2: add RCU stuffs by suggestion from Shaohua and Neil.
       - v1: initial effort.
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Neil Brown <neilb@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NShaohua Li <shli@fb.com>
      03a9e24e
  2. 13 2月, 2017 1 次提交
  3. 12 2月, 2017 7 次提交
  4. 11 2月, 2017 13 次提交
    • O
      Btrfs: fix btrfs_decompress_buf2page() · 6e78b3f7
      Omar Sandoval 提交于
      If btrfs_decompress_buf2page() is handed a bio with its page in the
      middle of the working buffer, then we adjust the offset into the working
      buffer. After we copy into the bio, we advance the iterator by the
      number of bytes we copied. Then, we have some logic to handle the case
      of discontiguous pages and adjust the offset into the working buffer
      again. However, if we didn't advance the bio to a new page, we may enter
      this case in error, essentially repeating the adjustment that we already
      made when we entered the function. The end result is bogus data in the
      bio.
      
      Previously, we only checked for this case when we advanced to a new
      page, but the conversion to bio iterators changed that. This restores
      the old, correct behavior.
      
      A case I saw when testing with zlib was:
      
          buf_start = 42769
          total_out = 46865
          working_bytes = total_out - buf_start = 4096
          start_byte = 45056
      
      The condition (total_out > start_byte && buf_start < start_byte) is
      true, so we adjust the offset:
      
          buf_offset = start_byte - buf_start = 2287
          working_bytes -= buf_offset = 1809
          current_buf_start = buf_start = 42769
      
      Then, we copy
      
          bytes = min(bvec.bv_len, PAGE_SIZE - buf_offset, working_bytes) = 1809
          buf_offset += bytes = 4096
          working_bytes -= bytes = 0
          current_buf_start += bytes = 44578
      
      After bio_advance(), we are still in the same page, so start_byte is the
      same. Then, we check (total_out > start_byte && current_buf_start < start_byte),
      which is true! So, we adjust the values again:
      
          buf_offset = start_byte - buf_start = 2287
          working_bytes = total_out - start_byte = 1809
          current_buf_start = buf_start + buf_offset = 45056
      
      But note that working_bytes was already zero before this, so we should
      have stopped copying.
      
      Fixes: 974b1adc ("btrfs: use bio iterators for the decompression handlers")
      Reported-by: NPat Erley <pat-lkml@erley.org>
      Reviewed-by: NChris Mason <clm@fb.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Tested-by: NLiu Bo <bo.li.liu@oracle.com>
      6e78b3f7
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 1ee18329
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
      
       1) If the timing is wrong we can indefinitely stop generating new ipv6
          temporary addresses, from Marcus Huewe.
      
       2) Don't double free per-cpu stats in ipv6 SIT tunnel driver, from Cong
          Wang.
      
       3) Put protections in place so that AF_PACKET is not able to submit
          packets which don't even have a link level header to drivers. From
          Willem de Bruijn.
      
       4) Fix memory leaks in ipv4 and ipv6 multicast code, from Hangbin Liu.
      
       5) Don't use udp_ioctl() in l2tp code, UDP version expects a UDP socket
          and that doesn't go over very well when it is passed an L2TP one.
          Fix from Eric Dumazet.
      
       6) Don't crash on NULL pointer in phy_attach_direct(), from Florian
          Fainelli.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        l2tp: do not use udp_ioctl()
        xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend()
        NET: mkiss: Fix panic
        net: hns: Fix the device being used for dma mapping during TX
        net: phy: Initialize mdio clock at probe function
        igmp, mld: Fix memory leak in igmpv3/mld_del_delrec()
        xen-netfront: Improve error handling during initialization
        sierra_net: Skip validating irrelevant fields for IDLE LSIs
        sierra_net: Add support for IPv6 and Dual-Stack Link Sense Indications
        kcm: fix 0-length case for kcm_sendmsg()
        xen-netfront: Rework the fix for Rx stall during OOM and network stress
        net: phy: Fix PHY module checks and NULL deref in phy_attach_direct()
        net: thunderx: Fix PHY autoneg for SGMII QLM mode
        net: dsa: Do not destroy invalid network devices
        ping: fix a null pointer dereference
        packet: round up linear to header len
        net: introduce device min_header_len
        sit: fix a double free on error path
        lwtunnel: valid encap attr check should return 0 when lwtunnel is disabled
        ipv6: addrconf: fix generation of new temporary addresses
      1ee18329
    • L
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma · a9dbf5c8
      Linus Torvalds 提交于
      Pull rdma fixes from Doug Ledford:
       "Third round of -rc fixes for 4.10 kernel:
      
         - two security related issues in the rxe driver
      
         - one compile issue in the RDMA uapi header"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
        RDMA: Don't reference kernel private header from UAPI header
        IB/rxe: Fix mem_check_range integer overflow
        IB/rxe: Fix resid update
      a9dbf5c8
    • L
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · aca9fa0c
      Linus Torvalds 提交于
      Pull i2c bugfixes from Wolfram Sang:
       "Two bugfixes (proper IO mapping and use of mutex) for a driver feature
        we introduced in this cycle"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: piix4: Request the SMBUS semaphore inside the mutex
        i2c: piix4: Fix request_region size
      aca9fa0c
    • L
      Merge tag 'mmc-v4.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · fc6f41ba
      Linus Torvalds 提交于
      Pull MMC host fix from Ulf Hansson:
       "mmci: Fix hang while waiting for busy-end interrupt"
      
      * tag 'mmc-v4.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
        mmc: mmci: avoid clearing ST Micro busy end interrupt mistakenly
      fc6f41ba
    • L
      Merge tag 'sound-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 1f369d16
      Linus Torvalds 提交于
      Pull sound fixes from Takashi Iwai:
       "Here are some last-minute fixes: two fixes for races in ALSA sequencer
        queue spotted by syzkaller, a revert for a regression of LINE6 driver
        (since 4.9), and a trivial new codec ID addition for Nvidia HDMI"
      
      * tag 'sound-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: hda - adding a new NV HDMI/DP codec ID in the driver
        ALSA: seq: Fix race at creating a queue
        Revert "ALSA: line6: Only determine control port properties if needed"
        ALSA: seq: Don't handle loop timeout at snd_seq_pool_done()
      1f369d16
    • L
      Merge tag 'nfsd-4.10-3' of git://linux-nfs.org/~bfields/linux · 7fe654dc
      Linus Torvalds 提交于
      Pull nfsd revert from Bruce Fields:
       "This patch turned out to have a couple problems. The problems are
        fixable, but at least one of the fixes is a little ugly. The original
        bug has always been there, so we can wait another week or two to get
        this right"
      
      * tag 'nfsd-4.10-3' of git://linux-nfs.org/~bfields/linux:
        nfsd: Revert "nfsd: special case truncates some more"
      7fe654dc
    • L
      Merge tag 'powerpc-4.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 3ebc7033
      Linus Torvalds 提交于
      Pull powerpc fixes friom Michael Ellerman:
       "Apologies for the late pull request, but Ben has been busy finding bugs.
      
         - Userspace was semi-randomly segfaulting on radix due to us
           incorrectly handling a fault triggered by autonuma, caused by a
           patch we merged earlier in v4.10 to prevent the kernel executing
           userspace.
      
         - We weren't marking host IPIs properly for KVM in the OPAL ICP
           backend.
      
         - The ERAT flushing on radix was missing an isync and was incorrectly
           marked as DD1 only.
      
         - The powernv CPU hotplug code was missing a wakeup type and failing
           to flush the interrupt correctly when using OPAL ICP
      
        Thanks to Benjamin Herrenschmidt"
      
      * tag 'powerpc-4.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/powernv: Properly set "host-ipi" on IPIs
        powerpc/powernv: Fix CPU hotplug to handle waking on HVI
        powerpc/mm/radix: Update ERAT flushes when invalidating TLB
        powerpc/mm: Fix spurrious segfaults on radix with autonuma
      3ebc7033
    • E
      l2tp: do not use udp_ioctl() · 72fb96e7
      Eric Dumazet 提交于
      udp_ioctl(), as its name suggests, is used by UDP protocols,
      but is also used by L2TP :(
      
      L2TP should use its own handler, because it really does not
      look the same.
      
      SIOCINQ for instance should not assume UDP checksum or headers.
      
      Thanks to Andrey and syzkaller team for providing the report
      and a nice reproducer.
      
      While crashes only happen on recent kernels (after commit
      7c13f97f ("udp: do fwd memory scheduling on dequeue")), this
      probably needs to be backported to older kernels.
      
      Fixes: 7c13f97f ("udp: do fwd memory scheduling on dequeue")
      Fixes: 85584672 ("udp: Fix udp_poll() and ioctl()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72fb96e7
    • C
      Merge branch 'for-chris' of... · f3c7bfbd
      Chris Mason 提交于
      Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.10
      f3c7bfbd
    • B
      xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend() · 74470954
      Boris Ostrovsky 提交于
      rx_refill_timer should be deleted as soon as we disconnect from the
      backend since otherwise it is possible for the timer to go off before
      we get to xennet_destroy_queues(). If this happens we may dereference
      queue->rx.sring which is set to NULL in xennet_disconnect_backend().
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      CC: stable@vger.kernel.org
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74470954
    • R
      NET: mkiss: Fix panic · 7ba1b689
      Ralf Baechle 提交于
      If a USB-to-serial adapter is unplugged, the driver re-initializes, with
      dev->hard_header_len and dev->addr_len set to zero, instead of the correct
      values.  If then a packet is sent through the half-dead interface, the
      kernel will panic due to running out of headroom in the skb when pushing
      for the AX.25 headers resulting in this panic:
      
      [<c0595468>] (skb_panic) from [<c0401f70>] (skb_push+0x4c/0x50)
      [<c0401f70>] (skb_push) from [<bf0bdad4>] (ax25_hard_header+0x34/0xf4 [ax25])
      [<bf0bdad4>] (ax25_hard_header [ax25]) from [<bf0d05d4>] (ax_header+0x38/0x40 [mkiss])
      [<bf0d05d4>] (ax_header [mkiss]) from [<c041b584>] (neigh_compat_output+0x8c/0xd8)
      [<c041b584>] (neigh_compat_output) from [<c043e7a8>] (ip_finish_output+0x2a0/0x914)
      [<c043e7a8>] (ip_finish_output) from [<c043f948>] (ip_output+0xd8/0xf0)
      [<c043f948>] (ip_output) from [<c043f04c>] (ip_local_out_sk+0x44/0x48)
      
      This patch makes mkiss behave like the 6pack driver. 6pack does not
      panic.  In 6pack.c sp_setup() (same function name here) the values for
      dev->hard_header_len and dev->addr_len are set to the same values as in
      my mkiss patch.
      
      [ralf@linux-mips.org: Massages original submission to conform to the usual
      standards for patch submissions.]
      Signed-off-by: NThomas Osterried <thomas@osterried.de>
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ba1b689
    • K
      net: hns: Fix the device being used for dma mapping during TX · b85ea006
      Kejian Yan 提交于
      This patch fixes the device being used to DMA map skb->data.
      Erroneous device assignment causes the crash when SMMU is enabled.
      This happens during TX since buffer gets DMA mapped with device
      correspondign to net_device and gets unmapped using the device
      related to DSAF.
      Signed-off-by: NKejian Yan <yankejian@huawei.com>
      Reviewed-by: NYisen Zhuang <yisen.zhuang@huawei.com>
      Signed-off-by: NSalil Mehta <salil.mehta@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b85ea006
  5. 10 2月, 2017 10 次提交