1. 19 9月, 2019 23 次提交
    • F
      KVM: x86: work around leak of uninitialized stack contents · 6e60900c
      Fuqian Huang 提交于
      commit 541ab2aeb28251bf7135c7961f3a6080eebcc705 upstream.
      
      Emulation of VMPTRST can incorrectly inject a page fault
      when passed an operand that points to an MMIO address.
      The page fault will use uninitialized kernel stack memory
      as the CR2 and error code.
      
      The right behavior would be to abort the VM with a KVM_EXIT_INTERNAL_ERROR
      exit to userspace; however, it is not an easy fix, so for now just ensure
      that the error code and CR2 are zero.
      Signed-off-by: NFuqian Huang <huangfq.daxian@gmail.com>
      Cc: stable@vger.kernel.org
      [add comment]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6e60900c
    • T
      KVM: s390: Do not leak kernel stack data in the KVM_S390_INTERRUPT ioctl · 09a9f894
      Thomas Huth 提交于
      commit 53936b5bf35e140ae27e4bbf0447a61063f400da upstream.
      
      When the userspace program runs the KVM_S390_INTERRUPT ioctl to inject
      an interrupt, we convert them from the legacy struct kvm_s390_interrupt
      to the new struct kvm_s390_irq via the s390int_to_s390irq() function.
      However, this function does not take care of all types of interrupts
      that we can inject into the guest later (see do_inject_vcpu()). Since we
      do not clear out the s390irq values before calling s390int_to_s390irq(),
      there is a chance that we copy random data from the kernel stack which
      could be leaked to the userspace later.
      
      Specifically, the problem exists with the KVM_S390_INT_PFAULT_INIT
      interrupt: s390int_to_s390irq() does not handle it, and the function
      __inject_pfault_init() later copies irq->u.ext which contains the
      random kernel stack data. This data can then be leaked either to
      the guest memory in __deliver_pfault_init(), or the userspace might
      retrieve it directly with the KVM_S390_GET_IRQ_STATE ioctl.
      
      Fix it by handling that interrupt type in s390int_to_s390irq(), too,
      and by making sure that the s390irq struct is properly pre-initialized.
      And while we're at it, make sure that s390int_to_s390irq() now
      directly returns -EINVAL for unknown interrupt types, so that we
      immediately get a proper error code in case we add more interrupt
      types to do_inject_vcpu() without updating s390int_to_s390irq()
      sometime in the future.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: NJanosch Frank <frankja@linux.ibm.com>
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Link: https://lore.kernel.org/kvm/20190912115438.25761-1-thuth@redhat.comSigned-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      09a9f894
    • I
      KVM: s390: kvm_s390_vm_start_migration: check dirty_bitmap before using it as target for memset() · 9f8a2825
      Igor Mammedov 提交于
      commit 13a17cc0526f08d1df9507f7484176371cd263a0 upstream.
      
      If userspace doesn't set KVM_MEM_LOG_DIRTY_PAGES on memslot before calling
      kvm_s390_vm_start_migration(), kernel will oops with:
      
        Unable to handle kernel pointer dereference in virtual kernel address space
        Failing address: 0000000000000000 TEID: 0000000000000483
        Fault in home space mode while using kernel ASCE.
        AS:0000000002a2000b R2:00000001bff8c00b R3:00000001bff88007 S:00000001bff91000 P:000000000000003d
        Oops: 0004 ilc:2 [#1] SMP
        ...
        Call Trace:
        ([<001fffff804ec552>] kvm_s390_vm_set_attr+0x347a/0x3828 [kvm])
         [<001fffff804ecfc0>] kvm_arch_vm_ioctl+0x6c0/0x1998 [kvm]
         [<001fffff804b67e4>] kvm_vm_ioctl+0x51c/0x11a8 [kvm]
         [<00000000008ba572>] do_vfs_ioctl+0x1d2/0xe58
         [<00000000008bb284>] ksys_ioctl+0x8c/0xb8
         [<00000000008bb2e2>] sys_ioctl+0x32/0x40
         [<000000000175552c>] system_call+0x2b8/0x2d8
        INFO: lockdep is turned off.
        Last Breaking-Event-Address:
         [<0000000000dbaf60>] __memset+0xc/0xa0
      
      due to ms->dirty_bitmap being NULL, which might crash the host.
      
      Make sure that ms->dirty_bitmap is set before using it or
      return -EINVAL otherwise.
      
      Cc: <stable@vger.kernel.org>
      Fixes: afdad616 ("KVM: s390: Fix storage attributes migration with memory slots")
      Signed-off-by: NIgor Mammedov <imammedo@redhat.com>
      Link: https://lore.kernel.org/kvm/20190911075218.29153-1-imammedo@redhat.com/Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: NClaudio Imbrenda <imbrenda@linux.ibm.com>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Reviewed-by: NJanosch Frank <frankja@linux.ibm.com>
      Signed-off-by: NJanosch Frank <frankja@linux.ibm.com>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9f8a2825
    • Y
      genirq: Prevent NULL pointer dereference in resend_irqs() · 991b3458
      Yunfeng Ye 提交于
      commit eddf3e9c7c7e4d0707c68d1bb22cc6ec8aef7d4a upstream.
      
      The following crash was observed:
      
        Unable to handle kernel NULL pointer dereference at 0000000000000158
        Internal error: Oops: 96000004 [#1] SMP
        pc : resend_irqs+0x68/0xb0
        lr : resend_irqs+0x64/0xb0
        ...
        Call trace:
         resend_irqs+0x68/0xb0
         tasklet_action_common.isra.6+0x84/0x138
         tasklet_action+0x2c/0x38
         __do_softirq+0x120/0x324
         run_ksoftirqd+0x44/0x60
         smpboot_thread_fn+0x1ac/0x1e8
         kthread+0x134/0x138
         ret_from_fork+0x10/0x18
      
      The reason for this is that the interrupt resend mechanism happens in soft
      interrupt context, which is a asynchronous mechanism versus other
      operations on interrupts. free_irq() does not take resend handling into
      account. Thus, the irq descriptor might be already freed before the resend
      tasklet is executed. resend_irqs() does not check the return value of the
      interrupt descriptor lookup and derefences the return value
      unconditionally.
      
        1):
        __setup_irq
          irq_startup
            check_irq_resend  // activate softirq to handle resend irq
        2):
        irq_domain_free_irqs
          irq_free_descs
            free_desc
              call_rcu(&desc->rcu, delayed_free_desc)
        3):
        __do_softirq
          tasklet_action
            resend_irqs
              desc = irq_to_desc(irq)
              desc->handle_irq(desc)  // desc is NULL --> Ooops
      
      Fix this by adding a NULL pointer check in resend_irqs() before derefencing
      the irq descriptor.
      
      Fixes: a4633adc ("[PATCH] genirq: add genirq sw IRQ-retrigger")
      Signed-off-by: NYunfeng Ye <yeyunfeng@huawei.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NZhiqiang Liu <liuzhiqiang26@huawei.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1630ae13-5c8e-901e-de09-e740b6a426a7@huawei.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      991b3458
    • A
      ixgbe: Prevent u8 wrapping of ITR value to something less than 10us · 5b5f1460
      Alexander Duyck 提交于
      commit 377228accbbb8b9738f615d791aa803f41c067e0 upstream.
      
      There were a couple cases where the ITR value generated via the adaptive
      ITR scheme could exceed 126. This resulted in the value becoming either 0
      or something less than 10. Switching back and forth between a value less
      than 10 and a value greater than 10 can cause issues as certain hardware
      features such as RSC to not function well when the ITR value has dropped
      that low.
      
      CC: stable@vger.kernel.org
      Fixes: b4ded832 ("ixgbe: Update adaptive ITR algorithm")
      Reported-by: NGregg Leventhal <gleventhal@janestreet.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5b5f1460
    • F
      Btrfs: fix assertion failure during fsync and use of stale transaction · 7cbd49cf
      Filipe Manana 提交于
      commit 410f954cb1d1c79ae485dd83a175f21954fd87cd upstream.
      
      Sometimes when fsync'ing a file we need to log that other inodes exist and
      when we need to do that we acquire a reference on the inodes and then drop
      that reference using iput() after logging them.
      
      That generally is not a problem except if we end up doing the final iput()
      (dropping the last reference) on the inode and that inode has a link count
      of 0, which can happen in a very short time window if the logging path
      gets a reference on the inode while it's being unlinked.
      
      In that case we end up getting the eviction callback, btrfs_evict_inode(),
      invoked through the iput() call chain which needs to drop all of the
      inode's items from its subvolume btree, and in order to do that, it needs
      to join a transaction at the helper function evict_refill_and_join().
      However because the task previously started a transaction at the fsync
      handler, btrfs_sync_file(), it has current->journal_info already pointing
      to a transaction handle and therefore evict_refill_and_join() will get
      that transaction handle from btrfs_join_transaction(). From this point on,
      two different problems can happen:
      
      1) evict_refill_and_join() will often change the transaction handle's
         block reserve (->block_rsv) and set its ->bytes_reserved field to a
         value greater than 0. If evict_refill_and_join() never commits the
         transaction, the eviction handler ends up decreasing the reference
         count (->use_count) of the transaction handle through the call to
         btrfs_end_transaction(), and after that point we have a transaction
         handle with a NULL ->block_rsv (which is the value prior to the
         transaction join from evict_refill_and_join()) and a ->bytes_reserved
         value greater than 0. If after the eviction/iput completes the inode
         logging path hits an error or it decides that it must fallback to a
         transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
         non-zero value from btrfs_log_dentry_safe(), and because of that
         non-zero value it tries to commit the transaction using a handle with
         a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
         the transaction commit hit an assertion failure at
         btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
         the ->block_rsv is NULL. The produced stack trace for that is like the
         following:
      
         [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
         [192922.917553] ------------[ cut here ]------------
         [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
         [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
         [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
         [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
         [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
         (...)
         [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
         [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
         [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
         [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
         [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
         [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
         [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
         [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
         [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [192922.925105] Call Trace:
         [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
         [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
         [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
         [192922.926731]  do_fsync+0x38/0x60
         [192922.927138]  __x64_sys_fdatasync+0x13/0x20
         [192922.927543]  do_syscall_64+0x60/0x1c0
         [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
         (...)
         [192922.934077] ---[ end trace f00808b12068168f ]---
      
      2) If evict_refill_and_join() decides to commit the transaction, it will
         be able to do it, since the nested transaction join only increments the
         transaction handle's ->use_count reference counter and it does not
         prevent the transaction from getting committed. This means that after
         eviction completes, the fsync logging path will be using a transaction
         handle that refers to an already committed transaction. What happens
         when using such a stale transaction can be unpredictable, we are at
         least having a use-after-free on the transaction handle itself, since
         the transaction commit will call kmem_cache_free() against the handle
         regardless of its ->use_count value, or we can end up silently losing
         all the updates to the log tree after that iput() in the logging path,
         or using a transaction handle that in the meanwhile was allocated to
         another task for a new transaction, etc, pretty much unpredictable
         what can happen.
      
      In order to fix both of them, instead of using iput() during logging, use
      btrfs_add_delayed_iput(), so that the logging path of fsync never drops
      the last reference on an inode, that step is offloaded to a safe context
      (usually the cleaner kthread).
      
      The assertion failure issue was sporadically triggered by the test case
      generic/475 from fstests, which loads the dm error target while fsstress
      is running, which lead to fsync failing while logging inodes with -EIO
      errors and then trying later to commit the transaction, triggering the
      assertion failure.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7cbd49cf
    • K
      gpio: fix line flag validation in linehandle_create · 22ed1d47
      Kent Gibson 提交于
      commit e95fbc130a162ba9ad956311b95aa0da269eea48 upstream.
      
      linehandle_create should not allow both GPIOHANDLE_REQUEST_INPUT
      and GPIOHANDLE_REQUEST_OUTPUT to be set.
      
      Fixes: d7c51b47 ("gpio: userspace ABI for reading/writing GPIO lines")
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NKent Gibson <warthog618@gmail.com>
      Signed-off-by: NBartosz Golaszewski <bgolaszewski@baylibre.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      22ed1d47
    • H
      gpiolib: acpi: Add gpiolib_acpi_run_edge_events_on_boot option and blacklist · 705df757
      Hans de Goede 提交于
      commit 61f7f7c8f978b1c0d80e43c83b7d110ca0496eb4 upstream.
      
      Another day; another DSDT bug we need to workaround...
      
      Since commit ca876c74 ("gpiolib-acpi: make sure we trigger edge events
      at least once on boot") we call _AEI edge handlers at boot.
      
      In some rare cases this causes problems. One example of this is the Minix
      Neo Z83-4 mini PC, this device has a clear DSDT bug where it has some copy
      and pasted code for dealing with Micro USB-B connector host/device role
      switching, while the mini PC does not even have a micro-USB connector.
      This code, which should not be there, messes with the DDC data pin from
      the HDMI connector (switching it to GPIO mode) breaking HDMI support.
      
      To avoid problems like this, this commit adds a new
      gpiolib_acpi.run_edge_events_on_boot kernel commandline option, which
      allows disabling the running of _AEI edge event handlers at boot.
      
      The default value is -1/auto which uses a DMI based blacklist, the initial
      version of this blacklist contains the Neo Z83-4 fixing the HDMI breakage.
      
      Cc: stable@vger.kernel.org
      Cc: Daniel Drake <drake@endlessm.com>
      Cc: Ian W MORRISON <ianwmorrison@gmail.com>
      Reported-by: NIan W MORRISON <ianwmorrison@gmail.com>
      Suggested-by: NIan W MORRISON <ianwmorrison@gmail.com>
      Fixes: ca876c74 ("gpiolib-acpi: make sure we trigger edge events at least once on boot")
      Signed-off-by: NHans de Goede <hdegoede@redhat.com>
      Link: https://lore.kernel.org/r/20190827202835.213456-1-hdegoede@redhat.comAcked-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Tested-by: NIan W MORRISON <ianwmorrison@gmail.com>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      705df757
    • Y
      tun: fix use-after-free when register netdev failed · 0f4ceb25
      Yang Yingliang 提交于
      [ Upstream commit 77f22f92dff8e7b45c7786a430626d38071d4670 ]
      
      I got a UAF repport in tun driver when doing fuzzy test:
      
      [  466.269490] ==================================================================
      [  466.271792] BUG: KASAN: use-after-free in tun_chr_read_iter+0x2ca/0x2d0
      [  466.271806] Read of size 8 at addr ffff888372139250 by task tun-test/2699
      [  466.271810]
      [  466.271824] CPU: 1 PID: 2699 Comm: tun-test Not tainted 5.3.0-rc1-00001-g5a9433db2614-dirty #427
      [  466.271833] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
      [  466.271838] Call Trace:
      [  466.271858]  dump_stack+0xca/0x13e
      [  466.271871]  ? tun_chr_read_iter+0x2ca/0x2d0
      [  466.271890]  print_address_description+0x79/0x440
      [  466.271906]  ? vprintk_func+0x5e/0xf0
      [  466.271920]  ? tun_chr_read_iter+0x2ca/0x2d0
      [  466.271935]  __kasan_report+0x15c/0x1df
      [  466.271958]  ? tun_chr_read_iter+0x2ca/0x2d0
      [  466.271976]  kasan_report+0xe/0x20
      [  466.271987]  tun_chr_read_iter+0x2ca/0x2d0
      [  466.272013]  do_iter_readv_writev+0x4b7/0x740
      [  466.272032]  ? default_llseek+0x2d0/0x2d0
      [  466.272072]  do_iter_read+0x1c5/0x5e0
      [  466.272110]  vfs_readv+0x108/0x180
      [  466.299007]  ? compat_rw_copy_check_uvector+0x440/0x440
      [  466.299020]  ? fsnotify+0x888/0xd50
      [  466.299040]  ? __fsnotify_parent+0xd0/0x350
      [  466.299064]  ? fsnotify_first_mark+0x1e0/0x1e0
      [  466.304548]  ? vfs_write+0x264/0x510
      [  466.304569]  ? ksys_write+0x101/0x210
      [  466.304591]  ? do_preadv+0x116/0x1a0
      [  466.304609]  do_preadv+0x116/0x1a0
      [  466.309829]  do_syscall_64+0xc8/0x600
      [  466.309849]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  466.309861] RIP: 0033:0x4560f9
      [  466.309875] Code: 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      [  466.309889] RSP: 002b:00007ffffa5166e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000127
      [  466.322992] RAX: ffffffffffffffda RBX: 0000000000400460 RCX: 00000000004560f9
      [  466.322999] RDX: 0000000000000003 RSI: 00000000200008c0 RDI: 0000000000000003
      [  466.323007] RBP: 00007ffffa516700 R08: 0000000000000004 R09: 0000000000000000
      [  466.323014] R10: 0000000000000000 R11: 0000000000000206 R12: 000000000040cb10
      [  466.323021] R13: 0000000000000000 R14: 00000000006d7018 R15: 0000000000000000
      [  466.323057]
      [  466.323064] Allocated by task 2605:
      [  466.335165]  save_stack+0x19/0x80
      [  466.336240]  __kasan_kmalloc.constprop.8+0xa0/0xd0
      [  466.337755]  kmem_cache_alloc+0xe8/0x320
      [  466.339050]  getname_flags+0xca/0x560
      [  466.340229]  user_path_at_empty+0x2c/0x50
      [  466.341508]  vfs_statx+0xe6/0x190
      [  466.342619]  __do_sys_newstat+0x81/0x100
      [  466.343908]  do_syscall_64+0xc8/0x600
      [  466.345303]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  466.347034]
      [  466.347517] Freed by task 2605:
      [  466.348471]  save_stack+0x19/0x80
      [  466.349476]  __kasan_slab_free+0x12e/0x180
      [  466.350726]  kmem_cache_free+0xc8/0x430
      [  466.351874]  putname+0xe2/0x120
      [  466.352921]  filename_lookup+0x257/0x3e0
      [  466.354319]  vfs_statx+0xe6/0x190
      [  466.355498]  __do_sys_newstat+0x81/0x100
      [  466.356889]  do_syscall_64+0xc8/0x600
      [  466.358037]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  466.359567]
      [  466.360050] The buggy address belongs to the object at ffff888372139100
      [  466.360050]  which belongs to the cache names_cache of size 4096
      [  466.363735] The buggy address is located 336 bytes inside of
      [  466.363735]  4096-byte region [ffff888372139100, ffff88837213a100)
      [  466.367179] The buggy address belongs to the page:
      [  466.368604] page:ffffea000dc84e00 refcount:1 mapcount:0 mapping:ffff8883df1b4f00 index:0x0 compound_mapcount: 0
      [  466.371582] flags: 0x2fffff80010200(slab|head)
      [  466.372910] raw: 002fffff80010200 dead000000000100 dead000000000122 ffff8883df1b4f00
      [  466.375209] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
      [  466.377778] page dumped because: kasan: bad access detected
      [  466.379730]
      [  466.380288] Memory state around the buggy address:
      [  466.381844]  ffff888372139100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  466.384009]  ffff888372139180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  466.386131] >ffff888372139200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  466.388257]                                                  ^
      [  466.390234]  ffff888372139280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  466.392512]  ffff888372139300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  466.394667] ==================================================================
      
      tun_chr_read_iter() accessed the memory which freed by free_netdev()
      called by tun_set_iff():
      
              CPUA                                           CPUB
        tun_set_iff()
          alloc_netdev_mqs()
          tun_attach()
                                                        tun_chr_read_iter()
                                                          tun_get()
                                                          tun_do_read()
                                                            tun_ring_recv()
          register_netdevice() <-- inject error
          goto err_detach
          tun_detach_all() <-- set RCV_SHUTDOWN
          free_netdev() <-- called from
                           err_free_dev path
            netdev_freemem() <-- free the memory
                              without check refcount
            (In this path, the refcount cannot prevent
             freeing the memory of dev, and the memory
             will be used by dev_put() called by
             tun_chr_read_iter() on CPUB.)
                                                           (Break from tun_ring_recv(),
                                                           because RCV_SHUTDOWN is set)
                                                         tun_put()
                                                           dev_put() <-- use the memory
                                                                         freed by netdev_freemem()
      
      Put the publishing of tfile->tun after register_netdevice(),
      so tun_get() won't get the tun pointer that freed by
      err_detach path if register_netdevice() failed.
      
      Fixes: eb0fb363 ("tuntap: attach queue 0 before registering netdevice")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Suggested-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0f4ceb25
    • X
      tipc: add NULL pointer check before calling kfree_rcu · 9a459842
      Xin Long 提交于
      [ Upstream commit 42dec1dbe38239cf91cc1f4df7830c66276ced37 ]
      
      Unlike kfree(p), kfree_rcu(p, rcu) won't do NULL pointer check. When
      tipc_nametbl_remove_publ returns NULL, the panic below happens:
      
         BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
         RIP: 0010:__call_rcu+0x1d/0x290
         Call Trace:
          <IRQ>
          tipc_publ_notify+0xa9/0x170 [tipc]
          tipc_node_write_unlock+0x8d/0x100 [tipc]
          tipc_node_link_down+0xae/0x1d0 [tipc]
          tipc_node_check_dest+0x3ea/0x8f0 [tipc]
          ? tipc_disc_rcv+0x2c7/0x430 [tipc]
          tipc_disc_rcv+0x2c7/0x430 [tipc]
          ? tipc_rcv+0x6bb/0xf20 [tipc]
          tipc_rcv+0x6bb/0xf20 [tipc]
          ? ip_route_input_slow+0x9cf/0xb10
          tipc_udp_recv+0x195/0x1e0 [tipc]
          ? tipc_udp_is_known_peer+0x80/0x80 [tipc]
          udp_queue_rcv_skb+0x180/0x460
          udp_unicast_rcv_skb.isra.56+0x75/0x90
          __udp4_lib_rcv+0x4ce/0xb90
          ip_local_deliver_finish+0x11c/0x210
          ip_local_deliver+0x6b/0xe0
          ? ip_rcv_finish+0xa9/0x410
          ip_rcv+0x273/0x362
      
      Fixes: 97ede29e ("tipc: convert name table read-write lock to RCU")
      Reported-by: NLi Shuang <shuali@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9a459842
    • N
      tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR · 67fe3b94
      Neal Cardwell 提交于
      [ Upstream commit af38d07ed391b21f7405fa1f936ca9686787d6d2 ]
      
      Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
      TCP_ECN_QUEUE_CWR.
      
      Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
      the behavior of data receivers, and deciding whether to reflect
      incoming IP ECN CE marks as outgoing TCP th->ece marks. The
      TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
      and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
      is only called from tcp_undo_cwnd_reduction() by data senders during
      an undo, so it should zero the sender-side state,
      TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
      incoming CE bits on incoming data packets just because outgoing
      packets were spuriously retransmitted.
      
      The bug has been reproduced with packetdrill to manifest in a scenario
      with RFC3168 ECN, with an incoming data packet with CE bit set and
      carrying a TCP timestamp value that causes cwnd undo. Before this fix,
      the IP CE bit was ignored and not reflected in the TCP ECE header bit,
      and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
      even though the cwnd reduction had been undone.  After this fix, the
      sender properly reflects the CE bit and does not set the W bit.
      
      Note: the bug actually predates 2005 git history; this Fixes footer is
      chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
      the patch applies cleanly (since before this commit the code was in a
      .h file).
      
      Fixes: bdf1ee5d ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      67fe3b94
    • X
      sctp: use transport pf_retrans in sctp_do_8_2_transport_strike · 7c34a292
      Xin Long 提交于
      [ Upstream commit 10eb56c582c557c629271f1ee31e15e7a9b2558b ]
      
      Transport should use its own pf_retrans to do the error_count
      check, instead of asoc's. Otherwise, it's meaningless to make
      pf_retrans per transport.
      
      Fixes: 5aa93bcf ("sctp: Implement quick failover draft from tsvwg")
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7c34a292
    • C
      sctp: Fix the link time qualifier of 'sctp_ctrlsock_exit()' · 41b624ff
      Christophe JAILLET 提交于
      [ Upstream commit b456d72412ca8797234449c25815e82f4e1426c0 ]
      
      The '.exit' functions from 'pernet_operations' structure should be marked
      as __net_exit, not __net_init.
      
      Fixes: 8e2d61e0 ("sctp: fix race on protocol/netns initialization")
      Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      41b624ff
    • C
      sch_hhf: ensure quantum and hhf_non_hh_weight are non-zero · a9e91767
      Cong Wang 提交于
      [ Upstream commit d4d6ec6dac07f263f06d847d6f732d6855522845 ]
      
      In case of TCA_HHF_NON_HH_WEIGHT or TCA_HHF_QUANTUM is zero,
      it would make no progress inside the loop in hhf_dequeue() thus
      kernel would get stuck.
      
      Fix this by checking this corner case in hhf_change().
      
      Fixes: 10239edf ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc")
      Reported-by: syzbot+bc6297c11f19ee807dc2@syzkaller.appspotmail.com
      Reported-by: syzbot+041483004a7f45f1f20a@syzkaller.appspotmail.com
      Reported-by: syzbot+55be5f513bed37fc4367@syzkaller.appspotmail.com
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: Terry Lam <vtlam@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a9e91767
    • E
      net: sched: fix reordering issues · a7f46e18
      Eric Dumazet 提交于
      [ Upstream commit b88dd52c62bb5c5d58f0963287f41fd084352c57 ]
      
      Whenever MQ is not used on a multiqueue device, we experience
      serious reordering problems. Bisection found the cited
      commit.
      
      The issue can be described this way :
      
      - A single qdisc hierarchy is shared by all transmit queues.
        (eg : tc qdisc replace dev eth0 root fq_codel)
      
      - When/if try_bulk_dequeue_skb_slow() dequeues a packet targetting
        a different transmit queue than the one used to build a packet train,
        we stop building the current list and save the 'bad' skb (P1) in a
        special queue. (bad_txq)
      
      - When dequeue_skb() calls qdisc_dequeue_skb_bad_txq() and finds this
        skb (P1), it checks if the associated transmit queues is still in frozen
        state. If the queue is still blocked (by BQL or NIC tx ring full),
        we leave the skb in bad_txq and return NULL.
      
      - dequeue_skb() calls q->dequeue() to get another packet (P2)
      
        The other packet can target the problematic queue (that we found
        in frozen state for the bad_txq packet), but another cpu just ran
        TX completion and made room in the txq that is now ready to accept
        new packets.
      
      - Packet P2 is sent while P1 is still held in bad_txq, P1 might be sent
        at next round. In practice P2 is the lead of a big packet train
        (P2,P3,P4 ...) filling the BQL budget and delaying P1 by many packets :/
      
      To solve this problem, we have to block the dequeue process as long
      as the first packet in bad_txq can not be sent. Reordering issues
      disappear and no side effects have been seen.
      
      Fixes: a53851e2 ("net: sched: explicit locking in gso_cpu fallback")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a7f46e18
    • S
      net: phylink: Fix flow control resolution · 3600a311
      Stefan Chulski 提交于
      [ Upstream commit 63b2ed4e10b2e6c913e1d8cdd728e7fba4115a3d ]
      
      Regarding to IEEE 802.3-2015 standard section 2
      28B.3 Priority resolution - Table 28-3 - Pause resolution
      
      In case of Local device Pause=1 AsymDir=0, Link partner
      Pause=1 AsymDir=1, Local device resolution should be enable PAUSE
      transmit, disable PAUSE receive.
      And in case of Local device Pause=1 AsymDir=1, Link partner
      Pause=1 AsymDir=0, Local device resolution should be enable PAUSE
      receive, disable PAUSE transmit.
      
      Fixes: 9525ae83 ("phylink: add phylink infrastructure")
      Signed-off-by: NStefan Chulski <stefanc@marvell.com>
      Reported-by: NShaul Ben-Mayor <shaulb@marvell.com>
      Acked-by: NRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3600a311
    • S
      net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list · 821302dd
      Shmulik Ladkani 提交于
      [ Upstream commit 3dcbdb134f329842a38f0e6797191b885ab00a00 ]
      
      Historically, support for frag_list packets entering skb_segment() was
      limited to frag_list members terminating on exact same gso_size
      boundaries. This is verified with a BUG_ON since commit 89319d38
      ("net: Add frag_list support to skb_segment"), quote:
      
          As such we require all frag_list members terminate on exact MSS
          boundaries.  This is checked using BUG_ON.
          As there should only be one producer in the kernel of such packets,
          namely GRO, this requirement should not be difficult to maintain.
      
      However, since commit 6578171a ("bpf: add bpf_skb_change_proto helper"),
      the "exact MSS boundaries" assumption no longer holds:
      An eBPF program using bpf_skb_change_proto() DOES modify 'gso_size', but
      leaves the frag_list members as originally merged by GRO with the
      original 'gso_size'. Example of such programs are bpf-based NAT46 or
      NAT64.
      
      This lead to a kernel BUG_ON for flows involving:
       - GRO generating a frag_list skb
       - bpf program performing bpf_skb_change_proto() or bpf_skb_adjust_room()
       - skb_segment() of the skb
      
      See example BUG_ON reports in [0].
      
      In commit 13acc94e ("net: permit skb_segment on head_frag frag_list skb"),
      skb_segment() was modified to support the "gso_size mangling" case of
      a frag_list GRO'ed skb, but *only* for frag_list members having
      head_frag==true (having a page-fragment head).
      
      Alas, GRO packets having frag_list members with a linear kmalloced head
      (head_frag==false) still hit the BUG_ON.
      
      This commit adds support to skb_segment() for a 'head_skb' packet having
      a frag_list whose members are *non* head_frag, with gso_size mangled, by
      disabling SG and thus falling-back to copying the data from the given
      'head_skb' into the generated segmented skbs - as suggested by Willem de
      Bruijn [1].
      
      Since this approach involves the penalty of skb_copy_and_csum_bits()
      when building the segments, care was taken in order to enable this
      solution only when required:
       - untrusted gso_size, by testing SKB_GSO_DODGY is set
         (SKB_GSO_DODGY is set by any gso_size mangling functions in
          net/core/filter.c)
       - the frag_list is non empty, its item is a non head_frag, *and* the
         headlen of the given 'head_skb' does not match the gso_size.
      
      [0]
      https://lore.kernel.org/netdev/20190826170724.25ff616f@pixies/
      https://lore.kernel.org/netdev/9265b93f-253d-6b8c-f2b8-4b54eff1835c@fb.com/
      
      [1]
      https://lore.kernel.org/netdev/CA+FuTSfVsgNDi7c=GUU8nMg2hWxF2SjCNLXetHeVPdnxAW5K-w@mail.gmail.com/
      
      Fixes: 6578171a ("bpf: add bpf_skb_change_proto helper")
      Suggested-by: NWillem de Bruijn <willemdebruijn.kernel@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      821302dd
    • S
      net: Fix null de-reference of device refcount · 88a46756
      Subash Abhinov Kasiviswanathan 提交于
      [ Upstream commit 10cc514f451a0f239aa34f91bc9dc954a9397840 ]
      
      In event of failure during register_netdevice, free_netdev is
      invoked immediately. free_netdev assumes that all the netdevice
      refcounts have been dropped prior to it being called and as a
      result frees and clears out the refcount pointer.
      
      However, this is not necessarily true as some of the operations
      in the NETDEV_UNREGISTER notifier handlers queue RCU callbacks for
      invocation after a grace period. The IPv4 callback in_dev_rcu_put
      tries to access the refcount after free_netdev is called which
      leads to a null de-reference-
      
      44837.761523:   <6> Unable to handle kernel paging request at
                          virtual address 0000004a88287000
      44837.761651:   <2> pc : in_dev_finish_destroy+0x4c/0xc8
      44837.761654:   <2> lr : in_dev_finish_destroy+0x2c/0xc8
      44837.762393:   <2> Call trace:
      44837.762398:   <2>  in_dev_finish_destroy+0x4c/0xc8
      44837.762404:   <2>  in_dev_rcu_put+0x24/0x30
      44837.762412:   <2>  rcu_nocb_kthread+0x43c/0x468
      44837.762418:   <2>  kthread+0x118/0x128
      44837.762424:   <2>  ret_from_fork+0x10/0x1c
      
      Fix this by waiting for the completion of the call_rcu() in
      case of register_netdevice errors.
      
      Fixes: 93ee31f1 ("[NET]: Fix free_netdev on register_netdev failure.")
      Cc: Sean Tranchetti <stranche@codeaurora.org>
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      88a46756
    • S
      ixgbe: Fix secpath usage for IPsec TX offload. · b26f4892
      Steffen Klassert 提交于
      [ Upstream commit f39b683d35dfa93a58f1b400a8ec0ff81296b37c ]
      
      The ixgbe driver currently does IPsec TX offloading
      based on an existing secpath. However, the secpath
      can also come from the RX side, in this case it is
      misinterpreted for TX offload and the packets are
      dropped with a "bad sa_idx" error. Fix this by using
      the xfrm_offload() function to test for TX offload.
      
      Fixes: 59259470 ("ixgbe: process the Tx ipsec offload")
      Reported-by: NMichael Marley <michael@michaelmarley.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b26f4892
    • E
      isdn/capi: check message length in capi_write() · 2354e925
      Eric Biggers 提交于
      [ Upstream commit fe163e534e5eecdfd7b5920b0dfd24c458ee85d6 ]
      
      syzbot reported:
      
          BUG: KMSAN: uninit-value in capi_write+0x791/0xa90 drivers/isdn/capi/capi.c:700
          CPU: 0 PID: 10025 Comm: syz-executor379 Not tainted 4.20.0-rc7+ #2
          Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
          Call Trace:
            __dump_stack lib/dump_stack.c:77 [inline]
            dump_stack+0x173/0x1d0 lib/dump_stack.c:113
            kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613
            __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:313
            capi_write+0x791/0xa90 drivers/isdn/capi/capi.c:700
            do_loop_readv_writev fs/read_write.c:703 [inline]
            do_iter_write+0x83e/0xd80 fs/read_write.c:961
            vfs_writev fs/read_write.c:1004 [inline]
            do_writev+0x397/0x840 fs/read_write.c:1039
            __do_sys_writev fs/read_write.c:1112 [inline]
            __se_sys_writev+0x9b/0xb0 fs/read_write.c:1109
            __x64_sys_writev+0x4a/0x70 fs/read_write.c:1109
            do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
            entry_SYSCALL_64_after_hwframe+0x63/0xe7
          [...]
      
      The problem is that capi_write() is reading past the end of the message.
      Fix it by checking the message's length in the needed places.
      
      Reported-and-tested-by: syzbot+0849c524d9c634f5ae66@syzkaller.appspotmail.com
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2354e925
    • C
      ipv6: Fix the link time qualifier of 'ping_v6_proc_exit_net()' · ea6ec671
      Christophe JAILLET 提交于
      [ Upstream commit d23dbc479a8e813db4161a695d67da0e36557846 ]
      
      The '.exit' functions from 'pernet_operations' structure should be marked
      as __net_exit, not __net_init.
      
      Fixes: d862e546 ("net: ipv6: Implement /proc/net/icmp6.")
      Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ea6ec671
    • B
      cdc_ether: fix rndis support for Mediatek based smartphones · a20c8e4a
      Bjørn Mork 提交于
      [ Upstream commit 4d7ffcf3bf1be98d876c570cab8fc31d9fa92725 ]
      
      A Mediatek based smartphone owner reports problems with USB
      tethering in Linux.  The verbose USB listing shows a rndis_host
      interface pair (e0/01/03 + 10/00/00), but the driver fails to
      bind with
      
      [  355.960428] usb 1-4: bad CDC descriptors
      
      The problem is a failsafe test intended to filter out ACM serial
      functions using the same 02/02/ff class/subclass/protocol as RNDIS.
      The serial functions are recognized by their non-zero bmCapabilities.
      
      No RNDIS function with non-zero bmCapabilities were known at the time
      this failsafe was added. But it turns out that some Wireless class
      RNDIS functions are using the bmCapabilities field. These functions
      are uniquely identified as RNDIS by their class/subclass/protocol, so
      the failing test can safely be disabled.  The same applies to the two
      types of Misc class RNDIS functions.
      
      Applying the failsafe to Communication class functions only retains
      the original functionality, and fixes the problem for the Mediatek based
      smartphone.
      
      Tow examples of CDC functional descriptors with non-zero bmCapabilities
      from Wireless class RNDIS functions are:
      
      0e8d:000a  Mediatek Crosscall Spider X5 3G Phone
      
            CDC Header:
              bcdCDC               1.10
            CDC ACM:
              bmCapabilities       0x0f
                connection notifications
                sends break
                line coding and serial state
                get/set/clear comm features
            CDC Union:
              bMasterInterface        0
              bSlaveInterface         1
            CDC Call Management:
              bmCapabilities       0x03
                call management
                use DataInterface
              bDataInterface          1
      
      and
      
      19d2:1023  ZTE K4201-z
      
            CDC Header:
              bcdCDC               1.10
            CDC ACM:
              bmCapabilities       0x02
                line coding and serial state
            CDC Call Management:
              bmCapabilities       0x03
                call management
                use DataInterface
              bDataInterface          1
            CDC Union:
              bMasterInterface        0
              bSlaveInterface         1
      
      The Mediatek example is believed to apply to most smartphones with
      Mediatek firmware.  The ZTE example is most likely also part of a larger
      family of devices/firmwares.
      Suggested-by: NLars Melin <larsm17@gmail.com>
      Signed-off-by: NBjørn Mork <bjorn@mork.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a20c8e4a
    • N
      bridge/mdb: remove wrong use of NLM_F_MULTI · f57fd58d
      Nicolas Dichtel 提交于
      [ Upstream commit 94a72b3f024fc7e9ab640897a1e38583a470659d ]
      
      NLM_F_MULTI must be used only when a NLMSG_DONE message is sent at the end.
      In fact, NLMSG_DONE is sent only at the end of a dump.
      
      Libraries like libnl will wait forever for NLMSG_DONE.
      
      Fixes: 949f1e39 ("bridge: mdb: notify on router port add and del")
      CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f57fd58d
  2. 16 9月, 2019 17 次提交
    • G
      Linux 4.19.73 · db2d0b7c
      Greg Kroah-Hartman 提交于
      db2d0b7c
    • Y
      vhost: make sure log_num < in_num · ba03ee62
      yongduan 提交于
      commit 060423bfdee3f8bc6e2c1bac97de24d5415e2bc4 upstream.
      
      The code assumes log_num < in_num everywhere, and that is true as long as
      in_num is incremented by descriptor iov count, and log_num by 1. However
      this breaks if there's a zero sized descriptor.
      
      As a result, if a malicious guest creates a vring desc with desc.len = 0,
      it may cause the host kernel to crash by overflowing the log array. This
      bug can be triggered during the VM migration.
      
      There's no need to log when desc.len = 0, so just don't increment log_num
      in this case.
      
      Fixes: 3a4d5c94 ("vhost_net: a kernel-level virtio server")
      Cc: stable@vger.kernel.org
      Reviewed-by: NLidong Chen <lidongchen@tencent.com>
      Signed-off-by: Nruippan <ruippan@tencent.com>
      Signed-off-by: Nyongduan <yongduan@tencent.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NTyler Hicks <tyhicks@canonical.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba03ee62
    • G
      powerpc/tm: Fix restoring FP/VMX facility incorrectly on interrupts · 569775bd
      Gustavo Romero 提交于
      [ Upstream commit a8318c13e79badb92bc6640704a64cc022a6eb97 ]
      
      When in userspace and MSR FP=0 the hardware FP state is unrelated to
      the current process. This is extended for transactions where if tbegin
      is run with FP=0, the hardware checkpoint FP state will also be
      unrelated to the current process. Due to this, we need to ensure this
      hardware checkpoint is updated with the correct state before we enable
      FP for this process.
      
      Unfortunately we get this wrong when returning to a process from a
      hardware interrupt. A process that starts a transaction with FP=0 can
      take an interrupt. When the kernel returns back to that process, we
      change to FP=1 but with hardware checkpoint FP state not updated. If
      this transaction is then rolled back, the FP registers now contain the
      wrong state.
      
      The process looks like this:
         Userspace:                      Kernel
      
                     Start userspace
                      with MSR FP=0 TM=1
                        < -----
         ...
         tbegin
         bne
                     Hardware interrupt
                         ---- >
                                          <do_IRQ...>
                                          ....
                                          ret_from_except
                                            restore_math()
      				        /* sees FP=0 */
                                              restore_fp()
                                                tm_active_with_fp()
      					    /* sees FP=1 (Incorrect) */
                                                load_fp_state()
                                              FP = 0 -> 1
                        < -----
                     Return to userspace
                       with MSR TM=1 FP=1
                       with junk in the FP TM checkpoint
         TM rollback
         reads FP junk
      
      When returning from the hardware exception, tm_active_with_fp() is
      incorrectly making restore_fp() call load_fp_state() which is setting
      FP=1.
      
      The fix is to remove tm_active_with_fp().
      
      tm_active_with_fp() is attempting to handle the case where FP state
      has been changed inside a transaction. In this case the checkpointed
      and transactional FP state is different and hence we must restore the
      FP state (ie. we can't do lazy FP restore inside a transaction that's
      used FP). It's safe to remove tm_active_with_fp() as this case is
      handled by restore_tm_state(). restore_tm_state() detects if FP has
      been using inside a transaction and will set load_fp and call
      restore_math() to ensure the FP state (checkpoint and transaction) is
      restored.
      
      This is a data integrity problem for the current process as the FP
      registers are corrupted. It's also a security problem as the FP
      registers from one process may be leaked to another.
      
      Similarly for VMX.
      
      A simple testcase to replicate this will be posted to
      tools/testing/selftests/powerpc/tm/tm-poison.c
      
      This fixes CVE-2019-15031.
      
      Fixes: a7771176 ("powerpc: Don't enable FP/Altivec if not checkpointed")
      Cc: stable@vger.kernel.org # 4.15+
      Signed-off-by: NGustavo Romero <gromero@linux.ibm.com>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190904045529.23002-2-gromero@linux.vnet.ibm.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      569775bd
    • B
      powerpc/tm: Remove msr_tm_active() · 052bc385
      Breno Leitao 提交于
      [ Upstream commit 5c784c8414fba11b62e12439f11e109fb5751f38 ]
      
      Currently msr_tm_active() is a wrapper around MSR_TM_ACTIVE() if
      CONFIG_PPC_TRANSACTIONAL_MEM is set, or it is just a function that
      returns false if CONFIG_PPC_TRANSACTIONAL_MEM is not set.
      
      This function is not necessary, since MSR_TM_ACTIVE() just do the same and
      could be used, removing the dualism and simplifying the code.
      
      This patchset remove every instance of msr_tm_active() and replaced it
      by MSR_TM_ACTIVE().
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      052bc385
    • L
      PCI: Reset both NVIDIA GPU and HDA in ThinkPad P50 workaround · f193e022
      Lyude Paul 提交于
      [ Upstream commit ad54567ad5d8e938ee6cf02e4f3867f18835ae6e ]
      
      quirk_reset_lenovo_thinkpad_50_nvgpu() resets NVIDIA GPUs to work around
      an apparent BIOS defect.  It previously used pci_reset_function(), and
      the available method was a bus reset, which was fine because there was
      only one function on the bus.  After b516ea586d71 ("PCI: Enable NVIDIA
      HDA controllers"), there are now two functions (the HDA controller and
      the GPU itself) on the bus, so the reset fails.
      
      Use pci_reset_bus() explicitly instead of pci_reset_function() since it's
      OK to reset both devices.
      
      [bhelgaas: commit log, add e0547c81bfcf]
      Fixes: b516ea586d71 ("PCI: Enable NVIDIA HDA controllers")
      Fixes: e0547c81bfcf ("PCI: Reset Lenovo ThinkPad P50 nvgpu at boot if necessary")
      Link: https://lore.kernel.org/r/20190801220117.14952-1-lyude@redhat.comSigned-off-by: NLyude Paul <lyude@redhat.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: NBen Skeggs <bskeggs@redhat.com>
      Cc: Lukas Wunner <lukas@wunner.de>
      Cc: Daniel Drake <drake@endlessm.com>
      Cc: Aaron Plattner <aplattner@nvidia.com>
      Cc: Peter Wu <peter@lekensteyn.nl>
      Cc: Ilia Mirkin <imirkin@alum.mit.edu>
      Cc: Karol Herbst <kherbst@redhat.com>
      Cc: Maik Freudenberg <hhfeuer@gmx.de>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f193e022
    • C
      ext4: unsigned int compared against zero · ff693225
      Colin Ian King 提交于
      [ Upstream commit fbbbbd2f28aec991f3fbc248df211550fbdfd58c ]
      
      There are two cases where u32 variables n and err are being checked
      for less than zero error values, the checks is always false because
      the variables are not signed. Fix this by making the variables ints.
      
      Addresses-Coverity: ("Unsigned compared against 0")
      Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ff693225
    • T
      ext4: fix block validity checks for journal inodes using indirect blocks · 292666d2
      Theodore Ts'o 提交于
      [ Upstream commit 170417c8c7bb2cbbdd949bf5c443c0c8f24a203b ]
      
      Commit 345c0dbf3a30 ("ext4: protect journal inode's blocks using
      block_validity") failed to add an exception for the journal inode in
      ext4_check_blockref(), which is the function used by ext4_get_branch()
      for indirect blocks.  This caused attempts to read from the ext3-style
      journals to fail with:
      
      [  848.968550] EXT4-fs error (device sdb7): ext4_get_branch:171: inode #8: block 30343695: comm jbd2/sdb7-8: invalid block
      
      Fix this by adding the missing exception check.
      
      Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity")
      Reported-by: NArthur Marsh <arthur.marsh@internode.on.net>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      292666d2
    • T
      ext4: don't perform block validity checks on the journal inode · 97fbf573
      Theodore Ts'o 提交于
      [ Upstream commit 0a944e8a6c66ca04c7afbaa17e22bf208a8b37f0 ]
      
      Since the journal inode is already checked when we added it to the
      block validity's system zone, if we check it again, we'll just trigger
      a failure.
      
      This was causing failures like this:
      
      [   53.897001] EXT4-fs error (device sda): ext4_find_extent:909: inode
      #8: comm jbd2/sda-8: pblk 121667583 bad header/extent: invalid extent entries - magic f30a, entries 8, max 340(340), depth 0(0)
      [   53.931430] jbd2_journal_bmap: journal block not found at offset 49 on sda-8
      [   53.938480] Aborting journal on device sda-8.
      
      ... but only if the system was under enough memory pressure that
      logical->physical mapping for the journal inode gets pushed out of the
      extent cache.  (This is why it wasn't noticed earlier.)
      
      Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity")
      Reported-by: NDan Rue <dan.rue@linaro.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      97fbf573
    • L
      drm/atomic_helper: Allow DPMS On<->Off changes for unregistered connectors · 1e88a1f8
      Lyude Paul 提交于
      [ Upstream commit 34ca26a98ad67edd6e4870fe2d4aa047d41a51dd ]
      
      It appears when testing my previous fix for some of the legacy
      modesetting issues with MST, I misattributed some kernel splats that
      started appearing on my machine after a rebase as being from upstream.
      But it appears they actually came from my patch series:
      
      [    2.980512] [drm:drm_atomic_helper_check_modeset [drm_kms_helper]] Updating routing for [CONNECTOR:65:eDP-1]
      [    2.980516] [drm:drm_atomic_helper_check_modeset [drm_kms_helper]] [CONNECTOR:65:eDP-1] is not registered
      [    2.980516] ------------[ cut here ]------------
      [    2.980519] Could not determine valid watermarks for inherited state
      [    2.980553] WARNING: CPU: 3 PID: 551 at drivers/gpu/drm/i915/intel_display.c:14983 intel_modeset_init+0x14d7/0x19f0 [i915]
      [    2.980556] Modules linked in: i915(O+) i2c_algo_bit drm_kms_helper(O) syscopyarea sysfillrect sysimgblt fb_sys_fops drm(O) intel_rapl x86_pkg_temp_thermal iTCO_wdt wmi_bmof coretemp crc32_pclmul psmouse i2c_i801 mei_me mei i2c_core lpc_ich mfd_core tpm_tis tpm_tis_core wmi tpm thinkpad_acpi pcc_cpufreq video ehci_pci crc32c_intel serio_raw ehci_hcd xhci_pci xhci_hcd
      [    2.980577] CPU: 3 PID: 551 Comm: systemd-udevd Tainted: G           O      4.19.0-rc7Lyude-Test+ #1
      [    2.980579] Hardware name: LENOVO 20BWS1KY00/20BWS1KY00, BIOS JBET63WW (1.27 ) 11/10/2016
      [    2.980605] RIP: 0010:intel_modeset_init+0x14d7/0x19f0 [i915]
      [    2.980607] Code: 89 df e8 ec 27 02 00 e9 24 f2 ff ff be 03 00 00 00 48 89 df e8 da 27 02 00 e9 26 f2 ff ff 48 c7 c7 c8 d1 34 a0 e8 23 cf dc e0 <0f> 0b e9 7c fd ff ff f6 c4 04 0f 85 37 f7 ff ff 48 8b 83 60 08 00
      [    2.980611] RSP: 0018:ffffc90000287988 EFLAGS: 00010282
      [    2.980614] RAX: 0000000000000000 RBX: ffff88031b488000 RCX: 0000000000000006
      [    2.980617] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff880321ad54d0
      [    2.980620] RBP: ffffc90000287a10 R08: 000000000000040a R09: 0000000000000065
      [    2.980623] R10: ffff88030ebb8f00 R11: ffffffff81416590 R12: ffff88031b488000
      [    2.980626] R13: ffff88031b4883a0 R14: ffffc900002879a8 R15: ffff880319099800
      [    2.980630] FS:  00007f475620d180(0000) GS:ffff880321ac0000(0000) knlGS:0000000000000000
      [    2.980633] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    2.980636] CR2: 00007f9ef28018a0 CR3: 000000031b72c001 CR4: 00000000003606e0
      [    2.980639] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    2.980642] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    2.980645] Call Trace:
      [    2.980675]  i915_driver_load+0xb0e/0xdc0 [i915]
      [    2.980681]  ? kernfs_add_one+0xe7/0x130
      [    2.980709]  i915_pci_probe+0x46/0x60 [i915]
      [    2.980715]  pci_device_probe+0xd4/0x150
      [    2.980719]  really_probe+0x243/0x3b0
      [    2.980722]  driver_probe_device+0xba/0x100
      [    2.980726]  __driver_attach+0xe4/0x110
      [    2.980729]  ? driver_probe_device+0x100/0x100
      [    2.980733]  bus_for_each_dev+0x74/0xb0
      [    2.980736]  driver_attach+0x1e/0x20
      [    2.980739]  bus_add_driver+0x159/0x230
      [    2.980743]  ? 0xffffffffa0393000
      [    2.980746]  driver_register+0x70/0xc0
      [    2.980749]  ? 0xffffffffa0393000
      [    2.980753]  __pci_register_driver+0x57/0x60
      [    2.980780]  i915_init+0x55/0x58 [i915]
      [    2.980785]  do_one_initcall+0x4a/0x1c4
      [    2.980789]  ? do_init_module+0x27/0x210
      [    2.980793]  ? kmem_cache_alloc_trace+0x131/0x190
      [    2.980797]  do_init_module+0x60/0x210
      [    2.980800]  load_module+0x2063/0x22e0
      [    2.980804]  ? vfs_read+0x116/0x140
      [    2.980807]  ? vfs_read+0x116/0x140
      [    2.980811]  __do_sys_finit_module+0xbd/0x120
      [    2.980814]  ? __do_sys_finit_module+0xbd/0x120
      [    2.980818]  __x64_sys_finit_module+0x1a/0x20
      [    2.980821]  do_syscall_64+0x5a/0x110
      [    2.980824]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [    2.980826] RIP: 0033:0x7f4754e32879
      [    2.980828] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f7 45 2c 00 f7 d8 64 89 01 48
      [    2.980831] RSP: 002b:00007fff43fd97d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
      [    2.980834] RAX: ffffffffffffffda RBX: 0000559a44ca64f0 RCX: 00007f4754e32879
      [    2.980836] RDX: 0000000000000000 RSI: 00007f475599f4cd RDI: 0000000000000018
      [    2.980838] RBP: 00007f475599f4cd R08: 0000000000000000 R09: 0000000000000000
      [    2.980839] R10: 0000000000000018 R11: 0000000000000246 R12: 0000000000000000
      [    2.980841] R13: 0000559a44c92fd0 R14: 0000000000020000 R15: 0000000000000000
      [    2.980881] WARNING: CPU: 3 PID: 551 at drivers/gpu/drm/i915/intel_display.c:14983 intel_modeset_init+0x14d7/0x19f0 [i915]
      [    2.980884] ---[ end trace 5eb47a76277d4731 ]---
      
      The cause of this appears to be due to the fact that if there's
      pre-existing display state that was set by the BIOS when i915 loads, it
      will attempt to perform a modeset before the driver is registered with
      userspace. Since this happens before the driver's registered with
      userspace, it's connectors are also unregistered and thus-states which
      would turn on DPMS on a connector end up getting rejected since the
      connector isn't registered.
      
      These bugs managed to get past Intel's CI partially due to the fact it
      never ran a full test on my patches for some reason, but also because
      all of the tests unload the GPU once before running. Since this bug is
      only really triggered when the drivers tries to perform a modeset before
      it's been fully registered with userspace when coming from whatever
      display configuration the firmware left us with, it likely would never
      have been picked up by CI in the first place.
      
      After some discussion with vsyrjala, we decided the best course of
      action would be to just move the unregistered connector checks out of
      update_connector_routing() and into drm_atomic_set_crtc_for_connector().
      The reason for this being that legacy modesetting isn't going to be
      expecting failures anywhere (at least this is the case with X), so
      ideally we want to ensure that any DPMS changes will still work even on
      unregistered connectors. Instead, we now only reject new modesets which
      would change the current CRTC assigned to an unregistered connector
      unless no new CRTC is being assigned to replace the connector's previous
      one.
      Signed-off-by: NLyude Paul <lyude@redhat.com>
      Reported-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Fixes: 4d80273976bf ("drm/atomic_helper: Disallow new modesets on unregistered connectors")
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20181009204424.21462-1-lyude@redhat.com
      (cherry picked from commit b5d29843d8ef86d4cde4742e095b81b7fd41e688)
      Fixes: e96550956fbc ("drm/atomic_helper: Disallow new modesets on unregistered connectors")
      Signed-off-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1e88a1f8
    • H
      virtio/s390: fix race on airq_areas[] · b1dd1d06
      Halil Pasic 提交于
      [ Upstream commit 4f419eb14272e0698e8c55bb5f3f266cc2a21c81 ]
      
      The access to airq_areas was racy ever since the adapter interrupts got
      introduced to virtio-ccw, but since commit 39c7dcb15892 ("virtio/s390:
      make airq summary indicators DMA") this became an issue in practice as
      well. Namely before that commit the airq_info that got overwritten was
      still functional. After that commit however the two infos share a
      summary_indicator, which aggravates the situation. Which means
      auto-online mechanism occasionally hangs the boot with virtio_blk.
      Signed-off-by: NHalil Pasic <pasic@linux.ibm.com>
      Reported-by: NMarc Hartmayer <mhartmay@linux.ibm.com>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 96b14536 ("virtio-ccw: virtio-ccw adapter interrupt support.")
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b1dd1d06
    • V
      drm/i915: Make sure cdclk is high enough for DP audio on VLV/CHV · 057cdb6f
      Ville Syrjälä 提交于
      [ Upstream commit a8f196a0fa6391a436f63f360a1fb57031fdf26c ]
      
      On VLV/CHV there is some kind of linkage between the cdclk frequency
      and the DP link frequency. The spec says:
      "For DP audio configuration, cdclk frequency shall be set to
       meet the following requirements:
       DP Link Frequency(MHz) | Cdclk frequency(MHz)
       270                    | 320 or higher
       162                    | 200 or higher"
      
      I suspect that would more accurately be expressed as
      "cdclk >= DP link clock", and in any case we can express it like
      that in the code because of the limited set of cdclk (200, 266,
      320, 400 MHz) and link frequencies (162 and 270 MHz) we support.
      
      Without this we can end up in a situation where the cdclk
      is too low and enabling DP audio will kill the pipe. Happens
      eg. with 2560x1440 modes where the 266MHz cdclk is sufficient
      to pump the pixels (241.5 MHz dotclock) but is too low for
      the DP audio due to the link frequency being 270 MHz.
      
      v2: Spell out the cdclk and link frequencies we actually support
      
      Cc: stable@vger.kernel.org
      Tested-by: NStefan Gottwald <gottwald@igel.com>
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111149Signed-off-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190717114536.22937-1-ville.syrjala@linux.intel.comAcked-by: NChris Wilson <chris@chris-wilson.co.uk>
      (cherry picked from commit bffb31f73b29a60ef693842d8744950c2819851d)
      Signed-off-by: NJani Nikula <jani.nikula@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      057cdb6f
    • C
      bcache: fix race in btree_flush_write() · b113f984
      Coly Li 提交于
      [ Upstream commit 50a260e859964002dab162513a10f91ae9d3bcd3 ]
      
      There is a race between mca_reap(), btree_node_free() and journal code
      btree_flush_write(), which results very rare and strange deadlock or
      panic and are very hard to reproduce.
      
      Let me explain how the race happens. In btree_flush_write() one btree
      node with oldest journal pin is selected, then it is flushed to cache
      device, the select-and-flush is a two steps operation. Between these two
      steps, there are something may happen inside the race window,
      - The selected btree node was reaped by mca_reap() and allocated to
        other requesters for other btree node.
      - The slected btree node was selected, flushed and released by mca
        shrink callback bch_mca_scan().
      When btree_flush_write() tries to flush the selected btree node, firstly
      b->write_lock is held by mutex_lock(). If the race happens and the
      memory of selected btree node is allocated to other btree node, if that
      btree node's write_lock is held already, a deadlock very probably
      happens here. A worse case is the memory of the selected btree node is
      released, then all references to this btree node (e.g. b->write_lock)
      will trigger NULL pointer deference panic.
      
      This race was introduced in commit cafe5635 ("bcache: A block layer
      cache"), and enlarged by commit c4dc2497 ("bcache: fix high CPU
      occupancy during journal"), which selected 128 btree nodes and flushed
      them one-by-one in a quite long time period.
      
      Such race is not easy to reproduce before. On a Lenovo SR650 server with
      48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
      device assembled by 3 NVMe SSDs as backing device, this race can be
      observed around every 10,000 times btree_flush_write() gets called. Both
      deadlock and kernel panic all happened as aftermath of the race.
      
      The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
      is set when selecting btree nodes, and cleared after btree nodes
      flushed. Then when mca_reap() selects a btree node with this bit set,
      this btree node will be skipped. Since mca_reap() only reaps btree node
      without BTREE_NODE_journal_flush flag, such race is avoided.
      
      Once corner case should be noticed, that is btree_node_free(). It might
      be called in some error handling code path. For example the following
      code piece from btree_split(),
              2149 err_free2:
              2150         bkey_put(b->c, &n2->key);
              2151         btree_node_free(n2);
              2152         rw_unlock(true, n2);
              2153 err_free1:
              2154         bkey_put(b->c, &n1->key);
              2155         btree_node_free(n1);
              2156         rw_unlock(true, n1);
      At line 2151 and 2155, the btree node n2 and n1 are released without
      mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
      If btree_node_free() is called directly in such error handling path,
      and the selected btree node has BTREE_NODE_journal_flush bit set, just
      delay for 1 us and retry again. In this case this btree node won't
      be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
      and free the btree node memory.
      
      Fixes: cafe5635 ("bcache: A block layer cache")
      Signed-off-by: NColy Li <colyli@suse.de>
      Reported-and-tested-by: Nkbuild test robot <lkp@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b113f984
    • C
      bcache: add comments for mutex_lock(&b->write_lock) · f73c35d9
      Coly Li 提交于
      [ Upstream commit 41508bb7d46b74dba631017e5a702a86caf1db8c ]
      
      When accessing or modifying BTREE_NODE_dirty bit, it is not always
      necessary to acquire b->write_lock. In bch_btree_cache_free() and
      mca_reap() acquiring b->write_lock is necessary, and this patch adds
      comments to explain why mutex_lock(&b->write_lock) is necessary for
      checking or clearing BTREE_NODE_dirty bit there.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f73c35d9
    • C
      bcache: only clear BTREE_NODE_dirty bit when it is set · 7989a502
      Coly Li 提交于
      [ Upstream commit e5ec5f4765ada9c75fb3eee93a6e72f0e50599d5 ]
      
      In bch_btree_cache_free() and btree_node_free(), BTREE_NODE_dirty is
      always set no matter btree node is dirty or not. The code looks like
      this,
      	if (btree_node_dirty(b))
      		btree_complete_write(b, btree_current_write(b));
      	clear_bit(BTREE_NODE_dirty, &b->flags);
      
      Indeed if btree_node_dirty(b) returns false, it means BTREE_NODE_dirty
      bit is cleared, then it is unnecessary to clear the bit again.
      
      This patch only clears BTREE_NODE_dirty when btree_node_dirty(b) is
      true (the bit is set), to save a few CPU cycles.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7989a502
    • T
      NFSv4: Fix delegation state recovery · 652993a5
      Trond Myklebust 提交于
      [ Upstream commit 5eb8d18ca0e001c6055da2b7f30d8f6dca23a44f ]
      
      Once we clear the NFS_DELEGATED_STATE flag, we're telling
      nfs_delegation_claim_opens() that we're done recovering all open state
      for that stateid, so we really need to ensure that we test for all
      open modes that are currently cached and recover them before exiting
      nfs4_open_delegation_recall().
      
      Fixes: 24311f88 ("NFSv4: Recovery of recalled read delegations...")
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Cc: stable@vger.kernel.org # v4.3+
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      652993a5
    • A
      iio: adc: gyroadc: fix uninitialized return code · 5026932a
      Arnd Bergmann 提交于
      [ Upstream commit 90c6260c1905a68fb596844087f2223bd4657fee ]
      
      gcc-9 complains about a blatant uninitialized variable use that
      all earlier compiler versions missed:
      
      drivers/iio/adc/rcar-gyroadc.c:510:5: warning: 'ret' may be used uninitialized in this function [-Wmaybe-uninitialized]
      
      Return -EINVAL instead here and a few lines above it where
      we accidentally return 0 on failure.
      
      Cc: stable@vger.kernel.org
      Fixes: 059c53b3 ("iio: adc: Add Renesas GyroADC driver")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NWolfram Sang <wsa+renesas@sang-engineering.com>
      Signed-off-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5026932a
    • R
      mm/migrate.c: initialize pud_entry in migrate_vma() · 2e7e7c8f
      Ralph Campbell 提交于
      [ Upstream commit 7b358c6f12dc82364f6d317f8c8f1d794adbc3f5 ]
      
      When CONFIG_MIGRATE_VMA_HELPER is enabled, migrate_vma() calls
      migrate_vma_collect() which initializes a struct mm_walk but didn't
      initialize mm_walk.pud_entry.  (Found by code inspection) Use a C
      structure initialization to make sure it is set to NULL.
      
      Link: http://lkml.kernel.org/r/20190719233225.12243-1-rcampbell@nvidia.com
      Fixes: 8763cb45 ("mm/migrate: new memory migration helper for use with device memory")
      Signed-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2e7e7c8f