1. 08 1月, 2021 1 次提交
    • T
      KVM: SVM: Add support for booting APs in an SEV-ES guest · 647daca2
      Tom Lendacky 提交于
      Typically under KVM, an AP is booted using the INIT-SIPI-SIPI sequence,
      where the guest vCPU register state is updated and then the vCPU is VMRUN
      to begin execution of the AP. For an SEV-ES guest, this won't work because
      the guest register state is encrypted.
      
      Following the GHCB specification, the hypervisor must not alter the guest
      register state, so KVM must track an AP/vCPU boot. Should the guest want
      to park the AP, it must use the AP Reset Hold exit event in place of, for
      example, a HLT loop.
      
      First AP boot (first INIT-SIPI-SIPI sequence):
        Execute the AP (vCPU) as it was initialized and measured by the SEV-ES
        support. It is up to the guest to transfer control of the AP to the
        proper location.
      
      Subsequent AP boot:
        KVM will expect to receive an AP Reset Hold exit event indicating that
        the vCPU is being parked and will require an INIT-SIPI-SIPI sequence to
        awaken it. When the AP Reset Hold exit event is received, KVM will place
        the vCPU into a simulated HLT mode. Upon receiving the INIT-SIPI-SIPI
        sequence, KVM will make the vCPU runnable. It is again up to the guest
        to then transfer control of the AP to the proper location.
      
        To differentiate between an actual HLT and an AP Reset Hold, a new MP
        state is introduced, KVM_MP_STATE_AP_RESET_HOLD, which the vCPU is
        placed in upon receiving the AP Reset Hold exit event. Additionally, to
        communicate the AP Reset Hold exit event up to userspace (if needed), a
        new exit reason is introduced, KVM_EXIT_AP_RESET_HOLD.
      
      A new x86 ops function is introduced, vcpu_deliver_sipi_vector, in order
      to accomplish AP booting. For VMX, vcpu_deliver_sipi_vector is set to the
      original SIPI delivery function, kvm_vcpu_deliver_sipi_vector(). SVM adds
      a new function that, for non SEV-ES guests, invokes the original SIPI
      delivery function, kvm_vcpu_deliver_sipi_vector(), but for SEV-ES guests,
      implements the logic above.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <e8fbebe8eb161ceaabdad7c01a5859a78b424d5e.1609791600.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      647daca2
  2. 12 12月, 2020 3 次提交
  3. 09 12月, 2020 6 次提交
  4. 08 12月, 2020 1 次提交
    • S
      netfilter: x_tables: Switch synchronization to RCU · cc00bcaa
      Subash Abhinov Kasiviswanathan 提交于
      When running concurrent iptables rules replacement with data, the per CPU
      sequence count is checked after the assignment of the new information.
      The sequence count is used to synchronize with the packet path without the
      use of any explicit locking. If there are any packets in the packet path using
      the table information, the sequence count is incremented to an odd value and
      is incremented to an even after the packet process completion.
      
      The new table value assignment is followed by a write memory barrier so every
      CPU should see the latest value. If the packet path has started with the old
      table information, the sequence counter will be odd and the iptables
      replacement will wait till the sequence count is even prior to freeing the
      old table info.
      
      However, this assumes that the new table information assignment and the memory
      barrier is actually executed prior to the counter check in the replacement
      thread. If CPU decides to execute the assignment later as there is no user of
      the table information prior to the sequence check, the packet path in another
      CPU may use the old table information. The replacement thread would then free
      the table information under it leading to a use after free in the packet
      processing context-
      
      Unable to handle kernel NULL pointer dereference at virtual
      address 000000000000008e
      pc : ip6t_do_table+0x5d0/0x89c
      lr : ip6t_do_table+0x5b8/0x89c
      ip6t_do_table+0x5d0/0x89c
      ip6table_filter_hook+0x24/0x30
      nf_hook_slow+0x84/0x120
      ip6_input+0x74/0xe0
      ip6_rcv_finish+0x7c/0x128
      ipv6_rcv+0xac/0xe4
      __netif_receive_skb+0x84/0x17c
      process_backlog+0x15c/0x1b8
      napi_poll+0x88/0x284
      net_rx_action+0xbc/0x23c
      __do_softirq+0x20c/0x48c
      
      This could be fixed by forcing instruction order after the new table
      information assignment or by switching to RCU for the synchronization.
      
      Fixes: 80055dab ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore")
      Reported-by: NSean Tranchetti <stranche@codeaurora.org>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Suggested-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      cc00bcaa
  5. 07 12月, 2020 1 次提交
    • M
      mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING · e91d8d78
      Minchan Kim 提交于
      While I was doing zram testing, I found sometimes decompression failed
      since the compression buffer was corrupted.  With investigation, I found
      below commit calls cond_resched unconditionally so it could make a
      problem in atomic context if the task is reschedule.
      
        BUG: sleeping function called from invalid context at mm/vmalloc.c:108
        in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
        3 locks held by memhog/946:
         #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
         #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
         #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
        CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
        Call Trace:
          unmap_kernel_range_noflush+0x2eb/0x350
          unmap_kernel_range+0x14/0x30
          zs_unmap_object+0xd5/0xe0
          zram_bvec_rw.isra.0+0x38c/0x8e0
          zram_rw_page+0x90/0x101
          bdev_write_page+0x92/0xe0
          __swap_writepage+0x94/0x4a0
          pageout+0xe3/0x3a0
          shrink_page_list+0xb94/0xd60
          shrink_inactive_list+0x158/0x460
      
      We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
      contains the offending calling code) from zsmalloc.
      
      Even though this option showed some amount improvement(e.g., 30%) in
      some arm32 platforms, it has been headache to maintain since it have
      abused APIs[1](e.g., unmap_kernel_range in atomic context).
      
      Since we are approaching to deprecate 32bit machines and already made
      the config option available for only builtin build since v5.8, lastly it
      has been not default option in zsmalloc, it's time to drop the option
      for better maintenance.
      
      [1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org
      
      Fixes: e47110e9 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Harish Sriram <harish@linux.ibm.com>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e91d8d78
  6. 06 12月, 2020 1 次提交
    • V
      net: mscc: ocelot: fix dropping of unknown IPv4 multicast on Seville · edd2410b
      Vladimir Oltean 提交于
      The current assumption is that the felix DSA driver has flooding knobs
      per traffic class, while ocelot switchdev has a single flooding knob.
      This was correct for felix VSC9959 and ocelot VSC7514, but with the
      introduction of seville VSC9953, we see a switch driven by felix.c which
      has a single flooding knob.
      
      So it is clear that we must do what should have been done from the
      beginning, which is not to overwrite the configuration done by ocelot.c
      in felix, but instead to teach the common ocelot library about the
      differences in our switches, and set up the flooding PGIDs centrally.
      
      The effect that the bogus iteration through FELIX_NUM_TC has upon
      seville is quite dramatic. ANA_FLOODING is located at 0x00b548, and
      ANA_FLOODING_IPMC is located at 0x00b54c. So the bogus iteration will
      actually overwrite ANA_FLOODING_IPMC when attempting to write
      ANA_FLOODING[1]. There is no ANA_FLOODING[1] in sevile, just ANA_FLOODING.
      
      And when ANA_FLOODING_IPMC is overwritten with a bogus value, the effect
      is that ANA_FLOODING_IPMC gets the value of 0x0003CF7D:
      	MC6_DATA = 61,
      	MC6_CTRL = 61,
      	MC4_DATA = 60,
      	MC4_CTRL = 0.
      Because MC4_CTRL is zero, this means that IPv4 multicast control packets
      are not flooded, but dropped. An invalid configuration, and this is how
      the issue was actually spotted.
      Reported-by: NEldar Gasanov <eldargasanov2@gmail.com>
      Reported-by: NMaxim Kochetkov <fido_max@inbox.ru>
      Tested-by: NEldar Gasanov <eldargasanov2@gmail.com>
      Fixes: 84705fc1 ("net: dsa: felix: introduce support for Seville VSC9953 switch")
      Fixes: 3c7b51bd ("net: dsa: felix: allow flooding for all traffic classes")
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20201204175416.1445937-1-vladimir.oltean@nxp.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      edd2410b
  7. 05 12月, 2020 4 次提交
    • S
      [SECURITY] fix namespaced fscaps when !CONFIG_SECURITY · ed9b25d1
      Serge Hallyn 提交于
      Namespaced file capabilities were introduced in 8db6c34f .
      When userspace reads an xattr for a namespaced capability, a
      virtualized representation of it is returned if the caller is
      in a user namespace owned by the capability's owning rootid.
      The function which performs this virtualization was not hooked
      up if CONFIG_SECURITY=n.  Therefore in that case the original
      xattr was shown instead of the virtualized one.
      
      To test this using libcap-bin (*1),
      
      $ v=$(mktemp)
      $ unshare -Ur setcap cap_sys_admin-eip $v
      $ unshare -Ur setcap -v cap_sys_admin-eip $v
      /tmp/tmp.lSiIFRvt8Y: OK
      
      "setcap -v" verifies the values instead of setting them, and
      will check whether the rootid value is set.  Therefore, with
      this bug un-fixed, and with CONFIG_SECURITY=n, setcap -v will
      fail:
      
      $ v=$(mktemp)
      $ unshare -Ur setcap cap_sys_admin=eip $v
      $ unshare -Ur setcap -v cap_sys_admin=eip $v
      nsowner[got=1000, want=0],/tmp/tmp.HHDiOOl9fY differs in []
      
      Fix this bug by calling cap_inode_getsecurity() in
      security_inode_getsecurity() instead of returning
      -EOPNOTSUPP, when CONFIG_SECURITY=n.
      
      *1 - note, if libcap is too old for getcap to have the '-n'
      option, then use verify-caps instead.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209689
      Cc: Hervé Guillemet <herve@guillemet.org>
      Acked-by: NCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: NSerge Hallyn <shallyn@cisco.com>
      Signed-off-by: NAndrew G. Morgan <morgan@kernel.org>
      Signed-off-by: NJames Morris <jamorris@linux.microsoft.com>
      ed9b25d1
    • M
      block: fix incorrect branching in blk_max_size_offset() · 65f33b35
      Mike Snitzer 提交于
      If non-zero 'chunk_sectors' is passed in to blk_max_size_offset() that
      override will be incorrectly ignored.
      
      Old blk_max_size_offset() branching, prior to commit 3ee16db3,
      must be used only if passed 'chunk_sectors' override is zero.
      
      Fixes: 3ee16db3 ("dm: fix IO splitting")
      Cc: stable@vger.kernel.org # 5.9
      Reported-by: NJohn Dorminy <jdorminy@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      65f33b35
    • M
      dm: fix IO splitting · 3ee16db3
      Mike Snitzer 提交于
      Commit 882ec4e6 ("dm table: stack 'chunk_sectors' limit to account
      for target-specific splitting") caused a couple regressions:
      1) Using lcm_not_zero() when stacking chunk_sectors was a bug because
         chunk_sectors must reflect the most limited of all devices in the
         IO stack.
      2) DM targets that set max_io_len but that do _not_ provide an
         .iterate_devices method no longer had there IO split properly.
      
      And commit 5091cdec ("dm: change max_io_len() to use
      blk_max_size_offset()") also caused a regression where DM no longer
      supported varied (per target) IO splitting. The implication being the
      potential for severely reduced performance for IO stacks that use a DM
      target like dm-cache to hide performance limitations of a slower
      device (e.g. one that requires 4K IO splitting).
      
      Coming full circle: Fix all these issues by discontinuing stacking
      chunk_sectors up using ti->max_io_len in dm_calculate_queue_limits(),
      add optional chunk_sectors override argument to blk_max_size_offset()
      and update DM's max_io_len() to pass ti->max_io_len to its
      blk_max_size_offset() call.
      
      Passing in an optional chunk_sectors override to blk_max_size_offset()
      allows for code reuse of block's centralized calculation for max IO
      size based on provided offset and split boundary.
      
      Fixes: 882ec4e6 ("dm table: stack 'chunk_sectors' limit to account for target-specific splitting")
      Fixes: 5091cdec ("dm: change max_io_len() to use blk_max_size_offset()")
      Cc: stable@vger.kernel.org
      Reported-by: NJohn Dorminy <jdorminy@redhat.com>
      Reported-by: NBruce Johnston <bjohnsto@redhat.com>
      Reported-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NJohn Dorminy <jdorminy@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      3ee16db3
    • J
      tty: Fix ->session locking · c8bcd9c5
      Jann Horn 提交于
      Currently, locking of ->session is very inconsistent; most places
      protect it using the legacy tty mutex, but disassociate_ctty(),
      __do_SAK(), tiocspgrp() and tiocgsid() don't.
      Two of the writers hold the ctrl_lock (because they already need it for
      ->pgrp), but __proc_set_tty() doesn't do that yet.
      
      On a PREEMPT=y system, an unprivileged user can theoretically abuse
      this broken locking to read 4 bytes of freed memory via TIOCGSID if
      tiocgsid() is preempted long enough at the right point. (Other things
      might also go wrong, especially if root-only ioctls are involved; I'm
      not sure about that.)
      
      Change the locking on ->session such that:
      
       - tty_lock() is held by all writers: By making disassociate_ctty()
         hold it. This should be fine because the same lock can already be
         taken through the call to tty_vhangup_session().
         The tricky part is that we need to shorten the area covered by
         siglock to be able to take tty_lock() without ugly retry logic; as
         far as I can tell, this should be fine, since nothing in the
         signal_struct is touched in the `if (tty)` branch.
       - ctrl_lock is held by all writers: By changing __proc_set_tty() to
         hold the lock a little longer.
       - All readers that aren't holding tty_lock() hold ctrl_lock: By
         adding locking to tiocgsid() and __do_SAK(), and expanding the area
         covered by ctrl_lock in tiocspgrp().
      
      Cc: stable@kernel.org
      Signed-off-by: NJann Horn <jannh@google.com>
      Reviewed-by: NJiri Slaby <jirislaby@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c8bcd9c5
  8. 04 12月, 2020 3 次提交
  9. 02 12月, 2020 1 次提交
  10. 30 11月, 2020 3 次提交
  11. 28 11月, 2020 1 次提交
  12. 27 11月, 2020 4 次提交
  13. 26 11月, 2020 1 次提交
  14. 25 11月, 2020 3 次提交
  15. 24 11月, 2020 3 次提交
  16. 23 11月, 2020 3 次提交
  17. 22 11月, 2020 1 次提交
    • J
      bonding: wait for sysfs kobject destruction before freeing struct slave · b9ad3e9f
      Jamie Iles 提交于
      syzkaller found that with CONFIG_DEBUG_KOBJECT_RELEASE=y, releasing a
      struct slave device could result in the following splat:
      
        kobject: 'bonding_slave' (00000000cecdd4fe): kobject_release, parent 0000000074ceb2b2 (delayed 1000)
        bond0 (unregistering): (slave bond_slave_1): Releasing backup interface
        ------------[ cut here ]------------
        ODEBUG: free active (active state 0) object type: timer_list hint: workqueue_select_cpu_near kernel/workqueue.c:1549 [inline]
        ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x98 kernel/workqueue.c:1600
        WARNING: CPU: 1 PID: 842 at lib/debugobjects.c:485 debug_print_object+0x180/0x240 lib/debugobjects.c:485
        Kernel panic - not syncing: panic_on_warn set ...
        CPU: 1 PID: 842 Comm: kworker/u4:4 Tainted: G S                5.9.0-rc8+ #96
        Hardware name: linux,dummy-virt (DT)
        Workqueue: netns cleanup_net
        Call trace:
         dump_backtrace+0x0/0x4d8 include/linux/bitmap.h:239
         show_stack+0x34/0x48 arch/arm64/kernel/traps.c:142
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x174/0x1f8 lib/dump_stack.c:118
         panic+0x360/0x7a0 kernel/panic.c:231
         __warn+0x244/0x2ec kernel/panic.c:600
         report_bug+0x240/0x398 lib/bug.c:198
         bug_handler+0x50/0xc0 arch/arm64/kernel/traps.c:974
         call_break_hook+0x160/0x1d8 arch/arm64/kernel/debug-monitors.c:322
         brk_handler+0x30/0xc0 arch/arm64/kernel/debug-monitors.c:329
         do_debug_exception+0x184/0x340 arch/arm64/mm/fault.c:864
         el1_dbg+0x48/0xb0 arch/arm64/kernel/entry-common.c:65
         el1_sync_handler+0x170/0x1c8 arch/arm64/kernel/entry-common.c:93
         el1_sync+0x80/0x100 arch/arm64/kernel/entry.S:594
         debug_print_object+0x180/0x240 lib/debugobjects.c:485
         __debug_check_no_obj_freed lib/debugobjects.c:967 [inline]
         debug_check_no_obj_freed+0x200/0x430 lib/debugobjects.c:998
         slab_free_hook mm/slub.c:1536 [inline]
         slab_free_freelist_hook+0x190/0x210 mm/slub.c:1577
         slab_free mm/slub.c:3138 [inline]
         kfree+0x13c/0x460 mm/slub.c:4119
         bond_free_slave+0x8c/0xf8 drivers/net/bonding/bond_main.c:1492
         __bond_release_one+0xe0c/0xec8 drivers/net/bonding/bond_main.c:2190
         bond_slave_netdev_event drivers/net/bonding/bond_main.c:3309 [inline]
         bond_netdev_event+0x8f0/0xa70 drivers/net/bonding/bond_main.c:3420
         notifier_call_chain+0xf0/0x200 kernel/notifier.c:83
         __raw_notifier_call_chain kernel/notifier.c:361 [inline]
         raw_notifier_call_chain+0x44/0x58 kernel/notifier.c:368
         call_netdevice_notifiers_info+0xbc/0x150 net/core/dev.c:2033
         call_netdevice_notifiers_extack net/core/dev.c:2045 [inline]
         call_netdevice_notifiers net/core/dev.c:2059 [inline]
         rollback_registered_many+0x6a4/0xec0 net/core/dev.c:9347
         unregister_netdevice_many.part.0+0x2c/0x1c0 net/core/dev.c:10509
         unregister_netdevice_many net/core/dev.c:10508 [inline]
         default_device_exit_batch+0x294/0x338 net/core/dev.c:10992
         ops_exit_list.isra.0+0xec/0x150 net/core/net_namespace.c:189
         cleanup_net+0x44c/0x888 net/core/net_namespace.c:603
         process_one_work+0x96c/0x18c0 kernel/workqueue.c:2269
         worker_thread+0x3f0/0xc30 kernel/workqueue.c:2415
         kthread+0x390/0x498 kernel/kthread.c:292
         ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:925
      
      This is a potential use-after-free if the sysfs nodes are being accessed
      whilst removing the struct slave, so wait for the object destruction to
      complete before freeing the struct slave itself.
      
      Fixes: 07699f9a ("bonding: add sysfs /slave dir for bond slave devices.")
      Fixes: a068aab4 ("bonding: Fix reference count leak in bond_sysfs_slave_add.")
      Cc: Qiushi Wu <wu000273@umn.edu>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: NJamie Iles <jamie@nuviainc.com>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Link: https://lore.kernel.org/r/20201120142827.879226-1-jamie@nuviainc.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      b9ad3e9f