1. 22 4月, 2021 3 次提交
    • B
      KVM: SVM: Add KVM_SEND_UPDATE_DATA command · d3d1af85
      Brijesh Singh 提交于
      The command is used for encrypting the guest memory region using the encryption
      context created with KVM_SEV_SEND_START.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by : Steve Rutherford <srutherford@google.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <d6a6ea740b0c668b30905ae31eac5ad7da048bb3.1618498113.git.ashish.kalra@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d3d1af85
    • B
      KVM: SVM: Add KVM_SEV SEND_START command · 4cfdd47d
      Brijesh Singh 提交于
      The command is used to create an outgoing SEV guest encryption context.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: NSteve Rutherford <srutherford@google.com>
      Reviewed-by: NVenu Busireddy <venu.busireddy@oracle.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <2f1686d0164e0f1b3d6a41d620408393e0a48376.1618498113.git.ashish.kalra@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4cfdd47d
    • N
      KVM: x86: Support KVM VMs sharing SEV context · 54526d1f
      Nathan Tempelman 提交于
      Add a capability for userspace to mirror SEV encryption context from
      one vm to another. On our side, this is intended to support a
      Migration Helper vCPU, but it can also be used generically to support
      other in-guest workloads scheduled by the host. The intention is for
      the primary guest and the mirror to have nearly identical memslots.
      
      The primary benefits of this are that:
      1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
      can't accidentally clobber each other.
      2) The VMs can have different memory-views, which is necessary for post-copy
      migration (the migration vCPUs on the target need to read and write to
      pages, when the primary guest would VMEXIT).
      
      This does not change the threat model for AMD SEV. Any memory involved
      is still owned by the primary guest and its initial state is still
      attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
      to circumvent SEV, they could achieve the same effect by simply attaching
      a vCPU to the primary VM.
      This patch deliberately leaves userspace in charge of the memslots for the
      mirror, as it already has the power to mess with them in the primary guest.
      
      This patch does not support SEV-ES (much less SNP), as it does not
      handle handing off attested VMSAs to the mirror.
      
      For additional context, we need a Migration Helper because SEV PSP
      migration is far too slow for our live migration on its own. Using
      an in-guest migrator lets us speed this up significantly.
      Signed-off-by: NNathan Tempelman <natet@google.com>
      Message-Id: <20210408223214.2582277-1-natet@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54526d1f
  2. 20 4月, 2021 1 次提交
    • S
      KVM: x86: Add capability to grant VM access to privileged SGX attribute · fe7e9488
      Sean Christopherson 提交于
      Add a capability, KVM_CAP_SGX_ATTRIBUTE, that can be used by userspace
      to grant a VM access to a priveleged attribute, with args[0] holding a
      file handle to a valid SGX attribute file.
      
      The SGX subsystem restricts access to a subset of enclave attributes to
      provide additional security for an uncompromised kernel, e.g. to prevent
      malware from using the PROVISIONKEY to ensure its nodes are running
      inside a geniune SGX enclave and/or to obtain a stable fingerprint.
      
      To prevent userspace from circumventing such restrictions by running an
      enclave in a VM, KVM restricts guest access to privileged attributes by
      default.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NKai Huang <kai.huang@intel.com>
      Message-Id: <0b099d65e933e068e3ea934b0523bab070cb8cea.1618196135.git.kai.huang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fe7e9488
  3. 17 4月, 2021 1 次提交
  4. 04 3月, 2021 1 次提交
    • M
      net: l2tp: reduce log level of messages in receive path, add counter instead · 3e59e885
      Matthias Schiffer 提交于
      Commit 5ee759cd ("l2tp: use standard API for warning log messages")
      changed a number of warnings about invalid packets in the receive path
      so that they are always shown, instead of only when a special L2TP debug
      flag is set. Even with rate limiting these warnings can easily cause
      significant log spam - potentially triggered by a malicious party
      sending invalid packets on purpose.
      
      In addition these warnings were noticed by projects like Tunneldigger [1],
      which uses L2TP for its data path, but implements its own control
      protocol (which is sufficiently different from L2TP data packets that it
      would always be passed up to userspace even with future extensions of
      L2TP).
      
      Some of the warnings were already redundant, as l2tp_stats has a counter
      for these packets. This commit adds one additional counter for invalid
      packets that are passed up to userspace. Packets with unknown session are
      not counted as invalid, as there is nothing wrong with the format of
      these packets.
      
      With the additional counter, all of these messages are either redundant
      or benign, so we reduce them to pr_debug_ratelimited().
      
      [1] https://github.com/wlanslovenija/tunneldigger/issues/160
      
      Fixes: 5ee759cd ("l2tp: use standard API for warning log messages")
      Signed-off-by: NMatthias Schiffer <mschiffer@universe-factory.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e59e885
  5. 03 3月, 2021 1 次提交
    • D
      KVM: x86/xen: Add support for vCPU runstate information · 30b5c851
      David Woodhouse 提交于
      This is how Xen guests do steal time accounting. The hypervisor records
      the amount of time spent in each of running/runnable/blocked/offline
      states.
      
      In the Xen accounting, a vCPU is still in state RUNSTATE_running while
      in Xen for a hypercall or I/O trap, etc. Only if Xen explicitly schedules
      does the state become RUNSTATE_blocked. In KVM this means that even when
      the vCPU exits the kvm_run loop, the state remains RUNSTATE_running.
      
      The VMM can explicitly set the vCPU to RUNSTATE_blocked by using the
      KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT attribute, and can also use
      KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST to retrospectively add a given
      amount of time to the blocked state and subtract it from the running
      state.
      
      The state_entry_time corresponds to get_kvmclock_ns() at the time the
      vCPU entered the current state, and the total times of all four states
      should always add up to state_entry_time.
      Co-developed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20210301125309.874953-2-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      30b5c851
  6. 28 2月, 2021 1 次提交
  7. 27 2月, 2021 1 次提交
  8. 25 2月, 2021 2 次提交
    • H
      numa balancing: migrate on fault among multiple bound nodes · bda420b9
      Huang Ying 提交于
      Now, NUMA balancing can only optimize the page placement among the NUMA
      nodes if the default memory policy is used.  Because the memory policy
      specified explicitly should take precedence.  But this seems too strict in
      some situations.  For example, on a system with 4 NUMA nodes, if the
      memory of an application is bound to the node 0 and 1, NUMA balancing can
      potentially migrate the pages between the node 0 and 1 to reduce
      cross-node accessing without breaking the explicit memory binding policy.
      
      So in this patch, we add MPOL_F_NUMA_BALANCING mode flag to
      set_mempolicy() when mode is MPOL_BIND.  With the flag specified, NUMA
      balancing will be enabled within the thread to optimize the page placement
      within the constrains of the specified memory binding policy.  With the
      newly added flag, the NUMA balancing control mechanism becomes,
      
       - sysctl knob numa_balancing can enable/disable the NUMA balancing
         globally.
      
       - even if sysctl numa_balancing is enabled, the NUMA balancing will be
         disabled for the memory areas or applications with the explicit
         memory policy by default.
      
       - MPOL_F_NUMA_BALANCING can be used to enable the NUMA balancing for
         the applications when specifying the explicit memory policy
         (MPOL_BIND).
      
      Various page placement optimization based on the NUMA balancing can be
      done with these flags.  As the first step, in this patch, if the memory of
      the application is bound to multiple nodes (MPOL_BIND), and in the hint
      page fault handler the accessing node are in the policy nodemask, the page
      will be tried to be migrated to the accessing node to reduce the
      cross-node accessing.
      
      If the newly added MPOL_F_NUMA_BALANCING flag is specified by an
      application on an old kernel version without its support, set_mempolicy()
      will return -1 and errno will be set to EINVAL.  The application can use
      this behavior to run on both old and new kernel versions.
      
      And if the MPOL_F_NUMA_BALANCING flag is specified for the mode other than
      MPOL_BIND, set_mempolicy() will return -1 and errno will be set to EINVAL
      as before.  Because we don't support optimization based on the NUMA
      balancing for these modes.
      
      In the previous version of the patch, we tried to reuse MPOL_MF_LAZY for
      mbind().  But that flag is tied to MPOL_MF_MOVE.*, so it seems not a good
      API/ABI for the purpose of the patch.
      
      And because it's not clear whether it's necessary to enable NUMA balancing
      for a specific memory area inside an application, so we only add the flag
      at the thread level (set_mempolicy()) instead of the memory area level
      (mbind()).  We can do that when it become necessary.
      
      To test the patch, we run a test case as follows on a 4-node machine with
      192 GB memory (48 GB per node).
      
      1. Change pmbench memory accessing benchmark to call set_mempolicy()
         to bind its memory to node 1 and 3 and enable NUMA balancing.  Some
         related code snippets are as follows,
      
           #include <numaif.h>
           #include <numa.h>
      
      	struct bitmask *bmp;
      	int ret;
      
      	bmp = numa_parse_nodestring("1,3");
      	ret = set_mempolicy(MPOL_BIND | MPOL_F_NUMA_BALANCING,
      			    bmp->maskp, bmp->size + 1);
      	/* If MPOL_F_NUMA_BALANCING isn't supported, fall back to MPOL_BIND */
      	if (ret < 0 && errno == EINVAL)
      		ret = set_mempolicy(MPOL_BIND, bmp->maskp, bmp->size + 1);
      	if (ret < 0) {
      		perror("Failed to call set_mempolicy");
      		exit(-1);
      	}
      
      2. Run a memory eater on node 3 to use 40 GB memory before running pmbench.
      
      3. Run pmbench with 64 processes, the working-set size of each process
         is 640 MB, so the total working-set size is 64 * 640 MB = 40 GB.  The
         CPU and the memory (as in step 1.) of all pmbench processes is bound
         to node 1 and 3. So, after CPU usage is balanced, some pmbench
         processes run on the CPUs of the node 3 will access the memory of
         the node 1.
      
      4. After the pmbench processes run for 100 seconds, kill the memory
         eater.  Now it's possible for some pmbench processes to migrate
         their pages from node 1 to node 3 to reduce cross-node accessing.
      
      Test results show that, with the patch, the pages can be migrated from
      node 1 to node 3 after killing the memory eater, and the pmbench score
      can increase about 17.5%.
      
      Link: https://lkml.kernel.org/r/20210120061235.148637-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bda420b9
    • H
      bpf: Remove blank line in bpf helper description comment · a7c9c25a
      Hangbin Liu 提交于
      Commit 34b2021c ("bpf: Add BPF-helper for MTU checking") added an extra
      blank line in bpf helper description. This will make bpf_helpers_doc.py stop
      building bpf_helper_defs.h immediately after bpf_check_mtu(), which will
      affect future added functions.
      
      Fixes: 34b2021c ("bpf: Add BPF-helper for MTU checking")
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/bpf/20210223131457.1378978-1-liuhangbin@gmail.com
      a7c9c25a
  9. 24 2月, 2021 1 次提交
    • J
      io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS · 1c0aa1fa
      Jens Axboe 提交于
      A few reasons to do this:
      
      - The naming of the manager and worker have changed. That's a user visible
        change, so makes sense to flag it.
      
      - Opening certain files that use ->signal (like /proc/self or /dev/tty)
        now works, and the flag tells the application upfront that this is the
        case.
      
      - Related to the above, using signalfd will now work as well.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1c0aa1fa
  10. 23 2月, 2021 3 次提交
  11. 17 2月, 2021 4 次提交
    • B
      cxl/mem: Add set of informational commands · 57ee605b
      Ben Widawsky 提交于
      Add initial set of formal commands beyond basic identify and command
      enumeration.
      Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> (v2)
      Link: https://lore.kernel.org/r/20210217040958.1354670-8-ben.widawsky@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      57ee605b
    • B
      cxl/mem: Enable commands via CEL · 472b1ce6
      Ben Widawsky 提交于
      CXL devices identified by the memory-device class code must implement
      the Device Command Interface (described in 8.2.9 of the CXL 2.0 spec).
      While the driver already maintains a list of commands it supports, there
      is still a need to be able to distinguish between commands that the
      driver knows about from commands that are optionally supported by the
      hardware.
      
      The Command Effects Log (CEL) is specified in the CXL 2.0 specification.
      The CEL is one of two types of logs, the other being vendor specific.
      They are distinguished in hardware/spec via UUID. The CEL is useful for
      2 things:
      1. Determine which optional commands are supported by the CXL device.
      2. Enumerate any vendor specific commands
      
      The CEL is used by the driver to determine which commands are available
      in the hardware and therefore which commands userspace is allowed to
      execute. The set of enabled commands might be a subset of commands which
      are advertised in UAPI via CXL_MEM_SEND_COMMAND IOCTL.
      
      With the CEL enabling comes a internal flag to indicate a base set of
      commands that are enabled regardless of CEL. Such commands are required
      for basic interaction with the hardware and thus can be useful in debug
      cases, for example if the CEL is corrupted.
      
      The implementation leaves the statically defined table of commands and
      supplements it with a bitmap to determine commands that are enabled.
      This organization was chosen for the following reasons:
      - Smaller memory footprint. Doesn't need a table per device.
      - Reduce memory allocation complexity.
      - Fixed command IDs to opcode mapping for all devices makes development
        and debugging easier.
      - Certain helpers are easily achievable, like cxl_for_each_cmd().
      Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: Dan Williams <dan.j.williams@intel.com> (v2)
      Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> (v3)
      Link: https://lore.kernel.org/r/20210217040958.1354670-7-ben.widawsky@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      472b1ce6
    • B
      cxl/mem: Add a "RAW" send command · 13237183
      Ben Widawsky 提交于
      The CXL memory device send interface will have a number of supported
      commands. The raw command is not such a command. Raw commands allow
      userspace to send a specified opcode to the underlying hardware and
      bypass all driver checks on the command. The primary use for this
      command is to [begrudgingly] allow undocumented vendor specific hardware
      commands.
      
      While not the main motivation, it also allows prototyping new hardware
      commands without a driver patch and rebuild.
      
      While this all sounds very powerful it comes with a couple of caveats:
      1. Bug reports using raw commands will not get the same level of
         attention as bug reports using supported commands (via taint).
      2. Supported commands will be rejected by the RAW command.
      
      With this comes new debugfs knob to allow full access to your toes with
      your weapon of choice.
      Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: Dan Williams <dan.j.williams@intel.com> (v2)
      Reviewed-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Ariel Sibley <Ariel.Sibley@microchip.com>
      Link: https://lore.kernel.org/r/20210217040958.1354670-6-ben.widawsky@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      13237183
    • B
      cxl/mem: Add basic IOCTL interface · 583fa5e7
      Ben Widawsky 提交于
      Add a straightforward IOCTL that provides a mechanism for userspace to
      query the supported memory device commands. CXL commands as they appear
      to userspace are described as part of the UAPI kerneldoc. The command
      list returned via this IOCTL will contain the full set of commands that
      the driver supports, however, some of those commands may not be
      available for use by userspace.
      
      Memory device commands first appear in the CXL 2.0 specification. They
      are submitted through a mailbox mechanism specified in the CXL 2.0
      specification.
      
      The send command allows userspace to issue mailbox commands directly to
      the hardware. The list of available commands to send are the output of
      the query command. The driver verifies basic properties of the command
      and possibly inspect the input (or output) payload to determine whether
      or not the command is allowed (or might taint the kernel).
      
      Reported-by: kernel test robot <lkp@intel.com> # bug in earlier revision
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: Dan Williams <dan.j.williams@intel.com> (v2)
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/r/20210217040958.1354670-5-ben.widawsky@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      583fa5e7
  12. 16 2月, 2021 3 次提交
    • G
      mptcp: add local addr info in mptcp_info · 0caf3ada
      Geliang Tang 提交于
      Add mptcpi_local_addr_used and mptcpi_local_addr_max in struct mptcp_info.
      Signed-off-by: NGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0caf3ada
    • L
      binfmt_misc: pass binfmt_misc flags to the interpreter · 2347961b
      Laurent Vivier 提交于
      It can be useful to the interpreter to know which flags are in use.
      
      For instance, knowing if the preserve-argv[0] is in use would
      allow to skip the pathname argument.
      
      This patch uses an unused auxiliary vector, AT_FLAGS, to add a
      flag to inform interpreter if the preserve-argv[0] is enabled.
      
      Note by Helge Deller:
      The real-world user of this patch is qemu-user, which needs to know
      if it has to preserve the argv[0]. See Debian bug #970460.
      Signed-off-by: NLaurent Vivier <laurent@vivier.eu>
      Reviewed-by: NYunQiang Su <ysu@wavecomp.com>
      URL: http://bugs.debian.org/970460Signed-off-by: NHelge Deller <deller@gmx.de>
      2347961b
    • P
      netfilter: nftables: introduce table ownership · 6001a930
      Pablo Neira Ayuso 提交于
      A userspace daemon like firewalld might need to monitor for netlink
      updates to detect its ruleset removal by the (global) flush ruleset
      command to ensure ruleset persistency. This adds extra complexity from
      userspace and, for some little time, the firewall policy is not in
      place.
      
      This patch adds the NFT_TABLE_F_OWNER flag which allows a userspace
      program to own the table that creates in exclusivity.
      
      Tables that are owned...
      
      - can only be updated and removed by the owner, non-owners hit EPERM if
        they try to update it or remove it.
      - are destroyed when the owner closes the netlink socket or the process
        is gone (implicit netlink socket closure).
      - are skipped by the global flush ruleset command.
      - are listed in the global ruleset.
      
      The userspace process that sets on the NFT_TABLE_F_OWNER flag need to
      leave open the netlink socket.
      
      A new NFTA_TABLE_OWNER netlink attribute specifies the netlink port ID
      to identify the owner from userspace.
      
      This patch also updates error reporting when an unknown table flag is
      specified to change it from EINVAL to EOPNOTSUPP given that EINVAL is
      usually reserved to report for malformed netlink messages to userspace.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      6001a930
  13. 15 2月, 2021 2 次提交
  14. 13 2月, 2021 3 次提交
  15. 12 2月, 2021 5 次提交
  16. 11 2月, 2021 3 次提交
  17. 10 2月, 2021 1 次提交
  18. 09 2月, 2021 4 次提交