1. 22 4月, 2021 3 次提交
  2. 21 4月, 2021 1 次提交
    • M
      RDMA/mlx5: Expose private query port · 9a89d3ad
      Mark Bloch 提交于
      Expose a non standard query port via IOCTL that will be used to expose
      port attributes that are specific to mlx5 devices.
      
      The new interface receives a port number to query and returns a structure
      that contains the available attributes for that port.  This will be used
      to fill the gap between pure DEVX use cases and use cases where a kernel
      needs to inform userspace about various kernel driver configurations that
      userspace must use in order to work correctly.
      
      Flags is used to indicate which fields are valid on return.
      
      MLX5_IB_UAPI_QUERY_PORT_VPORT:
      	The vport number of the queered port.
      
      MLX5_IB_UAPI_QUERY_PORT_VPORT_VHCA_ID:
      	The VHCA ID of the vport of the queered port.
      
      MLX5_IB_UAPI_QUERY_PORT_VPORT_STEERING_ICM_RX:
      	The vport's RX ICM address used for sw steering.
      
      MLX5_IB_UAPI_QUERY_PORT_VPORT_STEERING_ICM_TX:
      	The vport's TX ICM address used for sw steering.
      
      MLX5_IB_UAPI_QUERY_PORT_VPORT_REG_C0:
      	The metadata used to tag egress packets of the vport.
      
      MLX5_IB_UAPI_QUERY_PORT_ESW_OWNER_VHCA_ID:
      	The E-Switch owner vhca id of the vport.
      
      Link: https://lore.kernel.org/r/6e2ef13e5a266a6c037eb0105eb1564c7bb52f23.1618743394.git.leonro@nvidia.comReviewed-by: NMaor Gottlieb <maorg@nvidia.com>
      Signed-off-by: NMark Bloch <mbloch@nvidia.com>
      Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
      9a89d3ad
  3. 14 4月, 2021 2 次提交
  4. 12 3月, 2021 2 次提交
  5. 03 3月, 2021 1 次提交
    • D
      KVM: x86/xen: Add support for vCPU runstate information · 30b5c851
      David Woodhouse 提交于
      This is how Xen guests do steal time accounting. The hypervisor records
      the amount of time spent in each of running/runnable/blocked/offline
      states.
      
      In the Xen accounting, a vCPU is still in state RUNSTATE_running while
      in Xen for a hypercall or I/O trap, etc. Only if Xen explicitly schedules
      does the state become RUNSTATE_blocked. In KVM this means that even when
      the vCPU exits the kvm_run loop, the state remains RUNSTATE_running.
      
      The VMM can explicitly set the vCPU to RUNSTATE_blocked by using the
      KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT attribute, and can also use
      KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST to retrospectively add a given
      amount of time to the blocked state and subtract it from the running
      state.
      
      The state_entry_time corresponds to get_kvmclock_ns() at the time the
      vCPU entered the current state, and the total times of all four states
      should always add up to state_entry_time.
      Co-developed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20210301125309.874953-2-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      30b5c851
  6. 27 2月, 2021 1 次提交
  7. 25 2月, 2021 1 次提交
    • H
      numa balancing: migrate on fault among multiple bound nodes · bda420b9
      Huang Ying 提交于
      Now, NUMA balancing can only optimize the page placement among the NUMA
      nodes if the default memory policy is used.  Because the memory policy
      specified explicitly should take precedence.  But this seems too strict in
      some situations.  For example, on a system with 4 NUMA nodes, if the
      memory of an application is bound to the node 0 and 1, NUMA balancing can
      potentially migrate the pages between the node 0 and 1 to reduce
      cross-node accessing without breaking the explicit memory binding policy.
      
      So in this patch, we add MPOL_F_NUMA_BALANCING mode flag to
      set_mempolicy() when mode is MPOL_BIND.  With the flag specified, NUMA
      balancing will be enabled within the thread to optimize the page placement
      within the constrains of the specified memory binding policy.  With the
      newly added flag, the NUMA balancing control mechanism becomes,
      
       - sysctl knob numa_balancing can enable/disable the NUMA balancing
         globally.
      
       - even if sysctl numa_balancing is enabled, the NUMA balancing will be
         disabled for the memory areas or applications with the explicit
         memory policy by default.
      
       - MPOL_F_NUMA_BALANCING can be used to enable the NUMA balancing for
         the applications when specifying the explicit memory policy
         (MPOL_BIND).
      
      Various page placement optimization based on the NUMA balancing can be
      done with these flags.  As the first step, in this patch, if the memory of
      the application is bound to multiple nodes (MPOL_BIND), and in the hint
      page fault handler the accessing node are in the policy nodemask, the page
      will be tried to be migrated to the accessing node to reduce the
      cross-node accessing.
      
      If the newly added MPOL_F_NUMA_BALANCING flag is specified by an
      application on an old kernel version without its support, set_mempolicy()
      will return -1 and errno will be set to EINVAL.  The application can use
      this behavior to run on both old and new kernel versions.
      
      And if the MPOL_F_NUMA_BALANCING flag is specified for the mode other than
      MPOL_BIND, set_mempolicy() will return -1 and errno will be set to EINVAL
      as before.  Because we don't support optimization based on the NUMA
      balancing for these modes.
      
      In the previous version of the patch, we tried to reuse MPOL_MF_LAZY for
      mbind().  But that flag is tied to MPOL_MF_MOVE.*, so it seems not a good
      API/ABI for the purpose of the patch.
      
      And because it's not clear whether it's necessary to enable NUMA balancing
      for a specific memory area inside an application, so we only add the flag
      at the thread level (set_mempolicy()) instead of the memory area level
      (mbind()).  We can do that when it become necessary.
      
      To test the patch, we run a test case as follows on a 4-node machine with
      192 GB memory (48 GB per node).
      
      1. Change pmbench memory accessing benchmark to call set_mempolicy()
         to bind its memory to node 1 and 3 and enable NUMA balancing.  Some
         related code snippets are as follows,
      
           #include <numaif.h>
           #include <numa.h>
      
      	struct bitmask *bmp;
      	int ret;
      
      	bmp = numa_parse_nodestring("1,3");
      	ret = set_mempolicy(MPOL_BIND | MPOL_F_NUMA_BALANCING,
      			    bmp->maskp, bmp->size + 1);
      	/* If MPOL_F_NUMA_BALANCING isn't supported, fall back to MPOL_BIND */
      	if (ret < 0 && errno == EINVAL)
      		ret = set_mempolicy(MPOL_BIND, bmp->maskp, bmp->size + 1);
      	if (ret < 0) {
      		perror("Failed to call set_mempolicy");
      		exit(-1);
      	}
      
      2. Run a memory eater on node 3 to use 40 GB memory before running pmbench.
      
      3. Run pmbench with 64 processes, the working-set size of each process
         is 640 MB, so the total working-set size is 64 * 640 MB = 40 GB.  The
         CPU and the memory (as in step 1.) of all pmbench processes is bound
         to node 1 and 3. So, after CPU usage is balanced, some pmbench
         processes run on the CPUs of the node 3 will access the memory of
         the node 1.
      
      4. After the pmbench processes run for 100 seconds, kill the memory
         eater.  Now it's possible for some pmbench processes to migrate
         their pages from node 1 to node 3 to reduce cross-node accessing.
      
      Test results show that, with the patch, the pages can be migrated from
      node 1 to node 3 after killing the memory eater, and the pmbench score
      can increase about 17.5%.
      
      Link: https://lkml.kernel.org/r/20210120061235.148637-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bda420b9
  8. 24 2月, 2021 1 次提交
    • J
      io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS · 1c0aa1fa
      Jens Axboe 提交于
      A few reasons to do this:
      
      - The naming of the manager and worker have changed. That's a user visible
        change, so makes sense to flag it.
      
      - Opening certain files that use ->signal (like /proc/self or /dev/tty)
        now works, and the flag tells the application upfront that this is the
        case.
      
      - Related to the above, using signalfd will now work as well.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1c0aa1fa
  9. 23 2月, 2021 3 次提交
  10. 17 2月, 2021 4 次提交
    • B
      cxl/mem: Add set of informational commands · 57ee605b
      Ben Widawsky 提交于
      Add initial set of formal commands beyond basic identify and command
      enumeration.
      Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> (v2)
      Link: https://lore.kernel.org/r/20210217040958.1354670-8-ben.widawsky@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      57ee605b
    • B
      cxl/mem: Enable commands via CEL · 472b1ce6
      Ben Widawsky 提交于
      CXL devices identified by the memory-device class code must implement
      the Device Command Interface (described in 8.2.9 of the CXL 2.0 spec).
      While the driver already maintains a list of commands it supports, there
      is still a need to be able to distinguish between commands that the
      driver knows about from commands that are optionally supported by the
      hardware.
      
      The Command Effects Log (CEL) is specified in the CXL 2.0 specification.
      The CEL is one of two types of logs, the other being vendor specific.
      They are distinguished in hardware/spec via UUID. The CEL is useful for
      2 things:
      1. Determine which optional commands are supported by the CXL device.
      2. Enumerate any vendor specific commands
      
      The CEL is used by the driver to determine which commands are available
      in the hardware and therefore which commands userspace is allowed to
      execute. The set of enabled commands might be a subset of commands which
      are advertised in UAPI via CXL_MEM_SEND_COMMAND IOCTL.
      
      With the CEL enabling comes a internal flag to indicate a base set of
      commands that are enabled regardless of CEL. Such commands are required
      for basic interaction with the hardware and thus can be useful in debug
      cases, for example if the CEL is corrupted.
      
      The implementation leaves the statically defined table of commands and
      supplements it with a bitmap to determine commands that are enabled.
      This organization was chosen for the following reasons:
      - Smaller memory footprint. Doesn't need a table per device.
      - Reduce memory allocation complexity.
      - Fixed command IDs to opcode mapping for all devices makes development
        and debugging easier.
      - Certain helpers are easily achievable, like cxl_for_each_cmd().
      Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: Dan Williams <dan.j.williams@intel.com> (v2)
      Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> (v3)
      Link: https://lore.kernel.org/r/20210217040958.1354670-7-ben.widawsky@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      472b1ce6
    • B
      cxl/mem: Add a "RAW" send command · 13237183
      Ben Widawsky 提交于
      The CXL memory device send interface will have a number of supported
      commands. The raw command is not such a command. Raw commands allow
      userspace to send a specified opcode to the underlying hardware and
      bypass all driver checks on the command. The primary use for this
      command is to [begrudgingly] allow undocumented vendor specific hardware
      commands.
      
      While not the main motivation, it also allows prototyping new hardware
      commands without a driver patch and rebuild.
      
      While this all sounds very powerful it comes with a couple of caveats:
      1. Bug reports using raw commands will not get the same level of
         attention as bug reports using supported commands (via taint).
      2. Supported commands will be rejected by the RAW command.
      
      With this comes new debugfs knob to allow full access to your toes with
      your weapon of choice.
      Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: Dan Williams <dan.j.williams@intel.com> (v2)
      Reviewed-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Ariel Sibley <Ariel.Sibley@microchip.com>
      Link: https://lore.kernel.org/r/20210217040958.1354670-6-ben.widawsky@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      13237183
    • B
      cxl/mem: Add basic IOCTL interface · 583fa5e7
      Ben Widawsky 提交于
      Add a straightforward IOCTL that provides a mechanism for userspace to
      query the supported memory device commands. CXL commands as they appear
      to userspace are described as part of the UAPI kerneldoc. The command
      list returned via this IOCTL will contain the full set of commands that
      the driver supports, however, some of those commands may not be
      available for use by userspace.
      
      Memory device commands first appear in the CXL 2.0 specification. They
      are submitted through a mailbox mechanism specified in the CXL 2.0
      specification.
      
      The send command allows userspace to issue mailbox commands directly to
      the hardware. The list of available commands to send are the output of
      the query command. The driver verifies basic properties of the command
      and possibly inspect the input (or output) payload to determine whether
      or not the command is allowed (or might taint the kernel).
      
      Reported-by: kernel test robot <lkp@intel.com> # bug in earlier revision
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: Dan Williams <dan.j.williams@intel.com> (v2)
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/r/20210217040958.1354670-5-ben.widawsky@intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      583fa5e7
  11. 16 2月, 2021 3 次提交
    • G
      mptcp: add local addr info in mptcp_info · 0caf3ada
      Geliang Tang 提交于
      Add mptcpi_local_addr_used and mptcpi_local_addr_max in struct mptcp_info.
      Signed-off-by: NGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0caf3ada
    • L
      binfmt_misc: pass binfmt_misc flags to the interpreter · 2347961b
      Laurent Vivier 提交于
      It can be useful to the interpreter to know which flags are in use.
      
      For instance, knowing if the preserve-argv[0] is in use would
      allow to skip the pathname argument.
      
      This patch uses an unused auxiliary vector, AT_FLAGS, to add a
      flag to inform interpreter if the preserve-argv[0] is enabled.
      
      Note by Helge Deller:
      The real-world user of this patch is qemu-user, which needs to know
      if it has to preserve the argv[0]. See Debian bug #970460.
      Signed-off-by: NLaurent Vivier <laurent@vivier.eu>
      Reviewed-by: NYunQiang Su <ysu@wavecomp.com>
      URL: http://bugs.debian.org/970460Signed-off-by: NHelge Deller <deller@gmx.de>
      2347961b
    • P
      netfilter: nftables: introduce table ownership · 6001a930
      Pablo Neira Ayuso 提交于
      A userspace daemon like firewalld might need to monitor for netlink
      updates to detect its ruleset removal by the (global) flush ruleset
      command to ensure ruleset persistency. This adds extra complexity from
      userspace and, for some little time, the firewall policy is not in
      place.
      
      This patch adds the NFT_TABLE_F_OWNER flag which allows a userspace
      program to own the table that creates in exclusivity.
      
      Tables that are owned...
      
      - can only be updated and removed by the owner, non-owners hit EPERM if
        they try to update it or remove it.
      - are destroyed when the owner closes the netlink socket or the process
        is gone (implicit netlink socket closure).
      - are skipped by the global flush ruleset command.
      - are listed in the global ruleset.
      
      The userspace process that sets on the NFT_TABLE_F_OWNER flag need to
      leave open the netlink socket.
      
      A new NFTA_TABLE_OWNER netlink attribute specifies the netlink port ID
      to identify the owner from userspace.
      
      This patch also updates error reporting when an unknown table flag is
      specified to change it from EINVAL to EOPNOTSUPP given that EINVAL is
      usually reserved to report for malformed netlink messages to userspace.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      6001a930
  12. 15 2月, 2021 2 次提交
  13. 13 2月, 2021 3 次提交
  14. 12 2月, 2021 5 次提交
  15. 11 2月, 2021 3 次提交
  16. 10 2月, 2021 1 次提交
  17. 09 2月, 2021 4 次提交