1. 26 4月, 2021 2 次提交
    • T
      dmaengine: idxd: Enable IDXD performance monitor support · 0bde4444
      Tom Zanussi 提交于
      Add the code needed in the main IDXD driver to interface with the IDXD
      perfmon implementation.
      
      [ Based on work originally by Jing Lin. ]
      Reviewed-by: NDave Jiang <dave.jiang@intel.com>
      Reviewed-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NTom Zanussi <tom.zanussi@linux.intel.com>
      Link: https://lore.kernel.org/r/a5564a5583911565d31c2af9234218c5166c4b2c.1619276133.git.zanussi@kernel.orgSigned-off-by: NVinod Koul <vkoul@kernel.org>
      0bde4444
    • T
      dmaengine: idxd: Add IDXD performance monitor support · 81dd4d4d
      Tom Zanussi 提交于
      Implement the IDXD performance monitor capability (named 'perfmon' in
      the DSA (Data Streaming Accelerator) spec [1]), which supports the
      collection of information about key events occurring during DSA and
      IAX (Intel Analytics Accelerator) device execution, to assist in
      performance tuning and debugging.
      
      The idxd perfmon support is implemented as part of the IDXD driver and
      interfaces with the Linux perf framework.  It has several features in
      common with the existing uncore pmu support:
      
        - it does not support sampling
        - does not support per-thread counting
      
      However it also has some unique features not present in the core and
      uncore support:
      
        - all general-purpose counters are identical, thus no event constraints
        - operation is always system-wide
      
      While the core perf subsystem assumes that all counters are by default
      per-cpu, the uncore pmus are socket-scoped and use a cpu mask to
      restrict counting to one cpu from each socket.  IDXD counters use a
      similar strategy but expand the scope even further; since IDXD
      counters are system-wide and can be read from any cpu, the IDXD perf
      driver picks a single cpu to do the work (with cpu hotplug notifiers
      to choose a different cpu if the chosen one is taken off-line).
      
      More specifically, the perf userspace tool by default opens a counter
      for each cpu for an event.  However, if it finds a cpumask file
      associated with the pmu under sysfs, as is the case with the uncore
      pmus, it will open counters only on the cpus specified by the cpumask.
      Since perfmon only needs to open a single counter per event for a
      given IDXD device, the perfmon driver will create a sysfs cpumask file
      for the device and insert the first cpu of the system into it.  When a
      user uses perf to open an event, perf will open a single counter on
      the cpu specified by the cpu mask.  This amounts to the default
      system-wide rather than per-cpu counting mentioned previously for
      perfmon pmu events.  In order to keep the cpu mask up-to-date, the
      driver implements cpu hotplug support for multiple devices, as IDXD
      usually enumerates and registers more than one idxd device.
      
      The perfmon driver implements basic perfmon hardware capability
      discovery and configuration, and is initialized by the IDXD driver's
      probe function.  During initialization, the driver retrieves the total
      number of supported performance counters, the pmu ID, and the device
      type from idxd device, and registers itself under the Linux perf
      framework.
      
      The perf userspace tool can be used to monitor single or multiple
      events depending on the given configuration, as well as event groups,
      which are also supported by the perfmon driver.  The user configures
      events using the perf tool command-line interface by specifying the
      event and corresponding event category, along with an optional set of
      filters that can be used to restrict counting to specific work queues,
      traffic classes, page and transfer sizes, and engines (See [1] for
      specifics).
      
      With the configuration specified by the user, the perf tool issues a
      system call passing that information to the kernel, which uses it to
      initialize the specified event(s).  The event(s) are opened and
      started, and following termination of the perf command, they're
      stopped.  At that point, the perfmon driver will read the latest count
      for the event(s), calculate the difference between the latest counter
      values and previously tracked counter values, and display the final
      incremental count as the event count for the cycle.  An overflow
      handler registered on the IDXD irq path is used to account for counter
      overflows, which are signaled by an overflow interrupt.
      
      Below are a couple of examples of perf usage for monitoring DSA events.
      
      The following monitors all events in the 'engine' category.  Becuuse
      no filters are specified, this captures all engine events for the
      workload, which in this case is 19 iterations of the work generated by
      the kernel dmatest module.
      
      Details describing the events can be found in Appendix D of [1],
      Performance Monitoring Events, but briefly they are:
      
        event 0x1:  total input data processed, in 32-byte units
        event 0x2:  total data written, in 32-byte units
        event 0x4:  number of work descriptors that read the source
        event 0x8:  number of work descriptors that write the destination
        event 0x10: number of work descriptors dispatched from batch descriptors
        event 0x20: number of work descriptors dispatched from work queues
      
       # perf stat -e dsa0/event=0x1,event_category=0x1/,
                      dsa0/event=0x2,event_category=0x1/,
      		dsa0/event=0x4,event_category=0x1/,
      		dsa0/event=0x8,event_category=0x1/,
      		dsa0/event=0x10,event_category=0x1/,
      		dsa0/event=0x20,event_category=0x1/
      		  modprobe dmatest channel=dma0chan0 timeout=2000
      		  iterations=19 run=1 wait=1
      
           Performance counter stats for 'system wide':
      
                       5,332      dsa0/event=0x1,event_category=0x1/
                       5,327      dsa0/event=0x2,event_category=0x1/
                          19      dsa0/event=0x4,event_category=0x1/
                          19      dsa0/event=0x8,event_category=0x1/
                           0      dsa0/event=0x10,event_category=0x1/
                          19      dsa0/event=0x20,event_category=0x1/
      
                21.977436186 seconds time elapsed
      
      The command below illustrates filter usage with a simple example.  It
      specifies that MEM_MOVE operations should be counted for the DSA
      device dsa0 (event 0x8 corresponds to the EV_MEM_MOVE event - Number
      of Memory Move Descriptors, which is part of event category 0x3 -
      Operations. The detailed category and event IDs are available in
      Appendix D, Performance Monitoring Events, of [1]).  In addition to
      the event and event category, a number of filters are also specified
      (the detailed filter values are available in Chapter 6.4 (Filter
      Support) of [1]), which will restrict counting to only those events
      that meet all of the filter criteria.  In this case, the filters
      specify that only MEM_MOVE operations that are serviced by work queue
      wq0 and specifically engine number engine0 and traffic class tc0
      having sizes between 0 and 4k and page size of between 0 and 1G result
      in a counter hit; anything else will be filtered out and not appear in
      the final count.  Note that filters are optional - any filter not
      specified is assumed to be all ones and will pass anything.
      
       # perf stat -e dsa0/filter_wq=0x1,filter_tc=0x1,filter_sz=0x7,
                      filter_eng=0x1,event=0x8,event_category=0x3/
      		  modprobe dmatest channel=dma0chan0 timeout=2000
      		  iterations=19 run=1 wait=1
      
           Performance counter stats for 'system wide':
      
             19      dsa0/filter_wq=0x1,filter_tc=0x1,filter_sz=0x7,
                     filter_eng=0x1,event=0x8,event_category=0x3/
      
                21.865914091 seconds time elapsed
      
      The output above reflects that the unspecified workload resulted in
      the counting of 19 MEM_MOVE operation events that met the filter
      criteria.
      
      [1]: https://software.intel.com/content/www/us/en/develop/download/intel-data-streaming-accelerator-preliminary-architecture-specification.html
      
      [ Based on work originally by Jing Lin. ]
      Reviewed-by: NDave Jiang <dave.jiang@intel.com>
      Reviewed-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NTom Zanussi <tom.zanussi@linux.intel.com>
      Link: https://lore.kernel.org/r/0c5080a7d541904c4ad42b848c76a1ce056ddac7.1619276133.git.zanussi@kernel.orgSigned-off-by: NVinod Koul <vkoul@kernel.org>
      81dd4d4d
  2. 24 4月, 2021 8 次提交
  3. 20 4月, 2021 15 次提交
  4. 13 4月, 2021 2 次提交
  5. 12 4月, 2021 10 次提交
  6. 17 3月, 2021 3 次提交