1. 22 12月, 2018 8 次提交
  2. 21 12月, 2018 32 次提交
    • A
      vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver · 7f928917
      Alexey Kardashevskiy 提交于
      POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
      pluggable PCIe devices but still have PCIe links which are used
      for config space and MMIO. In addition to that the GPUs have 6 NVLinks
      which are connected to other GPUs and the POWER9 CPU. POWER9 chips
      have a special unit on a die called an NPU which is an NVLink2 host bus
      adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
      These systems also support ATS (address translation services) which is
      a part of the NVLink2 protocol. Such GPUs also share on-board RAM
      (16GB or 32GB) to the system via the same NVLink2 so a CPU has
      cache-coherent access to a GPU RAM.
      
      This exports GPU RAM to the userspace as a new VFIO device region. This
      preregisters the new memory as device memory as it might be used for DMA.
      This inserts pfns from the fault handler as the GPU memory is not onlined
      until the vendor driver is loaded and trained the NVLinks so doing this
      earlier causes low level errors which we fence in the firmware so
      it does not hurt the host system but still better be avoided; for the same
      reason this does not map GPU RAM into the host kernel (usual thing for
      emulated access otherwise).
      
      This exports an ATSD (Address Translation Shootdown) register of NPU which
      allows TLB invalidations inside GPU for an operating system. The register
      conveniently occupies a single 64k page. It is also presented to
      the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
      each of them can be used for TLB invalidation in a GPU linked to this NPU.
      This allocates one ATSD register per an NVLink bridge allowing passing
      up to 6 registers. Due to the host firmware bug (just recently fixed),
      only 1 ATSD register per NPU was actually advertised to the host system
      so this passes that alone register via the first NVLink bridge device in
      the group which is still enough as QEMU collects them all back and
      presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
      
      In order to provide the userspace with the information about GPU-to-NVLink
      connections, this exports an additional capability called "tgt"
      (which is an abbreviated host system bus address). The "tgt" property
      tells the GPU its own system address and allows the guest driver to
      conglomerate the routing information so each GPU knows how to get directly
      to the other GPUs.
      
      For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
      know LPID (a logical partition ID or a KVM guest hardware ID in other
      words) and PID (a memory context ID of a userspace process, not to be
      confused with a linux pid). This assigns a GPU to LPID in the NPU and
      this is why this adds a listener for KVM on an IOMMU group. A PID comes
      via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
      
      This requires coherent memory and ATSD to be available on the host as
      the GPU vendor only supports configurations with both features enabled
      and other configurations are known not to work. Because of this and
      because of the ways the features are advertised to the host system
      (which is a device tree with very platform specific properties),
      this requires enabled POWERNV platform.
      
      The V100 GPUs do not advertise any of these capabilities via the config
      space and there are more than just one device ID so this relies on
      the platform to tell whether these GPUs have special abilities such as
      NVLinks.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7f928917
    • A
      vfio_pci: Allow regions to add own capabilities · c2c0f1cd
      Alexey Kardashevskiy 提交于
      VFIO regions already support region capabilities with a limited set of
      fields. However the subdriver might have to report to the userspace
      additional bits.
      
      This adds an add_capability() hook to vfio_pci_regops.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c2c0f1cd
    • A
      vfio_pci: Allow mapping extra regions · a15b1883
      Alexey Kardashevskiy 提交于
      So far we only allowed mapping of MMIO BARs to the userspace. However
      there are GPUs with on-board coherent RAM accessible via side
      channels which we also want to map to the userspace. The first client
      for this is NVIDIA V100 GPU with NVLink2 direct links to a POWER9
      NPU-enabled CPU; such GPUs have 16GB RAM which is coherently mapped
      to the system address space, we are going to export these as an extra
      PCI region.
      
      We already support extra PCI regions and this adds support for mapping
      them to the userspace.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a15b1883
    • A
      powerpc/powernv/npu: Fault user page into the hypervisor's pagetable · 58629c0d
      Alexey Kardashevskiy 提交于
      When a page fault happens in a GPU, the GPU signals the OS and the GPU
      driver calls the fault handler which populated a page table; this allows
      the GPU to complete an ATS request.
      
      On the bare metal get_user_pages() is enough as it adds a pte to
      the kernel page table but under KVM the partition scope tree does not get
      updated so ATS will still fail.
      
      This reads a byte from an effective address which causes HV storage
      interrupt and KVM updates the partition scope tree.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      58629c0d
    • A
      powerpc/powernv/npu: Check mmio_atsd array bounds when populating · 135ef954
      Alexey Kardashevskiy 提交于
      A broken device tree might contain more than 8 values and introduce hard
      to debug memory corruption bug. This adds the boundary check.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      135ef954
    • A
      powerpc/powernv/npu: Add release_ownership hook · 1b785611
      Alexey Kardashevskiy 提交于
      In order to make ATS work and translate addresses for arbitrary
      LPID and PID, we need to program an NPU with LPID and allow PID wildcard
      matching with a specific MSR mask.
      
      This implements a helper to assign a GPU to LPAR and program the NPU
      with a wildcard for PID and a helper to do clean-up. The helper takes
      MSR (only DR/HV/PR/SF bits are allowed) to program them into NPU2 for
      ATS checkout requests support.
      
      This exports pnv_npu2_unmap_lpar_dev() as following patches will use it
      from the VFIO driver.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1b785611
    • A
      powerpc/powernv/npu: Add compound IOMMU groups · 0bd97167
      Alexey Kardashevskiy 提交于
      At the moment the powernv platform registers an IOMMU group for each
      PE. There is an exception though: an NVLink bridge which is attached
      to the corresponding GPU's IOMMU group making it a master.
      
      Now we have POWER9 systems with GPUs connected to each other directly
      bypassing PCI. At the moment we do not control state of these links so
      we have to put such interconnected GPUs to one IOMMU group which means
      that the old scheme with one GPU as a master won't work - there will
      be up to 3 GPUs in such group.
      
      This introduces a npu_comp struct which represents a compound IOMMU
      group made of multiple PEs - PCI PEs (for GPUs) and NPU PEs (for
      NVLink bridges). This converts the existing NVLink1 code to use the
      new scheme. >From now on, each PE must have a valid
      iommu_table_group_ops which will either be called directly (for a
      single PE group) or indirectly from a compound group handlers.
      
      This moves IOMMU group registration for NVLink-connected GPUs to
      npu-dma.c. For POWER8, this stores a new compound group pointer in the
      PE (so a GPU is still a master); for POWER9 the new group pointer is
      stored in an NPU (which is allocated per a PCI host controller).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [mpe: Initialise npdev to NULL in pnv_try_setup_npu_table_group()]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0bd97167
    • A
      powerpc/powernv/npu: Convert NPU IOMMU helpers to iommu_table_group_ops · 83fb8ccf
      Alexey Kardashevskiy 提交于
      At the moment NPU IOMMU is manipulated directly from the IODA2 PCI
      PE code; PCI PE acts as a master to NPU PE. Soon we will have compound
      IOMMU groups with several PEs from several different PHB (such as
      interconnected GPUs and NPUs) so there will be no single master but
      a one big IOMMU group.
      
      This makes a first step and converts an NPU PE with a set of extern
      function to a table group.
      
      This should cause no behavioral change. Note that
      pnv_npu_release_ownership() has never been implemented.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      83fb8ccf
    • A
      powerpc/powernv/npu: Move single TVE handling to NPU PE · b04149c2
      Alexey Kardashevskiy 提交于
      Normal PCI PEs have 2 TVEs, one per a DMA window; however NPU PE has only
      one which points to one of two tables of the corresponding PCI PE.
      
      So whenever a new DMA window is programmed to PEs, the NPU PE needs to
      release old table in order to use the new one.
      
      Commit d41ce7b1 ("powerpc/powernv/npu: Do not try invalidating 32bit
      table when 64bit table is enabled") did just that but in pci-ioda.c
      while it actually belongs to npu-dma.c.
      
      This moves the single TVE handling to npu-dma.c. This does not implement
      restoring though as it is highly unlikely that we can set the table to
      PCI PE and cannot to NPU PE and if that fails, we could only set 32bit
      table to NPU PE and this configuration is not really supported or wanted.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b04149c2
    • A
      powerpc/powernv: Reference iommu_table while it is linked to a group · 847e6563
      Alexey Kardashevskiy 提交于
      The iommu_table pointer stored in iommu_table_group may get stale
      by accident, this adds referencing and removes a redundant comment
      about this.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      847e6563
    • A
      powerpc/iommu_api: Move IOMMU groups setup to a single place · 5eada8a3
      Alexey Kardashevskiy 提交于
      Registering new IOMMU groups and adding devices to them are separated in
      code and the latter is dug in the DMA setup code which it does not
      really belong to.
      
      This moved IOMMU groups setup to a separate helper which registers a group
      and adds devices as before. This does not make a difference as IOMMU
      groups are not used anyway; the only dependency here is that
      iommu_add_device() requires a valid pointer to an iommu_table
      (set by set_iommu_table_base()).
      
      To keep the old behaviour, this does not add new IOMMU groups for PEs
      with no DMA weight and also skips NVLink bridges which do not have
      pci_controller_ops::setup_bridge (the normal way of adding PEs).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5eada8a3
    • A
      powerpc/powernv/pseries: Rework device adding to IOMMU groups · c4e9d3c1
      Alexey Kardashevskiy 提交于
      The powernv platform registers IOMMU groups and adds devices to them
      from the pci_controller_ops::setup_bridge() hook except one case when
      virtual functions (SRIOV VFs) are added from a bus notifier.
      
      The pseries platform registers IOMMU groups from
      the pci_controller_ops::dma_bus_setup() hook and adds devices from
      the pci_controller_ops::dma_dev_setup() hook. The very same bus notifier
      used for powernv does not add devices for pseries though as
      __of_scan_bus() adds devices first, then it does the bus/dev DMA setup.
      
      Both platforms use iommu_add_device() which takes a device and expects
      it to have a valid IOMMU table struct with an iommu_table_group pointer
      which in turn points the iommu_group struct (which represents
      an IOMMU group). Although the helper seems easy to use, it relies on
      some pre-existing device configuration and associated data structures
      which it does not really need.
      
      This simplifies iommu_add_device() to take the table_group pointer
      directly. Pseries already has a table_group pointer handy and the bus
      notified is not used anyway. For powernv, this copies the existing bus
      notifier, makes it work for powernv only which means an easy way of
      getting to the table_group pointer. This was tested on VFs but should
      also support physical PCI hotplug.
      
      Since iommu_add_device() receives the table_group pointer directly,
      pseries does not do TCE cache invalidation (the hypervisor does) nor
      allow multiple groups per a VFIO container (in other words sharing
      an IOMMU table between partitionable endpoints), this removes
      iommu_table_group_link from pseries.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c4e9d3c1
    • A
      powerpc/pseries: Remove IOMMU API support for non-LPAR systems · c409c631
      Alexey Kardashevskiy 提交于
      The pci_dma_bus_setup_pSeries and pci_dma_dev_setup_pSeries hooks are
      registered for the pseries platform which does not have FW_FEATURE_LPAR;
      these would be pre-powernv platforms which we never supported PCI pass
      through for anyway so remove it.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c409c631
    • A
      powerpc/pseries/npu: Enable platform support · 3be2df00
      Alexey Kardashevskiy 提交于
      We already changed NPU API for GPUs to not to call OPAL and the remaining
      bit is initializing NPU structures.
      
      This searches for POWER9 NVLinks attached to any device on a PHB and
      initializes an NPU structure if any found.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3be2df00
    • A
      powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation · 68c0449e
      Alexey Kardashevskiy 提交于
      We might have memory@ nodes with "linux,usable-memory" set to zero
      (for example, to replicate powernv's behaviour for GPU coherent memory)
      which means that the memory needs an extra initialization but since
      it can be used afterwards, the pseries platform will try mapping it
      for DMA so the DMA window needs to cover those memory regions too;
      if the window cannot cover new memory regions, the memory onlining fails.
      
      This walks through the memory nodes to find the highest RAM address to
      let a huge DMA window cover that too in case this memory gets onlined
      later.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      68c0449e
    • A
      powerpc/powernv/npu: Move OPAL calls away from context manipulation · 0e759bd7
      Alexey Kardashevskiy 提交于
      When introduced, the NPU context init/destroy helpers called OPAL which
      enabled/disabled PID (a userspace memory context ID) filtering in an NPU
      per a GPU; this was a requirement for P9 DD1.0. However newer chip
      revision added a PID wildcard support so there is no more need to
      call OPAL every time a new context is initialized. Also, since the PID
      wildcard support was added, skiboot does not clear wildcard entries
      in the NPU so these remain in the hardware till the system reboot.
      
      This moves LPID and wildcard programming to the PE setup code which
      executes once during the booting process so NPU2 context init/destroy
      won't need to do additional configuration.
      
      This replaces the check for FW_FEATURE_OPAL with a check for npu!=NULL as
      this is the way to tell if the NPU support is present and configured.
      
      This moves pnv_npu2_init() declaration as pseries should be able to use it.
      This keeps pnv_npu2_map_lpar() in powernv as pseries is not allowed to
      call that. This exports pnv_npu2_map_lpar_dev() as following patches
      will use it from the VFIO driver.
      
      While at it, replace redundant list_for_each_entry_safe() with
      a simpler list_for_each_entry().
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0e759bd7
    • A
      powerpc/powernv: Move npu struct from pnv_phb to pci_controller · 46a1449d
      Alexey Kardashevskiy 提交于
      The powernv PCI code stores NPU data in the pnv_phb struct. The latter
      is referenced by pci_controller::private_data. We are going to have NPU2
      support in the pseries platform as well but it does not store any
      private_data in in the pci_controller struct; and even if it did,
      it would be a different data structure.
      
      This makes npu a pointer and stores it one level higher in
      the pci_controller struct.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      46a1449d
    • A
      powerpc/vfio/iommu/kvm: Do not pin device memory · c10c21ef
      Alexey Kardashevskiy 提交于
      This new memory does not have page structs as it is not plugged to
      the host so gup() will fail anyway.
      
      This adds 2 helpers:
      - mm_iommu_newdev() to preregister the "memory device" memory so
      the rest of API can still be used;
      - mm_iommu_is_devmem() to know if the physical address is one of thise
      new regions which we must avoid unpinning of.
      
      This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
      if the memory is device memory to avoid pfn_to_page().
      
      This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which
      does delayed pages dirtying.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c10c21ef
    • A
      powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a region · e0bf78b0
      Alexey Kardashevskiy 提交于
      Normally mm_iommu_get() should add a reference and mm_iommu_put() should
      remove it. However historically mm_iommu_find() does the referencing and
      mm_iommu_get() is doing allocation and referencing.
      
      We are going to add another helper to preregister device memory so
      instead of having mm_iommu_new() (which pre-registers the normal memory
      and references the region), we need separate helpers for pre-registering
      and referencing.
      
      This renames:
      - mm_iommu_get to mm_iommu_new;
      - mm_iommu_find to mm_iommu_get.
      
      This changes mm_iommu_get() to reference the region so the name now
      reflects what it does.
      
      This removes the check for exact match from mm_iommu_new() as we want it
      to fail on existing regions; mm_iommu_get() should be used instead.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e0bf78b0
    • A
      powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2 · ab7032e7
      Alexey Kardashevskiy 提交于
      The skiboot firmware has a hot reset handler which fences the NVIDIA V100
      GPU RAM on Witherspoons and makes accesses no-op instead of throwing HMIs:
      https://github.com/open-power/skiboot/commit/fca2b2b839a67
      
      Now we are going to pass V100 via VFIO which most certainly involves
      KVM guests which are often terminated without getting a chance to offline
      GPU RAM so we end up with a running machine with misconfigured memory.
      Accessing this memory produces hardware management interrupts (HMI)
      which bring the host down.
      
      To suppress HMIs, this wires up this hot reset hook to vfio_pci_disable()
      via pci_disable_device() which switches NPU2 to a safe mode and prevents
      HMIs.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NAlistair Popple <alistair@popple.id.au>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ab7032e7
    • C
      powerpc/mm: Fix reporting of kernel execute faults on the 8xx · ffca395b
      Christophe Leroy 提交于
      On the 8xx, no-execute is set via PPP bits in the PTE. Therefore
      a no-exec fault generates DSISR_PROTFAULT error bits,
      not DSISR_NOEXEC_OR_G.
      
      This patch adds DSISR_PROTFAULT in the test mask.
      
      Fixes: d3ca5874 ("powerpc/mm: Fix reporting of kernel execute faults")
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ffca395b
    • F
      powerpc: generate uapi header and system call table files · ab66dcc7
      Firoz Khan 提交于
      System call table generation script must be run to gener-
      ate unistd_32/64.h and syscall_table_32/64/c32/spu.h files.
      This patch will have changes which will invokes the script.
      
      This patch will generate unistd_32/64.h and syscall_table-
      _32/64/c32/spu.h files by the syscall table generation
      script invoked by parisc/Makefile and the generated files
      against the removed files must be identical.
      
      The generated uapi header file will be included in uapi/-
      asm/unistd.h and generated system call table header file
      will be included by kernel/systbl.S file.
      Signed-off-by: NFiroz Khan <firoz.khan@linaro.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ab66dcc7
    • F
      powerpc: add system call table generation support · aff85039
      Firoz Khan 提交于
      The system call tables are in different format in all
      architecture and it will be difficult to manually add or
      modify the system calls in the respective files. To make
      it easy by keeping a script and which will generate the
      uapi header and syscall table file. This change will also
      help to unify the implementation across all architectures.
      
      The system call table generation script is added in
      syscalls directory which contain the script to generate
      both uapi header file and system call table files.
      The syscall.tbl file will be the input for the scripts.
      
      syscall.tbl contains the list of available system calls
      along with system call number and corresponding entry point.
      Add a new system call in this architecture will be possible
      by adding new entry in the syscall.tbl file.
      
      Adding a new table entry consisting of:
        	- System call number.
      	- ABI.
      	- System call name.
      	- Entry point name.
      	- Compat entry name, if required.
      
      syscallhdr.sh and syscalltbl.sh will generate uapi header-
      unistd_32/64.h and syscall_table_32/64/c32/spu.h files
      respectively. File syscall_table_32/64/c32/spu.h is incl-
      uded by syscall.S - the real system call table. Both *.sh
      files will parse the content syscall.tbl to generate the
      header and table files.
      
      ARM, s390 and x86 architecuture does have similar support.
      I leverage their implementation to come up with a generic
      solution.
      Signed-off-by: NFiroz Khan <firoz.khan@linaro.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      aff85039
    • F
      powerpc: split compat syscall table out from native table · fbf508da
      Firoz Khan 提交于
      PowerPC uses a syscall table with native and compat calls
      interleaved, which is a slightly simpler way to define two
      matching tables.
      
      As we move to having the tables generated, that advantage
      is no longer important, but the interleaved table gets in
      the way of using the same scripts as on the other archit-
      ectures.
      
      Split out a new compat_sys_call_table symbol that contains
      all the compat calls, and leave the main table for the nat-
      ive calls, to more closely match the method we use every-
      where else.
      Suggested-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NFiroz Khan <firoz.khan@linaro.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fbf508da
    • F
      powerpc: move macro definition from asm/systbl.h · a11b763d
      Firoz Khan 提交于
      Move the macro definition for compat_sys_sigsuspend from
      asm/systbl.h to the file which it is getting included.
      
      One of the patch in this patch series is generating uapi
      header and syscall table files. In order to come up with
      a common implimentation across all architecture, we need
      to do this change.
      
      This change will simplify the implementation of system
      call table generation script and help to come up a common
      implementation across all architecture.
      Signed-off-by: NFiroz Khan <firoz.khan@linaro.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a11b763d
    • F
      powerpc: add __NR_syscalls along with NR_syscalls · 8a19eeea
      Firoz Khan 提交于
      NR_syscalls macro holds the number of system call exist
      in powerpc architecture. We have to change the value of
      NR_syscalls, if we add or delete a system call.
      
      One of the patch in this patch series has a script which
      will generate a uapi header based on syscall.tbl file.
      The syscall.tbl file contains the number of system call
      information. So we have two option to update NR_syscalls
      value.
      
      1. Update NR_syscalls in asm/unistd.h manually by count-
         ing the no.of system calls. No need to update NR_sys-
         calls until we either add a new system call or delete
         existing system call.
      
      2. We can keep this feature in above mentioned script,
         that will count the number of syscalls and keep it in
         a generated file. In this case we don't need to expli-
         citly update NR_syscalls in asm/unistd.h file.
      
      The 2nd option will be the recommended one. For that, I
      added the __NR_syscalls macro in uapi/asm/unistd.h along
      with NR_syscalls asm/unistd.h. The macro __NR_syscalls
      also added for making the name convention same across all
      architecture. While __NR_syscalls isn't strictly part of
      the uapi, having it as part of the generated header to
      simplifies the implementation. We also need to enclose
      this macro with #ifdef __KERNEL__ to avoid side effects.
      Signed-off-by: NFiroz Khan <firoz.khan@linaro.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8a19eeea
    • R
      powerpc/pkeys: Fix handling of pkey state across fork() · 2cd4bd19
      Ram Pai 提交于
      Protection key tracking information is not copied over to the
      mm_struct of the child during fork(). This can cause the child to
      erroneously allocate keys that were already allocated. Any allocated
      execute-only key is lost aswell.
      
      Add code; called by dup_mmap(), to copy the pkey state from parent to
      child explicitly.
      
      This problem was originally found by Dave Hansen on x86, which turns
      out to be a problem on powerpc aswell.
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Reviewed-by: NThiago Jung Bauermann <bauerman@linux.ibm.com>
      Signed-off-by: NRam Pai <linuxram@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2cd4bd19
    • G
      ocxl: Fix endiannes bug in read_afu_name() · 2f07229f
      Greg Kurz 提交于
      The AFU Descriptor Template in the PCI config space has a Name Space
      field which is a 24 Byte ASCII character string of descriptive name
      space for the AFU. The OCXL driver read the string four characters at
      a time with pci_read_config_dword().
      
      This optimization is valid on a little-endian system since this is PCI,
      but a big-endian system ends up with each subset of four characters in
      reverse order.
      
      This could be fixed by switching to read characters one by one. Another
      option is to swap the bytes if we're big-endian.
      
      Go for the latter with le32_to_cpu().
      
      Cc: stable@vger.kernel.org      # v4.16
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Acked-by: NFrederic Barrat <fbarrat@linux.ibm.com>
      Acked-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2f07229f
    • B
      selftests/powerpc: Add checks for transactional sigreturn · 34642d70
      Breno Leitao 提交于
      This is a new test case that creates a signal and starts a suspended
      transaction inside the signal handler.
      
      It returns from the signal handler with the CPU at suspended state, but
      without setting user context MSR Transaction State (TS) field.
      
      The kernel signal handler code should be able to handle this discrepancy
      instead of crashing.
      
      This code could be compiled and used to test 32 and 64-bits signal
      handlers.
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NGustavo Romero <gromero@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      34642d70
    • B
      powerpc/tm: Unset MSR[TS] if not recheckpointing · 6f5b9f01
      Breno Leitao 提交于
      There is a TM Bad Thing bug that can be caused when you return from a
      signal context in a suspended transaction but with ucontext MSR[TS] unset.
      
      This forces regs->msr[TS] to be set at syscall entrance (since the CPU
      state is transactional). It also calls treclaim() to flush the transaction
      state, which is done based on the live (mfmsr) MSR state.
      
      Since user context MSR[TS] is not set, then restore_tm_sigcontexts() is not
      called, thus, not executing recheckpoint, keeping the CPU state as not
      transactional. When calling rfid, SRR1 will have MSR[TS] set, but the CPU
      state is non transactional, causing the TM Bad Thing with the following
      stack:
      
      	[   33.862316] Bad kernel stack pointer 3fffd9dce3e0 at c00000000000c47c
      	cpu 0x8: Vector: 700 (Program Check) at [c00000003ff7fd40]
      	    pc: c00000000000c47c: fast_exception_return+0xac/0xb4
      	    lr: 00003fff865f442c
      	    sp: 3fffd9dce3e0
      	   msr: 8000000102a03031
      	  current = 0xc00000041f68b700
      	  paca    = 0xc00000000fb84800   softe: 0        irq_happened: 0x01
      	    pid   = 1721, comm = tm-signal-sigre
      	Linux version 4.9.0-3-powerpc64le (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
      	WARNING: exception is not recoverable, can't continue
      
      The same problem happens on 32-bits signal handler, and the fix is very
      similar, if tm_recheckpoint() is not executed, then regs->msr[TS] should be
      zeroed.
      
      This patch also fixes a sparse warning related to lack of indentation when
      CONFIG_PPC_TRANSACTIONAL_MEM is set.
      
      Fixes: 2b0a576d ("powerpc: Add new transactional memory state to the signal context")
      CC: Stable <stable@vger.kernel.org>	# 3.10+
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Tested-by: NMichal Suchánek <msuchanek@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      6f5b9f01
    • B
      powerpc/tm: Print scratch value · 11be3958
      Breno Leitao 提交于
      Usually a TM Bad Thing exception is raised due to three different problems.
      a) touching SPRs in an active transaction; b) using TM instruction with the
      facility disabled and c) setting a wrong MSR/SRR1 at RFID.
      
      The two initial cases are easy to identify by looking at the instructions.
      The latter case is harder, because the MSR is masked after RFID, so, it is
      very useful to look at the previous MSR (SRR1) before RFID as also the
      current and masked MSR.
      
      Since MSR is saved at paca just before RFID, this patch prints it if a TM
      Bad thing happen, helping to understand what is the invalid TM transition
      that is causing the exception.
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      11be3958
    • B
      powerpc/tm: Save MSR to PACA before RFID · 63a0d6b0
      Breno Leitao 提交于
      As other exit points, move SRR1 (MSR) into paca->tm_scratch, so, if
      there is a TM Bad Thing in RFID, it is easy to understand what was the
      SRR1 value being used.
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      63a0d6b0