1. 22 12月, 2018 3 次提交
  2. 21 12月, 2018 37 次提交
    • A
      vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver · 7f928917
      Alexey Kardashevskiy 提交于
      POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
      pluggable PCIe devices but still have PCIe links which are used
      for config space and MMIO. In addition to that the GPUs have 6 NVLinks
      which are connected to other GPUs and the POWER9 CPU. POWER9 chips
      have a special unit on a die called an NPU which is an NVLink2 host bus
      adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
      These systems also support ATS (address translation services) which is
      a part of the NVLink2 protocol. Such GPUs also share on-board RAM
      (16GB or 32GB) to the system via the same NVLink2 so a CPU has
      cache-coherent access to a GPU RAM.
      
      This exports GPU RAM to the userspace as a new VFIO device region. This
      preregisters the new memory as device memory as it might be used for DMA.
      This inserts pfns from the fault handler as the GPU memory is not onlined
      until the vendor driver is loaded and trained the NVLinks so doing this
      earlier causes low level errors which we fence in the firmware so
      it does not hurt the host system but still better be avoided; for the same
      reason this does not map GPU RAM into the host kernel (usual thing for
      emulated access otherwise).
      
      This exports an ATSD (Address Translation Shootdown) register of NPU which
      allows TLB invalidations inside GPU for an operating system. The register
      conveniently occupies a single 64k page. It is also presented to
      the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
      each of them can be used for TLB invalidation in a GPU linked to this NPU.
      This allocates one ATSD register per an NVLink bridge allowing passing
      up to 6 registers. Due to the host firmware bug (just recently fixed),
      only 1 ATSD register per NPU was actually advertised to the host system
      so this passes that alone register via the first NVLink bridge device in
      the group which is still enough as QEMU collects them all back and
      presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
      
      In order to provide the userspace with the information about GPU-to-NVLink
      connections, this exports an additional capability called "tgt"
      (which is an abbreviated host system bus address). The "tgt" property
      tells the GPU its own system address and allows the guest driver to
      conglomerate the routing information so each GPU knows how to get directly
      to the other GPUs.
      
      For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
      know LPID (a logical partition ID or a KVM guest hardware ID in other
      words) and PID (a memory context ID of a userspace process, not to be
      confused with a linux pid). This assigns a GPU to LPID in the NPU and
      this is why this adds a listener for KVM on an IOMMU group. A PID comes
      via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
      
      This requires coherent memory and ATSD to be available on the host as
      the GPU vendor only supports configurations with both features enabled
      and other configurations are known not to work. Because of this and
      because of the ways the features are advertised to the host system
      (which is a device tree with very platform specific properties),
      this requires enabled POWERNV platform.
      
      The V100 GPUs do not advertise any of these capabilities via the config
      space and there are more than just one device ID so this relies on
      the platform to tell whether these GPUs have special abilities such as
      NVLinks.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7f928917
    • A
      vfio_pci: Allow regions to add own capabilities · c2c0f1cd
      Alexey Kardashevskiy 提交于
      VFIO regions already support region capabilities with a limited set of
      fields. However the subdriver might have to report to the userspace
      additional bits.
      
      This adds an add_capability() hook to vfio_pci_regops.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c2c0f1cd
    • A
      vfio_pci: Allow mapping extra regions · a15b1883
      Alexey Kardashevskiy 提交于
      So far we only allowed mapping of MMIO BARs to the userspace. However
      there are GPUs with on-board coherent RAM accessible via side
      channels which we also want to map to the userspace. The first client
      for this is NVIDIA V100 GPU with NVLink2 direct links to a POWER9
      NPU-enabled CPU; such GPUs have 16GB RAM which is coherently mapped
      to the system address space, we are going to export these as an extra
      PCI region.
      
      We already support extra PCI regions and this adds support for mapping
      them to the userspace.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a15b1883
    • A
      powerpc/powernv/npu: Fault user page into the hypervisor's pagetable · 58629c0d
      Alexey Kardashevskiy 提交于
      When a page fault happens in a GPU, the GPU signals the OS and the GPU
      driver calls the fault handler which populated a page table; this allows
      the GPU to complete an ATS request.
      
      On the bare metal get_user_pages() is enough as it adds a pte to
      the kernel page table but under KVM the partition scope tree does not get
      updated so ATS will still fail.
      
      This reads a byte from an effective address which causes HV storage
      interrupt and KVM updates the partition scope tree.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      58629c0d
    • A
      powerpc/powernv/npu: Check mmio_atsd array bounds when populating · 135ef954
      Alexey Kardashevskiy 提交于
      A broken device tree might contain more than 8 values and introduce hard
      to debug memory corruption bug. This adds the boundary check.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      135ef954
    • A
      powerpc/powernv/npu: Add release_ownership hook · 1b785611
      Alexey Kardashevskiy 提交于
      In order to make ATS work and translate addresses for arbitrary
      LPID and PID, we need to program an NPU with LPID and allow PID wildcard
      matching with a specific MSR mask.
      
      This implements a helper to assign a GPU to LPAR and program the NPU
      with a wildcard for PID and a helper to do clean-up. The helper takes
      MSR (only DR/HV/PR/SF bits are allowed) to program them into NPU2 for
      ATS checkout requests support.
      
      This exports pnv_npu2_unmap_lpar_dev() as following patches will use it
      from the VFIO driver.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1b785611
    • A
      powerpc/powernv/npu: Add compound IOMMU groups · 0bd97167
      Alexey Kardashevskiy 提交于
      At the moment the powernv platform registers an IOMMU group for each
      PE. There is an exception though: an NVLink bridge which is attached
      to the corresponding GPU's IOMMU group making it a master.
      
      Now we have POWER9 systems with GPUs connected to each other directly
      bypassing PCI. At the moment we do not control state of these links so
      we have to put such interconnected GPUs to one IOMMU group which means
      that the old scheme with one GPU as a master won't work - there will
      be up to 3 GPUs in such group.
      
      This introduces a npu_comp struct which represents a compound IOMMU
      group made of multiple PEs - PCI PEs (for GPUs) and NPU PEs (for
      NVLink bridges). This converts the existing NVLink1 code to use the
      new scheme. >From now on, each PE must have a valid
      iommu_table_group_ops which will either be called directly (for a
      single PE group) or indirectly from a compound group handlers.
      
      This moves IOMMU group registration for NVLink-connected GPUs to
      npu-dma.c. For POWER8, this stores a new compound group pointer in the
      PE (so a GPU is still a master); for POWER9 the new group pointer is
      stored in an NPU (which is allocated per a PCI host controller).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [mpe: Initialise npdev to NULL in pnv_try_setup_npu_table_group()]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0bd97167
    • A
      powerpc/powernv/npu: Convert NPU IOMMU helpers to iommu_table_group_ops · 83fb8ccf
      Alexey Kardashevskiy 提交于
      At the moment NPU IOMMU is manipulated directly from the IODA2 PCI
      PE code; PCI PE acts as a master to NPU PE. Soon we will have compound
      IOMMU groups with several PEs from several different PHB (such as
      interconnected GPUs and NPUs) so there will be no single master but
      a one big IOMMU group.
      
      This makes a first step and converts an NPU PE with a set of extern
      function to a table group.
      
      This should cause no behavioral change. Note that
      pnv_npu_release_ownership() has never been implemented.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      83fb8ccf
    • A
      powerpc/powernv/npu: Move single TVE handling to NPU PE · b04149c2
      Alexey Kardashevskiy 提交于
      Normal PCI PEs have 2 TVEs, one per a DMA window; however NPU PE has only
      one which points to one of two tables of the corresponding PCI PE.
      
      So whenever a new DMA window is programmed to PEs, the NPU PE needs to
      release old table in order to use the new one.
      
      Commit d41ce7b1 ("powerpc/powernv/npu: Do not try invalidating 32bit
      table when 64bit table is enabled") did just that but in pci-ioda.c
      while it actually belongs to npu-dma.c.
      
      This moves the single TVE handling to npu-dma.c. This does not implement
      restoring though as it is highly unlikely that we can set the table to
      PCI PE and cannot to NPU PE and if that fails, we could only set 32bit
      table to NPU PE and this configuration is not really supported or wanted.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b04149c2
    • A
      powerpc/powernv: Reference iommu_table while it is linked to a group · 847e6563
      Alexey Kardashevskiy 提交于
      The iommu_table pointer stored in iommu_table_group may get stale
      by accident, this adds referencing and removes a redundant comment
      about this.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      847e6563
    • A
      powerpc/iommu_api: Move IOMMU groups setup to a single place · 5eada8a3
      Alexey Kardashevskiy 提交于
      Registering new IOMMU groups and adding devices to them are separated in
      code and the latter is dug in the DMA setup code which it does not
      really belong to.
      
      This moved IOMMU groups setup to a separate helper which registers a group
      and adds devices as before. This does not make a difference as IOMMU
      groups are not used anyway; the only dependency here is that
      iommu_add_device() requires a valid pointer to an iommu_table
      (set by set_iommu_table_base()).
      
      To keep the old behaviour, this does not add new IOMMU groups for PEs
      with no DMA weight and also skips NVLink bridges which do not have
      pci_controller_ops::setup_bridge (the normal way of adding PEs).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5eada8a3
    • A
      powerpc/powernv/pseries: Rework device adding to IOMMU groups · c4e9d3c1
      Alexey Kardashevskiy 提交于
      The powernv platform registers IOMMU groups and adds devices to them
      from the pci_controller_ops::setup_bridge() hook except one case when
      virtual functions (SRIOV VFs) are added from a bus notifier.
      
      The pseries platform registers IOMMU groups from
      the pci_controller_ops::dma_bus_setup() hook and adds devices from
      the pci_controller_ops::dma_dev_setup() hook. The very same bus notifier
      used for powernv does not add devices for pseries though as
      __of_scan_bus() adds devices first, then it does the bus/dev DMA setup.
      
      Both platforms use iommu_add_device() which takes a device and expects
      it to have a valid IOMMU table struct with an iommu_table_group pointer
      which in turn points the iommu_group struct (which represents
      an IOMMU group). Although the helper seems easy to use, it relies on
      some pre-existing device configuration and associated data structures
      which it does not really need.
      
      This simplifies iommu_add_device() to take the table_group pointer
      directly. Pseries already has a table_group pointer handy and the bus
      notified is not used anyway. For powernv, this copies the existing bus
      notifier, makes it work for powernv only which means an easy way of
      getting to the table_group pointer. This was tested on VFs but should
      also support physical PCI hotplug.
      
      Since iommu_add_device() receives the table_group pointer directly,
      pseries does not do TCE cache invalidation (the hypervisor does) nor
      allow multiple groups per a VFIO container (in other words sharing
      an IOMMU table between partitionable endpoints), this removes
      iommu_table_group_link from pseries.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c4e9d3c1
    • A
      powerpc/pseries: Remove IOMMU API support for non-LPAR systems · c409c631
      Alexey Kardashevskiy 提交于
      The pci_dma_bus_setup_pSeries and pci_dma_dev_setup_pSeries hooks are
      registered for the pseries platform which does not have FW_FEATURE_LPAR;
      these would be pre-powernv platforms which we never supported PCI pass
      through for anyway so remove it.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c409c631
    • A
      powerpc/pseries/npu: Enable platform support · 3be2df00
      Alexey Kardashevskiy 提交于
      We already changed NPU API for GPUs to not to call OPAL and the remaining
      bit is initializing NPU structures.
      
      This searches for POWER9 NVLinks attached to any device on a PHB and
      initializes an NPU structure if any found.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3be2df00
    • A
      powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation · 68c0449e
      Alexey Kardashevskiy 提交于
      We might have memory@ nodes with "linux,usable-memory" set to zero
      (for example, to replicate powernv's behaviour for GPU coherent memory)
      which means that the memory needs an extra initialization but since
      it can be used afterwards, the pseries platform will try mapping it
      for DMA so the DMA window needs to cover those memory regions too;
      if the window cannot cover new memory regions, the memory onlining fails.
      
      This walks through the memory nodes to find the highest RAM address to
      let a huge DMA window cover that too in case this memory gets onlined
      later.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      68c0449e
    • A
      powerpc/powernv/npu: Move OPAL calls away from context manipulation · 0e759bd7
      Alexey Kardashevskiy 提交于
      When introduced, the NPU context init/destroy helpers called OPAL which
      enabled/disabled PID (a userspace memory context ID) filtering in an NPU
      per a GPU; this was a requirement for P9 DD1.0. However newer chip
      revision added a PID wildcard support so there is no more need to
      call OPAL every time a new context is initialized. Also, since the PID
      wildcard support was added, skiboot does not clear wildcard entries
      in the NPU so these remain in the hardware till the system reboot.
      
      This moves LPID and wildcard programming to the PE setup code which
      executes once during the booting process so NPU2 context init/destroy
      won't need to do additional configuration.
      
      This replaces the check for FW_FEATURE_OPAL with a check for npu!=NULL as
      this is the way to tell if the NPU support is present and configured.
      
      This moves pnv_npu2_init() declaration as pseries should be able to use it.
      This keeps pnv_npu2_map_lpar() in powernv as pseries is not allowed to
      call that. This exports pnv_npu2_map_lpar_dev() as following patches
      will use it from the VFIO driver.
      
      While at it, replace redundant list_for_each_entry_safe() with
      a simpler list_for_each_entry().
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0e759bd7
    • A
      powerpc/powernv: Move npu struct from pnv_phb to pci_controller · 46a1449d
      Alexey Kardashevskiy 提交于
      The powernv PCI code stores NPU data in the pnv_phb struct. The latter
      is referenced by pci_controller::private_data. We are going to have NPU2
      support in the pseries platform as well but it does not store any
      private_data in in the pci_controller struct; and even if it did,
      it would be a different data structure.
      
      This makes npu a pointer and stores it one level higher in
      the pci_controller struct.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      46a1449d
    • A
      powerpc/vfio/iommu/kvm: Do not pin device memory · c10c21ef
      Alexey Kardashevskiy 提交于
      This new memory does not have page structs as it is not plugged to
      the host so gup() will fail anyway.
      
      This adds 2 helpers:
      - mm_iommu_newdev() to preregister the "memory device" memory so
      the rest of API can still be used;
      - mm_iommu_is_devmem() to know if the physical address is one of thise
      new regions which we must avoid unpinning of.
      
      This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
      if the memory is device memory to avoid pfn_to_page().
      
      This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which
      does delayed pages dirtying.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c10c21ef
    • A
      powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a region · e0bf78b0
      Alexey Kardashevskiy 提交于
      Normally mm_iommu_get() should add a reference and mm_iommu_put() should
      remove it. However historically mm_iommu_find() does the referencing and
      mm_iommu_get() is doing allocation and referencing.
      
      We are going to add another helper to preregister device memory so
      instead of having mm_iommu_new() (which pre-registers the normal memory
      and references the region), we need separate helpers for pre-registering
      and referencing.
      
      This renames:
      - mm_iommu_get to mm_iommu_new;
      - mm_iommu_find to mm_iommu_get.
      
      This changes mm_iommu_get() to reference the region so the name now
      reflects what it does.
      
      This removes the check for exact match from mm_iommu_new() as we want it
      to fail on existing regions; mm_iommu_get() should be used instead.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e0bf78b0
    • A
      powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2 · ab7032e7
      Alexey Kardashevskiy 提交于
      The skiboot firmware has a hot reset handler which fences the NVIDIA V100
      GPU RAM on Witherspoons and makes accesses no-op instead of throwing HMIs:
      https://github.com/open-power/skiboot/commit/fca2b2b839a67
      
      Now we are going to pass V100 via VFIO which most certainly involves
      KVM guests which are often terminated without getting a chance to offline
      GPU RAM so we end up with a running machine with misconfigured memory.
      Accessing this memory produces hardware management interrupts (HMI)
      which bring the host down.
      
      To suppress HMIs, this wires up this hot reset hook to vfio_pci_disable()
      via pci_disable_device() which switches NPU2 to a safe mode and prevents
      HMIs.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NAlistair Popple <alistair@popple.id.au>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ab7032e7
    • C
      powerpc/mm: Fix reporting of kernel execute faults on the 8xx · ffca395b
      Christophe Leroy 提交于
      On the 8xx, no-execute is set via PPP bits in the PTE. Therefore
      a no-exec fault generates DSISR_PROTFAULT error bits,
      not DSISR_NOEXEC_OR_G.
      
      This patch adds DSISR_PROTFAULT in the test mask.
      
      Fixes: d3ca5874 ("powerpc/mm: Fix reporting of kernel execute faults")
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ffca395b
    • F
      powerpc: generate uapi header and system call table files · ab66dcc7
      Firoz Khan 提交于
      System call table generation script must be run to gener-
      ate unistd_32/64.h and syscall_table_32/64/c32/spu.h files.
      This patch will have changes which will invokes the script.
      
      This patch will generate unistd_32/64.h and syscall_table-
      _32/64/c32/spu.h files by the syscall table generation
      script invoked by parisc/Makefile and the generated files
      against the removed files must be identical.
      
      The generated uapi header file will be included in uapi/-
      asm/unistd.h and generated system call table header file
      will be included by kernel/systbl.S file.
      Signed-off-by: NFiroz Khan <firoz.khan@linaro.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ab66dcc7
    • F
      powerpc: add system call table generation support · aff85039
      Firoz Khan 提交于
      The system call tables are in different format in all
      architecture and it will be difficult to manually add or
      modify the system calls in the respective files. To make
      it easy by keeping a script and which will generate the
      uapi header and syscall table file. This change will also
      help to unify the implementation across all architectures.
      
      The system call table generation script is added in
      syscalls directory which contain the script to generate
      both uapi header file and system call table files.
      The syscall.tbl file will be the input for the scripts.
      
      syscall.tbl contains the list of available system calls
      along with system call number and corresponding entry point.
      Add a new system call in this architecture will be possible
      by adding new entry in the syscall.tbl file.
      
      Adding a new table entry consisting of:
        	- System call number.
      	- ABI.
      	- System call name.
      	- Entry point name.
      	- Compat entry name, if required.
      
      syscallhdr.sh and syscalltbl.sh will generate uapi header-
      unistd_32/64.h and syscall_table_32/64/c32/spu.h files
      respectively. File syscall_table_32/64/c32/spu.h is incl-
      uded by syscall.S - the real system call table. Both *.sh
      files will parse the content syscall.tbl to generate the
      header and table files.
      
      ARM, s390 and x86 architecuture does have similar support.
      I leverage their implementation to come up with a generic
      solution.
      Signed-off-by: NFiroz Khan <firoz.khan@linaro.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      aff85039
    • F
      powerpc: split compat syscall table out from native table · fbf508da
      Firoz Khan 提交于
      PowerPC uses a syscall table with native and compat calls
      interleaved, which is a slightly simpler way to define two
      matching tables.
      
      As we move to having the tables generated, that advantage
      is no longer important, but the interleaved table gets in
      the way of using the same scripts as on the other archit-
      ectures.
      
      Split out a new compat_sys_call_table symbol that contains
      all the compat calls, and leave the main table for the nat-
      ive calls, to more closely match the method we use every-
      where else.
      Suggested-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NFiroz Khan <firoz.khan@linaro.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fbf508da
    • F
      powerpc: move macro definition from asm/systbl.h · a11b763d
      Firoz Khan 提交于
      Move the macro definition for compat_sys_sigsuspend from
      asm/systbl.h to the file which it is getting included.
      
      One of the patch in this patch series is generating uapi
      header and syscall table files. In order to come up with
      a common implimentation across all architecture, we need
      to do this change.
      
      This change will simplify the implementation of system
      call table generation script and help to come up a common
      implementation across all architecture.
      Signed-off-by: NFiroz Khan <firoz.khan@linaro.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a11b763d
    • F
      powerpc: add __NR_syscalls along with NR_syscalls · 8a19eeea
      Firoz Khan 提交于
      NR_syscalls macro holds the number of system call exist
      in powerpc architecture. We have to change the value of
      NR_syscalls, if we add or delete a system call.
      
      One of the patch in this patch series has a script which
      will generate a uapi header based on syscall.tbl file.
      The syscall.tbl file contains the number of system call
      information. So we have two option to update NR_syscalls
      value.
      
      1. Update NR_syscalls in asm/unistd.h manually by count-
         ing the no.of system calls. No need to update NR_sys-
         calls until we either add a new system call or delete
         existing system call.
      
      2. We can keep this feature in above mentioned script,
         that will count the number of syscalls and keep it in
         a generated file. In this case we don't need to expli-
         citly update NR_syscalls in asm/unistd.h file.
      
      The 2nd option will be the recommended one. For that, I
      added the __NR_syscalls macro in uapi/asm/unistd.h along
      with NR_syscalls asm/unistd.h. The macro __NR_syscalls
      also added for making the name convention same across all
      architecture. While __NR_syscalls isn't strictly part of
      the uapi, having it as part of the generated header to
      simplifies the implementation. We also need to enclose
      this macro with #ifdef __KERNEL__ to avoid side effects.
      Signed-off-by: NFiroz Khan <firoz.khan@linaro.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8a19eeea
    • R
      powerpc/pkeys: Fix handling of pkey state across fork() · 2cd4bd19
      Ram Pai 提交于
      Protection key tracking information is not copied over to the
      mm_struct of the child during fork(). This can cause the child to
      erroneously allocate keys that were already allocated. Any allocated
      execute-only key is lost aswell.
      
      Add code; called by dup_mmap(), to copy the pkey state from parent to
      child explicitly.
      
      This problem was originally found by Dave Hansen on x86, which turns
      out to be a problem on powerpc aswell.
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Reviewed-by: NThiago Jung Bauermann <bauerman@linux.ibm.com>
      Signed-off-by: NRam Pai <linuxram@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2cd4bd19
    • G
      ocxl: Fix endiannes bug in read_afu_name() · 2f07229f
      Greg Kurz 提交于
      The AFU Descriptor Template in the PCI config space has a Name Space
      field which is a 24 Byte ASCII character string of descriptive name
      space for the AFU. The OCXL driver read the string four characters at
      a time with pci_read_config_dword().
      
      This optimization is valid on a little-endian system since this is PCI,
      but a big-endian system ends up with each subset of four characters in
      reverse order.
      
      This could be fixed by switching to read characters one by one. Another
      option is to swap the bytes if we're big-endian.
      
      Go for the latter with le32_to_cpu().
      
      Cc: stable@vger.kernel.org      # v4.16
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Acked-by: NFrederic Barrat <fbarrat@linux.ibm.com>
      Acked-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2f07229f
    • B
      selftests/powerpc: Add checks for transactional sigreturn · 34642d70
      Breno Leitao 提交于
      This is a new test case that creates a signal and starts a suspended
      transaction inside the signal handler.
      
      It returns from the signal handler with the CPU at suspended state, but
      without setting user context MSR Transaction State (TS) field.
      
      The kernel signal handler code should be able to handle this discrepancy
      instead of crashing.
      
      This code could be compiled and used to test 32 and 64-bits signal
      handlers.
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NGustavo Romero <gromero@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      34642d70
    • B
      powerpc/tm: Unset MSR[TS] if not recheckpointing · 6f5b9f01
      Breno Leitao 提交于
      There is a TM Bad Thing bug that can be caused when you return from a
      signal context in a suspended transaction but with ucontext MSR[TS] unset.
      
      This forces regs->msr[TS] to be set at syscall entrance (since the CPU
      state is transactional). It also calls treclaim() to flush the transaction
      state, which is done based on the live (mfmsr) MSR state.
      
      Since user context MSR[TS] is not set, then restore_tm_sigcontexts() is not
      called, thus, not executing recheckpoint, keeping the CPU state as not
      transactional. When calling rfid, SRR1 will have MSR[TS] set, but the CPU
      state is non transactional, causing the TM Bad Thing with the following
      stack:
      
      	[   33.862316] Bad kernel stack pointer 3fffd9dce3e0 at c00000000000c47c
      	cpu 0x8: Vector: 700 (Program Check) at [c00000003ff7fd40]
      	    pc: c00000000000c47c: fast_exception_return+0xac/0xb4
      	    lr: 00003fff865f442c
      	    sp: 3fffd9dce3e0
      	   msr: 8000000102a03031
      	  current = 0xc00000041f68b700
      	  paca    = 0xc00000000fb84800   softe: 0        irq_happened: 0x01
      	    pid   = 1721, comm = tm-signal-sigre
      	Linux version 4.9.0-3-powerpc64le (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
      	WARNING: exception is not recoverable, can't continue
      
      The same problem happens on 32-bits signal handler, and the fix is very
      similar, if tm_recheckpoint() is not executed, then regs->msr[TS] should be
      zeroed.
      
      This patch also fixes a sparse warning related to lack of indentation when
      CONFIG_PPC_TRANSACTIONAL_MEM is set.
      
      Fixes: 2b0a576d ("powerpc: Add new transactional memory state to the signal context")
      CC: Stable <stable@vger.kernel.org>	# 3.10+
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Tested-by: NMichal Suchánek <msuchanek@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      6f5b9f01
    • B
      powerpc/tm: Print scratch value · 11be3958
      Breno Leitao 提交于
      Usually a TM Bad Thing exception is raised due to three different problems.
      a) touching SPRs in an active transaction; b) using TM instruction with the
      facility disabled and c) setting a wrong MSR/SRR1 at RFID.
      
      The two initial cases are easy to identify by looking at the instructions.
      The latter case is harder, because the MSR is masked after RFID, so, it is
      very useful to look at the previous MSR (SRR1) before RFID as also the
      current and masked MSR.
      
      Since MSR is saved at paca just before RFID, this patch prints it if a TM
      Bad thing happen, helping to understand what is the invalid TM transition
      that is causing the exception.
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      11be3958
    • B
      powerpc/tm: Save MSR to PACA before RFID · 63a0d6b0
      Breno Leitao 提交于
      As other exit points, move SRR1 (MSR) into paca->tm_scratch, so, if
      there is a TM Bad Thing in RFID, it is easy to understand what was the
      SRR1 value being used.
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      63a0d6b0
    • B
      powerpc/tm: Set MSR[TS] just prior to recheckpoint · e1c3743e
      Breno Leitao 提交于
      On a signal handler return, the user could set a context with MSR[TS] bits
      set, and these bits would be copied to task regs->msr.
      
      At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set,
      several __get_user() are called and then a recheckpoint is executed.
      
      This is a problem since a page fault (in kernel space) could happen when
      calling __get_user(). If it happens, the process MSR[TS] bits were
      already set, but recheckpoint was not executed, and SPRs are still invalid.
      
      The page fault can cause the current process to be de-scheduled, with
      MSR[TS] active and without tm_recheckpoint() being called.  More
      importantly, without TEXASR[FS] bit set also.
      
      Since TEXASR might not have the FS bit set, and when the process is
      scheduled back, it will try to reclaim, which will be aborted because of
      the CPU is not in the suspended state, and, then, recheckpoint. This
      recheckpoint will restore thread->texasr into TEXASR SPR, which might be
      zero, hitting a BUG_ON().
      
      	kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
      	cpu 0xb: Vector: 700 (Program Check) at [c00000041f1576d0]
      	    pc: c000000000054550: restore_gprs+0xb0/0x180
      	    lr: 0000000000000000
      	    sp: c00000041f157950
      	   msr: 8000000100021033
      	  current = 0xc00000041f143000
      	  paca    = 0xc00000000fb86300	 softe: 0	 irq_happened: 0x01
      	    pid   = 1021, comm = kworker/11:1
      	kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
      	Linux version 4.9.0-3-powerpc64le (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
      	enter ? for help
      	[c00000041f157b30] c00000000001bc3c tm_recheckpoint.part.11+0x6c/0xa0
      	[c00000041f157b70] c00000000001d184 __switch_to+0x1e4/0x4c0
      	[c00000041f157bd0] c00000000082eeb8 __schedule+0x2f8/0x990
      	[c00000041f157cb0] c00000000082f598 schedule+0x48/0xc0
      	[c00000041f157ce0] c0000000000f0d28 worker_thread+0x148/0x610
      	[c00000041f157d80] c0000000000f96b0 kthread+0x120/0x140
      	[c00000041f157e30] c00000000000c0e0 ret_from_kernel_thread+0x5c/0x7c
      
      This patch simply delays the MSR[TS] set, so, if there is any page fault in
      the __get_user() section, it does not have regs->msr[TS] set, since the TM
      structures are still invalid, thus avoiding doing TM operations for
      in-kernel exceptions and possible process reschedule.
      
      With this patch, the MSR[TS] will only be set just before recheckpointing
      and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in
      invalid state.
      
      Other than that, if CONFIG_PREEMPT is set, there might be a preemption just
      after setting MSR[TS] and before tm_recheckpoint(), thus, this block must
      be atomic from a preemption perspective, thus, calling
      preempt_disable/enable() on this code.
      
      It is not possible to move tm_recheckpoint to happen earlier, because it is
      required to get the checkpointed registers from userspace, with
      __get_user(), thus, the only way to avoid this undesired behavior is
      delaying the MSR[TS] set.
      
      The 32-bits signal handler seems to be safe this current issue, but, it
      might be exposed to the preemption issue, thus, disabling preemption in
      this chunk of code.
      
      Changes from v2:
       * Run the critical section with preempt_disable.
      
      Fixes: 87b4e539 ("powerpc/tm: Fix return of active 64bit signals")
      Cc: stable@vger.kernel.org (v3.9+)
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e1c3743e
    • M
      powerpc/fadump: Do not allow hot-remove memory from fadump reserved area. · 0db6896f
      Mahesh Salgaonkar 提交于
      For fadump to work successfully there should not be any holes in reserved
      memory ranges where kernel has asked firmware to move the content of old
      kernel memory in event of crash. Now that fadump uses CMA for reserved
      area, this memory area is now not protected from hot-remove operations
      unless it is cma allocated. Hence, fadump service can fail to re-register
      after the hot-remove operation, if hot-removed memory belongs to fadump
      reserved region. To avoid this make sure that memory from fadump reserved
      area is not hot-removable if fadump is registered.
      
      However, if user still wants to remove that memory, he can do so by
      manually stopping fadump service before hot-remove operation.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0db6896f
    • M
      powerpc/fadump: Throw proper error message on fadump registration failure · f86593be
      Mahesh Salgaonkar 提交于
      fadump fails to register when there are holes in reserved memory area.
      This can happen if user has hot-removed a memory that falls in the
      fadump reserved memory area. Throw a meaningful error message to the
      user in such case.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      [mpe: is_reserved_memory_area_contiguous() returns bool, unsplit string]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f86593be
    • M
      powerpc/fadump: Reservationless firmware assisted dump · a4e92ce8
      Mahesh Salgaonkar 提交于
      One of the primary issues with Firmware Assisted Dump (fadump) on Power
      is that it needs a large amount of memory to be reserved. On large
      systems with TeraBytes of memory, this reservation can be quite
      significant.
      
      In some cases, fadump fails if the memory reserved is insufficient, or
      if the reserved memory was DLPAR hot-removed.
      
      In the normal case, post reboot, the preserved memory is filtered to
      extract only relevant areas of interest using the makedumpfile tool.
      While the tool provides flexibility to determine what needs to be part
      of the dump and what memory to filter out, all supported distributions
      default this to "Capture only kernel data and nothing else".
      
      We take advantage of this default and the Linux kernel's Contiguous
      Memory Allocator (CMA) to fundamentally change the memory reservation
      model for fadump.
      
      Instead of setting aside a significant chunk of memory nobody can use,
      this patch uses CMA instead, to reserve a significant chunk of memory
      that the kernel is prevented from using (due to MIGRATE_CMA), but
      applications are free to use it. With this fadump will still be able
      to capture all of the kernel memory and most of the user space memory
      except the user pages that were present in CMA region.
      
      Essentially, on a P9 LPAR with 2 cores, 8GB RAM and current upstream:
      [root@zzxx-yy10 ~]# free -m
                    total        used        free      shared  buff/cache   available
      Mem:           7557         193        6822          12         541        6725
      Swap:          4095           0        4095
      
      With this patch:
      [root@zzxx-yy10 ~]# free -m
                    total        used        free      shared  buff/cache   available
      Mem:           8133         194        7464          12         475        7338
      Swap:          4095           0        4095
      
      Changes made here are completely transparent to how fadump has
      traditionally worked.
      
      Thanks to Aneesh Kumar and Anshuman Khandual for helping us understand
      CMA and its usage.
      
      TODO:
      - Handle case where CMA reservation spans nodes.
      Signed-off-by: NAnanth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NHari Bathini <hbathini@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a4e92ce8
    • M
      powerpc/powernv: Move opal_power_control_init() call in opal_init(). · 08fb726d
      Mahesh Salgaonkar 提交于
      opal_power_control_init() depends on opal message notifier to be
      initialized, which is done in opal_init()->opal_message_init(). But both
      these initialization are called through machine initcalls and it all
      depends on in which order they being called. So far these are called in
      correct order (may be we got lucky) and never saw any issue. But it is
      clearer to control initialization order explicitly by moving
      opal_power_control_init() into opal_init().
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      08fb726d