1. 29 7月, 2015 5 次提交
    • M
      powerpc: Change syscall_get_nr() to return int · e9fbe686
      Michael Ellerman 提交于
      The documentation for syscall_get_nr() in asm-generic says:
      
       Note this returns int even on 64-bit machines. Only 32 bits of
       system call number can be meaningful. If the actual arch value
       is 64 bits, this truncates to 32 bits so 0xffffffff means -1.
      
      However our implementation was never updated to reflect this.
      
      Generally it's not important, but there is once case where it matters.
      
      For seccomp filter with SECCOMP_RET_TRACE, the tracer will set
      regs->gpr[0] to -1 to reject the syscall. When the task is a compat
      task, this means we end up with 0xffffffff in r0 because ptrace will
      zero extend the 32-bit value.
      
      If syscall_get_nr() returns an unsigned long, then a 64-bit kernel will
      see a positive value in r0 and will incorrectly allow the syscall
      through seccomp.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      e9fbe686
    • M
      powerpc: Use orig_gpr3 in syscall_get_arguments() · 1cb9839b
      Michael Ellerman 提交于
      Currently syscall_get_arguments() is used by syscall tracepoints, and
      collect_syscall() which is used in some debugging as well as
      /proc/pid/syscall.
      
      The current implementation just copies regs->gpr[3 .. 5] out, which is
      fine for all the current use cases.
      
      When we enable seccomp filter, that will also start using
      syscall_get_arguments(). However for seccomp filter we want to use r3
      as the return value of the syscall, and orig_gpr3 as the first
      parameter. This will allow seccomp to modify the return value in r3.
      
      To support this we need to modify syscall_get_arguments() to return
      orig_gpr3 instead of r3. This is safe for all uses because orig_gpr3
      always contains the r3 value that was passed to the syscall. We store it
      in the syscall entry path and never modify it.
      
      Update syscall_set_arguments() while we're here, even though it's never
      used.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      1cb9839b
    • M
      powerpc: Rework syscall_get_arguments() so there is only one loop · a7657844
      Michael Ellerman 提交于
      Currently syscall_get_arguments() has two loops, one for compat and one
      for regular tasks. In prepartion for the next patch, which changes which
      registers we use, switch it to only have one loop, so we only have one
      place to update.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      a7657844
    • M
      powerpc: Don't negate error in syscall_set_return_value() · 1b1a3702
      Michael Ellerman 提交于
      Currently the only caller of syscall_set_return_value() is seccomp
      filter, which is not enabled on powerpc.
      
      This means we have not noticed that our implementation of
      syscall_set_return_value() negates error, even though the value passed
      in is already negative.
      
      So remove the negation in syscall_set_return_value(), and expect the
      caller to do it like all other implementations do.
      
      Also add a comment about the ccr handling.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      1b1a3702
    • M
      powerpc: Drop unused syscall_get_error() · 2923e6d5
      Michael Ellerman 提交于
      syscall_get_error() is unused, and never has been.
      
      It's also probably wrong, as it negates r3 before returning it, but that
      depends on what the caller is expecting.
      
      It also doesn't deal with compat, and doesn't deal with TIF_NOERROR.
      
      Although we could fix those, until it has a caller and it's clear what
      semantics the caller wants it's just untested code. So drop it.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      2923e6d5
  2. 23 7月, 2015 2 次提交
    • P
      powerpc: Use hardware RNG for arch_get_random_seed_* not arch_get_random_* · 01c9348c
      Paul Mackerras 提交于
      The hardware RNG on POWER8 and POWER7+ can be relatively slow, since
      it can only supply one 64-bit value per microsecond.  Currently we
      read it in arch_get_random_long(), but that slows down reading from
      /dev/urandom since the code in random.c calls arch_get_random_long()
      for every longword read from /dev/urandom.
      
      Since the hardware RNG supplies high-quality entropy on every read, it
      matches the semantics of arch_get_random_seed_long() better than those
      of arch_get_random_long().  Therefore this commit makes the code use
      the POWER8/7+ hardware RNG only for arch_get_random_seed_{long,int}
      and not for arch_get_random_{long,int}.
      
      This won't affect any other PowerPC-based platforms because none of
      them currently support a hardware RNG.  To make it clear that the
      ppc_md function pointer is used for arch_get_random_seed_*, we rename
      it from get_random_long to get_random_seed.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      01c9348c
    • T
      powerpc/rtas: Introduce rtas_get_sensor_fast() for IRQ handlers · 1c2cb594
      Thomas Huth 提交于
      The EPOW interrupt handler uses rtas_get_sensor(), which in turn
      uses rtas_busy_delay() to wait for RTAS becoming ready in case it
      is necessary. But rtas_busy_delay() is annotated with might_sleep()
      and thus may not be used by interrupts handlers like the EPOW handler!
      This leads to the following BUG when CONFIG_DEBUG_ATOMIC_SLEEP is
      enabled:
      
       BUG: sleeping function called from invalid context at arch/powerpc/kernel/rtas.c:496
       in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc2-thuth #6
       Call Trace:
       [c00000007ffe7b90] [c000000000807670] dump_stack+0xa0/0xdc (unreliable)
       [c00000007ffe7bc0] [c0000000000e1f14] ___might_sleep+0x134/0x180
       [c00000007ffe7c20] [c00000000002aec0] rtas_busy_delay+0x30/0xd0
       [c00000007ffe7c50] [c00000000002bde4] rtas_get_sensor+0x74/0xe0
       [c00000007ffe7ce0] [c000000000083264] ras_epow_interrupt+0x44/0x450
       [c00000007ffe7d90] [c000000000120260] handle_irq_event_percpu+0xa0/0x300
       [c00000007ffe7e70] [c000000000120524] handle_irq_event+0x64/0xc0
       [c00000007ffe7eb0] [c000000000124dbc] handle_fasteoi_irq+0xec/0x260
       [c00000007ffe7ef0] [c00000000011f4f0] generic_handle_irq+0x50/0x80
       [c00000007ffe7f20] [c000000000010f3c] __do_irq+0x8c/0x200
       [c00000007ffe7f90] [c0000000000236cc] call_do_irq+0x14/0x24
       [c00000007e6f39e0] [c000000000011144] do_IRQ+0x94/0x110
       [c00000007e6f3a30] [c000000000002594] hardware_interrupt_common+0x114/0x180
      
      Fix this issue by introducing a new rtas_get_sensor_fast() function
      that does not use rtas_busy_delay() - and thus can only be used for
      sensors that do not cause a BUSY condition - known as "fast" sensors.
      
      The EPOW sensor is defined to be "fast" in sPAPR - mpe.
      
      Fixes: 587f83e8 ("powerpc/pseries: Use rtas_get_sensor in RAS code")
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Reviewed-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1c2cb594
  3. 16 7月, 2015 2 次提交
  4. 13 7月, 2015 3 次提交
  5. 26 6月, 2015 1 次提交
  6. 25 6月, 2015 6 次提交
  7. 19 6月, 2015 3 次提交
  8. 11 6月, 2015 15 次提交
    • A
      powerpc: Fix duplicate const clang warning in user access code · b91c1e3e
      Anton Blanchard 提交于
      We see a large number of duplicate const errors in the user access
      code when building with llvm/clang:
      
        include/linux/pagemap.h:576:8: warning: duplicate 'const' declaration specifier
            [-Wduplicate-decl-specifier]
              ret = __get_user(c, uaddr);
      
      The problem is we are doing const __typeof__(*(ptr)), which will hit the
      warning if ptr is marked const.
      
      Removing const does not seem to have any effect on GCC code generation.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b91c1e3e
    • A
      vfio: powerpc/spapr: Support Dynamic DMA windows · e633bc86
      Alexey Kardashevskiy 提交于
      This adds create/remove window ioctls to create and remove DMA windows.
      sPAPR defines a Dynamic DMA windows capability which allows
      para-virtualized guests to create additional DMA windows on a PCI bus.
      The existing linux kernels use this new window to map the entire guest
      memory and switch to the direct DMA operations saving time on map/unmap
      requests which would normally happen in a big amounts.
      
      This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
      VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
      Up to 2 windows are supported now by the hardware and by this driver.
      
      This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
      information such as a number of supported windows and maximum number
      levels of TCE tables.
      
      DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature
      as we still want to support v2 on platforms which cannot do DDW for
      the sake of TCE acceleration in KVM (coming soon).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e633bc86
    • A
      vfio: powerpc/spapr: Register memory and define IOMMU v2 · 2157e7b8
      Alexey Kardashevskiy 提交于
      The existing implementation accounts the whole DMA window in
      the locked_vm counter. This is going to be worse with multiple
      containers and huge DMA windows. Also, real-time accounting would requite
      additional tracking of accounted pages due to the page size difference -
      IOMMU uses 4K pages and system uses 4K or 64K pages.
      
      Another issue is that actual pages pinning/unpinning happens on every
      DMA map/unmap request. This does not affect the performance much now as
      we spend way too much time now on switching context between
      guest/userspace/host but this will start to matter when we add in-kernel
      DMA map/unmap acceleration.
      
      This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
      New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
      2 new ioctls to register/unregister DMA memory -
      VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
      which receive user space address and size of a memory region which
      needs to be pinned/unpinned and counted in locked_vm.
      New IOMMU splits physical pages pinning and TCE table update
      into 2 different operations. It requires:
      1) guest pages to be registered first
      2) consequent map/unmap requests to work only with pre-registered memory.
      For the default single window case this means that the entire guest
      (instead of 2GB) needs to be pinned before using VFIO.
      When a huge DMA window is added, no additional pinning will be
      required, otherwise it would be guest RAM + 2GB.
      
      The new memory registration ioctls are not supported by
      VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
      will require memory to be preregistered in order to work.
      
      The accounting is done per the user process.
      
      This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
      can do with v1 or v2 IOMMUs.
      
      In order to support memory pre-registration, we need a way to track
      the use of every registered memory region and only allow unregistration
      if a region is not in use anymore. So we need a way to tell from what
      region the just cleared TCE was from.
      
      This adds a userspace view of the TCE table into iommu_table struct.
      It contains userspace address, one per TCE entry. The table is only
      allocated when the ownership over an IOMMU group is taken which means
      it is only used from outside of the powernv code (such as VFIO).
      
      As v2 IOMMU supports IODA2 and pre-IODA2 IOMMUs (which do not support
      DDW API), this creates a default DMA window for IODA2 for consistency.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2157e7b8
    • A
      powerpc/mmu: Add userspace-to-physical addresses translation cache · 15b244a8
      Alexey Kardashevskiy 提交于
      We are adding support for DMA memory pre-registration to be used in
      conjunction with VFIO. The idea is that the userspace which is going to
      run a guest may want to pre-register a user space memory region so
      it all gets pinned once and never goes away. Having this done,
      a hypervisor will not have to pin/unpin pages on every DMA map/unmap
      request. This is going to help with multiple pinning of the same memory.
      
      Another use of it is in-kernel real mode (mmu off) acceleration of
      DMA requests where real time translation of guest physical to host
      physical addresses is non-trivial and may fail as linux ptes may be
      temporarily invalid. Also, having cached host physical addresses
      (compared to just pinning at the start and then walking the page table
      again on every H_PUT_TCE), we can be sure that the addresses which we put
      into TCE table are the ones we already pinned.
      
      This adds a list of memory regions to mm_context_t. Each region consists
      of a header and a list of physical addresses. This adds API to:
      1. register/unregister memory regions;
      2. do final cleanup (which puts all pre-registered pages);
      3. do userspace to physical address translation;
      4. manage usage counters; multiple registration of the same memory
      is allowed (once per container).
      
      This implements 2 counters per registered memory region:
      - @mapped: incremented on every DMA mapping; decremented on unmapping;
      initialized to 1 when a region is just registered; once it becomes zero,
      no more mappings allowe;
      - @used: incremented on every "register" ioctl; decremented on
      "unregister"; unregistration is allowed for DMA mapped regions unless
      it is the very last reference. For the very last reference this checks
      that the region is still mapped and returns -EBUSY so the userspace
      gets to know that memory is still pinned and unregistration needs to
      be retried; @used remains 1.
      
      Host physical addresses are stored in vmalloc'ed array. In order to
      access these in the real mode (mmu off), there is a real_vmalloc_addr()
      helper. In-kernel acceleration patchset will move it from KVM to MMU code.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      15b244a8
    • A
      powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table · 00547193
      Alexey Kardashevskiy 提交于
      This adds a way for the IOMMU user to know how much a new table will
      use so it can be accounted in the locked_vm limit before allocation
      happens.
      
      This stores the allocated table size in pnv_pci_ioda2_get_table_size()
      so the locked_vm counter can be updated correctly when a table is
      being disposed.
      
      This defines an iommu_table_group_ops callback to let VFIO know
      how much memory will be locked if a table is created.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      00547193
    • A
      vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API · 4793d65d
      Alexey Kardashevskiy 提交于
      This extends iommu_table_group_ops by a set of callbacks to support
      dynamic DMA windows management.
      
      create_table() creates a TCE table with specific parameters.
      it receives iommu_table_group to know nodeid in order to allocate
      TCE table memory closer to the PHB. The exact format of allocated
      multi-level table might be also specific to the PHB model (not
      the case now though).
      This callback calculated the DMA window offset on a PCI bus from @num
      and stores it in a just created table.
      
      set_window() sets the window at specified TVT index + @num on PHB.
      
      unset_window() unsets the window from specified TVT.
      
      This adds a free() callback to iommu_table_ops to free the memory
      (potentially a tree of tables) allocated for the TCE table.
      
      create_table() and free() are supposed to be called once per
      VFIO container and set_window()/unset_window() are supposed to be
      called for every group in a container.
      
      This adds IOMMU capabilities to iommu_table_group such as default
      32bit window parameters and others. This makes use of new values in
      vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
      advertise pagemasks to the userspace.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4793d65d
    • A
      powerpc/powernv: Implement multilevel TCE tables · bbb845c4
      Alexey Kardashevskiy 提交于
      TCE tables might get too big in case of 4K IOMMU pages and DDW enabled
      on huge guests (hundreds of GB of RAM) so the kernel might be unable to
      allocate contiguous chunk of physical memory to store the TCE table.
      
      To address this, POWER8 CPU (actually, IODA2) supports multi-level
      TCE tables, up to 5 levels which splits the table into a tree of
      smaller subtables.
      
      This adds multi-level TCE tables support to
      pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages()
      helpers.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bbb845c4
    • A
      powerpc/iommu/powernv: Release replaced TCE · 05c6cfb9
      Alexey Kardashevskiy 提交于
      At the moment writing new TCE value to the IOMMU table fails with EBUSY
      if there is a valid entry already. However PAPR specification allows
      the guest to write new TCE value without clearing it first.
      
      Another problem this patch is addressing is the use of pool locks for
      external IOMMU users such as VFIO. The pool locks are to protect
      DMA page allocator rather than entries and since the host kernel does
      not control what pages are in use, there is no point in pool locks and
      exchange()+put_page(oldtce) is sufficient to avoid possible races.
      
      This adds an exchange() callback to iommu_table_ops which does the same
      thing as set() plus it returns replaced TCE and DMA direction so
      the caller can release the pages afterwards. The exchange() receives
      a physical address unlike set() which receives linear mapping address;
      and returns a physical address as the clear() does.
      
      This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
      for a platform to have exchange() implemented in order to support VFIO.
      
      This replaces iommu_tce_build() and iommu_clear_tce() with
      a single iommu_tce_xchg().
      
      This makes sure that TCE permission bits are not set in TCE passed to
      IOMMU API as those are to be calculated by platform code from
      DMA direction.
      
      This moves SetPageDirty() to the IOMMU code to make it work for both
      VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
      available later).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      05c6cfb9
    • A
      vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control · f87a8864
      Alexey Kardashevskiy 提交于
      This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
      which call in a loop iommu_take_ownership()/iommu_release_ownership()
      for every table on the group. As there is just one now, no change in
      behaviour is expected.
      
      At the moment the iommu_table struct has a set_bypass() which enables/
      disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
      which calls this callback when external IOMMU users such as VFIO are
      about to get over a PHB.
      
      The set_bypass() callback is not really an iommu_table function but
      IOMMU/PE function. This introduces a iommu_table_group_ops struct and
      adds take_ownership()/release_ownership() callbacks to it which are
      called when an external user takes/releases control over the IOMMU.
      
      This replaces set_bypass() with ownership callbacks as it is not
      necessarily just bypass enabling, it can be something else/more
      so let's give it more generic name.
      
      The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
      IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
      The following patches will replace iommu_take_ownership/
      iommu_release_ownership calls in IODA2 with full IOMMU table release/
      create.
      
      As we here and touching bypass control, this removes
      pnv_pci_ioda2_setup_bypass_pe() as it does not do much
      more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
      initialization to pnv_pci_ioda2_setup_dma_pe.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f87a8864
    • A
      powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group · 0eaf4def
      Alexey Kardashevskiy 提交于
      So far one TCE table could only be used by one IOMMU group. However
      IODA2 hardware allows programming the same TCE table address to
      multiple PE allowing sharing tables.
      
      This replaces a single pointer to a group in a iommu_table struct
      with a linked list of groups which provides the way of invalidating
      TCE cache for every PE when an actual TCE table is updated. This adds
      pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group()
      helpers to manage the list. However without VFIO, it is still going
      to be a single IOMMU group per iommu_table.
      
      This changes iommu_add_device() to add a device to a first group
      from the group list of a table as it is only called from the platform
      init code or PCI bus notifier and at these moments there is only
      one group per table.
      
      This does not change TCE invalidation code to loop through all
      attached groups in order to simplify this patch and because
      it is not really needed in most cases. IODA2 is fixed in a later
      patch.
      
      This should cause no behavioural change.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0eaf4def
    • A
      powerpc/spapr: vfio: Replace iommu_table with iommu_table_group · b348aa65
      Alexey Kardashevskiy 提交于
      Modern IBM POWERPC systems support multiple (currently two) TCE tables
      per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
      for TCE tables. Right now just one table is supported.
      
      This defines iommu_table_group struct which stores pointers to
      iommu_group and iommu_table(s). This replaces iommu_table with
      iommu_table_group where iommu_table was used to identify a group:
      - iommu_register_group();
      - iommudata of generic iommu_group;
      
      This removes @data from iommu_table as it_table_group provides
      same access to pnv_ioda_pe.
      
      For IODA, instead of embedding iommu_table, the new iommu_table_group
      keeps pointers to those. The iommu_table structs are allocated
      dynamically.
      
      For P5IOC2, both iommu_table_group and iommu_table are embedded into
      PE struct. As there is no EEH and SRIOV support for P5IOC2,
      iommu_free_table() should not be called on iommu_table struct pointers
      so we can keep it embedded in pnv_phb::p5ioc2.
      
      For pSeries, this replaces multiple calls of kzalloc_node() with a new
      iommu_pseries_alloc_group() helper and stores the table group struct
      pointer into the pci_dn struct. For release, a iommu_table_free_group()
      helper is added.
      
      This moves iommu_table struct allocation from SR-IOV code to
      the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and
      pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized.
      This change is here because those lines had to be changed anyway.
      
      This should cause no behavioural change.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b348aa65
    • A
      powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table · da004c36
      Alexey Kardashevskiy 提交于
      This adds a iommu_table_ops struct and puts pointer to it into
      the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
      callbacks from ppc_md to the new struct where they really belong to.
      
      This adds the requirement for @it_ops to be initialized before calling
      iommu_init_table() to make sure that we do not leave any IOMMU table
      with iommu_table_ops uninitialized. This is not a parameter of
      iommu_init_table() though as there will be cases when iommu_init_table()
      will not be called on TCE tables, for example - VFIO.
      
      This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_"
      redundant prefixes.
      
      This removes tce_xxx_rm handlers from ppc_md but does not add
      them to iommu_table_ops as this will be done later if we decide to
      support TCE hypercalls in real mode. This removes _vm callbacks as
      only virtual mode is supported by now so this also removes @rm parameter.
      
      For pSeries, this always uses tce_buildmulti_pSeriesLP/
      tce_buildmulti_pSeriesLP. This changes multi callback to fall back to
      tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
      present. The reason for this is we still have to support "multitce=off"
      boot parameter in disable_multitce() and we do not want to walk through
      all IOMMU tables in the system and replace "multi" callbacks with single
      ones.
      
      For powernv, this defines _ops per PHB type which are P5IOC2/IODA1/IODA2.
      This makes the callbacks for them public. Later patches will extend
      callbacks for IODA1/2.
      
      No change in behaviour is expected.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      da004c36
    • A
      powerpc/powernv: Do not set "read" flag if direction==DMA_NONE · 10b35b2b
      Alexey Kardashevskiy 提交于
      Normally a bitmap from the iommu_table is used to track what TCE entry
      is in use. Since we are going to use iommu_table without its locks and
      do xchg() instead, it becomes essential not to put bits which are not
      implied in the direction flag as the old TCE value (more precisely -
      the permission bits) will be used to decide whether to put the page or not.
      
      This adds iommu_direction_to_tce_perm() (its counterpart is there already)
      and uses it for powernv's pnv_tce_build().
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      10b35b2b
    • A
      vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver · 9b14a1ff
      Alexey Kardashevskiy 提交于
      This moves page pinning (get_user_pages_fast()/put_page()) code out of
      the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs
      to as the platform code does not deal with page pinning.
      
      This makes iommu_take_ownership()/iommu_release_ownership() deal with
      the IOMMU table bitmap only.
      
      This removes page unpinning from iommu_take_ownership() as the actual
      TCE table might contain garbage and doing put_page() on it is undefined
      behaviour.
      
      Besides the last part, the rest of the patch is mechanical.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9b14a1ff
    • A
      powerpc/iommu/powernv: Get rid of set_iommu_table_base_and_group · 4617082e
      Alexey Kardashevskiy 提交于
      The set_iommu_table_base_and_group() name suggests that the function
      sets table base and add a device to an IOMMU group.
      
      The actual purpose for table base setting is to put some reference
      into a device so later iommu_add_device() can get the IOMMU group
      reference and the device to the group.
      
      At the moment a group cannot be explicitly passed to iommu_add_device()
      as we want it to work from the bus notifier, we can fix it later and
      remove confusing calls of set_iommu_table_base().
      
      This replaces set_iommu_table_base_and_group() with a couple of
      set_iommu_table_base() + iommu_add_device() which makes reading the code
      easier.
      
      This adds few comments why set_iommu_table_base() and iommu_add_device()
      are called where they are called.
      
      For IODA1/2, this essentially removes iommu_add_device() call from
      the pnv_pci_ioda_dma_dev_setup() as it will always fail at this particular
      place:
      - for physical PE, the device is already attached by iommu_add_device()
      in pnv_pci_ioda_setup_dma_pe();
      - for virtual PE, the sysfs entries are not ready to create all symlinks
      so actual adding is happening in tce_iommu_bus_notifier.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4617082e
  9. 10 6月, 2015 1 次提交
    • A
      powerpc/mm: Add trace point for tracking hash pte fault · cfcb3d80
      Aneesh Kumar K.V 提交于
      This enables us to understand how many hash fault we are taking
      when running benchmarks.
      
      For ex:
      -bash-4.2# ./perf stat -e  powerpc:hash_fault -e page-faults /tmp/ebizzy.ppc64 -S 30  -P -n 1000
      ...
      
       Performance counter stats for '/tmp/ebizzy.ppc64 -S 30 -P -n 1000':
      
             1,10,04,075      powerpc:hash_fault
             1,10,03,429      page-faults
      
            30.865978991 seconds time elapsed
      
      NOTE:
      The impact of the tracepoint was not noticeable when running test. It was
      within the run-time variance of the test. For ex:
      
      without-patch:
      --------------
      
       Performance counter stats for './a.out 3000 300':
      
      	       643      page-faults               #    0.089 M/sec
      	  7.236562      task-clock (msec)         #    0.928 CPUs utilized
      	 2,179,213      stalled-cycles-frontend   #    0.00% frontend cycles idle
      	17,174,367      stalled-cycles-backend    #    0.00% backend  cycles idle
      		 0      context-switches          #    0.000 K/sec
      
             0.007794658 seconds time elapsed
      
      And with-patch:
      ---------------
      
       Performance counter stats for './a.out 3000 300':
      
      	       643      page-faults               #    0.089 M/sec
      	  7.233746      task-clock (msec)         #    0.921 CPUs utilized
      		 0      context-switches          #    0.000 K/sec
      
             0.007854876 seconds time elapsed
      
       Performance counter stats for './a.out 3000 300':
      
      	       643      page-faults               #    0.087 M/sec
      	       649      powerpc:hash_fault        #    0.087 M/sec
      	  7.430376      task-clock (msec)         #    0.938 CPUs utilized
      	 2,347,174      stalled-cycles-frontend   #    0.00% frontend cycles idle
      	17,524,282      stalled-cycles-backend    #    0.00% backend  cycles idle
      		 0      context-switches          #    0.000 K/sec
      
             0.007920284 seconds time elapsed
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      cfcb3d80
  10. 08 6月, 2015 1 次提交
  11. 07 6月, 2015 1 次提交