- 20 8月, 2020 2 次提交
-
-
由 Vasant Hegde 提交于
As per PAPR we have to look for both EPOW sensor value and event modifier to identify the type of event and take appropriate action. In LoPAPR v1.1 section 10.2.2 includes table 136 "EPOW Action Codes": SYSTEM_SHUTDOWN 3 The system must be shut down. An EPOW-aware OS logs the EPOW error log information, then schedules the system to be shut down to begin after an OS defined delay internal (default is 10 minutes.) Then in section 10.3.2.2.8 there is table 146 "Platform Event Log Format, Version 6, EPOW Section", which includes the "EPOW Event Modifier": For EPOW sensor value = 3 0x01 = Normal system shutdown with no additional delay 0x02 = Loss of utility power, system is running on UPS/Battery 0x03 = Loss of system critical functions, system should be shutdown 0x04 = Ambient temperature too high All other values = reserved We have a user space tool (rtas_errd) on LPAR to monitor for EPOW_SHUTDOWN_ON_UPS. Once it gets an event it initiates shutdown after predefined time. It also starts monitoring for any new EPOW events. If it receives "Power restored" event before predefined time it will cancel the shutdown. Otherwise after predefined time it will shutdown the system. Commit 79872e35 ("powerpc/pseries: All events of EPOW_SYSTEM_SHUTDOWN must initiate shutdown") changed our handling of the "on UPS/Battery" case, to immediately shutdown the system. This breaks existing setups that rely on the userspace tool to delay shutdown and let the system run on the UPS. Fixes: 79872e35 ("powerpc/pseries: All events of EPOW_SYSTEM_SHUTDOWN must initiate shutdown") Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: NVasant Hegde <hegdevasant@linux.vnet.ibm.com> [mpe: Massage change log and add PAPR references] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200820061844.306460-1-hegdevasant@linux.vnet.ibm.com
-
由 Frederic Barrat 提交于
Fix a typo introduced during recent code cleanup, which could lead to silently not freeing resources or an oops message (on PCI hotplug or CAPI reset). Only impacts ioda2, the code path for ioda1 is correct. Fixes: 01e12629 ("powerpc/powernv/pci: Add explicit tracking of the DMA setup state") Signed-off-by: NFrederic Barrat <fbarrat@linux.ibm.com> Reviewed-by: NOliver O'Halloran <oohall@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200819130741.16769-1-fbarrat@linux.ibm.com
-
- 18 8月, 2020 1 次提交
-
-
由 Michael Roth 提交于
For a power9 KVM guest with XIVE enabled, running a test loop where we hotplug 384 vcpus and then unplug them, the following traces can be seen (generally within a few loops) either from the unplugged vcpu: cpu 65 (hwid 65) Ready to die... Querying DEAD? cpu 66 (66) shows 2 list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048 ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:56! Oops: Exception in kernel mode, sig: 5 [#1] LE SMP NR_CPUS=2048 NUMA pSeries Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 ... CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1 NIP: c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac REGS: c0000009e5a17840 TRAP: 0700 Not tainted (4.18.0-221.el8.ppc64le) MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28000842 XER: 20040000 ... NIP __list_del_entry_valid+0xac/0x100 LR __list_del_entry_valid+0xa8/0x100 Call Trace: __list_del_entry_valid+0xa8/0x100 (unreliable) free_pcppages_bulk+0x1f8/0x940 free_unref_page+0xd0/0x100 xive_spapr_cleanup_queue+0x148/0x1b0 xive_teardown_cpu+0x1bc/0x240 pseries_mach_cpu_die+0x78/0x2f0 cpu_die+0x48/0x70 arch_cpu_idle_dead+0x20/0x40 do_idle+0x2f4/0x4c0 cpu_startup_entry+0x38/0x40 start_secondary+0x7bc/0x8f0 start_secondary_prolog+0x10/0x14 or on the worker thread handling the unplug: pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a Querying DEAD? cpu 314 (314) shows 2 BUG: Bad page state in process kworker/u768:3 pfn:95de1 cpu 314 (hwid 314) Ready to die... page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 flags: 0x5ffffc00000000() raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000 page dumped because: nonzero mapcount Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ... CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1 Workqueue: pseries hotplug workque pseries_hp_work_fn Call Trace: dump_stack+0xb0/0xf4 (unreliable) bad_page+0x12c/0x1b0 free_pcppages_bulk+0x5bc/0x940 page_alloc_cpu_dead+0x118/0x120 cpuhp_invoke_callback.constprop.5+0xb8/0x760 _cpu_down+0x188/0x340 cpu_down+0x5c/0xa0 cpu_subsys_offline+0x24/0x40 device_offline+0xf0/0x130 dlpar_offline_cpu+0x1c4/0x2a0 dlpar_cpu_remove+0xb8/0x190 dlpar_cpu_remove_by_index+0x12c/0x150 dlpar_cpu+0x94/0x800 pseries_hp_work_fn+0x128/0x1e0 process_one_work+0x304/0x5d0 worker_thread+0xcc/0x7a0 kthread+0x1ac/0x1c0 ret_from_kernel_thread+0x5c/0x80 The latter trace is due to the following sequence: page_alloc_cpu_dead drain_pages drain_pages_zone free_pcppages_bulk where drain_pages() in this case is called under the assumption that the unplugged cpu is no longer executing. To ensure that is the case, and early call is made to __cpu_die()->pseries_cpu_die(), which runs a loop that waits for the cpu to reach a halted state by polling its status via query-cpu-stopped-state RTAS calls. It only polls for 25 iterations before giving up, however, and in the trace above this results in the following being printed only .1 seconds after the hotplug worker thread begins processing the unplug request: pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a Querying DEAD? cpu 314 (314) shows 2 At that point the worker thread assumes the unplugged CPU is in some unknown/dead state and procedes with the cleanup, causing the race with the XIVE cleanup code executed by the unplugged CPU. Fix this by waiting indefinitely, but also making an effort to avoid spurious lockup messages by allowing for rescheduling after polling the CPU status and printing a warning if we wait for longer than 120s. Fixes: eac1e731 ("powerpc/xive: guest exploitation of the XIVE interrupt controller") Suggested-by: NMichael Ellerman <mpe@ellerman.id.au> Signed-off-by: NMichael Roth <mdroth@linux.vnet.ibm.com> Tested-by: NGreg Kurz <groug@kaod.org> Reviewed-by: NThiago Jung Bauermann <bauerman@linux.ibm.com> Reviewed-by: NGreg Kurz <groug@kaod.org> [mpe: Trim oopses in change log slightly for readability] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200811161544.10513-1-mdroth@linux.vnet.ibm.com
-
- 08 8月, 2020 1 次提交
-
-
由 Mike Rapoport 提交于
Patch series "mm: cleanup usage of <asm/pgalloc.h>" Most architectures have very similar versions of pXd_alloc_one() and pXd_free_one() for intermediate levels of page table. These patches add generic versions of these functions in <asm-generic/pgalloc.h> and enable use of the generic functions where appropriate. In addition, functions declared and defined in <asm/pgalloc.h> headers are used mostly by core mm and early mm initialization in arch and there is no actual reason to have the <asm/pgalloc.h> included all over the place. The first patch in this series removes unneeded includes of <asm/pgalloc.h> In the end it didn't work out as neatly as I hoped and moving pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require unnecessary changes to arches that have custom page table allocations, so I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local to mm/. This patch (of 8): In most cases <asm/pgalloc.h> header is required only for allocations of page table memory. Most of the .c files that include that header do not use symbols declared in <asm/pgalloc.h> and do not require that header. As for the other header files that used to include <asm/pgalloc.h>, it is possible to move that include into the .c file that actually uses symbols from <asm/pgalloc.h> and drop the include from the header file. The process was somewhat automated using sed -i -E '/[<"]asm\/pgalloc\.h/d' \ $(grep -L -w -f /tmp/xx \ $(git grep -E -l '[<"]asm/pgalloc\.h')) where /tmp/xx contains all the symbols defined in arch/*/include/asm/pgalloc.h. [rppt@linux.ibm.com: fix powerpc warning] Signed-off-by: NMike Rapoport <rppt@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NPekka Enberg <penberg@kernel.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Joerg Roedel <joro@8bytes.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Cc: Stafford Horne <shorne@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Joerg Roedel <jroedel@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 8月, 2020 1 次提交
-
-
由 Oliver O'Halloran 提交于
Initialising the value before using it is generally regarded as a good idea so do that. Fixes: 4c51f3e1 ("powerpc/powernv/sriov: Make single PE mode a per-BAR setting") Reported-by: NNathan Chancellor <natechancellor@gmail.com> Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200803075408.132601-1-oohall@gmail.com
-
- 31 7月, 2020 2 次提交
-
-
由 Vaibhav Jain 提交于
We add support for reporting 'fuel-gauge' NVDIMM metric via PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage life remaining of a papr-scm compatible NVDIMM. PHYP exposes this metric via the H_SCM_PERFORMANCE_STATS. The metric value is returned from the pdsm by extending the return payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new field 'dimm_fuel_gauge' to hold the metric value is introduced at the end of the payload struct and its presence is indicated by by extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID. The patch introduces a new function papr_pdsm_fuel_gauge() that is called from papr_pdsm_health(). If fetching NVDIMM performance stats is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer large enough to hold the performance stat and passes it to drc_pmem_query_stats() that issues the HCALL to PHYP. The return value of the stat is then populated in the 'struct nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag 'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct nd_papr_pdsm_health.extension_flags' Signed-off-by: NVaibhav Jain <vaibhav@linux.ibm.com> Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200731064153.182203-3-vaibhav@linux.ibm.com
-
由 Vaibhav Jain 提交于
Update papr_scm.c to query dimm performance statistics from PHYP via H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR specific NVDIMM attribute 'perf_stats' in sysfs. The patch also provide a sysfs ABI documentation for the stats being reported and their meanings. During NVDIMM probe time in papr_scm_nvdimm_init() a special variant of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of performance statistics is supported or not. If successful then a PHYP returns a maximum possible buffer length needed to read all performance stats. This returned value is stored in a per-nvdimm attribute 'stat_buffer_len'. The layout of request buffer for reading NVDIMM performance stats from PHYP is defined in 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat'. These structs are used in newly introduced drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall. The sysfs access function perf_stats_show() uses value 'stat_buffer_len' to allocate a buffer large enough to hold all possible NVDIMM performance stats and passes it to drc_pmem_query_stats() to populate. Finally statistics reported in the buffer are formatted into the sysfs access function output buffer. Signed-off-by: NVaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200731064153.182203-2-vaibhav@linux.ibm.com
-
- 30 7月, 2020 3 次提交
-
-
由 Nathan Lynch 提交于
In the unlikely event that the device tree lacks a /cpus node, find_dlpar_cpus_to_add() oddly frees the cpu_drcs buffer it has been passed before returning an error. Its only caller also frees the buffer on error. Remove the less conventional kfree() of a caller-supplied buffer from find_dlpar_cpus_to_add(). Fixes: 90edf184 ("powerpc/pseries: Add CPU dlpar add functionality") Signed-off-by: NNathan Lynch <nathanl@linux.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190919231633.1344-1-nathanl@linux.ibm.com
-
由 Nathan Lynch 提交于
When investigating issues with partition migration or resource reassignments it is helpful to have a log of which nodes and properties in the device tree have changed. Use pr_debug() so it's easy to enable these at runtime with the dynamic debug facility. Signed-off-by: NNathan Lynch <nathanl@linux.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190627053044.9238-3-nathanl@linux.ibm.com
-
由 Nathan Lynch 提交于
The pr_err() callsites in mobility.c already manually include a "mobility:" prefix, let's make it official for the benefit of messages to be added later. Signed-off-by: NNathan Lynch <nathanl@linux.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190627053044.9238-2-nathanl@linux.ibm.com
-
- 29 7月, 2020 7 次提交
-
-
由 Wei Yongjun 提交于
Gcc report warning as follows: arch/powerpc/platforms/powernv/pci-sriov.c:602:25: warning: variable 'phb' set but not used [-Wunused-but-set-variable] 602 | struct pnv_phb *phb; | ^~~ This variable is not used, so this commit removing it. Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com> Acked-by: NOliver O'Halloran <oohall@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200727171112.2781-1-weiyongjun1@huawei.com
-
由 Qinglang Miao 提交于
Use for_each_child_of_node() macro instead of open coding it. Signed-off-by: NQinglang Miao <miaoqinglang@huawei.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200728022807.87815-1-miaoqinglang@huawei.com
-
由 Gustavo A. R. Silva 提交于
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-throughSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200727224201.GA10133@embeddedor
-
由 Aneesh Kumar K.V 提交于
This adds a kernel command line option that can be used to disable GTSE support. Disabling GTSE implies kernel will make hcalls to invalidate TLB entries. This was done so that we can do VM migration between configs that enable/disable GTSE support via hypervisor. To migrate a VM from a system that supports GTSE to a system that doesn't, we can boot the guest with radix_hcall_invalidate=on, thereby forcing the guest to use hcalls for TLB invalidates. The check for hcall availability is done in pSeries_setup_arch so that the panic message appears on the console. This should only happen on a hypervisor that doesn't force the guest to hash translation even though it can't handle the radix GTSE=0 request via CAS. With radix_hcall_invalidate=on if the hypervisor doesn't support hcall_rpt_invalidate hcall it should force the LPAR to hash translation. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Tested-by: NBharata B Rao <bharata@linux.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200727085908.420806-1-aneesh.kumar@linux.ibm.com
-
由 Michael Ellerman 提交于
There's a comment in lite5200_sleep.S that refers to "CONFIG_BDI*". This confuses scripts/checkkconfigsymbols.py, which thinks it should be able to find CONFIG_BDI. Change the comment to refer to CONFIG_BDI_SWITCH which is presumably roughly what it was referring to. AFAICS there never has been a CONFIG_BDI. Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200724131728.1643966-3-mpe@ellerman.id.au
-
由 Nicholas Piggin 提交于
KVM guests have certain restrictions and performance quirks when using doorbells. This patch moves the EPAPR KVM guest test so it can be shared with PSERIES, and uses that in doorbell setup code to apply the KVM guest quirks and improves IPI performance for two cases: - PowerVM guests may now use doorbells even if they are secure. - KVM guests no longer use doorbells if XIVE is available. There is a valid complaint that "KVM guest" is not a very reasonable thing to test for, it's preferable for the hypervisor to advertise particular behaviours to the guest so they could change if the hypervisor implementation or configuration changes. However in this case we were already assuming a KVM guest worst case, so this patch is about containing those quirks. If KVM later advertises fast doorbells, we should test for that and override the quirks. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Tested-by: NCédric Le Goater <clg@kaod.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200726035155.1424103-4-npiggin@gmail.com
-
由 Nicholas Piggin 提交于
KVM supports msgsndp in guests by trapping and emulating the instruction, so it was decided to always use XIVE for IPIs if it is available. However on PowerVM systems, msgsndp can be used and gives better performance. On large systems, high XIVE interrupt rates can have sub-linear scaling, and using msgsndp can reduce the load on the interrupt controller. So switch to using core local doorbells even if XIVE is available. This reduces performance for KVM guests with an SMT topology by about 50% for ping-pong context switching between SMT vCPUs. An option vector (or dt-cpu-ftrs) could be defined to disable msgsndp to get KVM performance back. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Tested-by: NCédric Le Goater <clg@kaod.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200726035155.1424103-3-npiggin@gmail.com
-
- 26 7月, 2020 23 次提交
-
-
由 Randy Dunlap 提交于
Drop the repeated word "for". Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200726003809.20454-10-rdunlap@infradead.org
-
由 Wei Yongjun 提交于
The sparse tool complains as follows: arch/powerpc/platforms/pseries/papr_scm.c:97:1: warning: symbol 'papr_nd_regions' was not declared. Should it be static? arch/powerpc/platforms/pseries/papr_scm.c:98:1: warning: symbol 'papr_ndr_lock' was not declared. Should it be static? Those variables are not used outside of papr_scm.c, so this commit marks them static. Fixes: 85343a8d ("powerpc/papr/scm: Add bad memory ranges to nvdimm bad ranges") Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200725091949.75234-1-weiyongjun1@huawei.com
-
由 Nicholas Piggin 提交于
This implements the generic paravirt qspinlocks using H_PROD and H_CONFER to kick and wait. This uses an un-directed yield to any CPU rather than the directed yield to a pre-empted lock holder that paravirtualised simple spinlocks use, that requires no kick hcall. This is something that could be investigated and improved in future. Performance results can be found in the commit which added queued spinlocks. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NWaiman Long <longman@redhat.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200724131423.1362108-5-npiggin@gmail.com
-
由 Oliver O'Halloran 提交于
Previously iov->vfs_expanded was used for two purposes. 1) To work out how much we need to multiple the per-VF BAR size to figure out the total space required for the IOV BAR. 2) To indicate that IOV is not usable with this device (vfs_expanded == 0). We don't really need the field for either since the multiple in 1) is always the number PEs supported by the PHB. Similarly, we don't really need it in 2) either since the IOV data field will be NULL if we can't use IOV with the device. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-16-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
Using single PE BARs to map an SR-IOV BAR is really a choice about what strategy to use when mapping a BAR. It doesn't make much sense for this to be a global setting since a device might have one large BAR which needs to be mapped with single PE windows and another smaller BAR that can be mapped with a regular segmented window. Make the segmented vs single decision a per-BAR setting and clean up the logic that decides which mode to use. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-15-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
Split up the logic so that we have one branch that handles setting up a segmented window and another that handles setting up single PE windows for each VF. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-14-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
I want to refactor the loop this code is currently inside of. Hoist it on out. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-13-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
Remove the IODA2 PHB checks. We already assume IODA2 in several places so there's not much point in wrapping most of the setup and teardown process in an if block. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-12-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
Currently the iov->pe_num_map[] does one of two things depending on whether single PE mode is being used or not. When it is, this contains an array which maps a vf_index to the corresponding PE number. When single PE mode is not being used this contains a scalar which is the base PE for the set of enabled VFs (for for VFn is base + n). The array was necessary because when calling pnv_ioda_alloc_pe() there is no guarantee that the allocated PEs would be contigious. We can now allocate contigious blocks of PEs so this is no longer an issue. This allows us to drop the if (single_mode) {} .. else {} block scattered through the SR-IOV code which is a nice clean up. This also fixes a bug in pnv_pci_sriov_disable() which is the non-atomic bitmap_clear() to manipulate the PE allocation map. Other users of the map assume it will be accessed with atomic ops. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-11-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
Rework the PE allocation logic to allow allocating blocks of PEs rather than individually. We'll use this to allocate contigious blocks of PEs for the SR-IOVs. This patch also adds code to pnv_ioda_alloc_pe() and pnv_ioda_reserve_pe() to use the existing, but unused, phb->pe_alloc_mutex. Currently these functions use atomic bit ops to release a currently allocated PE number. However, the pnv_ioda_alloc_pe() wants to have exclusive access to the bit map while scanning for hole large enough to accomodate the allocation size. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-10-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
The sequence required to use the single PE BAR mode is kinda janky and requires a little explanation. The API was designed with P7-IOC style windows where the setup process is something like: 1. Configure the window start / end address 2. Enable the window 3. Map the segments of each window to the PE For Single PE BARs the process is: 1. Set the PE for segment zero on a disabled window 2. Set the range 3. Enable the window Move the OPAL calls into their own helper functions where the quirks can be contained. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-9-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
No need for the multi-dimensional arrays, just use a bitmap. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-8-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
This prevents SR-IOV being used by making the SR-IOV BAR resources unallocatable. Rename it to reflect what it actually does. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-7-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
SR-IOV support on PowerNV is a byzantine maze of hooks. I have no idea how anyone is supposed to know how it works except through a lot of stuffering. Write up some docs about the overall story to help out the next sucker^Wperson who needs to tinker with it. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-6-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
pci-ioda.c is getting a bit unwieldly due to the amount of stuff jammed in there. The SR-IOV support can be extracted easily enough and is mostly standalone, so move it into a separate file. This patch also moves the PowerNV SR-IOV specific fields from pci_dn and moves them into a platform specific structure. I'm not sure how they ended up in there in the first place, but leaking platform specifics into common code has proven to be a terrible idea so far so lets stop doing that. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-5-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
We pre-configure the m64 window for IODA1 as a 1-1 segment-PE mapping, similar to PHB3. Currently the actual mapping of segments occurs in pnv_ioda_pick_m64_pe(), but we can move it into pnv_ioda1_init_m64() and drop the IODA1 specific code paths in the PE setup / teardown. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-4-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
There's an optimisation in the PE setup which skips performing DMA setup for a PE if we only have bridges in a PE. The assumption being that only "real" devices will DMA to system memory, which is probably fair. However, if we start off with only bridge devices in a PE then add a non-bridge device the new device won't be able to use DMA because we never configured it. Fix this (admittedly pretty weird) edge case by tracking whether we've done the DMA setup for the PE or not. If a non-bridge device is added to the PE (via rescan or hotplug, or whatever) we can set up DMA on demand. This also means the only remaining user of the old "DMA Weight" code is the IODA1 DMA setup code that it was originally added for, which is good. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-3-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
Currently we have these two functions: pnv_pci_ioda2_release_dma_pe(), and pnv_pci_ioda2_release_pe_dma() The first is used when tearing down VF PEs and the other is used for normal devices. There's very little difference between the two though. The latter (non-VF) will skip a call to pnv_pci_ioda2_unset_window() unless CONFIG_IOMMU_API=y is set. There's no real point in doing this so fold the two together. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-2-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
Add a helper to go from a pci_bus structure to the pnv_phb that hosts that bus. There's a lot of instances of the following pattern: struct pci_controller *hose = pci_bus_to_host(pdev->bus); struct pnv_phb *phb = hose->private_data; Without any other uses of the pci_controller inside the function. This is hard to read since it requires you to memorise the contents of the private data fields and kind of error prone since it involves blindly assigning a void pointer. Add a helper to make it more concise and explicit. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-1-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
The EEH core has a concept of a "PE tree" to support PowerNV. The PE tree follows the PCI bus structures because a reset asserted on an upstream bridge will be propagated to the downstream bridges. On pseries there's a 1-1 correspondence between what the guest sees are a PHB and a PE so the "tree" is really just a single node. Current the EEH core is reponsible for setting up this PE tree which it does by traversing the pci_dn tree. The structure of the pci_dn tree matches the bus tree on PowerNV which leads to the PE tree being "correct" this setup method doesn't make a whole lot of sense and it's actively confusing for the pseries case where it doesn't really do anything. We want to remove the dependence on pci_dn anyway so this patch move choosing where to insert a new PE into the platform code rather than being part of the generic EEH code. For PowerNV this simplifies the tree building logic and removes the use of pci_dn. For pseries we keep the existing logic. I'm not really convinced it does anything due to the 1-1 PE-to-PHB correspondence so every device under that PHB should be in the same PE, but I'd rather not remove it entirely until we've had a chance to look at it more deeply. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200725081231.39076-14-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
The naming of eeh_{add_to|remove_from}_parent_pe() doesn't really reflect what they actually do. If the PE referred to be edev->pe_config_addr already exists under that PHB then the edev is added to that PE. However, if the PE doesn't exist the a new one is created for the edev. The bulk of the implementation of eeh_add_to_parent_pe() covers that second case. Similarly, most of eeh_remove_from_parent_pe() is determining when it's safe to delete a PE. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200725081231.39076-12-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
The edev->class_code field is never referenced anywhere except for the platform specific probe functions. The same information is available in the pci_dev for PowerNV and in the pci_dn on pseries so we can remove the field. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200725081231.39076-11-oohall@gmail.com
-
由 Oliver O'Halloran 提交于
Mechanical conversion of the eeh_ops interfaces to use eeh_dev to reference a specific device rather than pci_dn. No functional changes. Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200725081231.39076-9-oohall@gmail.com
-