- 08 3月, 2013 22 次提交
-
-
由 Daniel Pieczko 提交于
Allocating 2 buffers per page is insanely inefficient when MTU is 1500 and PAGE_SIZE is 64K (as it usually is on POWER). Allocate as many as we can fit, and choose the refill batch size at run-time so that we still always use a whole page at once. [bwh: Fix loop condition to allow for compound pages; rebase] Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
This condition is brittle and we have lots of flags to spare. Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Daniel Pieczko 提交于
On POWER systems, DMA mapping/unmapping operations are very expensive. These changes reduce these costs by trying to reuse DMA mapped pages. After all the buffers associated with a page have been processed and passed up, the page is placed into a ring (if there is room). For each page that is required for a refill operation, a page in the ring is examined to determine if its page count has fallen to 1, ie. the kernel has released its reference to these packets. If this is the case, the page can be immediately added back into the RX descriptor ring, without having to re-map it for DMA. If the kernel is still holding a reference to this page, it is removed from the ring and unmapped for DMA. Then a new page, which can immediately be used by RX buffers in the descriptor ring, is allocated and DMA mapped. The time a page needs to spend in the recycle ring before the kernel has released its page references is based on the number of buffers that use this page. As large pages can hold more RX buffers, the RX recycle ring can be shorter. This reduces memory usage on POWER systems, while maintaining the performance gain achieved by recycling pages, following the driver change to pack more than two RX buffers into large pages. When an IOMMU is not present, the recycle ring can be small to reduce memory usage, since DMA mapping operations are inexpensive. With a small recycle ring, attempting to refill the descriptor queue with more buffers than the equivalent size of the recycle ring could ultimately lead to memory leaks if page entries in the recycle ring were overwritten. To prevent this, the check to see if the recycle ring is full is changed to check if the next entry to be written is NULL. [bwh: Combine and rebase several commits so this is complete before the following buffer-packing changes. Remove module parameter.] Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
Enable RX DMA scattering iff an RX buffer large enough for the current MTU will not fit into a single page and the NIC supports DMA scattering for kernel-mode RX queues. On Falcon and Siena, the RX_USR_BUF_SIZE field is used as the DMA limit for both all RX queues with scatter enabled. Set it to 1824, matching what Onload uses now. Maintain a statistic for frames truncated due to lack of descriptors (rx_nodesc_trunc). This is distinct from rx_frm_trunc which may be incremented when scattering is disabled and implies an over-length frame. Whenever an MTU change causes scattering to be turned on or off, update filters that point to the PF queues, but leave others unchanged, as VF drivers assume scattering is off. Add n_frags parameters to various functions, and make them iterate: - efx_rx_packet() - efx_recycle_rx_buffers() - efx_rx_mk_skb() - efx_rx_deliver() Make efx_handle_rx_event() responsible for updating efx_rx_queue::removed_count. Change the RX pipeline state to a starting ring index and number of fragments, and make __efx_rx_packet() responsible for clearing it. Based on earlier versions by David Riddoch and Jon Cooper. Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
Adjust rx_buf->page_offset when we eat the RX hash prefix. Remove efx_rx_buf_offset(), which is now redundant. Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
Currently we prefetch from the Ethernet header, but we will also read the hash prefix. In practice they should be in the same cache line and this won't hurt, but it is still pointless to add on the hash prefix size. Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
efx_rx_buf_va() returns the virtual address of the current start of the buffer. The callers must add the hash prefix size themselves. Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
The pipeline mechanism will need to change a bit for scattered packets. Add a wrapper to insulate efx_process_channel() from this. Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
Replace efx_nic::rx_buffer_len with efx_nic::rx_dma_len, the maximum RX DMA length. Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Alexandre Rames 提交于
The Linux side of EEH is triggered by MMIO reads, but this driver's data path does not issue any MMIO reads (except in legacy interrupt mode). Therefore add a monitor function to poll EEH periodically. When preparing to reset the device based on our own error detection, also poll EEH and defer to its recovery mechanism if appropriate. [bwh: Use a separate condition for the initial link poll; fix some style errors] Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
On Siena, VFs share RSS configuration with the PF. We attempted to support configurations where the PF only uses 1 RX queue and VFs use multiple RX queues, by (1) setting up RSS for the number of RX queues per VF (2) disabling RSS in the PF's RX default filters. Unfortunately commit cd2d5b52 ('sfc: Add SR-IOV back-end support for SFC9000 family') only included (1). This is (2). Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
efx_filter_insert_filter() uses the first table entry in the hash chain that either has the same match values or is empty. This means that replacement doesn't always work correctly: 1. Insert filter F1 with match values M1, hashing to H1, at first possible entry E1. 2. Insert filter F2 with match values M2, hashing to H1, at second possible entry E2. 3. Remove filter F1. 4. Insert filter F3 with match values M2, hashing to H1, at first possible entry E1. F3 should have either replaced F2 or been rejected (depending on priority and the replace_equal parameter). Instead, search for both a matching filter that the inserted filter would replace, and an available insertion point, up to the applicable maximum search depths. If we insert at lower depth than a replaced filter, clear the replaced filter. Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
efx_filter_search() is only called from efx_filter_insert(), and neither function is very long. The following bug fix requires a more sophisticated search with a third result, which is going to be easier to implement as part of the same function. Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
These functions happen to work for default MAC filters: they generate an initial index of 1/0 for unicast/multicast respectively and an increment of 1 for either, so a search succeeds at depth 2. But this is a matter of luck rather than design, and it really won't work well with the bug fix we're about to do. Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
The 'for_insert' parameter is redundant since there are no longer any other operations that need to search based on a filter spec. Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
The 'replace' flag to efx_filter_insert_filter() controls whether the new filter may replace *any* filter, and is checked even before priority comparison. But lower-priority filters should never block insertion of higher-priority filters. Change the priority checking so that lower-priority filters are replaced regardless of the value of the flag, and rename the flag to 'replace_equal'. Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Alexandre Rames 提交于
[bwh: Remove more dead code, and make efx_ptp_rx() pull the data it needs into the header area.] Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Laurence Evans 提交于
Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Laurence Evans 提交于
There is a long-standing problem with the packet-timestamp matching in the driver. When a PTP packet is received by the MC, the FPGA timestamps the packet and the MC sends the timestamp and 6 bytes of the UUID to the driver. The driver then matches the timestamp against received packets using the same 6 bytes of UUID. The problem comes from the choice of which 6 bytes to use. The PTP spec is slightly contradictory and misleading in one of the two places where the UUIDs are discussed. From section 7.2.2.2 of the spec, a PTPD2 UUID can be either a EUI-64 or a EUI-64 constructed from a EUI-48. The typical ethernet based implementation uses a EUI-64 constructed from a EUI-48. This works by taking the first 3 bytes of the MAC address of the NIC being used for PTP (the OUI), then inserting 0xFF, 0xFE, then taking the last 3 bytes of the MAC address giving MAC[0], MAC[1], MAC[2], 0xFF, 0xFE, MAC[3], MAC[4], MAC[5] The current MC firmware and driver discard the first two bytes of this UUID and packets are matched against timestamps using bytes 2 to 7 so there is a small risk that in a deployment of Solarflare PTP NICs used with other vendors NICs, that a PTP packet could be matched against the wrong timestamp. This applies to all other organisations whose third byte of the OUI is 0x53. It's a long list but I notice that it includes Cisco. The necessary modifications to use bytes 0-2 and 5-7 of the UUID to match against are quite small but introduce incompatibility between older version of the firmware and driver. When PTP is enabled via SO_TIMESTAMPING specifying PTP V2, the driver will try to enable PTP in the firmware using the enhanced mode (above). If the firmware returns an error, the driver will enable PTP in the firmware using the old mode. [bwh: Fix some style errors; remove private ioctl bits] Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
Instead of having efx_ptp_rx() call netif_receive_skb() for an invalid PTP packet, make it return false for rejected packets and have efx_rx_deliver() pass them up. Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
- 07 3月, 2013 5 次提交
-
-
由 Florian Fainelli 提交于
We are currently busy waiting for MDIO registers to complete their operation but we did not propagate the result back to the caller. Update r6040_phy_{read,write} to report the busy waiting result accordingly. Signed-off-by: NFlorian Fainelli <florian@openwrt.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Pirko 提交于
As suggested by Eric Dumazet, allow user to select mode which chooses TX port randomly. Functionality should be more of less similar to round-robin mode with even lower overhead. Signed-off-by: NJiri Pirko <jiri@resnulli.us> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Pirko 提交于
No need to duplicate code for this. Signed-off-by: NJiri Pirko <jiri@resnulli.us> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ben Hutchings 提交于
RX DMA buffers start at an offset of EFX_PAGE_IP_ALIGN bytes from the start of a cache line. This offset obviously needs to be included in the virtual address, but this was missed in commit b590ace0 ('sfc: Fix efx_rx_buf_offset() in the presence of swiotlb') since EFX_PAGE_IP_ALIGN is equal to 0 on both x86 and powerpc. Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
由 Ben Hutchings 提交于
efx_device_detach_sync() locks all TX queues before marking the device detached and thus disabling further TX scheduling. But it can still be interrupted by TX completions which then result in TX scheduling in soft interrupt context. This will deadlock when it tries to acquire a TX queue lock that efx_device_detach_sync() already acquired. To avoid deadlock, we must use netif_tx_{,un}lock_bh(). Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
-
- 06 3月, 2013 4 次提交
-
-
由 Eric Dumazet 提交于
BQL (Byte Queue Limits) proper operation needs TX completion being serviced in a timely fashion. bnx2x uses a non standard NAPI poll weight, and thats not fair to other napi poll handlers, and even not reasonable. Use the default value instead. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jingoo Han 提交于
This patch uses module_platform_driver_probe() macro which makes the code smaller and simpler. Signed-off-by: NJingoo Han <jg1.han@samsung.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jingoo Han 提交于
This patch uses module_platform_driver_probe() macro which makes the code smaller and simpler. Signed-off-by: NJingoo Han <jg1.han@samsung.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jingoo Han 提交于
This patch uses module_platform_driver_probe() macro which makes the code smaller and simpler. Signed-off-by: NJingoo Han <jg1.han@samsung.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 3月, 2013 4 次提交
-
-
由 Chen Gang 提交于
when strlen pi->location_code is larger than HVCS_CLC_LENGTH + 1, original implementation can not let hvcsd->p_location_code NUL terminated. so need fix it (also can simplify the code) Signed-off-by: NChen Gang <gang.chen@asianux.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Rusty Russell 提交于
virtio_rng feeds the randomness buffer handed by the core directly into the scatterlist, since commit bb347d98. However, if CONFIG_HW_RANDOM=m, the static buffer isn't a linear address (at least on most archs). We could fix this in virtio_rng, but it's actually far easier to just do it in the core as virtio_rng would have to allocate a buffer every time (it doesn't know how much the core will want to read). Reported-by: NAurelien Jarno <aurelien@aurel32.net> Tested-by: NAurelien Jarno <aurelien@aurel32.net> Signed-off-by: NRusty Russell <rusty@rustcorp.com.au> Cc: stable@kernel.org
-
由 Frank Li 提交于
build error cause by Commit ff43da86 ("NET: FEC: dynamtic check DMA desc buff type") drivers/net/ethernet/freescale/fec.c: In function ‘fec_enet_get_nextdesc’: drivers/net/ethernet/freescale/fec.c:215:18: error: invalid use of undefined type ‘struct bufdesc_ex’ drivers/net/ethernet/freescale/fec.c: In function ‘fec_enet_get_prevdesc’: drivers/net/ethernet/freescale/fec.c:224:18: error: invalid use of undefined type ‘struct bufdesc_ex’ drivers/net/ethernet/freescale/fec.c: In function ‘fec_enet_start_xmit’: drivers/net/ethernet/freescale/fec.c:286:37: error: arithmetic on pointer to an incomplete type drivers/net/ethernet/freescale/fec.c:287:13: error: arithmetic on pointer to an incomplete type drivers/net/ethernet/freescale/fec.c:324:7: error: dereferencing pointer to incomplete type etc.... Signed-off-by: NFrank Li <Frank.Li@freescale.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Frank Li 提交于
up stack ndo_start_xmit already hold lock. fec_enet_start_xmit needn't spin lock. stat_xmit just update fep->cur_tx fec_enet_tx just update fep->dirty_tx Reserve a empty bdb to check full or empty cur_tx == dirty_tx means full cur_tx == dirty_tx +1 means empty So needn't is_full variable. Fix spin lock deadlock ================================= [ INFO: inconsistent lock state ] 3.8.0-rc5+ #107 Not tainted --------------------------------- inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage. ptp4l/615 [HC1[1]:SC0[0]:HE0:SE1] takes: (&(&list->lock)->rlock#3){?.-...}, at: [<8042c3c4>] skb_queue_tail+0x20/0x50 {HARDIRQ-ON-W} state was registered at: [<80067250>] mark_lock+0x154/0x4e8 [<800676f4>] mark_irqflags+0x110/0x1a4 [<80069208>] __lock_acquire+0x494/0x9c0 [<80069ce8>] lock_acquire+0x90/0xa4 [<80527ad0>] _raw_spin_lock_bh+0x44/0x54 [<804877e0>] first_packet_length+0x38/0x1f0 [<804879e4>] udp_poll+0x4c/0x5c [<804231f8>] sock_poll+0x24/0x28 [<800d27f0>] do_poll.isra.10+0x120/0x254 [<800d36e4>] do_sys_poll+0x15c/0x1e8 [<800d3828>] sys_poll+0x60/0xc8 [<8000e780>] ret_fast_syscall+0x0/0x3c *** DEADLOCK *** 1 lock held by ptp4l/615: #0: (&(&fep->hw_lock)->rlock){-.-...}, at: [<80355f9c>] fec_enet_tx+0x24/0x268 stack backtrace: Backtrace: [<800121e0>] (dump_backtrace+0x0/0x10c) from [<80516210>] (dump_stack+0x18/0x1c) r6:8063b1fc r5:bf38b2f8 r4:bf38b000 r3:bf38b000 [<805161f8>] (dump_stack+0x0/0x1c) from [<805189d0>] (print_usage_bug.part.34+0x164/0x1a4) [<8051886c>] (print_usage_bug.part.34+0x0/0x1a4) from [<80518a88>] (print_usage_bug+0x78/0x88) r8:80065664 r7:bf38b2f8 r6:00000002 r5:00000000 r4:bf38b000 [<80518a10>] (print_usage_bug+0x0/0x88) from [<80518b58>] (mark_lock_irq+0xc0/0x270) r7:bf38b000 r6:00000002 r5:bf38b2f8 r4:00000000 [<80518a98>] (mark_lock_irq+0x0/0x270) from [<80067270>] (mark_lock+0x174/0x4e8) [<800670fc>] (mark_lock+0x0/0x4e8) from [<80067744>] (mark_irqflags+0x160/0x1a4) [<800675e4>] (mark_irqflags+0x0/0x1a4) from [<80069208>] (__lock_acquire+0x494/0x9c0) r5:00000002 r4:bf38b2f8 [<80068d74>] (__lock_acquire+0x0/0x9c0) from [<80069ce8>] (lock_acquire+0x90/0xa4) [<80069c58>] (lock_acquire+0x0/0xa4) from [<805278d8>] (_raw_spin_lock_irqsave+0x4c/0x60) [<8052788c>] (_raw_spin_lock_irqsave+0x0/0x60) from [<8042c3c4>] (skb_queue_tail+0x20/0x50) r6:bfbb2180 r5:bf1d0190 r4:bf1d0184 [<8042c3a4>] (skb_queue_tail+0x0/0x50) from [<8042c4cc>] (sock_queue_err_skb+0xd8/0x188) r6:00000056 r5:bfbb2180 r4:bf1d0000 r3:00000000 [<8042c3f4>] (sock_queue_err_skb+0x0/0x188) from [<8042d15c>] (skb_tstamp_tx+0x70/0xa0) r6:bf0dddb0 r5:bf1d0000 r4:bfbb2180 r3:00000004 [<8042d0ec>] (skb_tstamp_tx+0x0/0xa0) from [<803561d0>] (fec_enet_tx+0x258/0x268) r6:c089d260 r5:00001c00 r4:bfbd0000 [<80355f78>] (fec_enet_tx+0x0/0x268) from [<803562cc>] (fec_enet_interrupt+0xec/0xf8) [<803561e0>] (fec_enet_interrupt+0x0/0xf8) from [<8007d5b0>] (handle_irq_event_percpu+0x54/0x1a0) [<8007d55c>] (handle_irq_event_percpu+0x0/0x1a0) from [<8007d740>] (handle_irq_event+0x44/0x64) [<8007d6fc>] (handle_irq_event+0x0/0x64) from [<80080690>] (handle_fasteoi_irq+0xc4/0x15c) r6:bf0dc000 r5:bf811290 r4:bf811240 r3:00000000 [<800805cc>] (handle_fasteoi_irq+0x0/0x15c) from [<8007ceec>] (generic_handle_irq+0x28/0x38) r5:807130c8 r4:00000096 [<8007cec4>] (generic_handle_irq+0x0/0x38) from [<8000f16c>] (handle_IRQ+0x54/0xb4) r4:8071d280 r3:00000180 [<8000f118>] (handle_IRQ+0x0/0xb4) from [<80008544>] (gic_handle_irq+0x30/0x64) r8:8000e924 r7:f4000100 r6:bf0ddef8 r5:8071c974 r4:f400010c r3:00000000 [<80008514>] (gic_handle_irq+0x0/0x64) from [<8000e2e4>] (__irq_svc+0x44/0x5c) Exception stack(0xbf0ddef8 to 0xbf0ddf40) Signed-off-by: NFrank Li <Frank.Li@freescale.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 3月, 2013 5 次提交
-
-
由 Freddy Xin 提交于
This is a resubmission. Added kfree() in ax88179_get_eeprom to prevent memory leakage. Modified "__le16 rxctl" to "u16 rxctl" in "struct ax88179_data" and removed pointless casts. Removed asix_init and asix_exit functions and added "module_usb_driver(ax88179_178a_driver)". Fixed endianness issue on big endian systems and verified this driver on iBook G4. Removed steps that change net->features in ax88179_set_features function. Added "const" to ethtool_ops structure and fixed the coding style of AX88179_BULKIN_SIZE array. Fixed the issue that the default MTU is not 1500. Added ax88179_change_mtu function and enabled the hardware jumbo frame function to support an MTU higher than 1500. Fixed indentation and empty line coding style errors. The _nopm version usb functions were added to access register in suspend and resume functions. Serveral variables allocted dynamically were removed and replaced by stack variables. ax88179_get_eeprom were modified from asix_get_eeprom in asix_common. This patch adds a driver for ASIX's AX88179 family of USB 3.0/2.0 to gigabit ethernet adapters. It's based on the AX88xxx driver but the usb commands used to access registers for AX88179 are completely different. This driver had been verified on x86 system with AX88179/AX88178A and Sitcomm LN-032 USB dongles. Signed-off-by: NFreddy Xin <freddy@asix.com.tw> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 James Hogan 提交于
Meta core internal interrupts (from HWSTATMETA and friends) are vectored onto the TR1 core trigger for the current thread. This is demultiplexed in irq-metag.c to individual Linux IRQs for each internal interrupt. External SoC interrupts (from HWSTATEXT and friends) are vectored onto the TR2 core trigger for the current thread. This is demultiplexed in irq-metag-ext.c to individual Linux IRQs for each external SoC interrupt. The external irqchip has devicetree bindings for configuring the number of irq banks and the type of masking available. Signed-off-by: NJames Hogan <james.hogan@imgtec.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Grant Likely <grant.likely@secretlab.ca> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Rob Landley <rob@landley.net> Cc: Dom Cobley <popcornmix@gmail.com> Cc: Simon Arlott <simon@fire.lp0.eu> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Maxime Ripard <maxime.ripard@free-electrons.com> Cc: devicetree-discuss@lists.ozlabs.org Cc: linux-doc@vger.kernel.org
-
由 James Hogan 提交于
Add time keeping code for metag. Meta hardware threads have 2 timers. The background timer (TXTIMER) is used as a free-running time base, and the interrupt timer (TXTIMERI) is used for the timer interrupt. Both counters traditionally count at approximately 1MHz. Signed-off-by: NJames Hogan <james.hogan@imgtec.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de>
-
由 Helge Deller 提交于
additionally comment out unused code (which may be used later) Signed-off-by: NHelge Deller <deller@gmx.de>
-
由 Yinghai Lu 提交于
Tim found: WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80() Hardware name: S2600CP sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency. smpboot: Booting Node 1, Processors #1 Modules linked in: Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1 Call Trace: set_cpu_sibling_map+0x279/0x449 start_secondary+0x11d/0x1e5 Don Morris reproduced on a HP z620 workstation, and bisected it to commit e8d19552 ("acpi, memory-hotplug: parse SRAT before memblock is ready") It turns out movable_map has some problems, and it breaks several things 1. numa_init is called several times, NOT just for srat. so those nodes_clear(numa_nodes_parsed) memset(&numa_meminfo, 0, sizeof(numa_meminfo)) can not be just removed. Need to consider sequence is: numaq, srat, amd, dummy. and make fall back path working. 2. simply split acpi_numa_init to early_parse_srat. a. that early_parse_srat is NOT called for ia64, so you break ia64. b. for (i = 0; i < MAX_LOCAL_APIC; i++) set_apicid_to_node(i, NUMA_NO_NODE) still left in numa_init. So it will just clear result from early_parse_srat. it should be moved before that.... c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved early before override from INITRD is settled. 3. that patch TITLE is total misleading, there is NO x86 in the title, but it changes critical x86 code. It caused x86 guys did not pay attention to find the problem early. Those patches really should be routed via tip/x86/mm. 4. after that commit, following range can not use movable ram: a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed? b. initrd... it will be freed after booting, so it could be on movable... c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G anymore. d. init_mem_mapping: can not put page table high anymore. e. initmem_init: vmemmap can not be high local node anymore. That is not good. If node is hotplugable, the mem related range like page table and vmemmap could be on the that node without problem and should be on that node. We have workaround patch that could fix some problems, but some can not be fixed. So just remove that offending commit and related ones including: f7210e6c ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblock_overlaps_region().") 01a178a9 ("acpi, memory-hotplug: support getting hotplug info from SRAT") 27168d38 ("acpi, memory-hotplug: extend movablemem_map ranges to the end of node") e8d19552 ("acpi, memory-hotplug: parse SRAT before memblock is ready") fb06bc8e ("page_alloc: bootmem limit with movablecore_map") 42f47e27 ("page_alloc: make movablemem_map have higher priority") 6981ec31 ("page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes") 34b71f1e ("page_alloc: add movable_memmap kernel parameter") 4d59a751 ("x86: get pg_data_t's memory from other node") Later we should have patches that will make sure kernel put page table and vmemmap on local node ram instead of push them down to node0. Also need to find way to put other kernel used ram to local node ram. Reported-by: NTim Gardner <tim.gardner@canonical.com> Reported-by: NDon Morris <don.morris@hp.com> Bisected-by: NDon Morris <don.morris@hp.com> Tested-by: NDon Morris <don.morris@hp.com> Signed-off-by: NYinghai Lu <yinghai@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Thomas Renninger <trenn@suse.de> Cc: Tejun Heo <tj@kernel.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-