- 17 12月, 2014 3 次提交
-
-
由 Arik Nemtsov 提交于
If a device has self-managed regulatory, insist on returning the wiphy specific regdomain if a wiphy-idx is specified. The global regdomain is meaningless for such devices. Also add an attribute for self-managed devices, so usermode can distinguish them as such. Signed-off-by: NArik Nemtsov <arikx.nemtsov@intel.com> Reviewed-by: NLuis R. Rodriguez <mcgrof@suse.com> Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
-
由 Jonathan Doron 提交于
Add a new regulatory flag that allows a driver to manage regdomain changes/updates for its own wiphy. A self-managed wiphys only employs regulatory information obtained from the FW and driver and does not use other cfg80211 sources like beacon-hints, country-code IEs and hints from other devices on the same system. Conversely, a self-managed wiphy does not share its regulatory hints with other devices in the system. If a system contains several devices, one or more of which are self-managed, there might be contradictory regulatory settings between them. Usage of flag is generally discouraged. Only use it if the FW/driver is incompatible with non-locally originated hints. A new API lets the driver send a complete regdomain, to be applied on its wiphy only. After a wiphy-specific regdomain change takes place, usermode will get a new type of change notification. The regulatory core also takes care enforce regulatory restrictions, in case some interfaces are on forbidden channels. Signed-off-by: NJonathan Doron <jonathanx.doron@intel.com> Signed-off-by: NArik Nemtsov <arikx.nemtsov@intel.com> Reviewed-by: NLuis R. Rodriguez <mcgrof@suse.com> Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
-
由 Arik Nemtsov 提交于
If a wiphy-idx is specified, the kernel will return the wiphy specific regdomain, if such exists. Otherwise return the global regdom. When no wiphy-idx is specified, return the global regdomain as well as all wiphy-specific regulatory domains in the system, via a new nested list of attributes. Add a new attribute for each wiphy-specific regdomain, for usermode to identify it as such. Signed-off-by: NArik Nemtsov <arikx.nemtsov@intel.com> Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
-
- 15 12月, 2014 1 次提交
-
-
由 Johannes Berg 提交于
In order to let drivers have more dynamic U-APSD support, move the enablement flag to the virtual interface driver flags. This lets drivers not only set it up differently for different interfaces, but also enable/disable on the fly if needed. Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
-
- 12 12月, 2014 6 次提交
-
-
由 Sujith Manoharan 提交于
Since multicast frames are marked as no-ack, using IEEE80211_TX_STAT_ACK to check if they have been successfully transmitted by the driver is incorrect since a driver can choose to ignore transmission status for no-ack frames. This results in incorrect accounting for such frames. To fix this issue, this patch introduces a new flag that can be used by drivers to indicate error-free transmission of no-ack frames. Signed-off-by: NSujith Manoharan <c_manoha@qca.qualcomm.com> [add a note about not setting the flag for non-no-ack frames] Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
-
由 Sujith Manoharan 提交于
Move IEEE80211_TX_CTL_PS_RESPONSE to info->control.flags since this is used only in the TX path (by ath9k). This frees up a bit which can be used for other purposes. Signed-off-by: NSujith Manoharan <c_manoha@qca.qualcomm.com> Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
-
由 Matan Barak 提交于
Add the required firmware commands for A0 steering and a way to enable that. The firmware support focuses on INIT_HCA, QUERY_HCA, QUERY_PORT, QUERY_DEV_CAP and QUERY_FUNC_CAP commands. Those commands are used to configure and query the device. The different A0 DMFS (steering) modes are: Static - optimized performance, but flow steering rules are limited. This mode should be choosed explicitly by the user in order to be used. Dynamic - this mode should be explicitly choosed by the user. In this mode, the FW works in optimized steering mode as long as it can and afterwards automatically drops to classic (full) DMFS. Disable - this mode should be explicitly choosed by the user. The user instructs the system not to use optimized steering, even if the FW supports Dynamic A0 DMFS (and thus will be able to use optimized steering in Default A0 DMFS mode). Default - this mode is implicitly choosed. In this mode, if the FW supports Dynamic A0 DMFS, it'll work in this mode. Otherwise, it'll work at Disable A0 DMFS mode. Under SRIOV configuration, when the A0 steering mode is enabled, older guest VF drivers who aren't using the RX QP allocation flag (MLX4_RESERVE_A0_QP) will get a QP from the general range and fail when attempting to register a steering rule. To avoid that, the PF context behaviour is changed once on A0 static mode, to require support for the allocation flag in VF drivers too. In order to enable A0 steering, we use log_num_mgm_entry_size param. If the value of the parameter is not positive, we treat the absolute value of log_num_mgm_entry_size as a bit field. Setting bit 2 of this bit field enables static A0 steering. Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Matan Barak 提交于
A0 hybrid steering is a form of high performance flow steering. By using this mode, mlx4 cards use a fast limited table based steering, in order to enable fast steering of unicast packets to a QP. In order to implement A0 hybrid steering we allocate resources from different zones: (1) General range (2) Special MAC-assigned QPs [RSS, Raw-Ethernet] each has its own region. When we create a rss QP or a raw ethernet (A0 steerable and BF ready) QP, we try hard to allocate the QP from range (2). Otherwise, we try hard not to allocate from this range. However, when the system is pushed to its limits and one needs every resource, the allocator uses every region it can. Meaning, when we run out of raw-eth qps, the allocator allocates from the general range (and the special-A0 area is no longer active). If we run out of RSS qps, the mechanism tries to allocate from the raw-eth QP zone. If that is also exhausted, the allocator will allocate from the general range (and the A0 region is no longer active). Note that if a raw-eth qp is allocated from the general range, it attempts to allocate the range such that bits 6 and 7 (blueflame bits) in the QP number are not set. When the feature is used in SRIOV, the VF has to notify the PF what kind of QP attributes it needs. In order to do that, along with the "Eth QP blueflame" bit, we reserve a new "A0 steerable QP". According to the combination of these bits, the PF tries to allocate a suitable QP. In order to maintain backward compatibility (with older PFs), the PF notifies which QP attributes it supports via QUERY_FUNC_CAP command. Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eugenia Emantayev 提交于
When using BF (Blue-Flame), the QPN overrides the VLAN, CV, and SV fields in the WQE. Thus, BF may only be used for QPNs with bits 6,7 unset. The current Ethernet driver code reserves a Tx QP range with 256b alignment. This is wrong because if there are more than 64 Tx QPs in use, QPNs >= base + 65 will have bits 6/7 set. This problem is not specific for the Ethernet driver, any entity that tries to reserve more than 64 BF-enabled QPs should fail. Also, using ranges is not necessary here and is wasteful. The new mechanism introduced here will support reservation for "Eth QPs eligible for BF" for all drivers: bare-metal, multi-PF, and VFs (when hypervisors support WC in VMs). The flow we use is: 1. In mlx4_en, allocate Tx QPs one by one instead of a range allocation, and request "BF enabled QPs" if BF is supported for the function 2. In the ALLOC_RES FW command, change param1 to: a. param1[23:0] - number of QPs b. param1[31-24] - flags controlling QPs reservation Bit 31 refers to Eth blueflame supported QPs. Those QPs must have bits 6 and 7 unset in order to be used in Ethernet. Bits 24-30 of the flags are currently reserved. When a function tries to allocate a QP, it states the required attributes for this QP. Those attributes are considered "best-effort". If an attribute, such as Ethernet BF enabled QP, is a must-have attribute, the function has to check that attribute is supported before trying to do the allocation. In a lower layer of the code, mlx4_qp_reserve_range masks out the bits which are unsupported. If SRIOV is used, the PF validates those attributes and masks out unsupported attributes as well. In order to notify VFs which attributes are supported, the VF uses QUERY_FUNC_CAP command. This command's mailbox is filled by the PF, which notifies which QP allocation attributes it supports. Signed-off-by: NEugenia Emantayev <eugenia@mellanox.co.il> Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Matan Barak 提交于
Previously, we've fired all our completion callbacks straight from our ISR. Some of those callbacks were lightweight (for example, mlx4_en's and IPoIB napi callbacks), but some of them did more work (for example, the user-space RDMA stack uverbs' completion handler). Besides that, doing more than the minimal work in ISR is generally considered wrong, it could even lead to a hard lockup of the system. Since when a lot of completion events are generated by the hardware, the loop over those events could be so long, that we'll get into a hard lockup by the system watchdog. In order to avoid that, add a new way of invoking completion events callbacks. In the interrupt itself, we add the CQs which receive completion event to a per-EQ list and schedule a tasklet. In the tasklet context we loop over all the CQs in the list and invoke the user callback. Signed-off-by: NMatan Barak <matanb@mellanox.com> Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 12月, 2014 30 次提交
-
-
由 Gu Zheng 提交于
Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating cmsghdr from msghdr, just cleanup. Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Joe Perches 提交于
Use #defines instead of magic values. Signed-off-by: NJoe Perches <joe@perches.com> Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joe Perches 提交于
Eliminate the unlikely possibility of message interleaving for early_printk/early_vprintk use. early_vprintk can be done via the %pV extension so remove this unnecessary function and change early_printk to have the equivalent vprintk code. All uses of early_printk already end with a newline so also remove the unnecessary newline from the early_printk function. Signed-off-by: NJoe Perches <joe@perches.com> Acked-by: NChris Metcalf <cmetcalf@tilera.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Prarit Bhargava 提交于
There have been several times where I have had to rebuild a kernel to cause a panic when hitting a WARN() in the code in order to get a crash dump from a system. Sometimes this is easy to do, other times (such as in the case of a remote admin) it is not trivial to send new images to the user. A much easier method would be a switch to change the WARN() over to a panic. This makes debugging easier in that I can now test the actual image the WARN() was seen on and I do not have to engage in remote debugging. This patch adds a panic_on_warn kernel parameter and /proc/sys/kernel/panic_on_warn calls panic() in the warn_slowpath_common() path. The function will still print out the location of the warning. An example of the panic_on_warn output: The first line below is from the WARN_ON() to output the WARN_ON()'s location. After that the panic() output is displayed. WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]() Kernel panic - not syncing: panic_on_warn set ... CPU: 30 PID: 11698 Comm: insmod Tainted: G W OE 3.17.0+ #57 Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013 0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190 0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df Call Trace: [<ffffffff81665190>] dump_stack+0x46/0x58 [<ffffffff8165e2ec>] panic+0xd0/0x204 [<ffffffffa038e05f>] ? init_dummy+0x1f/0x30 [dummy_module] [<ffffffff81076b90>] warn_slowpath_common+0xd0/0xd0 [<ffffffffa038e040>] ? dummy_greetings+0x40/0x40 [dummy_module] [<ffffffff81076c8a>] warn_slowpath_null+0x1a/0x20 [<ffffffffa038e05f>] init_dummy+0x1f/0x30 [dummy_module] [<ffffffff81002144>] do_one_initcall+0xd4/0x210 [<ffffffff811b52c2>] ? __vunmap+0xc2/0x110 [<ffffffff810f8889>] load_module+0x16a9/0x1b30 [<ffffffff810f3d30>] ? store_uevent+0x70/0x70 [<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180 [<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0 [<ffffffff8166cf29>] system_call_fastpath+0x12/0x17 Successfully tested by me. hpa said: There is another very valid use for this: many operators would rather a machine shuts down than being potentially compromised either functionally or security-wise. Signed-off-by: NPrarit Bhargava <prarit@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Fabian Frederick <fabf@skynet.be> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yann Droneaud 提交于
Macro get_unused_fd() is used to allocate a file descriptor with default flags. Those default flags (0) don't enable close-on-exec. This can be seen as an unsafe default: in most case close-on-exec should be enabled to not leak file descriptor across exec(). It would be better to have a "safer" default set of flags, eg. O_CLOEXEC must be used to enable close-on-exec. Instead this patch removes get_unused_fd() so that out of tree modules won't be affect by a runtime behavor change which might introduce other kind of bugs: it's better to catch the change at build time, making it easier to fix. Removing the macro will also promote use of get_unused_fd_flags() (or anon_inode_getfd()) with flags provided by userspace. Or, if flags cannot be given by userspace, with flags set to O_CLOEXEC by default. Signed-off-by: NYann Droneaud <ydroneaud@opteya.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD tasks, we can simply pass "dead_children" list to exit_ptrace() and remove another release_task() loop. Plus this way we do not need to drop and reacquire tasklist_lock. Also shift the list_empty(ptraced) check, if we want this optimization it makes sense to eliminate the function call altogether. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Now that the external page_cgroup data structure and its lookup is gone, let the generic bad_page() check for page->mem_cgroup sanity. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Now that the external page_cgroup data structure and its lookup is gone, the only code remaining in there is swap slot accounting. Rename it and move the conditional compilation into mm/Makefile. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Memory cgroups used to have 5 per-page pointers. To allow users to disable that amount of overhead during runtime, those pointers were allocated in a separate array, with a translation layer between them and struct page. There is now only one page pointer remaining: the memcg pointer, that indicates which cgroup the page is associated with when charged. The complexity of runtime allocation and the runtime translation overhead is no longer justified to save that *potential* 0.19% of memory. With CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding after the page->private member and doesn't even increase struct page, and then this patch actually saves space. Remaining users that care can still compile their kernels without CONFIG_MEMCG. text data bss dec hex filename 8828345 1725264 983040 11536649 b00909 vmlinux.old 8827425 1725264 966656 11519345 afc571 vmlinux.new [mhocko@suse.cz: update Documentation/cgroups/memory.txt] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Since commit d7365e78 ("mm: memcontrol: fix missed end-writeback page accounting") mem_cgroup_end_page_stat consumes locked and flags variables directly rather than via pointers which might trigger C undefined behavior as those variables are initialized only in the slow path of mem_cgroup_begin_page_stat. Although mem_cgroup_end_page_stat handles parameters correctly and touches them only when they hold a sensible value it is caller which loads a potentially uninitialized value which then might allow compiler to do crazy things. I haven't seen any warning from gcc and it seems that the current version (4.9) doesn't exploit this type undefined behavior but Sasha has reported the following: UBSan: Undefined behaviour in mm/rmap.c:1084:2 load of value 255 is not a valid value for type '_Bool' CPU: 4 PID: 8304 Comm: rngd Not tainted 3.18.0-rc2-next-20141029-sasha-00039-g77ed13d-dirty #1427 Call Trace: dump_stack (lib/dump_stack.c:52) ubsan_epilogue (lib/ubsan.c:159) __ubsan_handle_load_invalid_value (lib/ubsan.c:482) page_remove_rmap (mm/rmap.c:1084 mm/rmap.c:1096) unmap_page_range (./arch/x86/include/asm/atomic.h:27 include/linux/mm.h:463 mm/memory.c:1146 mm/memory.c:1258 mm/memory.c:1279 mm/memory.c:1303) unmap_single_vma (mm/memory.c:1348) unmap_vmas (mm/memory.c:1377 (discriminator 3)) exit_mmap (mm/mmap.c:2837) mmput (kernel/fork.c:659) do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:462 kernel/exit.c:747) do_group_exit (include/linux/sched.h:775 kernel/exit.c:873) SyS_exit_group (kernel/exit.c:901) tracesys_phase2 (arch/x86/kernel/entry_64.S:529) Fix this by using pointer parameters for both locked and flags and be more robust for future compiler changes even though the current code is implemented correctly. Signed-off-by: NMichal Hocko <mhocko@suse.cz> Reported-by: NSasha Levin <sasha.levin@oracle.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
None of the mem_cgroup_same_or_subtree() callers actually require it to take the RCU lock, either because they hold it themselves or they have css references. Remove it. To make the API change clear, rename the leftover helper to mem_cgroup_is_descendant() to match cgroup_is_descendant(). Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The NULL in mm_match_cgroup() comes from a possibly exiting mm->owner. It makes a lot more sense to check where it's looked up, rather than check for it in __mem_cgroup_same_or_subtree() where it's unexpected. No other callsite passes NULL to __mem_cgroup_same_or_subtree(). Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Let's use generic slab_start/next/stop for showing memcg caches info. In contrast to the current implementation, this will work even if all memcg caches' info doesn't fit into a seq buffer (a page), plus it simply looks neater. Actually, the main reason I do this isn't mere cleanup. I'm going to zap the memcg_slab_caches list, because I find it useless provided we have the slab_caches list, and this patch is a step in this direction. It should be noted that before this patch an attempt to read memory.kmem.slabinfo of a cgroup that doesn't have kmem limit set resulted in -EIO, while after this patch it will silently show nothing except the header, but I don't think it will frustrate anyone. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sasha Levin 提交于
hstate_sizelog() would shift left an int rather than long, triggering undefined behaviour and passing an incorrect value when the requested page size was more than 4GB, thus breaking >4GB pages. Signed-off-by: NSasha Levin <sasha.levin@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
pc->mem_cgroup had to be left intact after uncharge for the final LRU removal, and !PCG_USED indicated whether the page was uncharged. But since commit 0a31bc97 ("mm: memcontrol: rewrite uncharge API") pages are uncharged after the final LRU removal. Uncharge can simply clear the pointer and the PCG_USED/PageCgroupUsed sites can test that instead. Because this is the last page_cgroup flag, this patch reduces the memcg per-page overhead to a single pointer. [akpm@linux-foundation.org: remove unneeded initialization of `memcg', per Michal] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
PCG_MEM is a remnant from an earlier version of 0a31bc97 ("mm: memcontrol: rewrite uncharge API"), used to tell whether migration cleared a charge while leaving pc->mem_cgroup valid and PCG_USED set. But in the final version, mem_cgroup_migrate() directly uncharges the source page, rendering this distinction unnecessary. Remove it. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Now that mem_cgroup_swapout() fully uncharges the page, every page that is still in use when reaching mem_cgroup_uncharge() is known to carry both the memory and the memory+swap charge. Simplify the uncharge path and remove the PCG_MEMSW page flag accordingly. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
Since commit 53853e2d ("mm, compaction: defer each zone individually instead of preferred zone"), compaction is deferred for each zone where sync direct compaction fails, and reset where it succeeds. However, it was observed that for DMA zone compaction often appeared to succeed while subsequent allocation attempt would not, due to different outcome of watermark check. In order to properly defer compaction in this zone, the candidate zone has to be passed back to __alloc_pages_direct_compact() and compaction deferred in the zone after the allocation attempt fails. The large source of mismatch between watermark check in compaction and allocation was the lack of alloc_flags and classzone_idx values in compaction, which has been fixed in the previous patch. So with this problem fixed, we can simplify the code by removing the candidate_zone parameter and deferring in __alloc_pages_direct_compact(). After this patch, the compaction activity during stress-highalloc benchmark is still somewhat increased, but it's negligible compared to the increase that occurred without the better watermark checking. This suggests that it is still possible to apparently succeed in compaction but fail to allocate, possibly due to parallel allocation activity. [akpm@linux-foundation.org: fix build] Suggested-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
Compaction relies on zone watermark checks for decisions such as if it's worth to start compacting in compaction_suitable() or whether compaction should stop in compact_finished(). The watermark checks take classzone_idx and alloc_flags parameters, which are related to the memory allocation request. But from the context of compaction they are currently passed as 0, including the direct compaction which is invoked to satisfy the allocation request, and could therefore know the proper values. The lack of proper values can lead to mismatch between decisions taken during compaction and decisions related to the allocation request. Lack of proper classzone_idx value means that lowmem_reserve is not taken into account. This has manifested (during recent changes to deferred compaction) when DMA zone was used as fallback for preferred Normal zone. compaction_suitable() without proper classzone_idx would think that the watermarks are already satisfied, but watermark check in get_page_from_freelist() would fail. Because of this problem, deferring compaction has extra complexity that can be removed in the following patch. The issue (not confirmed in practice) with missing alloc_flags is opposite in nature. For allocations that include ALLOC_HIGH, ALLOC_HIGHER or ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on CMA-enabled systems) the watermark checking in compaction with 0 passed will be stricter than in get_page_from_freelist(). In these cases compaction might be running for a longer time than is really needed. Another issue compaction_suitable() is that the check for "does the zone need compaction at all?" comes only after the check "does the zone have enough free free pages to succeed compaction". The latter considers extra pages for migration and can therefore in some situations fail and return COMPACT_SKIPPED, although the high-order allocation would succeed and we should return COMPACT_PARTIAL. This patch fixes these problems by adding alloc_flags and classzone_idx to struct compact_control and related functions involved in direct compaction and watermark checking. Where possible, all other callers of compaction_suitable() pass proper values where those are known. This is currently limited to classzone_idx, which is sometimes known in kswapd context. However, the direct reclaim callers should_continue_reclaim() and compaction_ready() do not currently know the proper values, so the coordination between reclaim and compaction may still not be as accurate as it could. This can be fixed later, if it's shown to be an issue. Additionaly the checks in compact_suitable() are reordered to address the second issue described above. The effect of this patch should be slightly better high-order allocation success rates and/or less compaction overhead, depending on the type of allocations and presence of CMA. It allows simplifying deferred compaction code in a followup patch. When testing with stress-highalloc, there was some slight improvement (which might be just due to variance) in success rates of non-THP-like allocations. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Acked-by: NRik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
The functions for draining per-cpu pages back to buddy allocators currently always operate on all zones. There are however several cases where the drain is only needed in the context of a single zone, and spilling other pcplists is a waste of time both due to the extra spilling and later refilling. This patch introduces new zone pointer parameter to drain_all_pages() and changes the dummy parameter of drain_local_pages() to be also a zone pointer. When NULL is passed, the functions operate on all zones as usual. Passing a specific zone pointer reduces the work to the single zone. All callers are updated to pass the NULL pointer in this patch. Conversion to single zone (where appropriate) is done in further patches. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
As charges now pin the css explicitely, there is no more need for kmemcg to acquire a proxy reference for outstanding pages during offlining, or maintain state to identify such "dead" groups. This was the last user of the uncharge functions' return values, so remove them as well. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Charges currently pin the css indirectly by playing tricks during css_offline(): user pages stall the offlining process until all of them have been reparented, whereas kmemcg acquires a keep-alive reference if outstanding kernel pages are detected at that point. In preparation for removing all this complexity, make the pinning explicit and acquire a css references for every charged page. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
All memory accounting and limiting has been switched over to the lockless page counters. Bye, res_counter! [akpm@linux-foundation.org: update Documentation/cgroups/memory.txt] [mhocko@suse.cz: ditch the last remainings of res_counter] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Abandon the spinlock-protected byte counters in favor of the unlocked page counters in the hugetlb controller as well. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Memory is internally accounted in bytes, using spinlock-protected 64-bit counters, even though the smallest accounting delta is a page. The counter interface is also convoluted and does too many things. Introduce a new lockless word-sized page counter API, then change all memory accounting over to it. The translation from and to bytes then only happens when interfacing with userspace. The removed locking overhead is noticable when scaling beyond the per-cpu charge caches - on a 4-socket machine with 144-threads, the following test shows the performance differences of 288 memcgs concurrently running a page fault benchmark: vanilla: 18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% ) 1,380,638 context-switches # 0.074 K/sec ( +- 0.75% ) 24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% ) 1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% ) 50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% ) 1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% ) 1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% ) 132.474343877 seconds time elapsed ( +- 0.21% ) lockless: 12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% ) 832,850 context-switches # 0.068 K/sec ( +- 0.54% ) 15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% ) 1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% ) 32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% ) 2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% ) 1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% ) 91.369330729 seconds time elapsed ( +- 0.45% ) On top of improved scalability, this also gets rid of the icky long long types in the very heart of memcg, which is great for 32 bit and also makes the code a lot more readable. Notable differences between the old and new API: - res_counter_charge() and res_counter_charge_nofail() become page_counter_try_charge() and page_counter_charge() resp. to match the more common kernel naming scheme of try_do()/do() - res_counter_uncharge_until() is only ever used to cancel a local counter and never to uncharge bigger segments of a hierarchy, so it's replaced by the simpler page_counter_cancel() - res_counter_set_limit() is replaced by page_counter_limit(), which expects its callers to serialize against themselves - res_counter_memparse_write_strategy() is replaced by page_counter_limit(), which rounds down to the nearest page size - rather than up. This is more reasonable for explicitely requested hard upper limits. - to keep charging light-weight, page_counter_try_charge() charges speculatively, only to roll back if the result exceeds the limit. Because of this, a failing bigger charge can temporarily lock out smaller charges that would otherwise succeed. The error is bounded to the difference between the smallest and the biggest possible charge size, so for memcg, this means that a failing THP charge can send base page charges into reclaim upto 2MB (4MB) before the limit would have been reached. This should be acceptable. [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse] [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NVladimir Davydov <vdavydov@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joe Perches 提交于
Making things const is a good thing. (x86-64 defconfig with all irda) $ size net/irda/built-in.o* text data bss dec hex filename 109276 1868 244 111388 1b31c net/irda/built-in.o.new 108828 2316 244 111388 1b31c net/irda/built-in.o.old Signed-off-by: NJoe Perches <joe@perches.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Joe Perches 提交于
It's better when function pointer arrays aren't modifiable. Net change: $ size net/llc/built-in.o.* text data bss dec hex filename 61193 12758 1344 75295 1261f net/llc/built-in.o.new 47113 27030 1344 75487 126df net/llc/built-in.o.old Signed-off-by: NJoe Perches <joe@perches.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Joe Perches 提交于
It's better when function pointer arrays aren't modifiable. Net change from original: $ size net/llc/built-in.o.* text data bss dec hex filename 61065 12886 1344 75295 1261f net/llc/built-in.o.new 47113 27030 1344 75487 126df net/llc/built-in.o.old Signed-off-by: NJoe Perches <joe@perches.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Joe Perches 提交于
It's better when function pointer arrays aren't modifiable. Signed-off-by: NJoe Perches <joe@perches.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
As there are now no remaining users of arch_fast_hash(), lets kill it entirely. This basically reverts commit 71ae8aac ("lib: introduce arch optimized hash library") and follow-up work, that is f.e., commit 23721754 ("lib: hash: follow-up fixups for arch hash"), commit e3fec2f7 ("lib: Add missing arch generic-y entries for asm-generic/hash.h") and last but not least commit 6a02652d ("perf tools: Fix include for non x86 architectures"). Cc: Francesco Fusco <fusco@ntop.org> Cc: Thomas Graf <tgraf@suug.ch> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: NDaniel Borkmann <dborkman@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-