- 10 10月, 2014 29 次提交
-
-
由 Johannes Weiner 提交于
In a memcg with even just moderate cache pressure, success rates for transparent huge page allocations drop to zero, wasting a lot of effort that the allocator puts into assembling these pages. The reason for this is that the memcg reclaim code was never designed for higher-order charges. It reclaims in small batches until there is room for at least one page. Huge page charges only succeed when these batches add up over a series of huge faults, which is unlikely under any significant load involving order-0 allocations in the group. Remove that loop on the memcg side in favor of passing the actual reclaim goal to direct reclaim, which is already set up and optimized to meet higher-order goals efficiently. This brings memcg's THP policy in line with the system policy: if the allocator painstakingly assembles a hugepage, memcg will at least make an honest effort to charge it. As a result, transparent hugepage allocation rates amid cache activity are drastically improved: vanilla patched pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%) pgfault 491370.60 ( +0.00%) 225477.40 ( -54.11%) pgmajfault 2.00 ( +0.00%) 1.80 ( -6.67%) thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%) thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%) [ Note: this may in turn increase memory consumption from internal fragmentation, which is an inherent risk of transparent hugepages. Some setups may have to adjust the memcg limits accordingly to accomodate this - or, if the machine is already packed to capacity, disable the transparent huge page feature. ] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Dave Hansen <dave@sr71.net> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
`While growing per memcg caches arrays, we jump between memcontrol.c and slab_common.c in a weird way: memcg_alloc_cache_id - memcontrol.c memcg_update_all_caches - slab_common.c memcg_update_cache_size - memcontrol.c There's absolutely no reason why memcg_update_cache_size can't live on the slab's side though. So let's move it there and settle it comfortably amid per-memcg cache allocation functions. Besides, this patch cleans this function up a bit, removing all the useless comments from it, and renames it to memcg_update_cache_params to conform to memcg_alloc/free_cache_params, which we already have in slab_common.c. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
The only reason why they live in memcontrol.c is that we get/put css reference to the owner memory cgroup in them. However, we can do that in memcg_{un,}register_cache. OTOH, there are several reasons to move them to slab_common.c. First, I think that the less public interface functions we have in memcontrol.h the better. Since the functions I move don't depend on memcontrol, I think it's worth making them private to slab, especially taking into account that the arrays are defined on the slab's side too. Second, the way how per-memcg arrays are updated looks rather awkward: it proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c (memcg_update_all_caches) and back to memcontrol.c again (memcg_update_array_size). In the following patches I move the function relocating the arrays (memcg_update_array_size) to slab_common.c and therefore get rid this circular call path. I think we should have the cache allocation stuff in the same place where we have relocation, because it's easier to follow the code then. So I move arrays alloc/free functions to slab_common.c too. The third point isn't obvious. I'm going to make the list_lru structure per-memcg to allow targeted kmem reclaim. That means we will have per-memcg arrays in list_lrus too. It turns out that it's much easier to update these arrays in list_lru.c rather than in memcontrol.c, because all the stuff we need is defined there. This patch makes memcg caches arrays allocation path conform that of the upcoming list_lru. So let's move these functions to slab_common.c and make them static. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sasha Levin 提交于
Very similar to VM_BUG_ON_PAGE and VM_BUG_ON_VMA, dump struct_mm when the bug is hit. [akpm@linux-foundation.org: coding-style fixes] [mhocko@suse.cz: fix build] [mhocko@suse.cz: fix build some more] [akpm@linux-foundation.org: do strange things to avoid doing strange things for the comma separators] Signed-off-by: NSasha Levin <sasha.levin@oracle.com> Cc: Dave Jones <davej@redhat.com> Signed-off-by: NMichal Hocko <mhocko@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Junxiao Bi 提交于
commit 21caf2fc ("mm: teach mm by current context info to not do I/O during memory allocation") introduces PF_MEMALLOC_NOIO flag to avoid doing I/O inside memory allocation, __GFP_IO is cleared when this flag is set, but __GFP_FS implies __GFP_IO, it should also be cleared. Or it may still run into I/O, like in superblock shrinker. And this will make the kernel run into the deadlock case described in that commit. See Dave Chinner's comment about io in superblock shrinker: Filesystem shrinkers do indeed perform IO from the superblock shrinker and have for years. Even clean inodes can require IO before they can be freed - e.g. on an orphan list, need truncation of post-eof blocks, need to wait for ordered operations to complete before it can be freed, etc. IOWs, Ext4, btrfs and XFS all can issue and/or block on arbitrary amounts of IO in the superblock shrinker context. XFS, in particular, has been doing transactions and IO from the VFS inode cache shrinker since it was first introduced.... Fix this by clearing __GFP_FS in memalloc_noio_flags(), this function has masked all the gfp_mask that will be passed into fs for the processes setting PF_MEMALLOC_NOIO in the direct reclaim path. v1 thread at: https://lkml.org/lkml/2014/9/3/32Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Cc: joyce.xue <xuejiufei@huawei.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Page reclaim tests zone_is_reclaim_dirty(), but the site that actually sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the reader through layers indirection just to track down a simple bit. Remove all zone flag wrappers and just use bitops against zone->flags directly. It's just as readable and the lines are barely any longer. Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and remove the zone_flags_t typedef. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sasha Levin 提交于
Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract more information when they trigger. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NSasha Levin <sasha.levin@oracle.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michel Lespinasse <walken@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sasha Levin 提交于
Very similar to VM_BUG_ON_PAGE but dumps VMA information instead. Signed-off-by: NSasha Levin <sasha.levin@oracle.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michel Lespinasse <walken@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sasha Levin 提交于
Introduce a helper to dump information about a VMA, this also makes dump_page_flags more generic and re-uses that so the output looks very similar to dump_page: [ 61.903437] vma ffff88070f88be00 start 00007fff25970000 end 00007fff25992000 [ 61.903437] next ffff88070facd600 prev ffff88070face400 mm ffff88070fade000 [ 61.903437] prot 8000000000000025 anon_vma ffff88070fa1e200 vm_ops (null) [ 61.903437] pgoff 7ffffffdd file (null) private_data (null) [ 61.909129] flags: 0x100173(read|write|mayread|maywrite|mayexec|growsdown|account) [akpm@linux-foundation.org: make dump_vma() require CONFIG_DEBUG_VM] [swarren@nvidia.com: fix dump_vma() compilation] Signed-off-by: NSasha Levin <sasha.levin@oracle.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michel Lespinasse <walken@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NStephen Warren <swarren@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
This is designed to avoid a few ifdefs in .c files but it's obnoxious because it can cause unsuspecting "migrate_page" symbols to get turned into "NULL". Just nuke it and use the ifdefs. Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
- get_vma_policy(task) is not safe if task != current, remove this argument. - get_vma_policy() no longer has callers outside of mempolicy.c, make it static. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Extract the code which looks for vma's policy from get_vma_policy() into the new helper, __get_vma_policy(). Export get_task_policy(). Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
1. vma_policy_mof(task) is simply not safe unless task == current, it can race with do_exit()->mpol_put(). Remove this arg and update its single caller. 2. vma can not be NULL, remove this check and simplify the code. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The deprecation warnings for the scan_unevictable interface triggers by scripts doing `sysctl -a | grep something else'. This is annoying and not helpful. The interface has been defunct since 264e56d8 ("mm: disable user interface to manually rescue unevictable pages"), which was in 2011, and there haven't been any reports of usecases for it, only reports that the deprecation warnings are annying. It's unlikely that anybody is using this interface specifically at this point, so remove it. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cyrill Gorcunov 提交于
During development of c/r we've noticed that in case if we need to support user namespaces we face a problem with capabilities in prctl(PR_SET_MM, ...) call, in particular once new user namespace is created capable(CAP_SYS_RESOURCE) no longer passes. A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values in one bundle, which would allow the kernel to make more intensive test for sanity of values and same time allow us to support checkpoint/restore of user namespaces. Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of prctl_mm_map structure which carries all the members to be updated. prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size) struct prctl_mm_map { __u64 start_code; __u64 end_code; __u64 start_data; __u64 end_data; __u64 start_brk; __u64 brk; __u64 start_stack; __u64 arg_start; __u64 arg_end; __u64 env_start; __u64 env_end; __u64 *auxv; __u32 auxv_size; __u32 exe_fd; }; All members except @exe_fd correspond ones of struct mm_struct. To figure out which available values these members may take here are meanings of the members. - start_code, end_code: represent bounds of executable code area - start_data, end_data: represent bounds of data area - start_brk, brk: used to calculate bounds for brk() syscall - start_stack: used when accounting space needed for command line arguments, environment and shmat() syscall - arg_start, arg_end, env_start, env_end: represent memory area supplied for command line arguments and environment variables - auxv, auxv_size: carries auxiliary vector, Elf format specifics - exe_fd: file descriptor number for executable link (/proc/self/exe) Thus we apply the following requirements to the values 1) Any member except @auxv, @auxv_size, @exe_fd is rather an address in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr) interval. 2) While @[start|end]_code and @[start|end]_data may point to an nonexisting VMAs (say a program maps own new .text and .data segments during execution) the rest of members should belong to VMA which must exist. 3) Addresses must be ordered, ie @start_ member must not be greater or equal to appropriate @end_ member. 4) As in regular Elf loading procedure we require that @start_brk and @brk be greater than @end_data. 5) If RLIMIT_DATA rlimit is set to non-infinity new values should not exceed existing limit. Same applies to RLIMIT_STACK. 6) Auxiliary vector size must not exceed existing one (which is predefined as AT_VECTOR_SIZE and depends on architecture). 7) File descriptor passed in @exe_file should be pointing to executable file (because we use existing prctl_set_mm_exe_file_locked helper it ensures that the file we are going to use as exe link has all required permission granted). Now about where these members are involved inside kernel code: - @start_code and @end_code are used in /proc/$pid/[stat|statm] output; - @start_data and @end_data are used in /proc/$pid/[stat|statm] output, also they are considered if there enough space for brk() syscall result if RLIMIT_DATA is set; - @start_brk shown in /proc/$pid/stat output and accounted in brk() syscall if RLIMIT_DATA is set; also this member is tested to find a symbolic name of mmap event for perf system (we choose if event is generated for "heap" area); one more aplication is selinux -- we test if a process has PROCESS__EXECHEAP permission if trying to make heap area being executable with mprotect() syscall; - @brk is a current value for brk() syscall which lays inside heap area, it's shown in /proc/$pid/stat. When syscall brk() succesfully provides new memory area to a user space upon brk() completion the mm::brk is updated to carry new value; Both @start_brk and @brk are actively used in /proc/$pid/maps and /proc/$pid/smaps output to find a symbolic name "heap" for VMA being scanned; - @start_stack is printed out in /proc/$pid/stat and used to find a symbolic name "stack" for task and threads in /proc/$pid/maps and /proc/$pid/smaps output, and as the same as with @start_brk -- perf system uses it for event naming. Also kernel treat this member as a start address of where to map vDSO pages and to check if there is enough space for shmat() syscall; - @arg_start, @arg_end, @env_start and @env_end are printed out in /proc/$pid/stat. Another access to the data these members represent is to read /proc/$pid/environ or /proc/$pid/cmdline. Any attempt to read these areas kernel tests with access_process_vm helper so a user must have enough rights for this action; - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly speaking kernel doesn't care much about which exactly data is sitting there because it is solely for userspace; - @exe_fd is referred from /proc/$pid/exe and when generating coredump. We uses prctl_set_mm_exe_file_locked helper to update this member, so exe-file link modification remains one-shot action. Still note that updating exe-file link now doesn't require sys-resource capability anymore, after all there is no much profit in preventing setup own file link (there are a number of ways to execute own code -- ptrace, ld-preload, so that the only reliable way to find which exactly code is executed is to inspect running program memory). Still we require the caller to be at least user-namespace root user. I believe the old interface should be deprecated and ripped off in a couple of kernel releases if no one against. To test if new interface is implemented in the kernel one can pass PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently supported struct prctl_mm_map. [akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions] Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Acked-by: NAndrew Vagin <avagin@openvz.org> Tested-by: NAndrew Vagin <avagin@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Julien Tinnes <jln@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cyrill Gorcunov 提交于
To eliminate code duplication lets introduce check_data_rlimit helper which we will use in brk() and prctl() syscalls. Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Julien Tinnes <jln@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like ALLOC_CPUSET) that have separate semantics. The function allocflags_to_migratetype() actually takes gfp flags, not alloc flags, and returns a migratetype. Rename it to gfpflags_to_migratetype(). Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NMinchan Kim <minchan@kernel.org> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
Async compaction aborts when it detects zone lock contention or need_resched() is true. David Rientjes has reported that in practice, most direct async compactions for THP allocation abort due to need_resched(). This means that a second direct compaction is never attempted, which might be OK for a page fault, but khugepaged is intended to attempt a sync compaction in such case and in these cases it won't. This patch replaces "bool contended" in compact_control with an int that distinguishes between aborting due to need_resched() and aborting due to lock contention. This allows propagating the abort through all compaction functions as before, but passing the abort reason up to __alloc_pages_slowpath() which decides when to continue with direct reclaim and another compaction attempt. Another problem is that try_to_compact_pages() did not act upon the reported contention (both need_resched() or lock contention) immediately and would proceed with another zone from the zonelist. When need_resched() is true, that means initializing another zone compaction, only to check again need_resched() in isolate_migratepages() and aborting. For zone lock contention, the unintended consequence is that the lock contended status reported back to the allocator is detrmined from the last zone where compaction was attempted, which is rather arbitrary. This patch fixes the problem in the following way: - async compaction of a zone aborting due to need_resched() or fatal signal pending means that further zones should not be tried. We report COMPACT_CONTENDED_SCHED to the allocator. - aborting zone compaction due to lock contention means we can still try another zone, since it has different set of locks. We report back COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted, it was aborted due to lock contention. As a result of these fixes, khugepaged will proceed with second sync compaction as intended, when the preceding async compaction aborted due to need_resched(). Page fault compactions aborting due to need_resched() will spare some cycles previously wasted by initializing another zone compaction only to abort again. Lock contention will be reported only when compaction in all zones aborted due to lock contention, and therefore it's not a good idea to try again after reclaim. In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this has improved number of THP collapse allocations by 10%, which shows positive effect on khugepaged. The benchmark's success rates are unchanged as it is not recognized as khugepaged. Numbers of compact_stall and compact_fail events have however decreased by 20%, with compact_success still a bit improved, which is good. With benchmark configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP collapse allocations, and only slight improvement in stalls and failures. [akpm@linux-foundation.org: fix warnings] Reported-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
When direct sync compaction is often unsuccessful, it may become deferred for some time to avoid further useless attempts, both sync and async. Successful high-order allocations un-defer compaction, while further unsuccessful compaction attempts prolong the compaction deferred period. Currently the checking and setting deferred status is performed only on the preferred zone of the allocation that invoked direct compaction. But compaction itself is attempted on all eligible zones in the zonelist, so the behavior is suboptimal and may lead both to scenarios where 1) compaction is attempted uselessly, or 2) where it's not attempted despite good chances of succeeding, as shown on the examples below: 1) A direct compaction with Normal preferred zone failed and set deferred compaction for the Normal zone. Another unrelated direct compaction with DMA32 as preferred zone will attempt to compact DMA32 zone even though the first compaction attempt also included DMA32 zone. In another scenario, compaction with Normal preferred zone failed to compact Normal zone, but succeeded in the DMA32 zone, so it will not defer compaction. In the next attempt, it will try Normal zone which will fail again, instead of skipping Normal zone and trying DMA32 directly. 2) Kswapd will balance DMA32 zone and reset defer status based on watermarks looking good. A direct compaction with preferred Normal zone will skip compaction of all zones including DMA32 because Normal was still deferred. The allocation might have succeeded in DMA32, but won't. This patch makes compaction deferring work on individual zone basis instead of preferred zone. For each zone, it checks compaction_deferred() to decide if the zone should be skipped. If watermarks fail after compacting the zone, defer_compaction() is called. The zone where watermarks passed can still be deferred when the allocation attempt is unsuccessful. When allocation is successful, compaction_defer_reset() is called for the zone containing the allocated page. This approach should approximate calling defer_compaction() only on zones where compaction was attempted and did not yield allocated page. There might be corner cases but that is inevitable as long as the decision to stop compacting dues not guarantee that a page will be allocated. Due to a new COMPACT_DEFERRED return value, some functions relying implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made more accurate. The did_some_progress output parameter of __alloc_pages_direct_compact() is removed completely, as the caller actually does not use it after compaction sets it - it is only considered when direct reclaim sets it. During testing on a two-node machine with a single very small Normal zone on node 1, this patch has improved success rates in stress-highalloc mmtests benchmark. The success here were previously made worse by commit 3a025760 ("mm: page_alloc: spill to remote nodes before waking kswapd") as kswapd was no longer resetting often enough the deferred compaction for the Normal zone, and DMA32 zones on both nodes were thus not considered for compaction. On different machine, success rates were improved with __GFP_NO_KSWAPD allocations. [akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build] Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMinchan Kim <minchan@kernel.org> Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Laura Abbott 提交于
For architectures without coherent DMA, memory for DMA may need to be remapped with coherent attributes. Factor out the the remapping code from arm and put it in a common location to reduce code duplication. As part of this, the arm APIs are now migrated away from ioremap_page_range to the common APIs which use map_vm_area for remapping. This should be an equivalent change and using map_vm_area is more correct as ioremap_page_range is intended to bring in io addresses into the cpu space and not regular kernel managed memory. Signed-off-by: NLaura Abbott <lauraa@codeaurora.org> Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Riley <davidriley@chromium.org> Cc: Olof Johansson <olof@lixom.net> Cc: Ritesh Harjain <ritesh.harjani@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thierry Reding <thierry.reding@gmail.com> Cc: Will Deacon <will.deacon@arm.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Mitchel Humpherys <mitchelh@codeaurora.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Laura Abbott 提交于
After allocating an address from a particular genpool, there is no good way to verify if that address actually belongs to a genpool. Introduce addr_in_gen_pool which will return if an address plus size falls completely within the genpool range. Signed-off-by: NLaura Abbott <lauraa@codeaurora.org> Acked-by: NWill Deacon <will.deacon@arm.com> Reviewed-by: NOlof Johansson <olof@lixom.net> Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Riley <davidriley@chromium.org> Cc: Ritesh Harjain <ritesh.harjani@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thierry Reding <thierry.reding@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Laura Abbott 提交于
One of the more common algorithms used for allocation is to align the start address of the allocation to the order of size requested. Add this as an algorithm option for genalloc. Signed-off-by: NLaura Abbott <lauraa@codeaurora.org> Acked-by: NWill Deacon <will.deacon@arm.com> Acked-by: NOlof Johansson <olof@lixom.net> Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Riley <davidriley@chromium.org> Cc: Ritesh Harjain <ritesh.harjani@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Thierry Reding <thierry.reding@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
ARCH_USES_NUMA_PROT_NONE was defined for architectures that implemented _PAGE_NUMA using _PROT_NONE. This saved using an additional PTE bit and relied on the fact that PROT_NONE vmas were skipped by the NUMA hinting fault scanner. This was found to be conceptually confusing with a lot of implicit assumptions and it was asked that an alternative be found. Commit c46a7c81 "x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels" redefined _PAGE_NUMA on x86 to be one of the swap PTE bits and shrunk the maximum possible swap size but it did not go far enough. There are no architectures that reuse _PROT_NONE as _PROT_NUMA but the relics still exist. This patch removes ARCH_USES_NUMA_PROT_NONE and removes some unnecessary duplication in powerpc vs the generic implementation by defining the types the core NUMA helpers expected to exist from x86 with their ppc64 equivalent. This necessitated that a PTE bit mask be created that identified the bits that distinguish present from NUMA pte entries but it is expected this will only differ between arches based on _PAGE_PROTNONE. The naming for the generic helpers was taken from x86 originally but ppc64 has types that are equivalent for the purposes of the helper so they are mapped instead of duplicating code. Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Zhang Zhen 提交于
Currently memory-hotplug has two limits: 1. If the memory block is in ZONE_NORMAL, you can change it to ZONE_MOVABLE, but this memory block must be adjacent to ZONE_MOVABLE. 2. If the memory block is in ZONE_MOVABLE, you can change it to ZONE_NORMAL, but this memory block must be adjacent to ZONE_NORMAL. With this patch, we can easy to know a memory block can be onlined to which zone, and don't need to know the above two limits. Updated the related Documentation. [akpm@linux-foundation.org: use conventional comment layout] [akpm@linux-foundation.org: fix build with CONFIG_MEMORY_HOTREMOVE=n] [akpm@linux-foundation.org: remove unused local zone_prev] Signed-off-by: NZhang Zhen <zhenzhang.zhang@huawei.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Wang Nan <wangnan0@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
Because of chicken and egg problem, initialization of SLAB is really complicated. We need to allocate cpu cache through SLAB to make the kmem_cache work, but before initialization of kmem_cache, allocation through SLAB is impossible. On the other hand, SLUB does initialization in a more simple way. It uses percpu allocator to allocate cpu cache so there is no chicken and egg problem. So, this patch try to use percpu allocator in SLAB. This simplifies the initialization step in SLAB so that we could maintain SLAB code more easily. In my testing there is no performance difference. This implementation relies on percpu allocator. Because percpu allocator uses vmalloc address space, vmalloc address space could be exhausted by this change on many cpu system with *32 bit* kernel. This implementation can cover 1024 cpus in worst case by following calculation. Worst: 1024 cpus * 4 bytes for pointer * 300 kmem_caches * 120 objects per cpu_cache = 140 MB Normal: 1024 cpus * 4 bytes for pointer * 150 kmem_caches(slab merge) * 80 objects per cpu_cache = 46 MB Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Jeremiah Mahler <jmmahler@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
Anton noticed (http://www.spinics.net/lists/linux-mm/msg67489.html) that on ppc LPARs with memoryless nodes, a large amount of memory was consumed by slabs and was marked unreclaimable. He tracked it down to slab deactivations in the SLUB core when we allocate remotely, leading to poor efficiency always when memoryless nodes are present. After much discussion, Joonsoo provided a few patches that help significantly. They don't resolve the problem altogether: - memory hotplug still needs testing, that is when a memoryless node becomes memory-ful, we want to dtrt - there are other reasons for going off-node than memoryless nodes, e.g., fully exhausted local nodes Neither case is resolved with this series, but I don't think that should block their acceptance, as they can be explored/resolved with follow-on patches. The series consists of: [1/3] topology: add support for node_to_mem_node() to determine the fallback node [2/3] slub: fallback to node_to_mem_node() node if allocating on memoryless node - Joonsoo's patches to cache the nearest node with memory for each NUMA node [3/3] Partial revert of 81c98869 (""kthread: ensure locality of task_struct allocations") - At Tejun's request, keep the knowledge of memoryless node fallback to the allocator core. This patch (of 3): We need to determine the fallback node in slub allocator if the allocation target node is memoryless node. Without it, the SLUB wrongly select the node which has no memory and can't use a partial slab, because of node mismatch. Introduced function, node_to_mem_node(X), will return a node Y with memory that has the nearest distance. If X is memoryless node, it will return nearest distance node, but, if X is normal node, it will return itself. We will use this function in following patch to determine the fallback node. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Han Pingtian <hanpt@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anton Blanchard <anton@samba.org> Cc: Christoph Lameter <cl@linux.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
Now, we track caller if tracing or slab debugging is enabled. If they are disabled, we could save one argument passing overhead by calling __kmalloc(_node)(). But, I think that it would be marginal. Furthermore, default slab allocator, SLUB, doesn't use this technique so I think that it's okay to change this situation. After this change, we can turn on/off CONFIG_DEBUG_SLAB without full kernel build and remove some complicated '#if' defintion. It looks more benefitial to me. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
We don't need to keep kmem_cache definition in include/linux/slab.h if we don't need to inline kmem_cache_size(). According to my code inspection, this function is only called at lc_create() in lib/lru_cache.c which may be called at initialization phase of something, so we don't need to inline it. Therfore, move it to slab_common.c and move kmem_cache definition to internal header. After this change, we can change kmem_cache definition easily without full kernel build. For instance, we can turn on/off CONFIG_SLUB_STATS without full kernel build. [akpm@linux-foundation.org: export kmem_cache_size() to modules] [rdunlap@infradead.org: add header files to fix kmemcheck.c build errors] Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
- Rename vm_is_stack() to task_of_stack() and change it to return "struct task_struct *" rather than the global (and thus wrong in general) pid_t. - Add the new pid_of_stack() helper which calls task_of_stack() and uses the right namespace to report the correct pid_t. Unfortunately we need to define this helper twice, in task_mmu.c and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c and move at least pid_of_stack/task_of_stack there to avoid the code duplication. - Change show_map_vma() and show_numa_map() to use the new helper. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Ungerer <gerg@uclinux.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 10月, 2014 3 次提交
-
-
由 Eric Dumazet 提交于
Add two helpers so that drivers do not have to care of BQL being available or not. Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NJim Davis <jim.epost@gmail.com> Fixes: 29d40c90 ("net/mlx4_en: Use prefetch in tx path") Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Masanari Iida 提交于
In commit 7bced397, dma_cookie was removed from struct skbuff. But the description of dma_cookie still exist. So the "make xmldocs" output following warning. Warning(.//include/linux/skbuff.h:609): Excess struct/union /enum/typedef member 'dma_cookie' description in 'sk_buff' Remove description of dma_cookie fix the symptom. Signed-off-by: NMasanari Iida <standby24x7@gmail.com> Acked-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 LEROY Christophe 提交于
In the probe function, use_napi is inconditionnaly set to 1. This patch removes all the code which is conditional to !use_napi, and removes use_napi which has then become useless. Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 10月, 2014 2 次提交
-
-
由 Eric Dumazet 提交于
Testing xmit_more support with netperf and connected UDP sockets, I found strange dst refcount false sharing. Current handling of IFF_XMIT_DST_RELEASE is not optimal. Dropping dst in validate_xmit_skb() is certainly too late in case packet was queued by cpu X but dequeued by cpu Y The logical point to take care of drop/force is in __dev_queue_xmit() before even taking qdisc lock. As Julian Anastasov pointed out, need for skb_dst() might come from some packet schedulers or classifiers. This patch adds new helper to cleanly express needs of various drivers or qdiscs/classifiers. Drivers that need skb_dst() in their ndo_start_xmit() should call following helper in their setup instead of the prior : dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; -> netif_keep_dst(dev); Instead of using a single bit, we use two bits, one being eventually rebuilt in bonding/team drivers. The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being rebuilt in bonding/team. Eventually, we could add something smarter later. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Petri Gynther 提交于
Adjust fixed_phy_register() to return struct phy_device *, so that it becomes easy to use fixed PHYs without device tree support: phydev = fixed_phy_register(PHY_POLL, &fixed_phy_status, NULL); fixed_phy_set_link_update(phydev, fixed_phy_link_update); phy_connect_direct(netdev, phydev, handler_fn, phy_interface); This change is a prerequisite for modifying bcmgenet driver to work without a device tree on Broadcom's MIPS-based 7xxx platforms. Signed-off-by: NPetri Gynther <pgynther@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 10月, 2014 6 次提交
-
-
由 Andy Zhou 提交于
Fix a openvswitch compilation error when CONFIG_INET is not set: ===================================================== In file included from include/net/geneve.h:4:0, from net/openvswitch/flow_netlink.c:45: include/net/udp_tunnel.h: In function 'udp_tunnel_handle_offloads': >> include/net/udp_tunnel.h:100:2: error: implicit declaration of function 'iptunnel_handle_offloads' [-Werror=implicit-function-declaration] >> return iptunnel_handle_offloads(skb, udp_csum, type); >> ^ >> >> include/net/udp_tunnel.h:100:2: warning: return makes pointer from integer without a cast >> >> cc1: some warnings being treated as errors ===================================================== Reported-by: Nkbuild test robot <fengguang.wu@intel.com> Signed-off-by: NAndy Zhou <azhou@nicira.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Hannes Frederic Sowa 提交于
Try to reduce number of possible fn_sernum mutation by constraining them to their namespace. Also remove rt_genid which I forgot to remove in 705f1c86 ("ipv6: remove rt6i_genid"). Cc: YOSHIFUJI Hideaki <hideaki@yoshifuji.org> Cc: Martin Lau <kafai@fb.com> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Hannes Frederic Sowa 提交于
Cc: YOSHIFUJI Hideaki <hideaki@yoshifuji.org> Cc: Martin Lau <kafai@fb.com> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Hannes Frederic Sowa 提交于
Also renamed struct fib6_walker_t to fib6_walker and enum fib_walk_state_t to fib6_walk_state as recommended by Cong Wang. Cc: Cong Wang <cwang@twopensource.com> Cc: YOSHIFUJI Hideaki <hideaki@yoshifuji.org> Cc: Martin Lau <kafai@fb.com> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 John Fastabend 提交于
This removes the tcf_proto argument from the ematch code paths that only need it to reference the net namespace. This allows simplifying qdisc code paths especially when we need to tear down the ematch from an RCU callback. In this case we can not guarentee that the tcf_proto structure is still valid. Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com> Acked-by: NCong Wang <cwang@twopensource.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Some TSO engines might have a too heavy setup cost, that impacts performance on hosts sending small bursts (2 MSS per packet). This patch adds a device gso_min_segs, allowing drivers to set a minimum segment size for TSO packets, according to the NIC performance. Tested on a mlx4 NIC, this allows to get a ~110% increase of throughput when sending 2 MSS per packet. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-