- 13 5月, 2017 3 次提交
-
-
由 Ross Zwisler 提交于
Patch series "mm,dax: Fix data corruption due to mmap inconsistency", v4. This series fixes data corruption that can happen for DAX mounts when page faults race with write(2) and as a result page tables get out of sync with block mappings in the filesystem and thus data seen through mmap is different from data seen through read(2). The series passes testing with t_mmap_stale test program from Ross and also other mmap related tests on DAX filesystem. This patch (of 4): dax_invalidate_mapping_entry() currently removes DAX exceptional entries only if they are clean and unlocked. This is done via: invalidate_mapping_pages() invalidate_exceptional_entry() dax_invalidate_mapping_entry() However, for page cache pages removed in invalidate_mapping_pages() there is an additional criteria which is that the page must not be mapped. This is noted in the comments above invalidate_mapping_pages() and is checked in invalidate_inode_page(). For DAX entries this means that we can can end up in a situation where a DAX exceptional entry, either a huge zero page or a regular DAX entry, could end up mapped but without an associated radix tree entry. This is inconsistent with the rest of the DAX code and with what happens in the page cache case. We aren't able to unmap the DAX exceptional entry because according to its comments invalidate_mapping_pages() isn't allowed to block, and unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem. Since we essentially never have unmapped DAX entries to evict from the radix tree, just remove dax_invalidate_mapping_entry(). Fixes: c6dcf52c ("mm: Invalidate DAX radix tree entries only if appropriate") Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.czSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Reported-by: NJan Kara <jack@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: <stable@vger.kernel.org> [4.10+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Commit 1f5307b1 ("mm, vmalloc: properly track vmalloc users") has pulled asm/pgtable.h include dependency to linux/vmalloc.h and that turned out to be a bad idea for some architectures. E.g. m68k fails with In file included from arch/m68k/include/asm/pgtable_mm.h:145:0, from arch/m68k/include/asm/pgtable.h:4, from include/linux/vmalloc.h:9, from arch/m68k/kernel/module.c:9: arch/m68k/include/asm/mcf_pgtable.h: In function 'nocache_page': >> arch/m68k/include/asm/mcf_pgtable.h:339:43: error: 'init_mm' undeclared (first use in this function) #define pgd_offset_k(address) pgd_offset(&init_mm, address) as spotted by kernel build bot. nios2 fails for other reason In file included from include/asm-generic/io.h:767:0, from arch/nios2/include/asm/io.h:61, from include/linux/io.h:25, from arch/nios2/include/asm/pgtable.h:18, from include/linux/mm.h:70, from include/linux/pid_namespace.h:6, from include/linux/ptrace.h:9, from arch/nios2/include/uapi/asm/elf.h:23, from arch/nios2/include/asm/elf.h:22, from include/linux/elf.h:4, from include/linux/module.h:15, from init/main.c:16: include/linux/vmalloc.h: In function '__vmalloc_node_flags': include/linux/vmalloc.h:99:40: error: 'PAGE_KERNEL' undeclared (first use in this function); did you mean 'GFP_KERNEL'? which is due to the newly added #include <asm/pgtable.h>, which on nios2 includes <linux/io.h> and thus <asm/io.h> and <asm-generic/io.h> which again includes <linux/vmalloc.h>. Tweaking that around just turns out a bigger headache than necessary. This patch reverts 1f5307b1 and reimplements the original fix in a different way. __vmalloc_node_flags can stay static inline which will cover vmalloc* functions. We only have one external user (kvmalloc_node) and we can export __vmalloc_node_flags_caller and provide the caller directly. This is much simpler and it doesn't really need any games with header files. [akpm@linux-foundation.org: coding-style fixes] [mhocko@kernel.org: revert old comment] Link: http://lkml.kernel.org/r/20170509211054.GB16325@dhcp22.suse.cz Fixes: 1f5307b1 ("mm, vmalloc: properly track vmalloc users") Link: http://lkml.kernel.org/r/20170509153702.GR6481@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com> Cc: Tobias Klauser <tklauser@distanz.ch> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Deepa Dinamani 提交于
All uses of the current_fs_time() function have been replaced by other time interfaces. And, its use cases can be fulfilled by current_time() or ktime_get_* variants. Link: http://lkml.kernel.org/r/1491613030-11599-13-git-send-email-deepa.kernel@gmail.comSigned-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: NArnd Bergmann <arnd@arndb.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 5月, 2017 1 次提交
-
-
由 Vishal Verma 提交于
nsio_rw_bytes can clear media errors, but this cannot be done while we are in an atomic context due to locking within ACPI. From the BTT, ->rw_bytes may be called either from atomic or process context depending on whether the calls happen during initialization or during IO. During init, we want to ensure error clearing happens, and the flag marking process context allows nsio_rw_bytes to do that. When called during IO, we're in atomic context, and error clearing can be skipped. Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NVishal Verma <vishal.l.verma@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 09 5月, 2017 26 次提交
-
-
由 Michael S. Tsirkin 提交于
A known weakness in ptr_ring design is that it does not handle well the situation when ring is almost full: as entries are consumed they are immediately used again by the producer, so consumer and producer are writing to a shared cache line. To fix this, add batching to consume calls: as entries are consumed do not write NULL into the ring until we get a multiple (in current implementation 2x) of cache lines away from the producer. At that point, write them all out. We do the write out in the reverse order to keep producer from sharing cache with consumer for as long as possible. Writeout also triggers when ring wraps around - there's no special reason to do this but it helps keep the code a bit simpler. What should we do if getting away from producer by 2 cache lines would mean we are keeping the ring moe than half empty? Maybe we should reduce the batching in this case, current patch simply reduces the batching. Notes: - it is no longer true that a call to consume guarantees that the following call to produce will succeed. No users seem to assume that. - batching can also in theory reduce the signalling rate: users that would previously send interrups to the producer to wake it up after consuming each entry would now only need to do this once in a batch. Doing this would be easy by returning a flag to the caller. No users seem to do signalling on consume yet so this was not implemented yet. Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Reviewed-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NJason Wang <jasowang@redhat.com>
-
由 Cornelia Huck 提交于
Add comments for the virtio_driver members that were not documented. Signed-off-by: NCornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
由 Nicholas Piggin 提交于
Introduce primitives for FDT parsing. These will be used for powerpc cpufeatures node scanning, which has quite complex structure but should be processed early. Cc: devicetree@vger.kernel.org Acked-by: NRob Herring <robh@kernel.org> Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Christoffer Dall 提交于
There are occasional needs to use the index of vcpu in the kvm->vcpus array to map something related to a VCPU. For example, unlike the vcpu->vcpu_id, the vcpu index is guaranteed to not be sparse across all vcpus which is useful when allocating a memory area for each vcpu. Signed-off-by: NChristoffer Dall <cdall@linaro.org> Reviewed-by: NEric Auger <eric.auger@redhat.com>
-
由 Vlastimil Babka 提交于
The previous patch ("mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC") has shown that simply setting and clearing PF_MEMALLOC in current->flags can result in wrongly clearing a pre-existing PF_MEMALLOC flag and potentially lead to recursive reclaim. Let's introduce helpers that support proper nesting by saving the previous stat of the flag, similar to the existing memalloc_noio_* and memalloc_nofs_* helpers. Convert existing setting/clearing of PF_MEMALLOC within mm to the new helpers. There are no known issues with the converted code, but the change makes it more robust. Link: http://lkml.kernel.org/r/20170405074700.29871-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz> Suggested-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Boris Brezillon <boris.brezillon@free-electrons.com> Cc: Chris Leech <cleech@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Lee Duncan <lduncan@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Richard Weinberger <richard@nod.at> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Deepa Dinamani 提交于
All uses of CURRENT_TIME_SEC and CURRENT_TIME macros have been replaced by other time functions. These macros are also not y2038 safe. And, all their use cases can be fulfilled by y2038 safe ktime_get_* variants. Link: http://lkml.kernel.org/r/1491613030-11599-12-git-send-email-deepa.kernel@gmail.comSigned-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: NArnd Bergmann <arnd@arndb.de> Acked-by: NJohn Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tetsuo Handa 提交于
Commit afddba49 ("fs: introduce write_begin, write_end, and perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was checked in pagecache_write_begin(), but that check was removed by 4e02ed4b ("fs: remove prepare_write/commit_write"). Between these two commits, commit d9414774 ("cifs: Convert cifs to new aops.") added a check in cifs_write_begin(), but that check was soon removed by commit a98ee8c1 ("[CIFS] fix regression in cifs_write_begin/cifs_write_end"). Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere. Let's remove this flag. This patch has no functionality changes. Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: NJeff Layton <jlayton@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Nick Piggin <npiggin@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andi Kleen 提交于
pagefault_disabled_dec is frequently used inline, and it has a WARN_ON for underflow that expands to about 6.5k of extra code. The warning doesn't seem to be that useful and worth so much code so remove it. If it was needed could make it depending on some debug kernel option. Saves ~6.5k in my kernel text data bss dec hex filename 9039417 5367568 11116544 25523529 1857549 vmlinux-before-pf 9032805 5367568 11116544 25516917 1855b75 vmlinux-pf Link: http://lkml.kernel.org/r/20170315021431.13107-8-andi@firstfloor.orgSigned-off-by: NAndi Kleen <ak@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andi Kleen 提交于
The kref functions check for NULL release functions. This WARN_ON seems rather pointless. We will eventually release and then just crash nicely. It is also somewhat expensive because these functions are inlined in a lot of places. Removing the WARN_ONs saves around 2.3k in this kernel (likely more in others with more drivers) text data bss dec hex filename 9083992 5367600 11116544 25568136 1862388 vmlinux-before-load-avg 9070166 5367600 11116544 25554310 185ed86 vmlinux-load-avg Link: http://lkml.kernel.org/r/20170315021431.13107-5-andi@firstfloor.orgSigned-off-by: NAndi Kleen <ak@linux.intel.com> Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Laura Abbott 提交于
set_memory_* functions have moved to set_memory.h. Switch to this explicitly. Link: http://lkml.kernel.org/r/1488920133-27229-11-git-send-email-labbott@redhat.comSigned-off-by: NLaura Abbott <labbott@redhat.com> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joe Perches 提交于
Add these misspellings to scripts/spelling.txt too Link: http://lkml.kernel.org/r/962aace119675e5fe87be2a88ddac1a5486f8e60.1490931810.git.joe@perches.comSigned-off-by: NJoe Perches <joe@perches.com> Acked-by: NMauro Carvalho Chehab <mchehab@s-opensource.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Stephen Boyd 提交于
This typo is quite common. Fix it and add it to the spelling file so that checkpatch catches it earlier. Link: http://lkml.kernel.org/r/20170317011131.6881-2-sboyd@codeaurora.orgSigned-off-by: NStephen Boyd <sboyd@codeaurora.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
There are many code paths opencoding kvmalloc. Let's use the helper instead. The main difference to kvmalloc is that those users are usually not considering all the aspects of the memory allocator. E.g. allocation requests <= 32kB (with 4kB pages) are basically never failing and invoke OOM killer to satisfy the allocation. This sounds too disruptive for something that has a reasonable fallback - the vmalloc. On the other hand those requests might fallback to vmalloc even when the memory allocator would succeed after several more reclaim/compaction attempts previously. There is no guarantee something like that happens though. This patch converts many of those places to kv[mz]alloc* helpers because they are more conservative. Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits Acked-by: NKees Cook <keescook@chromium.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390 Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim Acked-by: David Sterba <dsterba@suse.com> # btrfs Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4 Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5 Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Santosh Raspatur <santosh@chelsio.com> Cc: Hariprasad S <hariprasad@chelsio.com> Cc: Yishai Hadas <yishaih@mellanox.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
__vmalloc_node_flags used to be static inline but this has changed by "mm: introduce kv[mz]alloc helpers" because kvmalloc_node needs to use it as well and the code is outside of the vmalloc proper. I haven't realized that changing this will lead to a subtle bug though. The function is responsible to track the caller as well. This caller is then printed by /proc/vmallocinfo. If __vmalloc_node_flags is not inline then we would get only direct users of __vmalloc_node_flags as callers (e.g. v[mz]alloc) which reduces usefulness of this debugging feature considerably. It simply doesn't help to see that the given range belongs to vmalloc as a caller: 0xffffc90002c79000-0xffffc90002c7d000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N0=3 0xffffc90002c81000-0xffffc90002c85000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N1=3 0xffffc90002c8d000-0xffffc90002c91000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N1=3 0xffffc90002c95000-0xffffc90002c99000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N1=3 We really want to catch the _caller_ of the vmalloc function. Fix this issue by making __vmalloc_node_flags static inline again. Link: http://lkml.kernel.org/r/20170502134657.12381-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Patch series "kvmalloc", v5. There are many open coded kmalloc with vmalloc fallback instances in the tree. Most of them are not careful enough or simply do not care about the underlying semantic of the kmalloc/page allocator which means that a) some vmalloc fallbacks are basically unreachable because the kmalloc part will keep retrying until it succeeds b) the page allocator can invoke a really disruptive steps like the OOM killer to move forward which doesn't sound appropriate when we consider that the vmalloc fallback is available. As it can be seen implementing kvmalloc requires quite an intimate knowledge if the page allocator and the memory reclaim internals which strongly suggests that a helper should be implemented in the memory subsystem proper. Most callers, I could find, have been converted to use the helper instead. This is patch 6. There are some more relying on __GFP_REPEAT in the networking stack which I have converted as well and Eric Dumazet was not opposed [2] to convert them as well. [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com This patch (of 9): Using kmalloc with the vmalloc fallback for larger allocations is a common pattern in the kernel code. Yet we do not have any common helper for that and so users have invented their own helpers. Some of them are really creative when doing so. Let's just add kv[mz]alloc and make sure it is implemented properly. This implementation makes sure to not make a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also to not warn about allocation failures. This also rules out the OOM killer as the vmalloc is a more approapriate fallback than a disruptive user visible action. This patch also changes some existing users and removes helpers which are specific for them. In some cases this is not possible (e.g. ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and require GFP_NO{FS,IO} context which is not vmalloc compatible in general (note that the page table allocation is GFP_KERNEL). Those need to be fixed separately. While we are at it, document that __vmalloc{_node} about unsupported gfp mask because there seems to be a lot of confusion out there. kvmalloc_node will warn about GFP_KERNEL incompatible (which are not superset) flags to catch new abusers. Existing ones would have to die slowly. [sfr@canb.auug.org.au: f2fs fixup] Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Andreas Dilger <adilger@dilger.ca> [ext4 part] Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Davidlohr Bueso 提交于
Assign 'struct kern_ipc_perm' its own cacheline to avoid false sharing with sysv ipc calls. While the structure itself is rather read-mostly throughout the lifespan of ipc, the spinlock causes most of the invalidations. One example is commit 31a7c474 ("ipc/sem.c: cacheline align the ipc spinlock for semaphores"). Therefore, extend this to all ipc. The effect of cacheline alignment on sems can be seen in sembench, which deals mostly with semtimedop wait/wakes is seen to improve raw throughput (worker loops) between 8 to 12% on a 24-core x86 with over 4 threads. Link: http://lkml.kernel.org/r/1486673582-6979-4-git-send-email-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill Tkhai 提交于
pid_ns_for_children set by a task is known only to the task itself, and it's impossible to identify it from outside. It's a big problem for checkpoint/restore software like CRIU, because it can't correctly handle tasks, that do setns(CLONE_NEWPID) in proccess of their work. This patch solves the problem, and it exposes pid_ns_for_children to ns directory in standard way with the name "pid_for_children": ~# ls /proc/5531/ns -l | grep pid lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid -> pid:[4026531836] lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid_for_children -> pid:[4026532286] Link: http://lkml.kernel.org/r/149201123914.6007.2187327078064239572.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Cc: Andrei Vagin <avagin@virtuozzo.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ingo Molnar <mingo@kernel.org> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill Tkhai 提交于
Patch series "Expose task pid_ns_for_children to userspace". pid_ns_for_children set by a task is known only to the task itself, and it's impossible to identify it from outside. It's a big problem for checkpoint/restore software like CRIU, because it can't correctly handle tasks, that do setns(CLONE_NEWPID) in proccess of their work. If they have a custom pid_ns_for_children before dump, they must have the same ns after restore. Otherwise, restored task bumped into enviroment it does not expect. This patchset solves the problem. It exposes pid_ns_for_children to ns directory in standard way with the name "pid_for_children": ~# ls /proc/5531/ns -l | grep pid lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid -> pid:[4026531836] lrwxrwxrwx 1 root root 0 Jan 14 16:38 pid_for_children -> pid:[4026532286] This patch (of 2): Make possible to have link content prefix yyy different from the link name xxx: $ readlink /proc/[pid]/ns/xxx yyy:[4026531838] This will be used in next patch. Link: http://lkml.kernel.org/r/149201120318.6007.7362655181033883000.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ingo Molnar <mingo@kernel.org> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hari Bathini 提交于
Get rid of multiple definitions of append_elf_note() & final_note() functions. Reuse these functions compiled under CONFIG_CRASH_CORE Also, define Elf_Word and use it instead of generic u32 or the more specific Elf64_Word. Link: http://lkml.kernel.org/r/149035342324.6881.11667840929850361402.stgit@hbathini.in.ibm.comSigned-off-by: NHari Bathini <hbathini@linux.vnet.ibm.com> Acked-by: NDave Young <dyoung@redhat.com> Acked-by: NTony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hari Bathini 提交于
Patch series "kexec/fadump: remove dependency with CONFIG_KEXEC and reuse crashkernel parameter for fadump", v4. Traditionally, kdump is used to save vmcore in case of a crash. Some architectures like powerpc can save vmcore using architecture specific support instead of kexec/kdump mechanism. Such architecture specific support also needs to reserve memory, to be used by dump capture kernel. crashkernel parameter can be a reused, for memory reservation, by such architecture specific infrastructure. This patchset removes dependency with CONFIG_KEXEC for crashkernel parameter and vmcoreinfo related code as it can be reused without kexec support. Also, crashkernel parameter is reused instead of fadump_reserve_mem to reserve memory for fadump. The first patch moves crashkernel parameter parsing and vmcoreinfo related code under CONFIG_CRASH_CORE instead of CONFIG_KEXEC_CORE. The second patch reuses the definitions of append_elf_note() & final_note() functions under CONFIG_CRASH_CORE in IA64 arch code. The third patch removes dependency on CONFIG_KEXEC for firmware-assisted dump (fadump) in powerpc. The next patch reuses crashkernel parameter for reserving memory for fadump, instead of the fadump_reserve_mem parameter. This has the advantage of using all syntaxes crashkernel parameter supports, for fadump as well. The last patch updates fadump kernel documentation about use of crashkernel parameter. This patch (of 5): Traditionally, kdump is used to save vmcore in case of a crash. Some architectures like powerpc can save vmcore using architecture specific support instead of kexec/kdump mechanism. Such architecture specific support also needs to reserve memory, to be used by dump capture kernel. crashkernel parameter can be a reused, for memory reservation, by such architecture specific infrastructure. But currently, code related to vmcoreinfo and parsing of crashkernel parameter is built under CONFIG_KEXEC_CORE. This patch introduces CONFIG_CRASH_CORE and moves the above mentioned code under this config, allowing code reuse without dependency on CONFIG_KEXEC. There is no functional change with this patch. Link: http://lkml.kernel.org/r/149035338104.6881.4550894432615189948.stgit@hbathini.in.ibm.comSigned-off-by: NHari Bathini <hbathini@linux.vnet.ibm.com> Acked-by: NDave Young <dyoung@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Dobriyan 提交于
Bit searching functions accept "unsigned long" indices but "nr_cpumask_bits" is "int" which is signed, so inevitable sign extensions occur on x86_64. Those MOVSX are #1 MOVSX bloat by number of uses across whole kernel. Change "nr_cpumask_bits" to unsigned, this number can't be negative after all. It allows to do implicit zero-extension on x86_64 without MOVSX. Change signed comparisons into unsigned comparisons where necessary. Other uses looks fine because it is either argument passed to a function or comparison is already unsigned. Net win on allyesconfig type of kernel: ~2.8 KB (!) add/remove: 0/0 grow/shrink: 8/725 up/down: 93/-2926 (-2833) function old new delta xen_exit_mmap 691 735 +44 qstat_read 426 440 +14 __cpufreq_cooling_register 1678 1687 +9 trace_rb_cpu_prepare 447 455 +8 vermagic 54 60 +6 nfp_driver_version 54 60 +6 rcu_torture_stats_print 1147 1151 +4 find_next_push_cpu 267 269 +2 xen_irq_resume 961 960 -1 ... init_vp_index 946 906 -40 od_set_powersave_bias 328 281 -47 power_cpu_exit 193 139 -54 arch_show_interrupts 3538 3484 -54 select_idle_sibling 1558 1471 -87 Total: Before=158358910, After=158356077, chg -0.00% Same arguments apply to "nr_cpu_ids" but I haven't yet found enough courage to delve into this issue (and proper fix may require new type "cpu_t" which is whole separate story). Link: http://lkml.kernel.org/r/20170309205322.GA1728@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthias Kaehlcke 提交于
jiffies_64 is defined in kernel/time/timer.c with ____cacheline_aligned_in_smp, however this macro is not part of the declaration of jiffies and jiffies_64 in jiffies.h. As a result clang generates the following warning: kernel/time/timer.c:57:26: error: section does not match previous declaration [-Werror,-Wsection] __visible u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES; ^ include/linux/cache.h:39:36: note: expanded from macro '__cacheline_aligned_in_smp' ^ include/linux/cache.h:34:4: note: expanded from macro '__cacheline_aligned' __section__(".data..cacheline_aligned"))) ^ include/linux/jiffies.h:77:12: note: previous attribute is here extern u64 __jiffy_data jiffies_64; ^ include/linux/jiffies.h:70:38: note: expanded from macro '__jiffy_data' Link: http://lkml.kernel.org/r/20170403190200.70273-1-mka@chromium.orgSigned-off-by: NMatthias Kaehlcke <mka@chromium.org> Cc: "Jason A . Donenfeld" <Jason@zx2c4.com> Cc: Grant Grundler <grundler@chromium.org> Cc: Michael Davidson <md@google.com> Cc: Greg Hackmann <ghackmann@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
Preparation for making the decisions more complex and depending on compact_control flags. No functional change. Link: http://lkml.kernel.org/r/20170307131545.28577-6-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
When stealing pages from pageblock of a different migratetype, we count how many free pages were stolen, and change the pageblock's migratetype if more than half of the pageblock was free. This might be too conservative, as there might be other pages that are not free, but were allocated with the same migratetype as our allocation requested. While we cannot determine the migratetype of allocated pages precisely (at least without the page_owner functionality enabled), we can count pages that compaction would try to isolate for migration - those are either on LRU or __PageMovable(). The rest can be assumed to be MIGRATE_RECLAIMABLE or MIGRATE_UNMOVABLE, which we cannot easily distinguish. This counting can be done as part of free page stealing with little additional overhead. The page stealing code is changed so that it considers free pages plus pages of the "good" migratetype for the decision whether to change pageblock's migratetype. The result should be more accurate migratetype of pageblocks wrt the actual pages in the pageblocks, when stealing from semi-occupied pageblocks. This should help the efficiency of page grouping by mobility. In testing based on 4.9 kernel with stress-highalloc from mmtests configured for order-4 GFP_KERNEL allocations, this patch has reduced the number of unmovable allocations falling back to movable pageblocks by 47%. The number of movable allocations falling back to other pageblocks are increased by 55%, but these events don't cause permanent fragmentation, so the tradeoff should be positive. Later patches also offset the movable fallback increase to some extent. [akpm@linux-foundation.org: merge fix] Link: http://lkml.kernel.org/r/20170307131545.28577-5-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Olga Kornievskaia 提交于
Instead of messing with the commit path which has been causing issues, add a COMMIT op after the COPY and ask for stable copies in the first space. It saves a round trip, since after the COPY, the client sends a COMMIT anyway. Signed-off-by: NOlga Kornievskaia <kolga@netapp.com> Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
由 Dan Williams 提交于
For configurations that do not enable DAX filesystems or drivers, do not require the DAX core to be built. Given that the 'direct_access' method has been removed from 'block_device_operations', we can also go ahead and remove the block-related dax helper functions from fs/block_dev.c to drivers/dax/super.c. This keeps dax details out of the block layer and lets the DAX core be built as a module in the FS_DAX=n case. Filesystems need to include dax.h to call bdev_dax_supported(). Cc: linux-xfs@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NJan Kara <jack@suse.com> Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 08 5月, 2017 4 次提交
-
-
由 Eric Auger 提交于
this new helper synchronizes the irq pending_latch with the LPI pending bit status found in rdist pending table. As the status is consumed, we reset the bit in pending table. As we need the PENDBASER_ADDRESS() in vgic-v3, let's move its definition in the irqchip header. We restore the full length of the field, ie [51:16]. Same for PROPBASER_ADDRESS with full field length of [51:12]. Signed-off-by: NEric Auger <eric.auger@redhat.com> Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com> Reviewed-by: NChristoffer Dall <cdall@linaro.org>
-
由 Eric Auger 提交于
Up to now the MAPD's ITT size field has been ignored. It encodes the number of eventid bit minus 1. It should be used to check the eventid when a MAPTI command is issued on a device. Let's store the number of eventid bits in the its_device and do the check on MAPTI. Also make sure the ITT size field does not exceed the GITS_TYPER IDBITS field. Signed-off-by: NEric Auger <eric.auger@redhat.com> Reviewed-by: NChristoffer Dall <cdall@linaro.org> Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
-
由 Eric Auger 提交于
The GITS_IIDR revision field is used to encode the migration ABI revision. So we need to restore it to check the table layout is readable by the destination. By writing the IIDR, userspace thus forces the ABI revision to be used and this must be less than or equal to the max revision KVM supports. Signed-off-by: NEric Auger <eric.auger@redhat.com> Reviewed-by: NChristoffer Dall <cdall@linaro.org>
-
由 Eric Auger 提交于
We plan to support different migration ABIs, ie. characterizing the ITS table layout format in guest RAM. For example, a new ABI will be needed if vLPIs get supported for nested use case. So let's introduce an array of supported ABIs (at the moment a single ABI is supported though). The following characteristics are foreseen to vary with the ABI: size of table entries, save/restore operation, the way abi settings are applied. By default the MAX_ABI_REV is applied on its creation. In subsequent patches we will introduce a way for the userspace to change the ABI in use. The entry sizes now are set according to the ABI version and not hardcoded anymore. Signed-off-by: NEric Auger <eric.auger@redhat.com> Reviewed-by: NChristoffer Dall <cdall@linaro.org>
-
- 06 5月, 2017 2 次提交
-
-
由 Rafael J. Wysocki 提交于
The ACPI SCI (System Control Interrupt) is set up as a wakeup IRQ during suspend-to-idle transitions and, consequently, any events signaled through it wake up the system from that state. However, on some systems some of the events signaled via the ACPI SCI while suspended to idle should not cause the system to wake up. In fact, quite often they should just be discarded. Arguably, systems should not resume entirely on such events, but in order to decide which events really should cause the system to resume and which are spurious, it is necessary to resume up to the point when ACPI SCIs are actually handled and processed, which is after executing dpm_resume_noirq() in the system resume path. For this reasons, add a loop around freeze_enter() in which the platforms can process events signaled via multiplexed IRQ lines like the ACPI SCI and add suspend-to-idle hooks that can be used for this purpose to struct platform_freeze_ops. In the ACPI case, the ->wake hook is used for checking if the SCI has triggered while suspended and deferring the interrupt-induced system wakeup until the events signaled through it are actually processed sufficiently to decide whether or not the system should resume. In turn, the ->sync hook allows all of the relevant event queues to be flushed so as to prevent events from being missed due to race conditions. In addition to that, some ACPI code processing wakeup events needs to be modified to use the "hard" version of wakeup triggers, so that it will cause a system resume to happen on device-induced wakeup events even if the "soft" mechanism to prevent the system from suspending is not enabled (that also helps to catch device-induced wakeup events occurring during suspend transitions in progress). Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
由 Rafael J. Wysocki 提交于
The system wakeup framework is not very consistent with respect to the way it handles suspend-to-idle and generally wakeup events occurring during transitions to system low-power states. First off, system transitions in progress are aborted by the event reporting helpers like pm_wakeup_event() only if the wakeup_count sysfs attribute is in use (as documented), but there are cases in which system-wide transitions should be aborted even if that is not the case. For example, a wakeup signal from a designated wakeup device during system-wide PM transition, it should cause the transition to be aborted right away. Moreover, there is a freeze_wake() call in wakeup_source_activate(), but that really is only effective after suspend_freeze_state has been set to FREEZE_STATE_ENTER by freeze_enter(). However, it is very unlikely that wakeup_source_activate() will ever be called at that time, as it could only be triggered by a IRQF_NO_SUSPEND interrupt handler, so wakeups from suspend-to-idle don't really occur in wakeup_source_activate(). At the same time there is a way to abort a system suspend in progress (or wake up the system from suspend-to-idle), which is by calling pm_system_wakeup(), but in turn that doesn't cause any wakeup source objects to be activated, so it will not be covered by wakeup source statistics and will not prevent the system from suspending again immediately (in case autosleep is used, for example). Consequently, if anyone wants to abort system transitions in progress and allow the wakeup_count mechanism to work, they need to use both pm_system_wakeup() and pm_wakeup_event(), say, at the same time which is awkward. For the above reasons, make it possible to trigger pm_system_wakeup() from within wakeup_source_activate() and provide a new pm_wakeup_hard_event() helper to do so within the wakeup framework. Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
- 05 5月, 2017 4 次提交
-
-
由 Lukasz Luba 提交于
This patch introduces a new interface for device drivers connected to devfreq_cooling in the thermal framework: get_real_power(). Some devices have more sophisticated methods (like power counters) to approximate the actual power that they use. In the previous implementation we had a pre-calculated power table which was then scaled by 'utilization' ('busy_time' and 'total_time' taken from devfreq 'last_status'). With this new interface the driver can provide more precise data regarding actual power to the thermal governor every time the power budget is calculated. We then use this value and calculate the real resource utilization scaling factor. Reviewed-by: NChris Diamand <chris.diamand@arm.com> Acked-by: NJavi Merino <javi.merino@kernel.org> Signed-off-by: NLukasz Luba <lukasz.luba@arm.com>
-
由 David Howells 提交于
stat/lstat/fstatat need to pass AT_NO_AUTOMOUNT to vfs_statx() as the pre-statx code didn't set LOOKUP_AUTOMOUNT, even though fstatat() accepted the AT_NO_AUTOMOUNT flag. Fixes: a528d35e ("statx: Add a system call to make enhanced file info available") Reported-by: NIan Kent <raven@themaw.net> Signed-off-by: NDavid Howells <dhowells@redhat.com> Tested-by: NIan Kent <raven@themaw.net> cc: stable@vger.kernel.org Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Dan Williams 提交于
This is a preparation patch for handling locked nvdimm label regions, a new concept as introduced by the latest DSM document on pmem.io [1]. A future patch will leverage nvdimm_set_locked() at DIMM probe time to flag regions that can not be enabled. There should be no functional difference resulting from this change. [1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example-V1.3.pdfSigned-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Erez Shitrit 提交于
Enable mlx5 IPoIB acceleration by declaring mlx5_ib_{alloc,free}_rdma_netdev and assigning the mlx5 IPoIB rdma_netdev callbacks. In addition, this patch brings in sync mlx5's IPoIB parts for net and IB trees. As a precaution, we disabled IPoIB acceleration by default (in the mlx5_core Kconfig file). Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com> Signed-off-by: NErez Shitrit <erezsh@mellanox.com> Signed-off-by: NLeon Romanovsky <leon@kernel.org> Signed-off-by: NDoug Ledford <dledford@redhat.com>
-