- 07 9月, 2017 14 次提交
-
-
由 Jan Kara 提交于
All users of pagevec_lookup() and pagevec_lookup_range() now pass PAGEVEC_SIZE as a desired number of pages. Just drop the argument. Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Implement a variant of find_get_pages() that stops iterating at given index. This may be substantial performance gain if the mapping is sparse. See following commit for details. Furthermore lots of users of this function (through pagevec_lookup()) actually want a range lookup and all of them are currently open-coding this. Also create corresponding pagevec_lookup_range() function. Link: http://lkml.kernel.org/r/20170726114704.7626-4-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Make pagevec_lookup() (and underlying find_get_pages()) update index to the next page where iteration should continue. Most callers want this and also pagevec_lookup_tag() already does this. Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Patch series "Ranged pagevec lookup", v2. In this series I make pagevec_lookup() update the index (to be consistent with pagevec_lookup_tag() and also as a preparation for ranged lookups), provide ranged variant of pagevec_lookup() and use it in places where it makes sense. This not only removes some common code but is also a measurable performance win for some use cases (see patch 4/10) where radix tree is sparse and searching & grabing of a page after the end of the range has measurable overhead. This patch (of 10): The callback doesn't ever get called. Remove it. Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
zonelists_mutex was introduced by commit 4eaf3f64 ("mem-hotplug: fix potential race while building zonelist for new populated zone") to protect zonelist building from races. This is no longer needed though because both memory online and offline are fully serialized. New users have grown since then. Notably setup_per_zone_wmarks wants to prevent from races between memory hotplug, khugepaged setup and manual min_free_kbytes update via sysctl (see cfd3da1e ("mm: Serialize access to min_free_kbytes"). Let's add a private lock for that purpose. This will not prevent from seeing halfway through memory hotplug operation but that shouldn't be a big deal becuse memory hotplug will update watermarks explicitly so we will eventually get a full picture. The lock just makes sure we won't race when updating watermarks leading to weird results. Also __build_all_zonelists manipulates global data so add a private lock for it as well. This doesn't seem to be necessary today but it is more robust to have a lock there. While we are at it make sure we document that memory online/offline depends on a full serialization either via mem_hotplug_begin() or device_lock. Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Haicheng Li <haicheng.li@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
build_all_zonelists gets a zone parameter to initialize zone's pagesets. There is only a single user which gives a non-NULL zone parameter and that one doesn't really need the rest of the build_all_zonelists (see commit 6dcd73d7 ("memory-hotplug: allocate zone's pcp before onlining pages")). Therefore remove setup_zone_pageset from build_all_zonelists and call it from its only user directly. This will also remove a pointless zonlists rebuilding which is always good. Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Patch series "cleanup zonelists initialization", v1. This is aimed at cleaning up the zonelists initialization code we have but the primary motivation was bug report [2] which got resolved but the usage of stop_machine is just too ugly to live. Most patches are straightforward but 3 of them need a special consideration. Patch 1 removes zone ordered zonelists completely. I am CCing linux-api because this is a user visible change. As I argue in the patch description I do not think we have a strong usecase for it these days. I have kept sysctl in place and warn into the log if somebody tries to configure zone lists ordering. If somebody has a real usecase for it we can revert this patch but I do not expect anybody will actually notice runtime differences. This patch is not strictly needed for the rest but it made patch 6 easier to implement. Patch 7 removes stop_machine from build_all_zonelists without adding any special synchronization between iterators and updater which I _believe_ is acceptable as explained in the changelog. I hope I am not missing anything. Patch 8 then removes zonelists_mutex which is kind of ugly as well and not really needed AFAICS but a care should be taken when double checking my thinking. This patch (of 9): Supporting zone ordered zonelists costs us just a lot of code while the usefulness is arguable if existent at all. Mel has already made node ordering default on 64b systems. 32b systems are still using ZONELIST_ORDER_ZONE because it is considered better to fallback to a different NUMA node rather than consume precious lowmem zones. This argument is, however, weaken by the fact that the memory reclaim has been reworked to be node rather than zone oriented. This means that lowmem requests have to skip over all highmem pages on LRUs already and so zone ordering doesn't save the reclaim time much. So the only advantage of the zone ordering is under a light memory pressure when highmem requests do not ever hit into lowmem zones and the lowmem pressure doesn't need to reclaim. Considering that 32b NUMA systems are rather suboptimal already and it is generally advisable to use 64b kernel on such a HW I believe we should rather care about the code maintainability and just get rid of ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if somebody tries to set zone ordering either from kernel command line or the sysctl. [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate] Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: <linux-api@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Prior to commit f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online") we used to allow to change the valid zone types of a memory block if it is adjacent to a different zone type. This fact was reflected in memoryNN/valid_zones by the ordering of printed zones. The first one was default (echo online > memoryNN/state) and the other one could be onlined explicitly by online_{movable,kernel}. This behavior was removed by the said patch and as such the ordering was not all that important. In most cases a kernel zone would be default anyway. The only exception is movable_node handled by "mm, memory_hotplug: support movable_node for hotpluggable nodes". Let's reintroduce this behavior again because later patch will remove the zone overlap restriction and so user will be allowed to online kernel resp. movable block regardless of its placement. Original behavior will then become significant again because it would be non-trivial for users to see what is the default zone to online into. Implementation is really simple. Pull out zone selection out of move_pfn_range into zone_for_pfn_range helper and use it in show_valid_zones to display the zone for default onlining and then both kernel and movable if they are allowed. Default online zone is not duplicated. Link: http://lkml.kernel.org/r/20170714121233.16861-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Kani Toshimitsu <toshi.kani@hpe.com> Cc: <slaoub@gmail.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chris Wilson 提交于
Some shrinkers may only be able to free a bunch of objects at a time, and so free more than the requested nr_to_scan in one pass. Whilst other shrinkers may find themselves even unable to scan as many objects as they counted, and so underreport. Account for the extra freed/scanned objects against the total number of objects we intend to scan, otherwise we may end up penalising the slab far more than intended. Similarly, we want to add the underperforming scan to the deferred pass so that we try harder and harder in future passes. Link: http://lkml.kernel.org/r/20170822135325.9191-1-chris@chris-wilson.co.ukSigned-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Shaohua Li <shli@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kees Cook 提交于
This SLUB free list pointer obfuscation code is modified from Brad Spengler/PaX Team's code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. This adds a per-cache random value to SLUB caches that is XORed with their freelist pointer address and value. This adds nearly zero overhead and frustrates the very common heap overflow exploitation method of overwriting freelist pointers. A recent example of the attack is written up here: http://cyseclabs.com/blog/cve-2016-6187-heap-off-by-one-exploit and there is a section dedicated to the technique the book "A Guide to Kernel Exploitation: Attacking the Core". This is based on patches by Daniel Micay, and refactored to minimize the use of #ifdef. With 200-count cycles of "hackbench -g 20 -l 1000" I saw the following run times: before: mean 10.11882499999999999995 variance .03320378329145728642 stdev .18221905304181911048 after: mean 10.12654000000000000014 variance .04700556623115577889 stdev .21680767106160192064 The difference gets lost in the noise, but if the above is to be taken literally, using CONFIG_FREELIST_HARDENED is 0.07% slower. Link: http://lkml.kernel.org/r/20170802180609.GA66807@beastSigned-off-by: NKees Cook <keescook@chromium.org> Suggested-by: NDaniel Micay <danielmicay@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tycho Andersen <tycho@docker.com> Cc: Alexander Popov <alex.popov@linux.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ross Zwisler 提交于
Now that we no longer insert struct page pointers in DAX radix trees the page cache code no longer needs to know anything about DAX exceptional entries. Move all the DAX exceptional entry definitions from dax.h to fs/dax.c. Link: http://lkml.kernel.org/r/20170724170616.25810-6-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Suggested-by: NJan Kara <jack@suse.cz> Reviewed-by: NJan Kara <jack@suse.cz> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ross Zwisler 提交于
Now that we no longer insert struct page pointers in DAX radix trees we can remove the special casing for DAX in page_cache_tree_insert(). This also allows us to make dax_wake_mapping_entry_waiter() local to fs/dax.c, removing it from dax.h. Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Suggested-by: NJan Kara <jack@suse.cz> Reviewed-by: NJan Kara <jack@suse.cz> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ross Zwisler 提交于
When servicing mmap() reads from file holes the current DAX code allocates a page cache page of all zeroes and places the struct page pointer in the mapping->page_tree radix tree. This has three major drawbacks: 1) It consumes memory unnecessarily. For every 4k page that is read via a DAX mmap() over a hole, we allocate a new page cache page. This means that if you read 1GiB worth of pages, you end up using 1GiB of zeroed memory. This is easily visible by looking at the overall memory consumption of the system or by looking at /proc/[pid]/smaps: 7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12 /root/dax/data Size: 1048576 kB Rss: 1048576 kB Pss: 1048576 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 1048576 kB Private_Dirty: 0 kB Referenced: 1048576 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB 2) It is slower than using a common zero page because each page fault has more work to do. Instead of just inserting a common zero page we have to allocate a page cache page, zero it, and then insert it. Here are the average latencies of dax_load_hole() as measured by ftrace on a random test box: Old method, using zeroed page cache pages: 3.4 us New method, using the common 4k zero page: 0.8 us This was the average latency over 1 GiB of sequential reads done by this simple fio script: [global] size=1G filename=/root/dax/data fallocate=none [io] rw=read ioengine=mmap 3) The fact that we had to check for both DAX exceptional entries and for page cache pages in the radix tree made the DAX code more complex. Solve these issues by following the lead of the DAX PMD code and using a common 4k zero page instead. As with the PMD code we will now insert a DAX exceptional entry into the radix tree instead of a struct page pointer which allows us to remove all the special casing in the DAX code. Note that we do still pretty aggressively check for regular pages in the DAX radix tree, especially where we take action based on the bits set in the page. If we ever find a regular page in our radix tree now that most likely means that someone besides DAX is inserting pages (which has happened lots of times in the past), and we want to find that out early and fail loudly. This solution also removes the extra memory consumption. Here is that same /proc/[pid]/smaps after 1GiB of reading from a hole with the new code: 7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12 /root/dax/data Size: 1048576 kB Rss: 0 kB Pss: 0 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 0 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB Overall system memory consumption is similarly improved. Another major change is that we remove dax_pfn_mkwrite() from our fault flow, and instead rely on the page fault itself to make the PTE dirty and writeable. The following description from the patch adding the vm_insert_mixed_mkwrite() call explains this a little more: "To be able to use the common 4k zero page in DAX we need to have our PTE fault path look more like our PMD fault path where a PTE entry can be marked as dirty and writeable as it is first inserted rather than waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call. Right now we can rely on having a dax_pfn_mkwrite() call because we can distinguish between these two cases in do_wp_page(): case 1: 4k zero page => writable DAX storage case 2: read-only DAX storage => writeable DAX storage This distinction is made by via vm_normal_page(). vm_normal_page() returns false for the common 4k zero page, though, just as it does for DAX ptes. Instead of special casing the DAX + 4k zero page case we will simplify our DAX PTE page fault sequence so that it matches our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper. We will instead use dax_iomap_fault() to handle write-protection faults. This means that insert_pfn() needs to follow the lead of insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If 'mkwrite' is set insert_pfn() will do the work that was previously done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path" Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ross Zwisler 提交于
When servicing mmap() reads from file holes the current DAX code allocates a page cache page of all zeroes and places the struct page pointer in the mapping->page_tree radix tree. This has three major drawbacks: 1) It consumes memory unnecessarily. For every 4k page that is read via a DAX mmap() over a hole, we allocate a new page cache page. This means that if you read 1GiB worth of pages, you end up using 1GiB of zeroed memory. 2) It is slower than using a common zero page because each page fault has more work to do. Instead of just inserting a common zero page we have to allocate a page cache page, zero it, and then insert it. 3) The fact that we had to check for both DAX exceptional entries and for page cache pages in the radix tree made the DAX code more complex. This series solves these issues by following the lead of the DAX PMD code and using a common 4k zero page instead. This reduces memory usage and decreases latencies for some workloads, and it simplifies the DAX code, removing over 100 lines in total. This patch (of 5): To be able to use the common 4k zero page in DAX we need to have our PTE fault path look more like our PMD fault path where a PTE entry can be marked as dirty and writeable as it is first inserted rather than waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call. Right now we can rely on having a dax_pfn_mkwrite() call because we can distinguish between these two cases in do_wp_page(): case 1: 4k zero page => writable DAX storage case 2: read-only DAX storage => writeable DAX storage This distinction is made by via vm_normal_page(). vm_normal_page() returns false for the common 4k zero page, though, just as it does for DAX ptes. Instead of special casing the DAX + 4k zero page case we will simplify our DAX PTE page fault sequence so that it matches our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper. We will instead use dax_iomap_fault() to handle write-protection faults. This means that insert_pfn() needs to follow the lead of insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If 'mkwrite' is set insert_pfn() will do the work that was previously done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path. Link: http://lkml.kernel.org/r/20170724170616.25810-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 9月, 2017 1 次提交
-
-
由 Ido Schimmel 提交于
The mlxsw driver relies on NETDEV_CHANGEUPPER events to configure the device in case a port is enslaved to a master netdev such as bridge or bond. Since the driver ignores events unrelated to its ports and their uppers, it's possible to engineer situations in which the device's data path differs from the kernel's. One example to such a situation is when a port is enslaved to a bond that is already enslaved to a bridge. When the bond was enslaved the driver ignored the event - as the bond wasn't one of its uppers - and therefore a bridge port instance isn't created in the device. Until such configurations are supported forbid them by checking that the upper device doesn't have uppers of its own. Fixes: 0d65fc13 ("mlxsw: spectrum: Implement LAG port join/leave") Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Reported-by: NNogah Frankel <nogahf@mellanox.com> Tested-by: NNogah Frankel <nogahf@mellanox.com> Signed-off-by: NJiri Pirko <jiri@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 9月, 2017 9 次提交
-
-
由 Colin Cross 提交于
The BINDER_GET_NODE_DEBUG_INFO ioctl will return debug info on a node. Each successive call reusing the previous return value will return the next node. The data will be used by libmemunreachable to mark the pointers with kernel references as reachable. Signed-off-by: NColin Cross <ccross@android.com> Signed-off-by: NMartijn Coenen <maco@android.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Joe Stringer 提交于
Commit c7acec71 ("kernel.h: handle pointers to arrays better in container_of()") made use of __compiletime_assert() from container_of() thus increasing the usage of this macro, allowing developers to notice type conflicts in usage of container_of() at compile time. However, the implementation of __compiletime_assert relies on compiler optimizations to report an error. This means that if a developer uses "-O0" with any code that performs container_of(), the compiler will always report an error regardless of whether there is an actual problem in the code. This patch disables compile_time_assert when optimizations are disabled to allow such code to compile with CFLAGS="-O0". Example compilation failure: ./include/linux/compiler.h:547:38: error: call to `__compiletime_assert_94' declared with attribute error: pointer type mismatch in container_of() _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__) ^ ./include/linux/compiler.h:530:4: note: in definition of macro `__compiletime_assert' prefix ## suffix(); \ ^~~~~~ ./include/linux/compiler.h:547:2: note: in expansion of macro `_compiletime_assert' _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__) ^~~~~~~~~~~~~~~~~~~ ./include/linux/build_bug.h:46:37: note: in expansion of macro `compiletime_assert' #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) ^~~~~~~~~~~~~~~~~~ ./include/linux/kernel.h:860:2: note: in expansion of macro `BUILD_BUG_ON_MSG' BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) && \ ^~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: use do{}while(0), per Michal] Link: http://lkml.kernel.org/r/20170829230114.11662-1-joe@ovn.org Fixes: c7acec71 ("kernel.h: handle pointers to arrays better in container_of()") Signed-off-by: NJoe Stringer <joe@ovn.org> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
The invalidate_page callback suffered from two pitfalls. First it used to happen after the page table lock was release and thus a new page might have setup before the call to invalidate_page() happened. This is in a weird way fixed by commit c7ab0d2f ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()") that moved the callback under the page table lock but this also broke several existing users of the mmu_notifier API that assumed they could sleep inside this callback. The second pitfall was invalidate_page() being the only callback not taking a range of address in respect to invalidation but was giving an address and a page. Lots of the callback implementers assumed this could never be THP and thus failed to invalidate the appropriate range for THP. By killing this callback we unify the mmu_notifier callback API to always take a virtual address range as input. Finally this also simplifies the end user life as there is now two clear choices: - invalidate_range_start()/end() callback (which allow you to sleep) - invalidate_range() where you can not sleep but happen right after page table update under page table lock Signed-off-by: NJérôme Glisse <jglisse@redhat.com> Cc: Bernhard Held <berny156@gmx.de> Cc: Adam Borowski <kilobyte@angband.pl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: axie <axie@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jérôme Glisse 提交于
Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range() and make sure it is bracketed by calls to *_invalidate_range_start()/end(). Note that because we can not presume the pmd value or pte value we have to assume the worst and unconditionaly report an invalidation as happening. Signed-off-by: NJérôme Glisse <jglisse@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Bernhard Held <berny156@gmx.de> Cc: Adam Borowski <kilobyte@angband.pl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: axie <axie@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Bart Van Assche 提交于
copy_in_user() copies data from user-space address @from to user- space address @to. Hence declare both @from and @to as user-space pointers. Fixes: commit d597580d ("generic ...copy_..._user primitives") Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Christoph Hellwig 提交于
[AV: added missing annotations in syscalls.h/compat.h] Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
This patch has changes to w1.h/w1.c generic files to add (optional) hwmon support structures. Signed-off-by: NJaghathiswari Rankappagounder Natarajan <jaghu@google.com> Acked-by: NEvgeniy Polyakov <zbr@ioremap.net> Acked-by: NGuenter Roeck <linux@roeck-us.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Baolin Wang 提交于
Move the USB phy NULL checking before issuing usb_phy_set_charger_current() to avoid unchecked dereference warning. Signed-off-by: NBaolin Wang <baolin.wang@linaro.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 31 8月, 2017 16 次提交
-
-
由 Jonathan Corbet 提交于
The kerneldoc comment for the genpool_algo_t typedef was incomplete and incorrectly formatted, leading to a raft of warnings during the docs build. Fix it appropriately. Signed-off-by: NJonathan Corbet <corbet@lwn.net>
-
由 Marc Zyngier 提交于
As KVM needs to know about the availability of GICv4 to enable direct injection of interrupts, let's advertise the feature in the gic_kvm_info structure. Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
-
由 Marc Zyngier 提交于
Get the show on the road... Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
-
由 Marc Zyngier 提交于
Add the required interfaces to map, unmap and update a VLPI. Reviewed-by: NEric Auger <eric.auger@redhat.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
-
由 Marc Zyngier 提交于
Add the required interfaces to schedule a VPE and perform a VINVALL command. Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NEric Auger <eric.auger@redhat.com> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
-
由 Marc Zyngier 提交于
When creating a VM, it is very convenient to have an irq domain containing all the doorbell interrupts associated with that VM (each interrupt representing a VPE). Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
-
由 Marc Zyngier 提交于
A long time ago, GITS_CTLR[1] used to be called GITC_CTLR.EnableVLPI. It has been subsequently deprecated and is now an "Implementation Defined" bit that may ot may not be set for GICv4. Brilliant. And the current crop of the FastModel requires that bit for VLPIs to be enabled. Oh well... Let's set it and find out what breaks. Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
-
由 Marc Zyngier 提交于
When we don't have the DirectLPI feature, we must work around the architecture shortcomings to be able to perform the required maintenance (interrupt masking, clearing and injection). For this, we create a fake device whose sole purpose is to provide a way to issue commands as if we were dealing with LPIs coming from that device (while they actually originate from the ITS). This fake device doesn't have LPIs allocated to it, but instead uses the VPE LPIs. Of course, this could be a real bottleneck, and a naive implementation would require 6 commands to issue an invalidation. Instead, let's allocate at least one event per physical CPU (rounded up to the next power of 2), and opportunistically map the VPE doorbell to an event. This doorbell will be mapped until we roll over and need to reallocate this slot. This ensures that most of the time, we only need 2 commands to issue an INV, INT or CLEAR, making the performance a lot better, given that we always issue a CLEAR on entry, and an INV on each side of a trapped WFI. Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
-
由 Marc Zyngier 提交于
When a VPE is scheduled to run, the corresponding redistributor must be told so, by setting VPROPBASER to the VM's property table, and VPENDBASER to the vcpu's pending table. When scheduled out, we preserve the IDAI and PendingLast bits. The latter is specially important, as it tells the hypervisor that there are pending interrupts for this vcpu. Reviewed-by: NEric Auger <eric.auger@redhat.com> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
-
由 Marc Zyngier 提交于
V{PEND,PROP}BASER being 64bit registers, they need some ad-hoc accessors on 32bit, specially given that VPENDBASER contains a Valid bit, making the access a bit convoluted. Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NEric Auger <eric.auger@redhat.com> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
-
由 Marc Zyngier 提交于
Add the new GICv4 ITS command definitions, most of them, being defined in terms of their physical counterparts. Reviewed-by: NEric Auger <eric.auger@redhat.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
-
由 Marc Zyngier 提交于
Add a bunch of GICv4-specific data structures that will get used in subsequent patches. Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
-
由 Matan Barak 提交于
In order to use the parsing tree, we need to assign the root to all drivers. Currently, we just assign the default parsing tree via ib_uverbs_add_one. The driver could override this by assigning a parsing tree prior to registering the device. Signed-off-by: NMatan Barak <matanb@mellanox.com> Reviewed-by: NYishai Hadas <yishaih@mellanox.com> Signed-off-by: NDoug Ledford <dledford@redhat.com>
-
由 Matan Barak 提交于
Adding CQ ioctl actions: 1. create_cq 2. destroy_cq This requires adding the following: 1. A specification describing the method a. Handler b. Attributes specification Each attribute is one of the following: a. PTR_IN - input data Note: This could be encoded inlined for data < 64bit b. PTR_OUT - response data c. IDR - idr based object d. FD - fd based object Blobs attributes (clauses a and b) contain their type, while objects specifications (clauses c and d) contains the expected object type (for example, the given id should be UVERBS_TYPE_PD) and the required access (READ, WRITE, NEW or DESTROY). If a NEW is required, the new object's id will be assigned to this attribute. All attributes could get UA_FLAGS attribute. Currently we support stating that an attribute is mandatory or that the specification size corresponds to a lower bound (and that this attribute could be extended). We currently add both default attributes and the two generic UHW_IN and UHW_OUT driver specific attributes. 2. Handler A handler gets a uverbs_attr_bundle. The handler developer uses uverbs_attr_get to fetch an attribute of a given id. Each of these attribute groups correspond to the specification group defined in the action (clauses 1.b and 1.c respectively). The indices of these arrays corresponds to the attribute ids declared in the specifications (clause 2). The handler is quite simple. It assumes the infrastructure fetched all objects and locked, created or destroyed them as required by the specification. Pointer (or blob) attributes were validated to match their required sizes. After the handler finished, the infrastructure commits or rollbacks the objects. Signed-off-by: NMatan Barak <matanb@mellanox.com> Reviewed-by: NYishai Hadas <yishaih@mellanox.com> Signed-off-by: NDoug Ledford <dledford@redhat.com>
-
由 Matan Barak 提交于
In this phase, we don't want to change all the drivers to use flexible driver's specific attributes. Therefore, we add two default attributes: UHW_IN and UHW_OUT. These attributes are optional in some methods and they encode the driver specific command data. We add a function that extract this data and creates the legacy udata over it. Driver's data should start from UVERBS_UDATA_DRIVER_DATA_FLAG. This turns on the first bit of the namespace, indicating this attribute belongs to the driver's namespace. Signed-off-by: NMatan Barak <matanb@mellanox.com> Reviewed-by: NYishai Hadas <yishaih@mellanox.com> Signed-off-by: NDoug Ledford <dledford@redhat.com>
-
由 Matan Barak 提交于
Add a new ib_user_ioctl_verbs.h which exports all required ABI enums and structs to the user-space. Export the default types to user-space through this file. Signed-off-by: NMatan Barak <matanb@mellanox.com> Reviewed-by: NYishai Hadas <yishaih@mellanox.com> Signed-off-by: NDoug Ledford <dledford@redhat.com>
-