- 06 10月, 2017 2 次提交
-
-
由 Jens Axboe 提交于
After commit b35bd0d9, pdflush_proc_obsolete() is no longer used. Kill the function and declaration. Reported-by: NRakesh Pandit <rakesh@tuxera.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
We already have a queue_is_rq_based helper to check if a request_queue is request based, so we can remove the flag for it. Acked-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 05 10月, 2017 1 次提交
-
-
由 Jens Axboe 提交于
Handle start-all writeback like we do periodic or kupdate style writeback - by marking the bdi_writeback as needing a full flush, and simply waking the thread. This eliminates the need to allocate and queue a specific work item just for this purpose. After this change, we truly only ever have one of them running at any point in time. We mark the need to start all flushes, and the writeback thread will clear it once it has processed the request. Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 03 10月, 2017 6 次提交
-
-
由 Jens Axboe 提交于
When someone calls wakeup_flusher_threads() or wakeup_flusher_threads_bdi(), they schedule writeback of all dirty pages in the system (or on that bdi). If we are tight on memory, we can get tons of these queued from kswapd/vmscan. This causes (at least) two problems: 1) We consume a ton of memory just allocating writeback work items. We've seen as much as 600 million of these writeback work items pending. That's a lot of memory to pointlessly hold hostage, while the box is under memory pressure. 2) We spend so much time processing these work items, that we introduce a softlockup in writeback processing. This is because each of the writeback work items don't end up doing any work (it's hard when you have millions of identical ones coming in to the flush machinery), so we just sit in a tight loop pulling work items and deleting/freeing them. Fix this by adding a 'start_all' bit to the writeback structure, and set that when someone attempts to flush all dirty pages. The bit is cleared when we start writeback on that work item. If the bit is already set when we attempt to queue !nr_pages writeback, then we simply ignore it. This provides us one full flush in flight, with one pending as well, and makes for more efficient handling of this type of writeback. Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Tested-by: NChris Mason <clm@fb.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We don't have any callers outside of fs-writeback.c anymore, make it private. Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Tested-by: NChris Mason <clm@fb.com> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Similar to wakeup_flusher_threads(), except that we only wake up the flusher threads on the specified backing device. No functional changes in this patch. Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Tested-by: NChris Mason <clm@fb.com> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
All the callers pass in 'true' for range_cyclic, so kill the argument. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Everybody is passing in 0 now, let's get rid of the argument. Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Instead of adding weird retry logic in that function, utilize __GFP_NOFAIL to ensure that the vm takes care of handling any potential retries appropriately. This means we don't have to call free_more_memory() from here. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 27 9月, 2017 1 次提交
-
-
由 Shaohua Li 提交于
The code is only for blkcg not for all cgroups Fixes: d4478e92 ("block/loop: make loop cgroup aware") Reported-by: Nkbuild test robot <fengguang.wu@intel.com> Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 9月, 2017 4 次提交
-
-
由 Shaohua Li 提交于
bio_blkcg is the only API to get cgroup info for a bio right now. If bio_blkcg finds current task is a kthread and has original blkcg associated, it will use the css instead of associating the bio to current task. This makes it possible that kthread dispatches bios on behalf of other threads. Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Shaohua Li 提交于
Nobody uses the APIs right now. Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Shaohua Li 提交于
kthread usually runs jobs on behalf of other threads. The jobs should be charged to cgroup of original threads. But the jobs run in a kthread, where we lose the cgroup context of original threads. The patch adds a machanism to record cgroup info of original threads in kthread context. Later we can retrieve the cgroup info and attach the cgroup info to jobs. Since this mechanism is only required by kthread, we store the cgroup info in kthread data instead of generic task_struct. Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 James Smart 提交于
Comments were incorrect: - defer_rcv was in host port template. moved to target port template - Added Mandatory statements for target port template items Signed-off-by: NJames Smart <james.smart@broadcom.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 25 9月, 2017 3 次提交
-
-
由 James Smart 提交于
Add transport SGL defintions from NVMe TP 4008, required for the final NVMe-FC standard. Signed-off-by: NJames Smart <james.smart@broadcom.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 James Smart 提交于
The NVM express group recinded the reserved range for the transport. Remove the FC-centric values that had been defined. Signed-off-by: NJames Smart <james.smart@broadcom.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Waiman Long 提交于
The lockdep code had reported the following unsafe locking scenario: CPU0 CPU1 ---- ---- lock(s_active#228); lock(&bdev->bd_mutex/1); lock(s_active#228); lock(&bdev->bd_mutex); *** DEADLOCK *** The deadlock may happen when one task (CPU1) is trying to delete a partition in a block device and another task (CPU0) is accessing tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that partition. The s_active isn't an actual lock. It is a reference count (kn->count) on the sysfs (kernfs) file. Removal of a sysfs file, however, require a wait until all the references are gone. The reference count is treated like a rwsem using lockdep instrumentation code. The fact that a thread is in the sysfs callback method or in the ioctl call means there is a reference to the opended sysfs or device file. That should prevent the underlying block structure from being removed. Instead of using bd_mutex in the block_device structure, a new blk_trace_mutex is now added to the request_queue structure to protect access to the blk_trace structure. Suggested-by: NChristoph Hellwig <hch@infradead.org> Signed-off-by: NWaiman Long <longman@redhat.com> Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Fix typo in patch subject line, and prune a comment detailing how the code used to work. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 22 9月, 2017 3 次提交
-
-
由 Eric Dumazet 提交于
In linux-4.13, Wei worked hard to convert dst to a traditional refcounted model, removing GC. We now want to make sure a dst refcount can not transition from 0 back to 1. The problem here is that input path attached a not refcounted dst to an skb. Then later, because packet is forwarded and hits skb_dst_force() before exiting RCU section, we might try to take a refcount on one dst that is about to be freed, if another cpu saw 1 -> 0 transition in dst_release() and queued the dst for freeing after one RCU grace period. Lets unify skb_dst_force() and skb_dst_force_safe(), since we should always perform the complete check against dst refcount, and not assume it is not zero. Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005 [ 989.919496] skb_dst_force+0x32/0x34 [ 989.919498] __dev_queue_xmit+0x1ad/0x482 [ 989.919501] ? eth_header+0x28/0xc6 [ 989.919502] dev_queue_xmit+0xb/0xd [ 989.919504] neigh_connected_output+0x9b/0xb4 [ 989.919507] ip_finish_output2+0x234/0x294 [ 989.919509] ? ipt_do_table+0x369/0x388 [ 989.919510] ip_finish_output+0x12c/0x13f [ 989.919512] ip_output+0x53/0x87 [ 989.919513] ip_forward_finish+0x53/0x5a [ 989.919515] ip_forward+0x2cb/0x3e6 [ 989.919516] ? pskb_trim_rcsum.part.9+0x4b/0x4b [ 989.919518] ip_rcv_finish+0x2e2/0x321 [ 989.919519] ip_rcv+0x26f/0x2eb [ 989.919522] ? vlan_do_receive+0x4f/0x289 [ 989.919523] __netif_receive_skb_core+0x467/0x50b [ 989.919526] ? tcp_gro_receive+0x239/0x239 [ 989.919529] ? inet_gro_receive+0x226/0x238 [ 989.919530] __netif_receive_skb+0x4d/0x5f [ 989.919532] netif_receive_skb_internal+0x5c/0xaf [ 989.919533] napi_gro_receive+0x45/0x81 [ 989.919536] ixgbe_poll+0xc8a/0xf09 [ 989.919539] ? kmem_cache_free_bulk+0x1b6/0x1f7 [ 989.919540] net_rx_action+0xf4/0x266 [ 989.919543] __do_softirq+0xa8/0x19d [ 989.919545] irq_exit+0x5d/0x6b [ 989.919546] do_IRQ+0x9c/0xb5 [ 989.919548] common_interrupt+0x93/0x93 [ 989.919548] </IRQ> Similarly dst_clone() can use dst_hold() helper to have additional debugging, as a follow up to commit 44ebe791 ("net: add debug atomic_inc_not_zero() in dst_hold()") In net-next we will convert dst atomic_t to refcount_t for peace of mind. Fixes: a4c2fd7f ("net: remove DST_NOCACHE flag") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Wei Wang <weiwan@google.com> Reported-by: NPaweł Staszewski <pstaszewski@itcare.pl> Bisected-by: NPaweł Staszewski <pstaszewski@itcare.pl> Acked-by: NWei Wang <weiwan@google.com> Acked-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Dmitry Torokhov 提交于
Normally, when input device supporting force feedback effects is being destroyed, we try to "flush" currently playing effects, so that the physical device does not continue vibrating (or executing other effects). Unfortunately this does not work well for uinput as flushing of the effects deadlocks with the destroy action: - if device is being destroyed because the file descriptor is being closed, then there is noone to even service FF requests; - if device is being destroyed because userspace sent UI_DEV_DESTROY, while theoretically it could be possible to service FF requests, userspace is unlikely to do so (they'd need to make sure FF handling happens on a separate thread) even if kernel solves the issue with FF ioctls deadlocking with UI_DEV_DESTROY ioctl on udev->mutex. To avoid lockups like the one below, let's install a custom input device flush handler, and avoid trying to flush force feedback effects when we destroying the device, and instead rely on uinput to shut off the device properly. NMI watchdog: Watchdog detected hard LOCKUP on cpu 3 ... <<EOE>> [<ffffffff817a0307>] _raw_spin_lock_irqsave+0x37/0x40 [<ffffffff810e633d>] complete+0x1d/0x50 [<ffffffffa00ba08c>] uinput_request_done+0x3c/0x40 [uinput] [<ffffffffa00ba587>] uinput_request_submit.part.7+0x47/0xb0 [uinput] [<ffffffffa00bb62b>] uinput_dev_erase_effect+0x5b/0x76 [uinput] [<ffffffff815d91ad>] erase_effect+0xad/0xf0 [<ffffffff815d929d>] flush_effects+0x4d/0x90 [<ffffffff815d4cc0>] input_flush_device+0x40/0x60 [<ffffffff815daf1c>] evdev_cleanup+0xac/0xc0 [<ffffffff815daf5b>] evdev_disconnect+0x2b/0x60 [<ffffffff815d74ac>] __input_unregister_device+0xac/0x150 [<ffffffff815d75f7>] input_unregister_device+0x47/0x70 [<ffffffffa00bac45>] uinput_destroy_device+0xb5/0xc0 [uinput] [<ffffffffa00bb2de>] uinput_ioctl_handler.isra.9+0x65e/0x740 [uinput] [<ffffffff811231ab>] ? do_futex+0x12b/0xad0 [<ffffffffa00bb3f8>] uinput_ioctl+0x18/0x20 [uinput] [<ffffffff81241248>] do_vfs_ioctl+0x298/0x480 [<ffffffff81337553>] ? security_file_ioctl+0x43/0x60 [<ffffffff812414a9>] SyS_ioctl+0x79/0x90 [<ffffffff817a04ee>] entry_SYSCALL_64_fastpath+0x12/0x71 Reported-by: NRodrigo Rivas Costa <rodrigorivascosta@gmail.com> Reported-by: NClément VUCHENER <clement.vuchener@gmail.com> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=193741Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
-
由 Florian Fainelli 提交于
Commit 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API") deprecated the ethtool_cmd::transceiver field, which was fine in premise, except that the PHY library was actually using it to report the type of transceiver: internal or external. Use the first word of the reserved field to put this __u8 transceiver field back in. It is made read-only, and we don't expect the ETHTOOL_xLINKSETTINGS API to be doing anything with this anyway, so this is mostly for the legacy path where we do: ethtool_get_settings() -> dev->ethtool_ops->get_link_ksettings() -> convert_link_ksettings_to_legacy_settings() to have no information loss compared to the legacy get_settings API. Fixes: 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API") Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 9月, 2017 2 次提交
-
-
由 Thomas Gleixner 提交于
This reverts commit 74def747. The change to the helper function is only correct for the /proc/irq/ readout usage, but breaks the existing x86 usage of that function. Reported-by: NYanko Kaneti <yaneti@declera.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com>
-
由 Yonghong Song 提交于
This patch fixes a bug exhibited by the following scenario: 1. fd1 = perf_event_open with attr.config = ID1 2. attach bpf program prog1 to fd1 3. fd2 = perf_event_open with attr.config = ID1 <this will be successful> 4. user program closes fd2 and prog1 is detached from the tracepoint. 5. user program with fd1 does not work properly as tracepoint no output any more. The issue happens at step 4. Multiple perf_event_open can be called successfully, but only one bpf prog pointer in the tp_event. In the current logic, any fd release for the same tp_event will free the tp_event->prog. The fix is to free tp_event->prog only when the closing fd corresponds to the one which registered the program. Signed-off-by: NYonghong Song <yhs@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 9月, 2017 2 次提交
-
-
由 John Hubbard 提交于
Due to commit db3e50f3 (device property: Get rid of struct fwnode_handle type field), ACPI_HANDLE() inadvertently became a GPL-only call. The call path that led to that was: ACPI_HANDLE() ACPI_COMPANION() to_acpi_device_node() is_acpi_device_node() acpi_device_fwnode_ops DECLARE_ACPI_FWNODE_OPS(acpi_device_fwnode_ops); ...and the new DECLARE_ACPI_FWNODE_OPS() includes EXPORT_SYMBOL_GPL, whereas previously it was a static struct. In order to avoid changing any of that, let's instead provide ever so slightly better encapsulation of those struct fwnode_operations instances. Those do not really need to be directly used in inline function calls in header files. Simply moving two small functions (is_acpi_device_node and is_acpi_data_node) out of acpi_bus.h, and into a .c file, does that. That leaves the internals of struct fwnode_operations as GPL-only (which I think was the intent all along), but un-breaks any driver code out there that relies on the ACPI subsystem's being (historically) an EXPORT_SYMBOL-usable system. By that, I mean, ACPI_HANDLE() and other basic ACPI calls were non-GPL-protected. Also, while I'm there, remove a tiny bit of redundancy that was missed in the earlier commit, by having is_acpi_node() use the other two routines, instead of checking fwnode directly. Fixes: db3e50f3 (device property: Get rid of struct fwnode_handle type field) Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com> Acked-by: NSakari Ailus <sakari.ailus@linux.intel.com> Acked-by: NMika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
由 Arnd Bergmann 提交于
The ipmmu-vmsa driver fails in compile-testing on non-OF platforms: drivers/iommu/ipmmu-vmsa.o: In function `ipmmu_of_xlate': ipmmu-vmsa.c:(.text+0x740): undefined reference to `of_find_device_by_node' It would be reasonable to assume that this interface works but returns failure on non-OF builds, like it does on machines that have been booted in another way, so this adds another inline function helper. Fixes: 7b2d5961 ("iommu/ipmmu-vmsa: Replace local utlb code with fwspec ids") Signed-off-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NRob Herring <robh@kernel.org>
-
- 19 9月, 2017 2 次提交
-
-
由 Tycho Andersen 提交于
This is unused, and conflicts with the definition that we'll add for XPFO. Signed-off-by: NTycho Andersen <tycho@docker.com> Reviewed-by: NJulien Grall <julien.grall@arm.com> CC: Boris Ostrovsky <boris.ostrovsky@oracle.com> CC: Juergen Gross <jgross@suse.com> CC: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
-
由 Yuchung Cheng 提交于
remove tcp_may_send_now and tcp_snd_test that are no longer used Fixes: 840a3cbe ("tcp: remove forward retransmit feature") Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 9月, 2017 2 次提交
-
-
由 Geert Uytterhoeven 提交于
Correct location as of commit 2728b2d2 (PM / core / docs: Convert sleep states API document to reST). Fixes: 2728b2d2 (PM / core / docs: Convert sleep states API document to reST) Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
由 Thomas Garnier 提交于
Use CHECK_DATA_CORRUPTION instead of BUG_ON to provide more flexibility on address limit failures. By default, send a SIGKILL signal to kill the current process preventing exploitation of a bad address limit. Make the TIF_FSCHECK flag optional so ARM can use this function. Signed-off-by: NThomas Garnier <thgarnie@google.com> Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Pratyush Anand <panand@redhat.com> Cc: Dave Martin <Dave.Martin@arm.com> Cc: Will Drewry <wad@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Andy Lutomirski <luto@amacapital.net> Cc: David Howells <dhowells@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-api@vger.kernel.org Cc: Yonghong Song <yhs@fb.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/1504798247-48833-2-git-send-email-keescook@chromium.org
-
- 16 9月, 2017 1 次提交
-
-
由 Xin Long 提交于
Commit 86fdb344 ("sctp: ensure ep is not destroyed before doing the dump") tried to fix an use-after-free issue by checking !sctp_sk(sk)->ep with holding sock and sock lock. But Paolo noticed that endpoint could be destroyed in sctp_rcv without sock lock protection. It means the use-after-free issue still could be triggered when sctp_rcv put and destroy ep after sctp_sock_dump checks !ep, although it's pretty hard to reproduce. I could reproduce it by mdelay in sctp_rcv while msleep in sctp_close and sctp_sock_dump long time. This patch is to add another param cb_done to sctp_for_each_transport and dump ep->assocs with holding tsp after jumping out of transport's traversal in it to avoid this issue. It can also improve sctp diag dump to make it run faster, as no need to save sk into cb->args[5] and keep calling sctp_for_each_transport any more. This patch is also to use int * instead of int for the pos argument in sctp_for_each_transport, which could make postion increment only in sctp_for_each_transport and no need to keep changing cb->args[2] in sctp_sock_filter and sctp_sock_dump any more. Fixes: 86fdb344 ("sctp: ensure ep is not destroyed before doing the dump") Reported-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NXin Long <lucien.xin@gmail.com> Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: NNeil Horman <nhorman@tuxdriver.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 9月, 2017 5 次提交
-
-
由 Davidlohr Bueso 提交于
Which is the equivalent of what we have in regular waitqueues. I'm not crazy about the name, but this also helps us get both apis closer -- which iirc comes originally from the -net folks. We also duplicate the comments for the lockless swait_active(), from wait.h. Future users will make use of this interface. Signed-off-by: NDavidlohr Bueso <dbueso@suse.de> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Mimi Zohar 提交于
This patch constifies the path argument to kernel_read_file_from_path(). Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tim Chen 提交于
Now that we have added breaks in the wait queue scan and allow bookmark on scan position, we put this logic in the wake_up_page_bit function. We can have very long page wait list in large system where multiple pages share the same wait list. We break the wake up walk here to allow other cpus a chance to access the list, and not to disable the interrupts when traversing the list for too long. This reduces the interrupt and rescheduling latency, and excessive page wait queue lock hold time. [ v2: Remove bookmark_wake_function ] Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tim Chen 提交于
We encountered workloads that have very long wake up list on large systems. A waker takes a long time to traverse the entire wake list and execute all the wake functions. We saw page wait list that are up to 3700+ entries long in tests of large 4 and 8 socket systems. It took 0.8 sec to traverse such list during wake up. Any other CPU that contends for the list spin lock will spin for a long time. It is a result of the numa balancing migration of hot pages that are shared by many threads. Multiple CPUs waking are queued up behind the lock, and the last one queued has to wait until all CPUs did all the wakeups. The page wait list is traversed with interrupt disabled, which caused various problems. This was the original cause that triggered the NMI watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only extending the NMI watch dog timer there helped. This patch bookmarks the waker's scan position in wake list and break the wake up walk, to allow access to the list before the waker resume its walk down the rest of the wait list. It lowers the interrupt and rescheduling latency. This patch also provides a performance boost when combined with the next patch to break up page wakeup list walk. We saw 22% improvement in the will-it-scale file pread2 test on a Xeon Phi system running 256 threads. [ v2: Merged in Linus' changes to remove the bookmark_wake_function, and simply access to flags. ] Reported-by: NKan Liang <kan.liang@intel.com> Tested-by: NKan Liang <kan.liang@intel.com> Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ladi Prosek 提交于
Adding entries for exit reasons 23 - 27: KVM_EXIT_EPR KVM_EXIT_SYSTEM_EVENT KVM_EXIT_S390_STSI KVM_EXIT_IOAPIC_EOI KVM_EXIT_HYPERV Signed-off-by: NLadi Prosek <lprosek@redhat.com> Reviewed-by: NCornelia Huck <cohuck@redhat.com> Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
- 14 9月, 2017 2 次提交
-
-
由 Michal Hocko 提交于
GFP_TEMPORARY was introduced by commit e12ba74d ("Group short-lived and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's primary motivation was to allow users to tell that an allocation is short lived and so the allocator can try to place such allocations close together and prevent long term fragmentation. As much as this sounds like a reasonable semantic it becomes much less clear when to use the highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the context holding that memory sleep? Can it take locks? It seems there is no good answer for those questions. The current implementation of GFP_TEMPORARY is basically GFP_KERNEL | __GFP_RECLAIMABLE which in itself is tricky because basically none of the existing caller provide a way to reclaim the allocated memory. So this is rather misleading and hard to evaluate for any benefits. I have checked some random users and none of them has added the flag with a specific justification. I suspect most of them just copied from other existing users and others just thought it might be a good idea to use without any measuring. This suggests that GFP_TEMPORARY just motivates for cargo cult usage without any reasoning. I believe that our gfp flags are quite complex already and especially those with highlevel semantic should be clearly defined to prevent from confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and replace all existing users to simply use GFP_KERNEL. Please note that SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and so they will be placed properly for memory fragmentation prevention. I can see reasons we might want some gfp flag to reflect shorterm allocations but I propose starting from a clear semantic definition and only then add users with proper justification. This was been brought up before LSF this year by Matthew [1] and it turned out that GFP_TEMPORARY really doesn't have a clear semantic. It seems to be a heuristic without any measured advantage for most (if not all) its current users. The follow up discussion has revealed that opinions on what might be temporary allocation differ a lot between developers. So rather than trying to tweak existing users into a semantic which they haven't expected I propose to simply remove the flag and start from scratch if we really need a semantic for short term allocations. [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org [akpm@linux-foundation.org: fix typo] [akpm@linux-foundation.org: coding-style fixes] [sfr@canb.auug.org.au: drm/i915: fix up] Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Neil Brown <neilb@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dan Carpenter 提交于
This code causes a static checker warning because Smatch doesn't trust anything that comes from skb->data. I've reviewed this code and I do think skb->data can be controlled by the user here. The sctp_event_subscribe struct has 13 __u8 fields and we want to see if ours is non-zero. sn_type can be any value in the 0-USHRT_MAX range. We're subtracting SCTP_SN_TYPE_BASE which is 1 << 15 so we could read either before the start of the struct or after the end. This is a very old bug and it's surprising that it would go undetected for so long but my theory is that it just doesn't have a big impact so it would be hard to notice. Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 9月, 2017 1 次提交
-
-
由 Cong Wang 提交于
gen estimator has been rewritten in commit 1c0d32fd ("net_sched: gen_estimator: complete rewrite of rate estimators"), the caller is no longer needed to wait for a grace period. So this patch gets rid of it. This also completely closes a race condition between action free path and filter chain add/remove path for the following patch. Because otherwise the nested RCU callback can't be caught by rcu_barrier(). Please see also the comments in code. Cc: Jiri Pirko <jiri@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 9月, 2017 3 次提交
-
-
由 Jesper Dangaard Brouer 提交于
Using bpf_redirect_map is allowed for generic XDP programs, but the appropriate map lookup was never performed in xdp_do_generic_redirect(). Instead the map-index is directly used as the ifindex. For the xdp_redirect_map sample in SKB-mode '-S', this resulted in trying sending on ifindex 0 which isn't valid, resulting in getting SKB packets dropped. Thus, the reported performance numbers are wrong in commit 24251c26 ("samples/bpf: add option for native and skb mode for redirect apps") for the 'xdp_redirect_map -S' case. Before commit 109980b8 ("bpf: don't select potentially stale ri->map from buggy xdp progs") it could crash the kernel. Like this commit also check that the map_owner owner is correct before dereferencing the map pointer. But make sure that this API misusage can be caught by a tracepoint. Thus, allowing userspace via tracepoints to detect misbehaving bpf_progs. Fixes: 6103aa96 ("net: implement XDP_REDIRECT for xdp generic") Fixes: 24251c26 ("samples/bpf: add option for native and skb mode for redirect apps") Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yonghong Song 提交于
clang does not support variable length array for structure member. It has the following error during compilation: kernel/trace/trace_syscalls.c:568:17: error: fields must have a constant size: 'variable length array in structure' extension will never be supported unsigned long args[sys_data->nb_args]; ^ The fix is to use a fixed array length instead. Reported-by: NNick Desaulniers <ndesaulniers@google.com> Signed-off-by: NYonghong Song <yhs@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Martin Wilck 提交于
The way I'd implemented the new helper memcpy_and_pad with __FORTIFY_INLINE caused compiler warnings for certain kernel configurations. This helper is only used in a single place at this time, and thus doesn't benefit much from fortification. So simplify the code by dropping fortification support for now. Fixes: 01f33c33 "string.h: add memcpy_and_pad()" Signed-off-by: NMartin Wilck <mwilck@suse.com> Acked-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-