提交 · 06eb61844d841d0032a9950ce7f8e783ee49c0d0 · openanolis / cloud-kernel

29 9月, 2017 5 次提交

sched/debug: Add explicit TASK_IDLE printing · 06eb6184

由 Peter Zijlstra 提交于 9月 22, 2017

Markus reported that kthreads that idle using TASK_IDLE instead of
TASK_INTERRUPTIBLE are reported in as TASK_UNINTERRUPTIBLE and things
like htop mark those red.

This is undesirable, so add an explicit state for TASK_IDLE.
Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

06eb6184

sched/tracing: Use common task-state helpers · 5f6ad26e

由 Peter Zijlstra 提交于 9月 22, 2017

Remove yet another task-state char instance.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

5f6ad26e

sched/tracing: Fix trace_sched_switch task-state printing · efb40f58

由 Peter Zijlstra 提交于 9月 22, 2017

Convert trace_sched_switch to use the common task-state helpers and
fix the "X" and "Z" order, possibly they ended up in the wrong order
because TASK_REPORT has them in the wrong order too.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

efb40f58

sched/debug: Convert TASK_state to hex · 92c4bc9f

由 Peter Zijlstra 提交于 9月 22, 2017

Bit patterns are easier in hex.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

92c4bc9f

sched/debug: Implement consistent task-state printing · 1593baab

由 Peter Zijlstra 提交于 9月 22, 2017

Currently get_task_state() and task_state_to_char() report different
states, create a number of common helpers and unify the reported state
space.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

1593baab

26 9月, 2017 1 次提交

nvmet-fc: sync header templates with comments · 6b71f9e1

由 James Smart 提交于 9月 20, 2017

Comments were incorrect:
- defer_rcv was in host port template. moved to target port template
- Added Mandatory statements for target port template items
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6b71f9e1

25 9月, 2017 5 次提交

IB: Correct MR length field to be 64-bit · edd31551

由 Parav Pandit 提交于 9月 24, 2017

The ib_mr->length represents the length of the MR in bytes as per
the IBTA spec 1.3 section 11.2.10.3 (REGISTER PHYSICAL MEMORY REGION).

Currently ib_mr->length field is defined as only 32-bits field.
This might result into truncation and failed WRs of consumers who
registers more than 4GB bytes memory regions and whose WRs accessing
such MRs.

This patch makes the length 64-bit to avoid such truncation.

Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Faisal Latif <faisal.latif@intel.com>
Fixes: 4c67e2bf ("IB/core: Introduce new fast registration API")
Signed-off-by: NIlya Lesokhin <ilyal@mellanox.com>
Signed-off-by: NParav Pandit <parav@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

edd31551

IB/core: Fix typo in the name of the tag-matching cap struct · 78b1beb0

由 Leon Romanovsky 提交于 9月 24, 2017

The tag matching functionality is implemented by mlx5 driver
by extending XRQ, however this internal kernel information was
exposed to user space applications with *xrq* name instead of *tm*.

This patch renames *xrq* to *tm* to handle that.

Fixes: 8d50505a ("IB/uverbs: Expose XRQ capabilities")
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

78b1beb0

nvme: add transport SGL definitions · d85cf207

由 James Smart 提交于 9月 07, 2017

Add transport SGL defintions from NVMe TP 4008, required for
the final NVMe-FC standard.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d85cf207

nvme.h: remove FC transport-specific error values · c98cb3bd

由 James Smart 提交于 9月 07, 2017

The NVM express group recinded the reserved range for the transport.
Remove the FC-centric values that had been defined.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c98cb3bd

blktrace: Fix potential deadlock between delete & sysfs ops · 5acb3cc2

由 Waiman Long 提交于 9月 20, 2017

The lockdep code had reported the following unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(s_active#228);
                               lock(&bdev->bd_mutex/1);
                               lock(s_active#228);
  lock(&bdev->bd_mutex);

 *** DEADLOCK ***

The deadlock may happen when one task (CPU1) is trying to delete a
partition in a block device and another task (CPU0) is accessing
tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
partition.

The s_active isn't an actual lock. It is a reference count (kn->count)
on the sysfs (kernfs) file. Removal of a sysfs file, however, require
a wait until all the references are gone. The reference count is
treated like a rwsem using lockdep instrumentation code.

The fact that a thread is in the sysfs callback method or in the
ioctl call means there is a reference to the opended sysfs or device
file. That should prevent the underlying block structure from being
removed.

Instead of using bd_mutex in the block_device structure, a new
blk_trace_mutex is now added to the request_queue structure to protect
access to the blk_trace structure.
Suggested-by: NChristoph Hellwig <hch@infradead.org>
Signed-off-by: NWaiman Long <longman@redhat.com>
Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>

Fix typo in patch subject line, and prune a comment detailing how
the code used to work.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5acb3cc2

22 9月, 2017 3 次提交

net: prevent dst uses after free · 222d7dbd

由 Eric Dumazet 提交于 9月 21, 2017

In linux-4.13, Wei worked hard to convert dst to a traditional
refcounted model, removing GC.

We now want to make sure a dst refcount can not transition from 0 back
to 1.

The problem here is that input path attached a not refcounted dst to an
skb. Then later, because packet is forwarded and hits skb_dst_force()
before exiting RCU section, we might try to take a refcount on one dst
that is about to be freed, if another cpu saw 1 -> 0 transition in
dst_release() and queued the dst for freeing after one RCU grace period.

Lets unify skb_dst_force() and skb_dst_force_safe(), since we should
always perform the complete check against dst refcount, and not assume
it is not zero.

Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005

[  989.919496]  skb_dst_force+0x32/0x34
[  989.919498]  __dev_queue_xmit+0x1ad/0x482
[  989.919501]  ? eth_header+0x28/0xc6
[  989.919502]  dev_queue_xmit+0xb/0xd
[  989.919504]  neigh_connected_output+0x9b/0xb4
[  989.919507]  ip_finish_output2+0x234/0x294
[  989.919509]  ? ipt_do_table+0x369/0x388
[  989.919510]  ip_finish_output+0x12c/0x13f
[  989.919512]  ip_output+0x53/0x87
[  989.919513]  ip_forward_finish+0x53/0x5a
[  989.919515]  ip_forward+0x2cb/0x3e6
[  989.919516]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
[  989.919518]  ip_rcv_finish+0x2e2/0x321
[  989.919519]  ip_rcv+0x26f/0x2eb
[  989.919522]  ? vlan_do_receive+0x4f/0x289
[  989.919523]  __netif_receive_skb_core+0x467/0x50b
[  989.919526]  ? tcp_gro_receive+0x239/0x239
[  989.919529]  ? inet_gro_receive+0x226/0x238
[  989.919530]  __netif_receive_skb+0x4d/0x5f
[  989.919532]  netif_receive_skb_internal+0x5c/0xaf
[  989.919533]  napi_gro_receive+0x45/0x81
[  989.919536]  ixgbe_poll+0xc8a/0xf09
[  989.919539]  ? kmem_cache_free_bulk+0x1b6/0x1f7
[  989.919540]  net_rx_action+0xf4/0x266
[  989.919543]  __do_softirq+0xa8/0x19d
[  989.919545]  irq_exit+0x5d/0x6b
[  989.919546]  do_IRQ+0x9c/0xb5
[  989.919548]  common_interrupt+0x93/0x93
[  989.919548]  </IRQ>

Similarly dst_clone() can use dst_hold() helper to have additional
debugging, as a follow up to commit 44ebe791 ("net: add debug
atomic_inc_not_zero() in dst_hold()")

In net-next we will convert dst atomic_t to refcount_t for peace of
mind.

Fixes: a4c2fd7f ("net: remove DST_NOCACHE flag")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Reported-by: NPaweł Staszewski <pstaszewski@itcare.pl>
Bisected-by: NPaweł Staszewski <pstaszewski@itcare.pl>
Acked-by: NWei Wang <weiwan@google.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

222d7dbd

Input: uinput - avoid FF flush when destroying device · e8b95728

由 Dmitry Torokhov 提交于 9月 01, 2017

Normally, when input device supporting force feedback effects is being
destroyed, we try to "flush" currently playing effects, so that the
physical device does not continue vibrating (or executing other effects).
Unfortunately this does not work well for uinput as flushing of the effects
deadlocks with the destroy action:

- if device is being destroyed because the file descriptor is being closed,
  then there is noone to even service FF requests;

- if device is being destroyed because userspace sent UI_DEV_DESTROY,
  while theoretically it could be possible to service FF requests,
  userspace is unlikely to do so (they'd need to make sure FF handling
  happens on a separate thread) even if kernel solves the issue with FF
  ioctls deadlocking with UI_DEV_DESTROY ioctl on udev->mutex.

To avoid lockups like the one below, let's install a custom input device
flush handler, and avoid trying to flush force feedback effects when we
destroying the device, and instead rely on uinput to shut off the device
properly.

NMI watchdog: Watchdog detected hard LOCKUP on cpu 3
...
 <<EOE>>  [<ffffffff817a0307>] _raw_spin_lock_irqsave+0x37/0x40
 [<ffffffff810e633d>] complete+0x1d/0x50
 [<ffffffffa00ba08c>] uinput_request_done+0x3c/0x40 [uinput]
 [<ffffffffa00ba587>] uinput_request_submit.part.7+0x47/0xb0 [uinput]
 [<ffffffffa00bb62b>] uinput_dev_erase_effect+0x5b/0x76 [uinput]
 [<ffffffff815d91ad>] erase_effect+0xad/0xf0
 [<ffffffff815d929d>] flush_effects+0x4d/0x90
 [<ffffffff815d4cc0>] input_flush_device+0x40/0x60
 [<ffffffff815daf1c>] evdev_cleanup+0xac/0xc0
 [<ffffffff815daf5b>] evdev_disconnect+0x2b/0x60
 [<ffffffff815d74ac>] __input_unregister_device+0xac/0x150
 [<ffffffff815d75f7>] input_unregister_device+0x47/0x70
 [<ffffffffa00bac45>] uinput_destroy_device+0xb5/0xc0 [uinput]
 [<ffffffffa00bb2de>] uinput_ioctl_handler.isra.9+0x65e/0x740 [uinput]
 [<ffffffff811231ab>] ? do_futex+0x12b/0xad0
 [<ffffffffa00bb3f8>] uinput_ioctl+0x18/0x20 [uinput]
 [<ffffffff81241248>] do_vfs_ioctl+0x298/0x480
 [<ffffffff81337553>] ? security_file_ioctl+0x43/0x60
 [<ffffffff812414a9>] SyS_ioctl+0x79/0x90
 [<ffffffff817a04ee>] entry_SYSCALL_64_fastpath+0x12/0x71
Reported-by: NRodrigo Rivas Costa <rodrigorivascosta@gmail.com>
Reported-by: NClément VUCHENER <clement.vuchener@gmail.com>
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=193741Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>

e8b95728

net: ethtool: Add back transceiver type · 19cab887

由 Florian Fainelli 提交于 9月 20, 2017

Commit 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
deprecated the ethtool_cmd::transceiver field, which was fine in
premise, except that the PHY library was actually using it to report the
type of transceiver: internal or external.

Use the first word of the reserved field to put this __u8 transceiver
field back in. It is made read-only, and we don't expect the
ETHTOOL_xLINKSETTINGS API to be doing anything with this anyway, so this
is mostly for the legacy path where we do:

ethtool_get_settings()
-> dev->ethtool_ops->get_link_ksettings()
   -> convert_link_ksettings_to_legacy_settings()

to have no information loss compared to the legacy get_settings API.

Fixes: 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19cab887

21 9月, 2017 2 次提交

Revert "genirq: Restrict effective affinity to interrupts actually using it" · 0551968a

由 Thomas Gleixner 提交于 9月 21, 2017

This reverts commit 74def747.

The change to the helper function is only correct for the /proc/irq/
readout usage, but breaks the existing x86 usage of that function.
Reported-by: NYanko Kaneti <yaneti@declera.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>

0551968a

bpf: one perf event close won't free bpf program attached by another perf event · ec9dd352

由 Yonghong Song 提交于 9月 18, 2017

This patch fixes a bug exhibited by the following scenario:
  1. fd1 = perf_event_open with attr.config = ID1
  2. attach bpf program prog1 to fd1
  3. fd2 = perf_event_open with attr.config = ID1
     <this will be successful>
  4. user program closes fd2 and prog1 is detached from the tracepoint.
  5. user program with fd1 does not work properly as tracepoint
     no output any more.

The issue happens at step 4. Multiple perf_event_open can be called
successfully, but only one bpf prog pointer in the tp_event. In the
current logic, any fd release for the same tp_event will free
the tp_event->prog.

The fix is to free tp_event->prog only when the closing fd
corresponds to the one which registered the program.
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ec9dd352

20 9月, 2017 2 次提交

ACPI / bus: Make ACPI_HANDLE() work for non-GPL code again · 9e987b70

由 John Hubbard 提交于 9月 15, 2017

Due to commit db3e50f3 (device property: Get rid of struct
fwnode_handle type field), ACPI_HANDLE() inadvertently became
a GPL-only call. The call path that led to that was:

ACPI_HANDLE()
    ACPI_COMPANION()
        to_acpi_device_node()
            is_acpi_device_node()
                acpi_device_fwnode_ops
                    DECLARE_ACPI_FWNODE_OPS(acpi_device_fwnode_ops);

...and the new DECLARE_ACPI_FWNODE_OPS() includes
EXPORT_SYMBOL_GPL, whereas previously it was a static struct.

In order to avoid changing any of that, let's instead provide ever
so slightly better encapsulation of those struct fwnode_operations
instances. Those do not really need to be directly used in
inline function calls in header files. Simply moving two small
functions (is_acpi_device_node and is_acpi_data_node) out of
acpi_bus.h, and into a .c file, does that.

That leaves the internals of struct fwnode_operations as GPL-only
(which I think was the intent all along), but un-breaks any driver
code out there that relies on the ACPI subsystem's being (historically)
an EXPORT_SYMBOL-usable system. By that, I mean, ACPI_HANDLE() and
other basic ACPI calls were non-GPL-protected.

Also, while I'm there, remove a tiny bit of redundancy that was missed
in the earlier commit, by having is_acpi_node() use the other two
routines, instead of checking fwnode directly.

Fixes: db3e50f3 (device property: Get rid of struct fwnode_handle type field)
Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
Acked-by: NSakari Ailus <sakari.ailus@linux.intel.com>
Acked-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

9e987b70

of: provide inline helper for of_find_device_by_node · aa767cfb

由 Arnd Bergmann 提交于 9月 11, 2017

The ipmmu-vmsa driver fails in compile-testing on non-OF platforms:

drivers/iommu/ipmmu-vmsa.o: In function `ipmmu_of_xlate':
ipmmu-vmsa.c:(.text+0x740): undefined reference to `of_find_device_by_node'

It would be reasonable to assume that this interface works but
returns failure on non-OF builds, like it does on machines that
have been booted in another way, so this adds another inline
function helper.

Fixes: 7b2d5961 ("iommu/ipmmu-vmsa: Replace local utlb code with fwspec ids")
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NRob Herring <robh@kernel.org>

aa767cfb

19 9月, 2017 2 次提交

xen, arm64: drop dummy lookup_address() · 0555ac43

由 Tycho Andersen 提交于 9月 18, 2017

This is unused, and conflicts with the definition that we'll add for XPFO.
Signed-off-by: NTycho Andersen <tycho@docker.com>
Reviewed-by: NJulien Grall <julien.grall@arm.com>
CC: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: Juergen Gross <jgross@suse.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>

0555ac43

tcp: remove two unused functions · 4c712441

由 Yuchung Cheng 提交于 9月 18, 2017

remove tcp_may_send_now and tcp_snd_test that are no longer used

Fixes: 840a3cbe ("tcp: remove forward retransmit feature")
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c712441

18 9月, 2017 2 次提交

driver core: Fix link to device power management documentation · 74378c5c

由 Geert Uytterhoeven 提交于 9月 05, 2017

Correct location as of commit 2728b2d2 (PM / core / docs:
Convert sleep states API document to reST).

Fixes: 2728b2d2 (PM / core / docs: Convert sleep states API document to reST)
Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

74378c5c

syscalls: Use CHECK_DATA_CORRUPTION for addr_limit_user_check · bf29ed15

由 Thomas Garnier 提交于 9月 07, 2017

Use CHECK_DATA_CORRUPTION instead of BUG_ON to provide more flexibility
on address limit failures. By default, send a SIGKILL signal to kill the
current process preventing exploitation of a bad address limit.

Make the TIF_FSCHECK flag optional so ARM can use this function.
Signed-off-by: NThomas Garnier <thgarnie@google.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Pratyush Anand <panand@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Will Drewry <wad@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Yonghong Song <yhs@fb.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1504798247-48833-2-git-send-email-keescook@chromium.org

bf29ed15

16 9月, 2017 1 次提交

sctp: fix an use-after-free issue in sctp_sock_dump · d25adbeb

由 Xin Long 提交于 9月 15, 2017

Commit 86fdb344 ("sctp: ensure ep is not destroyed before doing the
dump") tried to fix an use-after-free issue by checking !sctp_sk(sk)->ep
with holding sock and sock lock.

But Paolo noticed that endpoint could be destroyed in sctp_rcv without
sock lock protection. It means the use-after-free issue still could be
triggered when sctp_rcv put and destroy ep after sctp_sock_dump checks
!ep, although it's pretty hard to reproduce.

I could reproduce it by mdelay in sctp_rcv while msleep in sctp_close
and sctp_sock_dump long time.

This patch is to add another param cb_done to sctp_for_each_transport
and dump ep->assocs with holding tsp after jumping out of transport's
traversal in it to avoid this issue.

It can also improve sctp diag dump to make it run faster, as no need
to save sk into cb->args[5] and keep calling sctp_for_each_transport
any more.

This patch is also to use int * instead of int for the pos argument
in sctp_for_each_transport, which could make postion increment only
in sctp_for_each_transport and no need to keep changing cb->args[2]
in sctp_sock_filter and sctp_sock_dump any more.

Fixes: 86fdb344 ("sctp: ensure ep is not destroyed before doing the dump")
Reported-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d25adbeb

15 9月, 2017 5 次提交

sched/wait: Add swq_has_sleeper() · 8cd641e3

由 Davidlohr Bueso 提交于 9月 13, 2017

Which is the equivalent of what we have in regular waitqueues.
I'm not crazy about the name, but this also helps us get both
apis closer -- which iirc comes originally from the -net folks.

We also duplicate the comments for the lockless swait_active(),
from wait.h. Future users will make use of this interface.
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8cd641e3

vfs: constify path argument to kernel_read_file_from_path · 711aab1d

由 Mimi Zohar 提交于 9月 12, 2017

This patch constifies the path argument to kernel_read_file_from_path().
Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

711aab1d

sched/wait: Introduce wakeup boomark in wake_up_page_bit · 11a19c7b

由 Tim Chen 提交于 8月 25, 2017

Now that we have added breaks in the wait queue scan and allow bookmark
on scan position, we put this logic in the wake_up_page_bit function.

We can have very long page wait list in large system where multiple
pages share the same wait list. We break the wake up walk here to allow
other cpus a chance to access the list, and not to disable the interrupts
when traversing the list for too long.  This reduces the interrupt and
rescheduling latency, and excessive page wait queue lock hold time.

[ v2: Remove bookmark_wake_function ]
Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

11a19c7b

sched/wait: Break up long wake list walk · 2554db91

由 Tim Chen 提交于 8月 25, 2017

We encountered workloads that have very long wake up list on large
systems. A waker takes a long time to traverse the entire wake list and
execute all the wake functions.

We saw page wait list that are up to 3700+ entries long in tests of
large 4 and 8 socket systems. It took 0.8 sec to traverse such list
during wake up. Any other CPU that contends for the list spin lock will
spin for a long time. It is a result of the numa balancing migration of
hot pages that are shared by many threads.

Multiple CPUs waking are queued up behind the lock, and the last one
queued has to wait until all CPUs did all the wakeups.

The page wait list is traversed with interrupt disabled, which caused
various problems. This was the original cause that triggered the NMI
watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only
extending the NMI watch dog timer there helped.

This patch bookmarks the waker's scan position in wake list and break
the wake up walk, to allow access to the list before the waker resume
its walk down the rest of the wait list. It lowers the interrupt and
rescheduling latency.

This patch also provides a performance boost when combined with the next
patch to break up page wakeup list walk. We saw 22% improvement in the
will-it-scale file pread2 test on a Xeon Phi system running 256 threads.

[ v2: Merged in Linus' changes to remove the bookmark_wake_function, and
  simply access to flags. ]
Reported-by: NKan Liang <kan.liang@intel.com>
Tested-by: NKan Liang <kan.liang@intel.com>
Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2554db91

KVM: trace events: update list of exit reasons · 488e32f1

由 Ladi Prosek 提交于 9月 14, 2017

Adding entries for exit reasons 23 - 27:

  KVM_EXIT_EPR
  KVM_EXIT_SYSTEM_EVENT
  KVM_EXIT_S390_STSI
  KVM_EXIT_IOAPIC_EOI
  KVM_EXIT_HYPERV
Signed-off-by: NLadi Prosek <lprosek@redhat.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

488e32f1

14 9月, 2017 2 次提交

mm: treewide: remove GFP_TEMPORARY allocation flag · 0ee931c4

由 Michal Hocko 提交于 9月 13, 2017

GFP_TEMPORARY was introduced by commit e12ba74d ("Group short-lived
and reclaimable kernel allocations") along with __GFP_RECLAIMABLE.  It's
primary motivation was to allow users to tell that an allocation is
short lived and so the allocator can try to place such allocations close
together and prevent long term fragmentation.  As much as this sounds
like a reasonable semantic it becomes much less clear when to use the
highlevel GFP_TEMPORARY allocation flag.  How long is temporary? Can the
context holding that memory sleep? Can it take locks? It seems there is
no good answer for those questions.

The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
__GFP_RECLAIMABLE which in itself is tricky because basically none of
the existing caller provide a way to reclaim the allocated memory.  So
this is rather misleading and hard to evaluate for any benefits.

I have checked some random users and none of them has added the flag
with a specific justification.  I suspect most of them just copied from
other existing users and others just thought it might be a good idea to
use without any measuring.  This suggests that GFP_TEMPORARY just
motivates for cargo cult usage without any reasoning.

I believe that our gfp flags are quite complex already and especially
those with highlevel semantic should be clearly defined to prevent from
confusion and abuse.  Therefore I propose dropping GFP_TEMPORARY and
replace all existing users to simply use GFP_KERNEL.  Please note that
SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
so they will be placed properly for memory fragmentation prevention.

I can see reasons we might want some gfp flag to reflect shorterm
allocations but I propose starting from a clear semantic definition and
only then add users with proper justification.

This was been brought up before LSF this year by Matthew [1] and it
turned out that GFP_TEMPORARY really doesn't have a clear semantic.  It
seems to be a heuristic without any measured advantage for most (if not
all) its current users.  The follow up discussion has revealed that
opinions on what might be temporary allocation differ a lot between
developers.  So rather than trying to tweak existing users into a
semantic which they haven't expected I propose to simply remove the flag
and start from scratch if we really need a semantic for short term
allocations.

[1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org

[akpm@linux-foundation.org: fix typo]
[akpm@linux-foundation.org: coding-style fixes]
[sfr@canb.auug.org.au: drm/i915: fix up]
  Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Acked-by: NMel Gorman <mgorman@suse.de>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0ee931c4

sctp: potential read out of bounds in sctp_ulpevent_type_enabled() · fa5f7b51

由 Dan Carpenter 提交于 9月 14, 2017

This code causes a static checker warning because Smatch doesn't trust
anything that comes from skb->data.  I've reviewed this code and I do
think skb->data can be controlled by the user here.

The sctp_event_subscribe struct has 13 __u8 fields and we want to see
if ours is non-zero.  sn_type can be any value in the 0-USHRT_MAX range.
We're subtracting SCTP_SN_TYPE_BASE which is 1 << 15 so we could read
either before the start of the struct or after the end.

This is a very old bug and it's surprising that it would go undetected
for so long but my theory is that it just doesn't have a big impact so
it would be hard to notice.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa5f7b51

13 9月, 2017 1 次提交

net_sched: get rid of tcfa_rcu · d7fb60b9

由 Cong Wang 提交于 9月 11, 2017

gen estimator has been rewritten in commit 1c0d32fd
("net_sched: gen_estimator: complete rewrite of rate estimators"),
the caller is no longer needed to wait for a grace period.
So this patch gets rid of it.

This also completely closes a race condition between action free
path and filter chain add/remove path for the following patch.
Because otherwise the nested RCU callback can't be caught by
rcu_barrier().

Please see also the comments in code.

Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d7fb60b9

12 9月, 2017 4 次提交

xdp: implement xdp_redirect_map for generic XDP · 96c5508e

由 Jesper Dangaard Brouer 提交于 9月 10, 2017

Using bpf_redirect_map is allowed for generic XDP programs, but the
appropriate map lookup was never performed in xdp_do_generic_redirect().

Instead the map-index is directly used as the ifindex. For the
xdp_redirect_map sample in SKB-mode '-S', this resulted in trying
sending on ifindex 0 which isn't valid, resulting in getting SKB
packets dropped. Thus, the reported performance numbers are wrong in
commit 24251c26 ("samples/bpf: add option for native and skb mode
for redirect apps") for the 'xdp_redirect_map -S' case.

Before commit 109980b8 ("bpf: don't select potentially stale
ri->map from buggy xdp progs") it could crash the kernel. Like this
commit also check that the map_owner owner is correct before
dereferencing the map pointer. But make sure that this API misusage
can be caught by a tracepoint. Thus, allowing userspace via
tracepoints to detect misbehaving bpf_progs.

Fixes: 6103aa96 ("net: implement XDP_REDIRECT for xdp generic")
Fixes: 24251c26 ("samples/bpf: add option for native and skb mode for redirect apps")
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

96c5508e

perf/bpf: fix a clang compilation issue · 609320c8

由 Yonghong Song 提交于 9月 07, 2017

clang does not support variable length array for structure member.
It has the following error during compilation:

kernel/trace/trace_syscalls.c:568:17: error: fields must have a constant size:
'variable length array in structure' extension will never be supported
                unsigned long args[sys_data->nb_args];
                              ^

The fix is to use a fixed array length instead.
Reported-by: NNick Desaulniers <ndesaulniers@google.com>
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

609320c8

string.h: un-fortify memcpy_and_pad · 1359798f

由 Martin Wilck 提交于 9月 06, 2017

The way I'd implemented the new helper memcpy_and_pad  with
__FORTIFY_INLINE caused compiler warnings for certain kernel
configurations.

This helper is only used in a single place at this time, and thus
doesn't benefit much from fortification. So simplify the code
by dropping fortification support for now.

Fixes: 01f33c33 "string.h: add memcpy_and_pad()"
Signed-off-by: NMartin Wilck <mwilck@suse.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

1359798f

nvme-pci: implement the HMB entry number and size limitations · 044a9df1

由 Christoph Hellwig 提交于 9月 11, 2017

Adds support for the new Host Memory Buffer Minimum Descriptor Entry Size
and Host Memory Maximum Descriptors Entries field that were added in
TP 4002 HMB Enhancements.  These allow the controller to advertise
limits for the usual number of segments in the host memory buffer, as
well as a minimum usable per-segment size.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>

044a9df1

11 9月, 2017 2 次提交

block: tolerate tracing of NULL bio · f8e9ec16

由 Greg Thelen 提交于 9月 07, 2017

__get_request() can call trace_block_getrq() with bio=NULL which causes
block_get_rq::TP_fast_assign() to deref a NULL pointer and panic.

Syzkaller fuzzer panics with
linux-next (1d53d908b79d7870d89063062584eead4cf83448):
  kasan: GPF could be caused by NULL-ptr deref or user memory access
  general protection fault: 0000 [#1] SMP KASAN
  Modules linked in:
  CPU: 0 PID: 2983 Comm: syzkaller401111 Not tainted 4.13.0-rc7-next-20170901+ #13
  task: ffff8801cf1da000 task.stack: ffff8801ce440000
  RIP: 0010:perf_trace_block_get_rq+0x697/0x970 include/trace/events/block.h:384
  RSP: 0018:ffff8801ce4473f0 EFLAGS: 00010246
  RAX: ffff8801cf1da000 RBX: 1ffff10039c88e84 RCX: 1ffffd1ffff84d27
  RDX: dffffc0000000001 RSI: 1ffff1003b643e7a RDI: ffffe8ffffc26938
  RBP: ffff8801ce447530 R08: 1ffff1003b643e6c R09: ffffe8ffffc26964
  R10: 0000000000000002 R11: fffff91ffff84d2d R12: ffffe8ffffc1f890
  R13: ffffe8ffffc26930 R14: ffffffff85cad9e0 R15: 0000000000000000
  FS:  0000000002641880(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000000043e670 CR3: 00000001d1d7a000 CR4: 00000000001406f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
    trace_block_getrq include/trace/events/block.h:423 [inline]
    __get_request block/blk-core.c:1283 [inline]
    get_request+0x1518/0x23b0 block/blk-core.c:1355
    blk_old_get_request block/blk-core.c:1402 [inline]
    blk_get_request+0x1d8/0x3c0 block/blk-core.c:1427
    sg_scsi_ioctl+0x117/0x750 block/scsi_ioctl.c:451
    sg_ioctl+0x192d/0x2ed0 drivers/scsi/sg.c:1070
    vfs_ioctl fs/ioctl.c:45 [inline]
    do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
    SYSC_ioctl fs/ioctl.c:700 [inline]
    SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
    entry_SYSCALL_64_fastpath+0x1f/0xbe

block_get_rq::TP_fast_assign() has multiple redundant ->dev assignments.
Only one of them is NULL tolerant.  Favor the NULL tolerant one.

Fixes: 74d46992 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NGreg Thelen <gthelen@google.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f8e9ec16

dax: remove the pmem_dax_ops->flush abstraction · c3ca015f

由 Mikulas Patocka 提交于 8月 31, 2017

Commit abebfbe2 ("dm: add ->flush() dax operation support") is
buggy. A DM device may be composed of multiple underlying devices and
all of them need to be flushed. That commit just routes the flush
request to the first device and ignores the other devices.

It could be fixed by adding more complex logic to the device mapper. But
there is only one implementation of the method pmem_dax_ops->flush - that
is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
don't need the pmem_dax_ops->flush abstraction at all, we can call
arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
can't ever reach anything different from arch_wb_cache_pmem().

It should be also pointed out that for some uses of persistent memory it
is needed to flush only a very small amount of data (such as 1 cacheline),
and it would be overkill if we go through that device mapper machinery for
a single flushed cache line.

Fix this by removing the pmem_dax_ops->flush abstraction and call
arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
mapper code that forwards the flushes.

Fixes: abebfbe2 ("dm: add ->flush() dax operation support")
Cc: stable@vger.kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

c3ca015f

09 9月, 2017 3 次提交

bpf: make error reporting in bpf_warn_invalid_xdp_action more clear · 9beb8bed

由 Daniel Borkmann 提交于 9月 09, 2017

Differ between illegal XDP action code and just driver
unsupported one to provide better feedback when we throw
a one-time warning here. Reason is that with 814abfab
("xdp: add bpf_redirect helper function") not all drivers
support the new XDP return code yet and thus they will
fall into their 'default' case when checking for return
codes after program return, which then triggers a
bpf_warn_invalid_xdp_action() stating that the return
code is illegal, but from XDP perspective it's not.

I decided not to place something like a XDP_ACT_MAX define
into uapi i) given we don't have this either for all other
program types, ii) future action codes could have further
encoding there, which would render such define unsuitable
and we wouldn't be able to rip it out again, and iii) we
rarely add new action codes.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9beb8bed

bpf: add support for sockmap detach programs · 5a67da2a

由 John Fastabend 提交于 9月 08, 2017

The bpf map sockmap supports adding programs via attach commands. This
patch adds the detach command to keep the API symmetric and allow
users to remove previously added programs. Otherwise the user would
have to delete the map and re-add it to get in this state.

This also adds a series of additional tests to capture detach operation
and also attaching/detaching invalid prog types.

API note: socks will run (or not run) programs depending on the state
of the map at the time the sock is added. We do not for example walk
the map and remove programs from previously attached socks.
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5a67da2a

ipc: optimize semget/shmget/msgget for lots of keys · 0cfb6aee

由 Guillaume Knispel 提交于 9月 08, 2017

ipc_findkey() used to scan all objects to look for the wanted key.  This
is slow when using a high number of keys.  This change adds an rhashtable
of kern_ipc_perm objects in ipc_ids, so that one lookup cease to be O(n).

This change gives a 865% improvement of benchmark reaim.jobs_per_min on a
56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory [1]

Other (more micro) benchmark results, by the author: On an i5 laptop, the
following loop executed right after a reboot took, without and with this
change:

    for (int i = 0, k=0x424242; i < KEYS; ++i)
        semget(k++, 1, IPC_CREAT | 0600);

                 total       total          max single  max single
   KEYS        without        with        call without   call with

      1            3.5         4.9   Âµs            3.5         4.9
     10            7.6         8.6   Âµs            3.7         4.7
     32           16.2        15.9   Âµs            4.3         5.3
    100           72.9        41.8   Âµs            3.7         4.7
   1000        5,630.0       502.0   Âµs             *           *
  10000    1,340,000.0     7,240.0   Âµs             *           *
  31900   17,600,000.0    22,200.0   Âµs             *           *

 *: unreliable measure: high variance

The duration for a lookup-only usage was obtained by the same loop once
the keys are present:

                 total       total          max single  max single
   KEYS        without        with        call without   call with

      1            2.1         2.5   Âµs            2.1         2.5
     10            4.5         4.8   Âµs            2.2         2.3
     32           13.0        10.8   Âµs            2.3         2.8
    100           82.9        25.1   Âµs             *          2.3
   1000        5,780.0       217.0   Âµs             *           *
  10000    1,470,000.0     2,520.0   Âµs             *           *
  31900   17,400,000.0     7,810.0   Âµs             *           *

Finally, executing each semget() in a new process gave, when still
summing only the durations of these syscalls:

creation:
                 total       total
   KEYS        without        with

      1            3.7         5.0   Âµs
     10           32.9        36.7   Âµs
     32          125.0       109.0   Âµs
    100          523.0       353.0   Âµs
   1000       20,300.0     3,280.0   Âµs
  10000    2,470,000.0    46,700.0   Âµs
  31900   27,800,000.0   219,000.0   Âµs

lookup-only:
                 total       total
   KEYS        without        with

      1            2.5         2.7   Âµs
     10           25.4        24.4   Âµs
     32          106.0        72.6   Âµs
    100          591.0       352.0   Âµs
   1000       22,400.0     2,250.0   Âµs
  10000    2,510,000.0    25,700.0   Âµs
  31900   28,200,000.0   115,000.0   Âµs

[1] http://lkml.kernel.org/r/20170814060507.GE23258@yexl-desktop

Link: http://lkml.kernel.org/r/20170815194954.ck32ta2z35yuzpwp@debixSigned-off-by: NGuillaume Knispel <guillaume.knispel@supersonicimagine.com>
Reviewed-by: NMarc Pardo <marc.pardo@supersonicimagine.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Guillaume Knispel <guillaume.knispel@supersonicimagine.com>
Cc: Marc Pardo <marc.pardo@supersonicimagine.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0cfb6aee

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功