1. 01 5月, 2019 2 次提交
    • K
      KVM: Introduce a new guest mapping API · e45adf66
      KarimAllah Ahmed 提交于
      In KVM, specially for nested guests, there is a dominant pattern of:
      
      	=> map guest memory -> do_something -> unmap guest memory
      
      In addition to all this unnecessarily noise in the code due to boiler plate
      code, most of the time the mapping function does not properly handle memory
      that is not backed by "struct page". This new guest mapping API encapsulate
      most of this boiler plate code and also handles guest memory that is not
      backed by "struct page".
      
      The current implementation of this API is using memremap for memory that is
      not backed by a "struct page" which would lead to a huge slow-down if it
      was used for high-frequency mapping operations. The API does not have any
      effect on current setups where guest memory is backed by a "struct page".
      Further patches are going to also introduce a pfn-cache which would
      significantly improve the performance of the memremap case.
      Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e45adf66
    • L
      KVM: x86: Inject PMI for KVM guest · 8479e04e
      Luwei Kang 提交于
      Inject a PMI for KVM guest when Intel PT working
      in Host-Guest mode and Guest ToPA entry memory buffer
      was completely filled.
      Signed-off-by: NLuwei Kang <luwei.kang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8479e04e
  2. 26 4月, 2019 1 次提交
  3. 16 4月, 2019 1 次提交
  4. 09 4月, 2019 2 次提交
  5. 07 4月, 2019 2 次提交
    • D
      nfc: nci: Potential off by one in ->pipes[] array · 6491d698
      Dan Carpenter 提交于
      This is similar to commit e285d5bf ("NFC: Fix the number of pipes")
      where we changed NFC_HCI_MAX_PIPES from 127 to 128.
      
      As the comment next to the define explains, the pipe identifier is 7
      bits long.  The highest possible pipe is 127, but the number of possible
      pipes is 128.  As the code is now, then there is potential for an
      out of bounds array access:
      
          net/nfc/nci/hci.c:297 nci_hci_cmd_received() warn: array off by one?
          'ndev->hci_dev->pipes[pipe]' '0-127 == 127'
      
      Fixes: 11f54f22 ("NFC: nci: Add HCI over NCI protocol support")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6491d698
    • K
      fs: stream_open - opener for stream-like files so that read and write can run... · 10dce8af
      Kirill Smelkov 提交于
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
      locking for file.f_pos access and in particular made concurrent read and
      write not possible - now both those functions take f_pos lock for the
      whole run, and so if e.g. a read is blocked waiting for data, write will
      deadlock waiting for that read to complete.
      
      This caused regression for stream-like files where previously read and
      write could run simultaneously, but after that patch could not do so
      anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
      to /proc/xen/xenbus") which fixes such regression for particular case of
      /proc/xen/xenbus.
      
      The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
      safety for read/write/lseek and added the locking to file descriptors of
      all regular files. In 2014 that thread-safety problem was not new as it
      was already discussed earlier in 2006.
      
      However even though 2006'th version of Linus's patch was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus
      avoiding the stream-like objects like pipes and sockets)", the 2014
      version - the one that actually made it into the tree as 9c225f26 -
      is doing so irregardless of whether a file is seekable or not.
      
      See
      
          https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
          https://lwn.net/Articles/180387
          https://lwn.net/Articles/180396
      
      for historic context.
      
      The reason that it did so is, probably, that there are many files that
      are marked non-seekable, but e.g. their read implementation actually
      depends on knowing current position to correctly handle the read. Some
      examples:
      
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      	...
      
      Despite that, many nonseekable_open users implement read and write with
      pure stream semantics - they don't depend on passed ppos at all. And for
      those cases where read could wait for something inside, it creates a
      situation similar to xenbus - the write could be never made to go until
      read is done, and read is waiting for some, potentially external, event,
      for potentially unbounded time -> deadlock.
      
      Besides xenbus, there are 14 such places in the kernel that I've found
      with semantic patch (see below):
      
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      
      In addition to the cases above another regression caused by f_pos
      locking is that now FUSE filesystems that implement open with
      FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
      stream-like files - for the same reason as above e.g. read can deadlock
      write locking on file.f_pos in the kernel.
      
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
      implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
      in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
      write routines not depending on current position at all, and with both
      read and write being potentially blocking operations:
      
      See
      
          https://github.com/libfuse/osspd
          https://lwn.net/Articles/308445
      
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
      
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset.
      However that test implements only read without write and cannot exercise
      the deadlock scenario:
      
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
      
      I've actually hit the read vs write deadlock for real while implementing
      my FUSE filesystem where there is /head/watch file, for which open
      creates separate bidirectional socket-like stream in between filesystem
      and its user with both read and write being later performed
      simultaneously. And there it is semantically not easy to split the
      stream into two separate read-only and write-only channels:
      
          https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
      
      Let's fix this regression. The plan is:
      
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
         doing so would break many in-kernel nonseekable_open users which
         actually use ppos in read/write handlers.
      
      2. Add stream_open() to kernel to open stream-like non-seekable file
         descriptors. Read and write on such file descriptors would never use
         nor change ppos. And with that property on stream-like files read and
         write will be running without taking f_pos lock - i.e. read and write
         could be running simultaneously.
      
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not
         depend on ppos and where there is no other methods in file_operations
         which assume @offset access.
      
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
         steam_open if that bit is present in filesystem open reply.
      
         It was tempting to change fs/fuse/ open handler to use stream_open
         instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
         grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
         and in particular GVFS which actually uses offset in its read and
         write handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
         so if we would do such a change it will break a real user.
      
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
         from v3.14+ (the kernel where 9c225f26 first appeared).
      
         This will allow to patch OSSPD and other FUSE filesystems that
         provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
         in their open handler and this way avoid the deadlock on all kernel
         versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a
         kernel that is not aware of this flag cannot hurt. In turn the kernel
         that is not aware of FOPEN_STREAM will be < v3.14 where just
         FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
         write deadlock.
      
      This patch adds stream_open, converts /proc/xen/xenbus to it and adds
      semantic patch to automatically locate in-kernel places that are either
      required to be converted due to read vs write deadlock, or that are just
      safe to be converted because read and write do not use ppos and there
      are no other funky methods in file_operations.
      
      Regarding semantic patch I've verified each generated change manually -
      that it is correct to convert - and each other nonseekable_open instance
      left - that it is either not correct to convert there, or that it is not
      converted due to current stream_open.cocci limitations.
      
      The script also does not convert files that should be valid to convert,
      but that currently have .llseek = noop_llseek or generic_file_llseek for
      unknown reason despite file being opened with nonseekable_open (e.g.
      drivers/input/mousedev.c)
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10dce8af
  6. 06 4月, 2019 4 次提交
    • G
      mm: writeback: use exact memcg dirty counts · 0b3d6e6f
      Greg Thelen 提交于
      Since commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting") memcg dirty and writeback counters are managed
      as:
      
       1) per-memcg per-cpu values in range of [-32..32]
      
       2) per-memcg atomic counter
      
      When a per-cpu counter cannot fit in [-32..32] it's flushed to the
      atomic.  Stat readers only check the atomic.  Thus readers such as
      balance_dirty_pages() may see a nontrivial error margin: 32 pages per
      cpu.
      
      Assuming 100 cpus:
         4k x86 page_size:  13 MiB error per memcg
        64k ppc page_size: 200 MiB error per memcg
      
      Considering that dirty+writeback are used together for some decisions the
      errors double.
      
      This inaccuracy can lead to undeserved oom kills.  One nasty case is
      when all per-cpu counters hold positive values offsetting an atomic
      negative value (i.e.  per_cpu[*]=32, atomic=n_cpu*-32).
      balance_dirty_pages() only consults the atomic and does not consider
      throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
      13..200 MiB range then there's absolutely no dirty throttling, which
      burdens vmscan with only dirty+writeback pages thus resorting to oom
      kill.
      
      It could be argued that tiny containers are not supported, but it's more
      subtle.  It's the amount the space available for file lru that matters.
      If a container has memory.max-200MiB of non reclaimable memory, then it
      will also suffer such oom kills on a 100 cpu machine.
      
      The following test reliably ooms without this patch.  This patch avoids
      oom kills.
      
        $ cat test
        mount -t cgroup2 none /dev/cgroup
        cd /dev/cgroup
        echo +io +memory > cgroup.subtree_control
        mkdir test
        cd test
        echo 10M > memory.max
        (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
        (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)
      
        $ cat memcg-writeback-stress.c
        /*
         * Dirty pages from all but one cpu.
         * Clean pages from the non dirtying cpu.
         * This is to stress per cpu counter imbalance.
         * On a 100 cpu machine:
         * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
         * - per memcg atomic is -99*32 pages
         * - thus the complete dirty limit: sum of all counters 0
         * - balance_dirty_pages() only sees atomic count -99*32 pages, which
         *   it max()s to 0.
         * - So a workload can dirty -99*32 pages before balance_dirty_pages()
         *   cares.
         */
        #define _GNU_SOURCE
        #include <err.h>
        #include <fcntl.h>
        #include <sched.h>
        #include <stdlib.h>
        #include <stdio.h>
        #include <sys/stat.h>
        #include <sys/sysinfo.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        static char *buf;
        static int bufSize;
      
        static void set_affinity(int cpu)
        {
        	cpu_set_t affinity;
      
        	CPU_ZERO(&affinity);
        	CPU_SET(cpu, &affinity);
        	if (sched_setaffinity(0, sizeof(affinity), &affinity))
        		err(1, "sched_setaffinity");
        }
      
        static void dirty_on(int output_fd, int cpu)
        {
        	int i, wrote;
      
        	set_affinity(cpu);
        	for (i = 0; i < 32; i++) {
        		for (wrote = 0; wrote < bufSize; ) {
        			int ret = write(output_fd, buf+wrote, bufSize-wrote);
        			if (ret == -1)
        				err(1, "write");
        			wrote += ret;
        		}
        	}
        }
      
        int main(int argc, char **argv)
        {
        	int cpu, flush_cpu = 1, output_fd;
        	const char *output;
      
        	if (argc != 2)
        		errx(1, "usage: output_file");
      
        	output = argv[1];
        	bufSize = getpagesize();
        	buf = malloc(getpagesize());
        	if (buf == NULL)
        		errx(1, "malloc failed");
      
        	output_fd = open(output, O_CREAT|O_RDWR);
        	if (output_fd == -1)
        		err(1, "open(%s)", output);
      
        	for (cpu = 0; cpu < get_nprocs(); cpu++) {
        		if (cpu != flush_cpu)
        			dirty_on(output_fd, cpu);
        	}
      
        	set_affinity(flush_cpu);
        	if (fsync(output_fd))
        		err(1, "fsync(%s)", output);
        	if (close(output_fd))
        		err(1, "close(%s)", output);
        	free(buf);
        }
      
      Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
      collect exact per memcg counters.  This avoids the aforementioned oom
      kills.
      
      This does not affect the overhead of memory.stat, which still reads the
      single atomic counter.
      
      Why not use percpu_counter? memcg already handles cpus going offline, so
      no need for that overhead from percpu_counter.  And the percpu_counter
      spinlocks are more heavyweight than is required.
      
      It probably also makes sense to use exact dirty and writeback counters
      in memcg oom reports.  But that is saved for later.
      
      Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.comSigned-off-by: NGreg Thelen <gthelen@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.16+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b3d6e6f
    • J
      mm: fix vm_fault_t cast in VM_FAULT_GET_HINDEX() · fcae96ff
      Jann Horn 提交于
      Symmetrically to VM_FAULT_SET_HINDEX(), we need a force-cast in
      VM_FAULT_GET_HINDEX() to tell sparse that this is intentional.
      
      Sparse complains about the current code when building a kernel with
      CONFIG_MEMORY_FAILURE:
      
        arch/x86/mm/fault.c:1058:53: warning: restricted vm_fault_t degrades to integer
      
      Link: http://lkml.kernel.org/r/20190327204117.35215-1-jannh@google.com
      Fixes: 3d353901 ("mm: create the new vm_fault_t type")
      Signed-off-by: NJann Horn <jannh@google.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcae96ff
    • A
      include/linux/bitrev.h: fix constant bitrev · 6147e136
      Arnd Bergmann 提交于
      clang points out with hundreds of warnings that the bitrev macros have a
      problem with constant input:
      
        drivers/hwmon/sht15.c:187:11: error: variable '__x' is uninitialized when used within its own initialization
              [-Werror,-Wuninitialized]
                u8 crc = bitrev8(data->val_status & 0x0F);
                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        include/linux/bitrev.h:102:21: note: expanded from macro 'bitrev8'
                __constant_bitrev8(__x) :                       \
                ~~~~~~~~~~~~~~~~~~~^~~~
        include/linux/bitrev.h:67:11: note: expanded from macro '__constant_bitrev8'
                u8 __x = x;                     \
                   ~~~   ^
      
      Both the bitrev and the __constant_bitrev macros use an internal
      variable named __x, which goes horribly wrong when passing one to the
      other.
      
      The obvious fix is to rename one of the variables, so this adds an extra
      '_'.
      
      It seems we got away with this because
      
       - there are only a few drivers using bitrev macros
      
       - usually there are no constant arguments to those
      
       - when they are constant, they tend to be either 0 or (unsigned)-1
         (drivers/isdn/i4l/isdnhdlc.o, drivers/iio/amplifiers/ad8366.c) and
         give the correct result by pure chance.
      
      In fact, the only driver that I could find that gets different results
      with this is drivers/net/wan/slic_ds26522.c, which in turn is a driver
      for fairly rare hardware (adding the maintainer to Cc for testing).
      
      Link: http://lkml.kernel.org/r/20190322140503.123580-1-arnd@arndb.de
      Fixes: 556d2f05 ("ARM: 8187/1: add CONFIG_HAVE_ARCH_BITREVERSE to support rbit instruction")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Cc: Zhao Qiang <qiang.zhao@nxp.com>
      Cc: Yalin Wang <yalin.wang@sonymobile.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6147e136
    • N
      lib/string.c: implement a basic bcmp · 5f074f3e
      Nick Desaulniers 提交于
      A recent optimization in Clang (r355672) lowers comparisons of the
      return value of memcmp against zero to comparisons of the return value
      of bcmp against zero.  This helps some platforms that implement bcmp
      more efficiently than memcmp.  glibc simply aliases bcmp to memcmp, but
      an optimized implementation is in the works.
      
      This results in linkage failures for all targets with Clang due to the
      undefined symbol.  For now, just implement bcmp as a tailcail to memcmp
      to unbreak the build.  This routine can be further optimized in the
      future.
      
      Other ideas discussed:
      
       * A weak alias was discussed, but breaks for architectures that define
         their own implementations of memcmp since aliases to declarations are
         not permitted (only definitions). Arch-specific memcmp
         implementations typically declare memcmp in C headers, but implement
         them in assembly.
      
       * -ffreestanding also is used sporadically throughout the kernel.
      
       * -fno-builtin-bcmp doesn't work when doing LTO.
      
      Link: https://bugs.llvm.org/show_bug.cgi?id=41035
      Link: https://code.woboq.org/userspace/glibc/string/memcmp.c.html#bcmp
      Link: https://github.com/llvm/llvm-project/commit/8e16d73346f8091461319a7dfc4ddd18eedcff13
      Link: https://github.com/ClangBuiltLinux/linux/issues/416
      Link: http://lkml.kernel.org/r/20190313211335.165605-1-ndesaulniers@google.comSigned-off-by: NNick Desaulniers <ndesaulniers@google.com>
      Reported-by: NNathan Chancellor <natechancellor@gmail.com>
      Reported-by: NAdhemerval Zanella <adhemerval.zanella@linaro.org>
      Suggested-by: NArnd Bergmann <arnd@arndb.de>
      Suggested-by: NJames Y Knight <jyknight@google.com>
      Suggested-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Suggested-by: NNathan Chancellor <natechancellor@gmail.com>
      Suggested-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
      Tested-by: NNathan Chancellor <natechancellor@gmail.com>
      Reviewed-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f074f3e
  7. 05 4月, 2019 2 次提交
    • S
      syscalls: Remove start and number from syscall_set_arguments() args · 32d92586
      Steven Rostedt (VMware) 提交于
      After removing the start and count arguments of syscall_get_arguments() it
      seems reasonable to remove them from syscall_set_arguments(). Note, as of
      today, there are no users of syscall_set_arguments(). But we are told that
      there will be soon. But for now, at least make it consistent with
      syscall_get_arguments().
      
      Link: http://lkml.kernel.org/r/20190327222014.GA32540@altlinux.org
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Dave Martin <dave.martin@arm.com>
      Cc: "Dmitry V. Levin" <ldv@altlinux.org>
      Cc: x86@kernel.org
      Cc: linux-snps-arc@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: uclinux-h8-devel@lists.sourceforge.jp
      Cc: linux-hexagon@vger.kernel.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-mips@vger.kernel.org
      Cc: nios2-dev@lists.rocketboards.org
      Cc: openrisc@lists.librecores.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-riscv@lists.infradead.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-um@lists.infradead.org
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux-arch@vger.kernel.org
      Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes
      Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits
      Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86
      Reviewed-by: NDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      32d92586
    • S
      syscalls: Remove start and number from syscall_get_arguments() args · b35f549d
      Steven Rostedt (Red Hat) 提交于
      At Linux Plumbers, Andy Lutomirski approached me and pointed out that the
      function call syscall_get_arguments() implemented in x86 was horribly
      written and not optimized for the standard case of passing in 0 and 6 for
      the starting index and the number of system calls to get. When looking at
      all the users of this function, I discovered that all instances pass in only
      0 and 6 for these arguments. Instead of having this function handle
      different cases that are never used, simply rewrite it to return the first 6
      arguments of a system call.
      
      This should help out the performance of tracing system calls by ptrace,
      ftrace and perf.
      
      Link: http://lkml.kernel.org/r/20161107213233.754809394@goodmis.org
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Dave Martin <dave.martin@arm.com>
      Cc: "Dmitry V. Levin" <ldv@altlinux.org>
      Cc: x86@kernel.org
      Cc: linux-snps-arc@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: uclinux-h8-devel@lists.sourceforge.jp
      Cc: linux-hexagon@vger.kernel.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-mips@vger.kernel.org
      Cc: nios2-dev@lists.rocketboards.org
      Cc: openrisc@lists.librecores.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-riscv@lists.infradead.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-um@lists.infradead.org
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux-arch@vger.kernel.org
      Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts
      Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes
      Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits
      Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86
      Reviewed-by: NDmitry V. Levin <ldv@altlinux.org>
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      b35f549d
  8. 04 4月, 2019 1 次提交
    • S
      ptrace: Remove maxargs from task_current_syscall() · 631b7aba
      Steven Rostedt (Red Hat) 提交于
      task_current_syscall() has a single user that passes in 6 for maxargs, which
      is the maximum arguments that can be used to get system calls from
      syscall_get_arguments(). Instead of passing in a number of arguments to
      grab, just get 6 arguments. The args argument even specifies that it's an
      array of 6 items.
      
      This will also allow changing syscall_get_arguments() to not get a variable
      number of arguments, but always grab 6.
      
      Linus also suggested not passing in a bunch of arguments to
      task_current_syscall() but to instead pass in a pointer to a structure, and
      just fill the structure. struct seccomp_data has almost all the parameters
      that is needed except for the stack pointer (sp). As seccomp_data is part of
      uapi, and I'm afraid to change it, a new structure was created
      "syscall_info", which includes seccomp_data and adds the "sp" field.
      
      Link: http://lkml.kernel.org/r/20161107213233.466776454@goodmis.org
      
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      631b7aba
  9. 02 4月, 2019 3 次提交
  10. 30 3月, 2019 6 次提交
  11. 29 3月, 2019 2 次提交
    • E
      netns: provide pure entropy for net_hash_mix() · 355b9855
      Eric Dumazet 提交于
      net_hash_mix() currently uses kernel address of a struct net,
      and is used in many places that could be used to reveal this
      address to a patient attacker, thus defeating KASLR, for
      the typical case (initial net namespace, &init_net is
      not dynamically allocated)
      
      I believe the original implementation tried to avoid spending
      too many cycles in this function, but security comes first.
      
      Also provide entropy regardless of CONFIG_NET_NS.
      
      Fixes: 0b441916 ("netns: introduce the net_hash_mix "salt" for hashes")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAmit Klein <aksecurity@gmail.com>
      Reported-by: NBenny Pinkas <benny@pinkas.net>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      355b9855
    • M
      KVM: export <linux/kvm_para.h> and <asm/kvm_para.h> iif KVM is supported · 3d9683cf
      Masahiro Yamada 提交于
      I do not see any consistency about headers_install of <linux/kvm_para.h>
      and <asm/kvm_para.h>.
      
      According to my analysis of Linux 5.1-rc1, there are 3 groups:
      
       [1] Both <linux/kvm_para.h> and <asm/kvm_para.h> are exported
      
          alpha, arm, hexagon, mips, powerpc, s390, sparc, x86
      
       [2] <asm/kvm_para.h> is exported, but <linux/kvm_para.h> is not
      
          arc, arm64, c6x, h8300, ia64, m68k, microblaze, nios2, openrisc,
          parisc, sh, unicore32, xtensa
      
       [3] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported
      
          csky, nds32, riscv
      
      This does not match to the actual KVM support. At least, [2] is
      half-baked.
      
      Nor do arch maintainers look like they care about this. For example,
      commit 0add5371 ("microblaze: Add missing kvm_para.h to Kbuild")
      exported <asm/kvm_para.h> to user-space in order to fix an in-kernel
      build error.
      
      We have two ways to make this consistent:
      
       [A] export both <linux/kvm_para.h> and <asm/kvm_para.h> for all
           architectures, irrespective of the KVM support
      
       [B] Match the header export of <linux/kvm_para.h> and <asm/kvm_para.h>
           to the KVM support
      
      My first attempt was [A] because the code looks cleaner, but Paolo
      suggested [B].
      
      So, this commit goes with [B].
      
      For most architectures, <asm/kvm_para.h> was moved to the kernel-space.
      I changed include/uapi/linux/Kbuild so that it checks generated
      asm/kvm_para.h as well as check-in ones.
      
      After this commit, there will be two groups:
      
       [1] Both <linux/kvm_para.h> and <asm/kvm_para.h> are exported
      
          arm, arm64, mips, powerpc, s390, x86
      
       [2] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported
      
          alpha, arc, c6x, csky, h8300, hexagon, ia64, m68k, microblaze,
          nds32, nios2, openrisc, parisc, riscv, sh, sparc, unicore32, xtensa
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NCornelia Huck <cohuck@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3d9683cf
  12. 28 3月, 2019 2 次提交
  13. 27 3月, 2019 1 次提交
  14. 26 3月, 2019 2 次提交
    • B
      proc/kcore: Remove unused kclist_add_remap() · db779ef6
      Bhupesh Sharma 提交于
      Commit
      
        bf904d27 ("x86/pti/64: Remove the SYSCALL64 entry trampoline")
      
      removed the sole usage of kclist_add_remap() but it did not remove the
      left-over definition from the include file.
      
      Fix the same.
      Signed-off-by: NBhupesh Sharma <bhsharma@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Anderson <anderson@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kairui Song <kasong@redhat.com>
      Cc: kexec@lists.infradead.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/1553583028-17804-1-git-send-email-bhsharma@redhat.com
      db779ef6
    • L
      Revert "parport: daisy: use new parport device model" · a3ac7917
      Linus Torvalds 提交于
      This reverts commit 1aec4211.
      
      Steven Rostedt reports that it causes a hang at bootup and bisected it
      to this commit.
      
      The troigger is apparently a module alias for "parport_lowlevel" that
      points to "parport_pc", which causes a hang with
      
          modprobe -q -- parport_lowlevel
      
      blocking forever with a backtrace like this:
      
          wait_for_completion_killable+0x1c/0x28
          call_usermodehelper_exec+0xa7/0x108
          __request_module+0x351/0x3d8
          get_lowlevel_driver+0x28/0x41 [parport]
          __parport_register_driver+0x39/0x1f4 [parport]
          daisy_drv_init+0x31/0x4f [parport]
          parport_bus_init+0x5d/0x7b [parport]
          parport_default_proc_register+0x26/0x1000 [parport]
          do_one_initcall+0xc2/0x1e0
          do_init_module+0x50/0x1d4
          load_module+0x1c2e/0x21b3
          sys_init_module+0xef/0x117
      
      Supid says:
       "Due to the new device model daisy driver will now try to find the
        parallel ports while trying to register its driver so that it can bind
        with them. Now, since daisy driver is loaded while parport bus is
        initialising the list of parport is still empty and it tries to load
        the lowlevel driver, which has an alias set to parport_pc, now causes
        a deadlock"
      
      But I don't think the daisy driver should be loaded by the parport
      initialization in the first place, so let's revert the whole change.
      
      If the daisy driver can just initialize separately on its own (like a
      driver should), instead of hooking into the parport init sequence
      directly, this issue probably would go away.
      Reported-and-bisected-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reported-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3ac7917
  15. 25 3月, 2019 2 次提交
  16. 23 3月, 2019 2 次提交
  17. 22 3月, 2019 4 次提交
    • T
      gpio: amd-fch: Fix bogus SPDX identifier · b45a02e1
      Thomas Gleixner 提交于
      spdxcheck.py complains:
      
       include/linux/platform_data/gpio/gpio-amd-fch.h: 1:28 Invalid License ID: GPL+
      
      which is correct because GPL+ is not a valid identifier. Of course this
      could have been caught by checkpatch.pl _before_ submitting or merging the
      patch.
      
       WARNING: 'SPDX-License-Identifier: GPL+ */' is not supported in LICENSES/...
       #271: FILE: include/linux/platform_data/gpio/gpio-amd-fch.h:1:
       +/* SPDX-License-Identifier: GPL+ */
      
      Fix it under the assumption that the author meant GPL-2.0+, which makes
      sense as the corresponding C file is using that identifier.
      
      Fixes: e09d168f ("gpio: AMD G-Series PCH gpio driver")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBartosz Golaszewski <bgolaszewski@baylibre.com>
      b45a02e1
    • D
      net/sched: let actions use RCU to access 'goto_chain' · ee3bbfe8
      Davide Caratti 提交于
      use RCU when accessing the action chain, to avoid use after free in the
      traffic path when 'goto chain' is replaced on existing TC actions (see
      script below). Since the control action is read in the traffic path
      without holding the action spinlock, we need to explicitly ensure that
      a->goto_chain is not NULL before dereferencing (i.e it's not sufficient
      to rely on the value of TC_ACT_GOTO_CHAIN bits). Not doing so caused NULL
      dereferences in tcf_action_goto_chain_exec() when the following script:
      
       # tc chain add dev dd0 chain 42 ingress protocol ip flower \
       > ip_proto udp action pass index 4
       # tc filter add dev dd0 ingress protocol ip flower \
       > ip_proto udp action csum udp goto chain 42 index 66
       # tc chain del dev dd0 chain 42 ingress
       (start UDP traffic towards dd0)
       # tc action replace action csum udp pass index 66
      
      was run repeatedly for several hours.
      Suggested-by: NCong Wang <xiyou.wangcong@gmail.com>
      Suggested-by: NVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee3bbfe8
    • D
      net/sched: don't dereference a->goto_chain to read the chain index · fe384e2f
      Davide Caratti 提交于
      callers of tcf_gact_goto_chain_index() can potentially read an old value
      of the chain index, or even dereference a NULL 'goto_chain' pointer,
      because 'goto_chain' and 'tcfa_action' are read in the traffic path
      without caring of concurrent write in the control path. The most recent
      value of chain index can be read also from a->tcfa_action (it's encoded
      there together with TC_ACT_GOTO_CHAIN bits), so we don't really need to
      dereference 'goto_chain': just read the chain id from the control action.
      
      Fixes: e457d86a ("net: sched: add couple of goto_chain helpers")
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe384e2f
    • D
      net/sched: prepare TC actions to properly validate the control action · 85d0966f
      Davide Caratti 提交于
      - pass a pointer to struct tcf_proto in each actions's init() handler,
        to allow validating the control action, checking whether the chain
        exists and (eventually) refcounting it.
      - remove code that validates the control action after a successful call
        to the action's init() handler, and replace it with a test that forbids
        addition of actions having 'goto_chain' and NULL goto_chain pointer at
        the same time.
      - add tcf_action_check_ctrlact(), that will validate the control action
        and eventually allocate the action 'goto_chain' within the init()
        handler.
      - add tcf_action_set_ctrlact(), that will assign the control action and
        swap the current 'goto_chain' pointer with the new given one.
      
      This disallows 'goto_chain' on actions that don't initialize it properly
      in their init() handler, i.e. calling tcf_action_check_ctrlact() after
      successful IDR reservation and then calling tcf_action_set_ctrlact()
      to assign 'goto_chain' and 'tcf_action' consistently.
      
      By doing this, the kernel does not leak anymore refcounts when a valid
      'goto chain' handle is replaced in TC actions, causing kmemleak splats
      like the following one:
      
       # tc chain add dev dd0 chain 42 ingress protocol ip flower \
       > ip_proto tcp action drop
       # tc chain add dev dd0 chain 43 ingress protocol ip flower \
       > ip_proto udp action drop
       # tc filter add dev dd0 ingress matchall \
       > action gact goto chain 42 index 66
       # tc filter replace dev dd0 ingress matchall \
       > action gact goto chain 43 index 66
       # echo scan >/sys/kernel/debug/kmemleak
       <...>
       unreferenced object 0xffff93c0ee09f000 (size 1024):
       comm "tc", pid 2565, jiffies 4295339808 (age 65.426s)
       hex dump (first 32 bytes):
         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
         00 00 00 00 08 00 06 00 00 00 00 00 00 00 00 00  ................
       backtrace:
         [<000000009b63f92d>] tc_ctl_chain+0x3d2/0x4c0
         [<00000000683a8d72>] rtnetlink_rcv_msg+0x263/0x2d0
         [<00000000ddd88f8e>] netlink_rcv_skb+0x4a/0x110
         [<000000006126a348>] netlink_unicast+0x1a0/0x250
         [<00000000b3340877>] netlink_sendmsg+0x2c1/0x3c0
         [<00000000a25a2171>] sock_sendmsg+0x36/0x40
         [<00000000f19ee1ec>] ___sys_sendmsg+0x280/0x2f0
         [<00000000d0422042>] __sys_sendmsg+0x5e/0xa0
         [<000000007a6c61f9>] do_syscall_64+0x5b/0x180
         [<00000000ccd07542>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
         [<0000000013eaa334>] 0xffffffffffffffff
      
      Fixes: db50514f ("net: sched: add termination action to allow goto chain")
      Fixes: 97763dc0 ("net_sched: reject unknown tcfa_action values")
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85d0966f
  18. 21 3月, 2019 1 次提交