1. 04 6月, 2019 3 次提交
  2. 31 5月, 2019 11 次提交
    • L
      overflow: Fix -Wtype-limits compilation warnings · 3c2b1ae4
      Leon Romanovsky 提交于
      [ Upstream commit dc7fe518b0493faa0af0568d6d8c2a33c00f58d0 ]
      
      Attempt to use check_shl_overflow() with inputs of unsigned type
      produces the following compilation warnings.
      
      drivers/infiniband/hw/mlx5/qp.c: In function _set_user_rq_size_:
      ./include/linux/overflow.h:230:6: warning: comparison of unsigned
      expression >= 0 is always true [-Wtype-limits]
         _s >= 0 && _s < 8 * sizeof(*d) ? _s : 0;  \
            ^~
      drivers/infiniband/hw/mlx5/qp.c:5820:6: note: in expansion of macro _check_shl_overflow_
        if (check_shl_overflow(rwq->wqe_count, rwq->wqe_shift,
      &rwq->buf_size))
            ^~~~~~~~~~~~~~~~~~
      ./include/linux/overflow.h:232:26: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        (_to_shift != _s || *_d < 0 || _a < 0 ||   \
                                ^
      drivers/infiniband/hw/mlx5/qp.c:5820:6: note: in expansion of macro _check_shl_overflow_
        if (check_shl_overflow(rwq->wqe_count, rwq->wqe_shift, &rwq->buf_size))
            ^~~~~~~~~~~~~~~~~~
      ./include/linux/overflow.h:232:36: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        (_to_shift != _s || *_d < 0 || _a < 0 ||   \
                                          ^
      drivers/infiniband/hw/mlx5/qp.c:5820:6: note: in expansion of macro _check_shl_overflow_
        if (check_shl_overflow(rwq->wqe_count, rwq->wqe_shift,&rwq->buf_size))
            ^~~~~~~~~~~~~~~~~~
      
      Fixes: 0c668477 ("overflow.h: Add arithmetic shift helper")
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3c2b1ae4
    • T
      timekeeping: Force upper bound for setting CLOCK_REALTIME · dc0f37b7
      Thomas Gleixner 提交于
      [ Upstream commit 7a8e61f8478639072d402a26789055a4a4de8f77 ]
      
      Several people reported testing failures after setting CLOCK_REALTIME close
      to the limits of the kernel internal representation in nanoseconds,
      i.e. year 2262.
      
      The failures are exposed in subsequent operations, i.e. when arming timers
      or when the advancing CLOCK_MONOTONIC makes the calculation of
      CLOCK_REALTIME overflow into negative space.
      
      Now people start to paper over the underlying problem by clamping
      calculations to the valid range, but that's just wrong because such
      workarounds will prevent detection of real issues as well.
      
      It is reasonable to force an upper bound for the various methods of setting
      CLOCK_REALTIME. Year 2262 is the absolute upper bound. Assume a maximum
      uptime of 30 years which is plenty enough even for esoteric embedded
      systems. That results in an upper bound of year 2232 for setting the time.
      
      Once that limit is reached in reality this limit is only a small part of
      the problem space. But until then this stops people from trying to paper
      over the problem at the wrong places.
      Reported-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reported-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Stephen Boyd <sboyd@kernel.org>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903231125480.2157@nanos.tec.linutronix.deSigned-off-by: NSasha Levin <sashal@kernel.org>
      dc0f37b7
    • N
      HID: core: move Usage Page concatenation to Main item · 69f67200
      Nicolas Saenz Julienne 提交于
      [ Upstream commit 58e75155009cc800005629955d3482f36a1e0eec ]
      
      As seen on some USB wireless keyboards manufactured by Primax, the HID
      parser was using some assumptions that are not always true. In this case
      it's s the fact that, inside the scope of a main item, an Usage Page
      will always precede an Usage.
      
      The spec is not pretty clear as 6.2.2.7 states "Any usage that follows
      is interpreted as a Usage ID and concatenated with the Usage Page".
      While 6.2.2.8 states "When the parser encounters a main item it
      concatenates the last declared Usage Page with a Usage to form a
      complete usage value." Being somewhat contradictory it was decided to
      match Window's implementation, which follows 6.2.2.8.
      
      In summary, the patch moves the Usage Page concatenation from the local
      item parsing function to the main item parsing function.
      Signed-off-by: NNicolas Saenz Julienne <nsaenzjulienne@suse.de>
      Reviewed-by: NTerry Junge <terry.junge@poly.com>
      Signed-off-by: NBenjamin Tissoires <benjamin.tissoires@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      69f67200
    • L
      iio: ad_sigma_delta: Properly handle SPI bus locking vs CS assertion · d7c77341
      Lars-Peter Clausen 提交于
      [ Upstream commit df1d80aee963480c5c2938c64ec0ac3e4a0df2e0 ]
      
      For devices from the SigmaDelta family we need to keep CS low when doing a
      conversion, since the device will use the MISO line as a interrupt to
      indicate that the conversion is complete.
      
      This is why the driver locks the SPI bus and when the SPI bus is locked
      keeps as long as a conversion is going on. The current implementation gets
      one small detail wrong though. CS is only de-asserted after the SPI bus is
      unlocked. This means it is possible for a different SPI device on the same
      bus to send a message which would be wrongfully be addressed to the
      SigmaDelta device as well. Make sure that the last SPI transfer that is
      done while holding the SPI bus lock de-asserts the CS signal.
      Signed-off-by: NLars-Peter Clausen <lars@metafoo.de>
      Signed-off-by: NAlexandru Ardelean <Alexandru.Ardelean@analog.com>
      Signed-off-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d7c77341
    • R
      cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock · 4e4d5cea
      Roman Gushchin 提交于
      [ Upstream commit 4dcabece4c3a9f9522127be12cc12cc120399b2f ]
      
      The number of descendant cgroups and the number of dying
      descendant cgroups are currently synchronized using the cgroup_mutex.
      
      The number of descendant cgroups will be required by the cgroup v2
      freezer, which will use it to determine if a cgroup is frozen
      (depending on total number of descendants and number of frozen
      descendants). It's not always acceptable to grab the cgroup_mutex,
      especially from quite hot paths (e.g. exit()).
      
      To avoid this, let's additionally synchronize these counters using
      the css_set_lock.
      
      So, it's safe to read these counters with either cgroup_mutex or
      css_set_lock locked, and for changing both locks should be acquired.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4e4d5cea
    • Y
      block: fix use-after-free on gendisk · ad393793
      Yufen Yu 提交于
      [ Upstream commit 2c88e3c7ec32d7a40cc7c9b4a487cf90e4671bdd ]
      
      commit 2da78092 "block: Fix dev_t minor allocation lifetime"
      specifically moved blk_free_devt(dev->devt) call to part_release()
      to avoid reallocating device number before the device is fully
      shutdown.
      
      However, it can cause use-after-free on gendisk in get_gendisk().
      We use md device as example to show the race scenes:
      
      Process1		Worker			Process2
      md_free
      						blkdev_open
      del_gendisk
        add delete_partition_work_fn() to wq
        						__blkdev_get
      						get_gendisk
      put_disk
        disk_release
          kfree(disk)
          						find part from ext_devt_idr
      						get_disk_and_module(disk)
          					  	cause use after free
      
          			delete_partition_work_fn
      			put_device(part)
          		  	part_release
      		    	remove part from ext_devt_idr
      
      Before <devt, hd_struct pointer> is removed from ext_devt_idr by
      delete_partition_work_fn(), we can find the devt and then access
      gendisk by hd_struct pointer. But, if we access the gendisk after
      it have been freed, it can cause in use-after-freeon gendisk in
      get_gendisk().
      
      We fix this by adding a new helper blk_invalidate_devt() in
      delete_partition() and del_gendisk(). It replaces hd_struct
      pointer in idr with value 'NULL', and deletes the entry from
      idr in part_release() as we do now.
      
      Thanks to Jan Kara for providing the solution and more clear comments
      for the code.
      
      Fixes: 2da78092 ("block: Fix dev_t minor allocation lifetime")
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ad393793
    • S
      smpboot: Place the __percpu annotation correctly · 3dc1e338
      Sebastian Andrzej Siewior 提交于
      [ Upstream commit d4645d30b50d1691c26ff0f8fa4e718b08f8d3bb ]
      
      The test robot reported a wrong assignment of a per-CPU variable which
      it detected by using sparse and sent a report. The assignment itself is
      correct. The annotation for sparse was wrong and hence the report.
      The first pointer is a "normal" pointer and points to the per-CPU memory
      area. That means that the __percpu annotation has to be moved.
      
      Move the __percpu annotation to pointer which points to the per-CPU
      area. This change affects only the sparse tool (and is ignored by the
      compiler).
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: f97f8f06 ("smpboot: Provide infrastructure for percpu hotplug threads")
      Link: http://lkml.kernel.org/r/20190424085253.12178-1-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3dc1e338
    • N
      x86/modules: Avoid breaking W^X while loading modules · 8715ce03
      Nadav Amit 提交于
      [ Upstream commit f2c65fb3221adc6b73b0549fc7ba892022db9797 ]
      
      When modules and BPF filters are loaded, there is a time window in
      which some memory is both writable and executable. An attacker that has
      already found another vulnerability (e.g., a dangling pointer) might be
      able to exploit this behavior to overwrite kernel code. Prevent having
      writable executable PTEs in this stage.
      
      In addition, avoiding having W+X mappings can also slightly simplify the
      patching of modules code on initialization (e.g., by alternatives and
      static-key), as would be done in the next patch. This was actually the
      main motivation for this patch.
      
      To avoid having W+X mappings, set them initially as RW (NX) and after
      they are set as RO set them as X as well. Setting them as executable is
      done as a separate step to avoid one core in which the old PTE is cached
      (hence writable), and another which sees the updated PTE (executable),
      which would break the W^X protection.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Suggested-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <akpm@linux-foundation.org>
      Cc: <ard.biesheuvel@linaro.org>
      Cc: <deneen.t.dock@intel.com>
      Cc: <kernel-hardening@lists.openwall.com>
      Cc: <kristen@linux.intel.com>
      Cc: <linux_dti@icloud.com>
      Cc: <will.deacon@arm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: https://lkml.kernel.org/r/20190426001143.4983-12-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8715ce03
    • A
      acct_on(): don't mess with freeze protection · 7c2bcb3c
      Al Viro 提交于
      commit 9419a3191dcb27f24478d288abaab697228d28e6 upstream.
      
      What happens there is that we are replacing file->path.mnt of
      a file we'd just opened with a clone and we need the write
      count contribution to be transferred from original mount to
      new one.  That's it.  We do *NOT* want any kind of freeze
      protection for the duration of switchover.
      
      IOW, we should just use __mnt_{want,drop}_write() for that
      switchover; no need to bother with mnt_{want,drop}_write()
      there.
      Tested-by: NAmir Goldstein <amir73il@gmail.com>
      Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7c2bcb3c
    • D
      bpf: add bpf_jit_limit knob to restrict unpriv allocations · 43caa29c
      Daniel Borkmann 提交于
      commit ede95a63b5e84ddeea6b0c473b36ab8bfd8c6ce3 upstream.
      
      Rick reported that the BPF JIT could potentially fill the entire module
      space with BPF programs from unprivileged users which would prevent later
      attempts to load normal kernel modules or privileged BPF programs, for
      example. If JIT was enabled but unsuccessful to generate the image, then
      before commit 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      we would always fall back to the BPF interpreter. Nowadays in the case
      where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
      with a failure since the BPF interpreter was compiled out.
      
      Add a global limit and enforce it for unprivileged users such that in case
      of BPF interpreter compiled out we fail once the limit has been reached
      or we fall back to BPF interpreter earlier w/o using module mem if latter
      was compiled in. In a next step, fair share among unprivileged users can
      be resolved in particular for the case where we would fail hard once limit
      is reached.
      
      Fixes: 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      Fixes: 0a14842f ("net: filter: Just In Time compiler for x86-64")
      Co-Developed-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: LKML <linux-kernel@vger.kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43caa29c
    • A
      bio: fix improper use of smp_mb__before_atomic() · b78255d6
      Andrea Parri 提交于
      commit f381c6a4bd0ae0fde2d6340f1b9bb0f58d915de6 upstream.
      
      This barrier only applies to the read-modify-write operations; in
      particular, it does not apply to the atomic_set() primitive.
      
      Replace the barrier with an smp_mb().
      
      Fixes: dac56212 ("bio: skip atomic inc/dec of ->bi_cnt for most use cases")
      Cc: stable@vger.kernel.org
      Reported-by: N"Paul E. McKenney" <paulmck@linux.ibm.com>
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: linux-block@vger.kernel.org
      Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b78255d6
  3. 26 5月, 2019 6 次提交
    • D
      bpf: add map_lookup_elem_sys_only for lookups from syscall side · 2bb3c547
      Daniel Borkmann 提交于
      commit c6110222c6f49ea68169f353565eb865488a8619 upstream.
      
      Add a callback map_lookup_elem_sys_only() that map implementations
      could use over map_lookup_elem() from system call side in case the
      map implementation needs to handle the latter differently than from
      the BPF data path. If map_lookup_elem_sys_only() is set, this will
      be preferred pick for map lookups out of user space. This hook is
      used in a follow-up fix for LRU map, but once development window
      opens, we can convert other map types from map_lookup_elem() (here,
      the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
      the callback to simplify and clean up the latter.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      2bb3c547
    • P
      bpf: Fix preempt_enable_no_resched() abuse · c1528193
      Peter Zijlstra 提交于
      [ Upstream commit 0edd6b64d1939e9e9168ff27947995bb7751db5d ]
      
      Unless the very next line is schedule(), or implies it, one must not use
      preempt_enable_no_resched(). It can cause a preemption to go missing and
      thereby cause arbitrary delays, breaking the PREEMPT=y invariant.
      
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c1528193
    • S
      PCI: Work around Pericom PCIe-to-PCI bridge Retrain Link erratum · d5c35230
      Stefan Mätje 提交于
      commit 4ec73791a64bab25cabf16a6067ee478692e506d upstream.
      
      Due to an erratum in some Pericom PCIe-to-PCI bridges in reverse mode
      (conventional PCI on primary side, PCIe on downstream side), the Retrain
      Link bit needs to be cleared manually to allow the link training to
      complete successfully.
      
      If it is not cleared manually, the link training is continuously restarted
      and no devices below the PCI-to-PCIe bridge can be accessed.  That means
      drivers for devices below the bridge will be loaded but won't work and may
      even crash because the driver is only reading 0xffff.
      
      See the Pericom Errata Sheet PI7C9X111SLB_errata_rev1.2_102711.pdf for
      details.  Devices known as affected so far are: PI7C9X110, PI7C9X111SL,
      PI7C9X130.
      
      Add a new flag, clear_retrain_link, in struct pci_dev.  Quirks for affected
      devices set this bit.
      
      Note that pcie_retrain_link() lives in aspm.c because that's currently the
      only place we use it, but this erratum is not specific to ASPM, and we may
      retrain links for other reasons in the future.
      Signed-off-by: NStefan Mätje <stefan.maetje@esd.eu>
      [bhelgaas: apply regardless of CONFIG_PCIEASPM]
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      CC: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d5c35230
    • P
      of: fix clang -Wunsequenced for be32_to_cpu() · a29b8829
      Phong Tran 提交于
      commit 440868661f36071886ed360d91de83bd67c73b4f upstream.
      
      Now, make the loop explicit to avoid clang warning.
      
      ./include/linux/of.h:238:37: warning: multiple unsequenced modifications
      to 'cell' [-Wunsequenced]
                      r = (r << 32) | be32_to_cpu(*(cell++));
                                                        ^~
      ./include/linux/byteorder/generic.h:95:21: note: expanded from macro
      'be32_to_cpu'
                          ^
      ./include/uapi/linux/byteorder/little_endian.h:40:59: note: expanded
      from macro '__be32_to_cpu'
                                                                ^
      ./include/uapi/linux/swab.h:118:21: note: expanded from macro '__swab32'
              ___constant_swab32(x) :                 \
                                 ^
      ./include/uapi/linux/swab.h:18:12: note: expanded from macro
      '___constant_swab32'
              (((__u32)(x) & (__u32)0x000000ffUL) << 24) |            \
                        ^
      Signed-off-by: NPhong Tran <tranmanphong@gmail.com>
      Reported-by: NNick Desaulniers <ndesaulniers@google.com>
      Link: https://github.com/ClangBuiltLinux/linux/issues/460Suggested-by: NDavid Laight <David.Laight@ACULAB.COM>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Cc: stable@vger.kernel.org
      [robh: fix up whitespace]
      Signed-off-by: NRob Herring <robh@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a29b8829
    • A
      dcache: sort the freeing-without-RCU-delay mess for good. · c939121b
      Al Viro 提交于
      commit 5467a68cbf6884c9a9d91e2a89140afb1839c835 upstream.
      
      For lockless accesses to dentries we don't have pinned we rely
      (among other things) upon having an RCU delay between dropping
      the last reference and actually freeing the memory.
      
      On the other hand, for things like pipes and sockets we neither
      do that kind of lockless access, nor want to deal with the
      overhead of an RCU delay every time a socket gets closed.
      
      So delay was made optional - setting DCACHE_RCUACCESS in ->d_flags
      made sure it would happen.  We tried to avoid setting it unless
      we knew we need it.  Unfortunately, that had led to recurring
      class of bugs, in which we missed the need to set it.
      
      We only really need it for dentries that are created by
      d_alloc_pseudo(), so let's not bother with trying to be smart -
      just make having an RCU delay the default.  The ones that do
      *not* get it set the replacement flag (DCACHE_NORCU) and we'd
      better use that sparingly.  d_alloc_pseudo() is the only
      such user right now.
      
      FWIW, the race that finally prompted that switch had been
      between __lock_parent() of immediate subdirectory of what's
      currently the root of a disconnected tree (e.g. from
      open-by-handle in progress) racing with d_splice_alias()
      elsewhere picking another alias for the same inode, either
      on outright corrupted fs image, or (in case of open-by-handle
      on NFS) that subdirectory having been just moved on server.
      It's not easy to hit, so the sky is not falling, but that's
      not the first race on similar missed cases and the logics
      for settinf DCACHE_RCUACCESS has gotten ridiculously
      convoluted.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c939121b
    • W
      net: test nouarg before dereferencing zerocopy pointers · 3620e546
      Willem de Bruijn 提交于
      [ Upstream commit 185ce5c38ea76f29b6bd9c7c8c7a5e5408834920 ]
      
      Zerocopy skbs without completion notification were added for packet
      sockets with PACKET_TX_RING user buffers. Those signal completion
      through the TP_STATUS_USER bit in the ring. Zerocopy annotation was
      added only to avoid premature notification after clone or orphan, by
      triggering a copy on these paths for these packets.
      
      The mechanism had to define a special "no-uarg" mode because packet
      sockets already use skb_uarg(skb) == skb_shinfo(skb)->destructor_arg
      for a different pointer.
      
      Before deferencing skb_uarg(skb), verify that it is a real pointer.
      
      Fixes: 5cd8d46ea1562 ("packet: copy user buffers before orphan or clone")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3620e546
  4. 22 5月, 2019 5 次提交
  5. 17 5月, 2019 3 次提交
  6. 15 5月, 2019 2 次提交
    • J
      cpu/speculation: Add 'mitigations=' cmdline option · 8cb932ac
      Josh Poimboeuf 提交于
      commit 98af8452945c55652de68536afdde3b520fec429 upstream
      
      Keeping track of the number of mitigations for all the CPU speculation
      bugs has become overwhelming for many users.  It's getting more and more
      complicated to decide which mitigations are needed for a given
      architecture.  Complicating matters is the fact that each arch tends to
      have its own custom way to mitigate the same vulnerability.
      
      Most users fall into a few basic categories:
      
      a) they want all mitigations off;
      
      b) they want all reasonable mitigations on, with SMT enabled even if
         it's vulnerable; or
      
      c) they want all reasonable mitigations on, with SMT disabled if
         vulnerable.
      
      Define a set of curated, arch-independent options, each of which is an
      aggregation of existing options:
      
      - mitigations=off: Disable all mitigations.
      
      - mitigations=auto: [default] Enable all the default mitigations, but
        leave SMT enabled, even if it's vulnerable.
      
      - mitigations=auto,nosmt: Enable all the default mitigations, disabling
        SMT if needed by a mitigation.
      
      Currently, these options are placeholders which don't actually do
      anything.  They will be fleshed out in upcoming patches.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86)
      Reviewed-by: NJiri Kosina <jkosina@suse.cz>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux-s390@vger.kernel.org
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-arch@vger.kernel.org
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Phil Auld <pauld@redhat.com>
      Link: https://lkml.kernel.org/r/b07a8ef9b7c5055c3a4637c87d07c296d5016fe0.1555085500.git.jpoimboe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8cb932ac
    • T
      x86/speculation/mds: Add sysfs reporting for MDS · 8230c202
      Thomas Gleixner 提交于
      commit 8a4b06d391b0a42a373808979b5028f5c84d9c6a upstream
      
      Add the sysfs reporting file for MDS. It exposes the vulnerability and
      mitigation state similar to the existing files for the other speculative
      hardware vulnerabilities.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NJon Masters <jcm@redhat.com>
      Tested-by: NJon Masters <jcm@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8230c202
  7. 10 5月, 2019 1 次提交
    • J
      linux/kernel.h: Use parentheses around argument in u64_to_user_ptr() · 95587274
      Jann Horn 提交于
      [ Upstream commit a0fe2c6479aab5723239b315ef1b552673f434a3 ]
      
      Use parentheses around uses of the argument in u64_to_user_ptr() to
      ensure that the cast doesn't apply to part of the argument.
      
      There are existing uses of the macro of the form
      
        u64_to_user_ptr(A + B)
      
      which expands to
      
        (void __user *)(uintptr_t)A + B
      
      (the cast applies to the first operand of the addition, the addition
      is a pointer addition). This happens to still work as intended, the
      semantic difference doesn't cause a difference in behavior.
      
      But I want to use u64_to_user_ptr() with a ternary operator in the
      argument, like so:
      
        u64_to_user_ptr(A ? B : C)
      
      This currently doesn't work as intended.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NMukesh Ojha <mojha@codeaurora.org>
      Cc: Andrei Vagin <avagin@openvz.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jani Nikula <jani.nikula@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qiaowei Ren <qiaowei.ren@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190329214652.258477-1-jannh@google.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      95587274
  8. 08 5月, 2019 4 次提交
    • D
      clk: x86: Add system specific quirk to mark clocks as critical · d572a3a0
      David Müller 提交于
      commit 7c2e07130090ae001a97a6b65597830d6815e93e upstream.
      
      Since commit 648e9218 ("clk: x86: Stop marking clocks as
      CLK_IS_CRITICAL"), the pmc_plt_clocks of the Bay Trail SoC are
      unconditionally gated off. Unfortunately this will break systems where these
      clocks are used for external purposes beyond the kernel's knowledge. Fix it
      by implementing a system specific quirk to mark the necessary pmc_plt_clks as
      critical.
      
      Fixes: 648e9218 ("clk: x86: Stop marking clocks as CLK_IS_CRITICAL")
      Signed-off-by: NDavid Müller <dave.mueller@gmx.ch>
      Signed-off-by: NHans de Goede <hdegoede@redhat.com>
      Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NStephen Boyd <sboyd@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d572a3a0
    • K
      fs: stream_open - opener for stream-like files so that read and write can run... · 04b4d5f7
      Kirill Smelkov 提交于
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      
      [ Upstream commit 10dce8af34226d90fa56746a934f8da5dcdba3df ]
      
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
      locking for file.f_pos access and in particular made concurrent read and
      write not possible - now both those functions take f_pos lock for the
      whole run, and so if e.g. a read is blocked waiting for data, write will
      deadlock waiting for that read to complete.
      
      This caused regression for stream-like files where previously read and
      write could run simultaneously, but after that patch could not do so
      anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
      to /proc/xen/xenbus") which fixes such regression for particular case of
      /proc/xen/xenbus.
      
      The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
      safety for read/write/lseek and added the locking to file descriptors of
      all regular files. In 2014 that thread-safety problem was not new as it
      was already discussed earlier in 2006.
      
      However even though 2006'th version of Linus's patch was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus
      avoiding the stream-like objects like pipes and sockets)", the 2014
      version - the one that actually made it into the tree as 9c225f26 -
      is doing so irregardless of whether a file is seekable or not.
      
      See
      
          https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
          https://lwn.net/Articles/180387
          https://lwn.net/Articles/180396
      
      for historic context.
      
      The reason that it did so is, probably, that there are many files that
      are marked non-seekable, but e.g. their read implementation actually
      depends on knowing current position to correctly handle the read. Some
      examples:
      
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      	...
      
      Despite that, many nonseekable_open users implement read and write with
      pure stream semantics - they don't depend on passed ppos at all. And for
      those cases where read could wait for something inside, it creates a
      situation similar to xenbus - the write could be never made to go until
      read is done, and read is waiting for some, potentially external, event,
      for potentially unbounded time -> deadlock.
      
      Besides xenbus, there are 14 such places in the kernel that I've found
      with semantic patch (see below):
      
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      
      In addition to the cases above another regression caused by f_pos
      locking is that now FUSE filesystems that implement open with
      FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
      stream-like files - for the same reason as above e.g. read can deadlock
      write locking on file.f_pos in the kernel.
      
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
      implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
      in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
      write routines not depending on current position at all, and with both
      read and write being potentially blocking operations:
      
      See
      
          https://github.com/libfuse/osspd
          https://lwn.net/Articles/308445
      
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
      
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset.
      However that test implements only read without write and cannot exercise
      the deadlock scenario:
      
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
      
      I've actually hit the read vs write deadlock for real while implementing
      my FUSE filesystem where there is /head/watch file, for which open
      creates separate bidirectional socket-like stream in between filesystem
      and its user with both read and write being later performed
      simultaneously. And there it is semantically not easy to split the
      stream into two separate read-only and write-only channels:
      
          https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
      
      Let's fix this regression. The plan is:
      
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
         doing so would break many in-kernel nonseekable_open users which
         actually use ppos in read/write handlers.
      
      2. Add stream_open() to kernel to open stream-like non-seekable file
         descriptors. Read and write on such file descriptors would never use
         nor change ppos. And with that property on stream-like files read and
         write will be running without taking f_pos lock - i.e. read and write
         could be running simultaneously.
      
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not
         depend on ppos and where there is no other methods in file_operations
         which assume @offset access.
      
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
         steam_open if that bit is present in filesystem open reply.
      
         It was tempting to change fs/fuse/ open handler to use stream_open
         instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
         grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
         and in particular GVFS which actually uses offset in its read and
         write handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
         so if we would do such a change it will break a real user.
      
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
         from v3.14+ (the kernel where 9c225f26 first appeared).
      
         This will allow to patch OSSPD and other FUSE filesystems that
         provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
         in their open handler and this way avoid the deadlock on all kernel
         versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a
         kernel that is not aware of this flag cannot hurt. In turn the kernel
         that is not aware of FOPEN_STREAM will be < v3.14 where just
         FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
         write deadlock.
      
      This patch adds stream_open, converts /proc/xen/xenbus to it and adds
      semantic patch to automatically locate in-kernel places that are either
      required to be converted due to read vs write deadlock, or that are just
      safe to be converted because read and write do not use ppos and there
      are no other funky methods in file_operations.
      
      Regarding semantic patch I've verified each generated change manually -
      that it is correct to convert - and each other nonseekable_open instance
      left - that it is either not correct to convert there, or that it is not
      converted due to current stream_open.cocci limitations.
      
      The script also does not convert files that should be valid to convert,
      but that currently have .llseek = noop_llseek or generic_file_llseek for
      unknown reason despite file being opened with nonseekable_open (e.g.
      drivers/input/mousedev.c)
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Signed-off-by: NKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      04b4d5f7
    • A
      USB: core: Fix bug caused by duplicate interface PM usage counter · 83c6688d
      Alan Stern 提交于
      commit c2b71462d294cf517a0bc6e4fd6424d7cee5596f upstream.
      
      The syzkaller fuzzer reported a bug in the USB hub driver which turned
      out to be caused by a negative runtime-PM usage counter.  This allowed
      a hub to be runtime suspended at a time when the driver did not expect
      it.  The symptom is a WARNING issued because the hub's status URB is
      submitted while it is already active:
      
      	URB 0000000031fb463e submitted while active
      	WARNING: CPU: 0 PID: 2917 at drivers/usb/core/urb.c:363
      
      The negative runtime-PM usage count was caused by an unfortunate
      design decision made when runtime PM was first implemented for USB.
      At that time, USB class drivers were allowed to unbind from their
      interfaces without balancing the usage counter (i.e., leaving it with
      a positive count).  The core code would take care of setting the
      counter back to 0 before allowing another driver to bind to the
      interface.
      
      Later on when runtime PM was implemented for the entire kernel, the
      opposite decision was made: Drivers were required to balance their
      runtime-PM get and put calls.  In order to maintain backward
      compatibility, however, the USB subsystem adapted to the new
      implementation by keeping an independent usage counter for each
      interface and using it to automatically adjust the normal usage
      counter back to 0 whenever a driver was unbound.
      
      This approach involves duplicating information, but what is worse, it
      doesn't work properly in cases where a USB class driver delays
      decrementing the usage counter until after the driver's disconnect()
      routine has returned and the counter has been adjusted back to 0.
      Doing so would cause the usage counter to become negative.  There's
      even a warning about this in the USB power management documentation!
      
      As it happens, this is exactly what the hub driver does.  The
      kick_hub_wq() routine increments the runtime-PM usage counter, and the
      corresponding decrement is carried out by hub_event() in the context
      of the hub_wq work-queue thread.  This work routine may sometimes run
      after the driver has been unbound from its interface, and when it does
      it causes the usage counter to go negative.
      
      It is not possible for hub_disconnect() to wait for a pending
      hub_event() call to finish, because hub_disconnect() is called with
      the device lock held and hub_event() acquires that lock.  The only
      feasible fix is to reverse the original design decision: remove the
      duplicate interface-specific usage counter and require USB drivers to
      balance their runtime PM gets and puts.  As far as I know, all
      existing drivers currently do this.
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Reported-and-tested-by: syzbot+7634edaea4d0b341c625@syzkaller.appspotmail.com
      CC: <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      83c6688d
    • J
      i2c: Allow recovery of the initial IRQ by an I2C client device. · 04e07919
      Jim Broadus 提交于
      commit 93b6604c5a669d84e45fe5129294875bf82eb1ff upstream.
      
      A previous change allowed I2C client devices to discover new IRQs upon
      reprobe by clearing the IRQ in i2c_device_remove. However, if an IRQ was
      assigned in i2c_new_device, that information is lost.
      
      For example, the touchscreen and trackpad devices on a Dell Inspiron laptop
      are I2C devices whose IRQs are defined by ACPI extended IRQ types. The
      client device structures are initialized during an ACPI walk. After
      removing the i2c_hid device, modprobe fails.
      
      This change caches the initial IRQ value in i2c_new_device and then resets
      the client device IRQ to the initial value in i2c_device_remove.
      
      Fixes: 6f108dd70d30 ("i2c: Clear client->irq in i2c_device_remove")
      Signed-off-by: NJim Broadus <jbroadus@gmail.com>
      Reviewed-by: NBenjamin Tissoires <benjamin.tissoires@redhat.com>
      Reviewed-by: NCharles Keepax <ckeepax@opensource.cirrus.com>
      [wsa: this is an easy to backport fix for the regression. We will
      refactor the code to handle irq assignments better in general.]
      Signed-off-by: NWolfram Sang <wsa@the-dreams.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      04e07919
  9. 04 5月, 2019 4 次提交
  10. 02 5月, 2019 1 次提交
    • L
      aio: simplify - and fix - fget/fput for io_submit() · d6b2615f
      Linus Torvalds 提交于
      commit 84c4e1f89fefe70554da0ab33be72c9be7994379 upstream.
      
      Al Viro root-caused a race where the IOCB_CMD_POLL handling of
      fget/fput() could cause us to access the file pointer after it had
      already been freed:
      
       "In more details - normally IOCB_CMD_POLL handling looks so:
      
         1) io_submit(2) allocates aio_kiocb instance and passes it to
            aio_poll()
      
         2) aio_poll() resolves the descriptor to struct file by req->file =
            fget(iocb->aio_fildes)
      
         3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
            aio_kiocb to 2 (bumps by 1, that is).
      
         4) aio_poll() calls vfs_poll(). After sanity checks (basically,
            "poll_wait() had been called and only once") it locks the queue.
            That's what the extra reference to iocb had been for - we know we
            can safely access it.
      
         5) With queue locked, we check if ->woken has already been set to
            true (by aio_poll_wake()) and, if it had been, we unlock the
            queue, drop a reference to aio_kiocb and bugger off - at that
            point it's a responsibility to aio_poll_wake() and the stuff
            called/scheduled by it. That code will drop the reference to file
            in req->file, along with the other reference to our aio_kiocb.
      
         6) otherwise, we see whether we need to wait. If we do, we unlock the
            queue, drop one reference to aio_kiocb and go away - eventual
            wakeup (or cancel) will deal with the reference to file and with
            the other reference to aio_kiocb
      
         7) otherwise we remove ourselves from waitqueue (still under the
            queue lock), so that wakeup won't get us. No async activity will
            be happening, so we can safely drop req->file and iocb ourselves.
      
        If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
        won't get freed under us, so we can do all the checks and locking
        safely. And we don't touch ->file if we detect that case.
      
        However, vfs_poll() most certainly *does* touch the file it had been
        given. So wakeup coming while we are still in ->poll() might end up
        doing fput() on that file. That case is not too rare, and usually we
        are saved by the still present reference from descriptor table - that
        fput() is not the final one.
      
        But if another thread closes that descriptor right after our fget()
        and wakeup does happen before ->poll() returns, we are in trouble -
        final fput() done while we are in the middle of a method:
      
      Al also wrote a patch to take an extra reference to the file descriptor
      to fix this, but I instead suggested we just streamline the whole file
      pointer handling by submit_io() so that the generic aio submission code
      simply keeps the file pointer around until the aio has completed.
      
      Fixes: bfe4037e ("aio: implement IOCB_CMD_POLL")
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d6b2615f