1. 25 7月, 2020 3 次提交
  2. 24 7月, 2020 2 次提交
    • Y
      tcp: allow at most one TLP probe per flight · 76be93fc
      Yuchung Cheng 提交于
      Previously TLP may send multiple probes of new data in one
      flight. This happens when the sender is cwnd limited. After the
      initial TLP containing new data is sent, the sender receives another
      ACK that acks partial inflight.  It may re-arm another TLP timer
      to send more, if no further ACK returns before the next TLP timeout
      (PTO) expires. The sender may send in theory a large amount of TLP
      until send queue is depleted. This only happens if the sender sees
      such irregular uncommon ACK pattern. But it is generally undesirable
      behavior during congestion especially.
      
      The original TLP design restrict only one TLP probe per inflight as
      published in "Reducing Web Latency: the Virtue of Gentle Aggression",
      SIGCOMM 2013. This patch changes TLP to send at most one probe
      per inflight.
      
      Note that if the sender is app-limited, TLP retransmits old data
      and did not have this issue.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76be93fc
    • M
      dm integrity: fix integrity recalculation that is improperly skipped · 5df96f2b
      Mikulas Patocka 提交于
      Commit adc0daad ("dm: report suspended
      device during destroy") broke integrity recalculation.
      
      The problem is dm_suspended() returns true not only during suspend,
      but also during resume. So this race condition could occur:
      1. dm_integrity_resume calls queue_work(ic->recalc_wq, &ic->recalc_work)
      2. integrity_recalc (&ic->recalc_work) preempts the current thread
      3. integrity_recalc calls if (unlikely(dm_suspended(ic->ti))) goto unlock_ret;
      4. integrity_recalc exits and no recalculating is done.
      
      To fix this race condition, add a function dm_post_suspending that is
      only true during the postsuspend phase and use it instead of
      dm_suspended().
      
      Signed-off-by: Mikulas Patocka <mpatocka redhat com>
      Fixes: adc0daad ("dm: report suspended device during destroy")
      Cc: stable vger kernel org # v4.18+
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      5df96f2b
  3. 22 7月, 2020 2 次提交
  4. 18 7月, 2020 1 次提交
  5. 17 7月, 2020 1 次提交
    • W
      asm-generic/mmiowb: Allow mmiowb_set_pending() when preemptible() · bd024e82
      Will Deacon 提交于
      Although mmiowb() is concerned only with serialising MMIO writes occuring
      in contexts where a spinlock is held, the call to mmiowb_set_pending()
      from the MMIO write accessors can occur in preemptible contexts, such
      as during driver probe() functions where ordering between CPUs is not
      usually a concern, assuming that the task migration path provides the
      necessary ordering guarantees.
      
      Unfortunately, the default implementation of mmiowb_set_pending() is not
      preempt-safe, as it makes use of a a per-cpu variable to track its
      internal state. This has been reported to generate the following splat
      on riscv:
      
       | BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
       | caller is regmap_mmio_write32le+0x1c/0x46
       | CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.8.0-rc3-hfu+ #1
       | Call Trace:
       |  walk_stackframe+0x0/0x7a
       |  dump_stack+0x6e/0x88
       |  regmap_mmio_write32le+0x18/0x46
       |  check_preemption_disabled+0xa4/0xaa
       |  regmap_mmio_write32le+0x18/0x46
       |  regmap_mmio_write+0x26/0x44
       |  regmap_write+0x28/0x48
       |  sifive_gpio_probe+0xc0/0x1da
      
      Although it's possible to fix the driver in this case, other splats have
      been seen from other drivers, including the infamous 8250 UART, and so
      it's better to address this problem in the mmiowb core itself.
      
      Fix mmiowb_set_pending() by using the raw_cpu_ptr() to get at the mmiowb
      state and then only updating the 'mmiowb_pending' field if we are not
      preemptible (i.e. we have a non-zero nesting count).
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Reported-by: NPalmer Dabbelt <palmer@dabbelt.com>
      Reported-by: NEmil Renner Berthing <kernel@esmil.dk>
      Tested-by: NEmil Renner Berthing <kernel@esmil.dk>
      Reviewed-by: NPalmer Dabbelt <palmerdabbelt@google.com>
      Acked-by: NPalmer Dabbelt <palmerdabbelt@google.com>
      Link: https://lore.kernel.org/r/20200716112816.7356-1-will@kernel.orgSigned-off-by: NWill Deacon <will@kernel.org>
      bd024e82
  6. 14 7月, 2020 2 次提交
    • N
      dma-direct: provide function to check physical memory area validity · 567f6a6e
      Nicolas Saenz Julienne 提交于
      dma_coherent_ok() checks if a physical memory area fits a device's DMA
      constraints.
      Signed-off-by: NNicolas Saenz Julienne <nsaenzjulienne@suse.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      567f6a6e
    • M
      fuse: reject options on reconfigure via fsconfig(2) · b330966f
      Miklos Szeredi 提交于
      Previous patch changed handling of remount/reconfigure to ignore all
      options, including those that are unknown to the fuse kernel fs.  This was
      done for backward compatibility, but this likely only affects the old
      mount(2) API.
      
      The new fsconfig(2) based reconfiguration could possibly be improved.  This
      would make the new API less of a drop in replacement for the old, OTOH this
      is a good chance to get rid of some weirdnesses in the old API.
      
      Several other behaviors might make sense:
      
       1) unknown options are rejected, known options are ignored
      
       2) unknown options are rejected, known options are rejected if the value
       is changed, allowed otherwise
      
       3) all options are rejected
      
      Prior to the backward compatibility fix to ignore all options all known
      options were accepted (1), even if they change the value of a mount
      parameter; fuse_reconfigure() does not look at the config values set by
      fuse_parse_param().
      
      To fix that we'd need to verify that the value provided is the same as set
      in the initial configuration (2).  The major drawback is that this is much
      more complex than just rejecting all attempts at changing options (3);
      i.e. all options signify initial configuration values and don't make sense
      on reconfigure.
      
      This patch opts for (3) with the rationale that no mount options are
      reconfigurable in fuse.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      b330966f
  7. 10 7月, 2020 9 次提交
  8. 09 7月, 2020 6 次提交
    • A
      efi/efivars: Expose RT service availability via efivars abstraction · f88814cc
      Ard Biesheuvel 提交于
      Commit
      
        bf67fad1 ("efi: Use more granular check for availability for variable services")
      
      introduced a check into the efivarfs, efi-pstore and other drivers that
      aborts loading of the module if not all three variable runtime services
      (GetVariable, SetVariable and GetNextVariable) are supported. However, this
      results in efivarfs being unavailable entirely if only SetVariable support
      is missing, which is only needed if you want to make any modifications.
      Also, efi-pstore and the sysfs EFI variable interface could be backed by
      another implementation of the 'efivars' abstraction, in which case it is
      completely irrelevant which services are supported by the EFI firmware.
      
      So make the generic 'efivars' abstraction dependent on the availibility of
      the GetVariable and GetNextVariable EFI runtime services, and add a helper
      'efivar_supports_writes()' to find out whether the currently active efivars
      abstraction supports writes (and wire it up to the availability of
      SetVariable for the generic one).
      
      Then, use the efivar_supports_writes() helper to decide whether to permit
      efivarfs to be mounted read-write, and whether to enable efi-pstore or the
      sysfs EFI variable interface altogether.
      
      Fixes: bf67fad1 ("efi: Use more granular check for availability for variable services")
      Reported-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
      Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Tested-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      f88814cc
    • D
      Input: elan_i2c - add more hardware ID for Lenovo laptops · a50ca295
      Dave Wang 提交于
      This adds more hardware IDs for Elan touchpads found in various Lenovo
      laptops.
      Signed-off-by: NDave Wang <dave.wang@emc.com.tw>
      Link: https://lore.kernel.org/r/000201d5a8bd$9fead3f0$dfc07bd0$@emc.com.tw
      Cc: stable@vger.kernel.org
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      a50ca295
    • X
      io_uring: export cq overflow status to userspace · 6d5f9049
      Xiaoguang Wang 提交于
      For those applications which are not willing to use io_uring_enter()
      to reap and handle cqes, they may completely rely on liburing's
      io_uring_peek_cqe(), but if cq ring has overflowed, currently because
      io_uring_peek_cqe() is not aware of this overflow, it won't enter
      kernel to flush cqes, below test program can reveal this bug:
      
      static void test_cq_overflow(struct io_uring *ring)
      {
              struct io_uring_cqe *cqe;
              struct io_uring_sqe *sqe;
              int issued = 0;
              int ret = 0;
      
              do {
                      sqe = io_uring_get_sqe(ring);
                      if (!sqe) {
                              fprintf(stderr, "get sqe failed\n");
                              break;;
                      }
                      ret = io_uring_submit(ring);
                      if (ret <= 0) {
                              if (ret != -EBUSY)
                                      fprintf(stderr, "sqe submit failed: %d\n", ret);
                              break;
                      }
                      issued++;
              } while (ret > 0);
              assert(ret == -EBUSY);
      
              printf("issued requests: %d\n", issued);
      
              while (issued) {
                      ret = io_uring_peek_cqe(ring, &cqe);
                      if (ret) {
                              if (ret != -EAGAIN) {
                                      fprintf(stderr, "peek completion failed: %s\n",
                                              strerror(ret));
                                      break;
                              }
                              printf("left requets: %d\n", issued);
                              continue;
                      }
                      io_uring_cqe_seen(ring, cqe);
                      issued--;
                      printf("left requets: %d\n", issued);
              }
      }
      
      int main(int argc, char *argv[])
      {
              int ret;
              struct io_uring ring;
      
              ret = io_uring_queue_init(16, &ring, 0);
              if (ret) {
                      fprintf(stderr, "ring setup failed: %d\n", ret);
                      return 1;
              }
      
              test_cq_overflow(&ring);
              return 0;
      }
      
      To fix this issue, export cq overflow status to userspace by adding new
      IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
      io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6d5f9049
    • K
      bpf: Check correct cred for CAP_SYSLOG in bpf_dump_raw_ok() · 63960260
      Kees Cook 提交于
      When evaluating access control over kallsyms visibility, credentials at
      open() time need to be used, not the "current" creds (though in BPF's
      case, this has likely always been the same). Plumb access to associated
      file->f_cred down through bpf_dump_raw_ok() and its callers now that
      kallsysm_show_value() has been refactored to take struct cred.
      
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: bpf@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: 7105e828 ("bpf: allow for correlation of maps and helpers in dump")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      63960260
    • K
      kallsyms: Refactor kallsyms_show_value() to take cred · 16025184
      Kees Cook 提交于
      In order to perform future tests against the cred saved during open(),
      switch kallsyms_show_value() to operate on a cred, and have all current
      callers pass current_cred(). This makes it very obvious where callers
      are checking the wrong credential in their "read" contexts. These will
      be fixed in the coming patches.
      
      Additionally switch return value to bool, since it is always used as a
      direct permission check, not a 0-on-success, negative-on-error style
      function return.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      16025184
    • L
      Raise gcc version requirement to 4.9 · 6ec4476a
      Linus Torvalds 提交于
      I realize that we fairly recently raised it to 4.8, but the fact is, 4.9
      is a much better minimum version to target.
      
      We have a number of workarounds for actual bugs in pre-4.9 gcc versions
      (including things like internal compiler errors on ARM), but we also
      have some syntactic workarounds for lacking features.
      
      In particular, raising the minimum to 4.9 means that we can now just
      assume _Generic() exists, which is likely the much better replacement
      for a lot of very convoluted built-time magic with conditionals on
      sizeof and/or __builtin_choose_expr() with same_type() etc.
      
      Using _Generic also means that you will need to have a very recent
      version of 'sparse', but thats easy to build yourself, and much less of
      a hassle than some old gcc version can be.
      
      The latest (in a long string) of reasons for minimum compiler version
      upgrades was commit 5435f73d ("efi/x86: Fix build with gcc 4").
      
      Ard points out that RHEL 7 uses gcc-4.8, but the people who stay back on
      old RHEL versions persumably also don't build their own kernels anyway.
      And maybe they should cross-built or just have a little side affair with
      a newer compiler?
      Acked-by: NArd Biesheuvel <ardb@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ec4476a
  9. 08 7月, 2020 8 次提交
    • P
      ASoC: soc-dai: set dai_link dpcm_ flags with a helper · 25612477
      Pierre-Louis Bossart 提交于
      Add a helper to walk through all the DAIs and set dpcm_playback and
      dpcm_capture flags based on the DAIs capabilities, and use this helper
      to avoid setting these flags arbitrarily in generic cards.
      
      The commit referenced in the Fixes tag did not introduce the
      configuration issue but will prevent the card from probing when
      detecting invalid configurations.
      
      Fixes: b73287f0 ('ASoC: soc-pcm: dpcm: fix playback/capture checks')
      Signed-off-by: NPierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
      Reviewed-by: NKai Vehmanen <kai.vehmanen@linux.intel.com>
      Reviewed-by: NGuennadi Liakhovetski <guennadi.liakhovetski@linux.intel.com>
      Link: https://lore.kernel.org/r/20200707210439.115300-2-pierre-louis.bossart@linux.intel.comSigned-off-by: NMark Brown <broonie@kernel.org>
      25612477
    • P
      sched: Fix loadavg accounting race · dbfb089d
      Peter Zijlstra 提交于
      The recent commit:
      
        c6e7bd7a ("sched/core: Optimize ttwu() spinning on p->on_cpu")
      
      moved these lines in ttwu():
      
      	p->sched_contributes_to_load = !!task_contributes_to_load(p);
      	p->state = TASK_WAKING;
      
      up before:
      
      	smp_cond_load_acquire(&p->on_cpu, !VAL);
      
      into the 'p->on_rq == 0' block, with the thinking that once we hit
      schedule() the current task cannot change it's ->state anymore. And
      while this is true, it is both incorrect and flawed.
      
      It is incorrect in that we need at least an ACQUIRE on 'p->on_rq == 0'
      to avoid weak hardware from re-ordering things for us. This can fairly
      easily be achieved by relying on the control-dependency already in
      place.
      
      The second problem, which makes the flaw in the original argument, is
      that while schedule() will not change prev->state, it will read it a
      number of times (arguably too many times since it's marked volatile).
      The previous condition 'p->on_cpu == 0' was sufficient because that
      indicates schedule() has completed, and will no longer read
      prev->state. So now the trick is to make this same true for the (much)
      earlier 'prev->on_rq == 0' case.
      
      Furthermore, in order to make the ordering stick, the 'prev->on_rq = 0'
      assignment needs to he a RELEASE, but adding additional ordering to
      schedule() is an unwelcome proposition at the best of times, doubly so
      for mere accounting.
      
      Luckily we can push the prev->state load up before rq->lock, with the
      only caveat that we then have to re-read the state after. However, we
      know that if it changed, we no longer have to worry about the blocking
      path. This gives us the required ordering, if we block, we did the
      prev->state load before an (effective) smp_mb() and the p->on_rq store
      needs not change.
      
      With this we end up with the effective ordering:
      
      	LOAD p->state           LOAD-ACQUIRE p->on_rq == 0
      	MB
      	STORE p->on_rq, 0       STORE p->state, TASK_WAKING
      
      which ensures the TASK_WAKING store happens after the prev->state
      load, and all is well again.
      
      Fixes: c6e7bd7a ("sched/core: Optimize ttwu() spinning on p->on_cpu")
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Reported-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDave Jones <davej@codemonkey.org.uk>
      Tested-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Link: https://lkml.kernel.org/r/20200707102957.GN117543@hirez.programming.kicks-ass.net
      dbfb089d
    • C
      fs: remove __vfs_read · 775802c0
      Christoph Hellwig 提交于
      Fold it into the two callers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      775802c0
    • C
      fs: add a __kernel_read helper · 61a707c5
      Christoph Hellwig 提交于
      This is the counterpart to __kernel_write, and skip the rw_verify_area
      call compared to kernel_read.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      61a707c5
    • T
      vlan: consolidate VLAN parsing code and limit max parsing depth · 469acedd
      Toke Høiland-Jørgensen 提交于
      Toshiaki pointed out that we now have two very similar functions to extract
      the L3 protocol number in the presence of VLAN tags. And Daniel pointed out
      that the unbounded parsing loop makes it possible for maliciously crafted
      packets to loop through potentially hundreds of tags.
      
      Fix both of these issues by consolidating the two parsing functions and
      limiting the VLAN tag parsing to a max depth of 8 tags. As part of this,
      switch over __vlan_get_protocol() to use skb_header_pointer() instead of
      pskb_may_pull(), to avoid the possible side effects of the latter and keep
      the skb pointer 'const' through all the parsing functions.
      
      v2:
      - Use limit of 8 tags instead of 32 (matching XMIT_RECURSION_LIMIT)
      Reported-by: NToshiaki Makita <toshiaki.makita1@gmail.com>
      Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
      Fixes: d7bf2ebe ("sched: consistently handle layer3 header accesses in the presence of VLANs")
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      469acedd
    • M
      net: Added pointer check for dst->ops->neigh_lookup in dst_neigh_lookup_skb · 394de110
      Martin Varghese 提交于
      The packets from tunnel devices (eg bareudp) may have only
      metadata in the dst pointer of skb. Hence a pointer check of
      neigh_lookup is needed in dst_neigh_lookup_skb
      
      Kernel crashes when packets from bareudp device is processed in
      the kernel neighbour subsytem.
      
      [  133.384484] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [  133.385240] #PF: supervisor instruction fetch in kernel mode
      [  133.385828] #PF: error_code(0x0010) - not-present page
      [  133.386603] PGD 0 P4D 0
      [  133.386875] Oops: 0010 [#1] SMP PTI
      [  133.387275] CPU: 0 PID: 5045 Comm: ping Tainted: G        W         5.8.0-rc2+ #15
      [  133.388052] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  133.391076] RIP: 0010:0x0
      [  133.392401] Code: Bad RIP value.
      [  133.394029] RSP: 0018:ffffb79980003d50 EFLAGS: 00010246
      [  133.396656] RAX: 0000000080000102 RBX: ffff9de2fe0d6600 RCX: ffff9de2fe5e9d00
      [  133.399018] RDX: 0000000000000000 RSI: ffff9de2fe5e9d00 RDI: ffff9de2fc21b400
      [  133.399685] RBP: ffff9de2fe5e9d00 R08: 0000000000000000 R09: 0000000000000000
      [  133.400350] R10: ffff9de2fbc6be22 R11: ffff9de2fe0d6600 R12: ffff9de2fc21b400
      [  133.401010] R13: ffff9de2fe0d6628 R14: 0000000000000001 R15: 0000000000000003
      [  133.401667] FS:  00007fe014918740(0000) GS:ffff9de2fec00000(0000) knlGS:0000000000000000
      [  133.402412] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  133.402948] CR2: ffffffffffffffd6 CR3: 000000003bb72000 CR4: 00000000000006f0
      [  133.403611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  133.404270] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  133.404933] Call Trace:
      [  133.405169]  <IRQ>
      [  133.405367]  __neigh_update+0x5a4/0x8f0
      [  133.405734]  arp_process+0x294/0x820
      [  133.406076]  ? __netif_receive_skb_core+0x866/0xe70
      [  133.406557]  arp_rcv+0x129/0x1c0
      [  133.406882]  __netif_receive_skb_one_core+0x95/0xb0
      [  133.407340]  process_backlog+0xa7/0x150
      [  133.407705]  net_rx_action+0x2af/0x420
      [  133.408457]  __do_softirq+0xda/0x2a8
      [  133.408813]  asm_call_on_stack+0x12/0x20
      [  133.409290]  </IRQ>
      [  133.409519]  do_softirq_own_stack+0x39/0x50
      [  133.410036]  do_softirq+0x50/0x60
      [  133.410401]  __local_bh_enable_ip+0x50/0x60
      [  133.410871]  ip_finish_output2+0x195/0x530
      [  133.411288]  ip_output+0x72/0xf0
      [  133.411673]  ? __ip_finish_output+0x1f0/0x1f0
      [  133.412122]  ip_send_skb+0x15/0x40
      [  133.412471]  raw_sendmsg+0x853/0xab0
      [  133.412855]  ? insert_pfn+0xfe/0x270
      [  133.413827]  ? vvar_fault+0xec/0x190
      [  133.414772]  sock_sendmsg+0x57/0x80
      [  133.415685]  __sys_sendto+0xdc/0x160
      [  133.416605]  ? syscall_trace_enter+0x1d4/0x2b0
      [  133.417679]  ? __audit_syscall_exit+0x1d9/0x280
      [  133.418753]  ? __prepare_exit_to_usermode+0x5d/0x1a0
      [  133.419819]  __x64_sys_sendto+0x24/0x30
      [  133.420848]  do_syscall_64+0x4d/0x90
      [  133.421768]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [  133.422833] RIP: 0033:0x7fe013689c03
      [  133.423749] Code: Bad RIP value.
      [  133.424624] RSP: 002b:00007ffc7288f418 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      [  133.425940] RAX: ffffffffffffffda RBX: 000056151fc63720 RCX: 00007fe013689c03
      [  133.427225] RDX: 0000000000000040 RSI: 000056151fc63720 RDI: 0000000000000003
      [  133.428481] RBP: 00007ffc72890b30 R08: 000056151fc60500 R09: 0000000000000010
      [  133.429757] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040
      [  133.431041] R13: 000056151fc636e0 R14: 000056151fc616bc R15: 0000000000000080
      [  133.432481] Modules linked in: mpls_iptunnel act_mirred act_tunnel_key cls_flower sch_ingress veth mpls_router ip_tunnel bareudp ip6_udp_tunnel udp_tunnel macsec udp_diag inet_diag unix_diag af_packet_diag netlink_diag binfmt_misc xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc ebtable_filter ebtables overlay ip6table_filter ip6_tables iptable_filter sunrpc ext4 mbcache jbd2 pcspkr i2c_piix4 virtio_balloon joydev ip_tables xfs libcrc32c ata_generic qxl pata_acpi drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ata_piix libata virtio_net net_failover virtio_console failover virtio_blk i2c_core virtio_pci virtio_ring serio_raw floppy virtio dm_mirror dm_region_hash dm_log dm_mod
      [  133.444045] CR2: 0000000000000000
      [  133.445082] ---[ end trace f4aeee1958fd1638 ]---
      [  133.446236] RIP: 0010:0x0
      [  133.447180] Code: Bad RIP value.
      [  133.448152] RSP: 0018:ffffb79980003d50 EFLAGS: 00010246
      [  133.449363] RAX: 0000000080000102 RBX: ffff9de2fe0d6600 RCX: ffff9de2fe5e9d00
      [  133.450835] RDX: 0000000000000000 RSI: ffff9de2fe5e9d00 RDI: ffff9de2fc21b400
      [  133.452237] RBP: ffff9de2fe5e9d00 R08: 0000000000000000 R09: 0000000000000000
      [  133.453722] R10: ffff9de2fbc6be22 R11: ffff9de2fe0d6600 R12: ffff9de2fc21b400
      [  133.455149] R13: ffff9de2fe0d6628 R14: 0000000000000001 R15: 0000000000000003
      [  133.456520] FS:  00007fe014918740(0000) GS:ffff9de2fec00000(0000) knlGS:0000000000000000
      [  133.458046] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  133.459342] CR2: ffffffffffffffd6 CR3: 000000003bb72000 CR4: 00000000000006f0
      [  133.460782] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  133.462240] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  133.463697] Kernel panic - not syncing: Fatal exception in interrupt
      [  133.465226] Kernel Offset: 0xfa00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [  133.467025] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
      
      Fixes: aaa0c23c ("Fix dst_neigh_lookup/dst_neigh_lookup_skb return value handling bug")
      Signed-off-by: NMartin Varghese <martin.varghese@nokia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      394de110
    • A
      fs: Add IOCB_NOIO flag for generic_file_read_iter · 41da51bc
      Andreas Gruenbacher 提交于
      Add an IOCB_NOIO flag that indicates to generic_file_read_iter that it
      shouldn't trigger any filesystem I/O for the actual request or for
      readahead.  This allows to do tentative reads out of the page cache as
      some filesystems allow, and to take the appropriate locks and retry the
      reads only if the requested pages are not cached.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      41da51bc
    • C
      cgroup: fix cgroup_sk_alloc() for sk_clone_lock() · ad0f75e5
      Cong Wang 提交于
      When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
      copied, so the cgroup refcnt must be taken too. And, unlike the
      sk_alloc() path, sock_update_netprioidx() is not called here.
      Therefore, it is safe and necessary to grab the cgroup refcnt
      even when cgroup_sk_alloc is disabled.
      
      sk_clone_lock() is in BH context anyway, the in_interrupt()
      would terminate this function if called there. And for sk_alloc()
      skcd->val is always zero. So it's safe to factor out the code
      to make it more readable.
      
      The global variable 'cgroup_sk_alloc_disabled' is used to determine
      whether to take these reference counts. It is impossible to make
      the reference counting correct unless we save this bit of information
      in skcd->val. So, add a new bit there to record whether the socket
      has already taken the reference counts. This obviously relies on
      kmalloc() to align cgroup pointers to at least 4 bytes,
      ARCH_KMALLOC_MINALIGN is certainly larger than that.
      
      This bug seems to be introduced since the beginning, commit
      d979a39d ("cgroup: duplicate cgroup reference when cloning sockets")
      tried to fix it but not compeletely. It seems not easy to trigger until
      the recent commit 090e28b2
      ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.
      
      Fixes: bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
      Reported-by: NCameron Berkenpas <cam@neo-zeon.de>
      Reported-by: NPeter Geis <pgwipeout@gmail.com>
      Reported-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reported-by: NDaniël Sonck <dsonck92@gmail.com>
      Reported-by: NZhang Qiang <qiang.zhang@windriver.com>
      Tested-by: NCameron Berkenpas <cam@neo-zeon.de>
      Tested-by: NPeter Geis <pgwipeout@gmail.com>
      Tested-by: NThomas Lamprecht <t.lamprecht@proxmox.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad0f75e5
  10. 07 7月, 2020 2 次提交
  11. 06 7月, 2020 1 次提交
  12. 04 7月, 2020 1 次提交
    • T
      sched: consistently handle layer3 header accesses in the presence of VLANs · d7bf2ebe
      Toke Høiland-Jørgensen 提交于
      There are a couple of places in net/sched/ that check skb->protocol and act
      on the value there. However, in the presence of VLAN tags, the value stored
      in skb->protocol can be inconsistent based on whether VLAN acceleration is
      enabled. The commit quoted in the Fixes tag below fixed the users of
      skb->protocol to use a helper that will always see the VLAN ethertype.
      
      However, most of the callers don't actually handle the VLAN ethertype, but
      expect to find the IP header type in the protocol field. This means that
      things like changing the ECN field, or parsing diffserv values, stops
      working if there's a VLAN tag, or if there are multiple nested VLAN
      tags (QinQ).
      
      To fix this, change the helper to take an argument that indicates whether
      the caller wants to skip the VLAN tags or not. When skipping VLAN tags, we
      make sure to skip all of them, so behaviour is consistent even in QinQ
      mode.
      
      To make the helper usable from the ECN code, move it to if_vlan.h instead
      of pkt_sched.h.
      
      v3:
      - Remove empty lines
      - Move vlan variable definitions inside loop in skb_protocol()
      - Also use skb_protocol() helper in IP{,6}_ECN_decapsulate() and
        bpf_skb_ecn_set_ce()
      
      v2:
      - Use eth_type_vlan() helper in skb_protocol()
      - Also fix code that reads skb->protocol directly
      - Change a couple of 'if/else if' statements to switch constructs to avoid
        calling the helper twice
      Reported-by: NIlya Ponetayev <i.ponetaev@ndmsystems.com>
      Fixes: d8b9605d ("net: sched: fix skb->protocol use in case of accelerated vlan path")
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7bf2ebe
  13. 02 7月, 2020 1 次提交
    • S
      genetlink: remove genl_bind · 1e82a62f
      Sean Tranchetti 提交于
      A potential deadlock can occur during registering or unregistering a
      new generic netlink family between the main nl_table_lock and the
      cb_lock where each thread wants the lock held by the other, as
      demonstrated below.
      
      1) Thread 1 is performing a netlink_bind() operation on a socket. As part
         of this call, it will call netlink_lock_table(), incrementing the
         nl_table_users count to 1.
      2) Thread 2 is registering (or unregistering) a genl_family via the
         genl_(un)register_family() API. The cb_lock semaphore will be taken for
         writing.
      3) Thread 1 will call genl_bind() as part of the bind operation to handle
         subscribing to GENL multicast groups at the request of the user. It will
         attempt to take the cb_lock semaphore for reading, but it will fail and
         be scheduled away, waiting for Thread 2 to finish the write.
      4) Thread 2 will call netlink_table_grab() during the (un)registration
         call. However, as Thread 1 has incremented nl_table_users, it will not
         be able to proceed, and both threads will be stuck waiting for the
         other.
      
      genl_bind() is a noop, unless a genl_family implements the mcast_bind()
      function to handle setting up family-specific multicast operations. Since
      no one in-tree uses this functionality as Cong pointed out, simply removing
      the genl_bind() function will remove the possibility for deadlock, as there
      is no attempt by Thread 1 above to take the cb_lock semaphore.
      
      Fixes: c380d9a7 ("genetlink: pass multicast bind/unbind to families")
      Suggested-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJohannes Berg <johannes.berg@intel.com>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NSean Tranchetti <stranche@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e82a62f
  14. 01 7月, 2020 1 次提交