1. 08 12月, 2019 2 次提交
    • L
      pipe: remove 'waiting_writers' merging logic · a28c8b9d
      Linus Torvalds 提交于
      This code is ancient, and goes back to when we only had a single page
      for the pipe buffers.  The exact history is hidden in the mists of time
      (ie "before git", and in fact predates the BK repository too).
      
      At that long-ago point in time, it actually helped to try to merge big
      back-and-forth pipe reads and writes, and not limit pipe reads to the
      single pipe buffer in length just because that was all we had at a time.
      
      However, since then we've expanded the pipe buffers to multiple pages,
      and this logic really doesn't seem to make sense.  And a lot of it is
      somewhat questionable (ie "hmm, the user asked for a non-blocking read,
      but we see that there's a writer pending, so let's wait anyway to get
      the extra data that the writer will have").
      
      But more importantly, it makes the "go to sleep" logic much less
      obvious, and considering the wakeup issues we've had, I want to make for
      less of those kinds of things.
      
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a28c8b9d
    • E
      inet: protect against too small mtu values. · 501a90c9
      Eric Dumazet 提交于
      syzbot was once again able to crash a host by setting a very small mtu
      on loopback device.
      
      Let's make inetdev_valid_mtu() available in include/net/ip.h,
      and use it in ip_setup_cork(), so that we protect both ip_append_page()
      and __ip_append_data()
      
      Also add a READ_ONCE() when the device mtu is read.
      
      Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(),
      even if other code paths might write over this field.
      
      Add a big comment in include/linux/netdevice.h about dev->mtu
      needing READ_ONCE()/WRITE_ONCE() annotations.
      
      Hopefully we will add the missing ones in followup patches.
      
      [1]
      
      refcount_t: saturated; leaking memory.
      WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       panic+0x2e3/0x75c kernel/panic.c:221
       __warn.cold+0x2f/0x3e kernel/panic.c:582
       report_bug+0x289/0x300 lib/bug.c:195
       fixup_bug arch/x86/kernel/traps.c:174 [inline]
       fixup_bug arch/x86/kernel/traps.c:169 [inline]
       do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
       do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
       invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
      RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd <0f> 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89
      RSP: 0018:ffff88809689f550 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c
      RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1
      R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001
      R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40
       refcount_add include/linux/refcount.h:193 [inline]
       skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999
       sock_wmalloc+0xf1/0x120 net/core/sock.c:2096
       ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383
       udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276
       inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821
       kernel_sendpage+0x92/0xf0 net/socket.c:3794
       sock_sendpage+0x8b/0xc0 net/socket.c:936
       pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458
       splice_from_pipe_feed fs/splice.c:512 [inline]
       __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636
       splice_from_pipe+0x108/0x170 fs/splice.c:671
       generic_splice_sendpage+0x3c/0x50 fs/splice.c:842
       do_splice_from fs/splice.c:861 [inline]
       direct_splice_actor+0x123/0x190 fs/splice.c:1035
       splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990
       do_splice_direct+0x1da/0x2a0 fs/splice.c:1078
       do_sendfile+0x597/0xd00 fs/read_write.c:1464
       __do_sys_sendfile64 fs/read_write.c:1525 [inline]
       __se_sys_sendfile64 fs/read_write.c:1511 [inline]
       __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x441409
      Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409
      RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005
      RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010
      R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180
      R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 1470ddf7 ("inet: Remove explicit write references to sk/inet in ip_append_data")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      501a90c9
  2. 07 12月, 2019 4 次提交
    • G
      tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE() · 721c8daf
      Guillaume Nault 提交于
      Syncookies borrow the ->rx_opt.ts_recent_stamp field to store the
      timestamp of the last synflood. Protect them with READ_ONCE() and
      WRITE_ONCE() since reads and writes aren't serialised.
      
      Use of .rx_opt.ts_recent_stamp for storing the synflood timestamp was
      introduced by a0f82f64 ("syncookies: remove last_synq_overflow from
      struct tcp_sock"). But unprotected accesses were already there when
      timestamp was stored in .last_synq_overflow.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      721c8daf
    • G
      tcp: tighten acceptance of ACKs not matching a child socket · cb44a08f
      Guillaume Nault 提交于
      When no synflood occurs, the synflood timestamp isn't updated.
      Therefore it can be so old that time_after32() can consider it to be
      in the future.
      
      That's a problem for tcp_synq_no_recent_overflow() as it may report
      that a recent overflow occurred while, in fact, it's just that jiffies
      has grown past 'last_overflow' + TCP_SYNCOOKIE_VALID + 2^31.
      
      Spurious detection of recent overflows lead to extra syncookie
      verification in cookie_v[46]_check(). At that point, the verification
      should fail and the packet dropped. But we should have dropped the
      packet earlier as we didn't even send a syncookie.
      
      Let's refine tcp_synq_no_recent_overflow() to report a recent overflow
      only if jiffies is within the
      [last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval. This
      way, no spurious recent overflow is reported when jiffies wraps and
      'last_overflow' becomes in the future from the point of view of
      time_after32().
      
      However, if jiffies wraps and enters the
      [last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval (with
      'last_overflow' being a stale synflood timestamp), then
      tcp_synq_no_recent_overflow() still erroneously reports an
      overflow. In such cases, we have to rely on syncookie verification
      to drop the packet. We unfortunately have no way to differentiate
      between a fresh and a stale syncookie timestamp.
      
      In practice, using last_overflow as lower bound is problematic.
      If the synflood timestamp is concurrently updated between the time
      we read jiffies and the moment we store the timestamp in
      'last_overflow', then 'now' becomes smaller than 'last_overflow' and
      tcp_synq_no_recent_overflow() returns true, potentially dropping a
      valid syncookie.
      
      Reading jiffies after loading the timestamp could fix the problem,
      but that'd require a memory barrier. Let's just accommodate for
      potential timestamp growth instead and extend the interval using
      'last_overflow - HZ' as lower bound.
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb44a08f
    • G
      tcp: fix rejected syncookies due to stale timestamps · 04d26e7b
      Guillaume Nault 提交于
      If no synflood happens for a long enough period of time, then the
      synflood timestamp isn't refreshed and jiffies can advance so much
      that time_after32() can't accurately compare them any more.
      
      Therefore, we can end up in a situation where time_after32(now,
      last_overflow + HZ) returns false, just because these two values are
      too far apart. In that case, the synflood timestamp isn't updated as
      it should be, which can trick tcp_synq_no_recent_overflow() into
      rejecting valid syncookies.
      
      For example, let's consider the following scenario on a system
      with HZ=1000:
      
        * The synflood timestamp is 0, either because that's the timestamp
          of the last synflood or, more commonly, because we're working with
          a freshly created socket.
      
        * We receive a new SYN, which triggers synflood protection. Let's say
          that this happens when jiffies == 2147484649 (that is,
          'synflood timestamp' + HZ + 2^31 + 1).
      
        * Then tcp_synq_overflow() doesn't update the synflood timestamp,
          because time_after32(2147484649, 1000) returns false.
          With:
            - 2147484649: the value of jiffies, aka. 'now'.
            - 1000: the value of 'last_overflow' + HZ.
      
        * A bit later, we receive the ACK completing the 3WHS. But
          cookie_v[46]_check() rejects it because tcp_synq_no_recent_overflow()
          says that we're not under synflood. That's because
          time_after32(2147484649, 120000) returns false.
          With:
            - 2147484649: the value of jiffies, aka. 'now'.
            - 120000: the value of 'last_overflow' + TCP_SYNCOOKIE_VALID.
      
          Of course, in reality jiffies would have increased a bit, but this
          condition will last for the next 119 seconds, which is far enough
          to accommodate for jiffie's growth.
      
      Fix this by updating the overflow timestamp whenever jiffies isn't
      within the [last_overflow, last_overflow + HZ] range. That shouldn't
      have any performance impact since the update still happens at most once
      per second.
      
      Now we're guaranteed to have fresh timestamps while under synflood, so
      tcp_synq_no_recent_overflow() can safely use it with time_after32() in
      such situations.
      
      Stale timestamps can still make tcp_synq_no_recent_overflow() return
      the wrong verdict when not under synflood. This will be handled in the
      next patch.
      
      For 64 bits architectures, the problem was introduced with the
      conversion of ->tw_ts_recent_stamp to 32 bits integer by commit
      cca9bab1 ("tcp: use monotonic timestamps for PAWS").
      The problem has always been there on 32 bits architectures.
      
      Fixes: cca9bab1 ("tcp: use monotonic timestamps for PAWS")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04d26e7b
    • J
      net: core: rename indirect block ingress cb function · dbad3408
      John Hurley 提交于
      With indirect blocks, a driver can register for callbacks from a device
      that is does not 'own', for example, a tunnel device. When registering to
      or unregistering from a new device, a callback is triggered to generate
      a bind/unbind event. This, in turn, allows the driver to receive any
      existing rules or to properly clean up installed rules.
      
      When first added, it was assumed that all indirect block registrations
      would be for ingress offloads. However, the NFP driver can, in some
      instances, support clsact qdisc binds for egress offload.
      
      Change the name of the indirect block callback command in flow_offload to
      remove the 'ingress' identifier from it. While this does not change
      functionality, a follow up patch will implement a more more generic
      callback than just those currently just supporting ingress offload.
      
      Fixes: 4d12ba42 ("nfp: flower: allow offloading of matches on 'internal' ports")
      Signed-off-by: NJohn Hurley <john.hurley@netronome.com>
      Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbad3408
  3. 05 12月, 2019 23 次提交
  4. 04 12月, 2019 2 次提交
    • C
      agp: move AGPGART_MINOR to include/linux/miscdevice.h · 0f109f0e
      Corentin Labbe 提交于
      This patch move the define for AGPGART_MINOR to include/linux/miscdevice.h.
      It is better that all minor number definitions are in the same place.
      Signed-off-by: NCorentin Labbe <clabbe@baylibre.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/1574324085-4338-3-git-send-email-clabbe@baylibre.com
      0f109f0e
    • Y
      cls_flower: Fix the behavior using port ranges with hw-offload · 8ffb055b
      Yoshiki Komachi 提交于
      The recent commit 5c72299f ("net: sched: cls_flower: Classify
      packets using port ranges") had added filtering based on port ranges
      to tc flower. However the commit missed necessary changes in hw-offload
      code, so the feature gave rise to generating incorrect offloaded flow
      keys in NIC.
      
      One more detailed example is below:
      
      $ tc qdisc add dev eth0 ingress
      $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
        dst_port 100-200 action drop
      
      With the setup above, an exact match filter with dst_port == 0 will be
      installed in NIC by hw-offload. IOW, the NIC will have a rule which is
      equivalent to the following one.
      
      $ tc qdisc add dev eth0 ingress
      $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
        dst_port 0 action drop
      
      The behavior was caused by the flow dissector which extracts packet
      data into the flow key in the tc flower. More specifically, regardless
      of exact match or specified port ranges, fl_init_dissector() set the
      FLOW_DISSECTOR_KEY_PORTS flag in struct flow_dissector to extract port
      numbers from skb in skb_flow_dissect() called by fl_classify(). Note
      that device drivers received the same struct flow_dissector object as
      used in skb_flow_dissect(). Thus, offloaded drivers could not identify
      which of these is used because the FLOW_DISSECTOR_KEY_PORTS flag was
      set to struct flow_dissector in either case.
      
      This patch adds the new FLOW_DISSECTOR_KEY_PORTS_RANGE flag and the new
      tp_range field in struct fl_flow_key to recognize which filters are applied
      to offloaded drivers. At this point, when filters based on port ranges
      passed to drivers, drivers return the EOPNOTSUPP error because they do
      not support the feature (the newly created FLOW_DISSECTOR_KEY_PORTS_RANGE
      flag).
      
      Fixes: 5c72299f ("net: sched: cls_flower: Classify packets using port ranges")
      Signed-off-by: NYoshiki Komachi <komachi.yoshiki@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ffb055b
  5. 03 12月, 2019 9 次提交