1. 28 1月, 2019 1 次提交
  2. 27 1月, 2019 1 次提交
  3. 26 1月, 2019 4 次提交
  4. 25 1月, 2019 2 次提交
    • P
      tcp_bbr: adapt cwnd based on ack aggregation estimation · 78dc70eb
      Priyaranjan Jha 提交于
      Aggregation effects are extremely common with wifi, cellular, and cable
      modem link technologies, ACK decimation in middleboxes, and LRO and GRO
      in receiving hosts. The aggregation can happen in either direction,
      data or ACKs, but in either case the aggregation effect is visible
      to the sender in the ACK stream.
      
      Previously BBR's sending was often limited by cwnd under severe ACK
      aggregation/decimation because BBR sized the cwnd at 2*BDP. If packets
      were acked in bursts after long delays (e.g. one ACK acking 5*BDP after
      5*RTT), BBR's sending was halted after sending 2*BDP over 2*RTT, leaving
      the bottleneck idle for potentially long periods. Note that loss-based
      congestion control does not have this issue because when facing
      aggregation it continues increasing cwnd after bursts of ACKs, growing
      cwnd until the buffer is full.
      
      To achieve good throughput in the presence of aggregation effects, this
      algorithm allows the BBR sender to put extra data in flight to keep the
      bottleneck utilized during silences in the ACK stream that it has evidence
      to suggest were caused by aggregation.
      
      A summary of the algorithm: when a burst of packets are acked by a
      stretched ACK or a burst of ACKs or both, BBR first estimates the expected
      amount of data that should have been acked, based on its estimated
      bandwidth. Then the surplus ("extra_acked") is recorded in a windowed-max
      filter to estimate the recent level of observed ACK aggregation. Then cwnd
      is increased by the ACK aggregation estimate. The larger cwnd avoids BBR
      being cwnd-limited in the face of ACK silences that recent history suggests
      were caused by aggregation. As a sanity check, the ACK aggregation degree
      is upper-bounded by the cwnd (at the time of measurement) and a global max
      of BW * 100ms. The algorithm is further described by the following
      presentation:
      https://datatracker.ietf.org/meeting/101/materials/slides-101-iccrg-an-update-on-bbr-work-at-google-00
      
      In our internal testing, we observed a significant increase in BBR
      throughput (measured using netperf), in a basic wifi setup.
      - Host1 (sender on ethernet) -> AP -> Host2 (receiver on wifi)
      - 2.4 GHz -> BBR before: ~73 Mbps; BBR after: ~102 Mbps; CUBIC: ~100 Mbps
      - 5.0 GHz -> BBR before: ~362 Mbps; BBR after: ~593 Mbps; CUBIC: ~601 Mbps
      
      Also, this code is running globally on YouTube TCP connections and produced
      significant bandwidth increases for YouTube traffic.
      
      This is based on Ian Swett's max_ack_height_ algorithm from the
      QUIC BBR implementation.
      Signed-off-by: NPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      78dc70eb
    • N
      bonding: count master 3ad stats separately · 949e7cea
      Nikolay Aleksandrov 提交于
      I made a dumb mistake when I summed up the slave stats, obviously slaves
      can come and go which would make the master stats unreliable.
      Count and export the master stats separately.
      
      Fixes: a258aeac ("bonding: add support for xstats and export 3ad stats")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      949e7cea
  5. 24 1月, 2019 1 次提交
    • E
      ax25: fix possible use-after-free · 63530aba
      Eric Dumazet 提交于
      syzbot found that ax25 routes where not properly protected
      against concurrent use [1].
      
      In this particular report the bug happened while
      copying ax25->digipeat.
      
      Fix this problem by making sure we call ax25_get_route()
      while ax25_route_lock is held, so that no modification
      could happen while using the route.
      
      The current two ax25_get_route() callers do not sleep,
      so this change should be fine.
      
      Once we do that, ax25_get_route() no longer needs to
      grab a reference on the found route.
      
      [1]
      ax25_connect(): syz-executor0 uses autobind, please contact jreuter@yaina.de
      BUG: KASAN: use-after-free in memcpy include/linux/string.h:352 [inline]
      BUG: KASAN: use-after-free in kmemdup+0x42/0x60 mm/util.c:113
      Read of size 66 at addr ffff888066641a80 by task syz-executor2/531
      
      ax25_connect(): syz-executor0 uses autobind, please contact jreuter@yaina.de
      CPU: 1 PID: 531 Comm: syz-executor2 Not tainted 5.0.0-rc2+ #10
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1db/0x2d0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
       kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       check_memory_region_inline mm/kasan/generic.c:185 [inline]
       check_memory_region+0x123/0x190 mm/kasan/generic.c:191
       memcpy+0x24/0x50 mm/kasan/common.c:130
       memcpy include/linux/string.h:352 [inline]
       kmemdup+0x42/0x60 mm/util.c:113
       kmemdup include/linux/string.h:425 [inline]
       ax25_rt_autobind+0x25d/0x750 net/ax25/ax25_route.c:424
       ax25_connect.cold+0x30/0xa4 net/ax25/af_ax25.c:1224
       __sys_connect+0x357/0x490 net/socket.c:1664
       __do_sys_connect net/socket.c:1675 [inline]
       __se_sys_connect net/socket.c:1672 [inline]
       __x64_sys_connect+0x73/0xb0 net/socket.c:1672
       do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x458099
      Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f870ee22c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458099
      RDX: 0000000000000048 RSI: 0000000020000080 RDI: 0000000000000005
      RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
      ax25_connect(): syz-executor4 uses autobind, please contact jreuter@yaina.de
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f870ee236d4
      R13: 00000000004be48e R14: 00000000004ce9a8 R15: 00000000ffffffff
      
      Allocated by task 526:
       save_stack+0x45/0xd0 mm/kasan/common.c:73
       set_track mm/kasan/common.c:85 [inline]
       __kasan_kmalloc mm/kasan/common.c:496 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:469
       kasan_kmalloc+0x9/0x10 mm/kasan/common.c:504
      ax25_connect(): syz-executor5 uses autobind, please contact jreuter@yaina.de
       kmem_cache_alloc_trace+0x151/0x760 mm/slab.c:3609
       kmalloc include/linux/slab.h:545 [inline]
       ax25_rt_add net/ax25/ax25_route.c:95 [inline]
       ax25_rt_ioctl+0x3b9/0x1270 net/ax25/ax25_route.c:233
       ax25_ioctl+0x322/0x10b0 net/ax25/af_ax25.c:1763
       sock_do_ioctl+0xe2/0x400 net/socket.c:950
       sock_ioctl+0x32f/0x6c0 net/socket.c:1074
       vfs_ioctl fs/ioctl.c:46 [inline]
       file_ioctl fs/ioctl.c:509 [inline]
       do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696
       ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
       __do_sys_ioctl fs/ioctl.c:720 [inline]
       __se_sys_ioctl fs/ioctl.c:718 [inline]
       __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
       do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      ax25_connect(): syz-executor5 uses autobind, please contact jreuter@yaina.de
      Freed by task 550:
       save_stack+0x45/0xd0 mm/kasan/common.c:73
       set_track mm/kasan/common.c:85 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:458
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:466
       __cache_free mm/slab.c:3487 [inline]
       kfree+0xcf/0x230 mm/slab.c:3806
       ax25_rt_add net/ax25/ax25_route.c:92 [inline]
       ax25_rt_ioctl+0x304/0x1270 net/ax25/ax25_route.c:233
       ax25_ioctl+0x322/0x10b0 net/ax25/af_ax25.c:1763
       sock_do_ioctl+0xe2/0x400 net/socket.c:950
       sock_ioctl+0x32f/0x6c0 net/socket.c:1074
       vfs_ioctl fs/ioctl.c:46 [inline]
       file_ioctl fs/ioctl.c:509 [inline]
       do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696
       ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
       __do_sys_ioctl fs/ioctl.c:720 [inline]
       __se_sys_ioctl fs/ioctl.c:718 [inline]
       __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
       do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff888066641a80
       which belongs to the cache kmalloc-96 of size 96
      The buggy address is located 0 bytes inside of
       96-byte region [ffff888066641a80, ffff888066641ae0)
      The buggy address belongs to the page:
      page:ffffea0001999040 count:1 mapcount:0 mapping:ffff88812c3f04c0 index:0x0
      flags: 0x1fffc0000000200(slab)
      ax25_connect(): syz-executor4 uses autobind, please contact jreuter@yaina.de
      raw: 01fffc0000000200 ffffea0001817948 ffffea0002341dc8 ffff88812c3f04c0
      raw: 0000000000000000 ffff888066641000 0000000100000020 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff888066641980: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
       ffff888066641a00: 00 00 00 00 00 00 00 00 02 fc fc fc fc fc fc fc
      >ffff888066641a80: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
                         ^
       ffff888066641b00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
       ffff888066641b80: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63530aba
  6. 23 1月, 2019 4 次提交
    • L
      bridge: Snoop Multicast Router Advertisements · 4b3087c7
      Linus Lüssing 提交于
      When multiple multicast routers are present in a broadcast domain then
      only one of them will be detectable via IGMP/MLD query snooping. The
      multicast router with the lowest IP address will become the selected and
      active querier while all other multicast routers will then refrain from
      sending queries.
      
      To detect such rather silent multicast routers, too, RFC4286
      ("Multicast Router Discovery") provides a standardized protocol to
      detect multicast routers for multicast snooping switches.
      
      This patch implements the necessary MRD Advertisement message parsing
      and after successful processing adds such routers to the internal
      multicast router list.
      Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b3087c7
    • L
      bridge: simplify ip_mc_check_igmp() and ipv6_mc_check_mld() calls · ba5ea614
      Linus Lüssing 提交于
      This patch refactors ip_mc_check_igmp(), ipv6_mc_check_mld() and
      their callers (more precisely, the Linux bridge) to not rely on
      the skb_trimmed parameter anymore.
      
      An skb with its tail trimmed to the IP packet length was initially
      introduced for the following three reasons:
      
      1) To be able to verify the ICMPv6 checksum.
      2) To be able to distinguish the version of an IGMP or MLD query.
         They are distinguishable only by their size.
      3) To avoid parsing data for an IGMPv3 or MLDv2 report that is
         beyond the IP packet but still within the skb.
      
      The first case still uses a cloned and potentially trimmed skb to
      verfiy. However, there is no need to propagate it to the caller.
      For the second and third case explicit IP packet length checks were
      added.
      
      This hopefully makes ip_mc_check_igmp() and ipv6_mc_check_mld() easier
      to read and verfiy, as well as easier to use.
      Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba5ea614
    • N
      bonding: add support for xstats and export 3ad stats · a258aeac
      Nikolay Aleksandrov 提交于
      This patch adds support for extended statistics (xstats) call to the
      bonding. The first user would be the 3ad code which counts the following
      events:
       - LACPDU Rx/Tx
       - LACPDU unknown type Rx
       - LACPDU illegal Rx
       - Marker Rx/Tx
       - Marker response Rx/Tx
       - Marker unknown type Rx
      
      All of these are exported via netlink as separate attributes to be
      easily extensible as we plan to add more in the future.
      Similar to how the bridge and other xstats exports, the structure
      inside is:
       [ IFLA_STATS_LINK_XSTATS ]
         -> [ LINK_XSTATS_TYPE_BOND ]
              -> [ BOND_XSTATS_3AD ]
                   -> [ 3ad stats attributes ]
      
      With this structure it's easy to add more stat types later.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a258aeac
    • N
      bonding: add 3ad stats · 267c095a
      Nikolay Aleksandrov 提交于
      Count the following types of 3ad packets per slave:
       - rx/tx lacpdu
       - rx/tx marker
       - rx/tx marker response
       - rx illegal lacpdus (right now counted on wrong length)
       - rx unknown lacpdu type
       - rx unknown marker type
      
      The counters are using atomic64 since this is not fast path.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      267c095a
  7. 20 1月, 2019 1 次提交
  8. 19 1月, 2019 4 次提交
    • E
      devlink: Add health report functionality · c7af343b
      Eran Ben Elisha 提交于
      Upon error discover, every driver can report it to the devlink health
      mechanism via devlink_health_report function, using the appropriate
      reporter registered to it. Driver can pass error specific context which
      will be delivered to it as part of the dump / recovery callbacks.
      
      Once an error is reported, devlink health will do the following actions:
      * A log is being send to the kernel trace events buffer
      * Health status and statistics are being updated for the reporter instance
      * Object dump is being taken and stored at the reporter instance (as long
        as there is no other dump which is already stored)
      * Auto recovery attempt is being done. depends on:
        - Auto Recovery configuration
        - Grace period vs. time since last recover
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7af343b
    • E
      devlink: Add health reporter create/destroy functionality · 880ee82f
      Eran Ben Elisha 提交于
      Devlink health reporter is an instance for reporting, diagnosing and
      recovering from run time errors discovered by the reporters.
      Define it's data structure and supported operations.
      In addition, expose devlink API to create and destroy a reporter.
      Each devlink instance will hold it's own reporters list.
      
      As part of the allocation, driver shall provide a set of callbacks which
      will be used the devlink in order to handle health reports and user
      commands related to this reporter. In addition, driver is entitled to
      provide some priv pointer, which can be fetched from the reporter by
      devlink_health_reporter_priv function.
      
      For each reporter, devlink will hold a metadata of statistics,
      buffers and status.
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      880ee82f
    • E
      devlink: Add health buffer support · cb5ccfbe
      Eran Ben Elisha 提交于
      Devlink health buffer is a mechanism to pass descriptors between drivers
      and devlink. The API allows the driver to add objects, object pair,
      value array (nested attributes), value and name.
      
      Driver can use this API to fill the buffers in a format which can be
      translated by the devlink to the netlink message.
      
      In order to fulfill it, an internal buffer descriptor is defined. This
      will hold the data and metadata per each attribute and by used to pass
      actual commands to the netlink.
      
      This mechanism will be later used in devlink health for dump and diagnose
      data store by the drivers.
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb5ccfbe
    • Y
      tcp: declare tcp_mmap() only when CONFIG_MMU is set · 340a6f3d
      Yafang Shao 提交于
      Since tcp_mmap() is defined when CONFIG_MMU is set.
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      340a6f3d
  9. 18 1月, 2019 4 次提交
    • P
      switchdev: Add extack argument to call_switchdev_notifiers() · 6685987c
      Petr Machata 提交于
      A follow-up patch will enable vetoing of FDB entries. Make it possible
      to communicate details of why an FDB entry is not acceptable back to the
      user.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6685987c
    • P
      vxlan: Add extack to switchdev operations · 4c59b7d1
      Petr Machata 提交于
      There are four sources of VXLAN switchdev notifier calls:
      
      - the changelink() link operation, which already supports extack,
      - ndo_fdb_add() which got extack support in a previous patch,
      - FDB updates due to packet forwarding,
      - and vxlan_fdb_replay().
      
      Extend vxlan_fdb_switchdev_call_notifiers() to include extack in the
      switchdev message that it sends, and propagate the argument upwards to
      the callers. For the first two cases, pass in the extack gotten through
      the operation. For case #3, pass in NULL.
      
      To cover the last case, extend vxlan_fdb_replay() to take extack
      argument, which might come from whatever operation necessitated the FDB
      replay.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c59b7d1
    • V
      tls: Fix recvmsg() to be able to peek across multiple records · 692d7b5d
      Vakul Garg 提交于
      This fixes recvmsg() to be able to peek across multiple tls records.
      Without this patch, the tls's selftests test case
      'recv_peek_large_buf_mult_recs' fails. Each tls receive context now
      maintains a 'rx_list' to retain incoming skb carrying tls records. If a
      tls record needs to be retained e.g. for peek case or for the case when
      the buffer passed to recvmsg() has a length smaller than decrypted
      record length, then it is added to 'rx_list'. Additionally, records are
      added in 'rx_list' if the crypto operation runs in async mode. The
      records are dequeued from 'rx_list' after the decrypted data is consumed
      by copying into the buffer passed to recvmsg(). In case, the MSG_PEEK
      flag is used in recvmsg(), then records are not consumed or removed
      from the 'rx_list'.
      Signed-off-by: NVakul Garg <vakul.garg@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      692d7b5d
    • F
      net: dsa: Split platform data to header file · ecfc9372
      Florian Fainelli 提交于
      Instead of having net/dsa.h contain both the internal switch tree/driver
      structures, split the relevant platform_data parts into
      include/linux/platform_data/dsa.h and make that header be included by
      net/dsa.h in order not to break any setup. A subsequent set of patches
      will update code including net/dsa.h to include only the platform_data
      header.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ecfc9372
  10. 17 1月, 2019 1 次提交
  11. 16 1月, 2019 2 次提交
    • D
      Revert "rxrpc: Allow failed client calls to be retried" · e122d845
      David Howells 提交于
      The changes introduced to allow rxrpc calls to be retried creates an issue
      when it comes to refcounting afs_call structs.  The problem is that when
      rxrpc_send_data() queues the last packet for an asynchronous call, the
      following sequence can occur:
      
       (1) The notify_end_tx callback is invoked which causes the state in the
           afs_call to be changed from AFS_CALL_CL_REQUESTING or
           AFS_CALL_SV_REPLYING.
      
       (2) afs_deliver_to_call() can then process event notifications from rxrpc
           on the async_work queue.
      
       (3) Delivery of events, such as an abort from the server, can cause the
           afs_call state to be changed to AFS_CALL_COMPLETE on async_work.
      
       (4) For an asynchronous call, afs_process_async_call() notes that the call
           is complete and tried to clean up all the refs on async_work.
      
       (5) rxrpc_send_data() might return the amount of data transferred
           (success) or an error - which could in turn reflect a local error or a
           received error.
      
      Synchronising the clean up after rxrpc_kernel_send_data() returns an error
      with the asynchronous cleanup is then tricky to get right.
      
      Mostly revert commit c038a58c.  The two API
      functions the original commit added aren't currently used.  This makes
      rxrpc_kernel_send_data() always return successfully if it queued the data
      it was given.
      
      Note that this doesn't affect synchronous calls since their Rx notification
      function merely pokes a wait queue and does not refcounting.  The
      asynchronous call notification function *has* to do refcounting and pass a
      ref over the work item to avoid the need to sync the workqueue in call
      cleanup.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e122d845
    • I
      net: ipv4: Fix memory leak in network namespace dismantle · f97f4dd8
      Ido Schimmel 提交于
      IPv4 routing tables are flushed in two cases:
      
      1. In response to events in the netdev and inetaddr notification chains
      2. When a network namespace is being dismantled
      
      In both cases only routes associated with a dead nexthop group are
      flushed. However, a nexthop group will only be marked as dead in case it
      is populated with actual nexthops using a nexthop device. This is not
      the case when the route in question is an error route (e.g.,
      'blackhole', 'unreachable').
      
      Therefore, when a network namespace is being dismantled such routes are
      not flushed and leaked [1].
      
      To reproduce:
      # ip netns add blue
      # ip -n blue route add unreachable 192.0.2.0/24
      # ip netns del blue
      
      Fix this by not skipping error routes that are not marked with
      RTNH_F_DEAD when flushing the routing tables.
      
      To prevent the flushing of such routes in case #1, add a parameter to
      fib_table_flush() that indicates if the table is flushed as part of
      namespace dismantle or not.
      
      Note that this problem does not exist in IPv6 since error routes are
      associated with the loopback device.
      
      [1]
      unreferenced object 0xffff888066650338 (size 56):
        comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 b0 1c 62 61 80 88 ff ff  ..........ba....
          e8 8b a1 64 80 88 ff ff 00 07 00 08 fe 00 00 00  ...d............
        backtrace:
          [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
          [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
          [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
          [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
          [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
          [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
          [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
          [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
          [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
          [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000003a8b605b>] 0xffffffffffffffff
      unreferenced object 0xffff888061621c88 (size 48):
        comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
        hex dump (first 32 bytes):
          6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
          6b 6b 6b 6b 6b 6b 6b 6b d8 8e 26 5f 80 88 ff ff  kkkkkkkk..&_....
        backtrace:
          [<00000000733609e3>] fib_table_insert+0x978/0x1500
          [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
          [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
          [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
          [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
          [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
          [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
          [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
          [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
          [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
          [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000003a8b605b>] 0xffffffffffffffff
      
      Fixes: 8cced9ef ("[NETNS]: Enable routing configuration in non-initial namespace.")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f97f4dd8
  12. 11 1月, 2019 1 次提交
    • W
      netfilter: nft_flow_offload: fix interaction with vrf slave device · 10f4e765
      wenxu 提交于
      In the forward chain, the iif is changed from slave device to master vrf
      device. Thus, flow offload does not find a match on the lower slave
      device.
      
      This patch uses the cached route, ie. dst->dev, to update the iif and
      oif fields in the flow entry.
      
      After this patch, the following example works fine:
      
       # ip addr add dev eth0 1.1.1.1/24
       # ip addr add dev eth1 10.0.0.1/24
       # ip link add user1 type vrf table 1
       # ip l set user1 up
       # ip l set dev eth0 master user1
       # ip l set dev eth1 master user1
      
       # nft add table firewall
       # nft add flowtable f fb1 { hook ingress priority 0 \; devices = { eth0, eth1 } \; }
       # nft add chain f ftb-all {type filter hook forward priority 0 \; policy accept \; }
       # nft add rule f ftb-all ct zone 1 ip protocol tcp flow offload @fb1
       # nft add rule f ftb-all ct zone 1 ip protocol udp flow offload @fb1
      Signed-off-by: Nwenxu <wenxu@ucloud.cn>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      10f4e765
  13. 04 1月, 2019 1 次提交
    • L
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds 提交于
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  14. 02 1月, 2019 2 次提交
    • W
      ip: validate header length on virtual device xmit · cb9f1b78
      Willem de Bruijn 提交于
      KMSAN detected read beyond end of buffer in vti and sit devices when
      passing truncated packets with PF_PACKET. The issue affects additional
      ip tunnel devices.
      
      Extend commit 76c0ddd8 ("ip6_tunnel: be careful when accessing the
      inner header") and commit ccfec9e5 ("ip_tunnel: be careful when
      accessing the inner header").
      
      Move the check to a separate helper and call at the start of each
      ndo_start_xmit function in net/ipv4 and net/ipv6.
      
      Minor changes:
      - convert dev_kfree_skb to kfree_skb on error path,
        as dev_kfree_skb calls consume_skb which is not for error paths.
      - use pskb_network_may_pull even though that is pedantic here,
        as the same as pskb_may_pull for devices without llheaders.
      - do not cache ipv6 hdrs if used only once
        (unsafe across pskb_may_pull, was more relevant to earlier patch)
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb9f1b78
    • D
      sock: Make sock->sk_stamp thread-safe · 3a0ed3e9
      Deepa Dinamani 提交于
      Al Viro mentioned (Message-ID
      <20170626041334.GZ10672@ZenIV.linux.org.uk>)
      that there is probably a race condition
      lurking in accesses of sk_stamp on 32-bit machines.
      
      sock->sk_stamp is of type ktime_t which is always an s64.
      On a 32 bit architecture, we might run into situations of
      unsafe access as the access to the field becomes non atomic.
      
      Use seqlocks for synchronization.
      This allows us to avoid using spinlocks for readers as
      readers do not need mutual exclusion.
      
      Another approach to solve this is to require sk_lock for all
      modifications of the timestamps. The current approach allows
      for timestamps to have their own lock: sk_stamp_lock.
      This allows for the patch to not compete with already
      existing critical sections, and side effects are limited
      to the paths in the patch.
      
      The addition of the new field maintains the data locality
      optimizations from
      commit 9115e8cd ("net: reorganize struct sock for better data
      locality")
      
      Note that all the instances of the sk_stamp accesses
      are either through the ioctl or the syscall recvmsg.
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a0ed3e9
  15. 29 12月, 2018 2 次提交
  16. 25 12月, 2018 1 次提交
  17. 21 12月, 2018 6 次提交
    • P
      net: seg6.h: remove an unused #include · a6ae520d
      Peter Oskolkov 提交于
      A minor code cleanup.
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6ae520d
    • F
      netfilter: netns: shrink netns_ct struct · 8527f9df
      Florian Westphal 提交于
      remove the obsolete sysctl anchors and move auto_assign_helper_warned
      to avoid/cover a hole.  Reduces size by 40 bytes on 64 bit.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8527f9df
    • F
      netfilter: conntrack: remove empty pernet fini stubs · fc3893fd
      Florian Westphal 提交于
      after moving sysctl handling into single place, the init functions
      can't fail anymore and some of the fini functions are empty.
      
      Remove them and change return type to void.
      This also simplifies error unwinding in conntrack module init path.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      fc3893fd
    • F
      netfilter: conntrack: un-export seq_print_acct · 4b216e21
      Florian Westphal 提交于
      Only one caller, just place it where its needed.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      4b216e21
    • F
      netfilter: conntrack: udp: only extend timeout to stream mode after 2s · d535c8a6
      Florian Westphal 提交于
      Currently DNS resolvers that send both A and AAAA queries from same source port
      can trigger stream mode prematurely, which results in non-early-evictable conntrack entry
      for three minutes, even though DNS requests are done in a few milliseconds.
      
      Add a two second grace period where we continue to use the ordinary
      30-second default timeout.  Its enough for DNS request/response traffic,
      even if two request/reply packets are involved.
      
      ASSURED is still set, else conntrack (and thus a possible
      NAT mapping ...) gets zapped too in case conntrack table runs full.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d535c8a6
    • J
      bpf: sk_msg, sock{map|hash} redirect through ULP · 0608c69c
      John Fastabend 提交于
      A sockmap program that redirects through a kTLS ULP enabled socket
      will not work correctly because the ULP layer is skipped. This
      fixes the behavior to call through the ULP layer on redirect to
      ensure any operations required on the data stream at the ULP layer
      continue to be applied.
      
      To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
      calling the BPF layer on a redirected message. This is
      required to avoid calling the BPF layer multiple times (possibly
      recursively) which is not the current/expected behavior without
      ULPs. In the future we may add a redirect flag if users _do_
      want the policy applied again but this would need to work for both
      ULP and non-ULP sockets and be opt-in to avoid breaking existing
      programs.
      
      Also to avoid polluting the flag space with an internal flag we
      reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
      MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
      SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
      to verify is user space API is masked correctly to ensure the flag
      can not be set by user. (Note this needs to be true regardless
      because we have internal flags already in-use that user space
      should not be able to set). But for completeness we have two UAPI
      paths into sendpage, sendfile and splice.
      
      In the sendfile case the function do_sendfile() zero's flags,
      
      ./fs/read_write.c:
       static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
      		   	    size_t count, loff_t max)
       {
         ...
         fl = 0;
      #if 0
         /*
          * We need to debate whether we can enable this or not. The
          * man page documents EAGAIN return for the output at least,
          * and the application is arguably buggy if it doesn't expect
          * EAGAIN on a non-blocking file descriptor.
          */
          if (in.file->f_flags & O_NONBLOCK)
      	fl = SPLICE_F_NONBLOCK;
      #endif
          file_start_write(out.file);
          retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
       }
      
      In the splice case the pipe_to_sendpage "actor" is used which
      masks flags with SPLICE_F_MORE.
      
      ./fs/splice.c:
       static int pipe_to_sendpage(struct pipe_inode_info *pipe,
      			    struct pipe_buffer *buf, struct splice_desc *sd)
       {
         ...
         more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
         ...
       }
      
      Confirming what we expect that internal flags  are in fact internal
      to socket side.
      
      Fixes: d3b18ad3 ("tls: add bpf support to sk_msg handling")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      0608c69c
  18. 20 12月, 2018 2 次提交
反馈
建议
客服 返回
顶部