1. 18 5月, 2017 2 次提交
    • E
      net: fix compile error in skb_orphan_partial() · 9142e900
      Eric Dumazet 提交于
      If CONFIG_INET is not set, net/core/sock.c can not compile :
      
      net/core/sock.c: In function ‘skb_orphan_partial’:
      net/core/sock.c:1810:2: error: implicit declaration of function
      ‘skb_is_tcp_pure_ack’ [-Werror=implicit-function-declaration]
        if (skb_is_tcp_pure_ack(skb))
        ^
      
      Fix this by always including <net/tcp.h>
      
      Fixes: f6ba8d33 ("netem: fix skb_orphan_partial()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9142e900
    • C
      ipv6: Prevent overrun when parsing v6 header options · 2423496a
      Craig Gallek 提交于
      The KASAN warning repoted below was discovered with a syzkaller
      program.  The reproducer is basically:
        int s = socket(AF_INET6, SOCK_RAW, NEXTHDR_HOP);
        send(s, &one_byte_of_data, 1, MSG_MORE);
        send(s, &more_than_mtu_bytes_data, 2000, 0);
      
      The socket() call sets the nexthdr field of the v6 header to
      NEXTHDR_HOP, the first send call primes the payload with a non zero
      byte of data, and the second send call triggers the fragmentation path.
      
      The fragmentation code tries to parse the header options in order
      to figure out where to insert the fragment option.  Since nexthdr points
      to an invalid option, the calculation of the size of the network header
      can made to be much larger than the linear section of the skb and data
      is read outside of it.
      
      This fix makes ip6_find_1stfrag return an error if it detects
      running out-of-bounds.
      
      [   42.361487] ==================================================================
      [   42.364412] BUG: KASAN: slab-out-of-bounds in ip6_fragment+0x11c8/0x3730
      [   42.365471] Read of size 840 at addr ffff88000969e798 by task ip6_fragment-oo/3789
      [   42.366469]
      [   42.366696] CPU: 1 PID: 3789 Comm: ip6_fragment-oo Not tainted 4.11.0+ #41
      [   42.367628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
      [   42.368824] Call Trace:
      [   42.369183]  dump_stack+0xb3/0x10b
      [   42.369664]  print_address_description+0x73/0x290
      [   42.370325]  kasan_report+0x252/0x370
      [   42.370839]  ? ip6_fragment+0x11c8/0x3730
      [   42.371396]  check_memory_region+0x13c/0x1a0
      [   42.371978]  memcpy+0x23/0x50
      [   42.372395]  ip6_fragment+0x11c8/0x3730
      [   42.372920]  ? nf_ct_expect_unregister_notifier+0x110/0x110
      [   42.373681]  ? ip6_copy_metadata+0x7f0/0x7f0
      [   42.374263]  ? ip6_forward+0x2e30/0x2e30
      [   42.374803]  ip6_finish_output+0x584/0x990
      [   42.375350]  ip6_output+0x1b7/0x690
      [   42.375836]  ? ip6_finish_output+0x990/0x990
      [   42.376411]  ? ip6_fragment+0x3730/0x3730
      [   42.376968]  ip6_local_out+0x95/0x160
      [   42.377471]  ip6_send_skb+0xa1/0x330
      [   42.377969]  ip6_push_pending_frames+0xb3/0xe0
      [   42.378589]  rawv6_sendmsg+0x2051/0x2db0
      [   42.379129]  ? rawv6_bind+0x8b0/0x8b0
      [   42.379633]  ? _copy_from_user+0x84/0xe0
      [   42.380193]  ? debug_check_no_locks_freed+0x290/0x290
      [   42.380878]  ? ___sys_sendmsg+0x162/0x930
      [   42.381427]  ? rcu_read_lock_sched_held+0xa3/0x120
      [   42.382074]  ? sock_has_perm+0x1f6/0x290
      [   42.382614]  ? ___sys_sendmsg+0x167/0x930
      [   42.383173]  ? lock_downgrade+0x660/0x660
      [   42.383727]  inet_sendmsg+0x123/0x500
      [   42.384226]  ? inet_sendmsg+0x123/0x500
      [   42.384748]  ? inet_recvmsg+0x540/0x540
      [   42.385263]  sock_sendmsg+0xca/0x110
      [   42.385758]  SYSC_sendto+0x217/0x380
      [   42.386249]  ? SYSC_connect+0x310/0x310
      [   42.386783]  ? __might_fault+0x110/0x1d0
      [   42.387324]  ? lock_downgrade+0x660/0x660
      [   42.387880]  ? __fget_light+0xa1/0x1f0
      [   42.388403]  ? __fdget+0x18/0x20
      [   42.388851]  ? sock_common_setsockopt+0x95/0xd0
      [   42.389472]  ? SyS_setsockopt+0x17f/0x260
      [   42.390021]  ? entry_SYSCALL_64_fastpath+0x5/0xbe
      [   42.390650]  SyS_sendto+0x40/0x50
      [   42.391103]  entry_SYSCALL_64_fastpath+0x1f/0xbe
      [   42.391731] RIP: 0033:0x7fbbb711e383
      [   42.392217] RSP: 002b:00007ffff4d34f28 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      [   42.393235] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbbb711e383
      [   42.394195] RDX: 0000000000001000 RSI: 00007ffff4d34f60 RDI: 0000000000000003
      [   42.395145] RBP: 0000000000000046 R08: 00007ffff4d34f40 R09: 0000000000000018
      [   42.396056] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000400aad
      [   42.396598] R13: 0000000000000066 R14: 00007ffff4d34ee0 R15: 00007fbbb717af00
      [   42.397257]
      [   42.397411] Allocated by task 3789:
      [   42.397702]  save_stack_trace+0x16/0x20
      [   42.398005]  save_stack+0x46/0xd0
      [   42.398267]  kasan_kmalloc+0xad/0xe0
      [   42.398548]  kasan_slab_alloc+0x12/0x20
      [   42.398848]  __kmalloc_node_track_caller+0xcb/0x380
      [   42.399224]  __kmalloc_reserve.isra.32+0x41/0xe0
      [   42.399654]  __alloc_skb+0xf8/0x580
      [   42.400003]  sock_wmalloc+0xab/0xf0
      [   42.400346]  __ip6_append_data.isra.41+0x2472/0x33d0
      [   42.400813]  ip6_append_data+0x1a8/0x2f0
      [   42.401122]  rawv6_sendmsg+0x11ee/0x2db0
      [   42.401505]  inet_sendmsg+0x123/0x500
      [   42.401860]  sock_sendmsg+0xca/0x110
      [   42.402209]  ___sys_sendmsg+0x7cb/0x930
      [   42.402582]  __sys_sendmsg+0xd9/0x190
      [   42.402941]  SyS_sendmsg+0x2d/0x50
      [   42.403273]  entry_SYSCALL_64_fastpath+0x1f/0xbe
      [   42.403718]
      [   42.403871] Freed by task 1794:
      [   42.404146]  save_stack_trace+0x16/0x20
      [   42.404515]  save_stack+0x46/0xd0
      [   42.404827]  kasan_slab_free+0x72/0xc0
      [   42.405167]  kfree+0xe8/0x2b0
      [   42.405462]  skb_free_head+0x74/0xb0
      [   42.405806]  skb_release_data+0x30e/0x3a0
      [   42.406198]  skb_release_all+0x4a/0x60
      [   42.406563]  consume_skb+0x113/0x2e0
      [   42.406910]  skb_free_datagram+0x1a/0xe0
      [   42.407288]  netlink_recvmsg+0x60d/0xe40
      [   42.407667]  sock_recvmsg+0xd7/0x110
      [   42.408022]  ___sys_recvmsg+0x25c/0x580
      [   42.408395]  __sys_recvmsg+0xd6/0x190
      [   42.408753]  SyS_recvmsg+0x2d/0x50
      [   42.409086]  entry_SYSCALL_64_fastpath+0x1f/0xbe
      [   42.409513]
      [   42.409665] The buggy address belongs to the object at ffff88000969e780
      [   42.409665]  which belongs to the cache kmalloc-512 of size 512
      [   42.410846] The buggy address is located 24 bytes inside of
      [   42.410846]  512-byte region [ffff88000969e780, ffff88000969e980)
      [   42.411941] The buggy address belongs to the page:
      [   42.412405] page:ffffea000025a780 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
      [   42.413298] flags: 0x100000000008100(slab|head)
      [   42.413729] raw: 0100000000008100 0000000000000000 0000000000000000 00000001800c000c
      [   42.414387] raw: ffffea00002a9500 0000000900000007 ffff88000c401280 0000000000000000
      [   42.415074] page dumped because: kasan: bad access detected
      [   42.415604]
      [   42.415757] Memory state around the buggy address:
      [   42.416222]  ffff88000969e880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      [   42.416904]  ffff88000969e900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      [   42.417591] >ffff88000969e980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   42.418273]                    ^
      [   42.418588]  ffff88000969ea00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [   42.419273]  ffff88000969ea80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [   42.419882] ==================================================================
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2423496a
  2. 17 5月, 2017 7 次提交
    • I
      neighbour: update neigh timestamps iff update is effective · 77d71233
      Ihar Hrachyshka 提交于
      It's a common practice to send gratuitous ARPs after moving an
      IP address to another device to speed up healing of a service. To
      fulfill service availability constraints, the timing of network peers
      updating their caches to point to a new location of an IP address can be
      particularly important.
      
      Sometimes neigh_update calls won't touch neither lladdr nor state, for
      example if an update arrives in locktime interval. The neigh->updated
      value is tested by the protocol specific neigh code, which in turn
      will influence whether NEIGH_UPDATE_F_OVERRIDE gets set in the
      call to neigh_update() or not. As a result, we may effectively ignore
      the update request, bailing out of touching the neigh entry, except that
      we still bump its timestamps inside neigh_update.
      
      This may be a problem for updates arriving in quick succession. For
      example, consider the following scenario:
      
      A service is moved to another device with its IP address. The new device
      sends three gratuitous ARP requests into the network with ~1 seconds
      interval between them. Just before the first request arrives to one of
      network peer nodes, its neigh entry for the IP address transitions from
      STALE to DELAY.  This transition, among other things, updates
      neigh->updated. Once the kernel receives the first gratuitous ARP, it
      ignores it because its arrival time is inside the locktime interval. The
      kernel still bumps neigh->updated. Then the second gratuitous ARP
      request arrives, and it's also ignored because it's still in the (new)
      locktime interval. Same happens for the third request. The node
      eventually heals itself (after delay_first_probe_time seconds since the
      initial transition to DELAY state), but it just wasted some time and
      require a new ARP request/reply round trip. This unfortunate behaviour
      both puts more load on the network, as well as reduces service
      availability.
      
      This patch changes neigh_update so that it bumps neigh->updated (as well
      as neigh->confirmed) only once we are sure that either lladdr or entry
      state will change). In the scenario described above, it means that the
      second gratuitous ARP request will actually update the entry lladdr.
      
      Ideally, we would update the neigh entry on the very first gratuitous
      ARP request. The locktime mechanism is designed to ignore ARP updates in
      a short timeframe after a previous ARP update was honoured by the kernel
      layer. This would require tracking timestamps for state transitions
      separately from timestamps when actual updates are received. This would
      probably involve changes in neighbour struct. Therefore, the patch
      doesn't tackle the issue of the first gratuitous APR ignored, leaving
      it for a follow-up.
      Signed-off-by: NIhar Hrachyshka <ihrachys@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77d71233
    • I
      arp: honour gratuitous ARP _replies_ · 23d268eb
      Ihar Hrachyshka 提交于
      When arp_accept is 1, gratuitous ARPs are supposed to override matching
      entries irrespective of whether they arrive during locktime. This was
      implemented in commit 56022a8f ("ipv4: arp: update neighbour address
      when a gratuitous arp is received and arp_accept is set")
      
      There is a glitch in the patch though. RFC 2002, section 4.6, "ARP,
      Proxy ARP, and Gratuitous ARP", defines gratuitous ARPs so that they can
      be either of Request or Reply type. Those Reply gratuitous ARPs can be
      triggered with standard tooling, for example, arping -A option does just
      that.
      
      This patch fixes the glitch, making both Request and Reply flavours of
      gratuitous ARPs to behave identically.
      
      As per RFC, if gratuitous ARPs are of Reply type, their Target Hardware
      Address field should also be set to the link-layer address to which this
      cache entry should be updated. The field is present in ARP over Ethernet
      but not in IEEE 1394. In this patch, I don't consider any broadcasted
      ARP replies as gratuitous if the field is not present, to conform the
      standard. It's not clear whether there is such a thing for IEEE 1394 as
      a gratuitous ARP reply; until it's cleared up, we will ignore such
      broadcasts. Note that they will still update existing ARP cache entries,
      assuming they arrive out of locktime time interval.
      Signed-off-by: NIhar Hrachyshka <ihrachys@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23d268eb
    • D
      net: Improve handling of failures on link and route dumps · f6c5775f
      David Ahern 提交于
      In general, rtnetlink dumps do not anticipate failure to dump a single
      object (e.g., link or route) on a single pass. As both route and link
      objects have grown via more attributes, that is no longer a given.
      
      netlink dumps can handle a failure if the dump function returns an
      error; specifically, netlink_dump adds the return code to the response
      if it is <= 0 so userspace is notified of the failure. The missing
      piece is the rtnetlink dump functions returning the error.
      
      Fix route and link dump functions to return the errors if no object is
      added to an skb (detected by skb->len != 0). IPv6 route dumps
      (rt6_dump_route) already return the error; this patch updates IPv4 and
      link dumps. Other dump functions may need to be ajusted as well.
      Reported-by: NJan Moskyto Matejka <mq@ucw.cz>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6c5775f
    • C
      net/smc: Add warning about remote memory exposure · 19a0f7e3
      Christoph Hellwig 提交于
      The driver explicitly bypasses APIs to register all memory once a
      connection is made, and thus allows remote access to memory.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Acked-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19a0f7e3
    • U
      smc: switch to usage of IB_PD_UNSAFE_GLOBAL_RKEY · 263eec9b
      Ursula Braun 提交于
      Currently, SMC enables remote access to physical memory when a user
      has successfully configured and established an SMC-connection until ten
      minutes after the last SMC connection is closed. Because this is considered
      a security risk, drivers are supposed to use IB_PD_UNSAFE_GLOBAL_RKEY in
      such a case.
      
      This patch changes the current SMC code to use IB_PD_UNSAFE_GLOBAL_RKEY.
      This improves user awareness, but does not remove the security risk itself.
      Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      263eec9b
    • T
      ipmr: vrf: Find VIFs using the actual device · bcfc7d33
      Thomas Winter 提交于
      The skb->dev that is passed into ip_mr_input is
      the loX device for VRFs. When we lookup a vif
      for this dev, none is found as we do not create
      vifs for loopbacks. Instead lookup a vif for the
      actual device that the packet was received on,
      eg the vlan.
      Signed-off-by: NThomas Winter <Thomas.Winter@alliedtelesis.co.nz>
      cc: David Ahern <dsa@cumulusnetworks.com>
      cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
      cc: roopa <roopa@cumulusnetworks.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bcfc7d33
    • S
      tcp: eliminate negative reordering in tcp_clean_rtx_queue · bafbb9c7
      Soheil Hassas Yeganeh 提交于
      tcp_ack() can call tcp_fragment() which may dededuct the
      value tp->fackets_out when MSS changes. When prior_fackets
      is larger than tp->fackets_out, tcp_clean_rtx_queue() can
      invoke tcp_update_reordering() with negative values. This
      results in absurd tp->reodering values higher than
      sysctl_tcp_max_reordering.
      
      Note that tcp_update_reordering indeeds sets tp->reordering
      to min(sysctl_tcp_max_reordering, metric), but because
      the comparison is signed, a negative metric always wins.
      
      Fixes: c7caf8d3 ("[TCP]: Fix reord detection due to snd_una covered holes")
      Reported-by: NRebecca Isaacs <risaacs@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bafbb9c7
  3. 16 5月, 2017 2 次提交
  4. 12 5月, 2017 8 次提交
    • X
      sctp: fix src address selection if using secondary addresses for ipv6 · dbc2b5e9
      Xin Long 提交于
      Commit 0ca50d12 ("sctp: fix src address selection if using secondary
      addresses") has fixed a src address selection issue when using secondary
      addresses for ipv4.
      
      Now sctp ipv6 also has the similar issue. When using a secondary address,
      sctp_v6_get_dst tries to choose the saddr which has the most same bits
      with the daddr by sctp_v6_addr_match_len. It may make some cases not work
      as expected.
      
      hostA:
        [1] fd21:356b:459a:cf10::11 (eth1)
        [2] fd21:356b:459a:cf20::11 (eth2)
      
      hostB:
        [a] fd21:356b:459a:cf30::2  (eth1)
        [b] fd21:356b:459a:cf40::2  (eth2)
      
      route from hostA to hostB:
        fd21:356b:459a:cf30::/64 dev eth1  metric 1024  mtu 1500
      
      The expected path should be:
        fd21:356b:459a:cf10::11 <-> fd21:356b:459a:cf30::2
      But addr[2] matches addr[a] more bits than addr[1] does, according to
      sctp_v6_addr_match_len. It causes the path to be:
        fd21:356b:459a:cf20::11 <-> fd21:356b:459a:cf30::2
      
      This patch is to fix it with the same way as Marcelo's fix for sctp ipv4.
      As no ip_dev_find for ipv6, this patch is to use ipv6_chk_addr to check
      if the saddr is in a dev instead.
      
      Note that for backwards compatibility, it will still do the addr_match_len
      check here when no optimal is found.
      Reported-by: NPatrick Talbert <ptalbert@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbc2b5e9
    • J
      tipc: make macro tipc_wait_for_cond() smp safe · 844cf763
      Jon Paul Maloy 提交于
      The macro tipc_wait_for_cond() is embedding the macro sk_wait_event()
      to fulfil its task. The latter, in turn, is evaluating the stated
      condition outside the socket lock context. This is problematic if
      the condition is accessing non-trivial data structures which may be
      altered by incoming interrupts, as is the case with the cong_links()
      linked list, used by socket to keep track of the current set of
      congested links. We sometimes see crashes when this list is accessed
      by a condition function at the same time as a SOCK_WAKEUP interrupt
      is removing an element from the list.
      
      We fix this by expanding selected parts of sk_wait_event() into the
      outer macro, while ensuring that all evaluations of a given condition
      are performed under socket lock protection.
      
      Fixes: commit 365ad353 ("tipc: reduce risk of user starvation during link congestion")
      Reviewed-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      844cf763
    • E
      net: sched: optimize class dumps · cb395b20
      Eric Dumazet 提交于
      In commit 59cc1f61 ("net: sched: convert qdisc linked list to
      hashtable") we missed the opportunity to considerably speed up
      tc_dump_tclass_root() if a qdisc handle is provided by user.
      
      Instead of iterating all the qdiscs, use qdisc_match_from_root()
      to directly get the one we look for.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb395b20
    • Y
      tcp: avoid fragmenting peculiar skbs in SACK · b451e5d2
      Yuchung Cheng 提交于
      This patch fixes a bug in splitting an SKB during SACK
      processing. Specifically if an skb contains multiple
      packets and is only partially sacked in the higher sequences,
      tcp_match_sack_to_skb() splits the skb and marks the second fragment
      as SACKed.
      
      The current code further attempts rounding up the first fragment
      to MSS boundaries. But it misses a boundary condition when the
      rounded-up fragment size (pkt_len) is exactly skb size.  Spliting
      such an skb is pointless and causses a kernel warning and aborts
      the SACK processing. This patch universally checks such over-split
      before calling tcp_fragment to prevent these unnecessary warnings.
      
      Fixes: adb92db8 ("tcp: Make SACK code to split only at mss boundaries")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b451e5d2
    • E
      netem: fix skb_orphan_partial() · f6ba8d33
      Eric Dumazet 提交于
      I should have known that lowering skb->truesize was dangerous :/
      
      In case packets are not leaving the host via a standard Ethernet device,
      but looped back to local sockets, bad things can happen, as reported
      by Michael Madsen ( https://bugzilla.kernel.org/show_bug.cgi?id=195713 )
      
      So instead of tweaking skb->truesize, lets change skb->destructor
      and keep a reference on the owner socket via its sk_refcnt.
      
      Fixes: f2f872f9 ("netem: Introduce skb_orphan_partial() helper")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMichael Madsen <mkm@nabto.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6ba8d33
    • D
      xdp: refine xdp api with regards to generic xdp · d67b9cd2
      Daniel Borkmann 提交于
      While working on the iproute2 generic XDP frontend, I noticed that
      as of right now it's possible to have native *and* generic XDP
      programs loaded both at the same time for the case when a driver
      supports native XDP.
      
      The intended model for generic XDP from b5cdae32 ("net: Generic
      XDP") is, however, that only one out of the two can be present at
      once which is also indicated as such in the XDP netlink dump part.
      The main rationale for generic XDP is to ease accessibility (in
      case a driver does not yet have XDP support) and to generically
      provide a semantical model as an example for driver developers
      wanting to add XDP support. The generic XDP option for an XDP
      aware driver can still be useful for comparing and testing both
      implementations.
      
      However, it is not intended to have a second XDP processing stage
      or layer with exactly the same functionality of the first native
      stage. Only reason could be to have a partial fallback for future
      XDP features that are not supported yet in the native implementation
      and we probably also shouldn't strive for such fallback and instead
      encourage native feature support in the first place. Given there's
      currently no such fallback issue or use case, lets not go there yet
      if we don't need to.
      
      Therefore, change semantics for loading XDP and bail out if the
      user tries to load a generic XDP program when a native one is
      present and vice versa. Another alternative to bailing out would
      be to handle the transition from one flavor to another gracefully,
      but that would require to bring the device down, exchange both
      types of programs, and bring it up again in order to avoid a tiny
      window where a packet could hit both hooks. Given this complicates
      the logic for just a debugging feature in the native case, I went
      with the simpler variant.
      
      For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae32
      and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
      or just a subset of flags that were used for loading the XDP prog
      is suboptimal in the long run since not all flags are useful for
      dumping and if we start to reuse the same flag definitions for
      load and dump, then we'll waste bit space. What we really just
      want is to dump the mode for now.
      
      Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
      a program is running at the native driver layer (1). Thus, add a
      mode that says that a program is running at generic XDP layer (2).
      Applications will handle this fine in that older binaries will
      just indicate that something is attached at XDP layer, effectively
      this is similar to IFLA_XDP_FLAGS attr that we would have had
      modulo the redundancy.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d67b9cd2
    • D
      xdp: add flag to enforce driver mode · 0489df9a
      Daniel Borkmann 提交于
      After commit b5cdae32 ("net: Generic XDP") we automatically fall
      back to a generic XDP variant if the driver does not support native
      XDP. Allow for an option where the user can specify that always the
      native XDP variant should be selected and in case it's not supported
      by a driver, just bail out.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0489df9a
    • W
      ipv6/dccp: do not inherit ipv6_mc_list from parent · 83eaddab
      WANG Cong 提交于
      Like commit 657831ff ("dccp/tcp: do not inherit mc_list from parent")
      we should clear ipv6_mc_list etc. for IPv6 sockets too.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83eaddab
  5. 10 5月, 2017 1 次提交
  6. 09 5月, 2017 14 次提交
    • K
      DECnet: Use container_of() for embedded struct · f92ceb01
      Kees Cook 提交于
      Instead of a direct cross-type cast, use conatiner_of() to locate
      the embedded structure, even in the face of future struct layout
      randomization.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f92ceb01
    • D
      Revert "ipv4: restore rt->fi for reference counting" · 32f1bc0f
      David S. Miller 提交于
      This reverts commit 82486aa6.
      
      As implemented, this causes dangling netdevice refs.
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32f1bc0f
    • V
      treewide: convert PF_MEMALLOC manipulations to new helpers · f1083048
      Vlastimil Babka 提交于
      We now have memalloc_noreclaim_{save,restore} helpers for robust setting
      and clearing of PF_MEMALLOC.  Let's convert the code which was using the
      generic tsk_restore_flags().  No functional change.
      
      [vbabka@suse.cz: in net/core/sock.c the hunk is missing]
      Link: http://lkml.kernel.org/r/20170405074700.29871-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Lee Duncan <lduncan@suse.com>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Wouter Verhelst <w@uter.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1083048
    • D
      fs: ceph: CURRENT_TIME with ktime_get_real_ts() · 1134e091
      Deepa Dinamani 提交于
      CURRENT_TIME is not y2038 safe.  The macro will be deleted and all the
      references to it will be replaced by ktime_get_* apis.
      
      struct timespec is also not y2038 safe.  Retain timespec for timestamp
      representation here as ceph uses it internally everywhere.  These
      references will be changed to use struct timespec64 in a separate patch.
      
      The current_fs_time() api is being changed to use vfs struct inode* as
      an argument instead of struct super_block*.
      
      Set the new mds client request r_stamp field using ktime_get_real_ts()
      instead of using current_fs_time().
      
      Also, since r_stamp is used as mtime on the server, use timespec_trunc()
      to truncate the timestamp, using the right granularity from the
      superblock.
      
      This api will be transitioned to be y2038 safe along with vfs.
      
      Link: http://lkml.kernel.org/r/1491613030-11599-5-git-send-email-deepa.kernel@gmail.comSigned-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      M:	Ilya Dryomov <idryomov@gmail.com>
      M:	"Yan, Zheng" <zyan@redhat.com>
      M:	Sage Weil <sage@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1134e091
    • K
      format-security: move static strings to const · 06324664
      Kees Cook 提交于
      While examining output from trial builds with -Wformat-security enabled,
      many strings were found that should be defined as "const", or as a char
      array instead of char pointer.  This makes some static analysis easier,
      by producing fewer false positives.
      
      As these are all trivial changes, it seemed best to put them all in a
      single patch rather than chopping them up per maintainer.
      
      Link: http://lkml.kernel.org/r/20170405214711.GA5711@beastSigned-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: Jes Sorensen <jes@trained-monkey.org>	[runner.c]
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Sean Paul <seanpaul@chromium.org>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
      Cc: Salil Mehta <salil.mehta@huawei.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Patrice Chotard <patrice.chotard@st.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Matt Redfearn <matt.redfearn@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Mugunthan V N <mugunthanvnm@ti.com>
      Cc: Felipe Balbi <felipe.balbi@linux.intel.com>
      Cc: Jarod Wilson <jarod@redhat.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Antonio Quartulli <a@unstable.cc>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Kejian Yan <yankejian@huawei.com>
      Cc: Daode Huang <huangdaode@hisilicon.com>
      Cc: Qianqian Xie <xieqianqian@huawei.com>
      Cc: Philippe Reynes <tremyfr@gmail.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Christian Gromm <christian.gromm@microchip.com>
      Cc: Andrey Shvetsov <andrey.shvetsov@k2l.de>
      Cc: Jason Litzinger <jlitzingerdev@gmail.com>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06324664
    • M
      mm, vmalloc: use __GFP_HIGHMEM implicitly · 19809c2d
      Michal Hocko 提交于
      __vmalloc* allows users to provide gfp flags for the underlying
      allocation.  This API is quite popular
      
        $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
        77
      
      The only problem is that many people are not aware that they really want
      to give __GFP_HIGHMEM along with other flags because there is really no
      reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
      which are mapped to the kernel vmalloc space.  About half of users don't
      use this flag, though.  This signals that we make the API unnecessarily
      too complex.
      
      This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
      be mapped to the vmalloc space.  Current users which add __GFP_HIGHMEM
      are simplified and drop the flag.
      
      Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Cristopher Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19809c2d
    • M
      net: use kvmalloc with __GFP_REPEAT rather than open coded variant · da6bc57a
      Michal Hocko 提交于
      fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc with
      vmalloc fallback.  Use the kvmalloc variant instead.  Keep the
      __GFP_REPEAT flag based on explanation from Eric:
      
       "At the time, tests on the hardware I had in my labs showed that
        vmalloc() could deliver pages spread all over the memory and that was
        a small penalty (once memory is fragmented enough, not at boot time)"
      
      The way how the code is constructed means, however, that we prefer to go
      and hit the OOM killer before we fall back to the vmalloc for requests
      <=32kB (with 4kB pages) in the current code.  This is rather disruptive
      for something that can be achived with the fallback.  On the other hand
      __GFP_REPEAT doesn't have any useful semantic for these requests.  So
      the effect of this patch is that requests which fit into 32kB will fall
      back to vmalloc easier now.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da6bc57a
    • M
      treewide: use kv[mz]alloc* rather than opencoded variants · 752ade68
      Michal Hocko 提交于
      There are many code paths opencoding kvmalloc.  Let's use the helper
      instead.  The main difference to kvmalloc is that those users are
      usually not considering all the aspects of the memory allocator.  E.g.
      allocation requests <= 32kB (with 4kB pages) are basically never failing
      and invoke OOM killer to satisfy the allocation.  This sounds too
      disruptive for something that has a reasonable fallback - the vmalloc.
      On the other hand those requests might fallback to vmalloc even when the
      memory allocator would succeed after several more reclaim/compaction
      attempts previously.  There is no guarantee something like that happens
      though.
      
      This patch converts many of those places to kv[mz]alloc* helpers because
      they are more conservative.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
      Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
      Acked-by: David Sterba <dsterba@suse.com> # btrfs
      Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
      Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Santosh Raspatur <santosh@chelsio.com>
      Cc: Hariprasad S <hariprasad@chelsio.com>
      Cc: Yishai Hadas <yishaih@mellanox.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      752ade68
    • M
      net/ipv6/ila/ila_xlat.c: simplify a strange allocation pattern · 847f716f
      Michal Hocko 提交于
      alloc_ila_locks seemed to c&p from alloc_bucket_locks allocation pattern
      which is quite unusual.  The default allocation size is 320 *
      sizeof(spinlock_t) which is sub page unless lockdep is enabled when the
      performance benefit is really questionable and not worth the subtle code
      IMHO.  Also note that the context when we call ila_init_net (modprobe or
      a task creating a net namespace) has to be properly configured.
      
      Let's just simplify the code and use kvmalloc helper which is a
      transparent way to use kmalloc with vmalloc fallback.
      
      Link: http://lkml.kernel.org/r/20170306103032.2540-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      847f716f
    • W
      ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf · 242d3a49
      WANG Cong 提交于
      For each netns (except init_net), we initialize its null entry
      in 3 places:
      
      1) The template itself, as we use kmemdup()
      2) Code around dst_init_metrics() in ip6_route_net_init()
      3) ip6_route_dev_notify(), which is supposed to initialize it after
         loopback registers
      
      Unfortunately the last one still happens in a wrong order because
      we expect to initialize net->ipv6.ip6_null_entry->rt6i_idev to
      net->loopback_dev's idev, thus we have to do that after we add
      idev to loopback. However, this notifier has priority == 0 same as
      ipv6_dev_notf, and ipv6_dev_notf is registered after
      ip6_route_dev_notifier so it is called actually after
      ip6_route_dev_notifier. This is similar to commit 2f460933
      ("ipv6: initialize route null entry in addrconf_init()") which
      fixes init_net.
      
      Fix it by picking a smaller priority for ip6_route_dev_notifier.
      Also, we have to release the refcnt accordingly when unregistering
      loopback_dev because device exit functions are called before subsys
      exit functions.
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Tested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      242d3a49
    • H
      vti: check nla_put_* return value · 8ed508fd
      Hangbin Liu 提交于
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ed508fd
    • V
      vlan: Keep NETIF_F_HW_CSUM similar to other software devices · 8403debe
      Vlad Yasevich 提交于
      Vlan devices, like all other software devices, enable
      NETIF_F_HW_CSUM feature.  However, unlike all the othe other
      software devices, vlans will switch to using IP|IPV6_CSUM
      features, if the underlying devices uses them.  In these situations,
      checksum offload features on the vlan device can't be controlled
      via ethtool.
      
      This patch makes vlans keep HW_CSUM feature if the underlying
      device supports checksum offloading.  This makes vlan devices
      behave like other software devices, and restores control to the
      user.
      
      A side-effect is that some offload settings (typically UFO)
      may be enabled on the vlan device while being disabled on the HW.
      However, the GSO code will correctly process the packets. This
      actually results in slightly better raw throughput.
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8403debe
    • W
      tcp: make congestion control optionally skip slow start after idle · 1b1fc3fd
      Wei Wang 提交于
      Congestion control modules that want full control over congestion
      control behavior do not want the cwnd modifications controlled by
      the sysctl_tcp_slow_start_after_idle code path.
      So skip those code paths for CC modules that use the cong_control()
      API.
      As an example, those cwnd effects are not desired for the BBR congestion
      control algorithm.
      
      Fixes: c0402760 ("tcp: new CC hook to set sending rate with rate_sample in any CA state")
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b1fc3fd
    • W
      ipv4: restore rt->fi for reference counting · 82486aa6
      WANG Cong 提交于
      IPv4 dst could use fi->fib_metrics to store metrics but fib_info
      itself is refcnt'ed, so without taking a refcnt fi and
      fi->fib_metrics could be freed while dst metrics still points to
      it. This triggers use-after-free as reported by Andrey twice.
      
      This patch reverts commit 2860583f ("ipv4: Kill rt->fi") to
      restore this reference counting. It is a quick fix for -net and
      -stable, for -net-next, as Eric suggested, we can consider doing
      reference counting for metrics itself instead of relying on fib_info.
      
      IPv6 is very different, it copies or steals the metrics from mx6_config
      in fib6_commit_metrics() so probably doesn't need a refcnt.
      
      Decnet has already done the refcnt'ing, see dn_fib_semantic_match().
      
      Fixes: 2860583f ("ipv4: Kill rt->fi")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82486aa6
  7. 08 5月, 2017 3 次提交
  8. 06 5月, 2017 1 次提交
    • E
      tcp: randomize timestamps on syncookies · 84b114b9
      Eric Dumazet 提交于
      Whole point of randomization was to hide server uptime, but an attacker
      can simply start a syn flood and TCP generates 'old style' timestamps,
      directly revealing server jiffies value.
      
      Also, TSval sent by the server to a particular remote address vary
      depending on syncookies being sent or not, potentially triggering PAWS
      drops for innocent clients.
      
      Lets implement proper randomization, including for SYNcookies.
      
      Also we do not need to export sysctl_tcp_timestamps, since it is not
      used from a module.
      
      In v2, I added Florian feedback and contribution, adding tsoff to
      tcp_get_cookie_sock().
      
      v3 removed one unused variable in tcp_v4_connect() as Florian spotted.
      
      Fixes: 95a22cae ("tcp: randomize tcp timestamp offsets for each connection")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Tested-by: NFlorian Westphal <fw@strlen.de>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84b114b9
  9. 05 5月, 2017 2 次提交