1. 16 10月, 2015 6 次提交
    • J
      tipc: send out RESET immediately when link goes down · 282b3a05
      Jon Paul Maloy 提交于
      When a link is taken down because of a node local event, such as
      disabling of a bearer or an interface, we currently leave it to the
      peer node to discover the broken communication. The default time for
      such failure discovery is 1.5-2 seconds.
      
      If we instead allow the terminating link endpoint to send out a RESET
      message at the moment it is reset, we can achieve the impression that
      both endpoints are going down instantly. Since this is a very common
      scenario, we find it worthwhile to make this small modification.
      
      Apart from letting the link produce the said message, we also have to
      ensure that the interface is able to transmit it before TIPC is
      detached. We do this by performing the disabling of a bearer in three
      steps:
      
      1) Disable reception of TIPC packets from the interface in question.
      2) Take down the links, while allowing them so send out a RESET message.
      3) Disable transmission of TIPC packets on the interface.
      
      Apart from this, we now have to react on the NETDEV_GOING_DOWN event,
      instead of as currently the NEDEV_DOWN event, to ensure that such
      transmission is possible during the teardown phase.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      282b3a05
    • J
      tipc: delay ESTABLISH state event when link is established · 73f646ce
      Jon Paul Maloy 提交于
      Link establishing, just like link teardown, is a non-atomic action, in
      the sense that discovering that conditions are right to establish a link,
      and the actual adding of the link to one of the node's send slots is done
      in two different lock contexts. The link FSM is designed to help bridging
      the gap between the two contexts in a safe manner.
      
      We have now discovered a weakness in the implementaton of this FSM.
      Because we directly let the link go from state LINK_ESTABLISHING to
      state LINK_ESTABLISHED already in the first lock context, we are unable
      to distinguish between a fully established link, i.e., a link that has
      been added to its slot, and a link that has not yet reached the second
      lock context. It may hence happen that a manual intervention, e.g., when
      disabling an interface, causes the function tipc_node_link_down() to try
      removing the link from the node slots, decrementing its active link
      counter etc, although the link was never added there in the first place.
      
      We solve this by delaying the actual state change until we reach the
      second lock context, inside the function tipc_node_link_up(). This
      makes it possible for potentail callers of __tipc_node_link_down() to
      know if they should proceed or not, and the problem is solved.
      
      Unforunately, the situation described above also has a second problem.
      Since there by necessity is a tipc_node_link_up() call pending once
      the node lock has been released, we must defuse that call by setting
      the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
      us to make a slight modification to the link FSM, which will now look
      as follows.
      
       +------------------------------------+
       |RESET_EVT                           |
       |                                    |
       |                             +--------------+
       |           +-----------------|   SYNCHING   |-----------------+
       |           |FAILURE_EVT      +--------------+   PEER_RESET_EVT|
       |           |                  A            |                  |
       |           |                  |            |                  |
       |           |                  |            |                  |
       |           |                  |SYNCH_      |SYNCH_            |
       |           |                  |BEGIN_EVT   |END_EVT           |
       |           |                  |            |                  |
       |           V                  |            V                  V
       |    +-------------+          +--------------+          +------------+
       |    |  RESETTING  |<---------|  ESTABLISHED |--------->| PEER_RESET |
       |    +-------------+ FAILURE_ +--------------+ PEER_    +------------+
       |           |        EVT        |    A         RESET_EVT       |
       |           |                   |    |                         |
       |           |  +----------------+    |                         |
       |  RESET_EVT|  |RESET_EVT            |                         |
       |           |  |                     |                         |
       |           |  |                     |ESTABLISH_EVT            |
       |           |  |  +-------------+    |                         |
       |           |  |  | RESET_EVT   |    |                         |
       |           |  |  |             |    |                         |
       |           V  V  V             |    |                         |
       |    +-------------+          +--------------+        RESET_EVT|
       +--->|    RESET    |--------->| ESTABLISHING |<----------------+
            +-------------+ PEER_    +--------------+
             |           A  RESET_EVT       |
             |           |                  |
             |           |                  |
             |FAILOVER_  |FAILOVER_         |FAILOVER_
             |BEGIN_EVT  |END_EVT           |BEGIN_EVT
             |           |                  |
             V           |                  |
            +-------------+                 |
            | FAILINGOVER |<----------------+
            +-------------+
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73f646ce
    • J
      tipc: disallow packet duplicates in link deferred queue · 8306f99a
      Jon Paul Maloy 提交于
      After the previous commits, we are guaranteed that no packets
      of type LINK_PROTOCOL or with illegal sequence numbers will be
      attempted added to the link deferred queue. This makes it possible to
      make some simplifications to the sorting algorithm in the function
      tipc_skb_queue_sorted().
      
      We also alter the function so that it will drop packets if one with
      the same seqeunce number is already present in the queue. This is
      necessary because we have identified weird packet sequences, involving
      duplicate packets, where a legitimate in-sequence packet may advance to
      the head of the queue without being detected and de-queued.
      
      Finally, we make this function outline, since it will now be called only
      in exceptional cases.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8306f99a
    • J
      tipc: improve sequence number checking · 81204c49
      Jon Paul Maloy 提交于
      The sequence number of an incoming packet is currently only checked
      for less than, equality to, or bigger than the next expected number,
      meaning that the receive window in practice becomes one half sequence
      number cycle, or U16_MAX/2. This does not make sense, and may not even
      be safe if there are extreme delays in the network. Any packet sent by
      the peer during the ongoing cycle must belong inside his current send
      window, or should otherwise be dropped if possible.
      
      Since a link endpoint cannot know its peer's current send window, it
      has to base this sanity check on a worst-case assumption, i.e., that
      the peer is using a maximum sized window of 8191 packets. Using this
      assumption, we now add a check that the sequence number is not bigger
      than next_expected + TIPC_MAX_LINK_WIN. We also re-order the checks
      done, so that the receive window test is performed before the gap test.
      This way, we are guaranteed that no packet with illegal sequence numbers
      are ever added to the deferred queue.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81204c49
    • J
      tipc: simplify tipc_link_rcv() reception loop · f9aa358a
      Jon Paul Maloy 提交于
      Currently, all packets received in tipc_link_rcv() are unconditionally
      added to the packet deferred queue, whereafter that queue is walked and
      all its buffers evaluated for delivery. This is both non-optimal and
      and makes the queue sorting function unnecessary complex.
      
      This commit changes the loop so that an arrived packet is evaluated
      first, and added to the deferred queue only when a sequence number gap
      is discovered. A non-empty deferred queue is walked until it is empty
      or until its head's sequence number doesn't fit.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9aa358a
    • J
      tipc: limit usage of temporary skb list during packet reception · 9945e804
      Jon Paul Maloy 提交于
      During packet reception, the function tipc_link_rcv() adds its accepted
      packets to a temporary buffer queue, before finally splicing this queue
      into the lock protected input queue that will be delivered up to the
      socket layer. The purpose is to reduce potential contention on the input
      queue lock. However, since the vast majority of packets arrive in
      sequence, they will anyway be added one by one to the input queue, and
      the use of the temporary queue becomes a sub-optimization.
      
      The only case where this queue makes sense is when unpacking buffers
      from a bundle packet; here we want to avoid dozens of small buffers
      to be added individually to the lock-protected input queue in a tight
      loop.
      
      In this commit, we remove the general usage of the temporary queue,
      and keep it only for the packet unbundling case.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9945e804
  2. 15 10月, 2015 9 次提交
  3. 14 10月, 2015 2 次提交
    • P
      Revert "ipv4/icmp: redirect messages can use the ingress daddr as source" · 02a6d613
      Paolo Abeni 提交于
      Revert the commit e2ca690b ("ipv4/icmp: redirect messages
      can use the ingress daddr as source"), which tried to introduce a more
      suitable behaviour for ICMP redirect messages generated by VRRP routers.
      However RFC 5798 section 8.1.1 states:
      
          The IPv4 source address of an ICMP redirect should be the address
          that the end-host used when making its next-hop routing decision.
      
      while said commit used the generating packet destination
      address, which do not match the above and in most cases leads to
      no redirect packets to be generated.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02a6d613
    • E
      tcp/dccp: fix behavior of stale SYN_RECV request sockets · 4bdc3d66
      Eric Dumazet 提交于
      When a TCP/DCCP listener is closed, its pending SYN_RECV request sockets
      become stale, meaning 3WHS can not complete.
      
      But current behavior is wrong :
      incoming packets finding such stale sockets are dropped.
      
      We need instead to cleanup the request socket and perform another
      lookup :
      - Incoming ACK will give a RST answer,
      - SYN rtx might find another listener if available.
      - We expedite cleanup of request sockets and old listener socket.
      
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bdc3d66
  4. 13 10月, 2015 23 次提交
    • A
      can: avoid using timeval for uapi · ba61a8d9
      Arnd Bergmann 提交于
      The can subsystem communicates with user space using a bcm_msg_head
      header, which contains two timestamps. This is problematic for
      multiple reasons:
      
      a) The structure layout is currently incompatible between 64-bit
         user space and 32-bit user space, and cannot work in compat
         mode (other than x32).
      
      b) The timeval structure layout will change in 32-bit user
         space when we fix the y2038 overflow problem by redefining
         time_t to 64-bit, making new 32-bit user space incompatible
         with the current kernel interface.
         Cars last a long time and often use old kernels, so the actual
         users of this code are the most likely ones to migrate to y2038
         safe user space.
      
      This tries to work around part of the problem by changing the
      publicly visible user interface in the header, but not the binary
      interface. Fortunately, the values passed around in the structure
      are relative times and do not actually suffer from the y2038
      overflow, so 32-bit is enough here.
      
      We replace the use of 'struct timeval' with a newly defined
      'struct bcm_timeval' that uses the exact same binary layout
      as before and that still suffers from problem a) but not problem
      b).
      
      The downside of this approach is that any user space program
      that currently assigns a timeval structure to these members
      rather than writing the tv_sec/tv_usec portions individually
      will suffer a compile-time error when built with an updated
      kernel header. Fixing this error makes it work fine with old
      and new headers though.
      
      We could address problem a) by using '__u32' or 'int' members
      rather than 'long', but that would have a more significant
      downside in also breaking support for all existing 64-bit user
      binaries that might be using this interface, which is likely
      not acceptable.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Cc: linux-can@vger.kernel.org
      Cc: linux-api@vger.kernel.org
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      ba61a8d9
    • N
      bridge: vlan: move back vlan_flush · f409d0ed
      Nikolay Aleksandrov 提交于
      Ido Schimmel reported a problem with switchdev devices because of the
      order change of del_nbp operations, more specifically the move of
      nbp_vlan_flush() which deletes all vlans and frees vlgrp after the
      rx_handler has been unregistered. So in order to fix this move
      vlan_flush back where it was and make it destroy the rhtable after
      NULLing vlgrp and waiting a grace period to make sure noone can see it.
      Reported-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f409d0ed
    • N
      bridge: vlan: drop unnecessary flush code · b8d02c3c
      Nikolay Aleksandrov 提交于
      As Ido Schimmel pointed out the vlan_vid_del() code in nbp_vlan_flush is
      unnecessary (and is actually a remnant of the old vlan code) so we can
      remove it.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8d02c3c
    • N
      bridge: vlan: use rcu for vlan_list traversal in br_fill_ifinfo · e9c953ef
      Nikolay Aleksandrov 提交于
      br_fill_ifinfo is called by br_ifinfo_notify which can be called from
      many contexts with different locks held, sometimes it relies upon
      bridge's spinlock only which is a problem for the vlan code, so use
      explicitly rcu for that to avoid problems.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9c953ef
    • N
      bridge: vlan: use proper rcu for the vlgrp member · 907b1e6e
      Nikolay Aleksandrov 提交于
      The bridge and port's vlgrp member is already used in RCU way, currently
      we rely on the fact that it cannot disappear while the port exists but
      that is error-prone and we might miss places with improper locking
      (either RCU or RTNL must be held to walk the vlan_list). So make it
      official and use RCU for vlgrp to catch offenders. Introduce proper vlgrp
      accessors and use them consistently throughout the code.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      907b1e6e
    • D
      net: Add VRF support to IPv6 stack · ca254490
      David Ahern 提交于
      As with IPv4 support for VRFs added to IPv6 stack by replacing hardcoded
      table ids with possibly device specific ones and manipulating the oif in
      the flowi6. The flow flags are used to skip oif compare in nexthop lookups
      if the device is enslaved to a VRF via the L3 master device.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca254490
    • D
      net: Export fib6_get_table and nd_tbl · c4850687
      David Ahern 提交于
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4850687
    • N
      bridge: fix gc_timer mod/del race condition · af379392
      Nikolay Aleksandrov 提交于
      commit c62987bb ("bridge: push bridge setting ageing_time down to
      switchdev") introduced a timer race condition because the gc_timer can
      get rearmed after it's supposedly stopped and flushed in br_dev_delete()
      leading to a use of freed memory. So take rtnl to sync with bridge
      destruction when setting ageing_timer.
      Here's the trace reproduced with these two commands running in parallel:
      while :; do echo 10000 > /sys/class/net/br0/bridge/ageing_timer; done;
      while :; do brctl addbr br0; ip l set br0 up; ip l set br0 down;
      brctl delbr br0; done;
      
      [  300.000029] BUG: unable to handle kernel paging request at
      ffffffff811c59d3
      [  300.000263] IP: [<ffffffff810f168e>] __internal_add_timer+0x2e/0xd0
      [  300.000422] PGD 1a0f067 PUD 1a10063 PMD 10001e1
      [  300.000639] Oops: 0003 [#1] SMP
      [  300.000793] Modules linked in: bridge stp llc nfsd auth_rpcgss
      oid_registry nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul
      crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev aesni_intel
      aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd
      snd_hda_codec_generic qxl drm_kms_helper psmouse pcspkr ttm
      snd_hda_intel 9pnet_virtio evdev serio_raw joydev snd_hda_codec 9pnet
      virtio_balloon drm snd_hwdep virtio_console snd_hda_core pvpanic snd_pcm
      i2c_piix4 snd_timer acpi_cpufreq parport_pc snd parport soundcore button
      processor i2c_core ipv6 autofs4 hid_generic usbhid hid ext4 crc16
      mbcache jbd2 sg sr_mod cdrom ata_generic virtio_blk virtio_net e1000
      ehci_pci uhci_hcd ehci_hcd usbcore usb_common floppy ata_piix libata
      virtio_pci virtio_ring virtio scsi_mod
      [  300.004008] CPU: 1 PID: 1169 Comm: bash Not tainted 4.3.0-rc3+ #46
      [  300.004008] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  300.004008] task: ffff880035be2200 ti: ffff88003795c000 task.ti:
      ffff88003795c000
      [  300.004008] RIP: 0010:[<ffffffff810f168e>]  [<ffffffff810f168e>]
      __internal_add_timer+0x2e/0xd0
      [  300.004008] RSP: 0018:ffff88003fd03e78  EFLAGS: 00010046
      [  300.004008] RAX: ffff88003fd0ef60 RBX: 840fc78949c08548 RCX:
      00000001ffffffff
      [  300.004008] RDX: 0000000000000000 RSI: ffffffff811c59d3 RDI:
      ffff88003fd0df00
      [  300.004008] RBP: ffff88003fd03e78 R08: 00000000ffffffff R09:
      0000000000000000
      [  300.004008] R10: 0000000000000000 R11: 0000000000000000 R12:
      ffff88003fd0df00
      [  300.004008] R13: 0000000000000000 R14: 0000000000000001 R15:
      ffffffff816032e0
      [  300.004008] FS:  00007fcbdd609700(0000) GS:ffff88003fd00000(0000)
      knlGS:0000000000000000
      [  300.004008] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  300.004008] CR2: ffffffff811c59d3 CR3: 0000000037879000 CR4:
      00000000000406e0
      [  300.004008] Stack:
      [  300.004008]  ffff88003fd03ea8 ffffffff810f1775 ffff88003c8cb958
      ffff88003fd0df00
      [  300.004008]  0000000000000000 0000000000000001 ffff88003fd03f18
      ffffffff810f28c4
      [  300.004008]  ffff88003fd0eb68 ffff88003fd0e968 ffff88003fd0e768
      ffff88003fd0df68
      [  300.004008] Call Trace:
      [  300.004008]  <IRQ>
      [  300.004008]  [<ffffffff810f1775>] cascade+0x45/0x70
      [  300.004008]  [<ffffffff810f28c4>] run_timer_softirq+0x2f4/0x340
      [  300.004008]  [<ffffffff8107e380>] __do_softirq+0xd0/0x440
      [  300.004008]  [<ffffffff8107e8a3>] irq_exit+0xb3/0xc0
      [  300.004008]  [<ffffffff815c2032>] smp_apic_timer_interrupt+0x42/0x50
      [  300.004008]  [<ffffffff815bfe37>] apic_timer_interrupt+0x87/0x90
      [  300.004008]  <EOI>
      [  300.004008]  [<ffffffff811fb80c>] ? create_object+0x13c/0x2e0
      [  300.004008]  [<ffffffff8109b23e>] ? __kernel_text_address+0x4e/0x70
      [  300.004008]  [<ffffffff8109b23e>] ? __kernel_text_address+0x4e/0x70
      [  300.004008]  [<ffffffff8101e17f>] print_context_stack+0x7f/0xf0
      [  300.004008]  [<ffffffff8101d55b>] dump_trace+0x11b/0x300
      [  300.004008]  [<ffffffff8102970b>] save_stack_trace+0x2b/0x50
      [  300.004008]  [<ffffffff811fb80c>] create_object+0x13c/0x2e0
      [  300.004008]  [<ffffffff815b2e8e>] kmemleak_alloc+0x4e/0xb0
      [  300.004008]  [<ffffffff811e475d>] kmem_cache_alloc_trace+0x18d/0x2f0
      [  300.004008]  [<ffffffff8128b139>] kernfs_fop_open+0xc9/0x380
      [  300.004008]  [<ffffffff8120214f>] do_dentry_open+0x1ff/0x2f0
      [  300.004008]  [<ffffffff8128b070>] ? kernfs_fop_release+0x70/0x70
      [  300.004008]  [<ffffffff812034f9>] vfs_open+0x59/0x60
      [  300.004008]  [<ffffffff812130de>] path_openat+0x1ce/0x1260
      [  300.004008]  [<ffffffff812154ae>] do_filp_open+0x7e/0xe0
      [  300.004008]  [<ffffffff812251ff>] ? __alloc_fd+0xaf/0x180
      [  300.004008]  [<ffffffff8120387b>] do_sys_open+0x12b/0x210
      [  300.004008]  [<ffffffff8120397e>] SyS_open+0x1e/0x20
      [  300.004008]  [<ffffffff815bf0b6>] entry_SYSCALL_64_fastpath+0x16/0x7a
      [  300.004008] Code: 66 90 48 8b 46 10 48 8b 4f 40 55 48 89 c2 48 89 e5
      48 29 ca 48 81 fa ff 00 00 00 77 20 0f b6 c0 48 8d 44 c7 68 48 8b 10 48
      85 d2 <48> 89 16 74 04 48 89 72 08 48 89 30 48 89 46 08 5d c3 48 81 fa
      [  300.004008] RIP  [<ffffffff810f168e>] __internal_add_timer+0x2e/0xd0
      [  300.004008]  RSP <ffff88003fd03e78>
      [  300.004008] CR2: ffffffff811c59d3
      
      Fixes: c62987bb ("bridge: push bridge setting ageing_time down to switchdev")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af379392
    • N
      switchdev: enforce no pvid flag in vlan ranges · cc02aa8e
      Nikolay Aleksandrov 提交于
      We shouldn't allow BRIDGE_VLAN_INFO_PVID flag in VLAN ranges.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: NElad Raz <eladr@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc02aa8e
    • V
      net: dsa: do not warn unsupported bridge ops · efd29b3d
      Vivien Didelot 提交于
      A DSA driver may not provide the port_join_bridge and port_leave_bridge
      functions, so don't warn in such case.
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efd29b3d
    • S
      RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one() · 241b2719
      Sowmini Varadhan 提交于
      Consider the following "duelling syn" sequence between two peers A and B:
              	A		B
              	SYN1     -->
              	    	<--	SYN2
              	SYN2ACK  -->
      
      Note that the SYN/ACK has already been sent out by TCP before
      rds_tcp_accept_one() gets invoked as part of callbacks.
      
      If the inet_addr(A) is numerically less than inet_addr(B),
      the arbitration scheme in rds_tcp_accept_one() will prefer the
      TCP connection triggered by SYN1, and will send a CLOSE for the
      SYN2 (just after the SYN2ACK was sent).
      
      Since B also follows the same arbitration scheme, it will send the SYN-ACK
      for SYN1 that will set up a healthy ESTABLISHED connection on both sides.
      B will also get a  CLOSE for SYN2, which should result in the cleanup
      of the TCP state machine for SYN2, but it should not trigger any
      stale RDS-TCP callbacks (such as ->writespace, ->state_change etc),
      that would disrupt the progress of the SYN2 based RDS-TCP  connection.
      
      Thus the arbitration scheme in rds_tcp_accept_one() should restore
      rds_tcp callbacks for the winner before setting them up for the
      new accept socket, and also make sure that conn->c_outgoing
      is set to 0 so that we do not trigger any reconnect attempts on the
      passive side of the tcp socket in the future, in conformance with
      commit c82ac7e6 ("net/rds: RDS-TCP: only initiate reconnect attempt
      on outgoing TCP socket.")
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      241b2719
    • S
      RDS: Invoke ->laddr_check() in rds_bind() for explicitly bound transports. · 48679800
      Sowmini Varadhan 提交于
      The IP address passed to rds_bind() should be vetted by the
      transport's ->laddr_check() for a previously bound transport.
      This needs to be done to avoid cases where, for example,
      the application has asked for an IB transport,
      but the IP address passed to bind is only usable on
      ethernet interfaces.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48679800
    • N
      bridge: vlan: enforce no pvid flag in vlan ranges · 6623c60d
      Nikolay Aleksandrov 提交于
      Currently it's possible for someone to send a vlan range to the kernel
      with the pvid flag set which will result in the pvid bouncing from a
      vlan to vlan and isn't correct, it also introduces problems for hardware
      where it doesn't make sense having more than 1 pvid. iproute2 already
      enforces this, so let's enforce it on kernel-side as well.
      Reported-by: NElad Raz <eladr@mellanox.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6623c60d
    • R
      ipv6 route: use err pointers instead of returning pointer by reference · 8c5b83f0
      Roopa Prabhu 提交于
      This patch makes ip6_route_info_create return err pointer instead of
      returning the rt pointer by reference as suggested  by Dave
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c5b83f0
    • E
      ipv6: Pass struct net into nf_ct_frag6_gather · b7277597
      Eric W. Biederman 提交于
      The function nf_ct_frag6_gather is called on both the input and the
      output paths of the networking stack.  In particular ipv6_defrag which
      calls nf_ct_frag6_gather is called from both the the PRE_ROUTING chain
      on input and the LOCAL_OUT chain on output.
      
      The addition of a net parameter makes it explicit which network
      namespace the packets are being reassembled in, and removes the need
      for nf_ct_frag6_gather to guess.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7277597
    • E
      ipv4: Pass struct net into ip_defrag and ip_check_defrag · 19bcf9f2
      Eric W. Biederman 提交于
      The function ip_defrag is called on both the input and the output
      paths of the networking stack.  In particular conntrack when it is
      tracking outbound packets from the local machine calls ip_defrag.
      
      So add a struct net parameter and stop making ip_defrag guess which
      network namespace it needs to defragment packets in.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19bcf9f2
    • E
      ipv4: Only compute net once in ip_call_ra_chain · 37fcbab6
      Eric W. Biederman 提交于
      ip_call_ra_chain is called early in the forwarding chain from
      ip_forward and ip_mr_input, which makes skb->dev the correct
      expression to get the input network device and dev_net(skb->dev) a
      correct expression for the network namespace the packet is being
      processed in.
      
      Compute the network namespace and store it in a variable to make the
      code clearer.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37fcbab6
    • E
      packet: fix match_fanout_group() · 161642e2
      Eric Dumazet 提交于
      Recent TCP listener patches exposed a prior af_packet bug :
      match_fanout_group() blindly assumes it is always safe
      to cast sk to a packet socket to compare fanout with af_packet_priv
      
      But SYNACK packets can be sent while attached to request_sock, which
      are smaller than a "struct sock".
      
      We can read non existent memory and crash.
      
      Fixes: c0de08d0 ("af_packet: don't emit packet on orig fanout group")
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Eric Leblond <eric@regit.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      161642e2
    • P
      ipv4/icmp: redirect messages can use the ingress daddr as source · e2ca690b
      Paolo Abeni 提交于
      This patch allows configuring how the source address of ICMP
      redirect messages is selected; by default the old behaviour is
      retained, while setting icmp_redirects_use_orig_daddr force the
      usage of the destination address of the packet that caused the
      redirect.
      
      The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
      following scenario:
      
      Two machines are set up with VRRP to act as routers out of a subnet,
      they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
      x.x.x.254/24.
      
      If a host in said subnet needs to get an ICMP redirect from the VRRP
      router, i.e. to reach a destination behind a different gateway, the
      source IP in the ICMP redirect is chosen as the primary IP on the
      interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.
      
      The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
      and will continue to use the wrong next-op.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2ca690b
    • J
      bridge: try switchdev op first in __vlan_vid_add/del · 0944d6b5
      Jiri Pirko 提交于
      Some drivers need to implement both switchdev vlan ops and
      vid_add/kill ndos. For that to work in bridge code, we need to try
      switchdev op first when adding/deleting vlan id.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0944d6b5
    • E
      net: shrink struct sock and request_sock by 8 bytes · ed53d0ab
      Eric Dumazet 提交于
      One 32bit hole is following skc_refcnt, use it.
      skc_incoming_cpu can also be an union for request_sock rcv_wnd.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed53d0ab
    • E
      net: SO_INCOMING_CPU setsockopt() support · 70da268b
      Eric Dumazet 提交于
      SO_INCOMING_CPU as added in commit 2c8c56e1 was a getsockopt() command
      to fetch incoming cpu handling a particular TCP flow after accept()
      
      This commits adds setsockopt() support and extends SO_REUSEPORT selection
      logic : If a TCP listener or UDP socket has this option set, a packet is
      delivered to this socket only if CPU handling the packet matches the specified
      one.
      
      This allows to build very efficient TCP servers, using one listener per
      RX queue, as the associated TCP listener should only accept flows handled
      in softirq by the same cpu.
      This provides optimal NUMA behavior and keep cpu caches hot.
      
      Note that __inet_lookup_listener() still has to iterate over the list of
      all listeners. Following patch puts sk_refcnt in a different cache line
      to let this iteration hit only shared and read mostly cache lines.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70da268b
    • E
      packet: support per-packet fwmark for af_packet sendmsg · c7d39e32
      Edward Jee 提交于
      Signed-off-by: NEdward Hyunkoo Jee <edjee@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d39e32