1. 13 10月, 2015 40 次提交
    • N
      bridge: vlan: use rcu for vlan_list traversal in br_fill_ifinfo · e9c953ef
      Nikolay Aleksandrov 提交于
      br_fill_ifinfo is called by br_ifinfo_notify which can be called from
      many contexts with different locks held, sometimes it relies upon
      bridge's spinlock only which is a problem for the vlan code, so use
      explicitly rcu for that to avoid problems.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9c953ef
    • N
      bridge: vlan: use proper rcu for the vlgrp member · 907b1e6e
      Nikolay Aleksandrov 提交于
      The bridge and port's vlgrp member is already used in RCU way, currently
      we rely on the fact that it cannot disappear while the port exists but
      that is error-prone and we might miss places with improper locking
      (either RCU or RTNL must be held to walk the vlan_list). So make it
      official and use RCU for vlgrp to catch offenders. Introduce proper vlgrp
      accessors and use them consistently throughout the code.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      907b1e6e
    • D
      Merge branch 'vrf-ipv6' · 4b918163
      David S. Miller 提交于
      David Ahern says:
      
      ====================
      net: VRF support in IPv6 stack
      
      Initial support for VRF in IPv6 stack. Makes IPv6 functionality on par
      with IPv4 -- ping, tcp client/server and udp client/server all work fine.
      tcpdump on vrf device and external tap (e.g., host side tap device) shows
      all packets with proper addresses. IPv6 does not need the source address
      operation like IPv4. Verified vti6 works properly in my setup as does use
      of an IPv6 address on the VRF device.
      
      v3
      - re-based to top of net-next (updates per net namespace changes by Eric)
      - fixed dst_entry typecasts as requested by Dave
      - added flags to inet6_rtm_getroute (IPv6 version of deaa0a6a)
      
      v2
      - fixed CONFIG_IPV6 dependency as questioned by Cong
        - if IPV6 is a module, kbuild ensures VRF is a module
        - if IPV6 is disabled IPV6 functionality is compiled out of VRF module
      - addressed comments from Nik over IRC
        - removed duplicate call to netif_is_l3_master in l3mdev_rt6_dst_by_oif
        - changed allocation flag from GFP_ATOMIC to GFP_KERNEL since it is init time
        - added free of rt6i_pcpu
        - check_ipv6_frame returns false only if packet is NDISC type
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b918163
    • D
      net: Add VRF support to IPv6 stack · ca254490
      David Ahern 提交于
      As with IPv4 support for VRFs added to IPv6 stack by replacing hardcoded
      table ids with possibly device specific ones and manipulating the oif in
      the flowi6. The flow flags are used to skip oif compare in nexthop lookups
      if the device is enslaved to a VRF via the L3 master device.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca254490
    • D
      net: Add IPv6 support to VRF device · 35402e31
      David Ahern 提交于
      Add support for IPv6 to VRF device driver. Implemenation parallels what
      has been done for IPv4.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35402e31
    • D
      net: Export fib6_get_table and nd_tbl · c4850687
      David Ahern 提交于
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4850687
    • D
      net: Add IPv6 support to l3mdev · ccf3c8c3
      David Ahern 提交于
      Add operations to retrieve cached IPv6 dst entry from l3mdev device
      and lookup IPv6 source address.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ccf3c8c3
    • N
      bridge: fix gc_timer mod/del race condition · af379392
      Nikolay Aleksandrov 提交于
      commit c62987bb ("bridge: push bridge setting ageing_time down to
      switchdev") introduced a timer race condition because the gc_timer can
      get rearmed after it's supposedly stopped and flushed in br_dev_delete()
      leading to a use of freed memory. So take rtnl to sync with bridge
      destruction when setting ageing_timer.
      Here's the trace reproduced with these two commands running in parallel:
      while :; do echo 10000 > /sys/class/net/br0/bridge/ageing_timer; done;
      while :; do brctl addbr br0; ip l set br0 up; ip l set br0 down;
      brctl delbr br0; done;
      
      [  300.000029] BUG: unable to handle kernel paging request at
      ffffffff811c59d3
      [  300.000263] IP: [<ffffffff810f168e>] __internal_add_timer+0x2e/0xd0
      [  300.000422] PGD 1a0f067 PUD 1a10063 PMD 10001e1
      [  300.000639] Oops: 0003 [#1] SMP
      [  300.000793] Modules linked in: bridge stp llc nfsd auth_rpcgss
      oid_registry nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul
      crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev aesni_intel
      aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd
      snd_hda_codec_generic qxl drm_kms_helper psmouse pcspkr ttm
      snd_hda_intel 9pnet_virtio evdev serio_raw joydev snd_hda_codec 9pnet
      virtio_balloon drm snd_hwdep virtio_console snd_hda_core pvpanic snd_pcm
      i2c_piix4 snd_timer acpi_cpufreq parport_pc snd parport soundcore button
      processor i2c_core ipv6 autofs4 hid_generic usbhid hid ext4 crc16
      mbcache jbd2 sg sr_mod cdrom ata_generic virtio_blk virtio_net e1000
      ehci_pci uhci_hcd ehci_hcd usbcore usb_common floppy ata_piix libata
      virtio_pci virtio_ring virtio scsi_mod
      [  300.004008] CPU: 1 PID: 1169 Comm: bash Not tainted 4.3.0-rc3+ #46
      [  300.004008] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  300.004008] task: ffff880035be2200 ti: ffff88003795c000 task.ti:
      ffff88003795c000
      [  300.004008] RIP: 0010:[<ffffffff810f168e>]  [<ffffffff810f168e>]
      __internal_add_timer+0x2e/0xd0
      [  300.004008] RSP: 0018:ffff88003fd03e78  EFLAGS: 00010046
      [  300.004008] RAX: ffff88003fd0ef60 RBX: 840fc78949c08548 RCX:
      00000001ffffffff
      [  300.004008] RDX: 0000000000000000 RSI: ffffffff811c59d3 RDI:
      ffff88003fd0df00
      [  300.004008] RBP: ffff88003fd03e78 R08: 00000000ffffffff R09:
      0000000000000000
      [  300.004008] R10: 0000000000000000 R11: 0000000000000000 R12:
      ffff88003fd0df00
      [  300.004008] R13: 0000000000000000 R14: 0000000000000001 R15:
      ffffffff816032e0
      [  300.004008] FS:  00007fcbdd609700(0000) GS:ffff88003fd00000(0000)
      knlGS:0000000000000000
      [  300.004008] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  300.004008] CR2: ffffffff811c59d3 CR3: 0000000037879000 CR4:
      00000000000406e0
      [  300.004008] Stack:
      [  300.004008]  ffff88003fd03ea8 ffffffff810f1775 ffff88003c8cb958
      ffff88003fd0df00
      [  300.004008]  0000000000000000 0000000000000001 ffff88003fd03f18
      ffffffff810f28c4
      [  300.004008]  ffff88003fd0eb68 ffff88003fd0e968 ffff88003fd0e768
      ffff88003fd0df68
      [  300.004008] Call Trace:
      [  300.004008]  <IRQ>
      [  300.004008]  [<ffffffff810f1775>] cascade+0x45/0x70
      [  300.004008]  [<ffffffff810f28c4>] run_timer_softirq+0x2f4/0x340
      [  300.004008]  [<ffffffff8107e380>] __do_softirq+0xd0/0x440
      [  300.004008]  [<ffffffff8107e8a3>] irq_exit+0xb3/0xc0
      [  300.004008]  [<ffffffff815c2032>] smp_apic_timer_interrupt+0x42/0x50
      [  300.004008]  [<ffffffff815bfe37>] apic_timer_interrupt+0x87/0x90
      [  300.004008]  <EOI>
      [  300.004008]  [<ffffffff811fb80c>] ? create_object+0x13c/0x2e0
      [  300.004008]  [<ffffffff8109b23e>] ? __kernel_text_address+0x4e/0x70
      [  300.004008]  [<ffffffff8109b23e>] ? __kernel_text_address+0x4e/0x70
      [  300.004008]  [<ffffffff8101e17f>] print_context_stack+0x7f/0xf0
      [  300.004008]  [<ffffffff8101d55b>] dump_trace+0x11b/0x300
      [  300.004008]  [<ffffffff8102970b>] save_stack_trace+0x2b/0x50
      [  300.004008]  [<ffffffff811fb80c>] create_object+0x13c/0x2e0
      [  300.004008]  [<ffffffff815b2e8e>] kmemleak_alloc+0x4e/0xb0
      [  300.004008]  [<ffffffff811e475d>] kmem_cache_alloc_trace+0x18d/0x2f0
      [  300.004008]  [<ffffffff8128b139>] kernfs_fop_open+0xc9/0x380
      [  300.004008]  [<ffffffff8120214f>] do_dentry_open+0x1ff/0x2f0
      [  300.004008]  [<ffffffff8128b070>] ? kernfs_fop_release+0x70/0x70
      [  300.004008]  [<ffffffff812034f9>] vfs_open+0x59/0x60
      [  300.004008]  [<ffffffff812130de>] path_openat+0x1ce/0x1260
      [  300.004008]  [<ffffffff812154ae>] do_filp_open+0x7e/0xe0
      [  300.004008]  [<ffffffff812251ff>] ? __alloc_fd+0xaf/0x180
      [  300.004008]  [<ffffffff8120387b>] do_sys_open+0x12b/0x210
      [  300.004008]  [<ffffffff8120397e>] SyS_open+0x1e/0x20
      [  300.004008]  [<ffffffff815bf0b6>] entry_SYSCALL_64_fastpath+0x16/0x7a
      [  300.004008] Code: 66 90 48 8b 46 10 48 8b 4f 40 55 48 89 c2 48 89 e5
      48 29 ca 48 81 fa ff 00 00 00 77 20 0f b6 c0 48 8d 44 c7 68 48 8b 10 48
      85 d2 <48> 89 16 74 04 48 89 72 08 48 89 30 48 89 46 08 5d c3 48 81 fa
      [  300.004008] RIP  [<ffffffff810f168e>] __internal_add_timer+0x2e/0xd0
      [  300.004008]  RSP <ffff88003fd03e78>
      [  300.004008] CR2: ffffffff811c59d3
      
      Fixes: c62987bb ("bridge: push bridge setting ageing_time down to switchdev")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af379392
    • N
      switchdev: enforce no pvid flag in vlan ranges · cc02aa8e
      Nikolay Aleksandrov 提交于
      We shouldn't allow BRIDGE_VLAN_INFO_PVID flag in VLAN ranges.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: NElad Raz <eladr@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc02aa8e
    • D
      Merge branch 'dsa-mv88e6xxx-fix-hardware-bridging' · f83665d0
      David S. Miller 提交于
      Vivien Didelot says:
      
      ====================
      net: dsa: mv88e6xxx: fix hardware bridging
      
      DSA and its drivers currently hook the NETDEV_CHANGEUPPER net_device event in
      order to configure the VLAN map of every port.
      
      This VLAN map is a feature of these switch chips to hardcode and restrict which
      output ports a given input port can egress frames to.
      
      A Linux bridge is a simple untagged VLAN propagated by the bridge code itself.
      With a proper 802.1Q support, a driver does not need this hook anymore, and
      will simply program the related VLAN object.
      
      This patchset improves the hardware bridging code in the mv88e6xxx driver with
      a strict 802.1Q mode.
      
      Ideally, the equivalent must be done for Broadcom Starfighter 2 and Rocker,
      before completely getting rid of this hook.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f83665d0
    • V
      net: dsa: mv88e6xxx: fix hardware bridging · 5fe7f680
      Vivien Didelot 提交于
      Playing with the VLAN map of every port to implement "hardware bridging"
      in the 88E6352 driver was a hack until full 802.1Q was supported.
      
      Indeed with 802.1Q port mode "Disabled" or "Fallback", this feature is
      used to restrict which output ports an input port can egress frames to.
      
      A Linux bridge is an untagged VLAN. With full 802.1Q support, we don't
      need this hack anymore and can use the "Secure" strict 802.1Q port mode.
      
      With this mode, the port-based VLAN map still needs to be configured,
      but all the logic is VTU-centric. This means that the switch only cares
      about rules described in its hardware VLAN table, which is exactly what
      Linux bridge expects and what we want.
      
      Note also that the hardware bridging was broken with the previous
      flexible "Fallback" 802.1Q port mode. Here's an example:
      
      Port0 and Port1 belong to the same bridge. If Port0 sends crafted tagged
      frames with VID 200 to Port1, Port1 receives it. Even if Port1 is in
      hardware VLAN 200, but not Port0, Port1 will still receive it, because
      Fallback mode doesn't care about invalid VID or non-member source port.
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fe7f680
    • V
      net: dsa: do not warn unsupported bridge ops · efd29b3d
      Vivien Didelot 提交于
      A DSA driver may not provide the port_join_bridge and port_leave_bridge
      functions, so don't warn in such case.
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efd29b3d
    • V
      net: dsa: mv88e6xxx: do not support per-port FID · f02bdffc
      Vivien Didelot 提交于
      Since we configure a switch chip through a Linux bridge, and a bridge is
      implemented as a VLAN, there is no need for per-port FID anymore.
      
      This patch gets rid of this and simplifies the driver code since we can
      now directly map all 4095 FIDs available to all VLANs.
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f02bdffc
    • V
      net: dsa: mv88e6xxx: bridges do not need an FID · ede8098d
      Vivien Didelot 提交于
      With 88E6352 and similar switch chips, each port has a map to restrict
      which output port this input port can egress frames to.
      
      The current driver code implements hardware bridging using this feature,
      and assigns to a bridge group the FID of its first member.
      
      Now that 802.1Q is fully implemented in this driver, a Linux bridge
      which is a simple untagged VLAN, already gets its own FID.
      
      This patch gets rid of the per-bridge FID and explicits the usage of the
      port based VLAN map feature.
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ede8098d
    • S
      RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one() · 241b2719
      Sowmini Varadhan 提交于
      Consider the following "duelling syn" sequence between two peers A and B:
              	A		B
              	SYN1     -->
              	    	<--	SYN2
              	SYN2ACK  -->
      
      Note that the SYN/ACK has already been sent out by TCP before
      rds_tcp_accept_one() gets invoked as part of callbacks.
      
      If the inet_addr(A) is numerically less than inet_addr(B),
      the arbitration scheme in rds_tcp_accept_one() will prefer the
      TCP connection triggered by SYN1, and will send a CLOSE for the
      SYN2 (just after the SYN2ACK was sent).
      
      Since B also follows the same arbitration scheme, it will send the SYN-ACK
      for SYN1 that will set up a healthy ESTABLISHED connection on both sides.
      B will also get a  CLOSE for SYN2, which should result in the cleanup
      of the TCP state machine for SYN2, but it should not trigger any
      stale RDS-TCP callbacks (such as ->writespace, ->state_change etc),
      that would disrupt the progress of the SYN2 based RDS-TCP  connection.
      
      Thus the arbitration scheme in rds_tcp_accept_one() should restore
      rds_tcp callbacks for the winner before setting them up for the
      new accept socket, and also make sure that conn->c_outgoing
      is set to 0 so that we do not trigger any reconnect attempts on the
      passive side of the tcp socket in the future, in conformance with
      commit c82ac7e6 ("net/rds: RDS-TCP: only initiate reconnect attempt
      on outgoing TCP socket.")
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      241b2719
    • S
      RDS: Invoke ->laddr_check() in rds_bind() for explicitly bound transports. · 48679800
      Sowmini Varadhan 提交于
      The IP address passed to rds_bind() should be vetted by the
      transport's ->laddr_check() for a previously bound transport.
      This needs to be done to avoid cases where, for example,
      the application has asked for an IB transport,
      but the IP address passed to bind is only usable on
      ethernet interfaces.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48679800
    • J
      qlcnic: constify qlcnic_mbx_ops structure · 571f2c11
      Julia Lawall 提交于
      The only instance of a qlcnic_mbx_ops structure is never modified.  Thus
      the declaration of the structure and all references to the structure type
      can be made const.
      
      In the definition of the qlcnic_mailbox structure, the ops field is no
      longer lined up with the other fields.  This was left as is, to avoid a lot
      of trivial changes on the other lines.
      
      Done with the help of Coccinelle.
      Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
      Acked-by: NSony Chacko <sony.chacko@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      571f2c11
    • N
      bridge: vlan: enforce no pvid flag in vlan ranges · 6623c60d
      Nikolay Aleksandrov 提交于
      Currently it's possible for someone to send a vlan range to the kernel
      with the pvid flag set which will result in the pvid bouncing from a
      vlan to vlan and isn't correct, it also introduces problems for hardware
      where it doesn't make sense having more than 1 pvid. iproute2 already
      enforces this, so let's enforce it on kernel-side as well.
      Reported-by: NElad Raz <eladr@mellanox.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6623c60d
    • T
      atm: iphase: fix misleading indention · cbb41b91
      Tillmann Heidsieck 提交于
      Fix a smatch warning:
      drivers/atm/iphase.c:1178 rx_pkt() warn: curly braces intended?
      
      The code is correct, the indention is misleading. In case the allocation
      of skb fails, we want to skip to the end.
      Signed-off-by: NTillmann Heidsieck <theidsieck@leenox.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbb41b91
    • T
      atm: iphase: return -ENOMEM instead of -1 in case of failed kmalloc() · 21e26ff9
      Tillmann Heidsieck 提交于
      Smatch complains about returning hard coded error codes, silence this
      warning.
      
      drivers/atm/iphase.c:115 ia_enque_rtn_q() warn: returning -1 instead of -ENOMEM is sloppy
      Signed-off-by: NTillmann Heidsieck <theidsieck@leenox.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21e26ff9
    • R
      ipv6 route: use err pointers instead of returning pointer by reference · 8c5b83f0
      Roopa Prabhu 提交于
      This patch makes ip6_route_info_create return err pointer instead of
      returning the rt pointer by reference as suggested  by Dave
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c5b83f0
    • H
      net: hns: fix the unknown phy_nterface_t type error · 99dcc7df
      huangdaode 提交于
      This patch fix the building error reported by Jiri Pirko <jiri@resnulli.us>
      
      drivers/net/ethernet/hisilicon/hns/hnae.h:465:2: error: unknown type
      name 'phy_interface_t'
              phy_interface_t phy_if;
      	^
      the full build log is on https://lists.01.org/pipermail/kbuild-all.
      Signed-off-by: Nhuangdaode <huangdaode@hisilicon.com>
      Signed-off-by: Nyankejian <yankejian@huawei.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99dcc7df
    • E
      tun: use sk_fullsock() before reading sk->sk_tsflags · 5fcd2d8b
      Eric Dumazet 提交于
      timewait or request sockets are small and do not contain sk->sk_tsflags
      
      Without this fix, we might read garbage, and crash later in
      
      __skb_complete_tx_timestamp()
       -> sock_queue_err_skb()
      
      (These pseudo sockets do not have an error queue either)
      
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fcd2d8b
    • D
      Merge branch 'netns-defrag' · b7a46095
      David S. Miller 提交于
      Eric W. Biederman says:
      
      ====================
      net: Pass net into defragmentation
      
      This is the next installment of my work to pass struct net through the
      output path so the code does not need to guess how to figure out which
      network namespace it is in, and ultimately routes can have output
      devices in another network namespace.
      
      In netfilter and af_packet we defragment packets in the output path,
      and there is the usual amount of confusion about how to compute which
      net we are processing the packets in.  This patchset clears that
      confusion up by explicitly passing in struct net in ip_defrag,
      ip_check_defrag, and nf_ct_frag6_gather.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7a46095
    • E
      ipv6: Pass struct net into nf_ct_frag6_gather · b7277597
      Eric W. Biederman 提交于
      The function nf_ct_frag6_gather is called on both the input and the
      output paths of the networking stack.  In particular ipv6_defrag which
      calls nf_ct_frag6_gather is called from both the the PRE_ROUTING chain
      on input and the LOCAL_OUT chain on output.
      
      The addition of a net parameter makes it explicit which network
      namespace the packets are being reassembled in, and removes the need
      for nf_ct_frag6_gather to guess.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7277597
    • E
      ipv4: Pass struct net into ip_defrag and ip_check_defrag · 19bcf9f2
      Eric W. Biederman 提交于
      The function ip_defrag is called on both the input and the output
      paths of the networking stack.  In particular conntrack when it is
      tracking outbound packets from the local machine calls ip_defrag.
      
      So add a struct net parameter and stop making ip_defrag guess which
      network namespace it needs to defragment packets in.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19bcf9f2
    • E
      ipv4: Only compute net once in ip_call_ra_chain · 37fcbab6
      Eric W. Biederman 提交于
      ip_call_ra_chain is called early in the forwarding chain from
      ip_forward and ip_mr_input, which makes skb->dev the correct
      expression to get the input network device and dev_net(skb->dev) a
      correct expression for the network namespace the packet is being
      processed in.
      
      Compute the network namespace and store it in a variable to make the
      code clearer.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37fcbab6
    • E
      packet: fix match_fanout_group() · 161642e2
      Eric Dumazet 提交于
      Recent TCP listener patches exposed a prior af_packet bug :
      match_fanout_group() blindly assumes it is always safe
      to cast sk to a packet socket to compare fanout with af_packet_priv
      
      But SYNACK packets can be sent while attached to request_sock, which
      are smaller than a "struct sock".
      
      We can read non existent memory and crash.
      
      Fixes: c0de08d0 ("af_packet: don't emit packet on orig fanout group")
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Eric Leblond <eric@regit.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      161642e2
    • D
      Merge tag 'wireless-drivers-next-for-davem-2015-10-09' of... · 99165967
      David S. Miller 提交于
      Merge tag 'wireless-drivers-next-for-davem-2015-10-09' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
      
      Kalle Valo says:
      
      ====================
      Major changes:
      
      iwlwifi
      
      * some debugfs improvements
      * fix signedness in beacon statistics
      * deinline some functions to reduce size when device tracing is enabled
      * filter beacons out in AP mode when no stations are associated
      * deprecate firmwares version -12
      * fix a runtime PM vs. legacy suspend race
      * one-liner fix for a ToF bug
      * clean-ups in the rx code
      * small debugging improvement
      * fix WoWLAN with new firmware versions
      * more clean-ups towards multiple RX queues;
      * some rate scaling fixes and improvements;
      * some time-of-flight fixes;
      * other generic improvements and clean-ups;
      
      brcmfmac
      
      * rework code dealing with multiple interfaces
      * allow logging firmware console using debug level
      * support for BCM4350, BCM4365, and BCM4366 PCIE devices
      * fixed for legacy P2P and P2P device handling
      * correct set and get tx-power
      
      ath9k
      
      * add support for Outside Context of a BSS (OCB) mode
      
      mwifiex
      
      * add USB multichannel feature
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99165967
    • P
      ipv4/icmp: redirect messages can use the ingress daddr as source · e2ca690b
      Paolo Abeni 提交于
      This patch allows configuring how the source address of ICMP
      redirect messages is selected; by default the old behaviour is
      retained, while setting icmp_redirects_use_orig_daddr force the
      usage of the destination address of the packet that caused the
      redirect.
      
      The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
      following scenario:
      
      Two machines are set up with VRRP to act as routers out of a subnet,
      they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
      x.x.x.254/24.
      
      If a host in said subnet needs to get an ICMP redirect from the VRRP
      router, i.e. to reach a destination behind a different gateway, the
      source IP in the ICMP redirect is chosen as the primary IP on the
      interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.
      
      The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
      and will continue to use the wrong next-op.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2ca690b
    • J
      bridge: try switchdev op first in __vlan_vid_add/del · 0944d6b5
      Jiri Pirko 提交于
      Some drivers need to implement both switchdev vlan ops and
      vid_add/kill ndos. For that to work in bridge code, we need to try
      switchdev op first when adding/deleting vlan id.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0944d6b5
    • W
      BNX2: free temp_stats_blk on error path · 3703ebe4
      wangweidong 提交于
      In bnx2_init_board, missing free temp_stats_blk on error path when
      some operations do failed. Just add the 'kfree' operation.
      Signed-off-by: NWang Weidong <wangweidong1@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3703ebe4
    • D
      Merge branch 'setsockopt_incoming_cpu' · 76973dd7
      David S. Miller 提交于
      Eric Dumazet says:
      
      ====================
      tcp: better smp listener behavior
      
      As promised in last patch series, we implement a better SO_REUSEPORT
      strategy, based on cpu hints if given by the application.
      
      We also moved sk_refcnt out of the cache line containing the lookup
      keys, as it was considerably slowing down smp operations because
      of false sharing. This was simpler than converting listen sockets
      to conventional RCU (to avoid sk_refcnt dirtying)
      
      Could process 6.0 Mpps SYN instead of 4.2 Mpps on my test server.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76973dd7
    • E
      tcp: shrink tcp_timewait_sock by 8 bytes · d475f090
      Eric Dumazet 提交于
      Reducing tcp_timewait_sock from 280 bytes to 272 bytes
      allows SLAB to pack 15 objects per page instead of 14 (on x86)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d475f090
    • E
      net: shrink struct sock and request_sock by 8 bytes · ed53d0ab
      Eric Dumazet 提交于
      One 32bit hole is following skc_refcnt, use it.
      skc_incoming_cpu can also be an union for request_sock rcv_wnd.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed53d0ab
    • E
      net: align sk_refcnt on 128 bytes boundary · 8e5eb54d
      Eric Dumazet 提交于
      sk->sk_refcnt is dirtied for every TCP/UDP incoming packet.
      This is a performance issue if multiple cpus hit a common socket,
      or multiple sockets are chained due to SO_REUSEPORT.
      
      By moving sk_refcnt 8 bytes further, first 128 bytes of sockets
      are mostly read. As they contain the lookup keys, this has
      a considerable performance impact, as cpus can cache them.
      
      These 8 bytes are not wasted, we use them as a place holder
      for various fields, depending on the socket type.
      
      Tested:
       SYN flood hitting a 16 RX queues NIC.
       TCP listener using 16 sockets and SO_REUSEPORT
       and SO_INCOMING_CPU for proper siloing.
      
       Could process 6.0 Mpps SYN instead of 4.2 Mpps
      
       Kernel profile looked like :
          11.68%  [kernel]  [k] sha_transform
           6.51%  [kernel]  [k] __inet_lookup_listener
           5.07%  [kernel]  [k] __inet_lookup_established
           4.15%  [kernel]  [k] memcpy_erms
           3.46%  [kernel]  [k] ipt_do_table
           2.74%  [kernel]  [k] fib_table_lookup
           2.54%  [kernel]  [k] tcp_make_synack
           2.34%  [kernel]  [k] tcp_conn_request
           2.05%  [kernel]  [k] __netif_receive_skb_core
           2.03%  [kernel]  [k] kmem_cache_alloc
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e5eb54d
    • E
      net: SO_INCOMING_CPU setsockopt() support · 70da268b
      Eric Dumazet 提交于
      SO_INCOMING_CPU as added in commit 2c8c56e1 was a getsockopt() command
      to fetch incoming cpu handling a particular TCP flow after accept()
      
      This commits adds setsockopt() support and extends SO_REUSEPORT selection
      logic : If a TCP listener or UDP socket has this option set, a packet is
      delivered to this socket only if CPU handling the packet matches the specified
      one.
      
      This allows to build very efficient TCP servers, using one listener per
      RX queue, as the associated TCP listener should only accept flows handled
      in softirq by the same cpu.
      This provides optimal NUMA behavior and keep cpu caches hot.
      
      Note that __inet_lookup_listener() still has to iterate over the list of
      all listeners. Following patch puts sk_refcnt in a different cache line
      to let this iteration hit only shared and read mostly cache lines.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70da268b
    • E
      packet: support per-packet fwmark for af_packet sendmsg · c7d39e32
      Edward Jee 提交于
      Signed-off-by: NEdward Hyunkoo Jee <edjee@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d39e32
    • E
      sock: support per-packet fwmark · f28ea365
      Edward Jee 提交于
      It's useful to allow users to set fwmark for an individual packet,
      without changing the socket state. The function this patch adds in
      sock layer can be used by the protocols that need such a feature.
      Signed-off-by: NEdward Hyunkoo Jee <edjee@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f28ea365
    • D
      Merge branch 'bpf-unprivileged' · c1bf5fe0
      David S. Miller 提交于
      Alexei Starovoitov says:
      
      ====================
      bpf: unprivileged
      
      v1-v2:
      - this set logically depends on cb patch
        "bpf: fix cb access in socket filter programs":
        http://patchwork.ozlabs.org/patch/527391/
        which is must have to allow unprivileged programs.
        Thanks Daniel for finding that issue.
      - refactored sysctl to be similar to 'modules_disabled'
      - dropped bpf_trace_printk
      - split tests into separate patch and added more tests
        based on discussion
      
      v1 cover letter:
      I think it is time to liberate eBPF from CAP_SYS_ADMIN.
      As was discussed when eBPF was first introduced two years ago
      the only piece missing in eBPF verifier is 'pointer leak detection'
      to make it available to non-root users.
      Patch 1 adds this pointer analysis.
      The eBPF programs, obviously, need to see and operate on kernel addresses,
      but with these extra checks they won't be able to pass these addresses
      to user space.
      Patch 2 adds accounting of kernel memory used by programs and maps.
      It changes behavoir for existing root users, but I think it needs
      to be done consistently for both root and non-root, since today
      programs and maps are only limited by number of open FDs (RLIMIT_NOFILE).
      Patch 2 accounts program's and map's kernel memory as RLIMIT_MEMLOCK.
      
      Unprivileged eBPF is only meaningful for 'socket filter'-like programs.
      eBPF programs for tracing and TC classifiers/actions will stay root only.
      
      In parallel the bpf fuzzing effort is ongoing and so far
      we've found only one verifier bug and that was already fixed.
      The 'constant blinding' pass also being worked on.
      It will obfuscate constant-like values that are part of eBPF ISA
      to make jit spraying attacks even harder.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1bf5fe0