1. 06 12月, 2016 5 次提交
  2. 04 12月, 2016 11 次提交
    • E
      ipv6 addrconf: Implemented enhanced DAD (RFC7527) · adc176c5
      Erik Nordmark 提交于
      Implemented RFC7527 Enhanced DAD.
      IPv6 duplicate address detection can fail if there is some temporary
      loopback of Ethernet frames. RFC7527 solves this by including a random
      nonce in the NS messages used for DAD, and if an NS is received with the
      same nonce it is assumed to be a looped back DAD probe and is ignored.
      RFC7527 is enabled by default. Can be disabled by setting both of
      conf/{all,interface}/enhanced_dad to zero.
      Signed-off-by: NErik Nordmark <nordmark@arista.com>
      Signed-off-by: NBob Gilligan <gilligan@arista.com>
      Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      adc176c5
    • I
      ipv4: fib: Replay events when registering FIB notifier · c3852ef7
      Ido Schimmel 提交于
      Commit b90eb754 ("fib: introduce FIB notification infrastructure")
      introduced a new notification chain to notify listeners (f.e., switchdev
      drivers) about addition and deletion of routes.
      
      However, upon registration to the chain the FIB tables can already be
      populated, which means potential listeners will have an incomplete view
      of the tables.
      
      Solve that by dumping the FIB tables and replaying the events to the
      passed notification block. The dump itself is done using RCU in order
      not to starve consumers that need RTNL to make progress.
      
      The integrity of the dump is ensured by reading the FIB change sequence
      counter before and after the dump under RTNL. This allows us to avoid
      the problematic situation in which the dumping process sends a ENTRY_ADD
      notification following ENTRY_DEL generated by another process holding
      RTNL.
      
      Callers of the registration function may pass a callback that is
      executed in case the dump was inconsistent with current FIB tables.
      
      The number of retries until a consistent dump is achieved is set to a
      fixed number to prevent callers from looping for long periods of time.
      In case current limit proves to be problematic in the future, it can be
      easily converted to be configurable using a sysctl.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3852ef7
    • I
      ipv4: fib: Allow for consistent FIB dumping · cacaad11
      Ido Schimmel 提交于
      The next patch will enable listeners of the FIB notification chain to
      request a dump of the FIB tables. However, since RTNL isn't taken during
      the dump, it's possible for the FIB tables to change mid-dump, which
      will result in inconsistency between the listener's table and the
      kernel's.
      
      Allow listeners to know about changes that occurred mid-dump, by adding
      a change sequence counter to each net namespace. The counter is
      incremented just before a notification is sent in the FIB chain.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cacaad11
    • I
      ipv4: fib: Convert FIB notification chain to be atomic · d3f706f6
      Ido Schimmel 提交于
      In order not to hold RTNL for long periods of time we're going to dump
      the FIB tables using RCU.
      
      Convert the FIB notification chain to be atomic, as we can't block in
      RCU critical sections.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3f706f6
    • I
      ipv4: fib: Export free_fib_info() · b423cb10
      Ido Schimmel 提交于
      The FIB notification chain is going to be converted to an atomic chain,
      which means switchdev drivers will have to offload FIB entries in
      deferred work, as hardware operations entail sleeping.
      
      However, while the work is queued fib info might be freed, so a
      reference must be taken. To release the reference (and potentially free
      the fib info) fib_info_put() will be called, which in turn calls
      free_fib_info().
      
      Export free_fib_info() so that modules will be able to invoke
      fib_info_put().
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b423cb10
    • W
      act_mirred: fix a typo in get_dev · 548ed722
      WANG Cong 提交于
      Fixes: 255cb304 ("net/sched: act_mirred: Add new tc_action_ops get_dev()")
      Cc: Hadar Hen Zion <hadarh@mellanox.com>
      Cc: Jiri Pirko <jiri@mellanox.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      548ed722
    • P
      udp: be less conservative with sock rmem accounting · 363dc73a
      Paolo Abeni 提交于
      Before commit 850cbadd ("udp: use it's own memory accounting
      schema"), the udp protocol allowed sk_rmem_alloc to grow beyond
      the rcvbuf by the whole current packet's truesize. After said commit
      we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue
      is empty. As reported by Jesper this cause a performance regression
      for some (small) values of rcvbuf.
      
      This commit is intended to fix the regression restoring the old
      handling of the rcvbuf limit.
      Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Fixes: 850cbadd ("udp: use it's own memory accounting schema")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      363dc73a
    • E
      net_sched: gen_estimator: account for timer drifts · 12efa1fa
      Eric Dumazet 提交于
      Under heavy stress, timer used in estimators tend to slowly be delayed
      by a few jiffies, leading to inaccuracies.
      
      Lets remember what was the last scheduled jiffies so that we get more
      precise estimations, without having to add a multiply/divide in the loop
      to account for the drifts.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12efa1fa
    • A
      netns: fix net_generic() "id - 1" bloat · 6af2d5ff
      Alexey Dobriyan 提交于
      net_generic() function is both a) inline and b) used ~600 times.
      
      It has the following code inside
      
      		...
      	ptr = ng->ptr[id - 1];
      		...
      
      "id" is never compile time constant so compiler is forced to subtract 1.
      And those decrements or LEA [r32 - 1] instructions add up.
      
      We also start id'ing from 1 to catch bugs where pernet sybsystem id
      is not initialized and 0. This is quite pointless idea (nothing will
      work or immediate interference with first registered subsystem) in
      general but it hints what needs to be done for code size reduction.
      
      Namely, overlaying allocation of pointer array and fixed part of
      structure in the beginning and using usual base-0 addressing.
      
      Ids are just cookies, their exact values do not matter, so lets start
      with 3 on x86_64.
      
      Code size savings (oh boy): -4.2 KB
      
      As usual, ignore the initial compiler stupidity part of the table.
      
      	add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
      	function                                     old     new   delta
      	tipc_nametbl_insert_publ                    1250    1270     +20
      	nlmclnt_lookup_host                          686     703     +17
      	nfsd4_encode_fattr                          5930    5941     +11
      	nfs_get_client                              1050    1061     +11
      	register_pernet_operations                   333     342      +9
      	tcf_mirred_init                              843     849      +6
      	tcf_bpf_init                                1143    1149      +6
      	gss_setup_upcall                             990     994      +4
      	idmap_name_to_id                             432     434      +2
      	ops_init                                     274     275      +1
      	nfsd_inject_forget_client                    259     260      +1
      	nfs4_alloc_client                            612     613      +1
      	tunnel_key_walker                            164     163      -1
      
      		...
      
      	tipc_bcbase_select_primary                   392     360     -32
      	mac80211_hwsim_new_radio                    2808    2767     -41
      	ipip6_tunnel_ioctl                          2228    2186     -42
      	tipc_bcast_rcv                               715     672     -43
      	tipc_link_build_proto_msg                   1140    1089     -51
      	nfsd4_lock                                  3851    3796     -55
      	tipc_mon_rcv                                1012     956     -56
      	Total: Before=156643951, After=156639743, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6af2d5ff
    • A
      netns: add dummy struct inside "struct net_generic" · 9bfc7b99
      Alexey Dobriyan 提交于
      This is precursor to fixing "[id - 1]" bloat inside net_generic().
      
      Name "s" is chosen to complement name "u" often used for dummy unions.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bfc7b99
    • A
      netns: publish net_generic correctly · 1a9a0592
      Alexey Dobriyan 提交于
      Publishing net_generic pointer is done with silly mistake: new array is
      published BEFORE setting freshly acquired pernet subsystem pointer.
      
      	memcpy
      	rcu_assign_pointer
      	kfree_rcu
      	ng->ptr[id - 1] = data;
      
      This bug was introduced with commit dec827d1
      ("[NETNS]: The generic per-net pointers.") in the glorious days of
      chopping networking stack into containers proper 8.5 years ago (whee...)
      
      How it didn't trigger for so long?
      Well, you need quite specific set of conditions:
      
      *) race window opens once per pernet subsystem addition
         (read: modprobe or boot)
      
      *) not every pernet subsystem is eligible (need ->id and ->size)
      
      *) not every pernet subsystem is vulnerable (need incorrect or absense
         of ordering of register_pernet_sybsys() and actually using net_generic())
      
      *) to hide the bug even more, default is to preallocate 13 pointers which
         is actually quite a lot. You need IPv6, netfilter, bridging etc together
         loaded to trigger reallocation in the first place. Trimmed down
         config are OK.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a9a0592
  3. 03 12月, 2016 16 次提交
    • E
      net: avoid signed overflows for SO_{SND|RCV}BUFFORCE · b98b0bc8
      Eric Dumazet 提交于
      CAP_NET_ADMIN users should not be allowed to set negative
      sk_sndbuf or sk_rcvbuf values, as it can lead to various memory
      corruptions, crashes, OOM...
      
      Note that before commit 82981930 ("net: cleanups in
      sock_setsockopt()"), the bug was even more serious, since SO_SNDBUF
      and SO_RCVBUF were vulnerable.
      
      This needs to be backported to all known linux kernels.
      
      Again, many thanks to syzkaller team for discovering this gem.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b98b0bc8
    • M
      tipc: check minimum bearer MTU · 3de81b75
      Michal Kubeček 提交于
      Qian Zhang (张谦) reported a potential socket buffer overflow in
      tipc_msg_build() which is also known as CVE-2016-8632: due to
      insufficient checks, a buffer overflow can occur if MTU is too short for
      even tipc headers. As anyone can set device MTU in a user/net namespace,
      this issue can be abused by a regular user.
      
      As agreed in the discussion on Ben Hutchings' original patch, we should
      check the MTU at the moment a bearer is attached rather than for each
      processed packet. We also need to repeat the check when bearer MTU is
      adjusted to new device MTU. UDP case also needs a check to avoid
      overflow when calculating bearer MTU.
      
      Fixes: b97bf3fd ("[TIPC] Initial merge")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Reported-by: NQian Zhang (张谦) <zhangqian-c@360.cn>
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3de81b75
    • D
      bpf: Add support for reading socket family, type, protocol · aa4c1037
      David Ahern 提交于
      Add socket family, type and protocol to bpf_sock allowing bpf programs
      read-only access.
      
      Add __sk_flags_offset[0] to struct sock before the bitfield to
      programmtically determine the offset of the unsigned int containing
      protocol and type.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa4c1037
    • D
      bpf: Add new cgroup attach type to enable sock modifications · 61023658
      David Ahern 提交于
      Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
      BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
      any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
      Currently only sk_bound_dev_if is exported to userspace for modification
      by a bpf program.
      
      This allows a cgroup to be configured such that AF_INET{6} sockets opened
      by processes are automatically bound to a specific device. In turn, this
      enables the running of programs that do not support SO_BINDTODEVICE in a
      specific VRF context / L3 domain.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61023658
    • A
      ip6_offload: check segs for NULL in ipv6_gso_segment. · 6b6ebb6b
      Artem Savkov 提交于
      segs needs to be checked for being NULL in ipv6_gso_segment() before calling
      skb_shinfo(segs), otherwise kernel can run into a NULL-pointer dereference:
      
      [   97.811262] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
      [   97.819112] IP: [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0
      [   97.825214] PGD 0 [   97.827047]
      [   97.828540] Oops: 0000 [#1] SMP
      [   97.831678] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 rpcsec_gss_krb5
      nfsv4 dns_resolver nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
      iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
      ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
      bridge stp llc snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel
      snd_hda_codec edac_mce_amd snd_hda_core edac_core snd_hwdep kvm_amd snd_seq kvm snd_seq_device
      snd_pcm irqbypass snd_timer ppdev parport_serial snd parport_pc k10temp pcspkr soundcore parport
      sp5100_tco shpchp sg wmi i2c_piix4 acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc
      ip_tables xfs libcrc32c sr_mod cdrom sd_mod ata_generic pata_acpi amdkfd amd_iommu_v2 radeon
      broadcom bcm_phy_lib i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
      ttm ahci serio_raw tg3 firewire_ohci libahci pata_atiixp drm ptp libata firewire_core pps_core
      i2c_core crc_itu_t fjes dm_mirror dm_region_hash dm_log dm_mod
      [   97.927721] CPU: 1 PID: 3504 Comm: vhost-3495 Not tainted 4.9.0-7.el7.test.x86_64 #1
      [   97.935457] Hardware name: AMD Snook/Snook, BIOS ESK0726A 07/26/2010
      [   97.941806] task: ffff880129a1c080 task.stack: ffffc90001bcc000
      [   97.947720] RIP: 0010:[<ffffffff816e52f9>]  [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0
      [   97.956251] RSP: 0018:ffff88012fc43a10  EFLAGS: 00010207
      [   97.961557] RAX: 0000000000000000 RBX: ffff8801292c8700 RCX: 0000000000000594
      [   97.968687] RDX: 0000000000000593 RSI: ffff880129a846c0 RDI: 0000000000240000
      [   97.975814] RBP: ffff88012fc43a68 R08: ffff880129a8404e R09: 0000000000000000
      [   97.982942] R10: 0000000000000000 R11: ffff880129a84076 R12: 00000020002949b3
      [   97.990070] R13: ffff88012a580000 R14: 0000000000000000 R15: ffff88012a580000
      [   97.997198] FS:  0000000000000000(0000) GS:ffff88012fc40000(0000) knlGS:0000000000000000
      [   98.005280] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   98.011021] CR2: 00000000000000cc CR3: 0000000126c5d000 CR4: 00000000000006e0
      [   98.018149] Stack:
      [   98.020157]  00000000ffffffff ffff88012fc43ac8 ffffffffa017ad0a 000000000000000e
      [   98.027584]  0000001300000000 0000000077d59998 ffff8801292c8700 00000020002949b3
      [   98.035010]  ffff88012a580000 0000000000000000 ffff88012a580000 ffff88012fc43a98
      [   98.042437] Call Trace:
      [   98.044879]  <IRQ> [   98.046803]  [<ffffffffa017ad0a>] ? tg3_start_xmit+0x84a/0xd60 [tg3]
      [   98.053156]  [<ffffffff815eeee0>] skb_mac_gso_segment+0xb0/0x130
      [   98.059158]  [<ffffffff815eefd3>] __skb_gso_segment+0x73/0x110
      [   98.064985]  [<ffffffff815ef40d>] validate_xmit_skb+0x12d/0x2b0
      [   98.070899]  [<ffffffff815ef5d2>] validate_xmit_skb_list+0x42/0x70
      [   98.077073]  [<ffffffff81618560>] sch_direct_xmit+0xd0/0x1b0
      [   98.082726]  [<ffffffff815efd86>] __dev_queue_xmit+0x486/0x690
      [   98.088554]  [<ffffffff8135c135>] ? cpumask_next_and+0x35/0x50
      [   98.094380]  [<ffffffff815effa0>] dev_queue_xmit+0x10/0x20
      [   98.099863]  [<ffffffffa09ce057>] br_dev_queue_push_xmit+0xa7/0x170 [bridge]
      [   98.106907]  [<ffffffffa09ce161>] br_forward_finish+0x41/0xc0 [bridge]
      [   98.113430]  [<ffffffff81627cf2>] ? nf_iterate+0x52/0x60
      [   98.118735]  [<ffffffff81627d6b>] ? nf_hook_slow+0x6b/0xc0
      [   98.124216]  [<ffffffffa09ce32c>] __br_forward+0x14c/0x1e0 [bridge]
      [   98.130480]  [<ffffffffa09ce120>] ? br_dev_queue_push_xmit+0x170/0x170 [bridge]
      [   98.137785]  [<ffffffffa09ce4bd>] br_forward+0x9d/0xb0 [bridge]
      [   98.143701]  [<ffffffffa09cfbb7>] br_handle_frame_finish+0x267/0x560 [bridge]
      [   98.150834]  [<ffffffffa09d0064>] br_handle_frame+0x174/0x2f0 [bridge]
      [   98.157355]  [<ffffffff8102fb89>] ? sched_clock+0x9/0x10
      [   98.162662]  [<ffffffff810b63b2>] ? sched_clock_cpu+0x72/0xa0
      [   98.168403]  [<ffffffff815eccf5>] __netif_receive_skb_core+0x1e5/0xa20
      [   98.174926]  [<ffffffff813659f9>] ? timerqueue_add+0x59/0xb0
      [   98.180580]  [<ffffffff815ed548>] __netif_receive_skb+0x18/0x60
      [   98.186494]  [<ffffffff815ee625>] process_backlog+0x95/0x140
      [   98.192145]  [<ffffffff815edccd>] net_rx_action+0x16d/0x380
      [   98.197713]  [<ffffffff8170cff1>] __do_softirq+0xd1/0x283
      [   98.203106]  [<ffffffff8170b2bc>] do_softirq_own_stack+0x1c/0x30
      [   98.209107]  <EOI> [   98.211029]  [<ffffffff8108a5c0>] do_softirq+0x50/0x60
      [   98.216166]  [<ffffffff815ec853>] netif_rx_ni+0x33/0x80
      [   98.221386]  [<ffffffffa09eeff7>] tun_get_user+0x487/0x7f0 [tun]
      [   98.227388]  [<ffffffffa09ef3ab>] tun_sendmsg+0x4b/0x60 [tun]
      [   98.233129]  [<ffffffffa0b68932>] handle_tx+0x282/0x540 [vhost_net]
      [   98.239392]  [<ffffffffa0b68c25>] handle_tx_kick+0x15/0x20 [vhost_net]
      [   98.245916]  [<ffffffffa0abacfe>] vhost_worker+0x9e/0xf0 [vhost]
      [   98.251919]  [<ffffffffa0abac60>] ? vhost_umem_alloc+0x40/0x40 [vhost]
      [   98.258440]  [<ffffffff81003a47>] ? do_syscall_64+0x67/0x180
      [   98.264094]  [<ffffffff810a44d9>] kthread+0xd9/0xf0
      [   98.268965]  [<ffffffff810a4400>] ? kthread_park+0x60/0x60
      [   98.274444]  [<ffffffff8170a4d5>] ret_from_fork+0x25/0x30
      [   98.279836] Code: 8b 93 d8 00 00 00 48 2b 93 d0 00 00 00 4c 89 e6 48 89 df 66 89 93 c2 00 00 00 ff 10 48 3d 00 f0 ff ff 49 89 c2 0f 87 52 01 00 00 <41> 8b 92 cc 00 00 00 48 8b 80 d0 00 00 00 44 0f b7 74 10 06 66
      [   98.299425] RIP  [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0
      [   98.305612]  RSP <ffff88012fc43a10>
      [   98.309094] CR2: 00000000000000cc
      [   98.312406] ---[ end trace 726a2c7a2d2d78d0 ]---
      Signed-off-by: NArtem Savkov <asavkov@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b6ebb6b
    • S
      RDS: TCP: unregister_netdevice_notifier() in error path of rds_tcp_init_net · 721c7443
      Sowmini Varadhan 提交于
      If some error is encountered in rds_tcp_init_net, make sure to
      unregister_netdevice_notifier(), else we could trigger a panic
      later on, when the modprobe from a netns fails.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      721c7443
    • H
      net/sched: cls_flower: Add offload support using egress Hardware device · 7091d8c7
      Hadar Hen Zion 提交于
      In order to support hardware offloading when the device given by the tc
      rule is different from the Hardware underline device, extract the mirred
      (egress) device from the tc action when a filter is added, using the new
      tc_action_ops, get_dev().
      
      Flower caches the information about the mirred device and use it for
      calling ndo_setup_tc in filter change, update stats and delete.
      
      Calling ndo_setup_tc of the mirred (egress) device instead of the
      ingress device will allow a resolution between the software ingress
      device and the underline hardware device.
      
      The resolution will take place inside the offloading driver using
      'egress_device' flag added to tc_to_netdev struct which is provided to
      the offloading driver.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7091d8c7
    • H
      net/sched: act_mirred: Add new tc_action_ops get_dev() · 255cb304
      Hadar Hen Zion 提交于
      Adding support to a new tc_action_ops.
      get_dev is a general option which allows to get the underline
      device when trying to offload a tc rule.
      
      In case of mirred action the returned device is the mirred (egress)
      device.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      255cb304
    • H
      net/sched: cls_flower: Provide a filter to replace/destroy hardware filter functions · 3036dab6
      Hadar Hen Zion 提交于
      Instead of providing many arguments to fl_hw_{replace/destroy}_filter
      functions, just provide cls_fl_filter struct that includes all the relevant
      args.
      
      This patches doesn't add any new functionality.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3036dab6
    • H
      net/sched: cls_flower: Try to offload only if skip_hw flag isn't set · 79685219
      Hadar Hen Zion 提交于
      Check skip_hw flag isn't set before calling
      fl_hw_{replace/destroy}_filter and fl_hw_update_stats functions.
      
      Replace the call to tc_should_offload with tc_can_offload.
      tc_can_offload only checks if the device supports offloading, the check for
      skip_hw flag is done earlier in the flow.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79685219
    • F
      tcp: allow to turn tcp timestamp randomization off · 25429d7b
      Florian Westphal 提交于
      Eric says: "By looking at tcpdump, and TS val of xmit packets of multiple
      flows, we can deduct the relative qdisc delays (think of fq pacing).
      This should work even if we have one flow per remote peer."
      
      Having random per flow (or host) offsets doesn't allow that anymore so add
      a way to turn this off.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25429d7b
    • F
      tcp: randomize tcp timestamp offsets for each connection · 95a22cae
      Florian Westphal 提交于
      jiffies based timestamps allow for easy inference of number of devices
      behind NAT translators and also makes tracking of hosts simpler.
      
      commit ceaa1fef ("tcp: adding a per-socket timestamp offset")
      added the main infrastructure that is needed for per-connection ts
      randomization, in particular writing/reading the on-wire tcp header
      format takes the offset into account so rest of stack can use normal
      tcp_time_stamp (jiffies).
      
      So only two items are left:
       - add a tsoffset for request sockets
       - extend the tcp isn generator to also return another 32bit number
         in addition to the ISN.
      
      Re-use of ISN generator also means timestamps are still monotonically
      increasing for same connection quadruple, i.e. PAWS will still work.
      
      Includes fixes from Eric Dumazet.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95a22cae
    • E
      Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()" · 80d1106a
      Eli Cooper 提交于
      This reverts commit ae148b08
      ("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()").
      
      skb->protocol is now set in __ip_local_out() and __ip6_local_out() before
      dst_output() is called. It is no longer necessary to do it for each tunnel.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEli Cooper <elicooper@gmx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80d1106a
    • E
      ipv6: Set skb->protocol properly for local output · b4e479a9
      Eli Cooper 提交于
      When xfrm is applied to TSO/GSO packets, it follows this path:
      
          xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
      
      where skb_gso_segment() relies on skb->protocol to function properly.
      
      This patch sets skb->protocol to ETH_P_IPV6 before dst_output() is called,
      fixing a bug where GSO packets sent through an ipip6 tunnel are dropped
      when xfrm is involved.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEli Cooper <elicooper@gmx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4e479a9
    • E
      ipv4: Set skb->protocol properly for local output · f4180439
      Eli Cooper 提交于
      When xfrm is applied to TSO/GSO packets, it follows this path:
      
          xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
      
      where skb_gso_segment() relies on skb->protocol to function properly.
      
      This patch sets skb->protocol to ETH_P_IP before dst_output() is called,
      fixing a bug where GSO packets sent through a sit tunnel are dropped
      when xfrm is involved.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEli Cooper <elicooper@gmx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4180439
    • P
      packet: fix race condition in packet_set_ring · 84ac7260
      Philip Pettersson 提交于
      When packet_set_ring creates a ring buffer it will initialize a
      struct timer_list if the packet version is TPACKET_V3. This value
      can then be raced by a different thread calling setsockopt to
      set the version to TPACKET_V1 before packet_set_ring has finished.
      
      This leads to a use-after-free on a function pointer in the
      struct timer_list when the socket is closed as the previously
      initialized timer will not be deleted.
      
      The bug is fixed by taking lock_sock(sk) in packet_setsockopt when
      changing the packet version while also taking the lock at the start
      of packet_set_ring.
      
      Fixes: f6fb8f10 ("af-packet: TPACKET_V3 flexible buffer implementation.")
      Signed-off-by: NPhilip Pettersson <philip.pettersson@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84ac7260
  4. 02 12月, 2016 6 次提交
  5. 01 12月, 2016 2 次提交