1. 10 5月, 2015 12 次提交
    • J
      pktgen: adjust flag NO_TIMESTAMP to be more pktgen compliant · f1f00d8f
      Jesper Dangaard Brouer 提交于
      Allow flag NO_TIMESTAMP to turn timestamping on again, like other flags,
      with a negation of the flag like !NO_TIMESTAMP.
      
      Also document the option flag NO_TIMESTAMP.
      
      Fixes: afb84b62 ("pktgen: add flag NO_TIMESTAMP to disable timestamping")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1f00d8f
    • N
      netlink: allow to listen "all" netns · 59324cf3
      Nicolas Dichtel 提交于
      More accurately, listen all netns that have a nsid assigned into the netns
      where the netlink socket is opened.
      For this purpose, a netlink socket option is added:
      NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this
      socket will receive netlink notifications from all netns that have a nsid
      assigned into the netns where the socket has been opened. The nsid is sent
      to userland via an anscillary data.
      
      With this patch, a daemon needs only one socket to listen many netns. This
      is useful when the number of netns is high.
      
      Because 0 is a valid value for a nsid, the field nsid_is_set indicates if
      the field nsid is valid or not. skb->cb is initialized to 0 on skb
      allocation, thus we are sure that we will never send a nsid 0 by error to
      the userland.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59324cf3
    • N
      netns: use a spin_lock to protect nsid management · 95f38411
      Nicolas Dichtel 提交于
      Before this patch, nsid were protected by the rtnl lock. The goal of this
      patch is to be able to find a nsid without needing to hold the rtnl lock.
      
      The next patch will introduce a netlink socket option to listen to all
      netns that have a nsid assigned into the netns where the socket is opened.
      Thus, it's important to call rtnl_net_notifyid() outside the spinlock, to
      avoid a recursive lock (nsid are notified via rtnl). This was the main
      reason of the previous patch.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95f38411
    • N
      netns: notify new nsid outside __peernet2id() · 3138dbf8
      Nicolas Dichtel 提交于
      There is no functional change with this patch. It will ease the refactoring
      of the locking system that protects nsids and the support of the netlink
      socket option NETLINK_LISTEN_ALL_NSID.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3138dbf8
    • N
      netns: rename peernet2id() to peernet2id_alloc() · 7a0877d4
      Nicolas Dichtel 提交于
      In a following commit, a new function will be introduced to only lookup for
      a nsid (no allocation if the nsid doesn't exist). To avoid confusion, the
      existing function is renamed.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a0877d4
    • N
      netns: always provide the id to rtnl_net_fill() · cab3c8ec
      Nicolas Dichtel 提交于
      The goal of this commit is to prepare the rework of the locking of nsnid
      protection.
      After this patch, rtnl_net_notifyid() will not call anymore __peernet2id(),
      ie no idr_* operation into this function.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cab3c8ec
    • N
      netns: returns always an id in __peernet2id() · 109582af
      Nicolas Dichtel 提交于
      All callers of this function expect a nsid, not an error.
      Thus, returns NETNSA_NSID_NOT_ASSIGNED in case of error so that callers
      don't have to convert the error to NETNSA_NSID_NOT_ASSIGNED.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      109582af
    • J
      tcp: set SOCK_NOSPACE under memory pressure · 790ba456
      Jason Baron 提交于
      Under tcp memory pressure, calling epoll_wait() in edge triggered
      mode after -EAGAIN, can result in an indefinite hang in epoll_wait(),
      even when there is sufficient memory available to continue making
      progress. The problem is that when __sk_mem_schedule() returns 0
      under memory pressure, we do not set the SOCK_NOSPACE flag in the
      tcp write paths (tcp_sendmsg() or do_tcp_sendpages()). Then, since
      SOCK_NOSPACE is used to trigger wakeups when incoming acks create
      sufficient new space in the write queue, all outstanding packets
      are acked, but we never wake up with the the EPOLLOUT that we are
      expecting from epoll_wait().
      
      This issue is currently limited to epoll() when used in edge trigger
      mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE.
      This is sufficient for poll()/select() and epoll() in level trigger
      mode. However, in edge trigger mode, epoll() is relying on the write
      path to set SOCK_NOSPACE. EPOLL(7) says that in edge-trigger mode we
      can only call epoll_wait() after read/write return -EAGAIN. Thus, in
      the case of the socket write, we are relying on the fact that
      tcp_sendmsg()/network write paths are going to issue a wakeup for
      us at some point in the future when we get -EAGAIN.
      
      Normally, epoll() edge trigger works fine when we've exceeded the
      sk->sndbuf because in that case we do set SOCK_NOSPACE. However, when
      we return -EAGAIN from the write path b/c we are over the tcp memory
      limits and not b/c we are over the sndbuf, we are never going to get
      another wakeup.
      
      I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule()
      will return 0, or failure more readily with SO_SNDBUF:
      
      1) create socket and set SO_SNDBUF to N
      2) add socket as edge trigger
      3) write to socket and block in epoll on -EAGAIN
      4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem
      
      The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory()
      when the socket is non-blocking. Note that SOCK_NOSPACE, in addition
      to waking up outstanding waiters is also used to expand the size of
      the sk->sndbuf. However, we will not expand it by setting it in this
      case because tcp_should_expand_sndbuf(), ensures that no expansion
      occurs when we are under tcp memory pressure.
      
      Note that we could still hang if sk->sk_wmem_queue is 0, when we get
      the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we
      are waiting for and event that will never happen. I believe
      that this case is harder to hit (and did not hit in my testing),
      in that over the tcp 'soft' memory limits, we continue to guarantee a
      minimum write buffer size. Perhaps, we could return -ENOSPC in this
      case, or maybe we simply issue a wakeup in this case, such that we
      keep retrying the write. Note that this case is not specific to
      epoll() ET, but rather would affect blocking sockets as well. So I
      view this patch as bringing epoll() edge-trigger into sync with the
      current poll()/select()/epoll() level trigger and blocking sockets
      behavior.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      790ba456
    • D
      seccomp, filter: add and use bpf_prog_create_from_user from seccomp · ac67eb2c
      Daniel Borkmann 提交于
      Seccomp has always been a special candidate when it comes to preparation
      of its filters in seccomp_prepare_filter(). Due to the extra checks and
      filter rewrite it partially duplicates code and has BPF internals exposed.
      
      This patch adds a generic API inside the BPF code code that seccomp can use
      and thus keep it's filter preparation code minimal and better maintainable.
      The other side-effect is that now classic JITs can add seccomp support as
      well by only providing a BPF_LDX | BPF_W | BPF_ABS translation.
      
      Tested with seccomp and BPF test suites.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Nicolas Schichan <nschichan@freebox.fr>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac67eb2c
    • D
      net: filter: add __GFP_NOWARN flag for larger kmem allocs · 658da937
      Daniel Borkmann 提交于
      When seccomp BPF was added, it was discussed to add __GFP_NOWARN
      flag for their configuration path as f.e. up to 32K allocations are
      more prone to fail under stress. As we're going to reuse BPF API,
      add __GFP_NOWARN flags where larger kmalloc() and friends allocations
      could fail.
      
      It doesn't make much sense to pass around __GFP_NOWARN everywhere as
      an extra argument only for seccomp while we just as well could run
      into similar issues for socket filters, where it's not desired to
      have a user application throw a WARN() due to allocation failure.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Nicolas Schichan <nschichan@freebox.fr>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      658da937
    • N
      seccomp: simplify seccomp_prepare_filter and reuse bpf_prepare_filter · d9e12f42
      Nicolas Schichan 提交于
      Remove the calls to bpf_check_classic(), bpf_convert_filter() and
      bpf_migrate_runtime() and let bpf_prepare_filter() take care of that
      instead.
      
      seccomp_check_filter() is passed to bpf_prepare_filter() so that it
      gets called from there, after bpf_check_classic().
      
      We can now remove exposure of two internal classic BPF functions
      previously used by seccomp. The export of bpf_check_classic() symbol,
      previously known as sk_chk_filter(), was there since pre git times,
      and no in-tree module was using it, therefore remove it.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NNicolas Schichan <nschichan@freebox.fr>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9e12f42
    • N
      net: filter: add a callback to allow classic post-verifier transformations · 4ae92bc7
      Nicolas Schichan 提交于
      This is in preparation for use by the seccomp code, the rationale is
      not to duplicate additional code within the seccomp layer, but instead,
      have it abstracted and hidden within the classic BPF API.
      
      As an interim step, this now also makes bpf_prepare_filter() visible
      (not as exported symbol though), so that seccomp can reuse that code
      path instead of reimplementing it.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NNicolas Schichan <nschichan@freebox.fr>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ae92bc7
  2. 05 5月, 2015 3 次提交
  3. 04 5月, 2015 3 次提交
  4. 30 4月, 2015 1 次提交
    • N
      bridge/nl: remove wrong use of NLM_F_MULTI · 46c264da
      Nicolas Dichtel 提交于
      NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
      it is sent only at the end of a dump.
      
      Libraries like libnl will wait forever for NLMSG_DONE.
      
      Fixes: e5a55a89 ("net: create generic bridge ops")
      Fixes: 815cccbf ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
      CC: John Fastabend <john.r.fastabend@intel.com>
      CC: Sathya Perla <sathya.perla@emulex.com>
      CC: Subbu Seetharaman <subbu.seetharaman@emulex.com>
      CC: Ajit Khaparde <ajit.khaparde@emulex.com>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      CC: intel-wired-lan@lists.osuosl.org
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Scott Feldman <sfeldma@gmail.com>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      CC: bridge@lists.linux-foundation.org
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46c264da
  5. 27 4月, 2015 1 次提交
    • E
      net: rfs: fix crash in get_rps_cpus() · a31196b0
      Eric Dumazet 提交于
      Commit 567e4b79 ("net: rfs: add hash collision detection") had one
      mistake :
      
      RPS_NO_CPU is no longer the marker for invalid cpu in set_rps_cpu()
      and get_rps_cpu(), as @next_cpu was the result of an AND with
      rps_cpu_mask
      
      This bug showed up on a host with 72 cpus :
      next_cpu was 0x7f, and the code was trying to access percpu data of an
      non existent cpu.
      
      In a follow up patch, we might get rid of compares against nr_cpu_ids,
      if we init the tables with 0. This is silly to test for a very unlikely
      condition that exists only shortly after table initialization, as
      we got rid of rps_reset_sock_flow() and similar functions that were
      writing this RPS_NO_CPU magic value at flow dismantle : When table is
      old enough, it never contains this value anymore.
      
      Fixes: 567e4b79 ("net: rfs: add hash collision detection")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a31196b0
  6. 26 4月, 2015 1 次提交
    • E
      net: fix crash in build_skb() · 2ea2f62c
      Eric Dumazet 提交于
      When I added pfmemalloc support in build_skb(), I forgot netlink
      was using build_skb() with a vmalloc() area.
      
      In this patch I introduce __build_skb() for netlink use,
      and build_skb() is a wrapper handling both skb->head_frag and
      skb->pfmemalloc
      
      This means netlink no longer has to hack skb->head_frag
      
      [ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
      [ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      [ 1567.700067] Dumping ftrace buffer:
      [ 1567.700067]    (ftrace buffer empty)
      [ 1567.700067] Modules linked in:
      [ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
      [ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
      [ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
      [ 1567.700067] RSP: 0018:ffff8802467779d8  EFLAGS: 00010202
      [ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
      [ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
      [ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
      [ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
      [ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
      [ 1567.700067] FS:  00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
      [ 1567.700067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
      [ 1567.700067] Stack:
      [ 1567.700067]  ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
      [ 1567.700067]  ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
      [ 1567.700067]  ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
      [ 1567.700067] Call Trace:
      [ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
      [ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
      [ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
      [ 1567.774369] sock_write_iter (net/socket.c:823)
      [ 1567.774369] ? sock_sendmsg (net/socket.c:806)
      [ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
      [ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
      [ 1567.774369] ? default_llseek (fs/read_write.c:487)
      [ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
      [ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
      [ 1567.774369] vfs_write (fs/read_write.c:539)
      [ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
      [ 1567.774369] ? SyS_read (fs/read_write.c:577)
      [ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
      [ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
      [ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
      [ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)
      
      Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ea2f62c
  7. 23 4月, 2015 1 次提交
    • E
      net: do not deplete pfmemalloc reserve · 79930f58
      Eric Dumazet 提交于
      build_skb() should look at the page pfmemalloc status.
      If set, this means page allocator allocated this page in the
      expectation it would help to free other pages. Networking
      stack can do that only if skb->pfmemalloc is also set.
      
      Also, we must refrain using high order pages from the pfmemalloc
      reserve, so __page_frag_refill() must also use __GFP_NOMEMALLOC for
      them. Under memory pressure, using order-0 pages is probably the best
      strategy.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79930f58
  8. 18 4月, 2015 1 次提交
  9. 17 4月, 2015 4 次提交
  10. 14 4月, 2015 1 次提交
    • D
      net: use jump label patching for ingress qdisc in __netif_receive_skb_core · 4577139b
      Daniel Borkmann 提交于
      Even if we make use of classifier and actions from the egress
      path, we're going into handle_ing() executing additional code
      on a per-packet cost for ingress qdisc, just to realize that
      nothing is attached on ingress.
      
      Instead, this can just be blinded out as a no-op entirely with
      the use of a static key. On input fast-path, we already make
      use of static keys in various places, e.g. skb time stamping,
      in RPS, etc. It makes sense to not waste time when we're assured
      that no ingress qdisc is attached anywhere.
      
      Enabling/disabling of that code path is being done via two
      helpers, namely net_{inc,dec}_ingress_queue(), that are being
      invoked under RTNL mutex when a ingress qdisc is being either
      initialized or destructed.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4577139b
  11. 13 4月, 2015 1 次提交
  12. 12 4月, 2015 1 次提交
  13. 11 4月, 2015 2 次提交
  14. 10 4月, 2015 1 次提交
  15. 08 4月, 2015 5 次提交
  16. 07 4月, 2015 2 次提交
    • A
      tc: bpf: add checksum helpers · 91bc4822
      Alexei Starovoitov 提交于
      Commit 608cd71a ("tc: bpf: generalize pedit action") has added the
      possibility to mangle packet data to BPF programs in the tc pipeline.
      This patch adds two helpers bpf_l3_csum_replace() and bpf_l4_csum_replace()
      for fixing up the protocol checksums after the packet mangling.
      
      It also adds 'flags' argument to bpf_skb_store_bytes() helper to avoid
      unnecessary checksum recomputations when BPF programs adjusting l3/l4
      checksums and documents all three helpers in uapi header.
      
      Moreover, a sample program is added to show how BPF programs can make use
      of the mangle and csum helpers.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91bc4822
    • H
      ipv6: protect skb->sk accesses from recursive dereference inside the stack · f60e5990
      hannes@stressinduktion.org 提交于
      We should not consult skb->sk for output decisions in xmit recursion
      levels > 0 in the stack. Otherwise local socket settings could influence
      the result of e.g. tunnel encapsulation process.
      
      ipv6 does not conform with this in three places:
      
      1) ip6_fragment: we do consult ipv6_npinfo for frag_size
      
      2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
         loop the packet back to the local socket
      
      3) ip6_skb_dst_mtu could query the settings from the user socket and
         force a wrong MTU
      
      Furthermore:
      In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
      PF_PACKET socket ontop of an IPv6-backed vxlan device.
      
      Reuse xmit_recursion as we are currently only interested in protecting
      tunnel devices.
      
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f60e5990