1. 21 10月, 2015 3 次提交
  2. 19 10月, 2015 2 次提交
  3. 17 10月, 2015 4 次提交
    • E
      net: add pfmemalloc check in sk_add_backlog() · c7c49b8f
      Eric Dumazet 提交于
      Greg reported crashes hitting the following check in __sk_backlog_rcv()
      
      	BUG_ON(!sock_flag(sk, SOCK_MEMALLOC));
      
      The pfmemalloc bit is currently checked in sk_filter().
      
      This works correctly for TCP, because sk_filter() is ran in
      tcp_v[46]_rcv() before hitting the prequeue or backlog checks.
      
      For UDP or other protocols, this does not work, because the sk_filter()
      is ran from sock_queue_rcv_skb(), which might be called _after_ backlog
      queuing if socket is owned by user by the time packet is processed by
      softirq handler.
      
      Fixes: b4b9e355 ("netvm: set PF_MEMALLOC as appropriate during SKB processing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7c49b8f
    • A
      netfilter: turn NF_HOOK into an inline function · 008027c3
      Arnd Bergmann 提交于
      A recent change to the dst_output handling caused a new warning
      when the call to NF_HOOK() is the only used of a local variable
      passed as 'dev', and CONFIG_NETFILTER is disabled:
      
      net/ipv6/ip6_output.c: In function 'ip6_output':
      net/ipv6/ip6_output.c:135:21: warning: unused variable 'dev' [-Wunused-variable]
      
      The reason for this is that the NF_HOOK macro in this case does
      not reference the variable at all, and the call to dev_net(dev)
      got removed from the ip6_output function. To avoid that warning now
      and in the future, this changes the macro into an equivalent
      inline function, which tells the compiler that the variable is
      passed correctly but still unused.
      
      The dn_forward function apparently had the same problem in
      the past and added a local workaround that no longer works
      with the inline function. In order to avoid a regression, we
      have to also remove the #ifdef from decnet in the same patch.
      
      Fixes: ede2059d ("dst: Pass net into dst->output")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      008027c3
    • F
      netfilter: make nf_queue_entry_get_refs return void · ed78d09d
      Florian Westphal 提交于
      We don't care if module is being unloaded anymore since hook unregister
      handling will destroy queue entries using that hook.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      ed78d09d
    • F
      netfilter: remove hook owner refcounting · 2ffbceb2
      Florian Westphal 提交于
      since commit 8405a8ff ("netfilter: nf_qeueue: Drop queue entries on
      nf_unregister_hook") all pending queued entries are discarded.
      
      So we can simply remove all of the owner handling -- when module is
      removed it also needs to unregister all its hooks.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2ffbceb2
  4. 16 10月, 2015 3 次提交
  5. 15 10月, 2015 11 次提交
  6. 14 10月, 2015 1 次提交
  7. 13 10月, 2015 15 次提交
    • A
      can: at91: remove at91_can_data · 42160a04
      Alexandre Belloni 提交于
      struct at91_can_data was used to pass a callback to the driver, allowing it
      to switch the transceiver on and off. As all at91 boards are now using DT,
      this is not used anymore, remove that structure.
      Signed-off-by: NAlexandre Belloni <alexandre.belloni@free-electrons.com>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      42160a04
    • A
      can: avoid using timeval for uapi · ba61a8d9
      Arnd Bergmann 提交于
      The can subsystem communicates with user space using a bcm_msg_head
      header, which contains two timestamps. This is problematic for
      multiple reasons:
      
      a) The structure layout is currently incompatible between 64-bit
         user space and 32-bit user space, and cannot work in compat
         mode (other than x32).
      
      b) The timeval structure layout will change in 32-bit user
         space when we fix the y2038 overflow problem by redefining
         time_t to 64-bit, making new 32-bit user space incompatible
         with the current kernel interface.
         Cars last a long time and often use old kernels, so the actual
         users of this code are the most likely ones to migrate to y2038
         safe user space.
      
      This tries to work around part of the problem by changing the
      publicly visible user interface in the header, but not the binary
      interface. Fortunately, the values passed around in the structure
      are relative times and do not actually suffer from the y2038
      overflow, so 32-bit is enough here.
      
      We replace the use of 'struct timeval' with a newly defined
      'struct bcm_timeval' that uses the exact same binary layout
      as before and that still suffers from problem a) but not problem
      b).
      
      The downside of this approach is that any user space program
      that currently assigns a timeval structure to these members
      rather than writing the tv_sec/tv_usec portions individually
      will suffer a compile-time error when built with an updated
      kernel header. Fixing this error makes it work fine with old
      and new headers though.
      
      We could address problem a) by using '__u32' or 'int' members
      rather than 'long', but that would have a more significant
      downside in also breaking support for all existing 64-bit user
      binaries that might be using this interface, which is likely
      not acceptable.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Cc: linux-can@vger.kernel.org
      Cc: linux-api@vger.kernel.org
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      ba61a8d9
    • D
      net: Add IPv6 support to l3mdev · ccf3c8c3
      David Ahern 提交于
      Add operations to retrieve cached IPv6 dst entry from l3mdev device
      and lookup IPv6 source address.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ccf3c8c3
    • E
      ipv6: Pass struct net into nf_ct_frag6_gather · b7277597
      Eric W. Biederman 提交于
      The function nf_ct_frag6_gather is called on both the input and the
      output paths of the networking stack.  In particular ipv6_defrag which
      calls nf_ct_frag6_gather is called from both the the PRE_ROUTING chain
      on input and the LOCAL_OUT chain on output.
      
      The addition of a net parameter makes it explicit which network
      namespace the packets are being reassembled in, and removes the need
      for nf_ct_frag6_gather to guess.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7277597
    • E
      ipv4: Pass struct net into ip_defrag and ip_check_defrag · 19bcf9f2
      Eric W. Biederman 提交于
      The function ip_defrag is called on both the input and the output
      paths of the networking stack.  In particular conntrack when it is
      tracking outbound packets from the local machine calls ip_defrag.
      
      So add a struct net parameter and stop making ip_defrag guess which
      network namespace it needs to defragment packets in.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19bcf9f2
    • A
      rtnetlink: fix gcc -Wconversion warning · e8444637
      Arad, Ronen 提交于
      RTA_ALIGNTO is currently define as 4. It has to be 4U to prevent warning
      for RTA_ALIGN and RTA_DATA expansions when -Wconversion gcc option is
      enabled.
      This follows NLMSG_ALIGNTO definition in <include/uapi/linux/netlink.h>.
      Signed-off-by: NRonen Arad <ronen.arad@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8444637
    • P
      ipv4/icmp: redirect messages can use the ingress daddr as source · e2ca690b
      Paolo Abeni 提交于
      This patch allows configuring how the source address of ICMP
      redirect messages is selected; by default the old behaviour is
      retained, while setting icmp_redirects_use_orig_daddr force the
      usage of the destination address of the packet that caused the
      redirect.
      
      The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
      following scenario:
      
      Two machines are set up with VRRP to act as routers out of a subnet,
      they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
      x.x.x.254/24.
      
      If a host in said subnet needs to get an ICMP redirect from the VRRP
      router, i.e. to reach a destination behind a different gateway, the
      source IP in the ICMP redirect is chosen as the primary IP on the
      interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.
      
      The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
      and will continue to use the wrong next-op.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2ca690b
    • E
      tcp: shrink tcp_timewait_sock by 8 bytes · d475f090
      Eric Dumazet 提交于
      Reducing tcp_timewait_sock from 280 bytes to 272 bytes
      allows SLAB to pack 15 objects per page instead of 14 (on x86)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d475f090
    • E
      net: shrink struct sock and request_sock by 8 bytes · ed53d0ab
      Eric Dumazet 提交于
      One 32bit hole is following skc_refcnt, use it.
      skc_incoming_cpu can also be an union for request_sock rcv_wnd.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed53d0ab
    • E
      net: align sk_refcnt on 128 bytes boundary · 8e5eb54d
      Eric Dumazet 提交于
      sk->sk_refcnt is dirtied for every TCP/UDP incoming packet.
      This is a performance issue if multiple cpus hit a common socket,
      or multiple sockets are chained due to SO_REUSEPORT.
      
      By moving sk_refcnt 8 bytes further, first 128 bytes of sockets
      are mostly read. As they contain the lookup keys, this has
      a considerable performance impact, as cpus can cache them.
      
      These 8 bytes are not wasted, we use them as a place holder
      for various fields, depending on the socket type.
      
      Tested:
       SYN flood hitting a 16 RX queues NIC.
       TCP listener using 16 sockets and SO_REUSEPORT
       and SO_INCOMING_CPU for proper siloing.
      
       Could process 6.0 Mpps SYN instead of 4.2 Mpps
      
       Kernel profile looked like :
          11.68%  [kernel]  [k] sha_transform
           6.51%  [kernel]  [k] __inet_lookup_listener
           5.07%  [kernel]  [k] __inet_lookup_established
           4.15%  [kernel]  [k] memcpy_erms
           3.46%  [kernel]  [k] ipt_do_table
           2.74%  [kernel]  [k] fib_table_lookup
           2.54%  [kernel]  [k] tcp_make_synack
           2.34%  [kernel]  [k] tcp_conn_request
           2.05%  [kernel]  [k] __netif_receive_skb_core
           2.03%  [kernel]  [k] kmem_cache_alloc
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e5eb54d
    • E
      net: SO_INCOMING_CPU setsockopt() support · 70da268b
      Eric Dumazet 提交于
      SO_INCOMING_CPU as added in commit 2c8c56e1 was a getsockopt() command
      to fetch incoming cpu handling a particular TCP flow after accept()
      
      This commits adds setsockopt() support and extends SO_REUSEPORT selection
      logic : If a TCP listener or UDP socket has this option set, a packet is
      delivered to this socket only if CPU handling the packet matches the specified
      one.
      
      This allows to build very efficient TCP servers, using one listener per
      RX queue, as the associated TCP listener should only accept flows handled
      in softirq by the same cpu.
      This provides optimal NUMA behavior and keep cpu caches hot.
      
      Note that __inet_lookup_listener() still has to iterate over the list of
      all listeners. Following patch puts sk_refcnt in a different cache line
      to let this iteration hit only shared and read mostly cache lines.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70da268b
    • E
      sock: support per-packet fwmark · f28ea365
      Edward Jee 提交于
      It's useful to allow users to set fwmark for an individual packet,
      without changing the socket state. The function this patch adds in
      sock layer can be used by the protocols that need such a feature.
      Signed-off-by: NEdward Hyunkoo Jee <edjee@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f28ea365
    • A
      bpf: charge user for creation of BPF maps and programs · aaac3ba9
      Alexei Starovoitov 提交于
      since eBPF programs and maps use kernel memory consider it 'locked' memory
      from user accounting point of view and charge it against RLIMIT_MEMLOCK limit.
      This limit is typically set to 64Kbytes by distros, so almost all
      bpf+tracing programs would need to increase it, since they use maps,
      but kernel charges maximum map size upfront.
      For example the hash map of 1024 elements will be charged as 64Kbyte.
      It's inconvenient for current users and changes current behavior for root,
      but probably worth doing to be consistent root vs non-root.
      
      Similar accounting logic is done by mmap of perf_event.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aaac3ba9
    • A
      bpf: enable non-root eBPF programs · 1be7f75d
      Alexei Starovoitov 提交于
      In order to let unprivileged users load and execute eBPF programs
      teach verifier to prevent pointer leaks.
      Verifier will prevent
      - any arithmetic on pointers
        (except R10+Imm which is used to compute stack addresses)
      - comparison of pointers
        (except if (map_value_ptr == 0) ... )
      - passing pointers to helper functions
      - indirectly passing pointers in stack to helper functions
      - returning pointer from bpf program
      - storing pointers into ctx or maps
      
      Spill/fill of pointers into stack is allowed, but mangling
      of pointers stored in the stack or reading them byte by byte is not.
      
      Within bpf programs the pointers do exist, since programs need to
      be able to access maps, pass skb pointer to LD_ABS insns, etc
      but programs cannot pass such pointer values to the outside
      or obfuscate them.
      
      Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
      so that socket filters (tcpdump), af_packet (quic acceleration)
      and future kcm can use it.
      tracing and tc cls/act program types still require root permissions,
      since tracing actually needs to be able to see all kernel pointers
      and tc is for root only.
      
      For example, the following unprivileged socket filter program is allowed:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += skb->len;
        return 0;
      }
      
      but the following program is not:
      int bpf_prog1(struct __sk_buff *skb)
      {
        u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        u64 *value = bpf_map_lookup_elem(&my_map, &index);
      
        if (value)
      	*value += (u64) skb;
        return 0;
      }
      since it would leak the kernel address into the map.
      
      Unprivileged socket filter bpf programs have access to the
      following helper functions:
      - map lookup/update/delete (but they cannot store kernel pointers into them)
      - get_random (it's already exposed to unprivileged user space)
      - get_smp_processor_id
      - tail_call into another socket filter program
      - ktime_get_ns
      
      The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
      This toggle defaults to off (0), but can be set true (1).  Once true,
      bpf programs and maps cannot be accessed from unprivileged process,
      and the toggle cannot be set back to false.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1be7f75d
    • M
      arm64: Fix MINSIGSTKSZ and SIGSTKSZ · c9692657
      Manjeet Pawar 提交于
      MINSIGSTKSZ and SIGSTKSZ for ARM64 are not correctly set in latest kernel.
      This patch fixes this issue.
      
      This issue is reported in LTP (testcase: sigaltstack02.c).
      Testcase failed when sigaltstack() called with stack size "MINSIGSTKSZ - 1"
      Since in Glibc-2.22, MINSIGSTKSZ is set to 5120 but in kernel
      it is set to 2048 so testcase gets failed.
      
      Testcase Output:
      sigaltstack02 1  TPASS  :  stgaltstack() fails, Invalid Flag value,errno:22
      sigaltstack02 2  TFAIL  :  sigaltstack() returned 0, expected -1,errno:12
      
      Reported Issue in Glibc Bugzilla:
      Bugfix in Glibc-2.22: [Bug 16850]
      https://sourceware.org/bugzilla/show_bug.cgi?id=16850Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAkhilesh Kumar <akhilesh.k@samsung.com>
      Signed-off-by: NManjeet Pawar <manjeet.p@samsung.com>
      Signed-off-by: NRohit Thapliyal <r.thapliyal@samsung.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      c9692657
  8. 12 10月, 2015 1 次提交
    • P
      netfilter: conntrack: fix crash on timeout object removal · ae2d708e
      Pablo Neira Ayuso 提交于
      The object and module refcounts are updated for each conntrack template,
      however, if we delete the iptables rules and we flush the timeout
      database, we may end up with invalid references to timeout object that
      are just gone.
      
      Resolve this problem by setting the timeout reference to NULL when the
      custom timeout entry is removed from our base. This patch requires some
      RCU trickery to ensure safe pointer handling.
      
      This handling is similar to what we already do with conntrack helpers,
      the idea is to avoid bumping the timeout object reference counter from
      the packet path to avoid the cost of atomic ops.
      Reported-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      ae2d708e