1. 06 9月, 2014 4 次提交
    • A
      net-timestamp: Merge shared code between phy and regular timestamping · 37846ef0
      Alexander Duyck 提交于
      This change merges the shared bits that exist between skb_tx_tstamp and
      skb_complete_tx_timestamp.  By doing this we can avoid the two diverging as
      there were already changes pushed into skb_tx_tstamp that hadn't made it
      into the other function.
      
      In addition this resolves issues with the fact that
      skb_complete_tx_timestamp was included in linux/skbuff.h even though it was
      only compiled in if phy timestamping was enabled.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37846ef0
    • G
      ethtool: Add generic options for tunables · f0db9b07
      Govindarajulu Varadarajan 提交于
      This patch adds new ethtool cmd, ETHTOOL_GTUNABLE & ETHTOOL_STUNABLE for getting
      tunable values from driver.
      
      Add get_tunable and set_tunable to ethtool_ops. Driver implements these
      functions for getting/setting tunable value.
      Signed-off-by: NGovindarajulu Varadarajan <_govind@gmx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0db9b07
    • D
      dev_ioctl: remove dev_load() CAP_SYS_MODULE message · e020836d
      Daniel Borkmann 提交于
      Marcel reported to see the following message when autoloading
      is being triggered when adding nlmon device:
      
        Loading kernel module for a network device with
        CAP_SYS_MODULE (deprecated). Use CAP_NET_ADMIN and alias
        netdev-nlmon instead.
      
      This false-positive happens despite with having correct
      capabilities set, e.g. through issuing `ip link del dev nlmon`
      more than once on a valid device with name nlmon, but Marcel
      has also seen it on creation time when no nlmon module is
      previously compiled-in or loaded as module and the device
      name equals a link type name (e.g. nlmon, vxlan, team).
      
      Stephen says:
      
        The netdev module alias is a hold over from the past. For
        normal devices, people used to create a alias eth0 to and
        point it to the type of network device used, that was back
        in the bad old ISA days before real discovery.
      
        Also, the tunnels create module alias for the control device
        and ip used to use this to autoload the tunnel device.
      
        The message is bogus and should just be removed, I also see
        it in a couple of other cases where tap devices are renamed
        for other usese.
      
      As mentioned in 8909c9ad ("net: don't allow CAP_NET_ADMIN
      to load non-netdev kernel modules"), we nevertheless still
      might want to leave the old autoloading behaviour in place
      as it could break old scripts, so for now, lets just remove
      the log message as Stephen suggests.
      
      Reference: http://thread.gmane.org/gmane.linux.kernel/1105168Reported-by: NMarcel Holtmann <marcel@holtmann.org>
      Suggested-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e020836d
    • D
      net: bpf: make eBPF interpreter images read-only · 60a3b225
      Daniel Borkmann 提交于
      With eBPF getting more extended and exposure to user space is on it's way,
      hardening the memory range the interpreter uses to steer its command flow
      seems appropriate.  This patch moves the to be interpreted bytecode to
      read-only pages.
      
      In case we execute a corrupted BPF interpreter image for some reason e.g.
      caused by an attacker which got past a verifier stage, it would not only
      provide arbitrary read/write memory access but arbitrary function calls
      as well. After setting up the BPF interpreter image, its contents do not
      change until destruction time, thus we can setup the image on immutable
      made pages in order to mitigate modifications to that code. The idea
      is derived from commit 314beb9b ("x86: bpf_jit_comp: secure bpf jit
      against spraying attacks").
      
      This is possible because bpf_prog is not part of sk_filter anymore.
      After setup bpf_prog cannot be altered during its life-time. This prevents
      any modifications to the entire bpf_prog structure (incl. function/JIT
      image pointer).
      
      Every eBPF program (including classic BPF that are migrated) have to call
      bpf_prog_select_runtime() to select either interpreter or a JIT image
      as a last setup step, and they all are being freed via bpf_prog_free(),
      including non-JIT. Therefore, we can easily integrate this into the
      eBPF life-time, plus since we directly allocate a bpf_prog, we have no
      performance penalty.
      
      Tested with seccomp and test_bpf testsuite in JIT/non-JIT mode and manual
      inspection of kernel_page_tables.  Brad Spengler proposed the same idea
      via Twitter during development of this patch.
      
      Joint work with Hannes Frederic Sowa.
      Suggested-by: NBrad Spengler <spender@grsecurity.net>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60a3b225
  2. 04 9月, 2014 1 次提交
    • J
      qdisc: validate frames going through the direct_xmit path · 1f59533f
      Jesper Dangaard Brouer 提交于
      In commit 50cbe9ab ("net: Validate xmit SKBs right when we
      pull them out of the qdisc") the validation code was moved out of
      dev_hard_start_xmit and into dequeue_skb.
      
      However this overlooked the fact that we do not always enqueue
      the skb onto a qdisc. First situation is if qdisc have flag
      TCQ_F_CAN_BYPASS and qdisc is empty.  Second situation is if
      there is no qdisc on the device, which is a common case for
      software devices.
      
      Originally spotted and inital patch by Alexander Duyck.
      As a result Alex was seeing issues trying to connect to a
      vhost_net interface after commit 50cbe9ab was applied.
      
      Added a call to validate_xmit_skb() in __dev_xmit_skb(), in the
      code path for qdiscs with TCQ_F_CAN_BYPASS flag, and in
      __dev_queue_xmit() when no qdisc.
      
      Also handle the error situation where dev_hard_start_xmit() could
      return a skb list, and does not return dev_xmit_complete(rc) and
      falls through to the kfree_skb(), in that situation it should
      call kfree_skb_list().
      
      Fixes:  50cbe9ab ("net: Validate xmit SKBs right when we pull them out of the qdisc")
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f59533f
  3. 03 9月, 2014 4 次提交
  4. 02 9月, 2014 12 次提交
  5. 30 8月, 2014 2 次提交
  6. 26 8月, 2014 2 次提交
  7. 25 8月, 2014 2 次提交
  8. 24 8月, 2014 2 次提交
    • D
      net: use reciprocal_scale() helper · 8fc54f68
      Daniel Borkmann 提交于
      Replace open codings of (((u64) <x> * <y>) >> 32) with reciprocal_scale().
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fc54f68
    • D
      net: Allow raw buffers to be passed into the flow dissector. · 690e36e7
      David S. Miller 提交于
      Drivers, and perhaps other entities we have not yet considered,
      sometimes want to know how deep the protocol headers go before
      deciding how large of an SKB to allocate and how much of the packet to
      place into the linear SKB area.
      
      For example, consider a driver which has a device which DMAs into
      pools of pages and then tells the driver where the data went in the
      DMA descriptor(s).  The driver can then build an SKB and reference
      most of the data via SKB fragments (which are page/offset/length
      triplets).
      
      However at least some of the front of the packet should be placed into
      the linear SKB area, which comes before the fragments, so that packet
      processing can get at the headers efficiently.  The first thing each
      protocol layer is going to do is a "pskb_may_pull()" so we might as
      well aggregate as much of this as possible while we're building the
      SKB in the driver.
      
      Part of supporting this is that we don't have an SKB yet, so we want
      to be able to let the flow dissector operate on a raw buffer in order
      to compute the offset of the end of the headers.
      
      So now we have a __skb_flow_dissect() which takes an explicit data
      pointer and length.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      690e36e7
  9. 23 8月, 2014 2 次提交
  10. 12 8月, 2014 1 次提交
    • V
      net: Always untag vlan-tagged traffic on input. · 0d5501c1
      Vlad Yasevich 提交于
      Currently the functionality to untag traffic on input resides
      as part of the vlan module and is build only when VLAN support
      is enabled in the kernel.  When VLAN is disabled, the function
      vlan_untag() turns into a stub and doesn't really untag the
      packets.  This seems to create an interesting interaction
      between VMs supporting checksum offloading and some network drivers.
      
      There are some drivers that do not allow the user to change
      tx-vlan-offload feature of the driver.  These drivers also seem
      to assume that any VLAN-tagged traffic they transmit will
      have the vlan information in the vlan_tci and not in the vlan
      header already in the skb.  When transmitting skbs that already
      have tagged data with partial checksum set, the checksum doesn't
      appear to be updated correctly by the card thus resulting in a
      failure to establish TCP connections.
      
      The following is a packet trace taken on the receiver where a
      sender is a VM with a VLAN configued.  The host VM is running on
      doest not have VLAN support and the outging interface on the
      host is tg3:
      10:12:43.503055 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
      (0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27243,
      offset 0, flags [DF], proto TCP (6), length 60)
          10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
      -> 0x48d9), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
      4294837885 ecr 0,nop,wscale 7], length 0
      10:12:44.505556 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
      (0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27244,
      offset 0, flags [DF], proto TCP (6), length 60)
          10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
      -> 0x44ee), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
      4294838888 ecr 0,nop,wscale 7], length 0
      
      This connection finally times out.
      
      I've only access to the TG3 hardware in this configuration thus have
      only tested this with TG3 driver.  There are a lot of other drivers
      that do not permit user changes to vlan acceleration features, and
      I don't know if they all suffere from a similar issue.
      
      The patch attempt to fix this another way.  It moves the vlan header
      stipping code out of the vlan module and always builds it into the
      kernel network core.  This way, even if vlan is not supported on
      a virtualizatoin host, the virtual machines running on top of such
      host will still work with VLANs enabled.
      
      CC: Patrick McHardy <kaber@trash.net>
      CC: Nithin Nayak Sujir <nsujir@broadcom.com>
      CC: Michael Chan <mchan@broadcom.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d5501c1
  11. 09 8月, 2014 1 次提交
  12. 06 8月, 2014 5 次提交
    • W
      net-timestamp: TCP timestamping · 4ed2d765
      Willem de Bruijn 提交于
      TCP timestamping extends SO_TIMESTAMPING to bytestreams.
      
      Bytestreams do not have a 1:1 relationship between send() buffers and
      network packets. The feature interprets a send call on a bytestream as
      a request for a timestamp for the last byte in that send() buffer.
      
      The choice corresponds to a request for a timestamp when all bytes in
      the buffer have been sent. That assumption depends on in-order kernel
      transmission. This is the common case. That said, it is possible to
      construct a traffic shaping tree that would result in reordering.
      The guarantee is strong, then, but not ironclad.
      
      This implementation supports send and sendpages (splice). GSO replaces
      one large packet with multiple smaller packets. This patch also copies
      the option into the correct smaller packet.
      
      This patch does not yet support timestamping on data in an initial TCP
      Fast Open SYN, because that takes a very different data path.
      
      If ID generation in ee_data is enabled, bytestream timestamps return a
      byte offset, instead of the packet counter for datagrams.
      
      The implementation supports a single timestamp per packet. It silenty
      replaces requests for previous timestamps. To avoid missing tstamps,
      flush the tcp queue by disabling Nagle, cork and autocork. Missing
      tstamps can be detected by offset when the ee_data ID is enabled.
      
      Implementation details:
      
      - On GSO, the timestamping code can be included in the main loop. I
      moved it into its own loop to reduce the impact on the common case
      to a single branch.
      
      - To avoid leaking the absolute seqno to userspace, the offset
      returned in ee_data must always be relative. It is an offset between
      an skb and sk field. The first is always set (also for GSO & ACK).
      The second must also never be uninitialized. Only allow the ID
      option on sockets in the ESTABLISHED state, for which the seqno
      is available. Never reset it to zero (instead, move it to the
      current seqno when reenabling the option).
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ed2d765
    • W
      net-timestamp: SCHED timestamp on entering packet scheduler · e7fd2885
      Willem de Bruijn 提交于
      Kernel transmit latency is often incurred in the packet scheduler.
      Introduce a new timestamp on transmission just before entering the
      scheduler. When data travels through multiple devices (bonding,
      tunneling, ...) each device will export an individual timestamp.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7fd2885
    • W
      net-timestamp: add key to disambiguate concurrent datagrams · 09c2d251
      Willem de Bruijn 提交于
      Datagrams timestamped on transmission can coexist in the kernel stack
      and be reordered in packet scheduling. When reading looped datagrams
      from the socket error queue it is not always possible to unique
      correlate looped data with original send() call (for application
      level retransmits). Even if possible, it may be expensive and complex,
      requiring packet inspection.
      
      Introduce a data-independent ID mechanism to associate timestamps with
      send calls. Pass an ID alongside the timestamp in field ee_data of
      sock_extended_err.
      
      The ID is a simple 32 bit unsigned int that is associated with the
      socket and incremented on each send() call for which software tx
      timestamp generation is enabled.
      
      The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
      avoid changing ee_data for existing applications that expect it 0.
      The counter is reset each time the flag is reenabled. Reenabling
      does not change the ID of already submitted data. It is possible
      to receive out of order IDs if the timestamp stream is not quiesced
      first.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09c2d251
    • W
      net-timestamp: move timestamp flags out of sk_flags · b9f40e21
      Willem de Bruijn 提交于
      sk_flags is reaching its limit. New timestamping options will not fit.
      Move all of them into a new field sk->sk_tsflags.
      
      Added benefit is that this removes boilerplate code to convert between
      SOF_TIMESTAMPING_.. and SOCK_TIMESTAMPING_.. in getsockopt/setsockopt.
      
      SOCK_TIMESTAMPING_RX_SOFTWARE is also used to toggle the receive
      timestamp logic (netstamp_needed). That can be simplified and this
      last key removed, but will leave that for a separate patch.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      The u16 in sock can be moved into a 16-bit hole below sk_gso_max_segs,
      though that scatters tstamp fields throughout the struct.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9f40e21
    • W
      net-timestamp: extend SCM_TIMESTAMPING ancillary data struct · f24b9be5
      Willem de Bruijn 提交于
      Applications that request kernel tx timestamps with SO_TIMESTAMPING
      read timestamps as recvmsg() ancillary data. The response is defined
      implicitly as timespec[3].
      
      1) define struct scm_timestamping explicitly and
      
      2) add support for new tstamp types. On tx, scm_timestamping always
         accompanies a sock_extended_err. Define previously unused field
         ee_info to signal the type of ts[0]. Introduce SCM_TSTAMP_SND to
         define the existing behavior.
      
      The reception path is not modified. On rx, no struct similar to
      sock_extended_err is passed along with SCM_TIMESTAMPING.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f24b9be5
  13. 03 8月, 2014 2 次提交
    • A
      net: filter: split 'struct sk_filter' into socket and bpf parts · 7ae457c1
      Alexei Starovoitov 提交于
      clean up names related to socket filtering and bpf in the following way:
      - everything that deals with sockets keeps 'sk_*' prefix
      - everything that is pure BPF is changed to 'bpf_*' prefix
      
      split 'struct sk_filter' into
      struct sk_filter {
      	atomic_t        refcnt;
      	struct rcu_head rcu;
      	struct bpf_prog *prog;
      };
      and
      struct bpf_prog {
              u32                     jited:1,
                                      len:31;
              struct sock_fprog_kern  *orig_prog;
              unsigned int            (*bpf_func)(const struct sk_buff *skb,
                                                  const struct bpf_insn *filter);
              union {
                      struct sock_filter      insns[0];
                      struct bpf_insn         insnsi[0];
                      struct work_struct      work;
              };
      };
      so that 'struct bpf_prog' can be used independent of sockets and cleans up
      'unattached' bpf use cases
      
      split SK_RUN_FILTER macro into:
          SK_RUN_FILTER to be used with 'struct sk_filter *' and
          BPF_PROG_RUN to be used with 'struct bpf_prog *'
      
      __sk_filter_release(struct sk_filter *) gains
      __bpf_prog_release(struct bpf_prog *) helper function
      
      also perform related renames for the functions that work
      with 'struct bpf_prog *', since they're on the same lines:
      
      sk_filter_size -> bpf_prog_size
      sk_filter_select_runtime -> bpf_prog_select_runtime
      sk_filter_free -> bpf_prog_free
      sk_unattached_filter_create -> bpf_prog_create
      sk_unattached_filter_destroy -> bpf_prog_destroy
      sk_store_orig_filter -> bpf_prog_store_orig_filter
      sk_release_orig_filter -> bpf_release_orig_filter
      __sk_migrate_filter -> bpf_migrate_filter
      __sk_prepare_filter -> bpf_prepare_filter
      
      API for attaching classic BPF to a socket stays the same:
      sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
      and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
      which is used by sockets, tun, af_packet
      
      API for 'unattached' BPF programs becomes:
      bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
      and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
      which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ae457c1
    • A
      net: filter: rename sk_convert_filter() -> bpf_convert_filter() · 8fb575ca
      Alexei Starovoitov 提交于
      to indicate that this function is converting classic BPF into eBPF
      and not related to sockets
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fb575ca