1. 27 8月, 2014 1 次提交
  2. 12 8月, 2014 1 次提交
    • V
      net: Always untag vlan-tagged traffic on input. · 0d5501c1
      Vlad Yasevich 提交于
      Currently the functionality to untag traffic on input resides
      as part of the vlan module and is build only when VLAN support
      is enabled in the kernel.  When VLAN is disabled, the function
      vlan_untag() turns into a stub and doesn't really untag the
      packets.  This seems to create an interesting interaction
      between VMs supporting checksum offloading and some network drivers.
      
      There are some drivers that do not allow the user to change
      tx-vlan-offload feature of the driver.  These drivers also seem
      to assume that any VLAN-tagged traffic they transmit will
      have the vlan information in the vlan_tci and not in the vlan
      header already in the skb.  When transmitting skbs that already
      have tagged data with partial checksum set, the checksum doesn't
      appear to be updated correctly by the card thus resulting in a
      failure to establish TCP connections.
      
      The following is a packet trace taken on the receiver where a
      sender is a VM with a VLAN configued.  The host VM is running on
      doest not have VLAN support and the outging interface on the
      host is tg3:
      10:12:43.503055 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
      (0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27243,
      offset 0, flags [DF], proto TCP (6), length 60)
          10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
      -> 0x48d9), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
      4294837885 ecr 0,nop,wscale 7], length 0
      10:12:44.505556 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
      (0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27244,
      offset 0, flags [DF], proto TCP (6), length 60)
          10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
      -> 0x44ee), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
      4294838888 ecr 0,nop,wscale 7], length 0
      
      This connection finally times out.
      
      I've only access to the TG3 hardware in this configuration thus have
      only tested this with TG3 driver.  There are a lot of other drivers
      that do not permit user changes to vlan acceleration features, and
      I don't know if they all suffere from a similar issue.
      
      The patch attempt to fix this another way.  It moves the vlan header
      stipping code out of the vlan module and always builds it into the
      kernel network core.  This way, even if vlan is not supported on
      a virtualizatoin host, the virtual machines running on top of such
      host will still work with VLANs enabled.
      
      CC: Patrick McHardy <kaber@trash.net>
      CC: Nithin Nayak Sujir <nsujir@broadcom.com>
      CC: Michael Chan <mchan@broadcom.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Acked-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d5501c1
  3. 09 8月, 2014 1 次提交
  4. 06 8月, 2014 5 次提交
    • W
      net-timestamp: TCP timestamping · 4ed2d765
      Willem de Bruijn 提交于
      TCP timestamping extends SO_TIMESTAMPING to bytestreams.
      
      Bytestreams do not have a 1:1 relationship between send() buffers and
      network packets. The feature interprets a send call on a bytestream as
      a request for a timestamp for the last byte in that send() buffer.
      
      The choice corresponds to a request for a timestamp when all bytes in
      the buffer have been sent. That assumption depends on in-order kernel
      transmission. This is the common case. That said, it is possible to
      construct a traffic shaping tree that would result in reordering.
      The guarantee is strong, then, but not ironclad.
      
      This implementation supports send and sendpages (splice). GSO replaces
      one large packet with multiple smaller packets. This patch also copies
      the option into the correct smaller packet.
      
      This patch does not yet support timestamping on data in an initial TCP
      Fast Open SYN, because that takes a very different data path.
      
      If ID generation in ee_data is enabled, bytestream timestamps return a
      byte offset, instead of the packet counter for datagrams.
      
      The implementation supports a single timestamp per packet. It silenty
      replaces requests for previous timestamps. To avoid missing tstamps,
      flush the tcp queue by disabling Nagle, cork and autocork. Missing
      tstamps can be detected by offset when the ee_data ID is enabled.
      
      Implementation details:
      
      - On GSO, the timestamping code can be included in the main loop. I
      moved it into its own loop to reduce the impact on the common case
      to a single branch.
      
      - To avoid leaking the absolute seqno to userspace, the offset
      returned in ee_data must always be relative. It is an offset between
      an skb and sk field. The first is always set (also for GSO & ACK).
      The second must also never be uninitialized. Only allow the ID
      option on sockets in the ESTABLISHED state, for which the seqno
      is available. Never reset it to zero (instead, move it to the
      current seqno when reenabling the option).
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ed2d765
    • W
      net-timestamp: SCHED timestamp on entering packet scheduler · e7fd2885
      Willem de Bruijn 提交于
      Kernel transmit latency is often incurred in the packet scheduler.
      Introduce a new timestamp on transmission just before entering the
      scheduler. When data travels through multiple devices (bonding,
      tunneling, ...) each device will export an individual timestamp.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7fd2885
    • W
      net-timestamp: add key to disambiguate concurrent datagrams · 09c2d251
      Willem de Bruijn 提交于
      Datagrams timestamped on transmission can coexist in the kernel stack
      and be reordered in packet scheduling. When reading looped datagrams
      from the socket error queue it is not always possible to unique
      correlate looped data with original send() call (for application
      level retransmits). Even if possible, it may be expensive and complex,
      requiring packet inspection.
      
      Introduce a data-independent ID mechanism to associate timestamps with
      send calls. Pass an ID alongside the timestamp in field ee_data of
      sock_extended_err.
      
      The ID is a simple 32 bit unsigned int that is associated with the
      socket and incremented on each send() call for which software tx
      timestamp generation is enabled.
      
      The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
      avoid changing ee_data for existing applications that expect it 0.
      The counter is reset each time the flag is reenabled. Reenabling
      does not change the ID of already submitted data. It is possible
      to receive out of order IDs if the timestamp stream is not quiesced
      first.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09c2d251
    • W
      net-timestamp: move timestamp flags out of sk_flags · b9f40e21
      Willem de Bruijn 提交于
      sk_flags is reaching its limit. New timestamping options will not fit.
      Move all of them into a new field sk->sk_tsflags.
      
      Added benefit is that this removes boilerplate code to convert between
      SOF_TIMESTAMPING_.. and SOCK_TIMESTAMPING_.. in getsockopt/setsockopt.
      
      SOCK_TIMESTAMPING_RX_SOFTWARE is also used to toggle the receive
      timestamp logic (netstamp_needed). That can be simplified and this
      last key removed, but will leave that for a separate patch.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      The u16 in sock can be moved into a 16-bit hole below sk_gso_max_segs,
      though that scatters tstamp fields throughout the struct.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9f40e21
    • W
      net-timestamp: extend SCM_TIMESTAMPING ancillary data struct · f24b9be5
      Willem de Bruijn 提交于
      Applications that request kernel tx timestamps with SO_TIMESTAMPING
      read timestamps as recvmsg() ancillary data. The response is defined
      implicitly as timespec[3].
      
      1) define struct scm_timestamping explicitly and
      
      2) add support for new tstamp types. On tx, scm_timestamping always
         accompanies a sock_extended_err. Define previously unused field
         ee_info to signal the type of ts[0]. Introduce SCM_TSTAMP_SND to
         define the existing behavior.
      
      The reception path is not modified. On rx, no struct similar to
      sock_extended_err is passed along with SCM_TIMESTAMPING.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f24b9be5
  5. 03 8月, 2014 5 次提交
  6. 01 8月, 2014 1 次提交
    • V
      net: Correctly set segment mac_len in skb_segment(). · fcdfe3a7
      Vlad Yasevich 提交于
      When performing segmentation, the mac_len value is copied right
      out of the original skb.  However, this value is not always set correctly
      (like when the packet is VLAN-tagged) and we'll end up copying a bad
      value.
      
      One way to demonstrate this is to configure a VM which tags
      packets internally and turn off VLAN acceleration on the forwarding
      bridge port.  The packets show up corrupt like this:
      16:18:24.985548 52:54:00:ab:be:25 > 52:54:00:26:ce:a3, ethertype 802.1Q
      (0x8100), length 1518: vlan 100, p 0, ethertype 0x05e0,
              0x0000:  8cdb 1c7c 8cdb 0064 4006 b59d 0a00 6402 ...|...d@.....d.
              0x0010:  0a00 6401 9e0d b441 0a5e 64ec 0330 14fa ..d....A.^d..0..
              0x0020:  29e3 01c9 f871 0000 0101 080a 000a e833)....q.........3
              0x0030:  000f 8c75 6e65 7470 6572 6600 6e65 7470 ...unetperf.netp
              0x0040:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
              0x0050:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
              0x0060:  6572 6600 6e65 7470 6572 6600 6e65 7470 erf.netperf.netp
              ...
      
      This also leads to awful throughput as GSO packets are dropped and
      cause retransmissions.
      
      The solution is to set the mac_len using the values already available
      in then new skb.  We've already adjusted all of the header offset, so we
      might as well correctly figure out the mac_len using skb_reset_mac_len().
      After this change, packets are segmented correctly and performance
      is restored.
      
      CC: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fcdfe3a7
  7. 31 7月, 2014 2 次提交
    • P
      net: filter: don't release unattached filter through call_rcu() · 34c5bd66
      Pablo Neira 提交于
      sk_unattached_filter_destroy() does not always need to release the
      filter object via rcu. Since this filter is never attached to the
      socket, the caller should be responsible for releasing the filter
      in a safe way, which may not necessarily imply rcu.
      
      This is a short summary of clients of this function:
      
      1) xt_bpf.c and cls_bpf.c use the bpf matchers from rules, these rules
         are removed from the packet path before the filter is released. Thus,
         the framework makes sure the filter is safely removed.
      
      2) In the ppp driver, the ppp_lock ensures serialization between the
         xmit and filter attachment/detachment path. This doesn't use rcu
         so deferred release via rcu makes no sense.
      
      3) In the isdn/ppp driver, it is called from isdn_ppp_release()
         the isdn_ppp_ioctl(). This driver uses mutex and spinlocks, no rcu.
         Thus, deferred rcu makes no sense to me either, the deferred releases
         may be just masking the effects of wrong locking strategy, which
         should be fixed in the driver itself.
      
      4) In the team driver, this is the only place where the rcu
         synchronization with unattached filter is used. Therefore, this
         patch introduces synchronize_rcu() which is called from the
         genetlink path to make sure the filter doesn't go away while packets
         are still walking over it. I think we can revisit this once struct
         bpf_prog (that only wraps specific bpf code bits) is in place, then
         add some specific struct rcu_head in the scope of the team driver if
         Jiri thinks this is needed.
      
      Deferred rcu release for unattached filters was originally introduced
      in 302d6637 ("filter: Allow to create sk-unattached filters").
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34c5bd66
    • T
      net: Remove unlikely() for WARN_ON() conditions · 80019d31
      Thomas Graf 提交于
      No need for the unlikely(), WARN_ON() and BUG_ON() internally use
      unlikely() on the condition.
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80019d31
  8. 30 7月, 2014 3 次提交
    • E
      namespaces: Use task_lock and not rcu to protect nsproxy · 728dba3a
      Eric W. Biederman 提交于
      The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
      a sufficiently expensive system call that people have complained.
      
      Upon inspect nsproxy no longer needs rcu protection for remote reads.
      remote reads are rare.  So optimize for same process reads and write
      by switching using rask_lock instead.
      
      This yields a simpler to understand lock, and a faster setns system call.
      
      In particular this fixes a performance regression observed
      by Rafael David Tinoco <rafael.tinoco@canonical.com>.
      
      This is effectively a revert of Pavel Emelyanov's commit
      cf7b708c Make access to task's nsproxy lighter
      from 2007.  The race this originialy fixed no longer exists as
      do_notify_parent uses task_active_pid_ns(parent) instead of
      parent->nsproxy.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      728dba3a
    • A
      net: sendmsg: fix NULL pointer dereference · 40eea803
      Andrey Ryabinin 提交于
      Sasha's report:
      	> While fuzzing with trinity inside a KVM tools guest running the latest -next
      	> kernel with the KASAN patchset, I've stumbled on the following spew:
      	>
      	> [ 4448.949424] ==================================================================
      	> [ 4448.951737] AddressSanitizer: user-memory-access on address 0
      	> [ 4448.952988] Read of size 2 by thread T19638:
      	> [ 4448.954510] CPU: 28 PID: 19638 Comm: trinity-c76 Not tainted 3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #813
      	> [ 4448.956823]  ffff88046d86ca40 0000000000000000 ffff880082f37e78 ffff880082f37a40
      	> [ 4448.958233]  ffffffffb6e47068 ffff880082f37a68 ffff880082f37a58 ffffffffb242708d
      	> [ 4448.959552]  0000000000000000 ffff880082f37a88 ffffffffb24255b1 0000000000000000
      	> [ 4448.961266] Call Trace:
      	> [ 4448.963158] dump_stack (lib/dump_stack.c:52)
      	> [ 4448.964244] kasan_report_user_access (mm/kasan/report.c:184)
      	> [ 4448.965507] __asan_load2 (mm/kasan/kasan.c:352)
      	> [ 4448.966482] ? netlink_sendmsg (net/netlink/af_netlink.c:2339)
      	> [ 4448.967541] netlink_sendmsg (net/netlink/af_netlink.c:2339)
      	> [ 4448.968537] ? get_parent_ip (kernel/sched/core.c:2555)
      	> [ 4448.970103] sock_sendmsg (net/socket.c:654)
      	> [ 4448.971584] ? might_fault (mm/memory.c:3741)
      	> [ 4448.972526] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3740)
      	> [ 4448.973596] ? verify_iovec (net/core/iovec.c:64)
      	> [ 4448.974522] ___sys_sendmsg (net/socket.c:2096)
      	> [ 4448.975797] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
      	> [ 4448.977030] ? lock_release_holdtime (kernel/locking/lockdep.c:273)
      	> [ 4448.978197] ? lock_release_non_nested (kernel/locking/lockdep.c:3434 (discriminator 1))
      	> [ 4448.979346] ? check_chain_key (kernel/locking/lockdep.c:2188)
      	> [ 4448.980535] __sys_sendmmsg (net/socket.c:2181)
      	> [ 4448.981592] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
      	> [ 4448.982773] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
      	> [ 4448.984458] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1500 (discriminator 2))
      	> [ 4448.985621] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
      	> [ 4448.986754] SyS_sendmmsg (net/socket.c:2201)
      	> [ 4448.987708] tracesys (arch/x86/kernel/entry_64.S:542)
      	> [ 4448.988929] ==================================================================
      
      This reports means that we've come to netlink_sendmsg() with msg->msg_name == NULL and msg->msg_namelen > 0.
      
      After this report there was no usual "Unable to handle kernel NULL pointer dereference"
      and this gave me a clue that address 0 is mapped and contains valid socket address structure in it.
      
      This bug was introduced in f3d33426
      (net: rework recvmsg handler msg_name and msg_namelen logic).
      Commit message states that:
      	"Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
      	 non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
      	 affect sendto as it would bail out earlier while trying to copy-in the
      	 address."
      But in fact this affects sendto when address 0 is mapped and contains
      socket address structure in it. In such case copy-in address will succeed,
      verify_iovec() function will successfully exit with msg->msg_namelen > 0
      and msg->msg_name == NULL.
      
      This patch fixes it by setting msg_namelen to 0 if msg_name == NULL.
      
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: <stable@vger.kernel.org>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40eea803
    • W
      net: remove deprecated syststamp timestamp · 4d276eb6
      Willem de Bruijn 提交于
      The SO_TIMESTAMPING API defines three types of timestamps: software,
      hardware in raw format (hwtstamp) and hardware converted to system
      format (syststamp). The last has been deprecated in favor of combining
      hwtstamp with a PTP clock driver. There are no active users in the
      kernel.
      
      The option was device driver dependent. If set, but without hardware
      support, the correct behavior is to return zero in the relevant field
      in the SCM_TIMESTAMPING ancillary message. Without device drivers
      implementing the option, this field is effectively always zero.
      
      Remove the internal plumbing to dissuage new drivers from implementing
      the feature. Keep the SOF_TIMESTAMPING_SYS_HARDWARE flag, however, to
      avoid breaking existing applications that request the timestamp.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d276eb6
  9. 29 7月, 2014 1 次提交
  10. 25 7月, 2014 2 次提交
  11. 24 7月, 2014 2 次提交
  12. 21 7月, 2014 2 次提交
    • V
      net: print a notification on device rename · 6fe82a39
      Veaceslav Falico 提交于
      Currently it's done silently (from the kernel part), and thus it might be
      hard to track the renames from logs.
      
      Add a simple netdev_info() to notify the rename, but only in case the
      previous name was valid.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Vlad Yasevich <vyasevic@redhat.com>
      CC: stephen hemminger <stephen@networkplumber.org>
      CC: Jerry Chu <hkchu@google.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      CC: David Laight <David.Laight@ACULAB.COM>
      Signed-off-by: NVeaceslav Falico <vfalico@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6fe82a39
    • V
      net: print net_device reg_state in netdev_* unless it's registered · ccc7f496
      Veaceslav Falico 提交于
      This way we'll always know in what status the device is, unless it's
      running normally (i.e. NETDEV_REGISTERED).
      
      Also, emit a warning once in case of a bad reg_state.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Jason Baron <jbaron@akamai.com>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Vlad Yasevich <vyasevic@redhat.com>
      CC: stephen hemminger <stephen@networkplumber.org>
      CC: Jerry Chu <hkchu@google.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      CC: Joe Perches <joe@perches.com>
      Signed-off-by: NVeaceslav Falico <vfalico@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ccc7f496
  13. 17 7月, 2014 3 次提交
  14. 16 7月, 2014 6 次提交
    • F
      4d3520cb
    • F
      aee944dd
    • T
      net: rtnetlink - make create_link take name_assign_type · 5517750f
      Tom Gundersen 提交于
      This passes down NET_NAME_USER (or NET_NAME_ENUM) to alloc_netdev(),
      for any device created over rtnetlink.
      
      v9: restore reverse-christmas-tree order of local variables
      Signed-off-by: NTom Gundersen <teg@jklm.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5517750f
    • T
      net: set name_assign_type in alloc_netdev() · c835a677
      Tom Gundersen 提交于
      Extend alloc_netdev{,_mq{,s}}() to take name_assign_type as argument, and convert
      all users to pass NET_NAME_UNKNOWN.
      
      Coccinelle patch:
      
      @@
      expression sizeof_priv, name, setup, txqs, rxqs, count;
      @@
      
      (
      -alloc_netdev_mqs(sizeof_priv, name, setup, txqs, rxqs)
      +alloc_netdev_mqs(sizeof_priv, name, NET_NAME_UNKNOWN, setup, txqs, rxqs)
      |
      -alloc_netdev_mq(sizeof_priv, name, setup, count)
      +alloc_netdev_mq(sizeof_priv, name, NET_NAME_UNKNOWN, setup, count)
      |
      -alloc_netdev(sizeof_priv, name, setup)
      +alloc_netdev(sizeof_priv, name, NET_NAME_UNKNOWN, setup)
      )
      
      v9: move comments here from the wrong commit
      Signed-off-by: NTom Gundersen <teg@jklm.no>
      Reviewed-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c835a677
    • T
      net: set name assign type for renamed devices · 238fa362
      Tom Gundersen 提交于
      Based on a patch from David Herrmann.
      
      This is the only place devices can be renamed.
      
      v9: restore revers-christmas-tree order of local variables
      Signed-off-by: NTom Gundersen <teg@jklm.no>
      Reviewed-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      238fa362
    • T
      net: add name_assign_type netdev attribute · 685343fc
      Tom Gundersen 提交于
      Based on a patch by David Herrmann.
      
      The name_assign_type attribute gives hints where the interface name of a
      given net-device comes from. These values are currently defined:
        NET_NAME_ENUM:
          The ifname is provided by the kernel with an enumerated
          suffix, typically based on order of discovery. Names may
          be reused and unpredictable.
        NET_NAME_PREDICTABLE:
          The ifname has been assigned by the kernel in a predictable way
          that is guaranteed to avoid reuse and always be the same for a
          given device. Examples include statically created devices like
          the loopback device and names deduced from hardware properties
          (including being given explicitly by the firmware). Names
          depending on the order of discovery, or in any other way on the
          existence of other devices, must not be marked as PREDICTABLE.
        NET_NAME_USER:
          The ifname was provided by user-space during net-device setup.
        NET_NAME_RENAMED:
          The net-device has been renamed from userspace. Once this type is set,
          it cannot change again.
        NET_NAME_UNKNOWN:
          This is an internal placeholder to indicate that we yet haven't yet
          categorized the name. It will not be exposed to userspace, rather
          -EINVAL is returned.
      
      The aim of these patches is to improve user-space renaming of interfaces. As
      a general rule, userspace must rename interfaces to guarantee that names stay
      the same every time a given piece of hardware appears (at boot, or when
      attaching it). However, there are several situations where userspace should
      not perform the renaming, and that depends on both the policy of the local
      admin, but crucially also on the nature of the current interface name.
      
      If an interface was created in repsonse to a userspace request, and userspace
      already provided a name, we most probably want to leave that name alone. The
      main instance of this is wifi-P2P devices created over nl80211, which currently
      have a long-standing bug where they are getting renamed by udev. We label such
      names NET_NAME_USER.
      
      If an interface, unbeknown to us, has already been renamed from userspace, we
      most probably want to leave also that alone. This will typically happen when
      third-party plugins (for instance to udev, but the interface is generic so could
      be from anywhere) renames the interface without informing udev about it. A
      typical situation is when you switch root from an installer or an initrd to the
      real system and the new instance of udev does not know what happened before
      the switch. These types of problems have caused repeated issues in the past. To
      solve this, once an interface has been renamed, its name is labelled
      NET_NAME_RENAMED.
      
      In many cases, the kernel is actually able to name interfaces in such a
      way that there is no need for userspace to rename them. This is the case when
      the enumeration order of devices, or in fact any other (non-parent) device on
      the system, can not influence the name of the interface. Examples include
      statically created devices, or any naming schemes based on hardware properties
      of the interface. In this case the admin may prefer to use the kernel-provided
      names, and to make that possible we label such names NET_NAME_PREDICTABLE.
      We want the kernel to have tho possibilty of performing predictable interface
      naming itself (and exposing to userspace that it has), as the information
      necessary for a proper naming scheme for a certain class of devices may not
      be exposed to userspace.
      
      The case where renaming is almost certainly desired, is when the kernel has
      given the interface a name using global device enumeration based on order of
      discovery (ethX, wlanY, etc). These naming schemes are labelled NET_NAME_ENUM.
      
      Lastly, a fallback is left as NET_NAME_UNKNOWN, to indicate that a driver has
      not yet been ported. This is mostly useful as a transitionary measure, allowing
      us to label the various naming schemes bit by bit.
      
      v8: minor documentation fixes
      v9: move comment to the right commit
      Signed-off-by: NTom Gundersen <teg@jklm.no>
      Reviewed-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Reviewed-by: NKay Sievers <kay@vrfy.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      685343fc
  15. 15 7月, 2014 2 次提交
    • T
      cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes · 5577964e
      Tejun Heo 提交于
      Currently, cgroup_subsys->base_cftypes is used for both the unified
      default hierarchy and legacy ones and subsystems can mark each file
      with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
      only on one of them.  This is quite hairy and error-prone.  Also, we
      may end up exposing interface files to the default hierarchy without
      thinking it through.
      
      cgroup_subsys will grow two separate cftype arrays and apply each only
      on the hierarchies of the matching type.  This will allow organizing
      cftypes in a lot clearer way and encourage subsystems to scrutinize
      the interface which is being exposed in the new default hierarchy.
      
      In preparation, this patch renames cgroup_subsys->base_cftypes to
      cgroup_subsys->legacy_cftypes.  This patch is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      5577964e
    • M
      neigh: sysctl - simplify address calculation of gc_* variables · 9ecf07a1
      Mathias Krause 提交于
      The code in neigh_sysctl_register() relies on a specific layout of
      struct neigh_table, namely that the 'gc_*' variables are directly
      following the 'parms' member in a specific order. The code, though,
      expresses this in the most ugly way.
      
      Get rid of the ugly casts and use the 'tbl' pointer to get a handle to
      the table. This way we can refer to the 'gc_*' variables directly.
      
      Similarly seen in the grsecurity patch, written by Brad Spengler.
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ecf07a1
  16. 14 7月, 2014 1 次提交
  17. 11 7月, 2014 2 次提交
    • J
      bridge: netlink dump interface at par with brctl · 5e6d2435
      Jamal Hadi Salim 提交于
      Actually better than brctl showmacs because we can filter by bridge
      port in the kernel.
      The current bridge netlink interface doesnt scale when you have many
      bridges each with large fdbs or even bridges with many bridge ports
      
      And now for the science non-fiction novel you have all been
      waiting for..
      
      //lets see what bridge ports we have
      root@moja-1:/configs/may30-iprt/bridge# ./bridge link show
      8: eth1 state DOWN : <BROADCAST,MULTICAST> mtu 1500 master br0 state
      disabled priority 32 cost 19
      17: sw1-p1 state DOWN : <BROADCAST,NOARP> mtu 1500 master br0 state
      disabled priority 32 cost 100
      
      // show all..
      root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show
      33:33:00:00:00:01 dev bond0 self permanent
      33:33:00:00:00:01 dev dummy0 self permanent
      33:33:00:00:00:01 dev ifb0 self permanent
      33:33:00:00:00:01 dev ifb1 self permanent
      33:33:00:00:00:01 dev eth0 self permanent
      01:00:5e:00:00:01 dev eth0 self permanent
      33:33:ff:22:01:01 dev eth0 self permanent
      02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
      00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
      00:17:42:8a:b4:07 dev eth1 self permanent
      33:33:00:00:00:01 dev eth1 self permanent
      33:33:00:00:00:01 dev gretap0 self permanent
      da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
      33:33:00:00:00:01 dev sw1-p1 self permanent
      
      //filter by bridge
      root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br br0
      02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
      00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
      00:17:42:8a:b4:07 dev eth1 self permanent
      33:33:00:00:00:01 dev eth1 self permanent
      da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
      33:33:00:00:00:01 dev sw1-p1 self permanent
      
      // bridge sw1 has no ports attached..
      root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br sw1
      
      //filter by port
      root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show brport eth1
      02:00:00:12:01:02 vlan 0 master br0 permanent
      00:17:42:8a:b4:05 vlan 0 master br0 permanent
      00:17:42:8a:b4:07 self permanent
      33:33:00:00:00:01 self permanent
      
      // filter by port + bridge
      root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br br0 brport
      sw1-p1
      da:ac:46:27:d9:53 vlan 0 master br0 permanent
      33:33:00:00:00:01 self permanent
      
      // for shits and giggles (as they say in New Brunswick), lets
      // change the mac that br0 uses
      // Note: a magical fdb entry with no brport is added ...
      root@moja-1:/configs/may30-iprt/bridge# ip link set dev br0 address
      02:00:00:12:01:04
      
      // lets see if we can see the unicorn ..
      root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show
      33:33:00:00:00:01 dev bond0 self permanent
      33:33:00:00:00:01 dev dummy0 self permanent
      33:33:00:00:00:01 dev ifb0 self permanent
      33:33:00:00:00:01 dev ifb1 self permanent
      33:33:00:00:00:01 dev eth0 self permanent
      01:00:5e:00:00:01 dev eth0 self permanent
      33:33:ff:22:01:01 dev eth0 self permanent
      02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
      00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
      00:17:42:8a:b4:07 dev eth1 self permanent
      33:33:00:00:00:01 dev eth1 self permanent
      33:33:00:00:00:01 dev gretap0 self permanent
      02:00:00:12:01:04 dev br0 vlan 0 master br0 permanent <=== there it is
      da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
      33:33:00:00:00:01 dev sw1-p1 self permanent
      
      //can we see it if we filter by bridge?
      root@moja-1:/configs/may30-iprt/bridge# ./bridge fdb show br br0
      02:00:00:12:01:02 dev eth1 vlan 0 master br0 permanent
      00:17:42:8a:b4:05 dev eth1 vlan 0 master br0 permanent
      00:17:42:8a:b4:07 dev eth1 self permanent
      33:33:00:00:00:01 dev eth1 self permanent
      02:00:00:12:01:04 dev br0 vlan 0 master br0 permanent <=== there it is
      da:ac:46:27:d9:53 dev sw1-p1 vlan 0 master br0 permanent
      33:33:00:00:00:01 dev sw1-p1 self permanent
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e6d2435
    • J
      bridge: fdb dumping takes a filter device · 5d5eacb3
      Jamal Hadi Salim 提交于
      Dumping a bridge fdb dumps every fdb entry
      held. With this change we are going to filter
      on selected bridge port.
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d5eacb3