1. 22 10月, 2012 1 次提交
  2. 13 10月, 2012 2 次提交
  3. 11 10月, 2012 5 次提交
  4. 09 10月, 2012 2 次提交
    • F
      vlan: don't deliver frames for unknown vlans to protocols · 48cc32d3
      Florian Zumbiehl 提交于
      6a32e4f9 made the vlan code skip marking
      vlan-tagged frames for not locally configured vlans as PACKET_OTHERHOST if
      there was an rx_handler, as the rx_handler could cause the frame to be received
      on a different (virtual) vlan-capable interface where that vlan might be
      configured.
      
      As rx_handlers do not necessarily return RX_HANDLER_ANOTHER, this could cause
      frames for unknown vlans to be delivered to the protocol stack as if they had
      been received untagged.
      
      For example, if an ipv6 router advertisement that's tagged for a locally not
      configured vlan is received on an interface with macvlan interfaces attached,
      macvlan's rx_handler returns RX_HANDLER_PASS after delivering the frame to the
      macvlan interfaces, which caused it to be passed to the protocol stack, leading
      to ipv6 addresses for the announced prefix being configured even though those
      are completely unusable on the underlying interface.
      
      The fix moves marking as PACKET_OTHERHOST after the rx_handler so the
      rx_handler, if there is one, sees the frame unchanged, but afterwards,
      before the frame is delivered to the protocol stack, it gets marked whether
      there is an rx_handler or not.
      Signed-off-by: NFlorian Zumbiehl <florz@florz.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48cc32d3
    • E
      net: gro: selective flush of packets · 2e71a6f8
      Eric Dumazet 提交于
      Current GRO can hold packets in gro_list for almost unlimited
      time, in case napi->poll() handler consumes its budget over and over.
      
      In this case, napi_complete()/napi_gro_flush() are not called.
      
      Another problem is that gro_list is flushed in non friendly way :
      We scan the list and complete packets in the reverse order.
      (youngest packets first, oldest packets last)
      This defeats priorities that sender could have cooked.
      
      Since GRO currently only store TCP packets, we dont really notice the
      bug because of retransmits, but this behavior can add unexpected
      latencies, particularly on mice flows clamped by elephant flows.
      
      This patch makes sure no packet can stay more than 1 ms in queue, and
      only in stress situations.
      
      It also complete packets in the right order to minimize latencies.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jesse Gross <jesse@nicira.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e71a6f8
  5. 08 10月, 2012 2 次提交
  6. 07 10月, 2012 1 次提交
    • E
      net: remove skb recycling · acb600de
      Eric Dumazet 提交于
      Over time, skb recycling infrastructure got litle interest and
      many bugs. Generic rx path skb allocation is now using page
      fragments for efficient GRO / TCP coalescing, and recyling
      a tx skb for rx path is not worth the pain.
      
      Last identified bug is that fat skbs can be recycled
      and it can endup using high order pages after few iterations.
      
      With help from Maxime Bizon, who pointed out that commit
      87151b86 (net: allow pskb_expand_head() to get maximum tailroom)
      introduced this regression for recycled skbs.
      
      Instead of fixing this bug, lets remove skb recycling.
      
      Drivers wanting really hot skbs should use build_skb() anyway,
      to allocate/populate sk_buff right before netif_receive_skb()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Maxime Bizon <mbizon@freebox.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acb600de
  7. 02 10月, 2012 3 次提交
  8. 28 9月, 2012 2 次提交
    • E
      net: use bigger pages in __netdev_alloc_frag · 69b08f62
      Eric Dumazet 提交于
      We currently use percpu order-0 pages in __netdev_alloc_frag
      to deliver fragments used by __netdev_alloc_skb()
      
      Depending on NIC driver and arch being 32 or 64 bit, it allows a page to
      be split in several fragments (between 1 and 8), assuming PAGE_SIZE=4096
      
      Switching to bigger pages (32768 bytes for PAGE_SIZE=4096 case) allows :
      
      - Better filling of space (the ending hole overhead is less an issue)
      
      - Less calls to page allocator or accesses to page->_count
      
      - Could allow struct skb_shared_info futures changes without major
        performance impact.
      
      This patch implements a transparent fallback to smaller
      pages in case of memory pressure.
      
      It also uses a standard "struct page_frag" instead of a custom one.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69b08f62
    • E
      net: remove sk_init() helper · e2bcabec
      Eric Dumazet 提交于
      It seems sk_init() has no value today and even does strange things :
      
      # grep . /proc/sys/net/core/?mem_*
      /proc/sys/net/core/rmem_default:212992
      /proc/sys/net/core/rmem_max:131071
      /proc/sys/net/core/wmem_default:212992
      /proc/sys/net/core/wmem_max:131071
      
      We can remove it completely.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NShan Wei <davidshan@tencent.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2bcabec
  9. 27 9月, 2012 2 次提交
    • A
      make get_file() return its argument · cb0942b8
      Al Viro 提交于
      simplifies a bunch of callers...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cb0942b8
    • A
      new helper: iterate_fd() · c3c073f8
      Al Viro 提交于
      iterates through the opened files in given descriptor table,
      calling a supplied function; we stop once non-zero is returned.
      Callback gets struct file *, descriptor number and const void *
      argument passed to iterator.  It is called with files->file_lock
      held, so it is not allowed to block.
      
      tty_io, netprio_cgroup and selinux flush_unauthorized_files()
      converted to its use.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c3c073f8
  10. 25 9月, 2012 3 次提交
    • E
      net: guard tcp_set_keepalive() to tcp sockets · 3e10986d
      Eric Dumazet 提交于
      Its possible to use RAW sockets to get a crash in
      tcp_set_keepalive() / sk_reset_timer()
      
      Fix is to make sure socket is a SOCK_STREAM one.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e10986d
    • D
      filter: add XOR instruction for use with X/K · 9e49e889
      Daniel Borkmann 提交于
      SKF_AD_ALU_XOR_X has been added a while ago, but as an 'ancillary'
      operation that is invoked through a negative offset in K within BPF
      load operations. Since BPF_MOD has recently been added, BPF_XOR should
      also be part of the common ALU operations. Removing SKF_AD_ALU_XOR_X
      might not be an option since this is exposed to user space.
      Signed-off-by: NDaniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e49e889
    • E
      net: use a per task frag allocator · 5640f768
      Eric Dumazet 提交于
      We currently use a per socket order-0 page cache for tcp_sendmsg()
      operations.
      
      This page is used to build fragments for skbs.
      
      Its done to increase probability of coalescing small write() into
      single segments in skbs still in write queue (not yet sent)
      
      But it wastes a lot of memory for applications handling many mostly
      idle sockets, since each socket holds one page in sk->sk_sndmsg_page
      
      Its also quite inefficient to build TSO 64KB packets, because we need
      about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
      page allocator more than wanted.
      
      This patch adds a per task frag allocator and uses bigger pages,
      if available. An automatic fallback is done in case of memory pressure.
      
      (up to 32768 bytes per frag, thats order-3 pages on x86)
      
      This increases TCP stream performance by 20% on loopback device,
      but also benefits on other network devices, since 8x less frags are
      mapped on transmit and unmapped on tx completion. Alexander Duyck
      mentioned a probable performance win on systems with IOMMU enabled.
      
      Its possible some SG enabled hardware cant cope with bigger fragments,
      but their ndo_start_xmit() should already handle this, splitting a
      fragment in sub fragments, since some arches have PAGE_SIZE=65536
      
      Successfully tested on various ethernet devices.
      (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5640f768
  11. 21 9月, 2012 1 次提交
    • E
      net: do not disable sg for packets requiring no checksum · c0d680e5
      Ed Cashin 提交于
      A change in a series of VLAN-related changes appears to have
      inadvertently disabled the use of the scatter gather feature of
      network cards for transmission of non-IP ethernet protocols like ATA
      over Ethernet (AoE).  Below is a reference to the commit that
      introduces a "harmonize_features" function that turns off scatter
      gather when the NIC does not support hardware checksumming for the
      ethernet protocol of an sk buff.
      
        commit f01a5236
        Author: Jesse Gross <jesse@nicira.com>
        Date:   Sun Jan 9 06:23:31 2011 +0000
      
            net offloading: Generalize netif_get_vlan_features().
      
      The can_checksum_protocol function is not equipped to consider a
      protocol that does not require checksumming.  Calling it for a
      protocol that requires no checksum is inappropriate.
      
      The patch below has harmonize_features call can_checksum_protocol when
      the protocol needs a checksum, so that the network layer is not forced
      to perform unnecessary skb linearization on the transmission of AoE
      packets.  Unnecessary linearization results in decreased performance
      and increased memory pressure, as reported here:
      
        http://www.spinics.net/lists/linux-mm/msg15184.html
      
      The problem has probably not been widely experienced yet, because
      only recently has the kernel.org-distributed aoe driver acquired the
      ability to use payloads of over a page in size, with the patchset
      recently included in the mm tree:
      
        https://lkml.org/lkml/2012/8/28/140
      
      The coraid.com-distributed aoe driver already could use payloads of
      greater than a page in size, but its users generally do not use the
      newest kernels.
      Signed-off-by: NEd Cashin <ecashin@coraid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0d680e5
  12. 20 9月, 2012 6 次提交
  13. 19 9月, 2012 1 次提交
  14. 18 9月, 2012 1 次提交
    • E
      userns: Convert the audit loginuid to be a kuid · e1760bd5
      Eric W. Biederman 提交于
      Always store audit loginuids in type kuid_t.
      
      Print loginuids by converting them into uids in the appropriate user
      namespace, and then printing the resulting uid.
      
      Modify audit_get_loginuid to return a kuid_t.
      
      Modify audit_set_loginuid to take a kuid_t.
      
      Modify /proc/<pid>/loginuid on read to convert the loginuid into the
      user namespace of the opener of the file.
      
      Modify /proc/<pid>/loginud on write to convert the loginuid
      rom the user namespace of the opener of the file.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Paul Moore <paul@paul-moore.com> ?
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      e1760bd5
  15. 17 9月, 2012 3 次提交
  16. 15 9月, 2012 4 次提交
    • T
      cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them · 8c7f6edb
      Tejun Heo 提交于
      Currently, cgroup hierarchy support is a mess.  cpu related subsystems
      behave correctly - configuration, accounting and control on a parent
      properly cover its children.  blkio and freezer completely ignore
      hierarchy and treat all cgroups as if they're directly under the root
      cgroup.  Others show yet different behaviors.
      
      These differing interpretations of cgroup hierarchy make using cgroup
      confusing and it impossible to co-mount controllers into the same
      hierarchy and obtain sane behavior.
      
      Eventually, we want full hierarchy support from all subsystems and
      probably a unified hierarchy.  Users using separate hierarchies
      expecting completely different behaviors depending on the mounted
      subsystem is deterimental to making any progress on this front.
      
      This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
      for controllers which are lacking in hierarchy support.  The goal of
      this patch is two-fold.
      
      * Move users away from using hierarchy on currently non-hierarchical
        subsystems, so that implementing proper hierarchy support on those
        doesn't surprise them.
      
      * Keep track of which controllers are broken how and nudge the
        subsystems to implement proper hierarchy support.
      
      For now, start with a single warning message.  We can whine louder
      later on.
      
      v2: Fixed a typo spotted by Michal. Warning message updated.
      
      v3: Updated memcg part so that it doesn't generate warning in the
          cases where .use_hierarchy=false doesn't make the behavior
          different from root.use_hierarchy=true.  Fixed a typo spotted by
          Glauber.
      
      v4: Check ->broken_hierarchy after cgroup creation is complete so that
          ->create() can affect the result per Michal.  Dropped unnecessary
          memcg root handling per Michal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      8c7f6edb
    • D
      cgroup: Assign subsystem IDs during compile time · 8a8e04df
      Daniel Wagner 提交于
      WARNING: With this change it is impossible to load external built
      controllers anymore.
      
      In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
      set, corresponding subsys_id should also be a constant. Up to now,
      net_prio_subsys_id and net_cls_subsys_id would be of the type int and
      the value would be assigned during runtime.
      
      By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
      to IS_ENABLED, all *_subsys_id will have constant value. That means we
      need to remove all the code which assumes a value can be assigned to
      net_prio_subsys_id and net_cls_subsys_id.
      
      A close look is necessary on the RCU part which was introduces by
      following patch:
      
        commit f8451725
        Author:	Herbert Xu <herbert@gondor.apana.org.au>  Mon May 24 09:12:34 2010
        Committer:	David S. Miller <davem@davemloft.net>  Mon May 24 09:12:34 2010
      
        cls_cgroup: Store classid in struct sock
      
        Tis code was added to init_cgroup_cls()
      
      	  /* We can't use rcu_assign_pointer because this is an int. */
      	  smp_wmb();
      	  net_cls_subsys_id = net_cls_subsys.subsys_id;
      
        respectively to exit_cgroup_cls()
      
      	  net_cls_subsys_id = -1;
      	  synchronize_rcu();
      
        and in module version of task_cls_classid()
      
      	  rcu_read_lock();
      	  id = rcu_dereference(net_cls_subsys_id);
      	  if (id >= 0)
      		  classid = container_of(task_subsys_state(p, id),
      					 struct cgroup_cls_state, css)->classid;
      	  rcu_read_unlock();
      
      Without an explicit explaination why the RCU part is needed. (The
      rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
      in a later commit, but that is a minor detail.)
      
      So here is my pondering why it was introduced and why it safe to
      remove it now. Note that this code was copied over to net_prio the
      reasoning holds for that subsystem too.
      
      The idea behind the RCU use for net_cls_subsys_id is to make sure we
      get a valid pointer back from task_subsys_state(). task_subsys_state()
      is just blindly accessing the subsys array and returning the
      pointer. Obviously, passing in -1 as id into task_subsys_state()
      returns an invalid value (out of lower bound).
      
      So this code makes sure that only after module is loaded and the
      subsystem registered, the id is assigned.
      
      Before unregistering the module all old readers must have left the
      critical section. This is done by assigning -1 to the id and issuing a
      synchronized_rcu(). Any new readers wont call task_subsys_state()
      anymore and therefore it is safe to unregister the subsystem.
      
      The new code relies on the same trick, but it looks at the subsys
      pointer return by task_subsys_state() (remember the id is constant
      and therefore we allways have a valid index into the subsys
      array).
      
      No precautions need to be taken during module loading
      module. Eventually, all CPUs will get a valid pointer back from
      task_subsys_state() because rebind_subsystem() which is called after
      the module init() function will assigned subsys[net_cls_subsys_id] the
      newly loaded module subsystem pointer.
      
      When the subsystem is about to be removed, rebind_subsystem() will
      called before the module exit() function. In this case,
      rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
      and then it calls synchronize_rcu(). All old readers have left by then
      the critical section. Any new reader wont access the subsystem
      anymore.  At this point we are safe to unregister the subsystem. No
      synchronize_rcu() call is needed.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      8a8e04df
    • D
      cgroup: net_prio: Do not define task_netpioidx() when not selected · 51e4e7fa
      Daniel Wagner 提交于
      task_netprioidx() should not be defined in case the configuration is
      CONFIG_NETPRIO_CGROUP=n. The reason is that in a following patch the
      net_prio_subsys_id will only be defined if CONFIG_NETPRIO_CGROUP!=n.
      When net_prio is not built at all any callee should only get an empty
      task_netprioidx() without any references to net_prio_subsys_id.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      51e4e7fa
    • D
      cgroup: net_cls: Do not define task_cls_classid() when not selected · 8fb974c9
      Daniel Wagner 提交于
      task_cls_classid() should not be defined in case the configuration is
      CONFIG_NET_CLS_CGROUP=n. The reason is that in a following patch the
      net_cls_subsys_id will only be defined if CONFIG_NET_CLS_CGROUP!=n.
      When net_cls is not built at all a callee should only get an empty
      task_cls_classid() without any references to net_cls_subsys_id.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      8fb974c9
  17. 14 9月, 2012 1 次提交
    • N
      pktgen: fix crash with vlan and packet size less than 46 · 6af773e7
      Nishank Trivedi 提交于
      If vlan option is being specified in the pktgen and packet size
      being requested is less than 46 bytes, despite being illogical
      request, pktgen should not crash the kernel.
      
      BUG: unable to handle kernel paging request at ffff88021fb82000
      Process kpktgend_0 (pid: 1184, threadinfo ffff880215f1a000, task ffff880218544530)
      Call Trace:
      [<ffffffffa0637cd2>] ? pktgen_finalize_skb+0x222/0x300 [pktgen]
      [<ffffffff814f0084>] ? build_skb+0x34/0x1c0
      [<ffffffffa0639b11>] pktgen_thread_worker+0x5d1/0x1790 [pktgen]
      [<ffffffffa03ffb10>] ? igb_xmit_frame_ring+0xa30/0xa30 [igb]
      [<ffffffff8107ba20>] ? wake_up_bit+0x40/0x40
      [<ffffffff8107ba20>] ? wake_up_bit+0x40/0x40
      [<ffffffffa0639540>] ? spin+0x240/0x240 [pktgen]
      [<ffffffff8107b4e3>] kthread+0x93/0xa0
      [<ffffffff81615de4>] kernel_thread_helper+0x4/0x10
      [<ffffffff8107b450>] ? flush_kthread_worker+0x80/0x80
      [<ffffffff81615de0>] ? gs_change+0x13/0x13
      
      The root cause of why pktgen is not able to handle this case is due
      to comparison of signed (datalen) and unsigned data (sizeof), which
      eventually passes a huge number to skb_put().
      Signed-off-by: NNishank Trivedi <nistrive@cisco.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6af773e7