1. 08 10月, 2013 1 次提交
    • A
      net: fix unsafe set_memory_rw from softirq · d45ed4a4
      Alexei Starovoitov 提交于
      on x86 system with net.core.bpf_jit_enable = 1
      
      sudo tcpdump -i eth1 'tcp port 22'
      
      causes the warning:
      [   56.766097]  Possible unsafe locking scenario:
      [   56.766097]
      [   56.780146]        CPU0
      [   56.786807]        ----
      [   56.793188]   lock(&(&vb->lock)->rlock);
      [   56.799593]   <Interrupt>
      [   56.805889]     lock(&(&vb->lock)->rlock);
      [   56.812266]
      [   56.812266]  *** DEADLOCK ***
      [   56.812266]
      [   56.830670] 1 lock held by ksoftirqd/1/13:
      [   56.836838]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8118f44c>] vm_unmap_aliases+0x8c/0x380
      [   56.849757]
      [   56.849757] stack backtrace:
      [   56.862194] CPU: 1 PID: 13 Comm: ksoftirqd/1 Not tainted 3.12.0-rc3+ #45
      [   56.868721] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
      [   56.882004]  ffffffff821944c0 ffff88080bbdb8c8 ffffffff8175a145 0000000000000007
      [   56.895630]  ffff88080bbd5f40 ffff88080bbdb928 ffffffff81755b14 0000000000000001
      [   56.909313]  ffff880800000001 ffff880800000000 ffffffff8101178f 0000000000000001
      [   56.923006] Call Trace:
      [   56.929532]  [<ffffffff8175a145>] dump_stack+0x55/0x76
      [   56.936067]  [<ffffffff81755b14>] print_usage_bug+0x1f7/0x208
      [   56.942445]  [<ffffffff8101178f>] ? save_stack_trace+0x2f/0x50
      [   56.948932]  [<ffffffff810cc0a0>] ? check_usage_backwards+0x150/0x150
      [   56.955470]  [<ffffffff810ccb52>] mark_lock+0x282/0x2c0
      [   56.961945]  [<ffffffff810ccfed>] __lock_acquire+0x45d/0x1d50
      [   56.968474]  [<ffffffff810cce6e>] ? __lock_acquire+0x2de/0x1d50
      [   56.975140]  [<ffffffff81393bf5>] ? cpumask_next_and+0x55/0x90
      [   56.981942]  [<ffffffff810cef72>] lock_acquire+0x92/0x1d0
      [   56.988745]  [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
      [   56.995619]  [<ffffffff817628f1>] _raw_spin_lock+0x41/0x50
      [   57.002493]  [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
      [   57.009447]  [<ffffffff8118f52a>] vm_unmap_aliases+0x16a/0x380
      [   57.016477]  [<ffffffff8118f44c>] ? vm_unmap_aliases+0x8c/0x380
      [   57.023607]  [<ffffffff810436b0>] change_page_attr_set_clr+0xc0/0x460
      [   57.030818]  [<ffffffff810cfb8d>] ? trace_hardirqs_on+0xd/0x10
      [   57.037896]  [<ffffffff811a8330>] ? kmem_cache_free+0xb0/0x2b0
      [   57.044789]  [<ffffffff811b59c3>] ? free_object_rcu+0x93/0xa0
      [   57.051720]  [<ffffffff81043d9f>] set_memory_rw+0x2f/0x40
      [   57.058727]  [<ffffffff8104e17c>] bpf_jit_free+0x2c/0x40
      [   57.065577]  [<ffffffff81642cba>] sk_filter_release_rcu+0x1a/0x30
      [   57.072338]  [<ffffffff811108e2>] rcu_process_callbacks+0x202/0x7c0
      [   57.078962]  [<ffffffff81057f17>] __do_softirq+0xf7/0x3f0
      [   57.085373]  [<ffffffff81058245>] run_ksoftirqd+0x35/0x70
      
      cannot reuse jited filter memory, since it's readonly,
      so use original bpf insns memory to hold work_struct
      
      defer kfree of sk_filter until jit completed freeing
      
      tested on x86_64 and i386
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d45ed4a4
  2. 11 6月, 2013 1 次提交
  3. 20 5月, 2013 1 次提交
  4. 02 5月, 2013 1 次提交
    • X
      filter: fix va_list build error · 20074f35
      Xi Wang 提交于
      This patch fixes the following build error.
      
      In file included from include/linux/filter.h:52:0,
                       from arch/arm/net/bpf_jit_32.c:14:
      include/linux/printk.h:54:2: error: unknown type name ‘va_list’
      include/linux/printk.h:105:21: error: unknown type name ‘va_list’
      include/linux/printk.h:108:30: error: unknown type name ‘va_list’
      Signed-off-by: NXi Wang <xi.wang@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20074f35
  5. 30 3月, 2013 1 次提交
  6. 22 3月, 2013 1 次提交
    • D
      filter: bpf_jit_comp: refactor and unify BPF JIT image dump output · 79617801
      Daniel Borkmann 提交于
      If bpf_jit_enable > 1, then we dump the emitted JIT compiled image
      after creation. Currently, only SPARC and PowerPC has similar output
      as in the reference implementation on x86_64. Make a small helper
      function in order to reduce duplicated code and make the dump output
      uniform across architectures x86_64, SPARC, PPC, ARM (e.g. on ARM
      flen, pass and proglen are currently not shown, but would be
      interesting to know as well), also for future BPF JIT implementations
      on other archs.
      
      Cc: Mircea Gherzan <mgherzan@gmail.com>
      Cc: Matt Evans <matt@ozlabs.org>
      Cc: Eric Dumazet <eric.dumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79617801
  7. 21 3月, 2013 1 次提交
  8. 01 11月, 2012 2 次提交
    • P
      sk-filter: Add ability to get socket filter program (v2) · a8fc9277
      Pavel Emelyanov 提交于
      The SO_ATTACH_FILTER option is set only. I propose to add the get
      ability by using SO_ATTACH_FILTER in getsockopt. To be less
      irritating to eyes the SO_GET_FILTER alias to it is declared. This
      ability is required by checkpoint-restore project to be able to
      save full state of a socket.
      
      There are two issues with getting filter back.
      
      First, kernel modifies the sock_filter->code on filter load, thus in
      order to return the filter element back to user we have to decode it
      into user-visible constants. Fortunately the modification in question
      is interconvertible.
      
      Second, the BPF_S_ALU_DIV_K code modifies the command argument k to
      speed up the run-time division by doing kernel_k = reciprocal(user_k).
      Bad news is that different user_k may result in same kernel_k, so we
      can't get the original user_k back. Good news is that we don't have
      to do it. What we need to is calculate a user2_k so, that
      
        reciprocal(user2_k) == reciprocal(user_k) == kernel_k
      
      i.e. if it's re-loaded back the compiled again value will be exactly
      the same as it was. That said, the user2_k can be calculated like this
      
        user2_k = reciprocal(kernel_k)
      
      with an exception, that if kernel_k == 0, then user2_k == 1.
      
      The optlen argument is treated like this -- when zero, kernel returns
      the amount of sock_fprog elements in filter, otherwise it should be
      large enough for the sock_fprog array.
      
      changes since v1:
      * Declared SO_GET_FILTER in all arch headers
      * Added decode of vlan-tag codes
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8fc9277
    • E
      net: filter: add vlan tag access · f3335031
      Eric Dumazet 提交于
      BPF filters lack ability to access skb->vlan_tci
      
      This patch adds two new ancillary accessors :
      
      SKF_AD_VLAN_TAG         (44) mapped to vlan_tx_tag_get(skb)
      
      SKF_AD_VLAN_TAG_PRESENT (48) mapped to vlan_tx_tag_present(skb)
      
      This allows libpcap/tcpdump to use a kernel filter instead of
      having to fallback to accept all packets, then filter them in
      user space.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NAni Sinha <ani@aristanetworks.com>
      Suggested-by: NDaniel Borkmann <danborkmann@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3335031
  9. 13 10月, 2012 1 次提交
  10. 25 9月, 2012 1 次提交
  11. 11 9月, 2012 1 次提交
  12. 14 4月, 2012 2 次提交
  13. 04 4月, 2012 2 次提交
  14. 20 10月, 2011 1 次提交
  15. 27 7月, 2011 1 次提交
  16. 23 5月, 2011 1 次提交
  17. 28 4月, 2011 1 次提交
    • E
      net: filter: Just In Time compiler for x86-64 · 0a14842f
      Eric Dumazet 提交于
      In order to speedup packet filtering, here is an implementation of a
      JIT compiler for x86_64
      
      It is disabled by default, and must be enabled by the admin.
      
      echo 1 >/proc/sys/net/core/bpf_jit_enable
      
      It uses module_alloc() and module_free() to get memory in the 2GB text
      kernel range since we call helpers functions from the generated code.
      
      EAX : BPF A accumulator
      EBX : BPF X accumulator
      RDI : pointer to skb   (first argument given to JIT function)
      RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
      r9d : skb->len - skb->data_len (headlen)
      r8  : skb->data
      
      To get a trace of generated code, use :
      
      echo 2 >/proc/sys/net/core/bpf_jit_enable
      
      Example of generated code :
      
      # tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24
      
      flen=18 proglen=147 pass=3 image=ffffffffa00b5000
      JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
      JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
      JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
      JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
      JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
      JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
      JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
      JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
      JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
      JIT code: ffffffffa00b5090: c0 c9 c3
      
      BPF program is 144 bytes long, so native program is almost same size ;)
      
      (000) ldh      [12]
      (001) jeq      #0x800           jt 2    jf 8
      (002) ld       [26]
      (003) and      #0xffffff00
      (004) jeq      #0xc0a81400      jt 16   jf 5
      (005) ld       [30]
      (006) and      #0xffffff00
      (007) jeq      #0xc0a81400      jt 16   jf 17
      (008) jeq      #0x806           jt 10   jf 9
      (009) jeq      #0x8035          jt 10   jf 17
      (010) ld       [28]
      (011) and      #0xffffff00
      (012) jeq      #0xc0a81400      jt 16   jf 13
      (013) ld       [38]
      (014) and      #0xffffff00
      (015) jeq      #0xc0a81400      jt 16   jf 17
      (016) ret      #65535
      (017) ret      #0
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a14842f
  18. 09 12月, 2010 1 次提交
  19. 07 12月, 2010 1 次提交
  20. 20 11月, 2010 1 次提交
    • E
      filter: optimize sk_run_filter · 93aaae2e
      Eric Dumazet 提交于
      Remove pc variable to avoid arithmetic to compute fentry at each filter
      instruction. Jumps directly manipulate fentry pointer.
      
      As the last instruction of filter[] is guaranteed to be a RETURN, and
      all jumps are before the last instruction, we dont need to check filter
      bounds (number of instructions in filter array) at each iteration, so we
      remove it from sk_run_filter() params.
      
      On x86_32 remove f_k var introduced in commit 57fe93b3
      (filter: make sure filters dont read uninitialized memory)
      
      Note : We could use a CONFIG_ARCH_HAS_{FEW|MANY}_REGISTERS in order to
      avoid too many ifdefs in this code.
      
      This helps compiler to use cpu registers to hold fentry and A
      accumulator.
      
      On x86_32, this saves 401 bytes, and more important, sk_run_filter()
      runs much faster because less register pressure (One less conditional
      branch per BPF instruction)
      
      # size net/core/filter.o net/core/filter_pre.o
         text    data     bss     dec     hex filename
         2948       0       0    2948     b84 net/core/filter.o
         3349       0       0    3349     d15 net/core/filter_pre.o
      
      on x86_64 :
      # size net/core/filter.o net/core/filter_pre.o
         text    data     bss     dec     hex filename
         5173       0       0    5173    1435 net/core/filter.o
         5224       0       0    5224    1468 net/core/filter_pre.o
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93aaae2e
  21. 19 11月, 2010 1 次提交
  22. 26 6月, 2010 1 次提交
    • H
      net: optimize Berkeley Packet Filter (BPF) processing · 01f2f3f6
      Hagen Paul Pfeifer 提交于
      Gcc is currenlty not in the ability to optimize the switch statement in
      sk_run_filter() because of dense case labels. This patch replace the
      OR'd labels with ordered sequenced case labels. The sk_chk_filter()
      function is modified to patch/replace the original OPCODES in a
      ordered but equivalent form. gcc is now in the ability to transform the
      switch statement in sk_run_filter into a jump table of complexity O(1).
      
      Until this patch gcc generates a sequence of conditional branches (O(n) of 567
      byte .text segment size (arch x86_64):
      
      7ff: 8b 06                 mov    (%rsi),%eax
      801: 66 83 f8 35           cmp    $0x35,%ax
      805: 0f 84 d0 02 00 00     je     adb <sk_run_filter+0x31d>
      80b: 0f 87 07 01 00 00     ja     918 <sk_run_filter+0x15a>
      811: 66 83 f8 15           cmp    $0x15,%ax
      815: 0f 84 c5 02 00 00     je     ae0 <sk_run_filter+0x322>
      81b: 77 73                 ja     890 <sk_run_filter+0xd2>
      81d: 66 83 f8 04           cmp    $0x4,%ax
      821: 0f 84 17 02 00 00     je     a3e <sk_run_filter+0x280>
      827: 77 29                 ja     852 <sk_run_filter+0x94>
      829: 66 83 f8 01           cmp    $0x1,%ax
      [...]
      
      With the modification the compiler translate the switch statement into
      the following jump table fragment:
      
      7ff: 66 83 3e 2c           cmpw   $0x2c,(%rsi)
      803: 0f 87 1f 02 00 00     ja     a28 <sk_run_filter+0x26a>
      809: 0f b7 06              movzwl (%rsi),%eax
      80c: ff 24 c5 00 00 00 00  jmpq   *0x0(,%rax,8)
      813: 44 89 e3              mov    %r12d,%ebx
      816: e9 43 03 00 00        jmpq   b5e <sk_run_filter+0x3a0>
      81b: 41 89 dc              mov    %ebx,%r12d
      81e: e9 3b 03 00 00        jmpq   b5e <sk_run_filter+0x3a0>
      
      Furthermore, I reordered the instructions to reduce cache line misses by
      order the most common instruction to the start.
      Signed-off-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      01f2f3f6
  23. 23 4月, 2010 1 次提交
  24. 05 11月, 2009 1 次提交
  25. 20 10月, 2009 2 次提交
  26. 20 11月, 2008 1 次提交
    • P
      filter: add SKF_AD_NLATTR_NEST to look for nested attributes · d214c753
      Pablo Neira Ayuso 提交于
      SKF_AD_NLATTR allows us to find the first matching attribute in a
      stream of netlink attributes from one offset to the end of the
      netlink message. This is not suitable to look for a specific
      matching inside a set of nested attributes.
      
      For example, in ctnetlink messages, if we look for the CTA_V6_SRC
      attribute in a message that talks about an IPv4 connection,
      SKF_AD_NLATTR returns the offset of CTA_STATUS which has the same
      value of CTA_V6_SRC but outside the nest. To differenciate
      CTA_STATUS and CTA_V6_SRC, we would have to make assumptions on the
      size of the attribute and the usual offset, resulting in horrible
      BSF code.
      
      This patch adds SKF_AD_NLATTR_NEST, which is a variant of
      SKF_AD_NLATTR, that looks for an attribute inside the limits of
      a nested attributes, but not further.
      
      This patch validates that we have enough room to look for the
      nested attributes - based on a suggestion from Patrick McHardy.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d214c753
  27. 10 4月, 2008 3 次提交
    • P
      [SKFILTER]: Add SKF_ADF_NLATTR instruction · 4738c1db
      Patrick McHardy 提交于
      SKF_ADF_NLATTR searches for a netlink attribute, which avoids manually
      parsing and walking attributes. It takes the offset at which to start
      searching in the 'A' register and the attribute type in the 'X' register
      and returns the offset in the 'A' register. When the attribute is not
      found it returns zero.
      
      A top-level attribute can be located using a filter like this
      (example for nfnetlink, using struct nfgenmsg):
      
      	...
      	{
      		/* A = offset of first attribute */
      		.code	= BPF_LD | BPF_IMM,
      		.k	= sizeof(struct nlmsghdr) + sizeof(struct nfgenmsg)
      	},
      	{
      		/* X = CTA_PROTOINFO */
      		.code	= BPF_LDX | BPF_IMM,
      		.k	= CTA_PROTOINFO,
      	},
      	{
      		/* A = netlink attribute offset */
      		.code	= BPF_LD | BPF_B | BPF_ABS,
      		.k	= SKF_AD_OFF + SKF_AD_NLATTR
      	},
      	{
      		/* Exit if not found */
      		.code   = BPF_JMP | BPF_JEQ | BPF_K,
      		.k	= 0,
      		.jt	= <error>
      	},
      	...
      
      A nested attribute below the CTA_PROTOINFO attribute would then
      be parsed like this:
      
      	...
      	{
      		/* A += sizeof(struct nlattr) */
      		.code	= BPF_ALU | BPF_ADD | BPF_K,
      		.k	= sizeof(struct nlattr),
      	},
      	{
      		/* X = CTA_PROTOINFO_TCP */
      		.code	= BPF_LDX | BPF_IMM,
      		.k	= CTA_PROTOINFO_TCP,
      	},
      	{
      		/* A = netlink attribute offset */
      		.code	= BPF_LD | BPF_B | BPF_ABS,
      		.k	= SKF_AD_OFF + SKF_AD_NLATTR
      	},
      	...
      
      The data of an attribute can be loaded into 'A' like this:
      
      	...
      	{
      		/* X = A (attribute offset) */
      		.code	= BPF_MISC | BPF_TAX,
      	},
      	{
      		/* A = skb->data[X + k] */
      		.code 	= BPF_LD | BPF_B | BPF_IND,
      		.k	= sizeof(struct nlattr),
      	},
      	...
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4738c1db
    • S
      socket: sk_filter deinline · 43db6d65
      Stephen Hemminger 提交于
      The sk_filter function is too big to be inlined. This saves 2296 bytes
      of text on allyesconfig.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43db6d65
    • S
      socket: sk_filter minor cleanups · b715631f
      Stephen Hemminger 提交于
      Some minor style cleanups:
        * Move __KERNEL__ definitions to one place in filter.h
        * Use const for sk_filter_len
        * Line wrapping
        * Put EXPORT_SYMBOL next to function definition
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b715631f
  28. 18 10月, 2007 1 次提交
  29. 23 9月, 2006 1 次提交
  30. 07 1月, 2006 1 次提交
  31. 17 4月, 2005 1 次提交
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4