1. 25 2月, 2016 4 次提交
  2. 24 2月, 2016 1 次提交
  3. 22 2月, 2016 8 次提交
    • Y
      qed: Introduce DMA_REGPAIR_LE · 94494598
      Yuval Mintz 提交于
      FW hsi contains regpairs, mostly for 64-bit address representations.
      Since same paradigm is applied each time a regpair is filled, this
      introduces a new utility macro for setting such regpairs.
      Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94494598
    • D
      bpf: fix csum update in bpf_l4_csum_replace helper for udp · 2f72959a
      Daniel Borkmann 提交于
      When using this helper for updating UDP checksums, we need to extend
      this in order to write CSUM_MANGLED_0 for csum computations that result
      into 0 as sum. Reason we need this is because packets with a checksum
      could otherwise become incorrectly marked as a packet without a checksum.
      Likewise, if the user indicates BPF_F_MARK_MANGLED_0, then we should
      not turn packets without a checksum into ones with a checksum.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f72959a
    • D
      bpf: try harder on clones when writing into skb · 3697649f
      Daniel Borkmann 提交于
      When we're dealing with clones and the area is not writeable, try
      harder and get a copy via pskb_expand_head(). Replace also other
      occurences in tc actions with the new skb_try_make_writable().
      Reported-by: NAshhad Sheikh <ashhadsheikh394@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3697649f
    • D
      bpf: add generic bpf_csum_diff helper · 7d672345
      Daniel Borkmann 提交于
      For L4 checksums, we currently have bpf_l4_csum_replace() helper. It's
      currently limited to handle 2 and 4 byte changes in a header and feeds the
      from/to into inet_proto_csum_replace{2,4}() helpers of the kernel. When
      working with IPv6, for example, this makes it rather cumbersome to deal
      with, similarly when editing larger parts of a header.
      
      Instead, extend the API in a more generic way: For bpf_l4_csum_replace(),
      add a case for header field mask of 0 to change the checksum at a given
      offset through inet_proto_csum_replace_by_diff(), and provide a helper
      bpf_csum_diff() that can generically calculate a from/to diff for arbitrary
      amounts of data.
      
      This can be used in multiple ways: for the bpf_l4_csum_replace() only
      part, this even provides us with the option to insert precalculated diffs
      from user space f.e. from a map, or from bpf_csum_diff() during runtime.
      
      bpf_csum_diff() has a optional from/to stack buffer input, so we can
      calculate a diff by using a scratchbuffer for scenarios where we're
      inserting (from is NULL), removing (to is NULL) or diffing (from/to buffers
      don't need to be of equal size) data. Also, bpf_csum_diff() allows to
      feed a previous csum into csum_partial(), so the function can also be
      cascaded.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d672345
    • D
      bpf: add new arg_type that allows for 0 sized stack buffer · 8e2fe1d9
      Daniel Borkmann 提交于
      Currently, when we pass a buffer from the eBPF stack into a helper
      function, the function proto indicates argument types as ARG_PTR_TO_STACK
      and ARG_CONST_STACK_SIZE pair. If R<X> contains the former, then R<X+1>
      must be of the latter type. Then, verifier checks whether the buffer
      points into eBPF stack, is initialized, etc. The verifier also guarantees
      that the constant value passed in R<X+1> is greater than 0, so helper
      functions don't need to test for it and can always assume a non-NULL
      initialized buffer as well as non-0 buffer size.
      
      This patch adds a new argument types ARG_CONST_STACK_SIZE_OR_ZERO that
      allows to also pass NULL as R<X> and 0 as R<X+1> into the helper function.
      Such helper functions, of course, need to be able to handle these cases
      internally then. Verifier guarantees that either R<X> == NULL && R<X+1> == 0
      or R<X> != NULL && R<X+1> != 0 (like the case of ARG_CONST_STACK_SIZE), any
      other combinations are not possible to load.
      
      I went through various options of extending the verifier, and introducing
      the type ARG_CONST_STACK_SIZE_OR_ZERO seems to have most minimal changes
      needed to the verifier.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e2fe1d9
    • A
      VXLAN: Support outer IPv4 Tx checksums by default · 6ceb31ca
      Alexander Duyck 提交于
      This change makes it so that if UDP CSUM is not specified we will default
      to enabling it.  The main motivation behind this is the fact that with the
      use of outer checksum we can greatly improve the performance for VXLAN
      tunnels on devices that don't know how to parse tunnel headers.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ceb31ca
    • K
      soc: ti: knav_dma: rename pad in struct knav_dma_desc to sw_data · b1cb86ae
      Karicheri, Muralidharan 提交于
      Rename the pad to sw_data as per description of this field in the hardware
      spec(refer sprugr9 from www.ti.com). Latest version of the document is
      at http://www.ti.com/lit/ug/sprugr9h/sprugr9h.pdf and section 3.1
      Host Packet Descriptor describes this field.
      
      Define and use a constant for the size of sw_data field similar to
      other fields in the struct for desc and document the sw_data field
      in the header. As the sw_data is not touched by hw, it's type can be
      changed to u32.
      
      Rename the helpers to match with the updated dma desc field sw_data.
      
      Cc: Wingman Kwok <w-kwok2@ti.com>
      Cc: Mugunthan V N <mugunthanvnm@ti.com>
      CC: Arnd Bergmann <arnd@arndb.de>
      CC: Grygorii Strashko <grygorii.strashko@ti.com>
      CC: David Laight <David.Laight@ACULAB.COM>
      Signed-off-by: NMurali Karicheri <m-karicheri2@ti.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1cb86ae
    • R
      lwtunnel: autoload of lwt modules · 745041e2
      Robert Shearman 提交于
      The lwt implementations using net devices can autoload using the
      existing mechanism using IFLA_INFO_KIND. However, there's no mechanism
      that lwt modules not using net devices can use.
      
      Therefore, add the ability to autoload modules registering lwt
      operations for lwt implementations not using a net device so that
      users don't have to manually load the modules.
      
      Only users with the CAP_NET_ADMIN capability can cause modules to be
      loaded, which is ensured by rtnetlink_rcv_msg rejecting non-RTM_GETxxx
      messages for users without this capability, and by
      lwtunnel_build_state not being called in response to RTM_GETxxx
      messages.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      745041e2
  4. 20 2月, 2016 8 次提交
    • A
      bpf: introduce BPF_MAP_TYPE_STACK_TRACE · d5a3b1f6
      Alexei Starovoitov 提交于
      add new map type to store stack traces and corresponding helper
      bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
      @ctx: struct pt_regs*
      @map: pointer to stack_trace map
      @flags: bits 0-7 - numer of stack frames to skip
              bit 8 - collect user stack instead of kernel
              bit 9 - compare stacks by hash only
              bit 10 - if two different stacks hash into the same stackid
                       discard old
              other bits - reserved
      Return: >= 0 stackid on success or negative error
      
      stackid is a 32-bit integer handle that can be further combined with
      other data (including other stackid) and used as a key into maps.
      
      Userspace will access stackmap using standard lookup/delete syscall commands to
      retrieve full stack trace for given stackid.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5a3b1f6
    • A
      perf: generalize perf_callchain · 568b329a
      Alexei Starovoitov 提交于
      . avoid walking the stack when there is no room left in the buffer
      . generalize get_perf_callchain() to be called from bpf helper
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      568b329a
    • K
      net/ethtool: support set coalesce per queue · f38d138a
      Kan Liang 提交于
      This patch implements sub command ETHTOOL_SCOALESCE for ioctl
      ETHTOOL_PERQUEUE. It introduces an interface set_per_queue_coalesce to
      set coalesce of each masked queue to device driver. The wanted coalesce
      information are stored in "data" for each masked queue, which can copy
      from userspace.
      If it fails to set coalesce to device driver, the value which already
      set to specific queue will be tried to rollback.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Reviewed-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f38d138a
    • K
      net/ethtool: support get coalesce per queue · 421797b1
      Kan Liang 提交于
      This patch implements sub command ETHTOOL_GCOALESCE for ioctl
      ETHTOOL_PERQUEUE. It introduces an interface get_per_queue_coalesce to
      get coalesce of each masked queue from device driver. Then the interrupt
      coalescing parameters will be copied back to user space one by one.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Reviewed-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      421797b1
    • K
      net/ethtool: introduce a new ioctl for per queue setting · ac2c7ad0
      Kan Liang 提交于
      Introduce a new ioctl ETHTOOL_PERQUEUE for per queue parameters setting.
      The following patches will enable some SUB_COMMANDs for per queue
      setting.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Reviewed-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac2c7ad0
    • D
      lib/bitmap.c: conversion routines to/from u32 array · e52bc7c2
      David Decotigny 提交于
      Aimed at transferring bitmaps to/from user-space in a 32/64-bit agnostic
      way.
      
      Tested:
        unit tests (next patch) on qemu i386, x86_64, ppc, ppc64 BE and LE,
        ARM.
      Signed-off-by: NDavid Decotigny <decot@googlers.com>
      Reviewed-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e52bc7c2
    • N
      net: make netdev_for_each_lower_dev safe for device removal · cfdd28be
      Nikolay Aleksandrov 提交于
      When I used netdev_for_each_lower_dev in commit bad53162 ("vrf:
      remove slave queue and private slave struct") I thought that it acts
      like netdev_for_each_lower_private and can be used to remove the current
      device from the list while walking, but unfortunately it acts more like
      netdev_for_each_lower_private_rcu and doesn't allow it. The difference
      is where the "iter" points to, right now it points to the current element
      and that makes it impossible to remove it. Change the logic to be
      similar to netdev_for_each_lower_private and make it point to the "next"
      element so we can safely delete the current one. VRF is the only such
      user right now, there's no change for the read-only users.
      
      Here's what can happen now:
      [98423.249858] general protection fault: 0000 [#1] SMP
      [98423.250175] Modules linked in: vrf bridge(O) stp llc nfsd auth_rpcgss
      oid_registry nfs_acl nfs lockd grace sunrpc crct10dif_pclmul
      crc32_pclmul crc32c_intel ghash_clmulni_intel jitterentropy_rng
      sha256_generic hmac drbg ppdev aesni_intel aes_x86_64 glue_helper lrw
      gf128mul ablk_helper cryptd evdev serio_raw pcspkr virtio_balloon
      parport_pc parport i2c_piix4 i2c_core virtio_console acpi_cpufreq button
      9pnet_virtio 9p 9pnet fscache ipv6 autofs4 ext4 crc16 mbcache jbd2 sg
      virtio_blk virtio_net sr_mod cdrom e1000 ata_generic ehci_pci uhci_hcd
      ehci_hcd usbcore usb_common virtio_pci ata_piix libata floppy
      virtio_ring virtio scsi_mod [last unloaded: bridge]
      [98423.255040] CPU: 1 PID: 14173 Comm: ip Tainted: G           O
      4.5.0-rc2+ #81
      [98423.255386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS 1.8.1-20150318_183358- 04/01/2014
      [98423.255777] task: ffff8800547f5540 ti: ffff88003428c000 task.ti:
      ffff88003428c000
      [98423.256123] RIP: 0010:[<ffffffff81514f3e>]  [<ffffffff81514f3e>]
      netdev_lower_get_next+0x1e/0x30
      [98423.256534] RSP: 0018:ffff88003428f940  EFLAGS: 00010207
      [98423.256766] RAX: 0002000100000004 RBX: ffff880054ff9000 RCX:
      0000000000000000
      [98423.257039] RDX: ffff88003428f8b8 RSI: ffff88003428f950 RDI:
      ffff880054ff90c0
      [98423.257287] RBP: ffff88003428f940 R08: 0000000000000000 R09:
      0000000000000000
      [98423.257537] R10: 0000000000000001 R11: 0000000000000000 R12:
      ffff88003428f9e0
      [98423.257802] R13: ffff880054a5fd00 R14: ffff88003428f970 R15:
      0000000000000001
      [98423.258055] FS:  00007f3d76881700(0000) GS:ffff88005d000000(0000)
      knlGS:0000000000000000
      [98423.258418] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [98423.258650] CR2: 00007ffe5951ffa8 CR3: 0000000052077000 CR4:
      00000000000406e0
      [98423.258902] Stack:
      [98423.259075]  ffff88003428f960 ffffffffa0442636 0002000100000004
      ffff880054ff9000
      [98423.259647]  ffff88003428f9b0 ffffffff81518205 ffff880054ff9000
      ffff88003428f978
      [98423.260208]  ffff88003428f978 ffff88003428f9e0 ffff88003428f9e0
      ffff880035b35f00
      [98423.260739] Call Trace:
      [98423.260920]  [<ffffffffa0442636>] vrf_dev_uninit+0x76/0xa0 [vrf]
      [98423.261156]  [<ffffffff81518205>]
      rollback_registered_many+0x205/0x390
      [98423.261401]  [<ffffffff815183ec>] unregister_netdevice_many+0x1c/0x70
      [98423.261641]  [<ffffffff8153223c>] rtnl_delete_link+0x3c/0x50
      [98423.271557]  [<ffffffff815335bb>] rtnl_dellink+0xcb/0x1d0
      [98423.271800]  [<ffffffff811cd7da>] ? __inc_zone_state+0x4a/0x90
      [98423.272049]  [<ffffffff815337b4>] rtnetlink_rcv_msg+0x84/0x200
      [98423.272279]  [<ffffffff810cfe7d>] ? trace_hardirqs_on+0xd/0x10
      [98423.272513]  [<ffffffff8153370b>] ? rtnetlink_rcv+0x1b/0x40
      [98423.272755]  [<ffffffff81533730>] ? rtnetlink_rcv+0x40/0x40
      [98423.272983]  [<ffffffff8155d6e7>] netlink_rcv_skb+0x97/0xb0
      [98423.273209]  [<ffffffff8153371a>] rtnetlink_rcv+0x2a/0x40
      [98423.273476]  [<ffffffff8155ce8b>] netlink_unicast+0x11b/0x1a0
      [98423.273710]  [<ffffffff8155d2f1>] netlink_sendmsg+0x3e1/0x610
      [98423.273947]  [<ffffffff814fbc98>] sock_sendmsg+0x38/0x70
      [98423.274175]  [<ffffffff814fc253>] ___sys_sendmsg+0x2e3/0x2f0
      [98423.274416]  [<ffffffff810d841e>] ? do_raw_spin_unlock+0xbe/0x140
      [98423.274658]  [<ffffffff811e1bec>] ? handle_mm_fault+0x26c/0x2210
      [98423.274894]  [<ffffffff811e19cd>] ? handle_mm_fault+0x4d/0x2210
      [98423.275130]  [<ffffffff81269611>] ? __fget_light+0x91/0xb0
      [98423.275365]  [<ffffffff814fcd42>] __sys_sendmsg+0x42/0x80
      [98423.275595]  [<ffffffff814fcd92>] SyS_sendmsg+0x12/0x20
      [98423.275827]  [<ffffffff81611bb6>] entry_SYSCALL_64_fastpath+0x16/0x7a
      [98423.276073] Code: c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66
      90 48 8b 06 55 48 81 c7 c0 00 00 00 48 89 e5 48 8b 00 48 39 f8 74 09 48
      89 06 <48> 8b 40 e8 5d c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66
      [98423.279639] RIP  [<ffffffff81514f3e>] netdev_lower_get_next+0x1e/0x30
      [98423.279920]  RSP <ffff88003428f940>
      
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      CC: Vlad Yasevich <vyasevic@redhat.com>
      Fixes: bad53162 ("vrf: remove slave queue and private slave struct")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfdd28be
    • N
      bridge: mdb: add support for more attributes and export timer · 21257156
      Nikolay Aleksandrov 提交于
      Currently mdb entries are exported directly as a structure inside
      MDBA_MDB_ENTRY_INFO attribute, we can't really extend it without
      breaking user-space. In order to export new mdb fields, I've converted
      the MDBA_MDB_ENTRY_INFO into a nested attribute which starts like before
      with struct br_mdb_entry (without header, as it's casted directly in
      iproute2) and continues with MDBA_MDB_EATTR_ attributes. This way we
      keep compatibility with older users and can export new data.
      I've tested this with iproute2, both with and without support for the
      added attribute and it works fine.
      So basically we again have MDBA_MDB_ENTRY_INFO with struct br_mdb_entry
      inside but it may contain also some additional MDBA_MDB_EATTR_ attributes
      such as MDBA_MDB_EATTR_TIMER which can be parsed by user-space.
      
      So the new structure is:
      [MDBA_MDB] = {
           [MDBA_MDB_ENTRY] = {
               [MDBA_MDB_ENTRY_INFO]
               [MDBA_MDB_ENTRY_INFO] { <- Nested attribute
                   struct br_mdb_entry <- nla_put_nohdr()
                   [MDBA_MDB_ENTRY attributes] <- normal netlink attributes
               }
           }
      }
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21257156
  5. 19 2月, 2016 13 次提交
    • M
      drm/atomic: Allow for holes in connector state, v2. · 5fff80bb
      Maarten Lankhorst 提交于
      Because we record connector_mask using 1 << drm_connector_index now
      the connector_mask should stay the same even when other connectors
      are removed. This was not the case with MST, in that case when removing
      a connector all other connectors may change their index.
      
      This is fixed by waiting until the first get_connector_state to allocate
      connector_state, and force reallocation when state is too small.
      
      As a side effect connector arrays no longer have to be preallocated,
      and can be allocated on first use which means a less allocations in
      the page flip only path.
      
      Changes since v1:
      - Whitespace. (Ville)
      - Call ida_remove when destroying the connector. (Ville)
      - u32 alloc -> int. (Ville)
      
      Fixes: 14de6c44 ("drm/atomic: Remove drm_atomic_connectors_for_crtc.")
      Signed-off-by: NMaarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
      Reviewed-by: NLyude <cpaul@redhat.com>
      Reviewed-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      5fff80bb
    • J
      Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread" · 13d34ac6
      Jeff Layton 提交于
      This reverts commit c510eff6 ("fsnotify: destroy marks with
      call_srcu instead of dedicated thread").
      
      Eryu reported that he was seeing some OOM kills kick in when running a
      testcase that adds and removes inotify marks on a file in a tight loop.
      
      The above commit changed the code to use call_srcu to clean up the
      marks.  While that does (in principle) work, the srcu callback job is
      limited to cleaning up entries in small batches and only once per jiffy.
      It's easily possible to overwhelm that machinery with too many call_srcu
      callbacks, and Eryu's reproduer did just that.
      
      There's also another potential problem with using call_srcu here.  While
      you can obviously sleep while holding the srcu_read_lock, the callbacks
      run under local_bh_disable, so you can't sleep there.
      
      It's possible when putting the last reference to the fsnotify_mark that
      we'll end up putting a chain of references including the fsnotify_group,
      uid, and associated keys.  While I don't see any obvious ways that that
      could occurs, it's probably still best to avoid using call_srcu here
      after all.
      
      This patch reverts the above patch.  A later patch will take a different
      approach to eliminated the dedicated thread here.
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Reported-by: NEryu Guan <guaneryu@gmail.com>
      Tested-by: NEryu Guan <guaneryu@gmail.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13d34ac6
    • Y
      qed: Lay infrastructure for vlan filtering offload · 3f9b4a69
      Yuval Mintz 提交于
      Today, interfaces are working in vlan-promisc mode; But once
      vlan filtering offloaded would be supported, we'll need a method to
      control it directly [e.g., when setting device to PROMISC, or when
      running out of vlan credits].
      
      This adds the necessary API for L2 client to manually choose whether to
      accept all vlans or only those for which filters were configured.
      Signed-off-by: NYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f9b4a69
    • A
      net: Optimize local checksum offload · 9e74a6da
      Alexander Duyck 提交于
      This patch takes advantage of several assumptions we can make about the
      headers of the frame in order to reduce overall processing overhead for
      computing the outer header checksum.
      
      First we can assume the entire header is in the region pointed to by
      skb->head as this is what csum_start is based on.
      
      Second, as a result of our first assumption, we can just call csum_partial
      instead of making a call to skb_checksum which would end up having to
      configure things so that we could walk through the frags list.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e74a6da
    • B
      ipv6: Annotate change of locking mechanism for np->opt · e550785c
      Benjamin Poirier 提交于
      follows up commit 45f6fad8 ("ipv6: add complete rcu protection around
      np->opt") which added mixed rcu/refcount protection to np->opt.
      
      Given the current implementation of rcu_pointer_handoff(), this has no
      effect at runtime.
      Signed-off-by: NBenjamin Poirier <bpoirier@suse.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e550785c
    • J
      iptunnel: scrub packet in iptunnel_pull_header · 7f290c94
      Jiri Benc 提交于
      Part of skb_scrub_packet was open coded in iptunnel_pull_header. Let it call
      skb_scrub_packet directly instead.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f290c94
    • J
      vxlan: tun_id is 64bit, not 32bit · 07dabf20
      Jiri Benc 提交于
      The tun_id field in struct ip_tunnel_key is __be64, not __be32. We need to
      convert the vni to tun_id correctly.
      
      Fixes: 54bfd872 ("vxlan: keep flags and vni in network byte order")
      Reported-by: NPaolo Abeni <pabeni@redhat.com>
      Tested-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NThadeu Lima de Souza Cascardo <cascardo@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07dabf20
    • F
      nfnetlink: Revert "nfnetlink: add support for memory mapped netlink" · c5b0db32
      Florian Westphal 提交于
      reverts commit 3ab1f683 ("nfnetlink: add support for memory mapped
      netlink")'
      
      Like previous commits in the series, remove wrappers that are not needed
      after mmapped netlink removal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5b0db32
    • F
      nfnetlink: remove nfnetlink_alloc_skb · 905f0a73
      Florian Westphal 提交于
      Following mmapped netlink removal this code can be simplified by
      removing the alloc wrapper.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      905f0a73
    • F
      Revert "genl: Add genlmsg_new_unicast() for unicast message allocation" · 263ea090
      Florian Westphal 提交于
      This reverts commit bb9b18fb ("genl: Add genlmsg_new_unicast() for
      unicast message allocation")'.
      
      Nothing wrong with it; its no longer needed since this was only for
      mmapped netlink support.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      263ea090
    • F
      netlink: remove mmapped netlink support · d1b4c689
      Florian Westphal 提交于
      mmapped netlink has a number of unresolved issues:
      
      - TX zerocopy support had to be disabled more than a year ago via
        commit 4682a035 ("netlink: Always copy on mmap TX.")
        because the content of the mmapped area can change after netlink
        attribute validation but before message processing.
      
      - RX support was implemented mainly to speed up nfqueue dumping packet
        payload to userspace.  However, since commit ae08ce00
        ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
        with the socket-based interface too (via the skb_zerocopy helper).
      
      The other problem is that skbs attached to mmaped netlink socket
      behave different from normal skbs:
      
      - they don't have a shinfo area, so all functions that use skb_shinfo()
      (e.g. skb_clone) cannot be used.
      
      - reserving headroom prevents userspace from seeing the content as
      it expects message to start at skb->head.
      See for instance
      commit aa3a0220 ("netlink: not trim skb for mmaped socket when dump").
      
      - skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
      crash because it needs the sk to check if a tx ring is attached.
      
      Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf35
      ("netfilter: nfnetlink: use original skbuff when acking batches").
      
      mmaped netlink also didn't play nicely with the skb_zerocopy helper
      used by nfqueue and openvswitch.  Daniel Borkmann fixed this via
      commit 6bb0fef4 ("netlink, mmap: fix edge-case leakages in nf queue
      zero-copy")' but at the cost of also needing to provide remaining
      length to the allocation function.
      
      nfqueue also has problems when used with mmaped rx netlink:
      - mmaped netlink doesn't allow use of nfqueue batch verdict messages.
        Problem is that in the mmap case, the allocation time also determines
        the ordering in which the frame will be seen by userspace (A
        allocating before B means that A is located in earlier ring slot,
        but this also means that B might get a lower sequence number then A
        since seqno is decided later.  To fix this we would need to extend the
        spinlocked region to also cover the allocation and message setup which
        isn't desirable.
      - nfqueue can now be configured to queue large (GSO) skbs to userspace.
        Queing GSO packets is faster than having to force a software segmentation
        in the kernel, so this is a desirable option.  However, with a mmap based
        ring one has to use 64kb per ring slot element, else mmap has to fall back
        to the socket path (NL_MMAP_STATUS_COPY) for all large packets.
      
      To use the mmap interface, userspace not only has to probe for mmap netlink
      support, it also has to implement a recv/socket receive path in order to
      handle messages that exceed the size of an rx ring element.
      
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1b4c689
    • E
      tcp/dccp: fix another race at listener dismantle · 7716682c
      Eric Dumazet 提交于
      Ilya reported following lockdep splat:
      
      kernel: =========================
      kernel: [ BUG: held lock freed! ]
      kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
      kernel: -------------------------
      kernel: swapper/5/0 is freeing memory
      ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
      kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
      [<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
      kernel: 4 locks held by swapper/5/0:
      kernel: #0:  (rcu_read_lock){......}, at: [<ffffffff8169ef6b>]
      netif_receive_skb_internal+0x4b/0x1f0
      kernel: #1:  (rcu_read_lock){......}, at: [<ffffffff816e977f>]
      ip_local_deliver_finish+0x3f/0x380
      kernel: #2:  (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>]
      sk_clone_lock+0x19b/0x440
      kernel: #3:  (&(&queue->rskq_lock)->rlock){+.-...}, at:
      [<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
      
      To properly fix this issue, inet_csk_reqsk_queue_add() needs
      to return to its callers if the child as been queued
      into accept queue.
      
      We also need to make sure listener is still there before
      calling sk->sk_data_ready(), by holding a reference on it,
      since the reference carried by the child can disappear as
      soon as the child is put on accept queue.
      Reported-by: NIlya Dryomov <idryomov@gmail.com>
      Fixes: ebb516af ("tcp/dccp: fix race at listener dismantle phase")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7716682c
    • X
      route: check and remove route cache when we get route · deed49df
      Xin Long 提交于
      Since the gc of ipv4 route was removed, the route cached would has
      no chance to be removed, and even it has been timeout, it still could
      be used, cause no code to check it's expires.
      
      Fix this issue by checking  and removing route cache when we get route.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      deed49df
  6. 18 2月, 2016 6 次提交