1. 23 2月, 2016 4 次提交
  2. 22 2月, 2016 10 次提交
    • D
      bpf: don't emit mov A,A on return · 6205b9cf
      Daniel Borkmann 提交于
      While debugging with bpf_jit_disasm I noticed emissions of 'mov %eax,%eax',
      and found that this comes from BPF_RET | BPF_A translations from classic
      BPF. Emitting this is unnecessary as BPF_REG_A is mapped into BPF_REG_0
      already, therefore only emit a mov when immediates are used as return value.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6205b9cf
    • D
      bpf: fix csum update in bpf_l4_csum_replace helper for udp · 2f72959a
      Daniel Borkmann 提交于
      When using this helper for updating UDP checksums, we need to extend
      this in order to write CSUM_MANGLED_0 for csum computations that result
      into 0 as sum. Reason we need this is because packets with a checksum
      could otherwise become incorrectly marked as a packet without a checksum.
      Likewise, if the user indicates BPF_F_MARK_MANGLED_0, then we should
      not turn packets without a checksum into ones with a checksum.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f72959a
    • D
      bpf: try harder on clones when writing into skb · 3697649f
      Daniel Borkmann 提交于
      When we're dealing with clones and the area is not writeable, try
      harder and get a copy via pskb_expand_head(). Replace also other
      occurences in tc actions with the new skb_try_make_writable().
      Reported-by: NAshhad Sheikh <ashhadsheikh394@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3697649f
    • D
      bpf: remove artificial bpf_skb_{load, store}_bytes buffer limitation · 21cafc1d
      Daniel Borkmann 提交于
      We currently limit bpf_skb_store_bytes() and bpf_skb_load_bytes()
      helpers to only store or load a maximum buffer of 16 bytes. Thus,
      loading, rewriting and storing headers require several bpf_skb_load_bytes()
      and bpf_skb_store_bytes() calls.
      
      Also here we can use a per-cpu scratch buffer instead in order to not
      pressure stack space any further. I do suspect that this limit was mainly
      set in place for this particular reason. So, ease program development
      by removing this limitation and make the scratchpad generic, so it can
      be reused.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21cafc1d
    • D
      bpf: add generic bpf_csum_diff helper · 7d672345
      Daniel Borkmann 提交于
      For L4 checksums, we currently have bpf_l4_csum_replace() helper. It's
      currently limited to handle 2 and 4 byte changes in a header and feeds the
      from/to into inet_proto_csum_replace{2,4}() helpers of the kernel. When
      working with IPv6, for example, this makes it rather cumbersome to deal
      with, similarly when editing larger parts of a header.
      
      Instead, extend the API in a more generic way: For bpf_l4_csum_replace(),
      add a case for header field mask of 0 to change the checksum at a given
      offset through inet_proto_csum_replace_by_diff(), and provide a helper
      bpf_csum_diff() that can generically calculate a from/to diff for arbitrary
      amounts of data.
      
      This can be used in multiple ways: for the bpf_l4_csum_replace() only
      part, this even provides us with the option to insert precalculated diffs
      from user space f.e. from a map, or from bpf_csum_diff() during runtime.
      
      bpf_csum_diff() has a optional from/to stack buffer input, so we can
      calculate a diff by using a scratchbuffer for scenarios where we're
      inserting (from is NULL), removing (to is NULL) or diffing (from/to buffers
      don't need to be of equal size) data. Also, bpf_csum_diff() allows to
      feed a previous csum into csum_partial(), so the function can also be
      cascaded.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d672345
    • R
      ila: autoload module · 84a8cbe4
      Robert Shearman 提交于
      Avoid users having to manually load the module by adding a module
      alias allowing it to be autoloaded by the lwt infra.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84a8cbe4
    • R
      mpls: autoload lwt module · b2b04edc
      Robert Shearman 提交于
      Avoid users having to manually load the module by adding a module
      alias allowing it to be autoloaded by the lwt infra.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2b04edc
    • R
      lwtunnel: autoload of lwt modules · 745041e2
      Robert Shearman 提交于
      The lwt implementations using net devices can autoload using the
      existing mechanism using IFLA_INFO_KIND. However, there's no mechanism
      that lwt modules not using net devices can use.
      
      Therefore, add the ability to autoload modules registering lwt
      operations for lwt implementations not using a net device so that
      users don't have to manually load the modules.
      
      Only users with the CAP_NET_ADMIN capability can cause modules to be
      loaded, which is ensured by rtnetlink_rcv_msg rejecting non-RTM_GETxxx
      messages for users without this capability, and by
      lwtunnel_build_state not being called in response to RTM_GETxxx
      messages.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      745041e2
    • Z
      vlan: turn on unicast filtering on vlan device · e817af27
      Zhang Shengju 提交于
      Currently vlan device inherits unicast filtering flag from underlying
      device. If underlying device doesn't support unicast filter, this will
      put vlan device into promiscuous mode when it's stacked.
      
      Tun on IFF_UNICAST_FLT on the vlan device in any case so that it does
      not go into promiscuous mode needlessly. If underlying device does not
      support unicast filtering, that device will enter promiscuous mode.
      Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e817af27
    • N
      sctp: Fix port hash table size computation · d9749fb5
      Neil Horman 提交于
      Dmitry Vyukov noted recently that the sctp_port_hashtable had an error in
      its size computation, observing that the current method never guaranteed
      that the hashsize (measured in number of entries) would be a power of two,
      which the input hash function for that table requires.  The root cause of
      the problem is that two values need to be computed (one, the allocation
      order of the storage requries, as passed to __get_free_pages, and two the
      number of entries for the hash table).  Both need to be ^2, but for
      different reasons, and the existing code is simply computing one order
      value, and using it as the basis for both, which is wrong (i.e. it assumes
      that ((1<<order)*PAGE_SIZE)/sizeof(bucket) is still ^2 when its not).
      
      To fix this, we change the logic slightly.  We start by computing a goal
      allocation order (which is limited by the maximum size hash table we want
      to support.  Then we attempt to allocate that size table, decreasing the
      order until a successful allocation is made.  Then, with the resultant
      successful order we compute the number of buckets that hash table supports,
      which we then round down to the nearest power of two, giving us the number
      of entries the table actually supports.
      
      I've tested this locally here, using non-debug and spinlock-debug kernels,
      and the number of entries in the hashtable consistently work out to be
      powers of two in all cases.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      CC: Dmitry Vyukov <dvyukov@google.com>
      CC: Vladislav Yasevich <vyasevich@gmail.com>
      CC: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9749fb5
  3. 20 2月, 2016 14 次提交
    • D
      Bluetooth: hci_core: Avoid mixing up req_complete and req_complete_skb · 3bd7594e
      Douglas Anderson 提交于
      In commit 44d27137 ("Bluetooth: Compress the size of struct
      hci_ctrl") we squashed down the size of the structure by using a union
      with the assumption that all users would use the flag to determine
      whether we had a req_complete or a req_complete_skb.
      
      Unfortunately we had a case in hci_req_cmd_complete() where we weren't
      looking at the flag.  This can result in a situation where we might be
      storing a hci_req_complete_skb_t in a hci_req_complete_t variable, or
      vice versa.
      
      During some testing I found at least one case where the function
      hci_req_sync_complete() was called improperly because the kernel thought
      that it didn't require an SKB.  Looking through the stack in kgdb I
      found that it was called by hci_event_packet() and that
      hci_event_packet() had both of its locals "req_complete" and
      "req_complete_skb" pointing to the same place: both to
      hci_req_sync_complete().
      
      Let's make sure we always check the flag.
      
      For more details on debugging done, see <http://crbug.com/588288>.
      
      Fixes: 44d27137 ("Bluetooth: Compress the size of struct hci_ctrl")
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Acked-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      3bd7594e
    • R
      af_unix: Don't use continue to re-execute unix_stream_read_generic loop · 18eceb81
      Rainer Weikusat 提交于
      The unix_stream_read_generic function tries to use a continue statement
      to restart the receive loop after waiting for a message. This may not
      work as intended as the caller might use a recvmsg call to peek at
      control messages without specifying a message buffer. If this was the
      case, the continue will cause the function to return without an error
      and without the credential information if the function had to wait for a
      message while it had returned with the credentials otherwise. Change to
      using goto to restart the loop without checking the condition first in
      this case so that credentials are returned either way.
      Signed-off-by: NRainer Weikusat <rweikusat@mobileactivedefense.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18eceb81
    • D
      unix_diag: fix incorrect sign extension in unix_lookup_by_ino · b5f05492
      Dmitry V. Levin 提交于
      The value passed by unix_diag_get_exact to unix_lookup_by_ino has type
      __u32, but unix_lookup_by_ino's argument ino has type int, which is not
      a problem yet.
      However, when ino is compared with sock_i_ino return value of type
      unsigned long, ino is sign extended to signed long, and this results
      to incorrect comparison on 64-bit architectures for inode numbers
      greater than INT_MAX.
      
      This bug was found by strace test suite.
      
      Fixes: 5d3cae8b ("unix_diag: Dumping exact socket core")
      Signed-off-by: NDmitry V. Levin <ldv@altlinux.org>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5f05492
    • D
      net: use skb_postpush_rcsum instead of own implementations · 6b83d28a
      Daniel Borkmann 提交于
      Replace individual implementations with the recently introduced
      skb_postpush_rcsum() helper.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6b83d28a
    • K
      net/ethtool: support set coalesce per queue · f38d138a
      Kan Liang 提交于
      This patch implements sub command ETHTOOL_SCOALESCE for ioctl
      ETHTOOL_PERQUEUE. It introduces an interface set_per_queue_coalesce to
      set coalesce of each masked queue to device driver. The wanted coalesce
      information are stored in "data" for each masked queue, which can copy
      from userspace.
      If it fails to set coalesce to device driver, the value which already
      set to specific queue will be tried to rollback.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Reviewed-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f38d138a
    • K
      net/ethtool: support get coalesce per queue · 421797b1
      Kan Liang 提交于
      This patch implements sub command ETHTOOL_GCOALESCE for ioctl
      ETHTOOL_PERQUEUE. It introduces an interface get_per_queue_coalesce to
      get coalesce of each masked queue from device driver. Then the interrupt
      coalescing parameters will be copied back to user space one by one.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Reviewed-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      421797b1
    • K
      net/ethtool: introduce a new ioctl for per queue setting · ac2c7ad0
      Kan Liang 提交于
      Introduce a new ioctl ETHTOOL_PERQUEUE for per queue parameters setting.
      The following patches will enable some SUB_COMMANDs for per queue
      setting.
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Reviewed-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac2c7ad0
    • W
      ipv6: pass up EMSGSIZE msg for UDP socket in Ipv6 · e0d8c1b7
      Wei Wang 提交于
      In ipv4,  when  the machine receives a ICMP_FRAG_NEEDED message,  the
      connected UDP socket will get EMSGSIZE message on its next read from the
      socket.
      However, this is not the case for ipv6.
      This fix modifies the udp err handler in Ipv6 for ICMP6_PKT_TOOBIG to
      make it similar to ipv4 behavior. That is when the machine gets an
      ICMP6_PKT_TOOBIG message, the connected UDP socket will get EMSGSIZE
      message on its next read from the socket.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0d8c1b7
    • P
      lwt: fix rx checksum setting for lwt devices tunneling over ipv6 · c868ee70
      Paolo Abeni 提交于
      the commit 35e2d115 ("tunnels: Allow IPv6 UDP checksums to be
      correctly controlled.") changed the default xmit checksum setting
      for lwt vxlan/geneve ipv6 tunnels, so that now the checksum is not
      set into external UDP header.
      This commit changes the rx checksum setting for both lwt vxlan/geneve
      devices created by openvswitch accordingly, so that lwt over ipv6
      tunnel pairs are again able to communicate with default values.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NJesse Gross <jesse@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c868ee70
    • I
      tipc: unlock in error path · b53ce3e7
      Insu Yun 提交于
      tipc_bcast_unlock need to be unlocked in error path.
      Signed-off-by: NInsu Yun <wuninsu@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b53ce3e7
    • A
      rtnl: RTM_GETNETCONF: fix wrong return value · a97eb33f
      Anton Protopopov 提交于
      An error response from a RTM_GETNETCONF request can return the positive
      error value EINVAL in the struct nlmsgerr that can mislead userspace.
      Signed-off-by: NAnton Protopopov <a.s.protopopov@gmail.com>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a97eb33f
    • N
      net: make netdev_for_each_lower_dev safe for device removal · cfdd28be
      Nikolay Aleksandrov 提交于
      When I used netdev_for_each_lower_dev in commit bad53162 ("vrf:
      remove slave queue and private slave struct") I thought that it acts
      like netdev_for_each_lower_private and can be used to remove the current
      device from the list while walking, but unfortunately it acts more like
      netdev_for_each_lower_private_rcu and doesn't allow it. The difference
      is where the "iter" points to, right now it points to the current element
      and that makes it impossible to remove it. Change the logic to be
      similar to netdev_for_each_lower_private and make it point to the "next"
      element so we can safely delete the current one. VRF is the only such
      user right now, there's no change for the read-only users.
      
      Here's what can happen now:
      [98423.249858] general protection fault: 0000 [#1] SMP
      [98423.250175] Modules linked in: vrf bridge(O) stp llc nfsd auth_rpcgss
      oid_registry nfs_acl nfs lockd grace sunrpc crct10dif_pclmul
      crc32_pclmul crc32c_intel ghash_clmulni_intel jitterentropy_rng
      sha256_generic hmac drbg ppdev aesni_intel aes_x86_64 glue_helper lrw
      gf128mul ablk_helper cryptd evdev serio_raw pcspkr virtio_balloon
      parport_pc parport i2c_piix4 i2c_core virtio_console acpi_cpufreq button
      9pnet_virtio 9p 9pnet fscache ipv6 autofs4 ext4 crc16 mbcache jbd2 sg
      virtio_blk virtio_net sr_mod cdrom e1000 ata_generic ehci_pci uhci_hcd
      ehci_hcd usbcore usb_common virtio_pci ata_piix libata floppy
      virtio_ring virtio scsi_mod [last unloaded: bridge]
      [98423.255040] CPU: 1 PID: 14173 Comm: ip Tainted: G           O
      4.5.0-rc2+ #81
      [98423.255386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS 1.8.1-20150318_183358- 04/01/2014
      [98423.255777] task: ffff8800547f5540 ti: ffff88003428c000 task.ti:
      ffff88003428c000
      [98423.256123] RIP: 0010:[<ffffffff81514f3e>]  [<ffffffff81514f3e>]
      netdev_lower_get_next+0x1e/0x30
      [98423.256534] RSP: 0018:ffff88003428f940  EFLAGS: 00010207
      [98423.256766] RAX: 0002000100000004 RBX: ffff880054ff9000 RCX:
      0000000000000000
      [98423.257039] RDX: ffff88003428f8b8 RSI: ffff88003428f950 RDI:
      ffff880054ff90c0
      [98423.257287] RBP: ffff88003428f940 R08: 0000000000000000 R09:
      0000000000000000
      [98423.257537] R10: 0000000000000001 R11: 0000000000000000 R12:
      ffff88003428f9e0
      [98423.257802] R13: ffff880054a5fd00 R14: ffff88003428f970 R15:
      0000000000000001
      [98423.258055] FS:  00007f3d76881700(0000) GS:ffff88005d000000(0000)
      knlGS:0000000000000000
      [98423.258418] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [98423.258650] CR2: 00007ffe5951ffa8 CR3: 0000000052077000 CR4:
      00000000000406e0
      [98423.258902] Stack:
      [98423.259075]  ffff88003428f960 ffffffffa0442636 0002000100000004
      ffff880054ff9000
      [98423.259647]  ffff88003428f9b0 ffffffff81518205 ffff880054ff9000
      ffff88003428f978
      [98423.260208]  ffff88003428f978 ffff88003428f9e0 ffff88003428f9e0
      ffff880035b35f00
      [98423.260739] Call Trace:
      [98423.260920]  [<ffffffffa0442636>] vrf_dev_uninit+0x76/0xa0 [vrf]
      [98423.261156]  [<ffffffff81518205>]
      rollback_registered_many+0x205/0x390
      [98423.261401]  [<ffffffff815183ec>] unregister_netdevice_many+0x1c/0x70
      [98423.261641]  [<ffffffff8153223c>] rtnl_delete_link+0x3c/0x50
      [98423.271557]  [<ffffffff815335bb>] rtnl_dellink+0xcb/0x1d0
      [98423.271800]  [<ffffffff811cd7da>] ? __inc_zone_state+0x4a/0x90
      [98423.272049]  [<ffffffff815337b4>] rtnetlink_rcv_msg+0x84/0x200
      [98423.272279]  [<ffffffff810cfe7d>] ? trace_hardirqs_on+0xd/0x10
      [98423.272513]  [<ffffffff8153370b>] ? rtnetlink_rcv+0x1b/0x40
      [98423.272755]  [<ffffffff81533730>] ? rtnetlink_rcv+0x40/0x40
      [98423.272983]  [<ffffffff8155d6e7>] netlink_rcv_skb+0x97/0xb0
      [98423.273209]  [<ffffffff8153371a>] rtnetlink_rcv+0x2a/0x40
      [98423.273476]  [<ffffffff8155ce8b>] netlink_unicast+0x11b/0x1a0
      [98423.273710]  [<ffffffff8155d2f1>] netlink_sendmsg+0x3e1/0x610
      [98423.273947]  [<ffffffff814fbc98>] sock_sendmsg+0x38/0x70
      [98423.274175]  [<ffffffff814fc253>] ___sys_sendmsg+0x2e3/0x2f0
      [98423.274416]  [<ffffffff810d841e>] ? do_raw_spin_unlock+0xbe/0x140
      [98423.274658]  [<ffffffff811e1bec>] ? handle_mm_fault+0x26c/0x2210
      [98423.274894]  [<ffffffff811e19cd>] ? handle_mm_fault+0x4d/0x2210
      [98423.275130]  [<ffffffff81269611>] ? __fget_light+0x91/0xb0
      [98423.275365]  [<ffffffff814fcd42>] __sys_sendmsg+0x42/0x80
      [98423.275595]  [<ffffffff814fcd92>] SyS_sendmsg+0x12/0x20
      [98423.275827]  [<ffffffff81611bb6>] entry_SYSCALL_64_fastpath+0x16/0x7a
      [98423.276073] Code: c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66
      90 48 8b 06 55 48 81 c7 c0 00 00 00 48 89 e5 48 8b 00 48 39 f8 74 09 48
      89 06 <48> 8b 40 e8 5d c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66
      [98423.279639] RIP  [<ffffffff81514f3e>] netdev_lower_get_next+0x1e/0x30
      [98423.279920]  RSP <ffff88003428f940>
      
      CC: David Ahern <dsa@cumulusnetworks.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      CC: Vlad Yasevich <vyasevic@redhat.com>
      Fixes: bad53162 ("vrf: remove slave queue and private slave struct")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfdd28be
    • N
      bridge: mdb: add support for more attributes and export timer · 21257156
      Nikolay Aleksandrov 提交于
      Currently mdb entries are exported directly as a structure inside
      MDBA_MDB_ENTRY_INFO attribute, we can't really extend it without
      breaking user-space. In order to export new mdb fields, I've converted
      the MDBA_MDB_ENTRY_INFO into a nested attribute which starts like before
      with struct br_mdb_entry (without header, as it's casted directly in
      iproute2) and continues with MDBA_MDB_EATTR_ attributes. This way we
      keep compatibility with older users and can export new data.
      I've tested this with iproute2, both with and without support for the
      added attribute and it works fine.
      So basically we again have MDBA_MDB_ENTRY_INFO with struct br_mdb_entry
      inside but it may contain also some additional MDBA_MDB_EATTR_ attributes
      such as MDBA_MDB_EATTR_TIMER which can be parsed by user-space.
      
      So the new structure is:
      [MDBA_MDB] = {
           [MDBA_MDB_ENTRY] = {
               [MDBA_MDB_ENTRY_INFO]
               [MDBA_MDB_ENTRY_INFO] { <- Nested attribute
                   struct br_mdb_entry <- nla_put_nohdr()
                   [MDBA_MDB_ENTRY attributes] <- normal netlink attributes
               }
           }
      }
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21257156
    • N
      bridge: mdb: reduce the indentation level in br_mdb_fill_info · 76cc173d
      Nikolay Aleksandrov 提交于
      Switch the port check and skip if it's null, this allows us to reduce one
      indentation level.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76cc173d
  4. 19 2月, 2016 12 次提交
    • A
      net: caif: fix erroneous return value · 449f14f0
      Anton Protopopov 提交于
      The cfrfml_receive() function might return positive value EPROTO
      Signed-off-by: NAnton Protopopov <a.s.protopopov@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      449f14f0
    • A
      appletalk: fix erroneous return value · 48bb230e
      Anton Protopopov 提交于
      The atalk_sendmsg() function might return wrong value ENETUNREACH
      instead of -ENETUNREACH.
      Signed-off-by: NAnton Protopopov <a.s.protopopov@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48bb230e
    • P
      IFF_NO_QUEUE: Fix for drivers not calling ether_setup() · a813104d
      Phil Sutter 提交于
      My implementation around IFF_NO_QUEUE driver flag assumed that leaving
      tx_queue_len untouched (specifically: not setting it to zero) by drivers
      would make it possible to assign a regular qdisc to them without having
      to worry about setting tx_queue_len to a useful value. This was only
      partially true: I overlooked that some drivers don't call ether_setup()
      and therefore not initialize tx_queue_len to the default value of 1000.
      Consequently, removing the workarounds in place for that case in qdisc
      implementations which cared about it (namely, pfifo, bfifo, gred, htb,
      plug and sfb) leads to problems with these specific interface types and
      qdiscs.
      
      Luckily, there's already a sanitization point for drivers setting
      tx_queue_len to zero, which can be reused to assign the fallback value
      most qdisc implementations used, which is 1.
      
      Fixes: 348e3435 ("net: sched: drop all special handling of tx_queue_len == 0")
      Tested-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a813104d
    • J
      gre: clear IFF_TX_SKB_SHARING · d13b161c
      Jiri Benc 提交于
      ether_setup sets IFF_TX_SKB_SHARING but this is not supported by gre
      as it modifies the skb on xmit.
      
      Also, clean up whitespace in ipgre_tap_setup when we're already touching it.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d13b161c
    • J
      iptunnel: scrub packet in iptunnel_pull_header · 7f290c94
      Jiri Benc 提交于
      Part of skb_scrub_packet was open coded in iptunnel_pull_header. Let it call
      skb_scrub_packet directly instead.
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f290c94
    • V
      net: bridge: log port STP state on change · 7c25b16d
      Vivien Didelot 提交于
      Remove the shared br_log_state function and print the info directly in
      br_set_state, where the net_bridge_port state is actually changed.
      Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Acked-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c25b16d
    • F
      nfnetlink: Revert "nfnetlink: add support for memory mapped netlink" · c5b0db32
      Florian Westphal 提交于
      reverts commit 3ab1f683 ("nfnetlink: add support for memory mapped
      netlink")'
      
      Like previous commits in the series, remove wrappers that are not needed
      after mmapped netlink removal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5b0db32
    • F
      nfnetlink: remove nfnetlink_alloc_skb · 905f0a73
      Florian Westphal 提交于
      Following mmapped netlink removal this code can be simplified by
      removing the alloc wrapper.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      905f0a73
    • F
      Revert "genl: Add genlmsg_new_unicast() for unicast message allocation" · 263ea090
      Florian Westphal 提交于
      This reverts commit bb9b18fb ("genl: Add genlmsg_new_unicast() for
      unicast message allocation")'.
      
      Nothing wrong with it; its no longer needed since this was only for
      mmapped netlink support.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      263ea090
    • F
      openvswitch: Revert: "Enable memory mapped Netlink i/o" · 551ddc05
      Florian Westphal 提交于
      revert commit 795449d8 ("openvswitch: Enable memory mapped Netlink i/o").
      Following the mmaped netlink removal this code can be removed.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      551ddc05
    • F
      netlink: remove mmapped netlink support · d1b4c689
      Florian Westphal 提交于
      mmapped netlink has a number of unresolved issues:
      
      - TX zerocopy support had to be disabled more than a year ago via
        commit 4682a035 ("netlink: Always copy on mmap TX.")
        because the content of the mmapped area can change after netlink
        attribute validation but before message processing.
      
      - RX support was implemented mainly to speed up nfqueue dumping packet
        payload to userspace.  However, since commit ae08ce00
        ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
        with the socket-based interface too (via the skb_zerocopy helper).
      
      The other problem is that skbs attached to mmaped netlink socket
      behave different from normal skbs:
      
      - they don't have a shinfo area, so all functions that use skb_shinfo()
      (e.g. skb_clone) cannot be used.
      
      - reserving headroom prevents userspace from seeing the content as
      it expects message to start at skb->head.
      See for instance
      commit aa3a0220 ("netlink: not trim skb for mmaped socket when dump").
      
      - skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
      crash because it needs the sk to check if a tx ring is attached.
      
      Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf35
      ("netfilter: nfnetlink: use original skbuff when acking batches").
      
      mmaped netlink also didn't play nicely with the skb_zerocopy helper
      used by nfqueue and openvswitch.  Daniel Borkmann fixed this via
      commit 6bb0fef4 ("netlink, mmap: fix edge-case leakages in nf queue
      zero-copy")' but at the cost of also needing to provide remaining
      length to the allocation function.
      
      nfqueue also has problems when used with mmaped rx netlink:
      - mmaped netlink doesn't allow use of nfqueue batch verdict messages.
        Problem is that in the mmap case, the allocation time also determines
        the ordering in which the frame will be seen by userspace (A
        allocating before B means that A is located in earlier ring slot,
        but this also means that B might get a lower sequence number then A
        since seqno is decided later.  To fix this we would need to extend the
        spinlocked region to also cover the allocation and message setup which
        isn't desirable.
      - nfqueue can now be configured to queue large (GSO) skbs to userspace.
        Queing GSO packets is faster than having to force a software segmentation
        in the kernel, so this is a desirable option.  However, with a mmap based
        ring one has to use 64kb per ring slot element, else mmap has to fall back
        to the socket path (NL_MMAP_STATUS_COPY) for all large packets.
      
      To use the mmap interface, userspace not only has to probe for mmap netlink
      support, it also has to implement a recv/socket receive path in order to
      handle messages that exceed the size of an rx ring element.
      
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1b4c689
    • E
      tcp/dccp: fix another race at listener dismantle · 7716682c
      Eric Dumazet 提交于
      Ilya reported following lockdep splat:
      
      kernel: =========================
      kernel: [ BUG: held lock freed! ]
      kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
      kernel: -------------------------
      kernel: swapper/5/0 is freeing memory
      ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
      kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
      [<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
      kernel: 4 locks held by swapper/5/0:
      kernel: #0:  (rcu_read_lock){......}, at: [<ffffffff8169ef6b>]
      netif_receive_skb_internal+0x4b/0x1f0
      kernel: #1:  (rcu_read_lock){......}, at: [<ffffffff816e977f>]
      ip_local_deliver_finish+0x3f/0x380
      kernel: #2:  (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>]
      sk_clone_lock+0x19b/0x440
      kernel: #3:  (&(&queue->rskq_lock)->rlock){+.-...}, at:
      [<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
      
      To properly fix this issue, inet_csk_reqsk_queue_add() needs
      to return to its callers if the child as been queued
      into accept queue.
      
      We also need to make sure listener is still there before
      calling sk->sk_data_ready(), by holding a reference on it,
      since the reference carried by the child can disappear as
      soon as the child is put on accept queue.
      Reported-by: NIlya Dryomov <idryomov@gmail.com>
      Fixes: ebb516af ("tcp/dccp: fix race at listener dismantle phase")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7716682c