1. 01 1月, 2014 1 次提交
  2. 14 12月, 2013 3 次提交
    • J
      libceph: resend all writes after the osdmap loses the full flag · 9a1ea2db
      Josh Durgin 提交于
      With the current full handling, there is a race between osds and
      clients getting the first map marked full. If the osd wins, it will
      return -ENOSPC to any writes, but the client may already have writes
      in flight. This results in the client getting the error and
      propagating it up the stack. For rbd, the block layer turns this into
      EIO, which can cause corruption in filesystems above it.
      
      To avoid this race, osds are being changed to drop writes that came
      from clients with an osdmap older than the last osdmap marked full.
      In order for this to work, clients must resend all writes after they
      encounter a full -> not full transition in the osdmap. osds will wait
      for an updated map instead of processing a request from a client with
      a newer map, so resent writes will not be dropped by the osd unless
      there is another not full -> full transition.
      
      This approach requires both osds and clients to be fixed to avoid the
      race. Old clients talking to osds with this fix may hang instead of
      returning EIO and potentially corrupting an fs. New clients talking to
      old osds have the same behavior as before if they encounter this race.
      
      Fixes: http://tracker.ceph.com/issues/6938Reviewed-by: NSage Weil <sage@inktank.com>
      Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>
      9a1ea2db
    • J
      libceph: block I/O when PAUSE or FULL osd map flags are set · d29adb34
      Josh Durgin 提交于
      The PAUSEWR and PAUSERD flags are meant to stop the cluster from
      processing writes and reads, respectively. The FULL flag is set when
      the cluster determines that it is out of space, and will no longer
      process writes.  PAUSEWR and PAUSERD are purely client-side settings
      already implemented in userspace clients. The osd does nothing special
      with these flags.
      
      When the FULL flag is set, however, the osd responds to all writes
      with -ENOSPC. For cephfs, this makes sense, but for rbd the block
      layer translates this into EIO.  If a cluster goes from full to
      non-full quickly, a filesystem on top of rbd will not behave well,
      since some writes succeed while others get EIO.
      
      Fix this by blocking any writes when the FULL flag is set in the osd
      client. This is the same strategy used by userspace, so apply it by
      default.  A follow-on patch makes this configurable.
      
      __map_request() is called to re-target osd requests in case the
      available osds changed.  Add a paused field to a ceph_osd_request, and
      set it whenever an appropriate osd map flag is set.  Avoid queueing
      paused requests in __map_request(), but force them to be resent if
      they become unpaused.
      
      Also subscribe to the next osd map from the monitor if any of these
      flags are set, so paused requests can be unblocked as soon as
      possible.
      
      Fixes: http://tracker.ceph.com/issues/6079Reviewed-by: NSage Weil <sage@inktank.com>
      Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>
      d29adb34
    • L
      ceph: Add necessary clean up if invalid reply received in handle_reply() · 37c89bde
      Li Wang 提交于
      Wake up possible waiters, invoke the call back if any, unregister the request
      Signed-off-by: NLi Wang <liwang@ubuntukylin.com>
      Signed-off-by: NYunchuan Wen <yunchuanwen@ubuntukylin.com>
      Signed-off-by: NSage Weil <sage@inktank.com>
      37c89bde
  3. 02 12月, 2013 1 次提交
    • F
      {pktgen, xfrm} Update IPv4 header total len and checksum after tranformation · 3868204d
      fan.du 提交于
      commit a553e4a6 ("[PKTGEN]: IPSEC support")
      tried to support IPsec ESP transport transformation for pktgen, but acctually
      this doesn't work at all for two reasons(The orignal transformed packet has
      bad IPv4 checksum value, as well as wrong auth value, reported by wireshark)
      
      - After transpormation, IPv4 header total length needs update,
        because encrypted payload's length is NOT same as that of plain text.
      
      - After transformation, IPv4 checksum needs re-caculate because of payload
        has been changed.
      
      With this patch, armmed pktgen with below cofiguration, Wireshark is able to
      decrypted ESP packet generated by pktgen without any IPv4 checksum error or
      auth value error.
      
      pgset "flag IPSEC"
      pgset "flows 1"
      Signed-off-by: NFan Du <fan.du@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3868204d
  4. 01 12月, 2013 6 次提交
  5. 30 11月, 2013 4 次提交
  6. 29 11月, 2013 5 次提交
  7. 27 11月, 2013 1 次提交
  8. 24 11月, 2013 8 次提交
  9. 22 11月, 2013 5 次提交
  10. 21 11月, 2013 5 次提交
    • H
      net: add BUG_ON if kernel advertises msg_namelen > sizeof(struct sockaddr_storage) · 68c6beb3
      Hannes Frederic Sowa 提交于
      In that case it is probable that kernel code overwrote part of the
      stack. So we should bail out loudly here.
      
      The BUG_ON may be removed in future if we are sure all protocols are
      conformant.
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68c6beb3
    • H
      net: rework recvmsg handler msg_name and msg_namelen logic · f3d33426
      Hannes Frederic Sowa 提交于
      This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
      set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
      to return msg_name to the user.
      
      This prevents numerous uninitialized memory leaks we had in the
      recvmsg handlers and makes it harder for new code to accidentally leak
      uninitialized memory.
      
      Optimize for the case recvfrom is called with NULL as address. We don't
      need to copy the address at all, so set it to NULL before invoking the
      recvmsg handler. We can do so, because all the recvmsg handlers must
      cope with the case a plain read() is called on them. read() also sets
      msg_name to NULL.
      
      Also document these changes in include/linux/net.h as suggested by David
      Miller.
      
      Changes since RFC:
      
      Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
      non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
      affect sendto as it would bail out earlier while trying to copy-in the
      address. It also more naturally reflects the logic by the callers of
      verify_iovec.
      
      With this change in place I could remove "
      if (!uaddr || msg_sys->msg_namelen == 0)
      	msg->msg_name = NULL
      ".
      
      This change does not alter the user visible error logic as we ignore
      msg_namelen as long as msg_name is NULL.
      
      Also remove two unnecessary curly brackets in ___sys_recvmsg and change
      comments to netdev style.
      
      Cc: David Miller <davem@davemloft.net>
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3d33426
    • D
      bridge: flush br's address entry in fdb when remove the · f8730420
      Ding Tianhong 提交于
       bridge dev
      
      When the following commands are executed:
      
      brctl addbr br0
      ifconfig br0 hw ether <addr>
      rmmod bridge
      
      The calltrace will occur:
      
      [  563.312114] device eth1 left promiscuous mode
      [  563.312188] br0: port 1(eth1) entered disabled state
      [  563.468190] kmem_cache_destroy bridge_fdb_cache: Slab cache still has objects
      [  563.468197] CPU: 6 PID: 6982 Comm: rmmod Tainted: G           O 3.12.0-0.7-default+ #9
      [  563.468199] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
      [  563.468200]  0000000000000880 ffff88010f111e98 ffffffff814d1c92 ffff88010f111eb8
      [  563.468204]  ffffffff81148efd ffff88010f111eb8 0000000000000000 ffff88010f111ec8
      [  563.468206]  ffffffffa062a270 ffff88010f111ed8 ffffffffa063ac76 ffff88010f111f78
      [  563.468209] Call Trace:
      [  563.468218]  [<ffffffff814d1c92>] dump_stack+0x6a/0x78
      [  563.468234]  [<ffffffff81148efd>] kmem_cache_destroy+0xfd/0x100
      [  563.468242]  [<ffffffffa062a270>] br_fdb_fini+0x10/0x20 [bridge]
      [  563.468247]  [<ffffffffa063ac76>] br_deinit+0x4e/0x50 [bridge]
      [  563.468254]  [<ffffffff810c7dc9>] SyS_delete_module+0x199/0x2b0
      [  563.468259]  [<ffffffff814e0922>] system_call_fastpath+0x16/0x1b
      [  570.377958] Bridge firewalling registered
      
      --------------------------- cut here -------------------------------
      
      The reason is that when the bridge dev's address is changed, the
      br_fdb_change_mac_address() will add new address in fdb, but when
      the bridge was removed, the address entry in the fdb did not free,
      the bridge_fdb_cache still has objects when destroy the cache, Fix
      this by flushing the bridge address entry when removing the bridge.
      
      v2: according to the Toshiaki Makita and Vlad's suggestion, I only
          delete the vlan0 entry, it still have a leak here if the vlan id
          is other number, so I need to call fdb_delete_by_port(br, NULL, 1)
          to flush all entries whose dst is NULL for the bridge.
      Suggested-by: NToshiaki Makita <toshiaki.makita1@gmail.com>
      Suggested-by: NVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8730420
    • V
      net: core: Always propagate flag changes to interfaces · d2615bf4
      Vlad Yasevich 提交于
      The following commit:
          b6c40d68
          net: only invoke dev->change_rx_flags when device is UP
      
      tried to fix a problem with VLAN devices and promiscuouse flag setting.
      The issue was that VLAN device was setting a flag on an interface that
      was down, thus resulting in bad promiscuity count.
      This commit blocked flag propagation to any device that is currently
      down.
      
      A later commit:
          deede2fa
          vlan: Don't propagate flag changes on down interfaces
      
      fixed VLAN code to only propagate flags when the VLAN interface is up,
      thus fixing the same issue as above, only localized to VLAN.
      
      The problem we have now is that if we have create a complex stack
      involving multiple software devices like bridges, bonds, and vlans,
      then it is possible that the flags would not propagate properly to
      the physical devices.  A simple examle of the scenario is the
      following:
      
        eth0----> bond0 ----> bridge0 ---> vlan50
      
      If bond0 or eth0 happen to be down at the time bond0 is added to
      the bridge, then eth0 will never have promisc mode set which is
      currently required for operation as part of the bridge.  As a
      result, packets with vlan50 will be dropped by the interface.
      
      The only 2 devices that implement the special flag handling are
      VLAN and DSA and they both have required code to prevent incorrect
      flag propagation.  As a result we can remove the generic solution
      introduced in b6c40d68 and leave
      it to the individual devices to decide whether they will block
      flag propagation or not.
      Reported-by: NStefan Priebe <s.priebe@profihost.ag>
      Suggested-by: NVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2615bf4
    • A
      ipv4: fix race in concurrent ip_route_input_slow() · dcdfdf56
      Alexei Starovoitov 提交于
      CPUs can ask for local route via ip_route_input_noref() concurrently.
      if nh_rth_input is not cached yet, CPUs will proceed to allocate
      equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
      via rt_cache_route()
      Most of the time they succeed, but on occasion the following two lines:
      	orig = *p;
      	prev = cmpxchg(p, orig, rt);
      in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
      But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
      so dst is leaking. dst_destroy() is never called and 'lo' device
      refcnt doesn't go to zero, which can be seen in the logs as:
      	unregister_netdevice: waiting for lo to become free. Usage count = 1
      Adding mdelay() between above two lines makes it easily reproducible.
      Fix it similar to nh_pcpu_rth_output case.
      
      Fixes: d2d68ba9 ("ipv4: Cache input routes in fib_info nexthops.")
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcdfdf56
  11. 20 11月, 2013 1 次提交
    • J
      genetlink: make multicast groups const, prevent abuse · 2a94fe48
      Johannes Berg 提交于
      Register generic netlink multicast groups as an array with
      the family and give them contiguous group IDs. Then instead
      of passing the global group ID to the various functions that
      send messages, pass the ID relative to the family - for most
      families that's just 0 because the only have one group.
      
      This avoids the list_head and ID in each group, adding a new
      field for the mcast group ID offset to the family.
      
      At the same time, this allows us to prevent abusing groups
      again like the quota and dropmon code did, since we can now
      check that a family only uses a group it owns.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a94fe48