1. 20 11月, 2008 5 次提交
  2. 18 11月, 2008 1 次提交
  3. 17 11月, 2008 11 次提交
    • A
      ematch: simpler tcf_em_unregister() · 4d24b52a
      Alexey Dobriyan 提交于
      Simply delete ops from list and let list debugging do the job.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d24b52a
    • G
      dccp: Deprecate Ack Ratio sysctl · dd9c0e36
      Gerrit Renker 提交于
      This patch deprecates the Ack Ratio sysctl, since
       * Ack Ratio is entirely ignored by CCID-3 and CCID-4,
       * Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1);
       * even if it would work in CCID-2, there is no point for a user to change it:
         - Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2),
         - if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts
           (since waiting for Acks which will never arrive in this window),
         - cwnd is not a user-configurable value.
      
      The only reasonable place for Ack Ratio is to print it for debugging. It is
      planned to do this later on, as part of e.g. dccp_probe.
      
      With this patch Ack Ratio is now under full control of feature negotiation:
       * Ack Ratio is resolved as a dependency of the selected CCID;
       * if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to
         the default of 2, following RFC 4340, 11.3 - "New connections start with Ack
         Ratio 2 for both endpoints";
       * what happens then is part of another patch set, since it concerns the
         dynamic update of Ack Ratio while the connection is in full flight.
      
      Thanks to Tomasz Grobelny for discussion leading up to this patch.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd9c0e36
    • G
      dccp: Feature negotiation for minimum-checksum-coverage · 29450559
      Gerrit Renker 提交于
      This provides feature negotiation for server minimum checksum coverage
      which so far has been missing.
      
      Since sender/receiver coverage values range only from 0...15, their
      type has also been reduced in size from u16 to u4.
      
      Feature-negotiation options are now generated for both sender and receiver
      coverage, i.e. when the peer has `forgotten' to enable partial coverage
      then feature negotiation will automatically enable (negotiate) the partial
      coverage value for this connection.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29450559
    • G
      dccp: Deprecate old setsockopt framework · 49aebc66
      Gerrit Renker 提交于
      The previous setsockopt interface, which passed socket options via struct
      dccp_so_feat, is complicated/difficult to use. Continuing to support it leads to
      ugly code since the old approach did not distinguish between NN and SP values.
      
      This patch removes the old setsockopt interface and replaces it with two new
      functions to register NN/SP values for feature negotiation. 
      These are essentially wrappers around the internal __feat_register functions,
      with checking added to avoid
      
       * wrong usage (type);
       * changing values while the connection is in progress.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49aebc66
    • M
      virtio_net: VIRTIO_NET_F_MSG_RXBUF (imprive rcv buffer allocation) · 3f2c31d9
      Mark McLoughlin 提交于
      If segmentation offload is enabled by the host, we currently allocate
      maximum sized packet buffers and pass them to the host. This uses up
      20 ring entries, allowing us to supply only 20 packet buffers to the
      host with a 256 entry ring. This is a huge overhead when receiving
      small packets, and is most keenly felt when receiving MTU sized
      packets from off-host.
      
      The VIRTIO_NET_F_MRG_RXBUF feature flag is set by hosts which support
      using receive buffers which are smaller than the maximum packet size.
      In order to transfer large packets to the guest, the host merges
      together multiple receive buffers to form a larger logical buffer.
      The number of merged buffers is returned to the guest via a field in
      the virtio_net_hdr.
      
      Make use of this support by supplying single page receive buffers to
      the host. On receive, we extract the virtio_net_hdr, copy 128 bytes of
      the payload to the skb's linear data buffer and adjust the fragment
      offset to point to the remaining data. This ensures proper alignment
      and allows us to not use any paged data for small packets. If the
      payload occupies multiple pages, we simply append those pages as
      fragments and free the associated skbs.
      
      This scheme allows us to be efficient in our use of ring entries
      while still supporting large packets. Benchmarking using netperf from
      an external machine to a guest over a 10Gb/s network shows a 100%
      improvement from ~1Gb/s to ~2Gb/s. With a local host->guest benchmark
      with GSO disabled on the host side, throughput was seen to increase
      from 700Mb/s to 1.7Gb/s.
      
      Based on a patch from Herbert Xu.
      Signed-off-by: NMark McLoughlin <markmc@redhat.com>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use netdev_priv)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f2c31d9
    • E
      net: make sure struct dst_entry refcount is aligned on 64 bytes · 5635c10d
      Eric Dumazet 提交于
      As found in the past (commit f1dd9c37
      [NET]: Fix tbench regression in 2.6.25-rc1), it is really
      important that struct dst_entry refcount is aligned on a cache line.
      
      We cannot use __atribute((aligned)), so manually pad the structure
      for 32 and 64 bit arches.
      
      for 32bit : offsetof(truct dst_entry, __refcnt) is 0x80
      for 64bit : offsetof(truct dst_entry, __refcnt) is 0xc0
      
      As it is not possible to guess at compile time cache line size,
      we use a generic value of 64 bytes, that satisfies many current arches.
      (Using 128 bytes alignment on 64bit arches would waste 64 bytes)
      
      Add a BUILD_BUG_ON to catch future updates to "struct dst_entry" dont
      break this alignment.
      
      "tbench 8" is 4.4 % faster on a dual quad core (HP BL460c G1), Intel E5450 @3.00GHz
      (2350 MB/s instead of 2250 MB/s)
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5635c10d
    • E
      net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls · 3ab5aee7
      Eric Dumazet 提交于
      RCU was added to UDP lookups, using a fast infrastructure :
      - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
        price of call_rcu() at freeing time.
      - hlist_nulls permits to use few memory barriers.
      
      This patch uses same infrastructure for TCP/DCCP established
      and timewait sockets.
      
      Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
      using short lived TCP connections. A followup patch, converting
      rwlocks to spinlocks will even speedup this case.
      
      __inet_lookup_established() is pretty fast now we dont have to
      dirty a contended cache line (read_lock/read_unlock)
      
      Only established and timewait hashtable are converted to RCU
      (bind table and listen table are still using traditional locking)
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ab5aee7
    • E
      udp: Use hlist_nulls in UDP RCU code · 88ab1932
      Eric Dumazet 提交于
      This is a straightforward patch, using hlist_nulls infrastructure.
      
      RCUification already done on UDP two weeks ago.
      
      Using hlist_nulls permits us to avoid some memory barriers, both
      at lookup time and delete time.
      
      Patch is large because it adds new macros to include/net/sock.h.
      These macros will be used by TCP & DCCP in next patch.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88ab1932
    • E
      rcu: Introduce hlist_nulls variant of hlist · bbaffaca
      Eric Dumazet 提交于
      hlist uses NULL value to finish a chain.
      
      hlist_nulls variant use the low order bit set to 1 to signal an end-of-list marker.
      
      This allows to store many different end markers, so that some RCU lockless
      algos (used in TCP/UDP stack for example) can save some memory barriers in
      fast paths.
      
      Two new files are added :
      
      include/linux/list_nulls.h
        - mimics hlist part of include/linux/list.h, derived to hlist_nulls variant
      
      include/linux/rculist_nulls.h
        - mimics hlist part of include/linux/rculist.h, derived to hlist_nulls variant
      
         Only four helpers are declared for the moment :
      
           hlist_nulls_del_init_rcu(), hlist_nulls_del_rcu(),
           hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry_rcu()
      
      prefetches() were removed, since an end of list is not anymore NULL value.
      prefetches() could trigger useless (and possibly dangerous) memory transactions.
      
      Example of use (extracted from __udp4_lib_lookup())
      
      	struct sock *sk, *result;
              struct hlist_nulls_node *node;
              unsigned short hnum = ntohs(dport);
              unsigned int hash = udp_hashfn(net, hnum);
              struct udp_hslot *hslot = &udptable->hash[hash];
              int score, badness;
      
              rcu_read_lock();
      begin:
              result = NULL;
              badness = -1;
              sk_nulls_for_each_rcu(sk, node, &hslot->head) {
                      score = compute_score(sk, net, saddr, hnum, sport,
                                            daddr, dport, dif);
                      if (score > badness) {
                              result = sk;
                              badness = score;
                      }
              }
              /*
               * if the nulls value we got at the end of this lookup is
               * not the expected one, we must restart lookup.
               * We probably met an item that was moved to another chain.
               */
              if (get_nulls_value(node) != hash)
                      goto begin;
      
              if (result) {
                      if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
                              result = NULL;
                      else if (unlikely(compute_score(result, net, saddr, hnum, sport,
                                        daddr, dport, dif) < badness)) {
                              sock_put(result);
                              goto begin;
                      }
              }
              rcu_read_unlock();
              return result;
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bbaffaca
    • B
      TPROXY: implemented IP_RECVORIGDSTADDR socket option · e8b2dfe9
      Balazs Scheidler 提交于
      In case UDP traffic is redirected to a local UDP socket,
      the originally addressed destination address/port
      cannot be recovered with the in-kernel tproxy.
      
      This patch adds an IP_RECVORIGDSTADDR sockopt that enables
      a IP_ORIGDSTADDR ancillary message in recvmsg(). This
      ancillary message contains the original destination address/port
      of the packet being received.
      Signed-off-by: NBalazs Scheidler <bazsi@balabit.hu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8b2dfe9
    • P
      phylib: make mdio-gpio work without OF (v4) · f004f3ea
      Paulius Zaleckas 提交于
      make mdio-gpio work with non OpenFirmware gpio implementation.
      
      Aditional changes to mdio-gpio:
      - use gpio_request() and gpio_free()
      - place irq[] array in struct mdio_gpio_info
      - add module description, author and license
      - add note about compiling this driver as module
      - rename mdc and mdio function (were ugly names)
      - change MII to MDIO in bus name
      - add __init __exit to module (un)loading functions
      - probe fails if no phys added to the bus
      - kzalloc bitbang with sizeof(*bitbang)
      
      Changes since v3:
      - keep bus naming "%x" to be compatible with existing drivers.
      
      Changes since v2:
      - more #ifdefs reduction
      - platform driver will be registered on OF platforms also
      - unified platform and OF bus_id to phy%i
      
      Changes since v1:
      - removed NO_IRQ
      - reduced #idefs
      
      Laurent, please test this driver under OF.
      Signed-off-by: NPaulius Zaleckas <paulius.zaleckas@teltonika.lt>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f004f3ea
  4. 16 11月, 2008 2 次提交
    • A
      Fix inotify watch removal/umount races · 8f7b0ba1
      Al Viro 提交于
      Inotify watch removals suck violently.
      
      To kick the watch out we need (in this order) inode->inotify_mutex and
      ih->mutex.  That's fine if we have a hold on inode; however, for all
      other cases we need to make damn sure we don't race with umount.  We can
      *NOT* just grab a reference to a watch - inotify_unmount_inodes() will
      happily sail past it and we'll end with reference to inode potentially
      outliving its superblock.
      
      Ideally we just want to grab an active reference to superblock if we
      can; that will make sure we won't go into inotify_umount_inodes() until
      we are done.  Cleanup is just deactivate_super().
      
      However, that leaves a messy case - what if we *are* racing with
      umount() and active references to superblock can't be acquired anymore?
      We can bump ->s_count, grab ->s_umount, which will almost certainly wait
      until the superblock is shut down and the watch in question is pining
      for fjords.  That's fine, but there is a problem - we might have hit the
      window between ->s_active getting to 0 / ->s_count - below S_BIAS (i.e.
      the moment when superblock is past the point of no return and is heading
      for shutdown) and the moment when deactivate_super() acquires
      ->s_umount.
      
      We could just do drop_super() yield() and retry, but that's rather
      antisocial and this stuff is luser-triggerable.  OTOH, having grabbed
      ->s_umount and having found that we'd got there first (i.e.  that
      ->s_root is non-NULL) we know that we won't race with
      inotify_umount_inodes().
      
      So we could grab a reference to watch and do the rest as above, just
      with drop_super() instead of deactivate_super(), right? Wrong.  We had
      to drop ih->mutex before we could grab ->s_umount.  So the watch
      could've been gone already.
      
      That still can be dealt with - we need to save watch->wd, do idr_find()
      and compare its result with our pointer.  If they match, we either have
      the damn thing still alive or we'd lost not one but two races at once,
      the watch had been killed and a new one got created with the same ->wd
      at the same address.  That couldn't have happened in inotify_destroy(),
      but inotify_rm_wd() could run into that.  Still, "new one got created"
      is not a problem - we have every right to kill it or leave it alone,
      whatever's more convenient.
      
      So we can use idr_find(...) == watch && watch->inode->i_sb == sb as
      "grab it and kill it" check.  If it's been our original watch, we are
      fine, if it's a newcomer - nevermind, just pretend that we'd won the
      race and kill the fscker anyway; we are safe since we know that its
      superblock won't be going away.
      
      And yes, this is far beyond mere "not very pretty"; so's the entire
      concept of inotify to start with.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Acked-by: NGreg KH <greg@kroah.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f7b0ba1
    • M
      Add 'pr_fmt()' format modifier to pr_xyz macros. · d091c2f5
      Martin Schwidefsky 提交于
      A common reason for device drivers to implement their own printk macros
      is the lack of a printk prefix with the standard pr_xyz macros.
      Introduce a pr_fmt() macro that is applied for every pr_xyz macro to the
      format string.
      
      The most common use of the pr_fmt macro would be to add the name of the
      device driver to all pr_xyz messages in a source file.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d091c2f5
  5. 14 11月, 2008 5 次提交
  6. 13 11月, 2008 6 次提交
  7. 12 11月, 2008 10 次提交