1. 25 11月, 2008 2 次提交
    • I
      111cc8b9
    • I
      tcp: Try to restore large SKBs while SACK processing · 832d11c5
      Ilpo Järvinen 提交于
      During SACK processing, most of the benefits of TSO are eaten by
      the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
      Then we're in problems when cleanup work for them has to be done
      when a large cumulative ACK comes. Try to return back to pre-split
      state already while more and more SACK info gets discovered by
      combining newly discovered SACK areas with the previous skb if
      that's SACKed as well.
      
      This approach has a number of benefits:
      
      1) The processing overhead is spread more equally over the RTT
      2) Write queue has less skbs to process (affect everything
         which has to walk in the queue past the sacked areas)
      3) Write queue is consistent whole the time, so no other parts
         of TCP has to be aware of this (this was not the case with
         some other approach that was, well, quite intrusive all
         around).
      4) Clean_rtx_queue can release most of the pages using single
         put_page instead of previous PAGE_SIZE/mss+1 calls
      
      In case a hole is fully filled by the new SACK block, we attempt
      to combine the next skb too which allows construction of skbs
      that are even larger than what tso split them to and it handles
      hole per on every nth patterns that often occur during slow start
      overshoot pretty nicely. Though this to be really useful also
      a retransmission would have to get lost since cumulative ACKs
      advance one hole at a time in the most typical case.
      
      TODO: handle upwards only merging. That should be rather easy
      when segment is fully sacked but I'm leaving that as future
      work item (it won't make very large difference anyway since
      this current approach already covers quite a lot of normal
      cases).
      
      I was earlier thinking of some sophisticated way of tracking
      timestamps of the first and the last segment but later on
      realized that it won't be that necessary at all to store the
      timestamp of the last segment. The cases that can occur are
      basically either:
        1) ambiguous => no sensible measurement can be taken anyway
        2) non-ambiguous is due to reordering => having the timestamp
           of the last segment there is just skewing things more off
           than does some good since the ack got triggered by one of
           the holes (besides some substle issues that would make
           determining right hole/skb even harder problem). Anyway,
           it has nothing to do with this change then.
      
      I choose to route some abnormal looking cases with goto noop,
      some could be handled differently (eg., by stopping the
      walking at that skb but again). In general, they either
      shouldn't happen at all or are rare enough to make no difference
      in practice.
      
      In theory this change (as whole) could cause some macroscale
      regression (global) because of cache misses that are taken over
      the round-trip time but it gets very likely better because of much
      less (local) cache misses per other write queue walkers and the
      big recovery clearing cumulative ack.
      
      Worth to note that these benefits would be very easy to get also
      without TSO/GSO being on as long as the data is in pages so that
      we can merge them. Currently I won't let that happen because
      DSACK splitting at fragment that would mess up pcounts due to
      sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
      avoided, we have some conditions that can be made less strict.
      
      TODO: I will probably have to convert the excessive pointer
      passing to struct sacktag_state... :-)
      
      My testing revealed that considerable amount of skbs couldn't
      be shifted because they were cloned (most likely still awaiting
      tx reclaim)...
      
      [The rest is considering future work instead since I got
      repeatably EFAULT to tcpdump's recvfrom when I added
      pskb_expand_head to deal with clones, so I separated that
      into another, later patch]
      
      ...To counter that, I gave up on the fifth advantage:
      
      5) When growing previous SACK block, less allocs for new skbs
         are done, basically a new alloc is needed only when new hole
         is detected and when the previous skb runs out of frags space
      
      ...which now only happens of if reclaim is fast enough to dispose
      the clone before the SACK block comes in (the window is RTT long),
      otherwise we'll have to alloc some.
      
      With clones being handled I got these numbers (will be somewhat
      worse without that), taken with fine-grained mibs:
      
                        TCPSackShifted 398
                         TCPSackMerged 877
                  TCPSackShiftFallback 320
            TCPSACKCOLLAPSEFALLBACKGSO 0
        TCPSACKCOLLAPSEFALLBACKSKBBITS 0
        TCPSACKCOLLAPSEFALLBACKSKBDATA 0
          TCPSACKCOLLAPSEFALLBACKBELOW 0
          TCPSACKCOLLAPSEFALLBACKFIRST 1
       TCPSACKCOLLAPSEFALLBACKPREVBITS 318
            TCPSACKCOLLAPSEFALLBACKMSS 1
         TCPSACKCOLLAPSEFALLBACKNOHEAD 0
          TCPSACKCOLLAPSEFALLBACKSHIFT 0
                TCPSACKCOLLAPSENOOPSEQ 0
        TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
           TCPSACKCOLLAPSENOOPSMALLLEN 0
                   TCPSACKCOLLAPSEHOLE 12
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      832d11c5
  2. 24 11月, 2008 2 次提交
    • E
      eth: Declare an optimized compare_ether_addr_64bits() function · 1f87e235
      Eric Dumazet 提交于
      Linus mentioned we could try to perform long word operations, even
      on potentially unaligned addresses, on x86 at least. David mentioned
      the HAVE_EFFICIENT_UNALIGNED_ACCESS test to handle this on all
      arches that have efficient unailgned accesses.
      
      I tried this idea and got nice assembly on 32 bits:
      
      158:   33 82 38 01 00 00       xor    0x138(%edx),%eax
      15e:   33 8a 34 01 00 00       xor    0x134(%edx),%ecx
      164:   c1 e0 10                shl    $0x10,%eax
      167:   09 c1                   or     %eax,%ecx
      169:   74 0b                   je     176 <eth_type_trans+0x87>
      
      And very nice assembly on 64 bits of course (one xor, one shl)
      
      Nice oprofile improvement in eth_type_trans(), 0.17 % instead of 0.41 %,
      expected since we remove 8 instructions on a fast path.
      
      This patch implements a compare_ether_addr_64bits() function, that
      uses the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS ifdef to efficiently
      perform the 6 bytes comparison on all capable arches.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f87e235
    • G
      dccp: Set per-connection CCIDs via socket options · b20a9c24
      Gerrit Renker 提交于
      With this patch, TX/RX CCIDs can now be changed on a per-connection
      basis, which overrides the defaults set by the global sysctl variables
      for TX/RX CCIDs.
      
      To make full use of this facility, the remaining patches of this patch
      set are needed, which track dependencies and activate negotiated
      feature values.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b20a9c24
  3. 22 11月, 2008 1 次提交
  4. 21 11月, 2008 8 次提交
  5. 20 11月, 2008 8 次提交
    • P
      pkt_sched: add DRR scheduler · 13d2a1d2
      Patrick McHardy 提交于
      Add classful DRR scheduler as a more flexible replacement for SFQ.
      
      The main difference to the algorithm described in "Efficient Fair Queueing
      using Deficit Round Robin" is that this implementation doesn't drop packets
      from the longest queue on overrun because its classful and limits are
      handled by each individual child qdisc.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13d2a1d2
    • P
      netlink: avoid memset of 0 bytes sparse warning · 0c19b0ad
      Patrick McHardy 提交于
      A netlink attribute padding of zero triggers this sparse warning:
      
      include/linux/netlink.h:245:8: warning: memset with byte count of 0
      
      Avoid the memset when the size parameter is constant and requires no padding.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c19b0ad
    • P
      filter: add SKF_AD_NLATTR_NEST to look for nested attributes · d214c753
      Pablo Neira Ayuso 提交于
      SKF_AD_NLATTR allows us to find the first matching attribute in a
      stream of netlink attributes from one offset to the end of the
      netlink message. This is not suitable to look for a specific
      matching inside a set of nested attributes.
      
      For example, in ctnetlink messages, if we look for the CTA_V6_SRC
      attribute in a message that talks about an IPv4 connection,
      SKF_AD_NLATTR returns the offset of CTA_STATUS which has the same
      value of CTA_V6_SRC but outside the nest. To differenciate
      CTA_STATUS and CTA_V6_SRC, we would have to make assumptions on the
      size of the attribute and the usual offset, resulting in horrible
      BSF code.
      
      This patch adds SKF_AD_NLATTR_NEST, which is a variant of
      SKF_AD_NLATTR, that looks for an attribute inside the limits of
      a nested attributes, but not further.
      
      This patch validates that we have enough room to look for the
      nested attributes - based on a suggestion from Patrick McHardy.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d214c753
    • S
      netdev: expose ethernet address primitives · ccad637b
      Stephen Hemminger 提交于
      When ethernet devices are converted, the function pointer setup
      by eth_setup() need to be done during intialization.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ccad637b
    • S
      netdev: introduce dev_get_stats() · eeda3fd6
      Stephen Hemminger 提交于
      In order for the network device ops get_stats call to be immutable, the handling
      of the default internal network device stats block has to be changed. Add a new
      helper function which replaces the old use of internal_get_stats.
      
      Note: change return code to make it clear that the caller should not
      go changing the returned statistics.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eeda3fd6
    • S
      netdev: network device operations infrastructure · d314774c
      Stephen Hemminger 提交于
      This patch changes the network device internal API to move adminstrative
      operations out of the network device structure and into a separate structure.
      
      This patch involves some hackery to maintain compatablity between the
      new and old model, so all 300+ drivers don't have to be changed at once.
      For drivers that aren't converted yet, the netdevice_ops virt function list
      still resides in the net_device structure. For old protocols, the new
      net_device_ops are copied out to the old net_device pointers.
      
      After the transistion is completed the nag message can be changed to
      an WARN_ON, and the compatiablity code can be made configurable.
      
      Some function pointers aren't moved:
      * destructor can't be in net_device_ops because
        it may need to be referenced after the module is unloaded.
      * neighbor setup is manipulated in a couple of places that need special
        consideration
      * hard_start_xmit is in the fast path for transmit.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d314774c
    • M
      cpuset: update top cpuset's mems after adding a node · f481891f
      Miao Xie 提交于
      After adding a node into the machine, top cpuset's mems isn't updated.
      
      By reviewing the code, we found that the update function
      
        cpuset_track_online_nodes()
      
      was invoked after node_states[N_ONLINE] changes.  It is wrong because
      N_ONLINE just means node has pgdat, and if node has/added memory, we use
      N_HIGH_MEMORY.  So, We should invoke the update function after
      node_states[N_HIGH_MEMORY] changes, just like its commit says.
      
      This patch fixes it.  And we use notifier of memory hotplug instead of
      direct calling of cpuset_track_online_nodes().
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Acked-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paul Menage <menage@google.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f481891f
    • U
      reintroduce accept4 · de11defe
      Ulrich Drepper 提交于
      Introduce a new accept4() system call.  The addition of this system call
      matches analogous changes in 2.6.27 (dup3(), evenfd2(), signalfd4(),
      inotify_init1(), epoll_create1(), pipe2()) which added new system calls
      that differed from analogous traditional system calls in adding a flags
      argument that can be used to access additional functionality.
      
      The accept4() system call is exactly the same as accept(), except that
      it adds a flags bit-mask argument.  Two flags are initially implemented.
      (Most of the new system calls in 2.6.27 also had both of these flags.)
      
      SOCK_CLOEXEC causes the close-on-exec (FD_CLOEXEC) flag to be enabled
      for the new file descriptor returned by accept4().  This is a useful
      security feature to avoid leaking information in a multithreaded
      program where one thread is doing an accept() at the same time as
      another thread is doing a fork() plus exec().  More details here:
      http://udrepper.livejournal.com/20407.html "Secure File Descriptor Handling",
      Ulrich Drepper).
      
      The other flag is SOCK_NONBLOCK, which causes the O_NONBLOCK flag
      to be enabled on the new open file description created by accept4().
      (This flag is merely a convenience, saving the use of additional calls
      fcntl(F_GETFL) and fcntl (F_SETFL) to achieve the same result.
      
      Here's a test program.  Works on x86-32.  Should work on x86-64, but
      I (mtk) don't have a system to hand to test with.
      
      It tests accept4() with each of the four possible combinations of
      SOCK_CLOEXEC and SOCK_NONBLOCK set/clear in 'flags', and verifies
      that the appropriate flags are set on the file descriptor/open file
      description returned by accept4().
      
      I tested Ulrich's patch in this thread by applying against 2.6.28-rc2,
      and it passes according to my test program.
      
      /* test_accept4.c
      
        Copyright (C) 2008, Linux Foundation, written by Michael Kerrisk
             <mtk.manpages@gmail.com>
      
        Licensed under the GNU GPLv2 or later.
      */
      #define _GNU_SOURCE
      #include <unistd.h>
      #include <sys/syscall.h>
      #include <sys/socket.h>
      #include <netinet/in.h>
      #include <stdlib.h>
      #include <fcntl.h>
      #include <stdio.h>
      #include <string.h>
      
      #define PORT_NUM 33333
      
      #define die(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)
      
      /**********************************************************************/
      
      /* The following is what we need until glibc gets a wrapper for
        accept4() */
      
      /* Flags for socket(), socketpair(), accept4() */
      #ifndef SOCK_CLOEXEC
      #define SOCK_CLOEXEC    O_CLOEXEC
      #endif
      #ifndef SOCK_NONBLOCK
      #define SOCK_NONBLOCK   O_NONBLOCK
      #endif
      
      #ifdef __x86_64__
      #define SYS_accept4 288
      #elif __i386__
      #define USE_SOCKETCALL 1
      #define SYS_ACCEPT4 18
      #else
      #error "Sorry -- don't know the syscall # on this architecture"
      #endif
      
      static int
      accept4(int fd, struct sockaddr *sockaddr, socklen_t *addrlen, int flags)
      {
         printf("Calling accept4(): flags = %x", flags);
         if (flags != 0) {
             printf(" (");
             if (flags & SOCK_CLOEXEC)
                 printf("SOCK_CLOEXEC");
             if ((flags & SOCK_CLOEXEC) && (flags & SOCK_NONBLOCK))
                 printf(" ");
             if (flags & SOCK_NONBLOCK)
                 printf("SOCK_NONBLOCK");
             printf(")");
         }
         printf("\n");
      
      #if USE_SOCKETCALL
         long args[6];
      
         args[0] = fd;
         args[1] = (long) sockaddr;
         args[2] = (long) addrlen;
         args[3] = flags;
      
         return syscall(SYS_socketcall, SYS_ACCEPT4, args);
      #else
         return syscall(SYS_accept4, fd, sockaddr, addrlen, flags);
      #endif
      }
      
      /**********************************************************************/
      
      static int
      do_test(int lfd, struct sockaddr_in *conn_addr,
             int closeonexec_flag, int nonblock_flag)
      {
         int connfd, acceptfd;
         int fdf, flf, fdf_pass, flf_pass;
         struct sockaddr_in claddr;
         socklen_t addrlen;
      
         printf("=======================================\n");
      
         connfd = socket(AF_INET, SOCK_STREAM, 0);
         if (connfd == -1)
             die("socket");
         if (connect(connfd, (struct sockaddr *) conn_addr,
                     sizeof(struct sockaddr_in)) == -1)
             die("connect");
      
         addrlen = sizeof(struct sockaddr_in);
         acceptfd = accept4(lfd, (struct sockaddr *) &claddr, &addrlen,
                            closeonexec_flag | nonblock_flag);
         if (acceptfd == -1) {
             perror("accept4()");
             close(connfd);
             return 0;
         }
      
         fdf = fcntl(acceptfd, F_GETFD);
         if (fdf == -1)
             die("fcntl:F_GETFD");
         fdf_pass = ((fdf & FD_CLOEXEC) != 0) ==
                    ((closeonexec_flag & SOCK_CLOEXEC) != 0);
         printf("Close-on-exec flag is %sset (%s); ",
                 (fdf & FD_CLOEXEC) ? "" : "not ",
                 fdf_pass ? "OK" : "failed");
      
         flf = fcntl(acceptfd, F_GETFL);
         if (flf == -1)
             die("fcntl:F_GETFD");
         flf_pass = ((flf & O_NONBLOCK) != 0) ==
                    ((nonblock_flag & SOCK_NONBLOCK) !=0);
         printf("nonblock flag is %sset (%s)\n",
                 (flf & O_NONBLOCK) ? "" : "not ",
                 flf_pass ? "OK" : "failed");
      
         close(acceptfd);
         close(connfd);
      
         printf("Test result: %s\n", (fdf_pass && flf_pass) ? "PASS" : "FAIL");
         return fdf_pass && flf_pass;
      }
      
      static int
      create_listening_socket(int port_num)
      {
         struct sockaddr_in svaddr;
         int lfd;
         int optval;
      
         memset(&svaddr, 0, sizeof(struct sockaddr_in));
         svaddr.sin_family = AF_INET;
         svaddr.sin_addr.s_addr = htonl(INADDR_ANY);
         svaddr.sin_port = htons(port_num);
      
         lfd = socket(AF_INET, SOCK_STREAM, 0);
         if (lfd == -1)
             die("socket");
      
         optval = 1;
         if (setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &optval,
                        sizeof(optval)) == -1)
             die("setsockopt");
      
         if (bind(lfd, (struct sockaddr *) &svaddr,
                  sizeof(struct sockaddr_in)) == -1)
             die("bind");
      
         if (listen(lfd, 5) == -1)
             die("listen");
      
         return lfd;
      }
      
      int
      main(int argc, char *argv[])
      {
         struct sockaddr_in conn_addr;
         int lfd;
         int port_num;
         int passed;
      
         passed = 1;
      
         port_num = (argc > 1) ? atoi(argv[1]) : PORT_NUM;
      
         memset(&conn_addr, 0, sizeof(struct sockaddr_in));
         conn_addr.sin_family = AF_INET;
         conn_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
         conn_addr.sin_port = htons(port_num);
      
         lfd = create_listening_socket(port_num);
      
         if (!do_test(lfd, &conn_addr, 0, 0))
             passed = 0;
         if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, 0))
             passed = 0;
         if (!do_test(lfd, &conn_addr, 0, SOCK_NONBLOCK))
             passed = 0;
         if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, SOCK_NONBLOCK))
             passed = 0;
      
         close(lfd);
      
         exit(passed ? EXIT_SUCCESS : EXIT_FAILURE);
      }
      
      [mtk.manpages@gmail.com: rewrote changelog, updated test program]
      Signed-off-by: NUlrich Drepper <drepper@redhat.com>
      Tested-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Cc: <linux-api@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de11defe
  6. 18 11月, 2008 1 次提交
  7. 17 11月, 2008 8 次提交
    • G
      dccp: Deprecate Ack Ratio sysctl · dd9c0e36
      Gerrit Renker 提交于
      This patch deprecates the Ack Ratio sysctl, since
       * Ack Ratio is entirely ignored by CCID-3 and CCID-4,
       * Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1);
       * even if it would work in CCID-2, there is no point for a user to change it:
         - Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2),
         - if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts
           (since waiting for Acks which will never arrive in this window),
         - cwnd is not a user-configurable value.
      
      The only reasonable place for Ack Ratio is to print it for debugging. It is
      planned to do this later on, as part of e.g. dccp_probe.
      
      With this patch Ack Ratio is now under full control of feature negotiation:
       * Ack Ratio is resolved as a dependency of the selected CCID;
       * if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to
         the default of 2, following RFC 4340, 11.3 - "New connections start with Ack
         Ratio 2 for both endpoints";
       * what happens then is part of another patch set, since it concerns the
         dynamic update of Ack Ratio while the connection is in full flight.
      
      Thanks to Tomasz Grobelny for discussion leading up to this patch.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd9c0e36
    • G
      dccp: Feature negotiation for minimum-checksum-coverage · 29450559
      Gerrit Renker 提交于
      This provides feature negotiation for server minimum checksum coverage
      which so far has been missing.
      
      Since sender/receiver coverage values range only from 0...15, their
      type has also been reduced in size from u16 to u4.
      
      Feature-negotiation options are now generated for both sender and receiver
      coverage, i.e. when the peer has `forgotten' to enable partial coverage
      then feature negotiation will automatically enable (negotiate) the partial
      coverage value for this connection.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29450559
    • G
      dccp: Deprecate old setsockopt framework · 49aebc66
      Gerrit Renker 提交于
      The previous setsockopt interface, which passed socket options via struct
      dccp_so_feat, is complicated/difficult to use. Continuing to support it leads to
      ugly code since the old approach did not distinguish between NN and SP values.
      
      This patch removes the old setsockopt interface and replaces it with two new
      functions to register NN/SP values for feature negotiation. 
      These are essentially wrappers around the internal __feat_register functions,
      with checking added to avoid
      
       * wrong usage (type);
       * changing values while the connection is in progress.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49aebc66
    • M
      virtio_net: VIRTIO_NET_F_MSG_RXBUF (imprive rcv buffer allocation) · 3f2c31d9
      Mark McLoughlin 提交于
      If segmentation offload is enabled by the host, we currently allocate
      maximum sized packet buffers and pass them to the host. This uses up
      20 ring entries, allowing us to supply only 20 packet buffers to the
      host with a 256 entry ring. This is a huge overhead when receiving
      small packets, and is most keenly felt when receiving MTU sized
      packets from off-host.
      
      The VIRTIO_NET_F_MRG_RXBUF feature flag is set by hosts which support
      using receive buffers which are smaller than the maximum packet size.
      In order to transfer large packets to the guest, the host merges
      together multiple receive buffers to form a larger logical buffer.
      The number of merged buffers is returned to the guest via a field in
      the virtio_net_hdr.
      
      Make use of this support by supplying single page receive buffers to
      the host. On receive, we extract the virtio_net_hdr, copy 128 bytes of
      the payload to the skb's linear data buffer and adjust the fragment
      offset to point to the remaining data. This ensures proper alignment
      and allows us to not use any paged data for small packets. If the
      payload occupies multiple pages, we simply append those pages as
      fragments and free the associated skbs.
      
      This scheme allows us to be efficient in our use of ring entries
      while still supporting large packets. Benchmarking using netperf from
      an external machine to a guest over a 10Gb/s network shows a 100%
      improvement from ~1Gb/s to ~2Gb/s. With a local host->guest benchmark
      with GSO disabled on the host side, throughput was seen to increase
      from 700Mb/s to 1.7Gb/s.
      
      Based on a patch from Herbert Xu.
      Signed-off-by: NMark McLoughlin <markmc@redhat.com>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use netdev_priv)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f2c31d9
    • E
      udp: Use hlist_nulls in UDP RCU code · 88ab1932
      Eric Dumazet 提交于
      This is a straightforward patch, using hlist_nulls infrastructure.
      
      RCUification already done on UDP two weeks ago.
      
      Using hlist_nulls permits us to avoid some memory barriers, both
      at lookup time and delete time.
      
      Patch is large because it adds new macros to include/net/sock.h.
      These macros will be used by TCP & DCCP in next patch.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88ab1932
    • E
      rcu: Introduce hlist_nulls variant of hlist · bbaffaca
      Eric Dumazet 提交于
      hlist uses NULL value to finish a chain.
      
      hlist_nulls variant use the low order bit set to 1 to signal an end-of-list marker.
      
      This allows to store many different end markers, so that some RCU lockless
      algos (used in TCP/UDP stack for example) can save some memory barriers in
      fast paths.
      
      Two new files are added :
      
      include/linux/list_nulls.h
        - mimics hlist part of include/linux/list.h, derived to hlist_nulls variant
      
      include/linux/rculist_nulls.h
        - mimics hlist part of include/linux/rculist.h, derived to hlist_nulls variant
      
         Only four helpers are declared for the moment :
      
           hlist_nulls_del_init_rcu(), hlist_nulls_del_rcu(),
           hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry_rcu()
      
      prefetches() were removed, since an end of list is not anymore NULL value.
      prefetches() could trigger useless (and possibly dangerous) memory transactions.
      
      Example of use (extracted from __udp4_lib_lookup())
      
      	struct sock *sk, *result;
              struct hlist_nulls_node *node;
              unsigned short hnum = ntohs(dport);
              unsigned int hash = udp_hashfn(net, hnum);
              struct udp_hslot *hslot = &udptable->hash[hash];
              int score, badness;
      
              rcu_read_lock();
      begin:
              result = NULL;
              badness = -1;
              sk_nulls_for_each_rcu(sk, node, &hslot->head) {
                      score = compute_score(sk, net, saddr, hnum, sport,
                                            daddr, dport, dif);
                      if (score > badness) {
                              result = sk;
                              badness = score;
                      }
              }
              /*
               * if the nulls value we got at the end of this lookup is
               * not the expected one, we must restart lookup.
               * We probably met an item that was moved to another chain.
               */
              if (get_nulls_value(node) != hash)
                      goto begin;
      
              if (result) {
                      if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
                              result = NULL;
                      else if (unlikely(compute_score(result, net, saddr, hnum, sport,
                                        daddr, dport, dif) < badness)) {
                              sock_put(result);
                              goto begin;
                      }
              }
              rcu_read_unlock();
              return result;
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bbaffaca
    • B
      TPROXY: implemented IP_RECVORIGDSTADDR socket option · e8b2dfe9
      Balazs Scheidler 提交于
      In case UDP traffic is redirected to a local UDP socket,
      the originally addressed destination address/port
      cannot be recovered with the in-kernel tproxy.
      
      This patch adds an IP_RECVORIGDSTADDR sockopt that enables
      a IP_ORIGDSTADDR ancillary message in recvmsg(). This
      ancillary message contains the original destination address/port
      of the packet being received.
      Signed-off-by: NBalazs Scheidler <bazsi@balabit.hu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8b2dfe9
    • P
      phylib: make mdio-gpio work without OF (v4) · f004f3ea
      Paulius Zaleckas 提交于
      make mdio-gpio work with non OpenFirmware gpio implementation.
      
      Aditional changes to mdio-gpio:
      - use gpio_request() and gpio_free()
      - place irq[] array in struct mdio_gpio_info
      - add module description, author and license
      - add note about compiling this driver as module
      - rename mdc and mdio function (were ugly names)
      - change MII to MDIO in bus name
      - add __init __exit to module (un)loading functions
      - probe fails if no phys added to the bus
      - kzalloc bitbang with sizeof(*bitbang)
      
      Changes since v3:
      - keep bus naming "%x" to be compatible with existing drivers.
      
      Changes since v2:
      - more #ifdefs reduction
      - platform driver will be registered on OF platforms also
      - unified platform and OF bus_id to phy%i
      
      Changes since v1:
      - removed NO_IRQ
      - reduced #idefs
      
      Laurent, please test this driver under OF.
      Signed-off-by: NPaulius Zaleckas <paulius.zaleckas@teltonika.lt>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f004f3ea
  8. 16 11月, 2008 2 次提交
    • A
      Fix inotify watch removal/umount races · 8f7b0ba1
      Al Viro 提交于
      Inotify watch removals suck violently.
      
      To kick the watch out we need (in this order) inode->inotify_mutex and
      ih->mutex.  That's fine if we have a hold on inode; however, for all
      other cases we need to make damn sure we don't race with umount.  We can
      *NOT* just grab a reference to a watch - inotify_unmount_inodes() will
      happily sail past it and we'll end with reference to inode potentially
      outliving its superblock.
      
      Ideally we just want to grab an active reference to superblock if we
      can; that will make sure we won't go into inotify_umount_inodes() until
      we are done.  Cleanup is just deactivate_super().
      
      However, that leaves a messy case - what if we *are* racing with
      umount() and active references to superblock can't be acquired anymore?
      We can bump ->s_count, grab ->s_umount, which will almost certainly wait
      until the superblock is shut down and the watch in question is pining
      for fjords.  That's fine, but there is a problem - we might have hit the
      window between ->s_active getting to 0 / ->s_count - below S_BIAS (i.e.
      the moment when superblock is past the point of no return and is heading
      for shutdown) and the moment when deactivate_super() acquires
      ->s_umount.
      
      We could just do drop_super() yield() and retry, but that's rather
      antisocial and this stuff is luser-triggerable.  OTOH, having grabbed
      ->s_umount and having found that we'd got there first (i.e.  that
      ->s_root is non-NULL) we know that we won't race with
      inotify_umount_inodes().
      
      So we could grab a reference to watch and do the rest as above, just
      with drop_super() instead of deactivate_super(), right? Wrong.  We had
      to drop ih->mutex before we could grab ->s_umount.  So the watch
      could've been gone already.
      
      That still can be dealt with - we need to save watch->wd, do idr_find()
      and compare its result with our pointer.  If they match, we either have
      the damn thing still alive or we'd lost not one but two races at once,
      the watch had been killed and a new one got created with the same ->wd
      at the same address.  That couldn't have happened in inotify_destroy(),
      but inotify_rm_wd() could run into that.  Still, "new one got created"
      is not a problem - we have every right to kill it or leave it alone,
      whatever's more convenient.
      
      So we can use idr_find(...) == watch && watch->inode->i_sb == sb as
      "grab it and kill it" check.  If it's been our original watch, we are
      fine, if it's a newcomer - nevermind, just pretend that we'd won the
      race and kill the fscker anyway; we are safe since we know that its
      superblock won't be going away.
      
      And yes, this is far beyond mere "not very pretty"; so's the entire
      concept of inotify to start with.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Acked-by: NGreg KH <greg@kroah.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f7b0ba1
    • M
      Add 'pr_fmt()' format modifier to pr_xyz macros. · d091c2f5
      Martin Schwidefsky 提交于
      A common reason for device drivers to implement their own printk macros
      is the lack of a printk prefix with the standard pr_xyz macros.
      Introduce a pr_fmt() macro that is applied for every pr_xyz macro to the
      format string.
      
      The most common use of the pr_fmt macro would be to add the name of the
      device driver to all pr_xyz messages in a source file.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d091c2f5
  9. 14 11月, 2008 3 次提交
  10. 13 11月, 2008 4 次提交
  11. 12 11月, 2008 1 次提交