1. 13 10月, 2009 2 次提交
    • A
      net: Introduce recvmmsg socket syscall · a2e27255
      Arnaldo Carvalho de Melo 提交于
      Meaning receive multiple messages, reducing the number of syscalls and
      net stack entry/exit operations.
      
      Next patches will introduce mechanisms where protocols that want to
      optimize this operation will provide an unlocked_recvmsg operation.
      
      This takes into account comments made by:
      
      . Paul Moore: sock_recvmsg is called only for the first datagram,
        sock_recvmsg_nosec is used for the rest.
      
      . Caitlin Bestler: recvmmsg now has a struct timespec timeout, that
        works in the same fashion as the ppoll one.
      
        If the underlying protocol returns a datagram with MSG_OOB set, this
        will make recvmmsg return right away with as many datagrams (+ the OOB
        one) it has received so far.
      
      . Rémi Denis-Courmont & Steven Whitehouse: If we receive N < vlen
        datagrams and then recvmsg returns an error, recvmmsg will return
        the successfully received datagrams, store the error and return it
        in the next call.
      
      This paves the way for a subsequent optimization, sk_prot->unlocked_recvmsg,
      where we will be able to acquire the lock only at batch start and end, not at
      every underlying recvmsg call.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2e27255
    • N
      net: Generalize socket rx gap / receive queue overflow cmsg · 3b885787
      Neil Horman 提交于
      Create a new socket level option to report number of queue overflows
      
      Recently I augmented the AF_PACKET protocol to report the number of frames lost
      on the socket receive queue between any two enqueued frames.  This value was
      exported via a SOL_PACKET level cmsg.  AFter I completed that work it was
      requested that this feature be generalized so that any datagram oriented socket
      could make use of this option.  As such I've created this patch, It creates a
      new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
      SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
      overflowed between any two given frames.  It also augments the AF_PACKET
      protocol to take advantage of this new feature (as it previously did not touch
      sk->sk_drops, which this patch uses to record the overflow count).  Tested
      successfully by me.
      
      Notes:
      
      1) Unlike my previous patch, this patch simply records the sk_drops value, which
      is not a number of drops between packets, but rather a total number of drops.
      Deltas must be computed in user space.
      
      2) While this patch currently works with datagram oriented protocols, it will
      also be accepted by non-datagram oriented protocols. I'm not sure if thats
      agreeable to everyone, but my argument in favor of doing so is that, for those
      protocols which aren't applicable to this option, sk_drops will always be zero,
      and reporting no drops on a receive queue that isn't used for those
      non-participating protocols seems reasonable to me.  This also saves us having
      to code in a per-protocol opt in mechanism.
      
      3) This applies cleanly to net-next assuming that commit
      97775007 (my af packet cmsg patch) is reverted
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b885787
  2. 12 10月, 2009 1 次提交
  3. 09 10月, 2009 1 次提交
    • J
      ipv6: Fix the size overflow of addrconf_sysctl array · e0e6f55d
      Jin Dongming 提交于
      (This patch fixes bug of commit f7734fdf
       title "make TLLAO option for NA packets configurable")
      
      When the IPV6 conf is used, the function sysctl_set_parent is called and the
      array addrconf_sysctl is used as a parameter of the function.
      
      The above patch added new conf "force_tllao" into the array addrconf_sysctl,
      but the size of the array was not modified, the static allocated size is
      DEVCONF_MAX + 1 but the real size is DEVCONF_MAX + 2, so the problem is
      that the function sysctl_set_parent accessed wrong address.
      
      I got the following information.
      Call Trace:
          [<ffffffff8106085d>] sysctl_set_parent+0x29/0x3e
          [<ffffffff8106085d>] sysctl_set_parent+0x29/0x3e
          [<ffffffff8106085d>] sysctl_set_parent+0x29/0x3e
          [<ffffffff8106085d>] sysctl_set_parent+0x29/0x3e
          [<ffffffff8106085d>] sysctl_set_parent+0x29/0x3e
          [<ffffffff810622d5>] __register_sysctl_paths+0xde/0x272
          [<ffffffff8110892d>] ? __kmalloc_track_caller+0x16e/0x180
          [<ffffffffa00cfac3>] ? __addrconf_sysctl_register+0xc5/0x144 [ipv6]
          [<ffffffff8141f2c9>] register_net_sysctl_table+0x48/0x4b
          [<ffffffffa00cfaf5>] __addrconf_sysctl_register+0xf7/0x144 [ipv6]
          [<ffffffffa00cfc16>] addrconf_init_net+0xd4/0x104 [ipv6]
          [<ffffffff8139195f>] setup_net+0x35/0x82
          [<ffffffff81391f6c>] copy_net_ns+0x76/0xe0
          [<ffffffff8107ad60>] create_new_namespaces+0xf0/0x16e
          [<ffffffff8107afee>] copy_namespaces+0x65/0x9f
          [<ffffffff81056dff>] copy_process+0xb2c/0x12c3
          [<ffffffff810576e1>] do_fork+0x14b/0x2d2
          [<ffffffff8107ac4e>] ? up_read+0xe/0x10
          [<ffffffff81438e73>] ? do_page_fault+0x27a/0x2aa
          [<ffffffff8101044b>] sys_clone+0x28/0x2a
          [<ffffffff81011fb3>] stub_clone+0x13/0x20
          [<ffffffff81011c72>] ? system_call_fastpath+0x16/0x1b
      
      And the information of IPV6 in .config is as following.
      IPV6 in .config:
          CONFIG_IPV6=m
          CONFIG_IPV6_PRIVACY=y
          CONFIG_IPV6_ROUTER_PREF=y
          CONFIG_IPV6_ROUTE_INFO=y
          CONFIG_IPV6_OPTIMISTIC_DAD=y
          CONFIG_IPV6_MIP6=m
          CONFIG_IPV6_SIT=m
          # CONFIG_IPV6_SIT_6RD is not set
          CONFIG_IPV6_NDISC_NODETYPE=y
          CONFIG_IPV6_TUNNEL=m
          CONFIG_IPV6_MULTIPLE_TABLES=y
          CONFIG_IPV6_SUBTREES=y
          CONFIG_IPV6_MROUTE=y
          CONFIG_IPV6_PIMSM_V2=y
          # CONFIG_IP_VS_IPV6 is not set
          CONFIG_NF_CONNTRACK_IPV6=m
          CONFIG_IP6_NF_MATCH_IPV6HEADER=m
      
      I confirmed this patch fixes this problem.
      Signed-off-by: NJin Dongming <jin.dongming@np.css.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0e6f55d
  4. 08 10月, 2009 7 次提交
  5. 07 10月, 2009 8 次提交
    • S
      dcb: data center bridging ops should be r/o · 32953543
      Stephen Hemminger 提交于
      The data center bridging ops structure can be const
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Acked-by: NPeter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32953543
    • S
      net: mark net_proto_ops as const · ec1b4cf7
      Stephen Hemminger 提交于
      All usages of structure net_proto_ops should be declared const.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec1b4cf7
    • O
      make TLLAO option for NA packets configurable · f7734fdf
      Octavian Purdila 提交于
      On Friday 02 October 2009 20:53:51 you wrote:
      
      > This is good although I would have shortened the name.
      
      Ah, I knew I forgot something :) Here is v4.
      
      tavi
      
      >From 24d96d825b9fa832b22878cc6c990d5711968734 Mon Sep 17 00:00:00 2001
      From: Octavian Purdila <opurdila@ixiacom.com>
      Date: Fri, 2 Oct 2009 00:51:15 +0300
      Subject: [PATCH] ipv6: new sysctl for sending TLLAO with unicast NAs
      
      Neighbor advertisements responding to unicast neighbor solicitations
      did not include the target link-layer address option. This patch adds
      a new sysctl option (disabled by default) which controls whether this
      option should be sent even with unicast NAs.
      
      The need for this arose because certain routers expect the TLLAO in
      some situations even as a response to unicast NS packets.
      
      Moreover, RFC 2461 recommends sending this to avoid a race condition
      (section 4.4, Target link-layer address)
      Signed-off-by: NCosmin Ratiu <cratiu@ixiacom.com>
      Signed-off-by: NOctavian Purdila <opurdila@ixiacom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7734fdf
    • B
      ethtool: Add reset operation · d73d3a8c
      Ben Hutchings 提交于
      After updating firmware stored in flash, users may wish to reset the
      relevant hardware and start the new firmware immediately.  This should
      not be completely automatic as it may be disruptive.
      
      A selective reset may also be useful for debugging or diagnostics.
      
      This adds a separate reset operation which takes flags indicating the
      components to be reset.  Drivers are allowed to reset only a subset of
      those requested, and must indicate the actual subset.  This allows the
      use of generic component masks and some future expansion.
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d73d3a8c
    • E
      pkt_sched: gen_estimator: Dont report fake rate estimators · d250a5f9
      Eric Dumazet 提交于
      Jarek Poplawski a écrit :
      >
      >
      > Hmm... So you made me to do some "real" work here, and guess what?:
      > there is one serious checkpatch warning! ;-) Plus, this new parameter
      > should be added to the function description. Otherwise:
      > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
      >
      > Thanks,
      > Jarek P.
      >
      > PS: I guess full "Don't" would show we really mean it...
      
      Okay :) Here is the last round, before the night !
      
      Thanks again
      
      [RFC] pkt_sched: gen_estimator: Don't report fake rate estimators
      
      We currently send TCA_STATS_RATE_EST elements to netlink users, even if no estimator
      is running.
      
      # tc -s -d qdisc
      qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
       Sent 112833764978 bytes 1495081739 pkt (dropped 0, overlimits 0 requeues 0)
       rate 0bit 0pps backlog 0b 0p requeues 0
      
      User has no way to tell if the "rate 0bit 0pps" is a real estimation, or a fake
      one (because no estimator is active)
      
      After this patch, tc command output is :
      $ tc -s -d qdisc
      qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
       Sent 561075 bytes 1196 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
      
      We add a parameter to gnet_stats_copy_rate_est() function so that
      it can use gen_estimator_active(bstats, r), as suggested by Jarek.
      
      This parameter can be NULL if check is not necessary, (htb for
      example has a mandatory rate estimator)
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d250a5f9
    • Y
      ipv6 sit: 6rd (IPv6 Rapid Deployment) Support. · fa857afc
      YOSHIFUJI Hideaki / 吉藤英明 提交于
      IPv6 Rapid Deployment (6rd; draft-ietf-softwire-ipv6-6rd) builds upon
      mechanisms of 6to4 (RFC3056) to enable a service provider to rapidly
      deploy IPv6 unicast service to IPv4 sites to which it provides
      customer premise equipment.  Like 6to4, it utilizes stateless IPv6 in
      IPv4 encapsulation in order to transit IPv4-only network
      infrastructure.  Unlike 6to4, a 6rd service provider uses an IPv6
      prefix of its own in place of the fixed 6to4 prefix.
      
      With this option enabled, the SIT driver offers 6rd functionality by
      providing additional ioctl API to configure the IPv6 Prefix for in
      stead of static 2002::/16 for 6to4.
      
      Original patch was done by Alexandre Cassen <acassen@freebox.fr>
      based on old Internet-Draft.
      Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa857afc
    • I
      add vif using local interface index instead of IP · ee5e81f0
      Ilia K 提交于
      When routing daemon wants to enable forwarding of multicast traffic it
      performs something like:
      
             struct vifctl vc = {
                     .vifc_vifi  = 1,
                     .vifc_flags = 0,
                     .vifc_threshold = 1,
                     .vifc_rate_limit = 0,
                     .vifc_lcl_addr = ip, /* <--- ip address of physical
      interface, e.g. eth0 */
                     .vifc_rmt_addr.s_addr = htonl(INADDR_ANY),
               };
             setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc));
      
      This leads (in the kernel) to calling  vif_add() function call which
      search the (physical) device using assigned IP address:
             dev = ip_dev_find(net, vifc->vifc_lcl_addr.s_addr);
      
      The current API (struct vifctl) does not allow to specify an
      interface other way than using it's IP, and if there are more than a
      single interface with specified IP only the first one will be found.
      
      The attached patch (against 2.6.30.4) allows to specify an interface
      by its index, instead of IP address:
      
             struct vifctl vc = {
                     .vifc_vifi  = 1,
                     .vifc_flags = VIFF_USE_IFINDEX,   /* NEW */
                     .vifc_threshold = 1,
                     .vifc_rate_limit = 0,
                     .vifc_lcl_ifindex = if_nametoindex("eth0"),   /* NEW */
                     .vifc_rmt_addr.s_addr = htonl(INADDR_ANY),
               };
             setsockopt(fd, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc));
      Signed-off-by: NIlia K. <mail4ilia@gmail.com>
      
      === modified file 'include/linux/mroute.h'
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee5e81f0
    • E
      net: speedup sk_wake_async() · bcdce719
      Eric Dumazet 提交于
      An incoming datagram must bring into cpu cache *lot* of cache lines,
      in particular : (other parts omitted (hash chains, ip route cache...))
      
      On 32bit arches :
      
      offsetof(struct sock, sk_rcvbuf)       =0x30    (read)
      offsetof(struct sock, sk_lock)         =0x34   (rw)
      
      offsetof(struct sock, sk_sleep)        =0x50 (read)
      offsetof(struct sock, sk_rmem_alloc)   =0x64   (rw)
      offsetof(struct sock, sk_receive_queue)=0x74   (rw)
      
      offsetof(struct sock, sk_forward_alloc)=0x98   (rw)
      
      offsetof(struct sock, sk_callback_lock)=0xcc    (rw)
      offsetof(struct sock, sk_drops)        =0xd8 (read if we add dropcount support, rw if frame dropped)
      offsetof(struct sock, sk_filter)       =0xf8    (read)
      
      offsetof(struct sock, sk_socket)       =0x138 (read)
      
      offsetof(struct sock, sk_data_ready)   =0x15c   (read)
      
      
      We can avoid sk->sk_socket and socket->fasync_list referencing on sockets
      with no fasync() structures. (socket->fasync_list ptr is probably already in cache
      because it shares a cache line with socket->wait, ie location pointed by sk->sk_sleep)
      
      This avoids one cache line load per incoming packet for common cases (no fasync())
      
      We can leave (or even move in a future patch) sk->sk_socket in a cold location
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bcdce719
  6. 05 10月, 2009 9 次提交
    • J
      net: introduce NETDEV_POST_INIT notifier · 7ffbe3fd
      Johannes Berg 提交于
      For various purposes including a wireless extensions
      bugfix, we need to hook into the netdev creation before
      before netdev_register_kobject(). This will also ease
      doing the dev type assignment that Marcel was working
      on for cfg80211 drivers w/o touching them all.
      Signed-off-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ffbe3fd
    • M
      usbnet: Use wwan%d interface name for mobile broadband devices · e1e499ee
      Marcel Holtmann 提交于
      Add support for usbnet based devices like CDC-Ether to indicate that they
      are actually mobile broadband devices. In that case use wwan%d as default
      interface name.
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1e499ee
    • B
      net: Support inclusion of <linux/socket.h> before <sys/socket.h> · 9c501935
      Ben Hutchings 提交于
      The following user-space program fails to compile:
      
          #include <linux/socket.h>
          #include <sys/socket.h>
          int main() { return 0; }
      
      The reason is that <linux/socket.h> tests __GLIBC__ to decide whether it
      should define various structures and macros that are now defined for
      user-space by <sys/socket.h>, but __GLIBC__ is not defined if no libc
      headers have yet been included.
      
      It seems safe to drop support for libc 5 now.
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NBastian Blank <waldi@debian.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c501935
    • E
      tunnels: Optimize tx path · 0bfbedb1
      Eric Dumazet 提交于
      We currently dirty a cache line to update tunnel device stats
      (tx_packets/tx_bytes). We better use the txq->tx_bytes/tx_packets
      counters that already are present in cpu cache, in the cache
      line shared with txq->_xmit_lock
      
      This patch extends IPTUNNEL_XMIT() macro to use txq pointer
      provided by the caller.
      
      Also &tunnel->dev->stats can be replaced by &dev->stats
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0bfbedb1
    • S
      ipv4: fib table algorithm performance improvement · 16c6cf8b
      Stephen Hemminger 提交于
      The FIB algorithim for IPV4 is set at compile time, but kernel goes through
      the overhead of function call indirection at runtime. Save some
      cycles by turning the indirect calls to direct calls to either
      hash or trie code.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16c6cf8b
    • N
      af_packet: add interframe drop cmsg (v6) · 97775007
      Neil Horman 提交于
      Add Ancilliary data to better represent loss information
      
      I've had a few requests recently to provide more detail regarding frame loss
      during an AF_PACKET packet capture session.  Specifically the requestors want to
      see where in a packet sequence frames were lost, i.e. they want to see that 40
      frames were lost between frames 302 and 303 in a packet capture file.  In order
      to do this we need:
      
      1) The kernel to export this data to user space
      2) The applications to make use of it
      
      This patch addresses item (1).  It does this by doing the following:
      
      A) Anytime we drop a frame for which we would increment po->stats.tp_drops, we
      also no increment a stats called po->stats.tp_gap.
      
      B) Every time we successfully enqueue a frame to sk_receive_queue, we record the
      value of po->stats.tp_gap in skb->mark.  skb->cb would nominally be the place to
      record this, but since all the space there is used up, we're overloading
      skb->mark.  Its safe to do since any enqueued packet is guaranteed to be
      unshared at this point, and skb->mark isn't used for anything else in the rx
      path to the application.  After we record tp_gap in the skb, we zero
      po->stats.tp_gap.  This allows us to keep a counter of the number of frames lost
      between any two enqueued packets
      
      C) When the application goes to dequeue a frame from the packet socket, we look
      at skb->mark for that frame.  If it is non-zero, we add a cmsg chunk to the
      msghdr of level SOL_PACKET and type PACKET_GAPDATA.  Its a 32 bit integer that
      represents the number of frames lost between this packet and the last previous
      frame received.
      
      Note there is a chance that if there is frame loss after a receive, and then the
      socket is closed, some gap data might be lost.  This is covered by the use of
      the PACKET_AUXDATA socket option, which gives total loss data.  With a bit of
      math, the final gap can be determined that way.
      
      I've tested this patch myself, and it works well.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      
       include/linux/if_packet.h |    2 ++
       net/packet/af_packet.c    |   33 +++++++++++++++++++++++++++++++++
       2 files changed, 35 insertions(+)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97775007
    • B
      ethtool: Remove support for obsolete string query operations · a9828ec6
      Ben Hutchings 提交于
      The in-tree implementations have all been converted to
      get_sset_count().
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9828ec6
    • A
      a99bbaf5
    • J
      Revert "Seperate read and write statistics of in_flight requests" · 0f78ab98
      Jens Axboe 提交于
      This reverts commit a9327cac.
      
      Corrado Zoccolo <czoccolo@gmail.com> reports:
      
      "with 2.6.32-rc1 I started getting the following strange output from
      "iostat -kx 2":
      Linux 2.6.31bisect (et2) 	04/10/2009 	_i686_	(2 CPU)
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                10,70    0,00    3,16   15,75    0,00   70,38
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda              18,22     0,00    0,67    0,01    14,77     0,02
      43,94     0,01   10,53 39043915,03 2629219,87
      sdb              60,89     9,68   50,79    3,04  1724,43    50,52
      65,95     0,70   13,06 488437,47 2629219,87
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 2,72    0,00    0,74    0,00    0,00   96,53
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      sdb               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 6,68    0,00    0,99    0,00    0,00   92,33
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      sdb               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 4,40    0,00    0,73    1,47    0,00   93,40
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
      avgrq-sz avgqu-sz   await  svctm  %util
      sda               0,00     0,00    0,00    0,00     0,00     0,00
      0,00     0,00    0,00   0,00 100,00
      sdb               0,00     4,00    0,00    3,00     0,00    28,00
      18,67     0,06   19,50 333,33 100,00
      
      Global values for service time and utilization are garbage. For
      interval values, utilization is always 100%, and service time is
      higher than normal.
      
      I bisected it down to:
      [a9327cac] Seperate read and write
      statistics of in_flight requests
      and verified that reverting just that commit indeed solves the issue
      on 2.6.32-rc1."
      
      So until this is debugged, revert the bad commit.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      0f78ab98
  7. 04 10月, 2009 1 次提交
  8. 03 10月, 2009 4 次提交
  9. 02 10月, 2009 7 次提交
    • K
      memcg: some modification to softlimit under hierarchical memory reclaim. · 4e649152
      KAMEZAWA Hiroyuki 提交于
      This patch clean up/fixes for memcg's uncharge soft limit path.
      
      Problems:
        Now, res_counter_charge()/uncharge() handles softlimit information at
        charge/uncharge and softlimit-check is done when event counter per memcg
        goes over limit. Now, event counter per memcg is updated only when
        memory usage is over soft limit. Here, considering hierarchical memcg
        management, ancesotors should be taken care of.
      
        Now, ancerstors(hierarchy) are handled in charge() but not in uncharge().
        This is not good.
      
        Prolems:
        1. memcg's event counter incremented only when softlimit hits. That's bad.
           It makes event counter hard to be reused for other purpose.
      
        2. At uncharge, only the lowest level rescounter is handled. This is bug.
           Because ancesotor's event counter is not incremented, children should
           take care of them.
      
        3. res_counter_uncharge()'s 3rd argument is NULL in most case.
           ops under res_counter->lock should be small. No "if" sentense is better.
      
      Fixes:
        * Removed soft_limit_xx poitner and checks in charge and uncharge.
          Do-check-only-when-necessary scheme works enough well without them.
      
        * make event-counter of memcg incremented at every charge/uncharge.
          (per-cpu area will be accessed soon anyway)
      
        * All ancestors are checked at soft-limit-check. This is necessary because
          ancesotor's event counter may never be modified. Then, they should be
          checked at the same time.
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e649152
    • M
      asm-generic/gpio.h: pull in linux/kernel.h for might_sleep() · b3db4a8a
      Mike Frysinger 提交于
      The asm-generic/gpio.h header uses the might_sleep() macro but doesn't
      include the header for it, so any source code that might include
      linux/gpio.h before linux/kernel.h can easily lead to a build failure.
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3db4a8a
    • A
      const: constify remaining file_operations · 828c0950
      Alexey Dobriyan 提交于
      [akpm@linux-foundation.org: fix KVM]
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Acked-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      828c0950
    • J
      Add a tracepoint for block request remapping · b0da3f0d
      Jun'ichi Nomura 提交于
      Since 2.6.31 now has request-based device-mapper, it's useful to have
      a tracepoint for request-remapping as well as bio-remapping.
      This patch adds a tracepoint for request-remapping, trace_block_rq_remap().
      Signed-off-by: NKiyoshi Ueda <k-ueda@ct.jp.nec.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Alasdair G Kergon <agk@redhat.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b0da3f0d
    • C
      block: allow large discard requests · 67efc925
      Christoph Hellwig 提交于
      Currently we set the bio size to the byte equivalent of the blocks to
      be trimmed when submitting the initial DISCARD ioctl.  That means it
      is subject to the max_hw_sectors limitation of the HBA which is
      much lower than the size of a DISCARD request we can support.
      Add a separate max_discard_sectors tunable to limit the size for discard
      requests.
      
      We limit the max discard request size in bytes to 32bit as that is the
      limit for bio->bi_size.  This could be much larger if we had a way to pass
      that information through the block layer.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      67efc925
    • C
      block: use normal I/O path for discard requests · c15227de
      Christoph Hellwig 提交于
      prepare_discard_fn() was being called in a place where memory allocation
      was effectively impossible.  This makes it inappropriate for all but
      the most trivial translations of Linux's DISCARD operation to the block
      command set.  Additionally adding a payload there makes the ownership
      of the bio backing unclear as it's now allocated by the device driver
      and not the submitter as usual.
      
      It is replaced with QUEUE_FLAG_DISCARD which is used to indicate whether
      the queue supports discard operations or not.  blkdev_issue_discard now
      allocates a one-page, sector-length payload which is the right thing
      for the common ATA and SCSI implementations.
      
      The mtd implementation of prepare_discard_fn() is replaced with simply
      checking for the request being a discard.
      
      Largely based on a previous patch from Matthew Wilcox <matthew@wil.cx>
      which did the prepare_discard_fn but not the different payload allocation
      yet.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c15227de
    • Z
      Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs · 48c0d4d4
      Zdenek Kabelac 提交于
      Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs
      introduced in commit 1d54ad6d.
      Release kobject also in case the request_fn is NULL.
      
      Problem was noticed via kmemleak backtrace when some sysfs entries were
      note properly destroyed during  device removal:
      
      unreferenced object 0xffff88001aa76640 (size 80):
        comm "lvcreate", pid 2120, jiffies 4294885144
        hex dump (first 32 bytes):
          01 00 00 00 00 00 00 00 f0 65 a7 1a 00 88 ff ff  .........e......
          90 66 a7 1a 00 88 ff ff 86 1d 53 81 ff ff ff ff  .f........S.....
        backtrace:
          [<ffffffff813f9cc6>] kmemleak_alloc+0x26/0x60
          [<ffffffff8111d693>] kmem_cache_alloc+0x133/0x1c0
          [<ffffffff81195891>] sysfs_new_dirent+0x41/0x120
          [<ffffffff81194b0c>] sysfs_add_file_mode+0x3c/0xb0
          [<ffffffff81197c81>] internal_create_group+0xc1/0x1a0
          [<ffffffff81197d93>] sysfs_create_group+0x13/0x20
          [<ffffffff810d8004>] blk_trace_init_sysfs+0x14/0x20
          [<ffffffff8123f45c>] blk_register_queue+0x3c/0xf0
          [<ffffffff812447e4>] add_disk+0x94/0x160
          [<ffffffffa00d8b08>] dm_create+0x598/0x6e0 [dm_mod]
          [<ffffffffa00de951>] dev_create+0x51/0x350 [dm_mod]
          [<ffffffffa00de823>] ctl_ioctl+0x1a3/0x240 [dm_mod]
          [<ffffffffa00de8f2>] dm_compat_ctl_ioctl+0x12/0x20 [dm_mod]
          [<ffffffff81177bfd>] compat_sys_ioctl+0xcd/0x4f0
          [<ffffffff81036ed8>] sysenter_dispatch+0x7/0x2c
          [<ffffffffffffffff>] 0xffffffffffffffff
      Signed-off-by: NZdenek Kabelac <zkabelac@redhat.com>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      48c0d4d4