1. 08 12月, 2008 4 次提交
    • G
      dccp ccid-2: Phase out the use of boolean Ack Vector sysctl · 6fdd34d4
      Gerrit Renker 提交于
      This removes the use of the sysctl and the minisock variable for the Send Ack
      Vector feature, as it now is handled fully dynamically via feature negotiation
      (i.e. when CCID-2 is enabled, Ack Vectors are automatically enabled as per
       RFC 4341, 4.).
      
      Using a sysctl in parallel to this implementation would open the door to
      crashes, since much of the code relies on tests of the boolean minisock /
      sysctl variable. Thus, this patch replaces all tests of type
      
      	if (dccp_msk(sk)->dccpms_send_ack_vector)
      		/* ... */
      with
      	if (dp->dccps_hc_rx_ackvec != NULL)
      		/* ... */
      
      The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature
      negotiation concluded that Ack Vectors are to be used on the half-connection.
      Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child),
      so that the test is a valid one.
      
      The activation handler for Ack Vectors is called as soon as the feature
      negotiation has concluded at the
       * server when the Ack marking the transition RESPOND => OPEN arrives;
       * client after it has sent its ACK, marking the transition REQUEST => PARTOPEN.
      
      Adding the sequence number of the Response packet to the Ack Vector has been
      removed, since
       (a) connection establishment implies that the Response has been received;
       (b) the CCIDs only look at packets received in the (PART)OPEN state, i.e.
           this entry will always be ignored;
       (c) it can not be used for anything useful - to detect loss for instance, only
           packets received after the loss can serve as pseudo-dupacks.
      
      There was a FIXME to change the error code when dccp_ackvec_add() fails.
      I removed this after finding out that:
       * the check whether ackno < ISN is already made earlier,
       * this Response is likely the 1st packet with an Ackno that the client gets,
       * so when dccp_ackvec_add() fails, the reason is likely not a packet error.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6fdd34d4
    • G
      dccp: Remove manual influence on NDP Count feature · 4098dce5
      Gerrit Renker 提交于
      Updating the NDP count feature is handled automatically now:
       * for CCID-2 it is disabled, since the code does not use NDP counts;
       * for CCID-3 it is enabled, as NDP counts are used to determine loss lengths.
      
      Allowing the user to change NDP values leads to unpredictable and failing
      behaviour, since it is then possible to disable NDP counts even when they
      are needed (e.g. in CCID-3).
      
      This means that only those user settings are sensible that agree with the
      values for Send NDP Count implied by the choice of CCID. But those settings
      are already activated by the feature negotiation (CCID dependency tracking),
      hence this form of support is redundant.
      
      At startup the initialisation of the NDP count feature uses the default
      value of 0, which is done implicitly by the zeroing-out of the socket when
      it is allocated. If the choice of CCID or feature negotiation enables NDP
      count, this will then be updated via the NDP activation handler.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4098dce5
    • G
      dccp: Remove obsolete parts of the old CCID interface · 0049bab5
      Gerrit Renker 提交于
      The TX/RX CCIDs of the minisock are now redundant: similar to the Ack Vector
      case, their value equals initially that of the sysctl, but at the end of
      feature negotiation may be something different.
      
      The old interface removed by this patch thus has been replaced by the newer
      interface to dynamically query the currently loaded CCIDs.
      
      Also removed are the constructors for the TX CCID and the RX CCID, since the
      switch "rx <-> non-rx" is done by the handler in minisocks.c (and the handler
      is the only place in the code where CCIDs are loaded).
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: NIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0049bab5
    • W
      netdevice: Kill netdev->priv · b74ca3a8
      Wang Chen 提交于
      This is the last shoot of this series.
      After I removing all directly reference of netdev->priv, I am killing
      "priv" of "struct net_device" and fixing relative comments/docs.
      
      Anyone will not be allowed to reference netdev->priv directly.
      If you want to reference the memory of private data, use netdev_priv()
      instead.
      If the private data is not allocted when alloc_netdev(), use
      netdev->ml_priv to point that memory after you creating that private
      data.
      Signed-off-by: NWang Chen <wangchen@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b74ca3a8
  2. 05 12月, 2008 2 次提交
  3. 04 12月, 2008 2 次提交
    • D
      atm: 32-bit ioctl compatibility · 8865c418
      David Woodhouse 提交于
      We lack compat ioctl support through most of the ATM code. This patch
      deals with most of it, and I can now at least use BR2684 and PPPoATM
      with 32-bit userspace.
      
      I haven't added a .compat_ioctl method to struct atm_ioctl, because
      AFAICT none of the current users need any conversion -- so we can just
      call the ->ioctl() method in every case. I looked at br2684, clip, lec,
      mpc, pppoatm and atmtcp.
      
      In svc_compat_ioctl() the only mangling which is needed is to change
      COMPAT_ATM_ADDPARTY to ATM_ADDPARTY. Although it's defined as
      	_IOW('a', ATMIOC_SPECIAL+4,struct atm_iobuf)
      it doesn't actually _take_ a struct atm_iobuf as an argument -- it takes
      a struct sockaddr_atmsvc, which _is_ the same between 32-bit and 64-bit
      code, so doesn't need conversion.
      
      Almost all of vcc_ioctl() would have been identical, so I converted that
      into a core do_vcc_ioctl() function with an 'int compat' argument.
      
      I've done the same with atm_dev_ioctl(), where there _are_ a few
      differences, but still it's relatively contained and there would
      otherwise have been a lot of duplication.
      
      I haven't done any of the actual device-specific ioctls, although I've
      added a compat_ioctl method to struct atmdev_ops.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8865c418
    • O
      can: Fix CAN_(EFF|RTR)_FLAG handling in can_filter · d253eee2
      Oliver Hartkopp 提交于
      Due to a wrong safety check in af_can.c it was not possible to filter
      for SFF frames with a specific CAN identifier without getting the
      same selected CAN identifier from a received EFF frame also.
      
      This fix has a minimum (but user visible) impact on the CAN filter
      API and therefore the CAN version is set to a new date.
      
      Indeed the 'old' API is still working as-is. But when now setting
      CAN_(EFF|RTR)_FLAG in can_filter.can_mask you might get less traffic
      than before - but still the stuff that you expected to get for your
      defined filter ...
      
      Thanks to Kurt Van Dijck for pointing at this issue and for the review.
      Signed-off-by: NOliver Hartkopp <oliver@hartkopp.net>
      Acked-by: NKurt Van Dijck <kurt.van.dijck@eia.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d253eee2
  4. 03 12月, 2008 2 次提交
  5. 02 12月, 2008 3 次提交
    • M
      lib/idr.c: fix rcu related race with idr_find · 6ff2d39b
      Manfred Spraul 提交于
      2nd part of the fixes needed for
      http://bugzilla.kernel.org/show_bug.cgi?id=11796.
      
      When the idr tree is either grown or shrunk, then the update to the number
      of layers and the top pointer were not atomic.  This race caused crashes.
      
      The attached patch fixes that by replicating the layers counter in each
      layer, thus idr_find doesn't need idp->layers anymore.
      Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
      Cc: Clement Calmels <cboulte@gmail.com>
      Cc: Nadia Derbey <Nadia.Derbey@bull.net>
      Cc: Pierre Peiffer <peifferp@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ff2d39b
    • D
      epoll: introduce resource usage limits · 7ef9964e
      Davide Libenzi 提交于
      It has been thought that the per-user file descriptors limit would also
      limit the resources that a normal user can request via the epoll
      interface.  Vegard Nossum reported a very simple program (a modified
      version attached) that can make a normal user to request a pretty large
      amount of kernel memory, well within the its maximum number of fds.  To
      solve such problem, default limits are now imposed, and /proc based
      configuration has been introduced.  A new directory has been created,
      named /proc/sys/fs/epoll/ and inside there, there are two configuration
      points:
      
        max_user_instances = Maximum number of devices - per user
      
        max_user_watches   = Maximum number of "watched" fds - per user
      
      The current default for "max_user_watches" limits the memory used by epoll
      to store "watches", to 1/32 of the amount of the low RAM.  As example, a
      256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
      That should be enough to not break existing heavy epoll users.  The
      default value for "max_user_instances" is set to 128, that should be
      enough too.
      
      This also changes the userspace, because a new error code can now come out
      from EPOLL_CTL_ADD (-ENOSPC).  The EMFILE from epoll_create() was already
      listed, so that should be ok.
      
      [akpm@linux-foundation.org: use get_current_user()]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: <stable@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Reported-by: NVegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ef9964e
    • T
      libata: blacklist Seagate drives which time out FLUSH_CACHE when used with NCQ · ac70a964
      Tejun Heo 提交于
      Some recent Seagate harddrives have firmware bug which causes FLUSH
      CACHE to timeout under certain circumstances if NCQ is being used.
      This can be worked around by disabling NCQ and fixed by updating the
      firmware.  Implement ATA_HORKAGE_FIRMWARE_UPDATE and blacklist these
      devices.
      
      The wiki page has been updated to contain information on this issue.
      
        http://ata.wiki.kernel.org/index.php/Known_issuesSigned-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      ac70a964
  6. 01 12月, 2008 3 次提交
  7. 29 11月, 2008 2 次提交
    • J
      mlx4_core: Save/restore default port IB capability mask · 9a5aa622
      Jack Morgenstein 提交于
      Commit 7ff93f8b ("mlx4_core: Multiple port type support") introduced
      support for different port types.  As part of that support, SET_PORT
      is invoked to set the port type during driver startup.  However, as a
      side-effect, for IB ports the invocation of this command also sets the
      port's capability mask to zero (losing the default value set by FW).
      
      To fix this, get the default ib port capabilities (via a MAD_IFC Port
      Info query) during driver startup, and save them for use in the
      mlx4_SET_PORT command when setting the port-type to Infiniband.
      
      This patch fixes problems with subnet manager (SM) failover such as
      <https://bugs.openfabrics.org/show_bug.cgi?id=1183>, which occurred
      because the IsTrapSupported bit in the capability mask was zeroed.
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      9a5aa622
    • G
      phy: power management support · 0f0ca340
      Giuseppe Cavallaro 提交于
      This patch adds the power management support into the physical
      abstraction layer.
      
      Suspend and resume functions respectively turns on/off the bit 11
      into the PHY Basic mode control register.
      Generic PHY device starts supporting PM.
      
      In order to support the wake-on LAN and avoid to put in power down
      the PHY device, the MDIO is aware of what the Ethernet device wants to do.
      
      Voluntary, no CONFIG_PM defines were added into the sources.
      Also generic suspend/resume functions are exported to allow
      other drivers use them (such as genphy_config_aneg etc.).
      
      Within the phy_driver_register function, we need to remove the
      memset. It overrides the device driver owner and it is not good.
      Signed-off-by: NGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f0ca340
  8. 28 11月, 2008 1 次提交
    • R
      Allow architectures to override copy_user_highpage() · 487ff320
      Russell King 提交于
      With aliasing VIPT cache support, the ARM implementation of
      clear_user_page() and copy_user_page() sets up a temporary kernel space
      mapping such that we have the same cache colour as the userspace page.
      This avoids having to consider any userspace aliases from this operation.
      
      However, when highmem is enabled, kmap_atomic() have to setup mappings.
      The copy_user_highpage() and clear_user_highpage() call these functions
      before delegating the copies to copy_user_page() and clear_user_page().
      
      The effect of this is that each of the *_user_highpage() functions setup
      their own kmap mapping, followed by the *_user_page() functions setting
      up another mapping.  This is rather wasteful.
      
      Thankfully, copy_user_highpage() can be overriden by architectures by
      defining __HAVE_ARCH_COPY_USER_HIGHPAGE.  However, replacement of
      clear_user_highpage() is more difficult because its inline definition
      is not conditional.  It seems that you're expected to define
      __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE and provide a replacement
      __alloc_zeroed_user_highpage() implementation instead.
      
      The allocation itself is fine, so we don't want to override that.  What
      we really want to do is to override clear_user_highpage() with our own
      version which doesn't kmap_atomic() unnecessarily.
      
      Other VIPT architectures (PARISC and SH) would also like to override
      this function as well.
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NJames Bottomley <James.Bottomley@HansenPartnership.com>
      Acked-by: NPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      487ff320
  9. 26 11月, 2008 4 次提交
  10. 25 11月, 2008 5 次提交
    • J
      DCB: fix kconfig option · 7a6b6f51
      Jeff Kirsher 提交于
      Since the netlink option for DCB is necessary to actually be useful,
      simplified the Kconfig option.  In addition, added useful help text for the
      Kconfig option.
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a6b6f51
    • S
      netdev: add HAVE_NET_DEVICE_OPS · 47fd5b83
      Stephen Hemminger 提交于
      As a concession to vendors who have to deal with one source for different
      kernel versions, add a HAVE_NET_DEVICE_OPS so they don't end up hard
      coding ifdef against kernel version.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47fd5b83
    • I
      111cc8b9
    • I
      tcp: Try to restore large SKBs while SACK processing · 832d11c5
      Ilpo Järvinen 提交于
      During SACK processing, most of the benefits of TSO are eaten by
      the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
      Then we're in problems when cleanup work for them has to be done
      when a large cumulative ACK comes. Try to return back to pre-split
      state already while more and more SACK info gets discovered by
      combining newly discovered SACK areas with the previous skb if
      that's SACKed as well.
      
      This approach has a number of benefits:
      
      1) The processing overhead is spread more equally over the RTT
      2) Write queue has less skbs to process (affect everything
         which has to walk in the queue past the sacked areas)
      3) Write queue is consistent whole the time, so no other parts
         of TCP has to be aware of this (this was not the case with
         some other approach that was, well, quite intrusive all
         around).
      4) Clean_rtx_queue can release most of the pages using single
         put_page instead of previous PAGE_SIZE/mss+1 calls
      
      In case a hole is fully filled by the new SACK block, we attempt
      to combine the next skb too which allows construction of skbs
      that are even larger than what tso split them to and it handles
      hole per on every nth patterns that often occur during slow start
      overshoot pretty nicely. Though this to be really useful also
      a retransmission would have to get lost since cumulative ACKs
      advance one hole at a time in the most typical case.
      
      TODO: handle upwards only merging. That should be rather easy
      when segment is fully sacked but I'm leaving that as future
      work item (it won't make very large difference anyway since
      this current approach already covers quite a lot of normal
      cases).
      
      I was earlier thinking of some sophisticated way of tracking
      timestamps of the first and the last segment but later on
      realized that it won't be that necessary at all to store the
      timestamp of the last segment. The cases that can occur are
      basically either:
        1) ambiguous => no sensible measurement can be taken anyway
        2) non-ambiguous is due to reordering => having the timestamp
           of the last segment there is just skewing things more off
           than does some good since the ack got triggered by one of
           the holes (besides some substle issues that would make
           determining right hole/skb even harder problem). Anyway,
           it has nothing to do with this change then.
      
      I choose to route some abnormal looking cases with goto noop,
      some could be handled differently (eg., by stopping the
      walking at that skb but again). In general, they either
      shouldn't happen at all or are rare enough to make no difference
      in practice.
      
      In theory this change (as whole) could cause some macroscale
      regression (global) because of cache misses that are taken over
      the round-trip time but it gets very likely better because of much
      less (local) cache misses per other write queue walkers and the
      big recovery clearing cumulative ack.
      
      Worth to note that these benefits would be very easy to get also
      without TSO/GSO being on as long as the data is in pages so that
      we can merge them. Currently I won't let that happen because
      DSACK splitting at fragment that would mess up pcounts due to
      sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
      avoided, we have some conditions that can be made less strict.
      
      TODO: I will probably have to convert the excessive pointer
      passing to struct sacktag_state... :-)
      
      My testing revealed that considerable amount of skbs couldn't
      be shifted because they were cloned (most likely still awaiting
      tx reclaim)...
      
      [The rest is considering future work instead since I got
      repeatably EFAULT to tcpdump's recvfrom when I added
      pskb_expand_head to deal with clones, so I separated that
      into another, later patch]
      
      ...To counter that, I gave up on the fifth advantage:
      
      5) When growing previous SACK block, less allocs for new skbs
         are done, basically a new alloc is needed only when new hole
         is detected and when the previous skb runs out of frags space
      
      ...which now only happens of if reclaim is fast enough to dispose
      the clone before the SACK block comes in (the window is RTT long),
      otherwise we'll have to alloc some.
      
      With clones being handled I got these numbers (will be somewhat
      worse without that), taken with fine-grained mibs:
      
                        TCPSackShifted 398
                         TCPSackMerged 877
                  TCPSackShiftFallback 320
            TCPSACKCOLLAPSEFALLBACKGSO 0
        TCPSACKCOLLAPSEFALLBACKSKBBITS 0
        TCPSACKCOLLAPSEFALLBACKSKBDATA 0
          TCPSACKCOLLAPSEFALLBACKBELOW 0
          TCPSACKCOLLAPSEFALLBACKFIRST 1
       TCPSACKCOLLAPSEFALLBACKPREVBITS 318
            TCPSACKCOLLAPSEFALLBACKMSS 1
         TCPSACKCOLLAPSEFALLBACKNOHEAD 0
          TCPSACKCOLLAPSEFALLBACKSHIFT 0
                TCPSACKCOLLAPSENOOPSEQ 0
        TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
           TCPSACKCOLLAPSENOOPSMALLLEN 0
                   TCPSACKCOLLAPSEHOLE 12
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      832d11c5
    • J
      netfilter: xtables: add missing const qualifier to xt_tgchk_param · f79fca55
      Jan Engelhardt 提交于
      When entryinfo was a standalone parameter to functions, it used to be
      "const void *". Put the const back in.
      Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f79fca55
  11. 24 11月, 2008 2 次提交
    • E
      eth: Declare an optimized compare_ether_addr_64bits() function · 1f87e235
      Eric Dumazet 提交于
      Linus mentioned we could try to perform long word operations, even
      on potentially unaligned addresses, on x86 at least. David mentioned
      the HAVE_EFFICIENT_UNALIGNED_ACCESS test to handle this on all
      arches that have efficient unailgned accesses.
      
      I tried this idea and got nice assembly on 32 bits:
      
      158:   33 82 38 01 00 00       xor    0x138(%edx),%eax
      15e:   33 8a 34 01 00 00       xor    0x134(%edx),%ecx
      164:   c1 e0 10                shl    $0x10,%eax
      167:   09 c1                   or     %eax,%ecx
      169:   74 0b                   je     176 <eth_type_trans+0x87>
      
      And very nice assembly on 64 bits of course (one xor, one shl)
      
      Nice oprofile improvement in eth_type_trans(), 0.17 % instead of 0.41 %,
      expected since we remove 8 instructions on a fast path.
      
      This patch implements a compare_ether_addr_64bits() function, that
      uses the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS ifdef to efficiently
      perform the 6 bytes comparison on all capable arches.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f87e235
    • G
      dccp: Set per-connection CCIDs via socket options · b20a9c24
      Gerrit Renker 提交于
      With this patch, TX/RX CCIDs can now be changed on a per-connection
      basis, which overrides the defaults set by the global sysctl variables
      for TX/RX CCIDs.
      
      To make full use of this facility, the remaining patches of this patch
      set are needed, which track dependencies and activate negotiated
      feature values.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b20a9c24
  12. 23 11月, 2008 1 次提交
  13. 22 11月, 2008 1 次提交
  14. 21 11月, 2008 8 次提交