1. 18 11月, 2010 2 次提交
    • T
      rtnetlink: Link address family API · f8ff182c
      Thomas Graf 提交于
      Each net_device contains address family specific data such as
      per device settings and statistics. We already expose this data
      via procfs/sysfs and partially netlink.
      
      The netlink method requires the requester to send one RTM_GETLINK
      request for each address family it wishes to receive data of
      and then merge this data itself.
      
      This patch implements a new API which combines all address family
      specific link data in a new netlink attribute IFLA_AF_SPEC.
      IFLA_AF_SPEC contains a sequence of nested attributes, one for each
      address family which in turn defines the structure of its own
      attribute. Example:
      
         [IFLA_AF_SPEC] = {
             [AF_INET] = {
                 [IFLA_INET_CONF] = ...,
             },
             [AF_INET6] = {
                 [IFLA_INET6_FLAGS] = ...,
                 [IFLA_INET6_CONF] = ...,
             }
         }
      
      The API also allows for address families to implement a function
      which parses the IFLA_AF_SPEC attribute sent by userspace to
      implement address family specific link options.
      Signed-off-by: NThomas Graf <tgraf@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8ff182c
    • E
      netfilter: allow hooks to pass error code back up the stack · da683650
      Eric Paris 提交于
      SELinux would like to pass certain fatal errors back up the stack.  This patch
      implements the generic netfilter support for this functionality.
      Based-on-patch-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da683650
  2. 17 11月, 2010 4 次提交
    • E
      net: reorder struct sock fields · b178bb3d
      Eric Dumazet 提交于
      Right now, fields in struct sock are not optimally ordered, because each
      path (RX softirq, TX completion, RX user,  TX user) has to touch fields
      that are contained in many different cache lines.
      
      The really critical thing is to shrink number of cache lines that are
      used at RX softirq time : CPU handling softirqs for a device can receive
      many frames per second for many sockets. If load is too big, we can drop
      frames at NIC level. RPS or multiqueue cards can help, but better reduce
      latency if possible.
      
      This patch starts with UDP protocol, then additional patches will try to
      reduce latencies of other ones as well.
      
      At RX softirq time, fields of interest for UDP protocol are :
      (not counting ones in inet struct for the lookup)
      
      Read/Written:
      sk_refcnt   (atomic increment/decrement)
      sk_rmem_alloc & sk_backlog.len (to check if there is room in queues)
      sk_receive_queue
      sk_backlog (if socket locked by user program)
      sk_rxhash
      sk_forward_alloc
      sk_drops
      
      Read only:
      sk_rcvbuf (sk_rcvqueues_full())
      sk_filter
      sk_wq
      sk_policy[0]
      sk_flags
      
      Additional notes :
      
      - sk_backlog has one hole on 64bit arches. We can fill it to save 8
      bytes.
      - sk_backlog is used only if RX sofirq handler finds the socket while
      locked by user.
      - sk_rxhash is written only once per flow.
      - sk_drops is written only if queues are full
      
      Final layout :
      
      [1] One section grouping all read/write fields, but placing rxhash and
      sk_backlog at the end of this section.
      
      [2] One section grouping all read fields in RX handler
         (sk_filter, sk_rcv_buf, sk_wq)
      
      [3] Section used by other paths
      
      I'll post a patch on its own to put sk_refcnt at the end of struct
      sock_common so that it shares same cache line than section [1]
      
      New offsets on 64bit arch :
      
      sizeof(struct sock)=0x268
      offsetof(struct sock, sk_refcnt)  =0x10
      offsetof(struct sock, sk_lock)    =0x48
      offsetof(struct sock, sk_receive_queue)=0x68
      offsetof(struct sock, sk_backlog)=0x80
      offsetof(struct sock, sk_rmem_alloc)=0x80
      offsetof(struct sock, sk_forward_alloc)=0x98
      offsetof(struct sock, sk_rxhash)=0x9c
      offsetof(struct sock, sk_rcvbuf)=0xa4
      offsetof(struct sock, sk_drops) =0xa0
      offsetof(struct sock, sk_filter)=0xa8
      offsetof(struct sock, sk_wq)=0xb0
      offsetof(struct sock, sk_policy)=0xd0
      offsetof(struct sock, sk_flags) =0xe0
      
      Instead of :
      
      sizeof(struct sock)=0x270
      offsetof(struct sock, sk_refcnt)  =0x10
      offsetof(struct sock, sk_lock)    =0x50
      offsetof(struct sock, sk_receive_queue)=0xc0
      offsetof(struct sock, sk_backlog)=0x70
      offsetof(struct sock, sk_rmem_alloc)=0xac
      offsetof(struct sock, sk_forward_alloc)=0x10c
      offsetof(struct sock, sk_rxhash)=0x128
      offsetof(struct sock, sk_rcvbuf)=0x4c
      offsetof(struct sock, sk_drops) =0x16c
      offsetof(struct sock, sk_filter)=0x198
      offsetof(struct sock, sk_wq)=0x88
      offsetof(struct sock, sk_policy)=0x98
      offsetof(struct sock, sk_flags) =0x130
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b178bb3d
    • E
      udp: use atomic_inc_not_zero_hint · c31504dc
      Eric Dumazet 提交于
      UDP sockets refcount is usually 2, unless an incoming frame is going to
      be queued in receive or backlog queue.
      
      Using atomic_inc_not_zero_hint() permits to reduce latency, because
      processor issues less memory transactions.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c31504dc
    • E
      macvlan: lockless tx path · 8ffab51b
      Eric Dumazet 提交于
      macvlan is a stacked device, like tunnels. We should use the lockless
      mechanism we are using in tunnels and loopback.
      
      This patch completely removes locking in TX path.
      
      tx stat counters are added into existing percpu stat structure, renamed
      from rx_stats to pcpu_stats.
      
      Note : this reverts commit 2c114553 (macvlan: add multiqueue
      capability)
      
      Note : rx_errors converted to a 32bit counter, like tx_dropped, since
      they dont need 64bit range.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Ben Greear <greearb@candelatech.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Acked-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ffab51b
    • J
      netlink: let nlmsg and nla functions take pointer-to-const args · 3654654f
      Jan Engelhardt 提交于
      The changed functions do not modify the NL messages and/or attributes
      at all. They should use const (similar to strchr), so that callers
      which have a const nlmsg/nlattr around can make use of them without
      casting.
      
      While at it, constify a data array.
      Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3654654f
  3. 16 11月, 2010 12 次提交
  4. 15 11月, 2010 1 次提交
    • G
      dccp ccid-2: Schedule Sync as out-of-band mechanism · d83447f0
      Gerrit Renker 提交于
      The problem with Ack Vectors is that
        i) their length is variable and can in principle grow quite large,
       ii) it is hard to predict exactly how large they will be.
      
      Due to the second point it seems not a good idea to reduce the MPS; in
      particular when on average there is enough room for the Ack Vector and an
      increase in length is momentarily due to some burst loss, after which the
      Ack Vector returns to its normal/average length.
      
      The solution taken by this patch is to subtract a minimum-expected Ack Vector
      length from the MPS, and to defer any larger Ack Vectors onto a separate
      Sync - but only if indeed there is no space left on the skb.
      
      This patch provides the infrastructure to schedule Sync-packets for transporting
      (urgent) out-of-band data. Its signalling is quicker than scheduling an Ack, since
      it does not need to wait for new application data.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      d83447f0
  5. 13 11月, 2010 2 次提交
  6. 12 11月, 2010 13 次提交
    • A
      backlight: add low threshold to pwm backlight · fef7764f
      Arun Murthy 提交于
      The intensity of the backlight can be varied from a range of
      max_brightness to zero.  Though most, if not all the pwm based backlight
      devices start flickering at lower brightness value.  And also for each
      device there exists a brightness value below which the backlight appears
      to be turned off though the value is not equal to zero.
      
      If the range of brightness for a device is from zero to max_brightness.  A
      graph is plotted for brightness Vs intensity for the pwm based backlight
      device has to be a linear graph.
      
      intensity
      	  |   /
      	  |  /
      	  | /
      	  |/
      	  ---------
      	 0	max_brightness
      
      But pratically on measuring the above we note that the intensity of
      backlight goes to zero(OFF) when the value in not zero almost nearing to
      zero(some x%).  so the graph looks like
      
      intensity
      	  |    /
      	  |   /
      	  |  /
      	  |  |
      	  ------------
      	 0   x	 max_brightness
      
      In order to overcome this drawback knowing this x% i.e nothing but the low
      threshold beyond which the backlight is off and will have no effect, the
      brightness value is being offset by the low threshold value(retaining the
      linearity of the graph).  Now the graph becomes
      
      intensity
      	  |     /
      	  |    /
      	  |   /
      	  |  /
      	  -------------
      	   0	  max_brightness
      
      With this for each and every digit increment in the brightness from zero
      there is a change in the intensity of backlight.  Devices having this
      behaviour can set the low threshold brightness(lth_brightness) and pass
      the same as platform data else can have it as zero.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NArun Murthy <arun.murthy@stericsson.com>
      Acked-by: NLinus Walleij <linus.walleij@stericsson.com>
      Acked-by: NRichard Purdie <rpurdie@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fef7764f
    • S
      leds: driver for National Semiconductors LP5523 chip · 0efba16c
      Samu Onkalo 提交于
      LP5523 chip is nine channel led driver with programmable engines.  Driver
      provides support for that chip for direct access via led class or via
      programmable engines.
      Signed-off-by: NSamu Onkalo <samu.p.onkalo@nokia.com>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Cc: Jean Delvare <khali@linux-fr.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0efba16c
    • S
      leds: driver for National Semiconductor LP5521 chip · 500fe141
      Samu Onkalo 提交于
      This patchset provides support for LP5521 and LP5523 LED driver chips from
      National Semicondutor.  Both drivers supports programmable engines and
      naturally LED class features.
      
      Documentation is provided as a part of the patchset.  I created "leds"
      subdirectory under Documentation.  Perhaps the rest of the leds*
      documentation should be moved there.
      
      Datasheets are freely available at National Semiconductor www pages.
      
      This patch:
      
      LP5521 chip is three channel led driver with programmable engines.  Driver
      provides support for that chip for direct access via led class or via
      programmable engines.
      Signed-off-by: NSamu Onkalo <samu.p.onkalo@nokia.com>
      Cc: Richard Purdie <rpurdie@rpsys.net>
      Cc: Jean Delvare <khali@linux-fr.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      500fe141
    • J
      led-class: always implement blinking · 5ada28bf
      Johannes Berg 提交于
      Currently, blinking LEDs can be awkward because it is not guaranteed that
      all LEDs implement blinking.  The trigger that wants it to blink then
      needs to implement its own timer solution.
      
      Rather than require that, add led_blink_set() API that triggers can use.
      This function will attempt to use hw blinking, but if that fails
      implements a timer for it.  To stop blinking again, brightness_set() also
      needs to be wrapped into API that will stop the software blink.
      
      As a result of this, the timer trigger becomes a very trivial one, and
      hopefully we can finally see triggers using blinking as well because it's
      always easy to use.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Acked-by: NRichard Purdie <rpurdie@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ada28bf
    • N
      radix-tree: fix RCU bug · 27d20fdd
      Nick Piggin 提交于
      Salman Qazi describes the following radix-tree bug:
      
      In the following case, we get can get a deadlock:
      
      0.  The radix tree contains two items, one has the index 0.
      1.  The reader (in this case find_get_pages) takes the rcu_read_lock.
      2.  The reader acquires slot(s) for item(s) including the index 0 item.
      3.  The non-zero index item is deleted, and as a consequence the other item is
          moved to the root of the tree. The place where it used to be is queued for
          deletion after the readers finish.
      3b. The zero item is deleted, removing it from the direct slot, it remains in
          the rcu-delayed indirect node.
      4.  The reader looks at the index 0 slot, and finds that the page has 0 ref
          count
      5.  The reader looks at it again, hoping that the item will either be freed or
          the ref count will increase. This never happens, as the slot it is looking
          at will never be updated. Also, this slot can never be reclaimed because
          the reader is holding rcu_read_lock and is in an infinite loop.
      
      The fix is to re-use the same "indirect" pointer case that requires a slot
      lookup retry into a general "retry the lookup" bit.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Reported-by: NSalman Qazi <sqazi@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27d20fdd
    • D
      Restrict unprivileged access to kernel syslog · eaf06b24
      Dan Rosenberg 提交于
      The kernel syslog contains debugging information that is often useful
      during exploitation of other vulnerabilities, such as kernel heap
      addresses.  Rather than futilely attempt to sanitize hundreds (or
      thousands) of printk statements and simultaneously cripple useful
      debugging functionality, it is far simpler to create an option that
      prevents unprivileged users from reading the syslog.
      
      This patch, loosely based on grsecurity's GRKERNSEC_DMESG, creates the
      dmesg_restrict sysctl.  When set to "0", the default, no restrictions are
      enforced.  When set to "1", only users with CAP_SYS_ADMIN can read the
      kernel syslog via dmesg(8) or other mechanisms.
      
      [akpm@linux-foundation.org: explain the config option in kernel.txt]
      Signed-off-by: NDan Rosenberg <drosenberg@vsecurity.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NEugene Teo <eugeneteo@kernel.org>
      Acked-by: NKees Cook <kees.cook@canonical.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eaf06b24
    • C
      include/linux/highmem.h needs hardirq.h · 43b3a0c7
      Catalin Marinas 提交于
      Commit 3e4d3af5 ("mm: stack based kmap_atomic()") introduced the
      kmap_atomic_idx_push() function which warns on in_irq() with
      CONFIG_DEBUG_HIGHMEM enabled.  This patch includes linux/hardirq.h for
      the in_irq definition.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43b3a0c7
    • E
      atomic: add atomic_inc_not_zero_hint() · 3f9d35b9
      Eric Dumazet 提交于
      Followup of perf tools session in Netfilter WorkShop 2010
      
      In the network stack we make high usage of atomic_inc_not_zero() in
      contexts we know the probable value of atomic before increment (2 for udp
      sockets for example)
      
      Using a special version of atomic_inc_not_zero() giving this hint can help
      processor to use less bus transactions.
      
      On x86 (MESI protocol) for example, this avoids entering Shared state,
      because "lock cmpxchg" issues an RFO (Read For Ownership)
      
      akpm: Adds a new include/linux/atomic.h.  This means that new code should
      henceforth include linux/atomic.h and not asm/atomic.h.  The presence of
      include/linux/atomic.h will in fact cause checkpatch.pl to warn about use
      of asm/atomic.h.  The new include/linux/atomic.h becomes the place where
      arch-neutral atomic_t code should be placed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Reviewed-by: N"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f9d35b9
    • J
      include/linux/resource.h needs types.h · 8705a1ba
      Jean Delvare 提交于
      Fix the following warning:
      usr/include/linux/resource.h:49: found __[us]{8,16,32,64} type without #include <linux/types.h>
      Signed-off-by: NJean Delvare <khali@linux-fr.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8705a1ba
    • E
      netfilter: NF_HOOK_COND has wrong conditional · ac5aa2e3
      Eric Paris 提交于
      The NF_HOOK_COND returns 0 when it shouldn't due to what I believe to be an
      error in the code as the order of operations is not what was intended.  C will
      evalutate == before =.  Which means ret is getting set to the bool result,
      rather than the return value of the function call.  The code says
      
      if (ret = function() == 1)
      when it meant to say:
      if ((ret = function()) == 1)
      
      Normally the compiler would warn, but it doesn't notice it because its
      a actually complex conditional and so the wrong code is wrapped in an explict
      set of () [exactly what the compiler wants you to do if this was intentional].
      Fixing this means that errors when netfilter denies a packet get propagated
      back up the stack rather than lost.
      
      Problem introduced by commit 2249065f (netfilter: get rid of the grossness
      in netfilter.h).
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      ac5aa2e3
    • D
      ipv4: Make rt->fl.iif tests lest obscure. · c7537967
      David S. Miller 提交于
      When we test rt->fl.iif against zero, we're seeing if it's
      an output or an input route.
      
      Make that explicit with some helper functions.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7537967
    • E
      net: get rid of rtable->idev · 72cdd1d9
      Eric Dumazet 提交于
      It seems idev field in struct rtable has no special purpose, but adding
      extra atomic ops.
      
      We hold refcounts on the device itself (using percpu data, so pretty
      cheap in current kernel).
      
      infiniband case is solved using dst.dev instead of idev->dev
      
      Removal of this field means routing without route cache is now using
      shared data, percpu data, and only potential contention is a pair of
      atomic ops on struct neighbour per forwarded packet.
      
      About 5% speedup on routing test.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Roland Dreier <rolandd@cisco.com>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72cdd1d9
    • E
      neigh: reorder struct neighbour · 46b13fc5
      Eric Dumazet 提交于
      It is important to move nud_state outside of the often modified cache
      line (because of refcnt), to reduce false sharing in neigh_event_send()
      
      This is a followup of commit 0ed8ddf4 (neigh: Protect neigh->ha[]
      with a seqlock)
      
      This gives a 7% speedup on routing test with IP route cache disabled.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46b13fc5
  7. 11 11月, 2010 3 次提交
    • J
      block: remove unused copy_io_context() · cedb4a7d
      Jens Axboe 提交于
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      cedb4a7d
    • S
      perf_events: Fix time tracking in samples · eed01528
      Stephane Eranian 提交于
      This patch corrects time tracking in samples. Without this patch
      both time_enabled and time_running are bogus when user asks for
      PERF_SAMPLE_READ.
      
      One uses PERF_SAMPLE_READ to sample the values of other counters
      in each sample. Because of multiplexing, it is necessary to know
      both time_enabled, time_running to be able to scale counts correctly.
      
      In this second version of the patch, we maintain a shadow
      copy of ctx->time which allows us to compute ctx->time without
      calling update_context_time() from NMI context. We avoid the
      issue that update_context_time() must always be called with
      ctx->lock held.
      
      We do not keep shadow copies of the other event timings
      because if the lead event is overflowing then it is active
      and thus it's been scheduled in via event_sched_in() in
      which case neither tstamp_stopped, tstamp_running can be modified.
      
      This timing logic only applies to samples when PERF_SAMPLE_READ
      is used.
      
      Note that this patch does not address timing issues related
      to sampling inheritance between tasks. This will be addressed
      in a future patch.
      
      With this patch, the libpfm4 example task_smpl now reports
      correct counts (shown on 2.4GHz Core 2):
      
      $ task_smpl -p 2400000000 -e unhalted_core_cycles:u,instructions_retired:u,baclears  noploop 5
      noploop for 5 seconds
      IIP:0x000000004006d6 PID:5596 TID:5596 TIME:466,210,211,430 STREAM_ID:33 PERIOD:2,400,000,000 ENA=1,010,157,814 RUN=1,010,157,814 NR=3
      	2,400,000,254 unhalted_core_cycles:u (33)
      	2,399,273,744 instructions_retired:u (34)
      	53,340 baclears (35)
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <4cc6e14b.1e07e30a.256e.5190@mx.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      eed01528
    • E
      net: avoid limits overflow · 8d987e5c
      Eric Dumazet 提交于
      Robin Holt tried to boot a 16TB machine and found some limits were
      reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]
      
      We can switch infrastructure to use long "instead" of "int", now
      atomic_long_t primitives are available for free.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reported-by: NRobin Holt <holt@sgi.com>
      Reviewed-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d987e5c
  8. 10 11月, 2010 3 次提交