提交 · fe1811438a2a229e93beaefa69481d72652795e5 · openeuler / raspberrypi-kernel

08 10月, 2013 1 次提交

cfg80211: fix nl80211.h documentation for DFS enum states · fe181143

由 Luis R. Rodriguez 提交于 10月 07, 2013

The names are prefixed incorrectly on the documentation.
Signed-off-by: NLuis R. Rodriguez <mcgrof@do-not-panic.com>
[also remove spurious blank line]
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

fe181143

11 9月, 2013 1 次提交

fs: bump inode and dentry counters to long · 3942c07c

由 Glauber Costa 提交于 8月 28, 2013

This series reworks our current object cache shrinking infrastructure in
two main ways:

 * Noticing that a lot of users copy and paste their own version of LRU
   lists for objects, we put some effort in providing a generic version.
   It is modeled after the filesystem users: dentries, inodes, and xfs
   (for various tasks), but we expect that other users could benefit in
   the near future with little or no modification.  Let us know if you
   have any issues.

 * The underlying list_lru being proposed automatically and
   transparently keeps the elements in per-node lists, and is able to
   manipulate the node lists individually.  Given this infrastructure, we
   are able to modify the up-to-now hammer called shrink_slab to proceed
   with node-reclaim instead of always searching memory from all over like
   it has been doing.

Per-node lru lists are also expected to lead to less contention in the lru
locks on multi-node scans, since we are now no longer fighting for a
global lock.  The locks usually disappear from the profilers with this
change.

Although we have no official benchmarks for this version - be our guest to
independently evaluate this - earlier versions of this series were
performance tested (details at
http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
visible performance regressions while yielding a better qualitative
behavior in NUMA machines.

With this infrastructure in place, we can use the list_lru entry point to
provide memcg isolation and per-memcg targeted reclaim.  Historically,
those two pieces of work have been posted together.  This version presents
only the infrastructure work, deferring the memcg work for a later time,
so we can focus on getting this part tested.  You can see more about the
history of such work at http://lwn.net/Articles/552769/

Dave Chinner (18):
  dcache: convert dentry_stat.nr_unused to per-cpu counters
  dentry: move to per-sb LRU locks
  dcache: remove dentries from LRU before putting on dispose list
  mm: new shrinker API
  shrinker: convert superblock shrinkers to new API
  list: add a new LRU list type
  inode: convert inode lru list to generic lru list code.
  dcache: convert to use new lru list infrastructure
  list_lru: per-node list infrastructure
  shrinker: add node awareness
  fs: convert inode and dentry shrinking to be node aware
  xfs: convert buftarg LRU to generic code
  xfs: rework buffer dispose list tracking
  xfs: convert dquot cache lru to list_lru
  fs: convert fs shrinkers to new scan/count API
  drivers: convert shrinkers to new count/scan API
  shrinker: convert remaining shrinkers to count/scan API
  shrinker: Kill old ->shrink API.

Glauber Costa (7):
  fs: bump inode and dentry counters to long
  super: fix calculation of shrinkable objects for small numbers
  list_lru: per-node API
  vmscan: per-node deferred work
  i915: bail out earlier when shrinker cannot acquire mutex
  hugepage: convert huge zero page shrinker to new shrinker API
  list_lru: dynamically adjust node arrays

This patch:

There are situations in very large machines in which we can have a large
quantity of dirty inodes, unused dentries, etc.  This is particularly true
when umounting a filesystem, where eventually since every live object will
eventually be discarded.

Dave Chinner reported a problem with this while experimenting with the
shrinker revamp patchset.  So we believe it is time for a change.  This
patch just moves int to longs.  Machines where it matters should have a
big long anyway.
Signed-off-by: NGlauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3942c07c

09 9月, 2013 3 次提交

cifs: Move and expand MAX_SERVER_SIZE definition · cdf1246f

由 Scott Lovenberg 提交于 8月 09, 2013

MAX_SERVER_SIZE has been moved to cifs_mount.h and renamed
CIFS_NI_MAXHOST for clarity.  It has been expanded to 1024 as the
previous value of 16 was very short.
Signed-off-by: NScott Lovenberg <scott.lovenberg@gmail.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NSteve French <smfrench@gmail.com>

cdf1246f

cifs: Expand max share name length to 256 · 54fcf270

由 Scott Lovenberg 提交于 8月 09, 2013

The old max share name length limit was 80 due to Windows NET SHARE
command not allowing more than that.  However, share names can be much
longer.  This is a more reasonable maximum share name length.
Signed-off-by: NScott Lovenberg <scott.lovenberg@gmail.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NSteve French <smfrench@gmail.com>

54fcf270

cifs: Move string length definitions to uapi · 8c3a2b4c

由 Scott Lovenberg 提交于 8月 09, 2013

The max string length definitions for user name, domain name, password,
and share name have been moved into their own header file in uapi so the
mount helper can use autoconf to define them instead of keeping the
kernel side and userland side definitions in sync manually. The names
have also been standardized with a "CIFS" prefix and "LEN" suffix.
Signed-off-by: NScott Lovenberg <scott.lovenberg@gmail.com>
Reviewed-by: NChen Gang <gang.chen@asianux.com>
Signed-off-by: NSteve French <smfrench@gmail.com>

8c3a2b4c

08 9月, 2013 2 次提交

Input: evdev - add EVIOCREVOKE ioctl · c7dc6573

由 David Herrmann 提交于 9月 07, 2013

If we have multiple sessions on a system, we normally don't want
background sessions to read input events. Otherwise, it could capture
passwords and more entered by the user on the foreground session. This is
a real world problem as the recent XMir development showed:
http://mjg59.dreamwidth.org/27327.html

We currently rely on sessions to release input devices when being
deactivated. This relies on trust across sessions. But that's not given on
usual systems. We therefore need a way to control which processes have
access to input devices.

With VTs the kernel simply routed them through the active /dev/ttyX. This
is not possible with evdev devices, though. Moreover, we want to avoid
routing input-devices through some dispatcher-daemon in userspace (which
would add some latency).

This patch introduces EVIOCREVOKE. If called on an evdev fd, this revokes
device-access irrecoverably for that *single* open-file. Hence, once you
call EVIOCREVOKE on any dup()ed fd, all fds for that open-file will be
rather useless now (but still valid compared to close()!). This allows us
to pass fds directly to session-processes from a trusted source. The
source keeps a dup()ed fd and revokes access once the session-process is
no longer active.
Compared to the EVIOCMUTE proposal, we can avoid the CAP_SYS_ADMIN
restriction now as there is no way to revive the fd again. Hence, a user
is free to call EVIOCREVOKE themself to kill the fd.

Additionally, this ioctl allows multi-layer access-control (again compared
to EVIOCMUTE which was limited to one layer via CAP_SYS_ADMIN). A middle
layer can simply request a new open-file from the layer above and pass it
to the layer below. Now each layer can call EVIOCREVOKE on the fds to
revoke access for all layers below, at the expense of one fd per layer.

There's already ongoing experimental user-space work which demonstrates
how it can be used:
http://lists.freedesktop.org/archives/systemd-devel/2013-August/012897.htmlSigned-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>

c7dc6573

Revert "Input: introduce BTN/ABS bits for drums and guitars" · b04c99e3

由 Linus Torvalds 提交于 9月 07, 2013

This reverts commits 61e00655, 73f8645d and 8e22ecb6:
  "Input: introduce BTN/ABS bits for drums and guitars"
  "HID: wiimote: add support for Guitar-Hero drums"
  "HID: wiimote: add support for Guitar-Hero guitars"

The extra new ABS_xx values resulted in ABS_MAX no longer being a
power-of-two, which broke the comparison logic.  It also caused the
ioctl numbers to overflow into the next byte, causing problems for that.

We'll try again for 3.13.
Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Acked-by: NDavid Herrmann <dh.herrmann@gmail.com>
Acked-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Benjamin Tissoires <benjamin.tissoires@gmail.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b04c99e3

06 9月, 2013 1 次提交

dm: add statistics support · fd2ed4d2

由 Mikulas Patocka 提交于 8月 16, 2013

Support the collection of I/O statistics on user-defined regions of
a DM device.  If no regions are defined no statistics are collected so
there isn't any performance impact.  Only bio-based DM devices are
currently supported.

Each user-defined region specifies a starting sector, length and step.
Individual statistics will be collected for each step-sized area within
the range specified.

The I/O statistics counters for each step-sized area of a region are
in the same format as /sys/block/*/stat or /proc/diskstats but extra
counters (12 and 13) are provided: total time spent reading and
writing in milliseconds.  All these counters may be accessed by sending
the @stats_print message to the appropriate DM device via dmsetup.

The creation of DM statistics will allocate memory via kmalloc or
fallback to using vmalloc space.  At most, 1/4 of the overall system
memory may be allocated by DM statistics.  The admin can see how much
memory is used by reading
/sys/module/dm_mod/parameters/stats_current_allocated_bytes

See Documentation/device-mapper/statistics.txt for more details.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NAlasdair G Kergon <agk@redhat.com>

fd2ed4d2

05 9月, 2013 2 次提交

vfio-pci: PCI hot reset interface · 8b27ee60

由 Alex Williamson 提交于 9月 04, 2013

The current VFIO_DEVICE_RESET interface only maps to PCI use cases
where we can isolate the reset to the individual PCI function.  This
means the device must support FLR (PCIe or AF), PM reset on D3hot->D0
transition, device specific reset, or be a singleton device on a bus
for a secondary bus reset.  FLR does not have widespread support,
PM reset is not very reliable, and bus topology is dictated by the
system and device design.  We need to provide a means for a user to
induce a bus reset in cases where the existing mechanisms are not
available or not reliable.

This device specific extension to VFIO provides the user with this
ability.  Two new ioctls are introduced:
 - VFIO_DEVICE_PCI_GET_HOT_RESET_INFO
 - VFIO_DEVICE_PCI_HOT_RESET

The first provides the user with information about the extent of
devices affected by a hot reset.  This is essentially a list of
devices and the IOMMU groups they belong to.  The user may then
initiate a hot reset by calling the second ioctl.  We must be
careful that the user has ownership of all the affected devices
found via the first ioctl, so the second ioctl takes a list of file
descriptors for the VFIO groups affected by the reset.  Each group
must have IOMMU protection established for the ioctl to succeed.
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

8b27ee60

net: sync some IP headers with glibc · cfd280c9

由 Carlos O'Donell 提交于 8月 15, 2013

Solution:
=========

- Synchronize linux's `include/uapi/linux/in6.h'
  with glibc's `inet/netinet/in.h'.
- Synchronize glibc's `inet/netinet/in.h with linux's
  `include/uapi/linux/in6.h'.
- Allow including the headers in either other.
- First header included defines the structures and macros.

Details:
========

The kernel promises not to break the UAPI ABI so I don't
see why we can't just have the two userspace headers
coordinate?

If you include the kernel headers first you get those,
and if you include the glibc headers first you get those,
and the following patch arranges a coordination and
synchronization between the two.

Let's handle `include/uapi/linux/in6.h' from linux,
and `inet/netinet/in.h' from glibc and ensure they compile
in any order and preserve the required ABI.

These two patches pass the following compile tests:

cat >> test1.c <<EOF
int main (void) {
  return 0;
}
EOF
gcc -c test1.c

cat >> test2.c <<EOF
int main (void) {
  return 0;
}
EOF
gcc -c test2.c

One wrinkle is that the kernel has a different name for one of
the members in ipv6_mreq. In the kernel patch we create a macro
to cover the uses of the old name, and while that's not entirely
clean it's one of the best solutions (aside from an anonymous
union which has other issues).

I've reviewed the code and it looks to me like the ABI is
assured and everything matches on both sides.

Notes:
- You want netinet/in.h to include bits/in.h as early as possible,
  but it needs in_addr so define in_addr early.
- You want bits/in.h included as early as possible so you can use
  the linux specific code to define __USE_KERNEL_DEFS based on
  the _UAPI_* macro definition and use those to cull in.h.
- glibc was missing IPPROTO_MH, added here.

Compile tested and inspected.
Reported-by: NThomas Backlund <tmb@mageia.org>
Cc: Thomas Backlund <tmb@mageia.org>
Cc: libc-alpha@sourceware.org
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: David S. Miller <davem@davemloft.net>
Tested-by: NCong Wang <amwang@redhat.com>
Signed-off-by: NCarlos O'Donell <carlos@redhat.com>
Signed-off-by: NCong Wang <amwang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cfd280c9

04 9月, 2013 5 次提交

Input: introduce BTN/ABS bits for drums and guitars · 61e00655

由 David Herrmann 提交于 8月 26, 2013

There are a bunch of guitar and drums devices out there that all report
similar data. To avoid reporting this as BTN_MISC or ABS_MISC, we
allocate some proper namespace for them. Note that most of these devices
are toys and we cannot report any sophisticated physics via this API.

I did some google-images research and tried to provide definitions that
work with all common devices. That's why I went with 4 toms, 4 cymbals,
one bass, one hi-hat. I haven't seen other drums and I doubt that we need
any additions to that. Anyway, the naming-scheme is intentionally done in
an extensible way.

For guitars, we support 5 frets (normally aligned vertically, compared to
the real horizontal layouts), a single strum-bar with up/down directions,
an optional fret-board and a whammy-bar.

Most of the devices provide pressure values so I went with ABS_* bits. If
we ever support devices which only provide digital input, we have to
decide whether to emulate pressure data or add additional BTN_* bits.

If someone is not familiar with these devices, here are two pictures which
provide almost all introduced interfaces (or try the given keywords
with a google-image search):
Guitar: ("guitar hero world tour guitar")
http://images1.wikia.nocookie.net/__cb20120911023442/applezone/es/images/f/f9/Wii_Guitar.jpg
Drums: ("guitar hero drums")
http://oyster.ignimgs.com/franchises/images/03/55/35526_band-hero-drum-set-hands-on-20090929040735768.jpgSigned-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
Acked-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>

61e00655

ICMPv6: treat dest unreachable codes 5 and 6 as EACCES, not EPROTO · 61e76b17

由 Jiri Bohac 提交于 8月 30, 2013

RFC 4443 has defined two additional codes for ICMPv6 type 1 (destination
unreachable) messages:
        5 - Source address failed ingress/egress policy
	6 - Reject route to destination

Now they are treated as protocol error and icmpv6_err_convert() converts them
to EPROTO.

RFC 4443 says:
	"Codes 5 and 6 are more informative subsets of code 1."

Treat codes 5 and 6 as code 1 (EACCES)

Btw, connect() returning -EPROTO confuses firefox, so that fallback to
other/IPv4 addresses does not work:
https://bugzilla.mozilla.org/show_bug.cgi?id=910773Signed-off-by: NJiri Bohac <jbohac@suse.cz>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

61e76b17

NVMe: Update nvme_id_power_state with latest spec · 685585c2

由 Keith Busch 提交于 6月 25, 2013

Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

685585c2

NVMe: Split header file into user-visible and kernel-visible pieces · 42c77683

由 Matthew Wilcox 提交于 6月 25, 2013

To build user programs that call the NVMe ioctls, we need to have a
user header file.  Catch up to the new way of doing that by splitting
the header file into kernel and uapi portions.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

42c77683

tilegx: Add tty serial support for TILE-Gx on-chip UART · b5c6c1a7

由 Chris Metcalf 提交于 8月 12, 2013

Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

b5c6c1a7

03 9月, 2013 1 次提交

perf: Add a dummy software event to keep tracking · fa0097ee

由 Adrian Hunter 提交于 8月 31, 2013

When an event is disabled the "tracking" events selected by the 'mmap',
'comm' and 'task' bits of struct perf_event_attr, are also disabled.
However, the information those events provide is necessary to resolve
symbols for when the main event is re-enabled.

The "tracking" events can be kept enabled by putting them on another
event, but that requires an event that otherwise does nothing.  A new
software event PERF_COUNT_SW_DUMMY is added for that purpose.
Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1377975053-3811-2-git-send-email-adrian.hunter@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

fa0097ee

02 9月, 2013 3 次提交

D
Move the EM_ARM and EM_AARCH64 definitions to uapi/linux/elf-em.h · 909e3ee4
由 Dan Aloni 提交于 8月 28, 2013
```
Signed-off-by: NDan Aloni <alonid@stratoscale.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
```
909e3ee4

perf: Export struct perf_branch_entry to userspace · 274481de

由 Vince Weaver 提交于 8月 23, 2013

If PERF_SAMPLE_BRANCH_STACK is enabled then samples are returned
with the format { u64 from, to, flags } but the flags layout
is not specified.

This field has the type struct perf_branch_entry; move this
definition into include/uapi/linux/perf_event.h so users can
access these fields.

This is similar to the existing inclusion of perf_mem_data_src in
the include/uapi/linux/perf_event.h file.
Signed-off-by: NVince Weaver <vincent.weaver@maine.edu>
Acked-by: NStephane Eranian <eranian@google.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1308231544420.1889@vincent-weaver-1.um.maine.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>

274481de

perf: Add attr->mmap2 attribute to an event · 13d7a241

由 Stephane Eranian 提交于 8月 21, 2013

Adds a new PERF_RECORD_MMAP2 record type which is essence
an expanded version of PERF_RECORD_MMAP.

Used to request mmap records with more information about
the mapping, including device major, minor and the inode
number and generation for mappings associated with files
or shared memory segments. Works for code and data
(with attr->mmap_data set).

Existing PERF_RECORD_MMAP record is unmodified by this patch.
Signed-off-by: NStephane Eranian <eranian@google.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/1377079825-19057-2-git-send-email-eranian@google.com
[ Added Al to the Cc:. Are the ino, maj/min exports of vma->vm_file OK? ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

13d7a241

01 9月, 2013 3 次提交

Btrfs: use __u64 in exported user headers · 68b823ef

由 Mike Frysinger 提交于 8月 16, 2013

Signed-off-by: NMike Frysinger <vapier@gentoo.org>
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

68b823ef

btrfs: offline dedupe · 416161db

由 Mark Fasheh 提交于 8月 06, 2013

This patch adds an ioctl, BTRFS_IOC_FILE_EXTENT_SAME which will try to
de-duplicate a list of extents across a range of files.

Internally, the ioctl re-uses code from the clone ioctl. This avoids
rewriting a large chunk of extent handling code.

Userspace passes in an array of file, offset pairs along with a length
argument. The ioctl will then (for each dedupe) do a byte-by-byte comparison
of the user data before deduping the extent. Status and number of bytes
deduped are returned for each operation.
Signed-off-by: NMark Fasheh <mfasheh@suse.de>
Reviewed-by: NZach Brown <zab@redhat.com>
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

416161db

vxlan: add ipv6 support · e4c7ed41

由 Cong Wang 提交于 8月 31, 2013

This patch adds IPv6 support to vxlan device, as the new version
RFC already mentions it:

   http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-03

Cc: David Stevens <dlstevens@us.ibm.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: NCong Wang <amwang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e4c7ed41

30 8月, 2013 6 次提交

pkt_sched: fq: Fair Queue packet scheduler · afe4fd06

由 Eric Dumazet 提交于 8月 29, 2013

- Uses perfect flow match (not stochastic hash like SFQ/FQ_codel)
- Uses the new_flow/old_flow separation from FQ_codel
- New flows get an initial credit allowing IW10 without added delay.
- Special FIFO queue for high prio packets (no need for PRIO + FQ)
- Uses a hash table of RB trees to locate the flows at enqueue() time
- Smart on demand gc (at enqueue() time, RB tree lookup evicts old
  unused flows)
- Dynamic memory allocations.
- Designed to allow millions of concurrent flows per Qdisc.
- Small memory footprint : ~8K per Qdisc, and 104 bytes per flow.
- Single high resolution timer for throttled flows (if any).
- One RB tree to link throttled flows.
- Ability to have a max rate per flow. We might add a socket option
  to add per socket limitation.

Attempts have been made to add TCP pacing in TCP stack, but this
seems to add complex code to an already complex stack.

TCP pacing is welcomed for flows having idle times, as the cwnd
permits TCP stack to queue a possibly large number of packets.

This removes the 'slow start after idle' choice, hitting badly
large BDP flows, and applications delivering chunks of data
as video streams.

Nicely spaced packets :
Here interface is 10Gbit, but flow bottleneck is ~20Mbit

cwin is big, yet FQ avoids the typical bursts generated by TCP
(as in netperf TCP_RR -- -r 100000,100000)

15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.545394 IP B > A: . ack 81089 win 3668 <nop,nop,timestamp 11597985 1115>
15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.546565 IP B > A: . ack 83985 win 3668 <nop,nop,timestamp 11597986 1115>
15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.547778 IP B > A: . ack 86881 win 3668 <nop,nop,timestamp 11597987 1115>
15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.548949 IP B > A: . ack 89777 win 3668 <nop,nop,timestamp 11597988 1115>
15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.550182 IP B > A: . ack 92673 win 3668 <nop,nop,timestamp 11597989 1115>
15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.551406 IP B > A: . ack 95569 win 3668 <nop,nop,timestamp 11597991 1115>
15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.552576 IP B > A: . ack 98465 win 3668 <nop,nop,timestamp 11597992 1115>
15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.554204 IP B > A: . ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>

TCP timestamps show that most packets from B were queued in the same ms
timeframe (TSval 1159799{3,4}), but FQ managed to send them right
in time to avoid a big burst.

In slow start or steady state, very few packets are throttled [1]

FQ gets a bunch of tunables as :

  limit : max number of packets on whole Qdisc (default 10000)

  flow_limit : max number of packets per flow (default 100)

  quantum : the credit per RR round (default is 2 MTU)

  initial_quantum : initial credit for new flows (default is 10 MTU)

  maxrate : max per flow rate (default : unlimited)

  buckets : number of RB trees (default : 1024) in hash table.
               (consumes 8 bytes per bucket)

  [no]pacing : disable/enable pacing (default is enable)

All of them can be changed on a live qdisc.

$ tc qd add dev eth0 root fq help
Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ]
              [ quantum BYTES ] [ initial_quantum BYTES ]
              [ maxrate RATE  ] [ buckets NUMBER ]
              [ [no]pacing ]

$ tc -s -d qd
qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140
 Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14)
 backlog 0b 0p requeues 14
  511 flows, 511 inactive, 0 throttled
  110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit

[1] Except if initial srtt is overestimated, as if using
cached srtt in tcp metrics. We'll provide a fix for this issue.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

afe4fd06

can: gw: add a per rule limitation of frame hops · 391ac128

由 Oliver Hartkopp 提交于 8月 26, 2013

Usually the received CAN frames can be processed/routed as much as 'max_hops'
times (which is given at module load time of the can-gw module).
Introduce a new configuration option to reduce the number of possible hops
for a specific gateway rule to a value smaller then max_hops.
Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

391ac128

net: packet: add randomized fanout scheduler · 5df0ddfb

由 Daniel Borkmann 提交于 8月 28, 2013

We currently allow for different fanout scheduling policies in pf_packet
such as scheduling by skb's rxhash, round-robin, by cpu, and rollover.
Also allow for a random, equidistributed selection of the socket from the
fanout process group.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5df0ddfb

ipv6: drop fragmented ndisc packets by default (RFC 6980) · b800c3b9

由 Hannes Frederic Sowa 提交于 8月 27, 2013

This patch implements RFC6980: Drop fragmented ndisc packets by
default. If a fragmented ndisc packet is received the user is informed
that it is possible to disable the check.

Cc: Fernando Gont <fernando@gont.com.ar>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b800c3b9

perf: make events stream always parsable · ff3d527c

由 Adrian Hunter 提交于 8月 27, 2013

The event stream is not always parsable because the format of a sample
is dependent on the sample_type of the selected event.  When there is
more than one selected event and the sample_types are not the same then
parsing becomes problematic.  A sample can be matched to its selected
event using the ID that is allocated when the event is opened.
Unfortunately, to get the ID from the sample means first parsing it.

This patch adds a new sample format bit PERF_SAMPLE_IDENTIFER that puts
the ID at a fixed position so that the ID can be retrieved without
parsing the sample.  For sample events, that is the first position
immediately after the header.  For non-sample events, that is the last
position.

In this respect parsing samples requires that the sample_type and ID
values are recorded.  For example, perf tools records struct
perf_event_attr and the IDs within the perf.data file.  Those must be
read first before it is possible to parse samples found later in the
perf.data file.
Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
Tested-by: NStephane Eranian <eranian@google.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1377591794-30553-6-git-send-email-adrian.hunter@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>

ff3d527c

Input: add SYN_MAX and SYN_CNT constants · 52764fed

由 David Herrmann 提交于 8月 29, 2013

SYN_* events are special and not enabled via set_bit() for devices. Hence,
they haven't been really needed, yet. However, user-space can still make
great use of that for int->string debugging helpers or alike.

Also, I haven't seen any reason not to define these, so here they are.
Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
Acked-by: NPeter Hutterer <peter.hutterer@who-t.net>
Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>

52764fed

29 8月, 2013 5 次提交

Omnikey Cardman 4000: pull in ioctl.h in user header · aaaafb7f

由 Mike Frysinger 提交于 8月 28, 2013

This file uses the ioctl helpers (_IOR/_IOW/etc...), so include ioctl.h
for the definitions.
Signed-off-by: NMike Frysinger <vapier@gentoo.org>
Cc: Harald Welte <laforge@gnumonks.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

aaaafb7f

PCI: Add offsets of PCIe capability registers · bd6fb762

由 Bjorn Helgaas 提交于 8月 27, 2013

These offsets are not used, and in some cases are completely reserved
even in the spec, but I'm adding them for completeness just to match
the diagrams in the spec, e.g., PCIe spec r3.0, sec 7.8.
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>

bd6fb762

PCI: Tidy bitmasks and spacing of PCIe capability definitions · c0b4b381

由 Bjorn Helgaas 提交于 8月 27, 2013

The convention of showing bits in a mask of the full register width, e.g.,
"0x00000007" instead of "0x07" for a field in a 32-bit register, is common
but not universal in this file. This patch makes it consistently used at
least for the PCIe capability.

Whitespace and zero-extension changes only; no functional change.
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>

c0b4b381

PCI: Remove obsolete comment reference to pci_pcie_cap2() · 1b121c24

由 Bjorn Helgaas 提交于 8月 27, 2013

pci_pcie_cap2() was replaced by pcie_capability_read_word() and similar
functions, so update the comment.
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>

1b121c24

PCI: Clarify PCI_EXP_TYPE_PCI_BRIDGE comment · fbf501c3

由 Bjorn Helgaas 提交于 8月 27, 2013

The PCI_EXP_TYPE_PCI_BRIDGE is a *PCIe* function that is a bridge to
PCI/PCI-X.  See PCIe spec r3.0, sec 7.8.2.
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>

fbf501c3

28 8月, 2013 4 次提交

Revert "OMAP: UART: Keep the TX fifo full when possible" · 355fe568

由 Greg Kroah-Hartman 提交于 8月 27, 2013

This reverts commit c4415084.

Kevin writes:
	Hmm, another OMAP serial patch that wasn't Cc'd to linux-omap
	where OMAP users might have seen it. :(

	I just bisected a strange problem in linux-next on OMAP3 down to
	this patch.  Reverting it fixes the problem.

	On OMAP3530 Beagle and Overo, after boot, doing a 'cat
	/proc/cpuinfo' was not returning to a prompt, suggesting
	something strange with the FIFO.  Hitting return gets me back to
	a prompt.

	Greg, this one should also be dropped from tty-next until it can
	be further investgated and the problem solved.
Reported-by: NKevin Hilman <khilman@linaro.org>
Cc: Dmitry Fink <finik@ti.com>
Cc: Alexander Savchenko <oleksandr.savchenko@ti.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

355fe568

netfilter: add SYNPROXY core/target · 48b1de4c

由 Patrick McHardy 提交于 8月 27, 2013

Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy
core with common functions and an address family specific target.

The SYNPROXY receives the connection request from the client, responds with
a SYN/ACK containing a SYN cookie and announcing a zero window and checks
whether the final ACK from the client contains a valid cookie.

It then establishes a connection to the original destination and, if
successful, sends a window update to the client with the window size
announced by the server.

Support for timestamps, SACK, window scaling and MSS options can be
statically configured as target parameters if the features of the server
are known. If timestamps are used, the timestamp value sent back to
the client in the SYN/ACK will be different from the real timestamp of
the server. In order to now break PAWS, the timestamps are translated in
the direction server->client.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Tested-by: NMartin Topholm <mph@one.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

48b1de4c

netfilter: nf_conntrack: make sequence number adjustments usuable without NAT · 41d73ec0

由 Patrick McHardy 提交于 8月 27, 2013

Split out sequence number adjustments from NAT and move them to the conntrack
core to make them usable for SYN proxying. The sequence number adjustment
information is moved to a seperate extend. The extend is added to new
conntracks when a NAT mapping is set up for a connection using a helper.

As a side effect, this saves 24 bytes per connection with NAT in the common
case that a connection does not have a helper assigned.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Tested-by: NMartin Topholm <mph@one.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

41d73ec0

PCI: Rename PCIe capability definitions to follow convention · d2ab1fa6

由 Bjorn Helgaas 提交于 8月 27, 2013

All other PCIe capability register fields include "PCI_EXP" + <reg-name> +
<field-name>.  This renames PCI_EXP_OBFF_MASK, PCI_EXP_IDO_REQ_EN,
PCI_EXP_LTR_EN, and related fields using the same convention.
No functional change.
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Acked-by: Samuel Ortiz <sameo@linux.intel.com>	# for MFD driver

d2ab1fa6

27 8月, 2013 1 次提交

openvswitch: Add SCTP support · a175a723

由 Joe Stringer 提交于 8月 22, 2013

This patch adds support for rewriting SCTP src,dst ports similar to the
functionality already available for TCP/UDP.

Rewriting SCTP ports is expensive due to double-recalculation of the
SCTP checksums; this is performed to ensure that packets traversing OVS
with invalid checksums will continue to the destination with any
checksum corruption intact.
Reviewed-by: NSimon Horman <horms@verge.net.au>
Signed-off-by: NJoe Stringer <joe@wand.net.nz>
Signed-off-by: NBen Pfaff <blp@nicira.com>
Signed-off-by: NJesse Gross <jesse@nicira.com>

a175a723

26 8月, 2013 2 次提交

KVM: PPC: reserve a capability number for multitce support · 0bd50dc9

由 Alexey Kardashevskiy 提交于 8月 01, 2013

This is to reserve a capablity number for upcoming support
of H_PUT_TCE_INDIRECT and H_STUFF_TCE pseries hypercalls
which support mulptiple DMA map/unmap operations per one call.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NGleb Natapov <gleb@redhat.com>

0bd50dc9

kvm uapi: Add KICK_CPU and PV_UNHALT definition to uapi · 4b0a8670

由 Raghavendra K T 提交于 8月 26, 2013

this is needed by both guest and host.

Originally-from: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: NGleb Natapov <gleb@redhat.com>
Acked-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NGleb Natapov <gleb@redhat.com>

4b0a8670