提交 · 9c0863b1cc485f2bacac0675c68b73e5341cfd26 · openanolis / cloud-kernel

11 3月, 2014 3 次提交

[media] vb2: call buf_finish from __queue_cancel · 9c0863b1

由 Hans Verkuil 提交于 3月 04, 2014

If a queue was canceled, then the buf_finish op was never called for the
pending buffers. So add this call to queue_cancel. Before calling buf_finish
set the buffer state to PREPARED, which is the correct state. That way the
states DONE and ERROR will only be seen in buf_finish if streaming is in
progress.

Since buf_finish can now be called from non-streaming state we need to
adapt the handful of drivers that actually need to know this.
Signed-off-by: NHans Verkuil <hans.verkuil@cisco.com>
Acked-by: NSakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

9c0863b1

[media] vb2: change result code of buf_finish to void · 06470642

由 Hans Verkuil 提交于 3月 04, 2014

The buf_finish op should always work, so change the return type to void.
Update the few drivers that use it.
Signed-off-by: NHans Verkuil <hans.verkuil@cisco.com>
Acked-by: NPawel Osciak <pawel@osciak.com>
Reviewed-by: NPawel Osciak <pawel@osciak.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

06470642

[media] vb2: add debugging code to check for unbalanced ops · b5b4541e

由 Hans Verkuil 提交于 1月 29, 2014

When a vb2_queue is freed check if all the mem_ops and queue ops were balanced.
So the number of calls to e.g. buf_finish has to match the number of calls to
buf_prepare, etc.

This code is only enabled if CONFIG_VIDEO_ADV_DEBUG is set.
Signed-off-by: NHans Verkuil <hans.verkuil@cisco.com>
Acked-by: NPawel Osciak <pawel@osciak.com>
Acked-by: NLaurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

b5b4541e

06 3月, 2014 10 次提交

[media] v4l: Add timestamp source flags, mask and document them · 872484ce

由 Sakari Ailus 提交于 8月 25, 2013

Some devices do not produce timestamps that correspond to the end of the
frame. The user space should be informed on the matter. This patch achieves
that by adding buffer flags (and a mask) for timestamp sources since more
possible timestamping points are expected than just two.

A three-bit mask is defined (V4L2_BUF_FLAG_TSTAMP_SRC_MASK) and two of the
eight possible values is are defined V4L2_BUF_FLAG_TSTAMP_SRC_EOF for end of
frame (value zero) V4L2_BUF_FLAG_TSTAMP_SRC_SOE for start of exposure (next
value).
Signed-off-by: NSakari Ailus <sakari.ailus@iki.fi>
Acked-by: NKamil Debski <k.debski@samsung.com>
Acked-by: NHans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

872484ce

[media] v4l: Rename vb2_queue.timestamp_type as timestamp_flags · ade48681

由 Sakari Ailus 提交于 2月 25, 2014

The timestamp_type field used to contain only the timestamp type. Soon it
will be used for timestamp source flags as well. Rename the field
accordingly.

[m.chehab@samsung.com: do the change also to drivers/staging/media and at s2255]
Signed-off-by: NSakari Ailus <sakari.ailus@iki.fi>
Acked-by: NHans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

ade48681

[media] v4l: Use full 32 bits for buffer flags · 939f1377

由 Sakari Ailus 提交于 8月 25, 2013

The buffer flags field is 32 bits but the defined only used 16. This is
fine, but as more than 16 bits will be used in the very near future, define
them as 32-bit numbers for consistency.
Signed-off-by: NSakari Ailus <sakari.ailus@iki.fi>
Acked-by: NHans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

939f1377

[media] v4l: add RF tuner gain controls · 80807fad

由 Antti Palosaari 提交于 1月 24, 2014

Modern silicon RF tuners used nowadays has many controllable gain
stages on signal path. Usually, but not always, there is at least
3 gain stages. Also on some cases there could be multiple gain
stages within the ones specified here. However, I think that having
these three controllable gain stages offers enough fine-tuning for
real use cases.

1) LNA gain. That is first gain just after antenna input.
2) Mixer gain. It is located quite middle of the signal path, where
RF signal is down-converted to IF/BB.
3) IF gain. That is last gain in order to adjust output signal level
to optimal level for receiving party (usually demodulator ADC).

Each gain stage could be set rather often both manual or automatic
(AGC) mode. Due to that add separate controls for controlling
operation mode.
Signed-off-by: NAntti Palosaari <crope@iki.fi>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

80807fad

[media] v4l: add device capability flag for SDR receiver · c9c54f72

由 Antti Palosaari 提交于 12月 17, 2013

VIDIOC_QUERYCAP IOCTL is used to query device capabilities. Add new
capability flag to inform given device supports SDR capture.
Signed-off-by: NAntti Palosaari <crope@iki.fi>
Acked-by: NHans Verkuil <hverkuil@xs4all.nl>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

c9c54f72

[media] v4l: define own IOCTL ops for SDR FMT · 855df1dc

由 Antti Palosaari 提交于 12月 14, 2013

Use own format ops for SDR data:
vidioc_enum_fmt_sdr_cap
vidioc_g_fmt_sdr_cap
vidioc_s_fmt_sdr_cap
vidioc_try_fmt_sdr_cap
Signed-off-by: NAntti Palosaari <crope@iki.fi>
Acked-by: NHans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

855df1dc

[media] v4l: add stream format for SDR receiver · 6f3073b8

由 Antti Palosaari 提交于 12月 12, 2013

Add new V4L2 stream format definition, V4L2_BUF_TYPE_SDR_CAPTURE,
for SDR receiver.
Signed-off-by: NAntti Palosaari <crope@iki.fi>
Acked-by: NHans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

6f3073b8

[media] v4l: 1 Hz resolution flag for tuners · 67f9a117

由 Antti Palosaari 提交于 12月 11, 2013

Add V4L2_TUNER_CAP_1HZ for 1 Hz resolution.
Signed-off-by: NAntti Palosaari <crope@iki.fi>
Acked-by: NHans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

67f9a117

[media] v4l: add new tuner types for SDR · 84099a28

由 Antti Palosaari 提交于 12月 11, 2013

Define tuner types V4L2_TUNER_ADC and V4L2_TUNER_RF for SDR usage.

ADC is used for setting sampling rate (sampling frequency) to SDR
device.

Another tuner type, named as V4L2_TUNER_RF, is possible RF tuner.
Is is used to down-convert RF frequency to range ADC could sample.
Having RF tuner is optional, whilst in practice it is almost always
there.

Also add checks to VIDIOC_G_FREQUENCY, VIDIOC_S_FREQUENCY and
VIDIOC_ENUM_FREQ_BANDS only allow these two tuner types when device
type is SDR (VFL_TYPE_SDR). For VIDIOC_G_FREQUENCY we do not check
tuner type, instead override type with V4L2_TUNER_ADC in every
case (requested by Hans in order to keep functionality in line with
existing tuners and existing API does not specify it).

Prohibit VIDIOC_S_HW_FREQ_SEEK explicitly when device type is SDR,
as device cannot do hardware seek without a hardware demodulator.
Signed-off-by: NAntti Palosaari <crope@iki.fi>
Acked-by: NHans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

84099a28

[media] v4l: add device type for Software Defined Radio · d42626bd

由 Antti Palosaari 提交于 12月 11, 2013

Add new V4L device type VFL_TYPE_SDR for Software Defined Radio.
It is registered as /dev/swradio0 (/dev/sdr0 was already reserved).
Signed-off-by: NAntti Palosaari <crope@iki.fi>
Acked-by: NHans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

d42626bd

01 3月, 2014 1 次提交

[media] v4l2: Add settings for Horizontal and Vertical MV Search Range · bf0bedd3

由 Amit Grover 提交于 2月 04, 2014

Adding V4L2 controls for horizontal and vertical search range in pixels
for motion estimation module in video encoder.
Signed-off-by: NSwami Nathan <swaminath.p@samsung.com>
Signed-off-by: NAmit Grover <amit.grover@samsung.com>
Acked-by: NHans Verkuil <hans.verkuil@cisco.com>
Acked-by: NLad, Prabhakar <prabhakar.csengg@gmail.com>
Signed-off-by: NKamil Debski <k.debski@samsung.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

bf0bedd3

26 2月, 2014 1 次提交

ipc,mqueue: remove limits for the amount of system-wide queues · f3713fd9

由 Davidlohr Bueso 提交于 2月 25, 2014

Commit 93e6f119 ("ipc/mqueue: cleanup definition names and
locations") added global hardcoded limits to the amount of message
queues that can be created.  While these limits are per-namespace,
reality is that it ends up breaking userspace applications.
Historically users have, at least in theory, been able to create up to
INT_MAX queues, and limiting it to just 1024 is way too low and dramatic
for some workloads and use cases.  For instance, Madars reports:

 "This update imposes bad limits on our multi-process application.  As
  our app uses approaches that each process opens its own set of queues
  (usually something about 3-5 queues per process).  In some scenarios
  we might run up to 3000 processes or more (which of-course for linux
  is not a problem).  Thus we might need up to 9000 queues or more.  All
  processes run under one user."

Other affected users can be found in launchpad bug #1155695:
  https://bugs.launchpad.net/ubuntu/+source/manpages/+bug/1155695

Instead of increasing this limit, revert it entirely and fallback to the
original way of dealing queue limits -- where once a user's resource
limit is reached, and all memory is used, new queues cannot be created.
Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
Reported-by: NMadars Vitolins <m@silodev.com>
Acked-by: NDoug Ledford <dledford@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>	[3.5+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f3713fd9

25 2月, 2014 2 次提交

sysfs: fix namespace refcnt leak · fed95bab

由 Li Zefan 提交于 2月 25, 2014

As mount() and kill_sb() is not a one-to-one match, we shoudn't get
ns refcnt unconditionally in sysfs_mount(), and instead we should
get the refcnt only when kernfs_mount() allocated a new superblock.

v2:
- Changed the name of the new argument, suggested by Tejun.
- Made the argument optional, suggested by Tejun.

v3:
- Make the new argument as second-to-last arg, suggested by Tejun.
Signed-off-by: NLi Zefan <lizefan@huawei.com>
Acked-by: NTejun Heo <tj@kernel.org>
 ---
 fs/kernfs/mount.c      | 8 +++++++-
 fs/sysfs/mount.c       | 5 +++--
 include/linux/kernfs.h | 9 +++++----
 3 files changed, 15 insertions(+), 7 deletions(-)
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

fed95bab

fsnotify: Allocate overflow events with proper type · ff57cd58

由 Jan Kara 提交于 2月 21, 2014

Commit 7053aee2 "fsnotify: do not share events between notification
groups" used overflow event statically allocated in a group with the
size of the generic notification event. This causes problems because
some code looks at type specific parts of event structure and gets
confused by a random data it sees there and causes crashes.

Fix the problem by allocating overflow event with type corresponding to
the group type so code cannot get confused.
Signed-off-by: NJan Kara <jack@suse.cz>

ff57cd58

24 2月, 2014 2 次提交

[media] v4l2-subdev: Allow 32-bit compat ioctls · ab58a301

由 Hans Verkuil 提交于 2月 10, 2014

Add support for 32-bit ioctls with v4l-subdev device nodes.

Rather than keep adding new ioctls to the list in v4l2-compat-ioctl32.c, just check
if the ioctl is a non-private V4L2 ioctl and if so, call the conversion code.

We keep forgetting to add new ioctls, so this is a more robust solution.

In addition extend the subdev API with support for a compat32 function to
convert custom v4l-subdev ioctls.
Signed-off-by: NHans Verkuil <hans.verkuil@cisco.com>
Tested-by: NSakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

ab58a301

asm-generic: add sched_setattr/sched_getattr syscalls · e6cfc029

由 James Hogan 提交于 2月 03, 2014

Add the sched_setattr and sched_getattr syscalls to the generic syscall
list, which is used by the following architectures: arc, arm64, c6x,
hexagon, metag, openrisc, score, tile, unicore32.
Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Cc: linux-arch@vger.kernel.org
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: linux-c6x-dev@linux-c6x.org
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: linux-hexagon@vger.kernel.org
Cc: linux-metag@vger.kernel.org
Cc: Jonas Bonn <jonas@southpole.se>
Cc: linux@lists.openrisc.net
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>

e6cfc029

22 2月, 2014 2 次提交

Revert "writeback: do not sync data dirtied after sync start" · 0dc83bd3

由 Jan Kara 提交于 2月 21, 2014

This reverts commit c4a391b5. Dave
Chinner <david@fromorbit.com> has reported the commit may cause some
inodes to be left out from sync(2). This is because we can call
redirty_tail() for some inode (which sets i_dirtied_when to current time)
after sync(2) has started or similarly requeue_inode() can set
i_dirtied_when to current time if writeback had to skip some pages. The
real problem is in the functions clobbering i_dirtied_when but fixing
that isn't trivial so revert is a safer choice for now.

CC: stable@vger.kernel.org # >= 3.13
Signed-off-by: NJan Kara <jack@suse.cz>

0dc83bd3

sched: Add 'flags' argument to sched_{set,get}attr() syscalls · 6d35ab48

由 Peter Zijlstra 提交于 2月 14, 2014

Because of a recent syscall design debate; its deemed appropriate for
each syscall to have a flags argument for future extension; without
immediately requiring new syscalls.

Cc: juri.lelli@gmail.com
Cc: Ingo Molnar <mingo@redhat.com>
Suggested-by: NMichael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140214161929.GL27965@twins.programming.kicks-ass.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

6d35ab48

20 2月, 2014 1 次提交

ASoC: dapm: Add locking to snd_soc_dapm_xxxx_pin functions · 11391100

由 Charles Keepax 提交于 2月 18, 2014

The snd_soc_dapm_xxxx_pin all require the dapm_mutex to be held when
they are called as they edit the dirty list, however very few of the
callers do so.

This patch adds unlocked versions of all the functions replacing the
existing implementations with one that holds the lock internally. We
also fix up the places where the lock was actually held on the caller
side.
Signed-off-by: NCharles Keepax <ckeepax@opensource.wolfsonmicro.com>
Signed-off-by: NMark Brown <broonie@linaro.org>
Cc: stable@vger.kernel.org

11391100

19 2月, 2014 4 次提交

mfd: tps65217: Naturalise cross-architecture discrepancies · 5c6fbd56

由 Lee Jones 提交于 2月 03, 2014

If we compile the TPS65217 for a 64bit architecture we receive the following
warnings:

drivers/mfd/tps65217.c: In function ‘tps65217_probe’:
drivers/mfd/tps65217.c:173:13:
  warning: cast from pointer to integer of different size
   chip_id = (unsigned int)match->data;
             ^
Signed-off-by: NLee Jones <lee.jones@linaro.org>

5c6fbd56

mfd: max8998: Naturalise cross-architecture discrepancies · 8bace2d5

由 Lee Jones 提交于 2月 03, 2014

If we compile the MAX8998 for a 64bit architecture we receive the following
warnings:

  drivers/mfd/max8998.c: In function ‘max8998_i2c_get_driver_data’:
  drivers/mfd/max8998.c:178:10:
    warning: cast from pointer to integer of different size
     return (int)match->data;
            ^
Signed-off-by: NLee Jones <lee.jones@linaro.org>

8bace2d5

mfd: max8997: Naturalise cross-architecture discrepancies · 05fb7a56

由 Lee Jones 提交于 1月 23, 2014

If we compile the MAX8997 for a 64bit architecture we receive the following
warnings:

  drivers/mfd/max8997.c: In function ‘max8997_i2c_get_driver_data’:
  drivers/mfd/max8997.c:173:10:
    warning: cast from pointer to integer of different size
     return (int)match->data;
            ^
Signed-off-by: NLee Jones <lee.jones@linaro.org>

05fb7a56

drm: add DRM_CAPs for cursor size · 8716ed4e

由 Alex Deucher 提交于 2月 12, 2014

Some hardware may not support standard 64x64 cursors.  Add
a drm cap to query the cursor size from the kernel.  Some examples
include radeon CIK parts (128x128 cursors) and armada (32x64 or 64x32).
This allows things like device specific ddxes to remove asics specific
logic and also allows xf86-video-modesetting to work properly with hw
cursors on this hardware. Default to 64 if the driver doesn't specify
a size.
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NRob Clark <robdclark@gmail.com>

8716ed4e

18 2月, 2014 3 次提交

drm/ttm: declare 'struct device' in ttm_page_alloc.h · 728a0cdf

由 Alexandre Courbot 提交于 2月 09, 2014

Declare 'struct device' explicitly in ttm_page_alloc.h as this file
does not include any file declaring it. This removes the following
warning:

	warning: 'struct device' declared inside parameter list
Signed-off-by: NAlexandre Courbot <acourbot@nvidia.com>
Reviewed-by: NThierry Reding <treding@nvidia.com>

728a0cdf

inotify: Fix reporting of cookies for inotify events · 45a22f4c

由 Jan Kara 提交于 2月 17, 2014

My rework of handling of notification events (namely commit 7053aee2
"fsnotify: do not share events between notification groups") broke
sending of cookies with inotify events. We didn't propagate the value
passed to fsnotify() properly and passed 4 uninitialized bytes to
userspace instead (so it is also an information leak). Sadly I didn't
notice this during my testing because inotify cookies aren't used very
much and LTP inotify tests ignore them.

Fix the problem by passing the cookie value properly.

Fixes: 7053aee2Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: NJan Kara <jack@suse.cz>

45a22f4c

ceph: remove xattr when null value is given to setxattr() · bcdfeb2e

由 Yan, Zheng 提交于 2月 11, 2014

For the setxattr request, introduce a new flag CEPH_XATTR_REMOVE
to distinguish null value case from the zero-length value case.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>

bcdfeb2e

17 2月, 2014 4 次提交

netdevice: move netdev_cap_txqueue for shared usage to header · b9507bda

由 Daniel Borkmann 提交于 2月 16, 2014

In order to allow users to invoke netdev_cap_txqueue, it needs to
be moved into netdevice.h header file. While at it, also add kernel
doc header to document the API.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b9507bda

netdevice: add queue selection fallback handler for ndo_select_queue · 99932d4f

由 Daniel Borkmann 提交于 2月 16, 2014

Add a new argument for ndo_select_queue() callback that passes a
fallback handler. This gets invoked through netdev_pick_tx();
fallback handler is currently __netdev_pick_tx() as most drivers
invoke this function within their customized implementation in
case for skbs that don't need any special handling. This fallback
handler can then be replaced on other call-sites with different
queue selection methods (e.g. in packet sockets, pktgen etc).

This also has the nice side-effect that __netdev_pick_tx() is
then only invoked from netdev_pick_tx() and export of that
function to modules can be undone.
Suggested-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

99932d4f

net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer · ef2820a7

由 Matija Glavinic Pecotic 提交于 2月 14, 2014

Implementation of (a)rwnd calculation might lead to severe performance issues
and associations completely stalling. These problems are described and solution
is proposed which improves lksctp's robustness in congestion state.

1) Sudden drop of a_rwnd and incomplete window recovery afterwards

Data accounted in sctp_assoc_rwnd_decrease takes only payload size (sctp data),
but size of sk_buff, which is blamed against receiver buffer, is not accounted
in rwnd. Theoretically, this should not be the problem as actual size of buffer
is double the amount requested on the socket (SO_RECVBUF). Problem here is
that this will have bad scaling for data which is less then sizeof sk_buff.
E.g. in 4G (LTE) networks, link interfacing radio side will have a large portion
of traffic of this size (less then 100B).

An example of sudden drop and incomplete window recovery is given below. Node B
exhibits problematic behavior. Node A initiates association and B is configured
to advertise rwnd of 10000. A sends messages of size 43B (size of typical sctp
message in 4G (LTE) network). On B data is left in buffer by not reading socket
in userspace.

Lets examine when we will hit pressure state and declare rwnd to be 0 for
scenario with above stated parameters (rwnd == 10000, chunk size == 43, each
chunk is sent in separate sctp packet)

Logic is implemented in sctp_assoc_rwnd_decrease:

socket_buffer (see below) is maximum size which can be held in socket buffer
(sk_rcvbuf). current_alloced is amount of data currently allocated (rx_count)

A simple expression is given for which it will be examined after how many
packets for above stated parameters we enter pressure state:

We start by condition which has to be met in order to enter pressure state:

	socket_buffer < currently_alloced;

currently_alloced is represented as size of sctp packets received so far and not
yet delivered to userspace. x is the number of chunks/packets (since there is no
bundling, and each chunk is delivered in separate packet, we can observe each
chunk also as sctp packet, and what is important here, having its own sk_buff):

	socket_buffer < x*each_sctp_packet;

each_sctp_packet is sctp chunk size + sizeof(struct sk_buff). socket_buffer is
twice the amount of initially requested size of socket buffer, which is in case
of sctp, twice the a_rwnd requested:

	2*rwnd < x*(payload+sizeof(struc sk_buff));

sizeof(struct sk_buff) is 190 (3.13.0-rc4+). Above is stated that rwnd is 10000
and each payload size is 43

	20000 < x(43+190);

	x > 20000/233;

	x ~> 84;

After ~84 messages, pressure state is entered and 0 rwnd is advertised while
received 84*43B ~= 3612B sctp data. This is why external observer notices sudden
drop from 6474 to 0, as it will be now shown in example:

IP A.34340 > B.12345: sctp (1) [INIT] [init tag: 1875509148] [rwnd: 81920] [OS: 10] [MIS: 65535] [init TSN: 1096057017]
IP B.12345 > A.34340: sctp (1) [INIT ACK] [init tag: 3198966556] [rwnd: 10000] [OS: 10] [MIS: 10] [init TSN: 902132839]
IP A.34340 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.34340: sctp (1) [COOKIE ACK]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057017] [SID: 0] [SSEQ 0] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057017] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057018] [SID: 0] [SSEQ 1] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057018] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057019] [SID: 0] [SSEQ 2] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057019] [a_rwnd 9914] [#gap acks 0] [#dup tsns 0]
<...>
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057098] [SID: 0] [SSEQ 81] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057098] [a_rwnd 6517] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057099] [SID: 0] [SSEQ 82] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057099] [a_rwnd 6474] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057100] [SID: 0] [SSEQ 83] [PPID 0x18]

--> Sudden drop

IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

At this point, rwnd_press stores current rwnd value so it can be later restored
in sctp_assoc_rwnd_increase. This however doesn't happen as condition to start
slowly increasing rwnd until rwnd_press is returned to rwnd is never met. This
condition is not met since rwnd, after it hit 0, must first reach rwnd_press by
adding amount which is read from userspace. Let us observe values in above
example. Initial a_rwnd is 10000, pressure was hit when rwnd was ~6500 and the
amount of actual sctp data currently waiting to be delivered to userspace
is ~3500. When userspace starts to read, sctp_assoc_rwnd_increase will be blamed
only for sctp data, which is ~3500. Condition is never met, and when userspace
reads all data, rwnd stays on 3569.

IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 1505] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057100] [a_rwnd 3010] [#gap acks 0] [#dup tsns 0]
IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057101] [SID: 0] [SSEQ 84] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057101] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

--> At this point userspace read everything, rwnd recovered only to 3569

IP A.34340 > B.12345: sctp (1) [DATA] (B)(E) [TSN: 1096057102] [SID: 0] [SSEQ 85] [PPID 0x18]
IP B.12345 > A.34340: sctp (1) [SACK] [cum ack 1096057102] [a_rwnd 3569] [#gap acks 0] [#dup tsns 0]

Reproduction is straight forward, it is enough for sender to send packets of
size less then sizeof(struct sk_buff) and receiver keeping them in its buffers.

2) Minute size window for associations sharing the same socket buffer

In case multiple associations share the same socket, and same socket buffer
(sctp.rcvbuf_policy == 0), different scenarios exist in which congestion on one
of the associations can permanently drop rwnd of other association(s).

Situation will be typically observed as one association suddenly having rwnd
dropped to size of last packet received and never recovering beyond that point.
Different scenarios will lead to it, but all have in common that one of the
associations (let it be association from 1)) nearly depleted socket buffer, and
the other association blames socket buffer just for the amount enough to start
the pressure. This association will enter pressure state, set rwnd_press and
announce 0 rwnd.
When data is read by userspace, similar situation as in 1) will occur, rwnd will
increase just for the size read by userspace but rwnd_press will be high enough
so that association doesn't have enough credit to reach rwnd_press and restore
to previous state. This case is special case of 1), being worse as there is, in
the worst case, only one packet in buffer for which size rwnd will be increased.
Consequence is association which has very low maximum rwnd ('minute size', in
our case down to 43B - size of packet which caused pressure) and as such
unusable.

Scenario happened in the field and labs frequently after congestion state (link
breaks, different probabilities of packet drop, packet reordering) and with
scenario 1) preceding. Here is given a deterministic scenario for reproduction:

>From node A establish two associations on the same socket, with rcvbuf_policy
being set to share one common buffer (sctp.rcvbuf_policy == 0). On association 1
repeat scenario from 1), that is, bring it down to 0 and restore up. Observe
scenario 1). Use small payload size (here we use 43). Once rwnd is 'recovered',
bring it down close to 0, as in just one more packet would close it. This has as
a consequence that association number 2 is able to receive (at least) one more
packet which will bring it in pressure state. E.g. if association 2 had rwnd of
10000, packet received was 43, and we enter at this point into pressure,
rwnd_press will have 9957. Once payload is delivered to userspace, rwnd will
increase for 43, but conditions to restore rwnd to original state, just as in
1), will never be satisfied.

--> Association 1, between A.y and B.12345

IP A.55915 > B.12345: sctp (1) [INIT] [init tag: 836880897] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 4032536569]
IP B.12345 > A.55915: sctp (1) [INIT ACK] [init tag: 2873310749] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3799315613]
IP A.55915 > B.12345: sctp (1) [COOKIE ECHO]
IP B.12345 > A.55915: sctp (1) [COOKIE ACK]

--> Association 2, between A.z and B.12346

IP A.55915 > B.12346: sctp (1) [INIT] [init tag: 534798321] [rwnd: 10000] [OS: 10] [MIS: 65535] [init TSN: 2099285173]
IP B.12346 > A.55915: sctp (1) [INIT ACK] [init tag: 516668823] [rwnd: 81920] [OS: 10] [MIS: 10] [init TSN: 3676403240]
IP A.55915 > B.12346: sctp (1) [COOKIE ECHO]
IP B.12346 > A.55915: sctp (1) [COOKIE ACK]

--> Deplete socket buffer by sending messages of size 43B over association 1

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315613] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315613] [a_rwnd 9957] [#gap acks 0] [#dup tsns 0]

<...>

IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315696] [a_rwnd 6388] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315697] [SID: 0] [SSEQ 84] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315697] [a_rwnd 6345] [#gap acks 0] [#dup tsns 0]

--> Sudden drop on 1

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315698] [SID: 0] [SSEQ 85] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315698] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Here userspace read, rwnd 'recovered' to 3698, now deplete again using
    association 1 so there is place in buffer for only one more packet

IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315799] [SID: 0] [SSEQ 186] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315799] [a_rwnd 86] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315800] [SID: 0] [SSEQ 187] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

--> Socket buffer is almost depleted, but there is space for one more packet,
    send them over association 2, size 43B

IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403240] [SID: 0] [SSEQ 0] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403240] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Immediate drop

IP A.60995 > B.12346: sctp (1) [SACK] [cum ack 387491510] [a_rwnd 0] [#gap acks 0] [#dup tsns 0]

--> Read everything from the socket, both association recover up to maximum rwnd
    they are capable of reaching, note that association 1 recovered up to 3698,
    and association 2 recovered only to 43

IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 1548] [#gap acks 0] [#dup tsns 0]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315800] [a_rwnd 3053] [#gap acks 0] [#dup tsns 0]
IP B.12345 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3799315801] [SID: 0] [SSEQ 188] [PPID 0x18]
IP A.55915 > B.12345: sctp (1) [SACK] [cum ack 3799315801] [a_rwnd 3698] [#gap acks 0] [#dup tsns 0]
IP B.12346 > A.55915: sctp (1) [DATA] (B)(E) [TSN: 3676403241] [SID: 0] [SSEQ 1] [PPID 0x18]
IP A.55915 > B.12346: sctp (1) [SACK] [cum ack 3676403241] [a_rwnd 43] [#gap acks 0] [#dup tsns 0]

A careful reader might wonder why it is necessary to reproduce 1) prior
reproduction of 2). It is simply easier to observe when to send packet over
association 2 which will push association into the pressure state.

Proposed solution:

Both problems share the same root cause, and that is improper scaling of socket
buffer with rwnd. Solution in which sizeof(sk_buff) is taken into concern while
calculating rwnd is not possible due to fact that there is no linear
relationship between amount of data blamed in increase/decrease with IP packet
in which payload arrived. Even in case such solution would be followed,
complexity of the code would increase. Due to nature of current rwnd handling,
slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure state is
entered is rationale, but it gives false representation to the sender of current
buffer space. Furthermore, it implements additional congestion control mechanism
which is defined on implementation, and not on standard basis.

Proposed solution simplifies whole algorithm having on mind definition from rfc:

o  Receiver Window (rwnd): This gives the sender an indication of the space
   available in the receiver's inbound buffer.

Core of the proposed solution is given with these lines:

sctp_assoc_rwnd_update:
	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
	else
		asoc->rwnd = 0;

We advertise to sender (half of) actual space we have. Half is in the braces
depending whether you would like to observe size of socket buffer as SO_RECVBUF
or twice the amount, i.e. size is the one visible from userspace, that is,
from kernelspace.
In this way sender is given with good approximation of our buffer space,
regardless of the buffer policy - we always advertise what we have. Proposed
solution fixes described problems and removes necessity for rwnd restoration
algorithm. Finally, as proposed solution is simplification, some lines of code,
along with some bytes in struct sctp_association are saved.

Version 2 of the patch addressed comments from Vlad. Name of the function is set
to be more descriptive, and two parts of code are changed, in one removing the
superfluous call to sctp_assoc_rwnd_update since call would not result in update
of rwnd, and the other being reordering of the code in a way that call to
sctp_assoc_rwnd_update updates rwnd. Version 3 corrected change introduced in v2
in a way that existing function is not reordered/copied in line, but it is
correctly called. Thanks Vlad for suggesting.
Signed-off-by: NMatija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
Reviewed-by: NAlexander Sverdlin <alexander.sverdlin@nsn.com>
Acked-by: NVlad Yasevich <vyasevich@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ef2820a7

mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit · 56eecdb9

由 Aneesh Kumar K.V 提交于 2月 12, 2014

Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions when using
a hash table MMU for various reasons (the flush is handled as part of
the PTE modification when necessary).

ppc64 thus doesn't implement flush_tlb_range for hash based MMUs.

Additionally ppc64 require the tlb flushing to be batched within ptl locks.

The reason to do that is to ensure that the hash page table is in sync with
linux page table.

We track the hpte index in linux pte and if we clear them without flushing
hash and drop the ptl lock, we can have another cpu update the pte and can
end up with duplicate entry in the hash table, which is fatal.

We also want to keep set_pte_at simpler by not requiring them to do hash
flush for performance reason. We do that by assuming that set_pte_at() is
never *ever* called on a PTE that is already valid.

This was the case until the NUMA code went in which broke that assumption.

Fix that by introducing a new pair of helpers to set _PAGE_NUMA in a
way similar to ptep/pmdp_set_wrprotect(), with a generic implementation
using set_pte_at() and a powerpc specific one using the appropriate
mechanism needed to keep the hash table in sync.
Acked-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

56eecdb9

15 2月, 2014 1 次提交

Revert "btrfs: add ioctl to export size of global metadata reservation" · 11bcac89

由 Chris Mason 提交于 2月 14, 2014

This reverts commit 01e219e8.

David Sterba found a different way to provide these features without adding a new
ioctl.  We haven't released any progs with this ioctl yet, so I'm taking this out
for now until we finalize things.
Signed-off-by: NChris Mason <clm@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
CC: Jeff Mahoney <jeffm@suse.com>

11bcac89

14 2月, 2014 6 次提交

workqueue: add args to workqueue lockdep name · fada94ee

由 Li Zhong 提交于 2月 14, 2014

Tommi noticed a 'funny' lock class name: "%s#5" from a lock acquired in
process_one_work().

Maybe #fmt plus #args could be used as the lock_name to give some more
information for some fmt string like the above.

__builtin_constant_p() check is removed (as there seems no good way to
check all the variables in args list). However, by removing the check,
it only adds two additional "s for those constants.

Some lockdep name examples printed out after the change:

lockdep name                    wq->name

"events_long"                   events_long
"%s"("khelper")                 khelper
"xfs-data/%s"mp->m_fsname       xfs-data/dm-3
Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

fada94ee

mlx5: Add include of <linux/slab.h> because of kzalloc()/kfree() use · 6ecde51d

由 Roland Dreier 提交于 2月 13, 2014

On some architectures (for example, arm), we don't end up indirectly
pulling in the declaration of kzalloc() and kfree(), and so building
anything that includes <linux/mlx5/driver.h> breaks.  Fix this by adding
an explicit include to get the declaration.
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

6ecde51d

IB: Report using RoCE IP based gids in port caps · b4a26a27

由 Moni Shoua 提交于 2月 09, 2014

For userspace RoCE UD QPs we need to know the GID format that the
kernel uses, e.g when working over older kernels. For that end, add a
new port capability IB_PORT_IP_BASED_GIDS and report it when query
port is issued.
Signed-off-by: NMoni Shoua <monis@mellanox.co.il>
Signed-off-by: NMatan Barak <matanb@mellanox.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NRoland Dreier <roland@purestorage.com>

b4a26a27

net: ip, ipv6: handle gso skbs in forwarding path · fe6cc55f

由 Florian Westphal 提交于 2月 13, 2014

Marcelo Ricardo Leitner reported problems when the forwarding link path
has a lower mtu than the incoming one if the inbound interface supports GRO.

Given:
Host <mtu1500> R1 <mtu1200> R2

Host sends tcp stream which is routed via R1 and R2.  R1 performs GRO.

In this case, the kernel will fail to send ICMP fragmentation needed
messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
checks in forward path. Instead, Linux tries to send out packets exceeding
the mtu.

When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
not fragment the packets when forwarding, and again tries to send out
packets exceeding R1-R2 link mtu.

This alters the forwarding dstmtu checks to take the individual gso
segment lengths into account.

For ipv6, we send out pkt too big error for gso if the individual
segments are too big.

For ipv4, we either send icmp fragmentation needed, or, if the DF bit
is not set, perform software segmentation and let the output path
create fragments when the packet is leaving the machine.
It is not 100% correct as the error message will contain the headers of
the GRO skb instead of the original/segmented one, but it seems to
work fine in my (limited) tests.

Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
sofware segmentation.

However it turns out that skb_segment() assumes skb nr_frags is related
to mss size so we would BUG there.  I don't want to mess with it considering
Herbert and Eric disagree on what the correct behavior should be.

Hannes Frederic Sowa notes that when we would shrink gso_size
skb_segment would then also need to deal with the case where
SKB_MAX_FRAGS would be exceeded.

This uses sofware segmentation in the forward path when we hit ipv4
non-DF packets and the outgoing link mtu is too small.  Its not perfect,
but given the lack of bug reports wrt. GRO fwd being broken this is a
rare case anyway.  Also its not like this could not be improved later
once the dust settles.
Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
Reported-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fe6cc55f

net: core: introduce netif_skb_dev_features · d2069403

由 Florian Westphal 提交于 2月 13, 2014

Will be used by upcoming ipv4 forward path change that needs to
determine feature mask using skb->dst->dev instead of skb->dev.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d2069403

PCI/MSI: Add pci_enable_msi_exact() and pci_enable_msix_exact() · 3ce4e860

由 Alexander Gordeev 提交于 2月 13, 2014

The new functions are special cases for pci_enable_msi_range() and
pci_enable_msix_range() when a particular number of MSI or MSI-X
is needed.

By contrast with pci_enable_msi_range() and pci_enable_msix_range()
functions, pci_enable_msi_exact() and pci_enable_msix_exact()
return zero in case of success, which indicates MSI or MSI-X
interrupts have been successfully allocated.
Signed-off-by: NAlexander Gordeev <agordeev@redhat.com>
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>

3ce4e860

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功