提交 · 17a13e4028e6ad7ded079cf32370c47bd0e0fc07 · openanolis / cloud-kernel

28 1月, 2014 8 次提交

libceph: follow {read,write}_tier fields on osd request submission · 17a13e40

由 Ilya Dryomov 提交于 1月 27, 2014

Overwrite ceph_osd_request::r_oloc.pool with read_tier for read ops and
write_tier for write and read+write ops (aka basic tiering support).
{read,write}_tier are part of pg_pool_t since v9.  This commit bumps
our pg_pool_t decode compat version from v7 to v9, all new fields
except for {read,write}_tier are ignored.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

17a13e40

libceph: add ceph_pg_pool_by_id() · ce7f6a27

由 Ilya Dryomov 提交于 1月 27, 2014

"Lookup pool info by ID" function is hidden in osdmap.c.  Expose it to
the rest of libceph.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

ce7f6a27

libceph: CEPH_OSD_FLAG_* enum update · 1b3f2ab5

由 Ilya Dryomov 提交于 1月 27, 2014

Update CEPH_OSD_FLAG_* enum.  (We need CEPH_OSD_FLAG_IGNORE_OVERLAY to
support tiering).
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

1b3f2ab5

libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg() · 7c13cb64

由 Ilya Dryomov 提交于 1月 27, 2014

Switch ceph_calc_ceph_pg() to new oloc and oid abstractions and rename
it to ceph_oloc_oid_to_pg() to make its purpose more clear.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

7c13cb64

libceph: introduce and start using oid abstraction · 4295f221

由 Ilya Dryomov 提交于 1月 27, 2014

In preparation for tiering support, which would require having two
(base and target) object names for each osd request and also copying
those names around, introduce struct ceph_object_id (oid) and a couple
helpers to facilitate those copies and encapsulate the fact that object
name is not necessarily a NUL-terminated string.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

4295f221

libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN · 2d0ebc5d

由 Ilya Dryomov 提交于 1月 27, 2014

In preparation for adding oid abstraction, rename MAX_OBJ_NAME_SIZE to
CEPH_MAX_OID_NAME_LEN.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

2d0ebc5d

libceph: move ceph_file_layout helpers to ceph_fs.h · e8221464

由 Ilya Dryomov 提交于 1月 27, 2014

Move ceph_file_layout helper macros and inline functions to ceph_fs.h.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

e8221464

libceph: start using oloc abstraction · 22116525

由 Ilya Dryomov 提交于 1月 27, 2014

Instead of relying on pool fields in ceph_file_layout (for mapping) and
ceph_pg (for enconding), start using ceph_object_locator (oloc)
abstraction.  Note that userspace oloc currently consists of pool, key,
nspace and hash fields, while this one contains only a pool.  This is
OK, because at this point we only send (i.e. encode) olocs and never
have to receive (i.e. decode) them.

This makes keeping a copy of ceph_file_layout in every osd request
unnecessary, so ceph_osd_request::r_file_layout field is nuked.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

22116525

26 1月, 2014 1 次提交

libceph: add ceph_kv{malloc,free}() and switch to them · eeb0bed5

由 Ilya Dryomov 提交于 1月 09, 2014

Encapsulate kmalloc vs vmalloc memory allocation and freeing logic into
two helpers, ceph_kvmalloc() and ceph_kvfree(), and switch to them.

ceph_kvmalloc() kmalloc()'s a maximum of 8 pages, anything bigger is
vmalloc()'ed with __GFP_HIGHMEM set.  This changes the existing
behaviour:

- for buffers (ceph_buffer_new()), from trying to kmalloc() everything
  and using vmalloc() just as a fallback

- for messages (ceph_msg_new()), from going to vmalloc() for anything
  bigger than a page

- for messages (ceph_msg_new()), from disallowing vmalloc() to use high
  memory
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

eeb0bed5

21 1月, 2014 3 次提交

Y
libceph: support CEPH_FEATURE_EXPORT_PEER · 80213a84
由 Yan, Zheng 提交于 1月 21, 2014
```
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
```
80213a84

ceph: remove exported caps when handling cap import message · 4ee6a914

由 Yan, Zheng 提交于 11月 24, 2013

Version 3 cap import message includes the ID of the exported
caps. It allow us to remove the exported caps if we still haven't
received the corresponding cap export message.

We remove the exported caps because they are stale, keeping them
can compromise consistence.
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>

4ee6a914

Y
ceph: handle session flush message · 186e4f7a
由 Yan, Zheng 提交于 11月 22, 2013
```
Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
```
186e4f7a

14 1月, 2014 1 次提交

libceph: rename ceph_msg::front_max to front_alloc_len · 3cea4c30

由 Ilya Dryomov 提交于 1月 09, 2014

Rename front_max field of struct ceph_msg to front_alloc_len to make
its purpose more clear.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

3cea4c30

01 1月, 2014 13 次提交

crush: support new indep mode and SET_* steps (crush v2) by default · cdff4991

由 Ilya Dryomov 提交于 12月 24, 2013

Add CRUSH_V2 feature (new indep mode and SET_* steps) to a set of
features supported by default.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

cdff4991

crush: add set_choose_local_[fallback_]tries steps · f046bf92

由 Ilya Dryomov 提交于 12月 24, 2013

This allows all of the tunables to be overridden by a specific rule.

Reflects ceph.git commits d129e09e57fbc61cfd4f492e3ee77d0750c9d292,
                          0497db49e5973b50df26251ed0e3f4ac7578e66e.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

f046bf92

crush: CHOOSE_LEAF -> CHOOSELEAF throughout · 917edad5

由 Ilya Dryomov 提交于 12月 24, 2013

This aligns the internal identifier names with the user-visible names in
the decompiled crush map language.

Reflects ceph.git commit caa0e22e15e4226c3671318ba1f61314bf6da2a6.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

917edad5

crush: add SET_CHOOSE_TRIES rule step · cc10df4a

由 Ilya Dryomov 提交于 12月 24, 2013

Since we can specify the recursive retries in a rule, we may as well also
specify the non-recursive tries too for completeness.

Reflects ceph.git commit d1b97462cffccc871914859eaee562f2786abfd1.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

cc10df4a

crush: apply chooseleaf_tries to firstn mode too · f18650ac

由 Ilya Dryomov 提交于 12月 24, 2013

Parameterize the attempts for the _firstn choose method, and apply the
rule-specified tries count to firstn mode as well. Note that we have
slightly different behavior here than with indep:

If the firstn value is not specified for firstn, we pass through the
normal attempt count. This maintains compatibility with legacy behavior.
Note that this is usually *not* actually N^2 work, though, because of the
descend_once tunable. However, descend_once is unfortunately *not* the
same thing as 1 chooseleaf try because it is only checked on a reject but
not on a collision. Sigh.

In contrast, for indep, if tries is not specified we default to 1
recursive attempt, because that is simply more sane, and we have the
option to do so. The descend_once tunable has no effect for indep.

Reflects ceph.git commit 64aeded50d80942d66a5ec7b604ff2fcbf5d7b63.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

f18650ac

crush: new SET_CHOOSE_LEAF_TRIES command · be3226ac

由 Ilya Dryomov 提交于 12月 24, 2013

Explicitly control the number of sample attempts, and allow the number of
tries in the recursive call to be explicitly controlled via the rule. This
is important because the amount of time we want to spend looking for a
solution may be rule dependent (e.g., higher for the wide indep pool than
the rep pools).

(We should do the same for the other tunables, by the way!)

Reflects ceph.git commit c43c893be872f709c787bc57f46c0e97876ff681.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

be3226ac

crush: use breadth-first search for indep mode · 9a3b490a

由 Ilya Dryomov 提交于 12月 24, 2013

Reflects ceph.git commit 86e978036a4ecbac4c875e7c00f6c5bbe37282d3.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

9a3b490a

crush: return CRUSH_ITEM_UNDEF for failed placements with indep · c6d98a60

由 Ilya Dryomov 提交于 12月 24, 2013

For firstn mode, if we fail to make a valid placement choice, we just
continue and return a short result to the caller. For indep mode, however,
we need to make the position stable, and return an undefined value on
failed placements to avoid shifting later results to the left.

Reflects ceph.git commit b1d4dd4eb044875874a1d01c01c7d766db5d0a80.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

c6d98a60

crush: eliminate CRUSH_MAX_SET result size limitation · e8ef19c4

由 Ilya Dryomov 提交于 12月 24, 2013

This is only present to size the temporary scratch arrays that we put on
the stack.  Let the caller allocate them as they wish and remove the
limitation.

Reflects ceph.git commit 1cfe140bf2dab99517589a82a916f4c75b9492d1.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

e8ef19c4

crush: factor out (trivial) crush_destroy_rule() · bfb16d7d

由 Ilya Dryomov 提交于 12月 24, 2013

Reflects ceph.git commit 43a01c9973c4b83f2eaa98be87429941a227ddde.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

bfb16d7d

crush: pass weight vector size to map function · b3b33b0e

由 Ilya Dryomov 提交于 12月 24, 2013

Pass the size of the weight vector into crush_do_rule() to ensure that we
don't access values past the end.  This can happen if the caller misbehaves
and passes a weight vector that is smaller than max_devices.

Currently the monitor tries to prevent that from happening, but this will
gracefully tolerate previous bad osdmaps that got into this state.  It's
also a bit more defensive.

Reflects ceph.git commit 5922e2c2b8335b5e46c9504349c3a55b7434c01a.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

b3b33b0e

libceph: update ceph_features.h · 2b3e0c90

由 Ilya Dryomov 提交于 12月 24, 2013

This updates ceph_features.h so that it has all feature bits defined in
ceph.git. In the interim since the last update, ceph.git crossed the
"32 feature bits" point, and, the addition of the 33rd bit wasn't
handled correctly. The work-around is squashed into this commit and
reflects ceph.git commit 053659d05e0349053ef703b414f44965f368b9f0.
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

2b3e0c90

libceph: all features fields must be u64 · 12b4629a

由 Ilya Dryomov 提交于 12月 24, 2013

In preparation for ceph_features.h update, change all features fields
from unsigned int/u32 to u64.  (ceph.git has ~40 feature bits at this
point.)
Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: NSage Weil <sage@inktank.com>

12b4629a

14 12月, 2013 1 次提交

libceph: block I/O when PAUSE or FULL osd map flags are set · d29adb34

由 Josh Durgin 提交于 12月 02, 2013

The PAUSEWR and PAUSERD flags are meant to stop the cluster from
processing writes and reads, respectively. The FULL flag is set when
the cluster determines that it is out of space, and will no longer
process writes.  PAUSEWR and PAUSERD are purely client-side settings
already implemented in userspace clients. The osd does nothing special
with these flags.

When the FULL flag is set, however, the osd responds to all writes
with -ENOSPC. For cephfs, this makes sense, but for rbd the block
layer translates this into EIO.  If a cluster goes from full to
non-full quickly, a filesystem on top of rbd will not behave well,
since some writes succeed while others get EIO.

Fix this by blocking any writes when the FULL flag is set in the osd
client. This is the same strategy used by userspace, so apply it by
default.  A follow-on patch makes this configurable.

__map_request() is called to re-target osd requests in case the
available osds changed.  Add a paused field to a ceph_osd_request, and
set it whenever an appropriate osd map flag is set.  Avoid queueing
paused requests in __map_request(), but force them to be resent if
they become unpaused.

Also subscribe to the next osd map from the monitor if any of these
flags are set, so paused requests can be unblocked as soon as
possible.

Fixes: http://tracker.ceph.com/issues/6079Reviewed-by: NSage Weil <sage@inktank.com>
Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>

d29adb34

03 12月, 2013 1 次提交

gpiolib: add missing declarations · c9a9972b

由 Alexandre Courbot 提交于 11月 25, 2013

Add declaration of 'struct of_phandle_args' to avoid the following
warning:

In file included from arch/arm/mach-tegra/board-paz00.c:21:0:
include/linux/gpio/driver.h:102:17: warning: 'struct of_phandle_args' declared inside parameter list
include/linux/gpio/driver.h:102:17: warning: its scope is only this definition or declaration, which is probably not what you want

Also proactively add other definitions/includes that could be missing
in other contexts.
Signed-off-by: NAlexandre Courbot <acourbot@nvidia.com>
Reported-by: NStephen Warren <swarren@wwwdotorg.org>
Reviewed-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>

c9a9972b

29 11月, 2013 1 次提交

efivars, efi-pstore: Hold off deletion of sysfs entry until the scan is completed · e0d59733

由 Seiji Aguchi 提交于 10月 30, 2013

Currently, when mounting pstore file system, a read callback of
efi_pstore driver runs mutiple times as below.

- In the first read callback, scan efivar_sysfs_list from head and pass
  a kmsg buffer of a entry to an upper pstore layer.
- In the second read callback, rescan efivar_sysfs_list from the entry
  and pass another kmsg buffer to it.
- Repeat the scan and pass until the end of efivar_sysfs_list.

In this process, an entry is read across the multiple read function
calls. To avoid race between the read and erasion, the whole process
above is protected by a spinlock, holding in open() and releasing in
close().

At the same time, kmemdup() is called to pass the buffer to pstore
filesystem during it. And then, it causes a following lockdep warning.

To make the dynamic memory allocation runnable without taking spinlock,
holding off a deletion of sysfs entry if it happens while scanning it
via efi_pstore, and deleting it after the scan is completed.

To implement it, this patch introduces two flags, scanning and deleting,
to efivar_entry.

On the code basis, it seems that all the scanning and deleting logic is
not needed because __efivars->lock are not dropped when reading from the
EFI variable store.

But, the scanning and deleting logic is still needed because an
efi-pstore and a pstore filesystem works as follows.

In case an entry(A) is found, the pointer is saved to psi->data.  And
efi_pstore_read() passes the entry(A) to a pstore filesystem by
releasing  __efivars->lock.

And then, the pstore filesystem calls efi_pstore_read() again and the
same entry(A), which is saved to psi->data, is used for resuming to scan
a sysfs-list.

So, to protect the entry(A), the logic is needed.

[    1.143710] ------------[ cut here ]------------
[    1.144058] WARNING: CPU: 1 PID: 1 at kernel/lockdep.c:2740 lockdep_trace_alloc+0x104/0x110()
[    1.144058] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[    1.144058] Modules linked in:
[    1.144058] CPU: 1 PID: 1 Comm: systemd Not tainted 3.11.0-rc5 #2
[    1.144058]  0000000000000009 ffff8800797e9ae0 ffffffff816614a5 ffff8800797e9b28
[    1.144058]  ffff8800797e9b18 ffffffff8105510d 0000000000000080 0000000000000046
[    1.144058]  00000000000000d0 00000000000003af ffffffff81ccd0c0 ffff8800797e9b78
[    1.144058] Call Trace:
[    1.144058]  [<ffffffff816614a5>] dump_stack+0x54/0x74
[    1.144058]  [<ffffffff8105510d>] warn_slowpath_common+0x7d/0xa0
[    1.144058]  [<ffffffff8105517c>] warn_slowpath_fmt+0x4c/0x50
[    1.144058]  [<ffffffff8131290f>] ? vsscanf+0x57f/0x7b0
[    1.144058]  [<ffffffff810bbd74>] lockdep_trace_alloc+0x104/0x110
[    1.144058]  [<ffffffff81192da0>] __kmalloc_track_caller+0x50/0x280
[    1.144058]  [<ffffffff815147bb>] ? efi_pstore_read_func.part.1+0x12b/0x170
[    1.144058]  [<ffffffff8115b260>] kmemdup+0x20/0x50
[    1.144058]  [<ffffffff815147bb>] efi_pstore_read_func.part.1+0x12b/0x170
[    1.144058]  [<ffffffff81514800>] ? efi_pstore_read_func.part.1+0x170/0x170
[    1.144058]  [<ffffffff815148b4>] efi_pstore_read_func+0xb4/0xe0
[    1.144058]  [<ffffffff81512b7b>] __efivar_entry_iter+0xfb/0x120
[    1.144058]  [<ffffffff8151428f>] efi_pstore_read+0x3f/0x50
[    1.144058]  [<ffffffff8128d7ba>] pstore_get_records+0x9a/0x150
[    1.158207]  [<ffffffff812af25c>] ? selinux_d_instantiate+0x1c/0x20
[    1.158207]  [<ffffffff8128ce30>] ? parse_options+0x80/0x80
[    1.158207]  [<ffffffff8128ced5>] pstore_fill_super+0xa5/0xc0
[    1.158207]  [<ffffffff811ae7d2>] mount_single+0xa2/0xd0
[    1.158207]  [<ffffffff8128ccf8>] pstore_mount+0x18/0x20
[    1.158207]  [<ffffffff811ae8b9>] mount_fs+0x39/0x1b0
[    1.158207]  [<ffffffff81160550>] ? __alloc_percpu+0x10/0x20
[    1.158207]  [<ffffffff811c9493>] vfs_kern_mount+0x63/0xf0
[    1.158207]  [<ffffffff811cbb0e>] do_mount+0x23e/0xa20
[    1.158207]  [<ffffffff8115b51b>] ? strndup_user+0x4b/0xf0
[    1.158207]  [<ffffffff811cc373>] SyS_mount+0x83/0xc0
[    1.158207]  [<ffffffff81673cc2>] system_call_fastpath+0x16/0x1b
[    1.158207] ---[ end trace 61981bc62de9f6f4 ]---
Signed-off-by: NSeiji Aguchi <seiji.aguchi@hds.com>
Tested-by: NMadper Xie <cxie@redhat.com>
Cc: stable@kernel.org
Signed-off-by: NMatt Fleming <matt.fleming@intel.com>

e0d59733

28 11月, 2013 1 次提交

cpufreq: suspend governors on system suspend/hibernate · 5a87182a

由 Viresh Kumar 提交于 11月 27, 2013

This patch adds cpufreq suspend/resume calls to dpm_{suspend|resume}_noirq()
for handling suspend/resume of cpufreq governors.

Lan Tianyu (Intel) & Jinhyuk Choi (Broadcom) found anr issue where
tunables configuration for clusters/sockets with non-boot CPUs was
getting lost after suspend/resume, as we were notifying governors
with CPUFREQ_GOV_POLICY_EXIT on removal of the last cpu for that
policy and so deallocating memory for tunables. This is fixed by
this patch as we don't allow any operation on governors after
device suspend and before device resume now.
Reported-and-tested-by: NLan Tianyu <tianyu.lan@intel.com>
Reported-by: NJinhyuk Choi <jinchoi@broadcom.com>
Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
[rjw: Changelog, minor cleanups]
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

5a87182a

26 11月, 2013 1 次提交

ARM: tegra: Provide dummy powergate implementation · 9886e1fd

由 Thierry Reding 提交于 11月 25, 2013

In order to support increased build test coverage for drivers, implement
dummies for the powergate implementation. This will allow the drivers to
be built without requiring support for Tegra to be selected.

This patch solves the following build errors, which can be triggered in
v3.13-rc1 by selecting DRM_TEGRA without ARCH_TEGRA:

drivers/built-in.o: In function `gr3d_remove':
drivers/gpu/drm/tegra/gr3d.c:321: undefined reference to `tegra_powergate_power_off'
drivers/gpu/drm/tegra/gr3d.c:325: undefined reference to `tegra_powergate_power_off'
drivers/built-in.o: In function `gr3d_probe':
drivers/gpu/drm/tegra/gr3d.c:266: undefined reference to `tegra_powergate_sequence_power_up'
drivers/gpu/drm/tegra/gr3d.c:273: undefined reference to `tegra_powergate_sequence_power_up'
Signed-off-by: NThierry Reding <treding@nvidia.com>
[swarren, updated commit description]
Signed-off-by: NStephen Warren <swarren@nvidia.com>
Signed-off-by: NOlof Johansson <olof@lixom.net>

9886e1fd

25 11月, 2013 2 次提交

gpiolib: use dedicated flags for GPIO properties · 53e7cac3

由 Alexandre Courbot 提交于 11月 16, 2013

GPIO mapping properties were defined using the GPIOF_* flags, which are
declared in linux/gpio.h. This file is not included when using the
GPIO descriptor interface.

This patch declares the flags that can be used as GPIO mappings
properties in linux/gpio/driver.h, and uses them in gpiolib, so that no
deprecated declarations are used by the GPIO descriptor interface.

This patch also allows GPIO_OPEN_DRAIN and GPIO_OPEN_SOURCE to be
specified as GPIO mapping properties.
Signed-off-by: NAlexandre Courbot <acourbot@nvidia.com>
Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>

53e7cac3

slab.h: remove duplicate kmalloc declaration and fix kernel-doc warnings · 7e3528c3

由 Randy Dunlap 提交于 11月 22, 2013

Fix kernel-doc warning for duplicate definition of 'kmalloc':

  Documentation/DocBook/kernel-api.xml:9483: element refentry: validity error : ID API-kmalloc already defined
  <refentry id="API-kmalloc">

Also combine the kernel-doc info from the 2 kmalloc definitions into one
block and remove the "see kcalloc" comment since kmalloc now contains the
@flags info.
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Acked-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7e3528c3

22 11月, 2013 3 次提交

mm: place page->pmd_huge_pte to right union · 7aa555bf

由 Kirill A. Shutemov 提交于 11月 21, 2013

I don't know what went wrong, mis-merge or something, but ->pmd_huge_pte
placed in wrong union within struct page.

In original patch[1] it's placed to union with ->lru and ->slab, but in
commit e009bb30 ("mm: implement split page table lock for PMD
level") it's in union with ->index and ->freelist.

That union seems also unused for pages with table tables and safe to
re-use, but it's not what I've tested.

Let's move it to original place.  It fixes indentation at least.  :)

[1] https://lkml.org/lkml/2013/10/7/288Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7aa555bf

mm: hugetlbfs: fix hugetlbfs optimization · 27c73ae7

由 Andrea Arcangeli 提交于 11月 21, 2013

Commit 7cb2ef56 ("mm: fix aio performance regression for database
caused by THP") can cause dereference of a dangling pointer if
split_huge_page runs during PageHuge() if there are updates to the
tail_page->private field.

Also it is repeating compound_head twice for hugetlbfs and it is running
compound_head+compound_trans_head for THP when a single one is needed in
both cases.

The new code within the PageSlab() check doesn't need to verify that the
THP page size is never bigger than the smallest hugetlbfs page size, to
avoid memory corruption.

A longstanding theoretical race condition was found while fixing the
above (see the change right after the skip_unlock label, that is
relevant for the compound_lock path too).

By re-establishing the _mapcount tail refcounting for all compound
pages, this also fixes the below problem:

  echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

  BUG: Bad page state in process bash  pfn:59a01
  page:ffffea000139b038 count:0 mapcount:10 mapping:          (null) index:0x0
  page flags: 0x1c00000000008000(tail)
  Modules linked in:
  CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  Call Trace:
    dump_stack+0x55/0x76
    bad_page+0xd5/0x130
    free_pages_prepare+0x213/0x280
    __free_pages+0x36/0x80
    update_and_free_page+0xc1/0xd0
    free_pool_huge_page+0xc2/0xe0
    set_max_huge_pages.part.58+0x14c/0x220
    nr_hugepages_store_common.isra.60+0xd0/0xf0
    nr_hugepages_store+0x13/0x20
    kobj_attr_store+0xf/0x20
    sysfs_write_file+0x189/0x1e0
    vfs_write+0xc5/0x1f0
    SyS_write+0x55/0xb0
    system_call_fastpath+0x16/0x1b
Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Tested-by: NKhalid Aziz <khalid.aziz@oracle.com>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

27c73ae7

mm: thp: give transparent hugepage code a separate copy_page · 30b0a105

由 Dave Hansen 提交于 11月 21, 2013

Right now, the migration code in migrate_page_copy() uses copy_huge_page()
for hugetlbfs and thp pages:

       if (PageHuge(page) || PageTransHuge(page))
                copy_huge_page(newpage, page);

So, yay for code reuse.  But:

  void copy_huge_page(struct page *dst, struct page *src)
  {
        struct hstate *h = page_hstate(src);

and a non-hugetlbfs page has no page_hstate().  This works 99% of the
time because page_hstate() determines the hstate from the page order
alone.  Since the page order of a THP page matches the default hugetlbfs
page order, it works.

But, if you change the default huge page size on the boot command-line
(say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
so page_hstate() returns null and copy_huge_page() oopses pretty fast
since copy_huge_page() dereferences the hstate:

  void copy_huge_page(struct page *dst, struct page *src)
  {
        struct hstate *h = page_hstate(src);
        if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
  ...

Mel noticed that the migration code is really the only user of these
functions.  This moves all the copy code over to migrate.c and makes
copy_huge_page() work for THP by checking for it explicitly.

I believe the bug was introduced in commit b32967ff ("mm: numa: Add
THP migration for the NUMA working set scanning fault case")

[akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

30b0a105

21 11月, 2013 3 次提交

net/phy: Add the autocross feature for forced links on VSC82x4 · 3fb69bca

由 Madalin Bucur 提交于 11月 20, 2013

Add auto-MDI/MDI-X capability for forced (autonegotiation disabled)
10/100 Mbps speeds on Vitesse VSC82x4 PHYs. Exported previously static
function genphy_setup_forced() required by the new config_aneg handler
in the Vitesse PHY module.
Signed-off-by: NMadalin Bucur <madalin.bucur@freescale.com>
Signed-off-by: NShruti Kanetkar <Shruti@freescale.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3fb69bca

net: rework recvmsg handler msg_name and msg_namelen logic · f3d33426

由 Hannes Frederic Sowa 提交于 11月 21, 2013

This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
to return msg_name to the user.

This prevents numerous uninitialized memory leaks we had in the
recvmsg handlers and makes it harder for new code to accidentally leak
uninitialized memory.

Optimize for the case recvfrom is called with NULL as address. We don't
need to copy the address at all, so set it to NULL before invoking the
recvmsg handler. We can do so, because all the recvmsg handlers must
cope with the case a plain read() is called on them. read() also sets
msg_name to NULL.

Also document these changes in include/linux/net.h as suggested by David
Miller.

Changes since RFC:

Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
affect sendto as it would bail out earlier while trying to copy-in the
address. It also more naturally reflects the logic by the callers of
verify_iovec.

With this change in place I could remove "
if (!uaddr || msg_sys->msg_namelen == 0)
	msg->msg_name = NULL
".

This change does not alter the user visible error logic as we ignore
msg_namelen as long as msg_name is NULL.

Also remove two unnecessary curly brackets in ___sys_recvmsg and change
comments to netdev style.

Cc: David Miller <davem@davemloft.net>
Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f3d33426

Revert "mm: create a separate slab for page->ptl allocation" · 8b2e9b71

由 Linus Torvalds 提交于 11月 20, 2013

This reverts commit ea1e7ed3.

Al points out that while the commit *does* actually create a separate
slab for the page->ptl allocation, that slab is never actually used, and
the code continues to use kmalloc/kfree.

Damien Wyart points out that the original patch did have the conversion
to use kmem_cache_alloc/free, so it got lost somewhere on its way to me.

Revert the half-arsed attempt that didn't do anything.  If we really do
want the special slab (remember: this is all relevant just for debug
builds, so it's not necessarily all that critical) we might as well redo
the patch fully.
Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Kirill A Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8b2e9b71

20 11月, 2013 1 次提交

genetlink: make multicast groups const, prevent abuse · 2a94fe48

由 Johannes Berg 提交于 11月 19, 2013

Register generic netlink multicast groups as an array with
the family and give them contiguous group IDs. Then instead
of passing the global group ID to the various functions that
send messages, pass the ID relative to the family - for most
families that's just 0 because the only have one group.

This avoids the list_head and ID in each group, adding a new
field for the mcast group ID offset to the family.

At the same time, this allows us to prevent abusing groups
again like the quota and dropmon code did, since we can now
check that a family only uses a group it owns.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2a94fe48

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功