提交 · ddf1af6fa00e772fdb67a7d22cb83fac2b8968a8 · OpenHarmony / kernel_linux

08 2月, 2016 1 次提交

由 Yuchung Cheng 提交于 2月 02, 2016

This patch changes the accounting of how many packets are
newly acked or sacked when the sender receives an ACK.

The current approach basically computes

   newly_acked_sacked = (prior_packets - prior_sacked) -
                        (tp->packets_out - tp->sacked_out)

   where prior_packets and prior_sacked out are snapshot
   at the beginning of the ACK processing.

The new approach tracks the delivery information via a new
TCP state variable "delivered" which monotically increases
as new packets are delivered in order or out-of-order.

The reason for this change is that the current approach is
brittle that produces negative or inaccurate estimate.

   1) For non-SACK connections, an ACK that advances the SND.UNA
   could reset the DUPACK counters (tp->sacked_out) in
   tcp_process_loss() or tcp_fastretrans_alert(). This inflates
   the inflight suddenly and causes under-estimate or even
   negative estimate. Here is a real example:

                   before   after (processing ACK)
   packets_out     75       73
   sacked_out      23        0
   ca state        Loss     Open

   The old approach computes (75-23) - (73 - 0) = -21 delivered
   while the new approach computes 1 delivered since it
   considers the 2nd-24th packets are delivered OOO.

   2) MSS change would re-count packets_out and sacked_out so
   the estimate is in-accurate and can even become negative.
   E.g., the inflight is doubled when MSS is halved.

   3) Spurious retransmission signaled by DSACK is not accounted

The new approach is simpler and more robust. For SACK connections,
tp->delivered increments as packets are being acked or sacked in
SACK and ACK processing.

For non-sack connections, it's done in tcp_remove_reno_sacks() and
tcp_add_reno_sack(). When an ACK advances the SND.UNA, tp->delivered
is incremented by the number of packets ACKed (less the current
number of DUPACKs received plus one packet hole).  Upon receiving
a DUPACK, tp->delivered is incremented assuming one out-of-order
packet is delivered.

Upon receiving a DSACK, tp->delivered is incremtened assuming one
retransmission is delivered in tcp_sacktag_write_queue().
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ddf1af6f

06 2月, 2016 5 次提交

bpf: add lookup/update support for per-cpu hash and array maps · 15a07b33

由 Alexei Starovoitov 提交于 2月 01, 2016

The functions bpf_map_lookup_elem(map, key, value) and
bpf_map_update_elem(map, key, value, flags) need to get/set
values from all-cpus for per-cpu hash and array maps,
so that user space can aggregate/update them as necessary.

Example of single counter aggregation in user space:
  unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
  long values[nr_cpus];
  long value = 0;

  bpf_lookup_elem(fd, key, values);
  for (i = 0; i < nr_cpus; i++)
    value += values[i];

The user space must provide round_up(value_size, 8) * nr_cpus
array to get/set values, since kernel will use 'long' copy
of per-cpu values to try to copy good counters atomically.
It's a best-effort, since bpf programs and user space are racing
to access the same memory.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

15a07b33

bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map · a10423b8

由 Alexei Starovoitov 提交于 2月 01, 2016

Primary use case is a histogram array of latency
where bpf program computes the latency of block requests or other
events and stores histogram of latency into array of 64 elements.
All cpus are constantly running, so normal increment is not accurate,
bpf_xadd causes cache ping-pong and this per-cpu approach allows
fastest collision-free counters.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a10423b8

ethtool: Declare netdev_rss_key as __read_mostly. · ba905f5e

由 Kim Jones 提交于 2月 02, 2016

netdev_rss_key is written to once and thereafter is read by
drivers when they are initialising. The fact that it is mostly
read and not written to makes it a candidate for a __read_mostly
declaration.
Signed-off-by: NKim Jones <kim-marie.jones@intel.com>
Signed-off-by: NAlan Carey <alan.carey@intel.com>
Acked-by: NRami Rosen <rami.rosen@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ba905f5e

team: track sum of rx_nohandler for all slaves · bb63daf9

由 Jarod Wilson 提交于 2月 01, 2016

CC: Jiri Pirko <jiri@resnulli.us>
CC: netdev@vger.kernel.org
Signed-off-by: NJarod Wilson <jarod@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bb63daf9

net: add rx_nohandler stat counter · 6e7333d3

由 Jarod Wilson 提交于 2月 01, 2016

This adds an rx_nohandler stat counter, along with a sysfs statistics
node, and copies the counter out via netlink as well.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@mellanox.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: Tom Herbert <tom@herbertland.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: netdev@vger.kernel.org
Signed-off-by: NJarod Wilson <jarod@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6e7333d3

01 2月, 2016 1 次提交

phys_to_pfn_t: use phys_addr_t · 76e9f0ee

由 Dan Williams 提交于 1月 22, 2016

A dma_addr_t is potentially smaller than a phys_addr_t on some archs.
Don't truncate the address when doing the pfn conversion.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: NMatthew Wilcox <willy@linux.intel.com>
[willy: fix pfn_t_to_phys as well]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

76e9f0ee

31 1月, 2016 3 次提交

block: use DAX for partition table reads · d1a5f2b4

由 Dan Williams 提交于 1月 28, 2016

Avoid populating pagecache when the block device is in DAX mode.
Otherwise these page cache entries collide with the fsync/msync
implementation and break data durability guarantees.

Cc: Jan Kara <jack@suse.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NMatthew Wilcox <willy@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

d1a5f2b4

block: revert runtime dax control of the raw block device · 9f4736fe

由 Dan Williams 提交于 1月 28, 2016

Dynamically enabling DAX requires that the page cache first be flushed
and invalidated.  This must occur atomically with the change of DAX mode
otherwise we confuse the fsync/msync tracking and violate data
durability guarantees.  Eliminate the possibilty of DAX-disabled to
DAX-enabled transitions for now and revisit this for the next cycle.

Cc: Jan Kara <jack@suse.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

9f4736fe

fs, block: force direct-I/O for dax-enabled block devices · 65f87ee7

由 Dan Williams 提交于 1月 25, 2016

Similar to the file I/O path, re-direct all I/O to the DAX path for I/O
to a block-device special file.  Both regular files and device special
files can use the common filp->f_mapping->host lookup to determing is
DAX is enabled.

Otherwise, we confuse the DAX code that does not expect to find live
data in the page cache:

    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 7676 at mm/filemap.c:217
    __delete_from_page_cache+0x9f6/0xb60()
    Modules linked in:
    CPU: 0 PID: 7676 Comm: a.out Not tainted 4.4.0+ #276
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
     00000000ffffffff ffff88006d3f7738 ffffffff82999e2d 0000000000000000
     ffff8800620a0000 ffffffff86473d20 ffff88006d3f7778 ffffffff81352089
     ffffffff81658d36 ffffffff86473d20 00000000000000d9 ffffea0000009d60
    Call Trace:
     [<     inline     >] __dump_stack lib/dump_stack.c:15
     [<ffffffff82999e2d>] dump_stack+0x6f/0xa2 lib/dump_stack.c:50
     [<ffffffff81352089>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:482
     [<ffffffff813522b9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:515
     [<ffffffff81658d36>] __delete_from_page_cache+0x9f6/0xb60 mm/filemap.c:217
     [<ffffffff81658fb2>] delete_from_page_cache+0x112/0x200 mm/filemap.c:244
     [<ffffffff818af369>] __dax_fault+0x859/0x1800 fs/dax.c:487
     [<ffffffff8186f4f6>] blkdev_dax_fault+0x26/0x30 fs/block_dev.c:1730
     [<     inline     >] wp_pfn_shared mm/memory.c:2208
     [<ffffffff816e9145>] do_wp_page+0xc85/0x14f0 mm/memory.c:2307
     [<     inline     >] handle_pte_fault mm/memory.c:3323
     [<     inline     >] __handle_mm_fault mm/memory.c:3417
     [<ffffffff816ecec3>] handle_mm_fault+0x2483/0x4640 mm/memory.c:3446
     [<ffffffff8127eff6>] __do_page_fault+0x376/0x960 arch/x86/mm/fault.c:1238
     [<ffffffff8127f738>] trace_do_page_fault+0xe8/0x420 arch/x86/mm/fault.c:1331
     [<ffffffff812705c4>] do_async_page_fault+0x14/0xd0 arch/x86/kernel/kvm.c:264
     [<ffffffff86338f78>] async_page_fault+0x28/0x30 arch/x86/entry/entry_64.S:986
     [<ffffffff86336c36>] entry_SYSCALL_64_fastpath+0x16/0x7a
    arch/x86/entry/entry_64.S:185
    ---[ end trace dae21e0f85f1f98c ]---

Fixes: 5a023cdb ("block: enable dax for raw block devices")
Reported-by: NDmitry Vyukov <dvyukov@google.com>
Reported-by: NKirill A. Shutemov <kirill@shutemov.name>
Suggested-by: NJan Kara <jack@suse.cz>
Reviewed-by: NJan Kara <jack@suse.cz>
Suggested-by: NMatthew Wilcox <willy@linux.intel.com>
Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

65f87ee7

29 1月, 2016 3 次提交

iommu: Update struct iommu_ops comments · 0d9bacb6

由 Magnus Damm 提交于 1月 19, 2016

Update the comments around struct iommu_ops to match
current state and fix a few typos while at it.
Signed-off-by: NMagnus Damm <damm+renesas@opensource.se>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>

0d9bacb6

perf: Synchronously clean up child events · c6e5b732

由 Peter Zijlstra 提交于 1月 15, 2016

The orphan cleanup workqueue doesn't always catch orphans, for example,
if they never schedule after they are orphaned. IOW, the event leak is
still very real. It also wouldn't work for kernel counters.

Doing it synchonously is a little hairy due to lock inversion issues,
but is made to work.

Patch based on work by Alexander Shishkin.
Suggested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Signed-off-by: NIngo Molnar <mingo@kernel.org>

c6e5b732

perf/bpf: Convert perf_event_array to use struct file · e03e7ee3

由 Alexei Starovoitov 提交于 1月 25, 2016

Robustify refcounting.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160126045947.GA40151@ast-mbp.thefacebook.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

e03e7ee3

27 1月, 2016 3 次提交

include/linux/cleancache.h: Clean up code · a39bb9a0

由 Chen Gang 提交于 1月 07, 2016

Let cleancache_fs_enabled() call cleancache_fs_enabled_mapping()
directly.

Remove redundant variable ret in cleancache_get_page().
Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

a39bb9a0

cleancache: constify cleancache_ops structure · b3c6de49

由 Julia Lawall 提交于 1月 21, 2016

The cleancache_ops structure is never modified, so declare it as const.

Done with the help of Coccinelle.
Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

b3c6de49

tty: Wait interruptibly for tty lock on reopen · 0bfd464d

由 Peter Hurley 提交于 1月 09, 2016

Allow a signal to interrupt the wait for a tty reopen; eg., if
the tty has starting final close and is waiting for the device to
drain.
Signed-off-by: NPeter Hurley <peter@hurleysoftware.com>
Cc: stable <stable@vger.kernel.org> # 4.4
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

0bfd464d

26 1月, 2016 1 次提交

irqdomain: Allow domain lookup with DOMAIN_BUS_WIRED token · 530cbe10

由 Marc Zyngier 提交于 1月 26, 2016

Let's take the (outlandish) example of an interrupt controller
capable of handling both wired interrupts and PCI MSIs.

With the current code, the PCI MSI domain is going to be tagged
with DOMAIN_BUS_PCI_MSI, and the wired domain with DOMAIN_BUS_ANY.

Things get hairy when we start looking up the domain for a wired
interrupt (typically when creating it based on some firmware
information - DT or ACPI).

In irq_create_fwspec_mapping(), we perform the lookup using
DOMAIN_BUS_ANY, which is actually used as a wildcard. This gives
us one chance out of two to end up with the wrong domain, and
we try to configure a wired interrupt with the MSI domain.
Everything grinds to a halt pretty quickly.

What we really need to do is to start looking for a domain that
would uniquely identify a wired interrupt domain, and only use
DOMAIN_BUS_ANY as a fallback.

In order to solve this, let's introduce a new DOMAIN_BUS_WIRED
token, which is going to be used exactly as described above.
Of course, this depends on the irqchip to setup the domain
bus_token, and nobody had to implement this so far.

Only so far.
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Link: http://lkml.kernel.org/r/1453816347-32720-2-git-send-email-marc.zyngier@arm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

530cbe10

25 1月, 2016 1 次提交

net: simplify napi_synchronize() to avoid warnings · facc432f

由 Arnd Bergmann 提交于 1月 22, 2016

The napi_synchronize() function is defined twice: The definition
for SMP builds waits for other CPUs to be done, while the uniprocessor
variant just contains a barrier and ignores its argument.

In the mvneta driver, this leads to a warning about an unused variable
when we lookup the NAPI struct of another CPU and then don't use it:

ethernet/marvell/mvneta.c: In function 'mvneta_percpu_notifier':
ethernet/marvell/mvneta.c:2910:30: error: unused variable 'other_port' [-Werror=unused-variable]

There are no other CPUs on a UP build, so that code never runs, but
gcc does not know this.

The nicest solution seems to be to turn the napi_synchronize() helper
into an inline function for the UP case as well, as that leads gcc to
not complain about the argument being unused. Once we do that, we can
also combine the two cases into a single function definition and use
if(IS_ENABLED()) rather than #ifdef to make it look a bit nicer.

The warning first came up in linux-4.4, but I failed to catch it
earlier.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Fixes: f8642885 ("net: mvneta: Statically assign queues to CPUs")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

facc432f

24 1月, 2016 5 次提交

MIPS: bcm963xx: Update bcm_tag field image_sequence · 696569f7

由 Simon Arlott 提交于 12月 13, 2015

The "dual_image" and "inactive_flag" fields should be merged into a single
"image_sequence" field.
Signed-off-by: NSimon Arlott <simon@fire.lp0.eu>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Jonas Gorski <jogo@openwrt.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Cc: MIPS Mailing List <linux-mips@linux-mips.org>
Cc: MTD Maling List <linux-mtd@lists.infradead.org>
Patchwork: https://patchwork.linux-mips.org/patch/11834/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

696569f7

MIPS: bcm963xx: Move extended flash address to bcm_tag header file · 1f29cb19

由 Simon Arlott 提交于 12月 13, 2015

The extended flash address needs to be subtracted from bcm_tag flash
image offsets. Move this value to the bcm_tag header file.

Renamed define name to consistently use bcm963xx for flash layout
which should be considered a property of the board and not the SoC
(i.e. bcm63xx could theoretically be used on a board without CFE
or any flash).
Signed-off-by: NSimon Arlott <simon@fire.lp0.eu>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Jonas Gorski <jogo@openwrt.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Cc: MIPS Mailing List <linux-mips@linux-mips.org>
Cc: MTD Maling List <linux-mtd@lists.infradead.org>
Patchwork: https://patchwork.linux-mips.org/patch/11833/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

1f29cb19

MIPS: bcm963xx: Move Broadcom BCM963xx image tag data structure · 8fce60b8

由 Simon Arlott 提交于 12月 13, 2015

Move Broadcom BCM963xx image tag data structure to include/linux/
so that drivers outside of mach-bcm63xx can use it.
Signed-off-by: NSimon Arlott <simon@fire.lp0.eu>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Jonas Gorski <jogo@openwrt.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Cc: MIPS Mailing List <linux-mips@linux-mips.org>
Cc: MTD Maling List <linux-mtd@lists.infradead.org>
Patchwork: https://patchwork.linux-mips.org/patch/11832/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

8fce60b8

MIPS: bcm963xx: Add Broadcom BCM963xx board nvram data structure · 3271e610

由 Simon Arlott 提交于 12月 13, 2015

Broadcom BCM963xx boards have multiple nvram variants across different
SoCs with additional checksum fields added whenever the size of the
nvram was extended.

Add this structure as a header file so that multiple drivers can use it.
Signed-off-by: NSimon Arlott <simon@fire.lp0.eu>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Jonas Gorski <jogo@openwrt.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Cc: MIPS Mailing List <linux-mips@linux-mips.org>
Cc: MTD Maling List <linux-mtd@lists.infradead.org>
Patchwork: https://patchwork.linux-mips.org/patch/11830/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

3271e610

MIPS: Add support for PIC32MZDA platform · 2572f00d

由 Joshua Henderson 提交于 1月 13, 2016

This adds support for the Microchip PIC32 MIPS microcontroller with the
specific variant PIC32MZDA. PIC32MZDA is based on the MIPS m14KEc core
and boots using device tree.

This includes an early pin setup and early clock setup needed prior to
device tree being initialized. In additon, an interface is provided to
synchronize access to registers shared across several peripherals.
Signed-off-by: NJoshua Henderson <joshua.henderson@microchip.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/12097/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

2572f00d

23 1月, 2016 6 次提交

dax: add support for fsync/sync · 9973c98e

由 Ross Zwisler 提交于 1月 22, 2016

To properly handle fsync/msync in an efficient way DAX needs to track
dirty pages so it is able to flush them durably to media on demand.

The tracking of dirty pages is done via the radix tree in struct
address_space.  This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file,
and it already has support for exceptional (non struct page*) entries.
We build upon these features to add exceptional entries to the radix
tree for DAX dirty PMD or PTE pages at fault time.

[dan.j.williams@intel.com: fix dax_pmd_dbg build warning]
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9973c98e

mm: add find_get_entries_tag() · 7e7f7749

由 Ross Zwisler 提交于 1月 22, 2016

Add find_get_entries_tag() to the family of functions that include
find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
needed for DAX dirty page handling because we need a list of both page
offsets and radix tree entries ('indices' and 'entries' in this
function) that are marked with the PAGECACHE_TAG_TOWRITE tag.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7e7f7749

dax: support dirty DAX entries in radix tree · f9fe48be

由 Ross Zwisler 提交于 1月 22, 2016

Add support for tracking dirty DAX entries in the struct address_space
radix tree.  This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.

In order to properly track dirty DAX pages we will insert new
exceptional entries into the radix tree that represent dirty DAX PTE or
PMD pages.  These exceptional entries will also contain the writeback
addresses for the PTE or PMD faults that we can use at fsync/msync time.

There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third.  We rely
on the fact that only one type of exceptional entry can be found in a
given radix tree based on its usage.  This happens for free with DAX vs
shmem but we explicitly prevent shadow entries from being added to radix
trees for DAX mappings.

The only shadow entries that would be generated for DAX radix trees
would be to track zero page mappings that were created for holes.  These
pages would receive minimal benefit from having shadow entries, and the
choice to have only one type of exceptional entry in a given radix tree
makes the logic simpler both in clear_exceptional_entry() and in the
rest of DAX.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f9fe48be

pmem: add wb_cache_pmem() to the PMEM API · 3f4a2670

由 Ross Zwisler 提交于 1月 22, 2016

__arch_wb_cache_pmem() was already an internal implementation detail of
the x86 PMEM API, but this functionality needs to be exported as part of
the general PMEM API to handle the fsync/msync case for DAX mmaps.

One thing worth noting is that we really do want this to be part of the
PMEM API as opposed to a stand-alone function like clflush_cache_range()
because of ordering restrictions.  By having wb_cache_pmem() as part of
the PMEM API we can leave it unordered, call it multiple times to write
back large amounts of memory, and then order the multiple calls with a
single wmb_pmem().
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3f4a2670

A
make sure that freeing shmem fast symlinks is RCU-delayed · 3ed47db3
由 Al Viro 提交于 1月 22, 2016
```
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
3ed47db3

wrappers for ->i_mutex access · 5955102c

由 Al Viro 提交于 1月 22, 2016

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5955102c

22 1月, 2016 10 次提交

thp: change pmd_trans_huge_lock() interface to return ptl · b6ec57f4

由 Kirill A. Shutemov 提交于 1月 21, 2016

After THP refcounting rework we have only two possible return values
from pmd_trans_huge_lock(): success and failure.  Return-by-pointer for
ptl doesn't make much sense in this case.

Let's convert pmd_trans_huge_lock() to return ptl on success and NULL on
failure.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b6ec57f4

libceph: fix ceph_msg_revoke() · 67645d76

由 Ilya Dryomov 提交于 12月 28, 2015

There are a number of problems with revoking a "was sending" message:

(1) We never make any attempt to revoke data - only kvecs contibute to
con->out_skip.  However, once the header (envelope) is written to the
socket, our peer learns data_len and sets itself to expect at least
data_len bytes to follow front or front+middle.  If ceph_msg_revoke()
is called while the messenger is sending message's data portion,
anything we send after that call is counted by the OSD towards the now
revoked message's data portion.  The effects vary, the most common one
is the eventual hang - higher layers get stuck waiting for the reply to
the message that was sent out after ceph_msg_revoke() returned and
treated by the OSD as a bunch of data bytes.  This is what Matt ran
into.

(2) Flat out zeroing con->out_kvec_bytes worth of bytes to handle kvecs
is wrong.  If ceph_msg_revoke() is called before the tag is sent out or
while the messenger is sending the header, we will get a connection
reset, either due to a bad tag (0 is not a valid tag) or a bad header
CRC, which kind of defeats the purpose of revoke.  Currently the kernel
client refuses to work with header CRCs disabled, but that will likely
change in the future, making this even worse.

(3) con->out_skip is not reset on connection reset, leading to one or
more spurious connection resets if we happen to get a real one between
con->out_skip is set in ceph_msg_revoke() and before it's cleared in
write_partial_skip().

Fixing (1) and (3) is trivial.  The idea behind fixing (2) is to never
zero the tag or the header, i.e. send out tag+header regardless of when
ceph_msg_revoke() is called.  That way the header is always correct, no
unnecessary resets are induced and revoke stands ready for disabled
CRCs.  Since ceph_msg_revoke() rips out con->out_msg, introduce a new
"message out temp" and copy the header into it before sending.

Cc: stable@vger.kernel.org # 4.0+
Reported-by: NMatt Conner <matt.conner@keepertech.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Tested-by: NMatt Conner <matt.conner@keepertech.com>
Reviewed-by: NSage Weil <sage@redhat.com>

67645d76

ceph: ceph_frag_contains_value can be boolean · 79a3ed2e

由 Yaowei Bai 提交于 11月 17, 2015

This patch makes ceph_frag_contains_value return bool to improve
readability due to this particular function only using either one or
zero as its return value.

No functional change.
Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: NYan, Zheng <zyan@redhat.com>

79a3ed2e

ceph: remove unused functions in ceph_frag.h · eade1fe7

由 Yaowei Bai 提交于 11月 17, 2015

These functions were introduced in commit 3d14c5d2 ("ceph: factor
out libceph from Ceph file system"). Howover, there's no user of
these functions since then, so remove them for simplicity.
Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: NYan, Zheng <zyan@redhat.com>

eade1fe7

perf: Collapse and fix event_function_call() users · fae3fde6

由 Peter Zijlstra 提交于 1月 11, 2016

There is one common bug left in all the event_function_call() users,
between loading ctx->task and getting to the remote_function(),
ctx->task can already have been changed.

Therefore we need to double check and retry if ctx->task != current.

Insert another trampoline specific to event_function_call() that
checks for this and further validates state. This also allows getting
rid of the active/inactive functions.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

fae3fde6

{IB, net}/mlx5: Move the modify QP operation table to mlx5_ib · 427c1e7b

由 majd@mellanox.com 提交于 1月 14, 2016

When modifying a QP, the desired operation was determined in
the mlx5_core using a transition table that takes the current
state, the final state, and returns the desired operation.

Since this logic will be used for Raw Packet QP, move the
operation table to the mlx5_ib.
Signed-off-by: NMajd Dibbiny <majd@mellanox.com>
Reviewed-by: NMatan Barak <matanb@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

427c1e7b

IB/mlx5: Support setting Ethernet priority for Raw Packet QPs · 75850d0b

由 majd@mellanox.com 提交于 1月 14, 2016

When the user changes the Address Vector(AV) in the modify QP, he
provides an SL. This SL should be translated to Ethernet Priority
by taking the 3 LSB bits, and modify the QP's TIS according to this
Ethernet priority.
Signed-off-by: NMajd Dibbiny <majd@mellanox.com>
Reviewed-by: NMatan Barak <matanb@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

75850d0b

IB/mlx5: Add Raw Packet QP query functionality · 6d2f89df

由 majd@mellanox.com 提交于 1月 14, 2016

Since Raw Packet QP is composed of RQ and SQ, the IB QP's
state is derived from the sub-objects. Therefore we need
to query each one of the sub-objects, and decide on the
IB QP's state.
Signed-off-by: NMajd Dibbiny <majd@mellanox.com>
Reviewed-by: NMatan Barak <matanb@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

6d2f89df

net/mlx5_core: Add RQ and SQ event handling · e2013b21

由 majd@mellanox.com 提交于 1月 14, 2016

RQ/SQ will be used to implement IB verbs QPs, so the IB QP affiliated
events are affiliated also with SQs and RQs.

Since SQ, RQ and QP resource numbers do not share the same name
space, a queue type field was added to the event data to specify
the SW object that the event is affiliated with.
Signed-off-by: NMajd Dibbiny <majd@mellanox.com>
Reviewed-by: NMatan Barak <matanb@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

e2013b21

net/mlx5_core: Export transport objects · 8d7f9ecb

由 majd@mellanox.com 提交于 1月 14, 2016

To be used by mlx5_ib in the following patches for implementing
RAW PACKET QP.

Add mlx5_core_ prefix to alloc and delloc transport_domain since
they are exposed now.
Signed-off-by: NMajd Dibbiny <majd@mellanox.com>
Reviewed-by: NMatan Barak <matanb@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

8d7f9ecb

21 1月, 2016 1 次提交

mm: memcontrol: add "sock" to cgroup2 memory.stat · b2807f07

由 Johannes Weiner 提交于 1月 20, 2016

Provide statistics on how much of a cgroup's memory footprint is made up
of socket buffers from network connections owned by the group.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b2807f07

OpenHarmony / kernel_linux 上一次同步 4 年多

OpenHarmony / kernel_linux
上一次同步 4 年多