提交 · fd26830423e5f7442001f090cd4a53f4b6c3d9fa · openeuler / Kernel

24 2月, 2020 18 次提交

efi/x86: Drop 'systab' member from struct efi · fd268304

由 Ard Biesheuvel 提交于 1月 21, 2020

The systab member in struct efi has outlived its usefulness, now that
we have better ways to access the only piece of information we are
interested in after init, which is the EFI runtime services table
address. So instead of instantiating a doctored copy at early boot
with lots of mangled values, and switching the pointer when switching
into virtual mode, let's grab the values we need directly, and get
rid of the systab pointer entirely.

Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

fd268304

efi: Add 'runtime' pointer to struct efi · 59f2a619

由 Ard Biesheuvel 提交于 1月 21, 2020

Instead of going through the EFI system table each time, just copy the
runtime services table pointer into struct efi directly. This is the
last use of the system table pointer in struct efi, allowing us to
drop it in a future patch, along with a fair amount of quirky handling
of the translated address.

Note that usually, the runtime services pointer changes value during
the call to SetVirtualAddressMap(), so grab the updated value as soon
as that call returns. (Mixed mode uses a 1:1 mapping, and kexec boot
enters with the updated address in the system table, so in those cases,
we don't need to do anything here)

Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

59f2a619

efi/x86: Make fw_vendor, config_table and runtime sysfs nodes x86 specific · 9cd437ac

由 Ard Biesheuvel 提交于 1月 20, 2020

There is some code that exposes physical addresses of certain parts of
the EFI firmware implementation via sysfs nodes. These nodes are only
used on x86, and are of dubious value to begin with, so let's move
their handling into the x86 arch code.

Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

9cd437ac

efi: Clean up config_parse_tables() · 06c0bd93

由 Ard Biesheuvel 提交于 1月 22, 2020

config_parse_tables() is a jumble of pointer arithmetic, due to the
fact that on x86, we may be dealing with firmware whose native word
size differs from the kernel's.

This is not a concern on other architectures, and doesn't quite
justify the state of the code, so let's clean it up by adding a
non-x86 code path, constifying statically allocated tables and
replacing preprocessor conditionals with IS_ENABLED() checks.

Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

06c0bd93

efi: Make efi_config_init() x86 only · 3a0701dc

由 Ard Biesheuvel 提交于 1月 20, 2020

The efi_config_init() routine is no longer shared with ia64 so let's
move it into the x86 arch code before making further x86 specific
changes to it.

Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

3a0701dc

efi: Merge EFI system table revision and vendor checks · 14fb4209

由 Ard Biesheuvel 提交于 1月 20, 2020

We have three different versions of the code that checks the EFI system
table revision and copies the firmware vendor string, and they are
mostly equivalent, with the exception of the use of early_memremap_ro
vs. __va() and the lowest major revision to warn about. Let's move this
into common code and factor out the commonalities.

Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

14fb4209

efi: Make memreserve table handling local to efi.c · b7846e6b

由 Ard Biesheuvel 提交于 1月 22, 2020

There is no need for struct efi to carry the address of the memreserve
table and share it with the world. So move it out and make it
__initdata as well.

Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

b7846e6b

efi: Move mem_attr_table out of struct efi · a17e809e

由 Ard Biesheuvel 提交于 1月 22, 2020

The memory attributes table is only used at init time by the core EFI
code, so there is no need to carry its address in struct efi that is
shared with the world. So move it out, and make it __ro_after_init as
well, considering that the value is set during early boot.

Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

a17e809e

efi: Make rng_seed table handling local to efi.c · 5d288dbd

由 Ard Biesheuvel 提交于 1月 22, 2020

Move the rng_seed table address from struct efi into a static global
variable in efi.c, which is the only place we ever refer to it anyway.
This reduces the footprint of struct efi, which is a r/w data structure
that is shared with the world.

Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

5d288dbd

efi: Move UGA and PROP table handling to x86 code · fd506e0c

由 Ard Biesheuvel 提交于 1月 19, 2020

The UGA table is x86 specific (its handling was introduced when the
EFI support code was modified to accommodate IA32), so there is no
need to handle it in generic code.

The EFI properties table is not strictly x86 specific, but it was
deprecated almost immediately after having been introduced, due to
implementation difficulties. Only x86 takes it into account today,
and this is not going to change, so make this table x86 only as well.

Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

fd506e0c

efi/ia64: Move HCDP and MPS table handling into IA64 arch code · 120540f2

由 Ard Biesheuvel 提交于 1月 19, 2020

The HCDP and MPS tables are Itanium specific EFI config tables, so
move their handling to ia64 arch code.

Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

120540f2

efi: Drop handling of 'boot_info' configuration table · 50d53c58

由 Ard Biesheuvel 提交于 1月 19, 2020

Some plumbing exists to handle a UEFI configuration table of type
BOOT_INFO but since we never match it to a GUID anywhere, we never
actually register such a table, or access it, for that matter. So
simply drop all mentions of it.

Tested-by: Tony Luck <tony.luck@intel.com> # arch/ia64
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

50d53c58

efi/libstub: Add support for loading the initrd from a device path · ec93fc37

由 Ard Biesheuvel 提交于 2月 03, 2020

There are currently two ways to specify the initrd to be passed to the
Linux kernel when booting via the EFI stub:
- it can be passed as a initrd= command line option when doing a pure PE
  boot (as opposed to the EFI handover protocol that exists for x86)
- otherwise, the bootloader or firmware can load the initrd into memory,
  and pass the address and size via the bootparams struct (x86) or
  device tree (ARM)

In the first case, we are limited to loading from the same file system
that the kernel was loaded from, and it is also problematic in a trusted
boot context, given that we cannot easily protect the command line from
tampering without either adding complicated white/blacklisting of boot
arguments or locking down the command line altogether.

In the second case, we force the bootloader to duplicate knowledge about
the boot protocol which is already encoded in the stub, and which may be
subject to change over time, e.g., bootparams struct definitions, memory
allocation/alignment requirements for the placement of the initrd etc etc.
In the ARM case, it also requires the bootloader to modify the hardware
description provided by the firmware, as it is passed in the same file.
On systems where the initrd is measured after loading, it creates a time
window where the initrd contents might be manipulated in memory before
handing over to the kernel.

Address these concerns by adding support for loading the initrd into
memory by invoking the EFI LoadFile2 protocol installed on a vendor
GUIDed device path that specifically designates a Linux initrd.
This addresses the above concerns, by putting the EFI stub in charge of
placement in memory and of passing the base and size to the kernel proper
(via whatever means it desires) while still leaving it up to the firmware
or bootloader to obtain the file contents, potentially from other file
systems than the one the kernel itself was loaded from. On platforms that
implement measured boot, it permits the firmware to take the measurement
right before the kernel actually consumes the contents.
Acked-by: NLaszlo Ersek <lersek@redhat.com>
Tested-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

ec93fc37

efi/dev-path-parser: Add struct definition for vendor type device path nodes · db8952e7

由 Ard Biesheuvel 提交于 2月 10, 2020

In preparation of adding support for loading the initrd via a special
device path, add the struct definition of a vendor GUIDed device path
node to efi.h.

Since we will be producing these data structures rather than just
consumsing the ones instantiated by the firmware, refactor the various
device path node definitions so we can take the size of each node using
sizeof() rather than having to resort to opaque arithmetic in the static
initializers.

While at it, drop the #if IS_ENABLED() check for the declaration of
efi_get_device_by_path(), which is unnecessary, and constify its first
argument as well.
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

db8952e7

efi/libstub: Make the LoadFile EFI protocol accessible · 2931d526

由 Ard Biesheuvel 提交于 2月 10, 2020

Add the protocol definitions, GUIDs and mixed mode glue so that
the EFI loadfile protocol can be used from the stub. This will
be used in a future patch to load the initrd.
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

2931d526

efi/libstub: Move stub specific declarations into efistub.h · 8166ec09

由 Ard Biesheuvel 提交于 2月 10, 2020

Move all the declarations that are only used in stub code from
linux/efi.h to efistub.h which is only included locally.
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

8166ec09

efi/libstub: Use consistent type names for file I/O protocols · a46a290a

由 Ard Biesheuvel 提交于 2月 10, 2020

Align the naming of efi_file_io_interface_t and efi_file_handle_t with
the UEFI spec, and call them efi_simple_file_system_protocol_t and
efi_file_protocol_t, respectively, using the same convention we use
for all other type definitions that originate in the UEFI spec.

While at it, move the definitions to efistub.h, so they are only seen
by code that needs them.
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

a46a290a

efi/libstub: Simplify efi_high_alloc() and rename to efi_allocate_pages() · a7495c28

由 Ard Biesheuvel 提交于 2月 10, 2020

The implementation of efi_high_alloc() uses a complicated way of
traversing the memory map to find an available region that is located
as close as possible to the provided upper limit, and calls AllocatePages
subsequently to create the allocation at that exact address.

This is precisely what the EFI_ALLOCATE_MAX_ADDRESS allocation type
argument to AllocatePages() does, and considering that EFI_ALLOC_ALIGN
only exceeds EFI_PAGE_SIZE on arm64, let's use AllocatePages() directly
and implement the alignment using code that the compiler can remove if
it does not exceed EFI_PAGE_SIZE.
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

a7495c28

09 2月, 2020 1 次提交

pipe: use exclusive waits when reading or writing · 0ddad21d

由 Linus Torvalds 提交于 12月 09, 2019

This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).

While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.

In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.

This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.

A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):

    #include <unistd.h>

    int main(int argc, char **argv)
    {
        int fd[2], counters[2];

        pipe(fd);
        counters[0] = 0;
        counters[1] = -1;
        write(fd[1], counters, sizeof(counters));

        /* 64 processes */
        fork(); fork(); fork(); fork(); fork(); fork();

        do {
                int i;
                read(fd[0], &i, sizeof(i));
                if (i < 0)
                        continue;
                counters[0] = i+1;
                write(fd[1], counters, (1+(i & 1)) *sizeof(int));
        } while (counters[0] < 1000000);
        return 0;
    }

and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.

But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows

        231,852.37 msec task-clock                #   15.857 CPUs utilized
        11,250,961      context-switches          #    0.049 M/sec
           616,304      cpu-migrations            #    0.003 M/sec
             1,648      page-faults               #    0.007 K/sec
 1,097,903,998,514      cycles                    #    4.735 GHz
   120,781,778,352      instructions              #    0.11  insn per cycle
    27,997,056,043      branches                  #  120.754 M/sec
       283,581,233      branch-misses             #    1.01% of all branches

      14.621273891 seconds time elapsed

       0.018243000 seconds user
       3.611468000 seconds sys

before this commit.

After this commit, I get

          5,229.55 msec task-clock                #    3.072 CPUs utilized
         1,212,233      context-switches          #    0.232 M/sec
           103,951      cpu-migrations            #    0.020 M/sec
             1,328      page-faults               #    0.254 K/sec
    21,307,456,166      cycles                    #    4.074 GHz
    12,947,819,999      instructions              #    0.61  insn per cycle
     2,881,985,678      branches                  #  551.096 M/sec
        64,267,015      branch-misses             #    2.23% of all branches

       1.702148350 seconds time elapsed

       0.004868000 seconds user
       0.110786000 seconds sys

instead. Much better.

[ Note! This kernel improvement seems to be very good at triggering a
  race condition in the make jobserver (in GNU make 4.2.1) for me. It's
  a long known bug that was fixed back in June 2017 by GNU make commit
  b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
  avoid hangs.").

  But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
  so a number of distributions may still have the buggy version. Some
  have backported the fix to their 4.2.1 release, though, and even
  without the fix it's quite timing-dependent whether the bug actually
  is hit. ]

Josh Triplett says:
 "I've been hammering on your pipe fix patch (switching to exclusive
  wait queues) for a month or so, on several different systems, and I've
  run into no issues with it. The patch *substantially* improves
  parallel build times on large (~100 CPU) systems, both with parallel
  make and with other things that use make's pipe-based jobserver.

  All current distributions (including stable and long-term stable
  distributions) have versions of GNU make that no longer have the
  jobserver bug"
Tested-by: NJosh Triplett <josh@joshtriplett.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0ddad21d

08 2月, 2020 13 次提交

irqchip/gic-v4.1: Set vpe_l1_base for all redistributors · 8b718d40

由 Zenghui Yu 提交于 2月 06, 2020

Currently, we will not set vpe_l1_page for the current RD if we can
inherit the vPE configuration table from another RD (or ITS), which
results in an inconsistency between RDs within the same CommonLPIAff
group.

Let's rename it to vpe_l1_base to indicate the base address of the
vPE configuration table of this RD, and set it properly for *all*
v4.1 redistributors.
Signed-off-by: NZenghui Yu <yuzenghui@huawei.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200206075711.1275-3-yuzenghui@huawei.com

8b718d40

A
prefix-handling analogues of errorf() and friends · a3ff937b
由 Al Viro 提交于 12月 21, 2019
```
called errorfc/infofc/warnfc/invalfc
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
a3ff937b
A
turn fs_param_is_... into functions · 328de528
由 Al Viro 提交于 12月 18, 2019
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
328de528

fs_parse: handle optional arguments sanely · 48ce73b1

由 Al Viro 提交于 12月 17, 2019

Don't bother with "mixed" options that would allow both the
form with and without argument (i.e. both -o foo and -o foo=bar).
Rather than trying to shove both into a single fs_parameter_spec,
allow having with-argument and no-argument specs with the same
name and teach fs_parse to handle that.

There are very few options of that sort, and they are actually
easier to handle that way - callers end up with less postprocessing.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

48ce73b1

fs_parse: fold fs_parameter_desc/fs_parameter_spec · d7167b14

由 Al Viro 提交于 9月 07, 2019

The former contains nothing but a pointer to an array of the latter...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d7167b14

fs_parser: remove fs_parameter_description name field · 96cafb9c

由 Eric Sandeen 提交于 12月 06, 2019

Unused now.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Acked-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

96cafb9c

add prefix to fs_context->log · cc3c0b53

由 Al Viro 提交于 12月 21, 2019

... turning it into struct p_log embedded into fs_context.  Initialize
the prefix with fs_type->name, turning fs_parse() into a trivial
inline wrapper for __fs_parse().

This makes fs_parameter_description->name completely unused.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

cc3c0b53

ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log · c80c98f0

由 Al Viro 提交于 12月 21, 2019

... and now errorf() et.al. are never called with NULL fs_context,
so we can get rid of conditional in those.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c80c98f0

new primitive: __fs_parse() · 7f5d3814

由 Al Viro 提交于 12月 20, 2019

fs_parse() analogue taking p_log instead of fs_context.
fs_parse() turned into a wrapper, callers in ceph_common and rbd
switched to __fs_parse().

As the result, fs_parse() never gets NULL fs_context and neither
do fs_context-based logging primitives
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7f5d3814

A
struct p_log, variants of warnf() et.al. taking that one instead · 3fbb8d55
由 Al Viro 提交于 12月 20, 2019
```
primitives for prefixed logging
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
3fbb8d55
A
teach logfc() to handle prefices, give it saner calling conventions · 9f09f649
由 Al Viro 提交于 12月 20, 2019
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
9f09f649

get rid of fs_value_is_filename_empty · aa1918f9

由 Al Viro 提交于 12月 17, 2019

Its behaviour is identical to that of fs_value_is_filename.
It makes no sense, anyway - LOOKUP_EMPTY affects nothing
whatsoever once the pathname has been imported from userland.
And both fs_value_is_filename and fs_value_is_filename_empty
carry an already imported pathname.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

aa1918f9

don't bother with explicit length argument for __lookup_constant() · 34264ae3

由 Al Viro 提交于 12月 16, 2019

Have the arrays of constant_table self-terminated (by NULL ->name
in the final entry).  Simplifies lookup_constant() and allows to
reuse the search for enum params as well.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

34264ae3

07 2月, 2020 5 次提交

fs: New zonefs file system · 8dcc1a9d

由 Damien Le Moal 提交于 12月 25, 2019

zonefs is a very simple file system exposing each zone of a zoned block
device as a file. Unlike a regular file system with zoned block device
support (e.g. f2fs), zonefs does not hide the sequential write
constraint of zoned block devices to the user. Files representing
sequential write zones of the device must be written sequentially
starting from the end of the file (append only writes).

As such, zonefs is in essence closer to a raw block device access
interface than to a full featured POSIX file system. The goal of zonefs
is to simplify the implementation of zoned block device support in
applications by replacing raw block device file accesses with a richer
file API, avoiding relying on direct block device file ioctls which may
be more obscure to developers. One example of this approach is the
implementation of LSM (log-structured merge) tree structures (such as
used in RocksDB and LevelDB) on zoned block devices by allowing SSTables
to be stored in a zone file similarly to a regular file system rather
than as a range of sectors of a zoned device. The introduction of the
higher level construct "one file is one zone" can help reducing the
amount of changes needed in the application as well as introducing
support for different application programming languages.

Zonefs on-disk metadata is reduced to an immutable super block to
persistently store a magic number and optional feature flags and
values. On mount, zonefs uses blkdev_report_zones() to obtain the device
zone configuration and populates the mount point with a static file tree
solely based on this information. E.g. file sizes come from the device
zone type and write pointer offset managed by the device itself.

The zone files created on mount have the following characteristics.
1) Files representing zones of the same type are grouped together
   under a common sub-directory:
     * For conventional zones, the sub-directory "cnv" is used.
     * For sequential write zones, the sub-directory "seq" is used.
  These two directories are the only directories that exist in zonefs.
  Users cannot create other directories and cannot rename nor delete
  the "cnv" and "seq" sub-directories.
2) The name of zone files is the number of the file within the zone
   type sub-directory, in order of increasing zone start sector.
3) The size of conventional zone files is fixed to the device zone size.
   Conventional zone files cannot be truncated.
4) The size of sequential zone files represent the file's zone write
   pointer position relative to the zone start sector. Truncating these
   files is allowed only down to 0, in which case, the zone is reset to
   rewind the zone write pointer position to the start of the zone, or
   up to the zone size, in which case the file's zone is transitioned
   to the FULL state (finish zone operation).
5) All read and write operations to files are not allowed beyond the
   file zone size. Any access exceeding the zone size is failed with
   the -EFBIG error.
6) Creating, deleting, renaming or modifying any attribute of files and
   sub-directories is not allowed.
7) There are no restrictions on the type of read and write operations
   that can be issued to conventional zone files. Buffered, direct and
   mmap read & write operations are accepted. For sequential zone files,
   there are no restrictions on read operations, but all write
   operations must be direct IO append writes. mmap write of sequential
   files is not allowed.

Several optional features of zonefs can be enabled at format time.
* Conventional zone aggregation: ranges of contiguous conventional
  zones can be aggregated into a single larger file instead of the
  default one file per zone.
* File ownership: The owner UID and GID of zone files is by default 0
  (root) but can be changed to any valid UID/GID.
* File access permissions: the default 640 access permissions can be
  changed.

The mkzonefs tool is used to format zoned block devices for use with
zonefs. This tool is available on Github at:

git@github.com:damien-lemoal/zonefs-tools.git.

zonefs-tools also includes a test suite which can be run against any
zoned block device, including null_blk block device created with zoned
mode.

Example: the following formats a 15TB host-managed SMR HDD with 256 MB
zones with the conventional zones aggregation feature enabled.

$ sudo mkzonefs -o aggr_cnv /dev/sdX
$ sudo mount -t zonefs /dev/sdX /mnt
$ ls -l /mnt/
total 0
dr-xr-xr-x 2 root root     1 Nov 25 13:23 cnv
dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq

The size of the zone files sub-directories indicate the number of files
existing for each type of zones. In this example, there is only one
conventional zone file (all conventional zones are aggregated under a
single file).

$ ls -l /mnt/cnv
total 137101312
-rw-r----- 1 root root 140391743488 Nov 25 13:23 0

This aggregated conventional zone file can be used as a regular file.

$ sudo mkfs.ext4 /mnt/cnv/0
$ sudo mount -o loop /mnt/cnv/0 /data

The "seq" sub-directory grouping files for sequential write zones has
in this example 55356 zones.

$ ls -lv /mnt/seq
total 14511243264
-rw-r----- 1 root root 0 Nov 25 13:23 0
-rw-r----- 1 root root 0 Nov 25 13:23 1
-rw-r----- 1 root root 0 Nov 25 13:23 2
...
-rw-r----- 1 root root 0 Nov 25 13:23 55354
-rw-r----- 1 root root 0 Nov 25 13:23 55355

For sequential write zone files, the file size changes as data is
appended at the end of the file, similarly to any regular file system.

$ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s

$ ls -l /mnt/seq/0
-rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0

The written file can be truncated to the zone size, preventing any
further write operation.

$ truncate -s 268435456 /mnt/seq/0
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0

Truncation to 0 size allows freeing the file zone storage space and
restart append-writes to the file.

$ truncate -s 0 /mnt/seq/0
$ ls -l /mnt/seq/0
-rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0

Since files are statically mapped to zones on the disk, the number of
blocks of a file as reported by stat() and fstat() indicates the size
of the file zone.

$ stat /mnt/seq/0
  File: /mnt/seq/0
  Size: 0       Blocks: 524288     IO Block: 4096   regular empty file
Device: 870h/2160d      Inode: 50431       Links: 1
Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/  root)
Access: 2019-11-25 13:23:57.048971997 +0900
Modify: 2019-11-25 13:52:25.553805765 +0900
Change: 2019-11-25 13:52:25.553805765 +0900
 Birth: -

The number of blocks of the file ("Blocks") in units of 512B blocks
gives the maximum file size of 524288 * 512 B = 256 MB, corresponding
to the device zone size in this example. Of note is that the "IO block"
field always indicates the minimum IO size for writes and corresponds
to the device physical sector size.

This code contains contributions from:
* Johannes Thumshirn <jthumshirn@suse.de>,
* Darrick J. Wong <darrick.wong@oracle.com>,
* Christoph Hellwig <hch@lst.de>,
* Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> and
* Ting Yao <tingyao@hust.edu.cn>.
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

8dcc1a9d

A
fold struct fs_parameter_enum into struct constant_table · 5eede625
由 Al Viro 提交于 12月 16, 2019
```
no real difference now
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
5eede625

fs_parse: get rid of ->enums · 2710c957

由 Al Viro 提交于 9月 06, 2019

Don't do a single array; attach them to fsparam_enum() entry
instead.  And don't bother trying to embed the names into those -
it actually loses memory, with no real speedup worth mentioning.

Simplifies validation as well.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

2710c957

Pass consistent param->type to fs_parse() · 0f89589a

由 Al Viro 提交于 12月 17, 2019

As it is, vfs_parse_fs_string() makes "foo" and "foo=" indistinguishable;
both get fs_value_is_string for ->type and NULL for ->string.  To make
it even more unpleasant, that combination is impossible to produce with
fsconfig().

Much saner rules would be
        "foo"           => fs_value_is_flag, NULL
	"foo="          => fs_value_is_string, ""
	"foo=bar"       => fs_value_is_string, "bar"
All cases are distinguishable, all results are expressable by fsconfig(),
->has_value checks are much simpler that way (to the point of the field
being useless) and quite a few regressions go away (gfs2 has no business
accepting -o nodebug=, for example).

Partially based upon patches from Miklos.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

0f89589a

net/mlx5: Deprecate usage of generic TLS HW capability bit · 61c00cca

由 Tariq Toukan 提交于 1月 27, 2020

Deprecate the generic TLS cap bit, use the new TX-specific
TLS cap bit instead.

Fixes: a12ff35e ("net/mlx5: Introduce TLS TX offload hardware bits and structures")
Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
Reviewed-by: NEran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

61c00cca

06 2月, 2020 2 次提交

skbuff: fix a data race in skb_queue_len() · 86b18aaa

由 Qian Cai 提交于 2月 04, 2020

sk_buff.qlen can be accessed concurrently as noticed by KCSAN,

 BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg

 read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96:
  unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821
				 net/unix/af_unix.c:1761
  ____sys_sendmsg+0x33e/0x370
  ___sys_sendmsg+0xa6/0xf0
  __sys_sendmsg+0x69/0xf0
  __x64_sys_sendmsg+0x51/0x70
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99:
  __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029
  __skb_try_recv_datagram+0xbe/0x220
  unix_dgram_recvmsg+0xee/0x850
  ____sys_recvmsg+0x1fb/0x210
  ___sys_recvmsg+0xa2/0xf0
  __sys_recvmsg+0x66/0xf0
  __x64_sys_recvmsg+0x51/0x70
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Since only the read is operating as lockless, it could introduce a logic
bug in unix_recvq_full() due to the load tearing. Fix it by adding
a lockless variant of skb_queue_len() and unix_recvq_full() where
READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to
the commit d7d16a89 ("net: add skb_queue_empty_lockless()").
Signed-off-by: NQian Cai <cai@lca.pw>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

86b18aaa

of: clk: Make <linux/of_clk.h> self-contained · 5df86714

由 Geert Uytterhoeven 提交于 2月 05, 2020

Depending on include order:

    include/linux/of_clk.h:11:45: warning: ‘struct device_node’ declared inside parameter list will not be visible outside of this definition or declaration
     unsigned int of_clk_get_parent_count(struct device_node *np);
						 ^~~~~~~~~~~
    include/linux/of_clk.h:12:43: warning: ‘struct device_node’ declared inside parameter list will not be visible outside of this definition or declaration
     const char *of_clk_get_parent_name(struct device_node *np, int index);
					       ^~~~~~~~~~~
    include/linux/of_clk.h:13:31: warning: ‘struct of_device_id’ declared inside parameter list will not be visible outside of this definition or declaration
     void of_clk_init(const struct of_device_id *matches);
				   ^~~~~~~~~~~~

Fix this by adding forward declarations for struct device_node and
struct of_device_id.
Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Link: https://lkml.kernel.org/r/20200205194649.31309-1-geert+renesas@glider.beSigned-off-by: NStephen Boyd <sboyd@kernel.org>

5df86714

05 2月, 2020 1 次提交

bonding/alb: properly access headers in bond_alb_xmit() · 38f88c45

由 Eric Dumazet 提交于 2月 04, 2020

syzbot managed to send an IPX packet through bond_alb_xmit()
and af_packet and triggered a use-after-free.

First, bond_alb_xmit() was using ipx_hdr() helper to reach
the IPX header, but ipx_hdr() was using the transport offset
instead of the network offset. In the particular syzbot
report transport offset was 0xFFFF

This patch removes ipx_hdr() since it was only (mis)used from bonding.

Then we need to make sure IPv4/IPv6/IPX headers are pulled
in skb->head before dereferencing anything.

BUG: KASAN: use-after-free in bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452
Read of size 2 at addr ffff8801ce56dfff by task syz-executor.2/18108
 (if (ipx_hdr(skb)->ipx_checksum != IPX_NO_CHECKSUM) ...)

Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 [<ffffffff8441fc42>] __dump_stack lib/dump_stack.c:17 [inline]
 [<ffffffff8441fc42>] dump_stack+0x14d/0x20b lib/dump_stack.c:53
 [<ffffffff81a7dec4>] print_address_description+0x6f/0x20b mm/kasan/report.c:282
 [<ffffffff81a7e0ec>] kasan_report_error mm/kasan/report.c:380 [inline]
 [<ffffffff81a7e0ec>] kasan_report mm/kasan/report.c:438 [inline]
 [<ffffffff81a7e0ec>] kasan_report.cold+0x8c/0x2a0 mm/kasan/report.c:422
 [<ffffffff81a7dc4f>] __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:469
 [<ffffffff82c8c00a>] bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452
 [<ffffffff82c60c74>] __bond_start_xmit drivers/net/bonding/bond_main.c:4199 [inline]
 [<ffffffff82c60c74>] bond_start_xmit+0x4f4/0x1570 drivers/net/bonding/bond_main.c:4224
 [<ffffffff83baa558>] __netdev_start_xmit include/linux/netdevice.h:4525 [inline]
 [<ffffffff83baa558>] netdev_start_xmit include/linux/netdevice.h:4539 [inline]
 [<ffffffff83baa558>] xmit_one net/core/dev.c:3611 [inline]
 [<ffffffff83baa558>] dev_hard_start_xmit+0x168/0x910 net/core/dev.c:3627
 [<ffffffff83bacf35>] __dev_queue_xmit+0x1f55/0x33b0 net/core/dev.c:4238
 [<ffffffff83bae3a8>] dev_queue_xmit+0x18/0x20 net/core/dev.c:4278
 [<ffffffff84339189>] packet_snd net/packet/af_packet.c:3226 [inline]
 [<ffffffff84339189>] packet_sendmsg+0x4919/0x70b0 net/packet/af_packet.c:3252
 [<ffffffff83b1ac0c>] sock_sendmsg_nosec net/socket.c:673 [inline]
 [<ffffffff83b1ac0c>] sock_sendmsg+0x12c/0x160 net/socket.c:684
 [<ffffffff83b1f5a2>] __sys_sendto+0x262/0x380 net/socket.c:1996
 [<ffffffff83b1f700>] SYSC_sendto net/socket.c:2008 [inline]
 [<ffffffff83b1f700>] SyS_sendto+0x40/0x60 net/socket.c:2004

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

38f88c45

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功