提交 df2071bd 编写于 作者: H Herbert Xu

要显示的变更太多。

To preserve performance only 1000 of 1000+ files are displayed.
...@@ -34,13 +34,18 @@ modules.builtin ...@@ -34,13 +34,18 @@ modules.builtin
# #
# Top-level generic files # Top-level generic files
# #
tags /tags
TAGS /TAGS
vmlinux /linux
vmlinuz /vmlinux
System.map /vmlinuz
Module.markers /System.map
Module.symvers /Module.markers
/Module.symvers
#
# git files that we don't want to ignore even it they are dot-files
#
!.gitignore !.gitignore
!.mailmap !.mailmap
......
What: /sys/devices/system/node/nodeX
Date: October 2002
Contact: Linux Memory Management list <linux-mm@kvack.org>
Description:
When CONFIG_NUMA is enabled, this is a directory containing
information on node X such as what CPUs are local to the
node.
...@@ -160,7 +160,7 @@ Description: ...@@ -160,7 +160,7 @@ Description:
match the driver to the device. For example: match the driver to the device. For example:
# echo "046d c315" > /sys/bus/usb/drivers/foo/remove_id # echo "046d c315" > /sys/bus/usb/drivers/foo/remove_id
What: /sys/bus/usb/device/.../avoid_reset What: /sys/bus/usb/device/.../avoid_reset_quirk
Date: December 2009 Date: December 2009
Contact: Oliver Neukum <oliver@neukum.org> Contact: Oliver Neukum <oliver@neukum.org>
Description: Description:
......
此差异已折叠。
...@@ -4,20 +4,18 @@ ...@@ -4,20 +4,18 @@
James E.J. Bottomley <James.Bottomley@HansenPartnership.com> James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
This document describes the DMA API. For a more gentle introduction This document describes the DMA API. For a more gentle introduction
phrased in terms of the pci_ equivalents (and actual examples) see of the API (and actual examples) see
Documentation/PCI/PCI-DMA-mapping.txt. Documentation/DMA-API-HOWTO.txt.
This API is split into two pieces. Part I describes the API and the This API is split into two pieces. Part I describes the API. Part II
corresponding pci_ API. Part II describes the extensions to the API describes the extensions to the API for supporting non-consistent
for supporting non-consistent memory machines. Unless you know that memory machines. Unless you know that your driver absolutely has to
your driver absolutely has to support non-consistent platforms (this support non-consistent platforms (this is usually only legacy
is usually only legacy platforms) you should only use the API platforms) you should only use the API described in part I.
described in part I.
Part I - pci_ and dma_ Equivalent API Part I - dma_ API
------------------------------------- -------------------------------------
To get the pci_ API, you must #include <linux/pci.h>
To get the dma_ API, you must #include <linux/dma-mapping.h> To get the dma_ API, you must #include <linux/dma-mapping.h>
...@@ -27,9 +25,6 @@ Part Ia - Using large dma-coherent buffers ...@@ -27,9 +25,6 @@ Part Ia - Using large dma-coherent buffers
void * void *
dma_alloc_coherent(struct device *dev, size_t size, dma_alloc_coherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t flag) dma_addr_t *dma_handle, gfp_t flag)
void *
pci_alloc_consistent(struct pci_dev *dev, size_t size,
dma_addr_t *dma_handle)
Consistent memory is memory for which a write by either the device or Consistent memory is memory for which a write by either the device or
the processor can immediately be read by the processor or device the processor can immediately be read by the processor or device
...@@ -53,15 +48,11 @@ The simplest way to do that is to use the dma_pool calls (see below). ...@@ -53,15 +48,11 @@ The simplest way to do that is to use the dma_pool calls (see below).
The flag parameter (dma_alloc_coherent only) allows the caller to The flag parameter (dma_alloc_coherent only) allows the caller to
specify the GFP_ flags (see kmalloc) for the allocation (the specify the GFP_ flags (see kmalloc) for the allocation (the
implementation may choose to ignore flags that affect the location of implementation may choose to ignore flags that affect the location of
the returned memory, like GFP_DMA). For pci_alloc_consistent, you the returned memory, like GFP_DMA).
must assume GFP_ATOMIC behaviour.
void void
dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, dma_free_coherent(struct device *dev, size_t size, void *cpu_addr,
dma_addr_t dma_handle) dma_addr_t dma_handle)
void
pci_free_consistent(struct pci_dev *dev, size_t size, void *cpu_addr,
dma_addr_t dma_handle)
Free the region of consistent memory you previously allocated. dev, Free the region of consistent memory you previously allocated. dev,
size and dma_handle must all be the same as those passed into the size and dma_handle must all be the same as those passed into the
...@@ -89,10 +80,6 @@ for alignment, like queue heads needing to be aligned on N-byte boundaries. ...@@ -89,10 +80,6 @@ for alignment, like queue heads needing to be aligned on N-byte boundaries.
dma_pool_create(const char *name, struct device *dev, dma_pool_create(const char *name, struct device *dev,
size_t size, size_t align, size_t alloc); size_t size, size_t align, size_t alloc);
struct pci_pool *
pci_pool_create(const char *name, struct pci_device *dev,
size_t size, size_t align, size_t alloc);
The pool create() routines initialize a pool of dma-coherent buffers The pool create() routines initialize a pool of dma-coherent buffers
for use with a given device. It must be called in a context which for use with a given device. It must be called in a context which
can sleep. can sleep.
...@@ -108,9 +95,6 @@ from this pool must not cross 4KByte boundaries. ...@@ -108,9 +95,6 @@ from this pool must not cross 4KByte boundaries.
void *dma_pool_alloc(struct dma_pool *pool, gfp_t gfp_flags, void *dma_pool_alloc(struct dma_pool *pool, gfp_t gfp_flags,
dma_addr_t *dma_handle); dma_addr_t *dma_handle);
void *pci_pool_alloc(struct pci_pool *pool, gfp_t gfp_flags,
dma_addr_t *dma_handle);
This allocates memory from the pool; the returned memory will meet the size This allocates memory from the pool; the returned memory will meet the size
and alignment requirements specified at creation time. Pass GFP_ATOMIC to and alignment requirements specified at creation time. Pass GFP_ATOMIC to
prevent blocking, or if it's permitted (not in_interrupt, not holding SMP locks), prevent blocking, or if it's permitted (not in_interrupt, not holding SMP locks),
...@@ -122,9 +106,6 @@ pool's device. ...@@ -122,9 +106,6 @@ pool's device.
void dma_pool_free(struct dma_pool *pool, void *vaddr, void dma_pool_free(struct dma_pool *pool, void *vaddr,
dma_addr_t addr); dma_addr_t addr);
void pci_pool_free(struct pci_pool *pool, void *vaddr,
dma_addr_t addr);
This puts memory back into the pool. The pool is what was passed to This puts memory back into the pool. The pool is what was passed to
the pool allocation routine; the cpu (vaddr) and dma addresses are what the pool allocation routine; the cpu (vaddr) and dma addresses are what
were returned when that routine allocated the memory being freed. were returned when that routine allocated the memory being freed.
...@@ -132,8 +113,6 @@ were returned when that routine allocated the memory being freed. ...@@ -132,8 +113,6 @@ were returned when that routine allocated the memory being freed.
void dma_pool_destroy(struct dma_pool *pool); void dma_pool_destroy(struct dma_pool *pool);
void pci_pool_destroy(struct pci_pool *pool);
The pool destroy() routines free the resources of the pool. They must be The pool destroy() routines free the resources of the pool. They must be
called in a context which can sleep. Make sure you've freed all allocated called in a context which can sleep. Make sure you've freed all allocated
memory back to the pool before you destroy it. memory back to the pool before you destroy it.
...@@ -144,8 +123,6 @@ Part Ic - DMA addressing limitations ...@@ -144,8 +123,6 @@ Part Ic - DMA addressing limitations
int int
dma_supported(struct device *dev, u64 mask) dma_supported(struct device *dev, u64 mask)
int
pci_dma_supported(struct pci_dev *hwdev, u64 mask)
Checks to see if the device can support DMA to the memory described by Checks to see if the device can support DMA to the memory described by
mask. mask.
...@@ -159,8 +136,14 @@ driver writers. ...@@ -159,8 +136,14 @@ driver writers.
int int
dma_set_mask(struct device *dev, u64 mask) dma_set_mask(struct device *dev, u64 mask)
Checks to see if the mask is possible and updates the device
parameters if it is.
Returns: 0 if successful and a negative error if not.
int int
pci_set_dma_mask(struct pci_device *dev, u64 mask) dma_set_coherent_mask(struct device *dev, u64 mask)
Checks to see if the mask is possible and updates the device Checks to see if the mask is possible and updates the device
parameters if it is. parameters if it is.
...@@ -187,9 +170,6 @@ Part Id - Streaming DMA mappings ...@@ -187,9 +170,6 @@ Part Id - Streaming DMA mappings
dma_addr_t dma_addr_t
dma_map_single(struct device *dev, void *cpu_addr, size_t size, dma_map_single(struct device *dev, void *cpu_addr, size_t size,
enum dma_data_direction direction) enum dma_data_direction direction)
dma_addr_t
pci_map_single(struct pci_dev *hwdev, void *cpu_addr, size_t size,
int direction)
Maps a piece of processor virtual memory so it can be accessed by the Maps a piece of processor virtual memory so it can be accessed by the
device and returns the physical handle of the memory. device and returns the physical handle of the memory.
...@@ -198,14 +178,10 @@ The direction for both api's may be converted freely by casting. ...@@ -198,14 +178,10 @@ The direction for both api's may be converted freely by casting.
However the dma_ API uses a strongly typed enumerator for its However the dma_ API uses a strongly typed enumerator for its
direction: direction:
DMA_NONE = PCI_DMA_NONE no direction (used for DMA_NONE no direction (used for debugging)
debugging) DMA_TO_DEVICE data is going from the memory to the device
DMA_TO_DEVICE = PCI_DMA_TODEVICE data is going from the DMA_FROM_DEVICE data is coming from the device to the memory
memory to the device DMA_BIDIRECTIONAL direction isn't known
DMA_FROM_DEVICE = PCI_DMA_FROMDEVICE data is coming from
the device to the
memory
DMA_BIDIRECTIONAL = PCI_DMA_BIDIRECTIONAL direction isn't known
Notes: Not all memory regions in a machine can be mapped by this Notes: Not all memory regions in a machine can be mapped by this
API. Further, regions that appear to be physically contiguous in API. Further, regions that appear to be physically contiguous in
...@@ -268,9 +244,6 @@ cache lines are updated with data that the device may have changed). ...@@ -268,9 +244,6 @@ cache lines are updated with data that the device may have changed).
void void
dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size, dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size,
enum dma_data_direction direction) enum dma_data_direction direction)
void
pci_unmap_single(struct pci_dev *hwdev, dma_addr_t dma_addr,
size_t size, int direction)
Unmaps the region previously mapped. All the parameters passed in Unmaps the region previously mapped. All the parameters passed in
must be identical to those passed in (and returned) by the mapping must be identical to those passed in (and returned) by the mapping
...@@ -280,15 +253,9 @@ dma_addr_t ...@@ -280,15 +253,9 @@ dma_addr_t
dma_map_page(struct device *dev, struct page *page, dma_map_page(struct device *dev, struct page *page,
unsigned long offset, size_t size, unsigned long offset, size_t size,
enum dma_data_direction direction) enum dma_data_direction direction)
dma_addr_t
pci_map_page(struct pci_dev *hwdev, struct page *page,
unsigned long offset, size_t size, int direction)
void void
dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size, dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size,
enum dma_data_direction direction) enum dma_data_direction direction)
void
pci_unmap_page(struct pci_dev *hwdev, dma_addr_t dma_address,
size_t size, int direction)
API for mapping and unmapping for pages. All the notes and warnings API for mapping and unmapping for pages. All the notes and warnings
for the other mapping APIs apply here. Also, although the <offset> for the other mapping APIs apply here. Also, although the <offset>
...@@ -299,9 +266,6 @@ cache width is. ...@@ -299,9 +266,6 @@ cache width is.
int int
dma_mapping_error(struct device *dev, dma_addr_t dma_addr) dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
int
pci_dma_mapping_error(struct pci_dev *hwdev, dma_addr_t dma_addr)
In some circumstances dma_map_single and dma_map_page will fail to create In some circumstances dma_map_single and dma_map_page will fail to create
a mapping. A driver can check for these errors by testing the returned a mapping. A driver can check for these errors by testing the returned
dma address with dma_mapping_error(). A non-zero return value means the mapping dma address with dma_mapping_error(). A non-zero return value means the mapping
...@@ -311,9 +275,6 @@ reduce current DMA mapping usage or delay and try again later). ...@@ -311,9 +275,6 @@ reduce current DMA mapping usage or delay and try again later).
int int
dma_map_sg(struct device *dev, struct scatterlist *sg, dma_map_sg(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction direction) int nents, enum dma_data_direction direction)
int
pci_map_sg(struct pci_dev *hwdev, struct scatterlist *sg,
int nents, int direction)
Returns: the number of physical segments mapped (this may be shorter Returns: the number of physical segments mapped (this may be shorter
than <nents> passed in if some elements of the scatter/gather list are than <nents> passed in if some elements of the scatter/gather list are
...@@ -353,9 +314,6 @@ accessed sg->address and sg->length as shown above. ...@@ -353,9 +314,6 @@ accessed sg->address and sg->length as shown above.
void void
dma_unmap_sg(struct device *dev, struct scatterlist *sg, dma_unmap_sg(struct device *dev, struct scatterlist *sg,
int nhwentries, enum dma_data_direction direction) int nhwentries, enum dma_data_direction direction)
void
pci_unmap_sg(struct pci_dev *hwdev, struct scatterlist *sg,
int nents, int direction)
Unmap the previously mapped scatter/gather list. All the parameters Unmap the previously mapped scatter/gather list. All the parameters
must be the same as those and passed in to the scatter/gather mapping must be the same as those and passed in to the scatter/gather mapping
...@@ -365,21 +323,23 @@ Note: <nents> must be the number you passed in, *not* the number of ...@@ -365,21 +323,23 @@ Note: <nents> must be the number you passed in, *not* the number of
physical entries returned. physical entries returned.
void void
dma_sync_single(struct device *dev, dma_addr_t dma_handle, size_t size, dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle, size_t size,
enum dma_data_direction direction) enum dma_data_direction direction)
void void
pci_dma_sync_single(struct pci_dev *hwdev, dma_addr_t dma_handle, dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle, size_t size,
size_t size, int direction) enum dma_data_direction direction)
void void
dma_sync_sg(struct device *dev, struct scatterlist *sg, int nelems, dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, int nelems,
enum dma_data_direction direction) enum dma_data_direction direction)
void void
pci_dma_sync_sg(struct pci_dev *hwdev, struct scatterlist *sg, dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, int nelems,
int nelems, int direction) enum dma_data_direction direction)
Synchronise a single contiguous or scatter/gather mapping. All the Synchronise a single contiguous or scatter/gather mapping for the cpu
parameters must be the same as those passed into the single mapping and device. With the sync_sg API, all the parameters must be the same
API. as those passed into the single mapping API. With the sync_single API,
you can use dma_handle and size parameters that aren't identical to
those passed into the single mapping API to do a partial sync.
Notes: You must do this: Notes: You must do this:
...@@ -461,9 +421,9 @@ void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr, ...@@ -461,9 +421,9 @@ void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr,
Part II - Advanced dma_ usage Part II - Advanced dma_ usage
----------------------------- -----------------------------
Warning: These pieces of the DMA API have no PCI equivalent. They Warning: These pieces of the DMA API should not be used in the
should also not be used in the majority of cases, since they cater for majority of cases, since they cater for unlikely corner cases that
unlikely corner cases that don't belong in usual drivers. don't belong in usual drivers.
If you don't understand how cache line coherency works between a If you don't understand how cache line coherency works between a
processor and an I/O device, you should not be using this part of the processor and an I/O device, you should not be using this part of the
...@@ -513,16 +473,6 @@ line, but it will guarantee that one or more cache lines fit exactly ...@@ -513,16 +473,6 @@ line, but it will guarantee that one or more cache lines fit exactly
into the width returned by this call. It will also always be a power into the width returned by this call. It will also always be a power
of two for easy alignment. of two for easy alignment.
void
dma_sync_single_range(struct device *dev, dma_addr_t dma_handle,
unsigned long offset, size_t size,
enum dma_data_direction direction)
Does a partial sync, starting at offset and continuing for size. You
must be careful to observe the cache alignment and width when doing
anything like this. You must also be extra careful about accessing
memory you intend to sync partially.
void void
dma_cache_sync(struct device *dev, void *vaddr, size_t size, dma_cache_sync(struct device *dev, void *vaddr, size_t size,
enum dma_data_direction direction) enum dma_data_direction direction)
......
...@@ -488,7 +488,7 @@ static void board_select_chip (struct mtd_info *mtd, int chip) ...@@ -488,7 +488,7 @@ static void board_select_chip (struct mtd_info *mtd, int chip)
The ECC bytes must be placed immidiately after the data The ECC bytes must be placed immidiately after the data
bytes in order to make the syndrome generator work. This bytes in order to make the syndrome generator work. This
is contrary to the usual layout used by software ECC. The is contrary to the usual layout used by software ECC. The
seperation of data and out of band area is not longer separation of data and out of band area is not longer
possible. The nand driver code handles this layout and possible. The nand driver code handles this layout and
the remaining free bytes in the oob area are managed by the remaining free bytes in the oob area are managed by
the autoplacement code. Provide a matching oob-layout the autoplacement code. Provide a matching oob-layout
...@@ -560,7 +560,7 @@ static void board_select_chip (struct mtd_info *mtd, int chip) ...@@ -560,7 +560,7 @@ static void board_select_chip (struct mtd_info *mtd, int chip)
bad blocks. They have factory marked good blocks. The marker pattern bad blocks. They have factory marked good blocks. The marker pattern
is erased when the block is erased to be reused. So in case of is erased when the block is erased to be reused. So in case of
powerloss before writing the pattern back to the chip this block powerloss before writing the pattern back to the chip this block
would be lost and added to the bad blocks. Therefor we scan the would be lost and added to the bad blocks. Therefore we scan the
chip(s) when we detect them the first time for good blocks and chip(s) when we detect them the first time for good blocks and
store this information in a bad block table before erasing any store this information in a bad block table before erasing any
of the blocks. of the blocks.
...@@ -1094,7 +1094,7 @@ in this page</entry> ...@@ -1094,7 +1094,7 @@ in this page</entry>
manufacturers specifications. This applies similar to the spare area. manufacturers specifications. This applies similar to the spare area.
</para> </para>
<para> <para>
Therefor NAND aware filesystems must either write in page size chunks Therefore NAND aware filesystems must either write in page size chunks
or hold a writebuffer to collect smaller writes until they sum up to or hold a writebuffer to collect smaller writes until they sum up to
pagesize. Available NAND aware filesystems: JFFS2, YAFFS. pagesize. Available NAND aware filesystems: JFFS2, YAFFS.
</para> </para>
......
...@@ -16,6 +16,15 @@ ...@@ -16,6 +16,15 @@
</address> </address>
</affiliation> </affiliation>
</author> </author>
<author>
<firstname>William</firstname>
<surname>Cohen</surname>
<affiliation>
<address>
<email>wcohen@redhat.com</email>
</address>
</affiliation>
</author>
</authorgroup> </authorgroup>
<legalnotice> <legalnotice>
...@@ -91,4 +100,8 @@ ...@@ -91,4 +100,8 @@
!Iinclude/trace/events/signal.h !Iinclude/trace/events/signal.h
</chapter> </chapter>
<chapter id="block">
<title>Block IO</title>
!Iinclude/trace/events/block.h
</chapter>
</book> </book>
...@@ -1170,7 +1170,7 @@ frames per second. If less than this number of frames is to be ...@@ -1170,7 +1170,7 @@ frames per second. If less than this number of frames is to be
captured or output, applications can request frame skipping or captured or output, applications can request frame skipping or
duplicating on the driver side. This is especially useful when using duplicating on the driver side. This is especially useful when using
the &func-read; or &func-write;, which are not augmented by timestamps the &func-read; or &func-write;, which are not augmented by timestamps
or sequence counters, and to avoid unneccessary data copying.</para> or sequence counters, and to avoid unnecessary data copying.</para>
<para>Finally these ioctls can be used to determine the number of <para>Finally these ioctls can be used to determine the number of
buffers used internally by a driver in read/write mode. For buffers used internally by a driver in read/write mode. For
......
...@@ -55,7 +55,7 @@ captured or output, applications can request frame skipping or ...@@ -55,7 +55,7 @@ captured or output, applications can request frame skipping or
duplicating on the driver side. This is especially useful when using duplicating on the driver side. This is especially useful when using
the <function>read()</function> or <function>write()</function>, which the <function>read()</function> or <function>write()</function>, which
are not augmented by timestamps or sequence counters, and to avoid are not augmented by timestamps or sequence counters, and to avoid
unneccessary data copying.</para> unnecessary data copying.</para>
<para>Further these ioctls can be used to determine the number of <para>Further these ioctls can be used to determine the number of
buffers used internally by a driver in read/write mode. For buffers used internally by a driver in read/write mode. For
......
...@@ -221,8 +221,8 @@ branches. These different branches are: ...@@ -221,8 +221,8 @@ branches. These different branches are:
- main 2.6.x kernel tree - main 2.6.x kernel tree
- 2.6.x.y -stable kernel tree - 2.6.x.y -stable kernel tree
- 2.6.x -git kernel patches - 2.6.x -git kernel patches
- 2.6.x -mm kernel patches
- subsystem specific kernel trees and patches - subsystem specific kernel trees and patches
- the 2.6.x -next kernel tree for integration tests
2.6.x kernel tree 2.6.x kernel tree
----------------- -----------------
...@@ -232,9 +232,9 @@ process is as follows: ...@@ -232,9 +232,9 @@ process is as follows:
- As soon as a new kernel is released a two weeks window is open, - As soon as a new kernel is released a two weeks window is open,
during this period of time maintainers can submit big diffs to during this period of time maintainers can submit big diffs to
Linus, usually the patches that have already been included in the Linus, usually the patches that have already been included in the
-mm kernel for a few weeks. The preferred way to submit big changes -next kernel for a few weeks. The preferred way to submit big changes
is using git (the kernel's source management tool, more information is using git (the kernel's source management tool, more information
can be found at http://git.or.cz/) but plain patches are also just can be found at http://git-scm.com/) but plain patches are also just
fine. fine.
- After two weeks a -rc1 kernel is released it is now possible to push - After two weeks a -rc1 kernel is released it is now possible to push
only patches that do not include new features that could affect the only patches that do not include new features that could affect the
...@@ -293,84 +293,43 @@ daily and represent the current state of Linus' tree. They are more ...@@ -293,84 +293,43 @@ daily and represent the current state of Linus' tree. They are more
experimental than -rc kernels since they are generated automatically experimental than -rc kernels since they are generated automatically
without even a cursory glance to see if they are sane. without even a cursory glance to see if they are sane.
2.6.x -mm kernel patches
------------------------
These are experimental kernel patches released by Andrew Morton. Andrew
takes all of the different subsystem kernel trees and patches and mushes
them together, along with a lot of patches that have been plucked from
the linux-kernel mailing list. This tree serves as a proving ground for
new features and patches. Once a patch has proved its worth in -mm for
a while Andrew or the subsystem maintainer pushes it on to Linus for
inclusion in mainline.
It is heavily encouraged that all new patches get tested in the -mm tree
before they are sent to Linus for inclusion in the main kernel tree. Code
which does not make an appearance in -mm before the opening of the merge
window will prove hard to merge into the mainline.
These kernels are not appropriate for use on systems that are supposed
to be stable and they are more risky to run than any of the other
branches.
If you wish to help out with the kernel development process, please test
and use these kernel releases and provide feedback to the linux-kernel
mailing list if you have any problems, and if everything works properly.
In addition to all the other experimental patches, these kernels usually
also contain any changes in the mainline -git kernels available at the
time of release.
The -mm kernels are not released on a fixed schedule, but usually a few
-mm kernels are released in between each -rc kernel (1 to 3 is common).
Subsystem Specific kernel trees and patches Subsystem Specific kernel trees and patches
------------------------------------------- -------------------------------------------
A number of the different kernel subsystem developers expose their The maintainers of the various kernel subsystems --- and also many
development trees so that others can see what is happening in the kernel subsystem developers --- expose their current state of
different areas of the kernel. These trees are pulled into the -mm development in source repositories. That way, others can see what is
kernel releases as described above. happening in the different areas of the kernel. In areas where
development is rapid, a developer may be asked to base his submissions
Here is a list of some of the different kernel trees available: onto such a subsystem kernel tree so that conflicts between the
git trees: submission and other already ongoing work are avoided.
- Kbuild development tree, Sam Ravnborg <sam@ravnborg.org>
git.kernel.org:/pub/scm/linux/kernel/git/sam/kbuild.git Most of these repositories are git trees, but there are also other SCMs
in use, or patch queues being published as quilt series. Addresses of
- ACPI development tree, Len Brown <len.brown@intel.com> these subsystem repositories are listed in the MAINTAINERS file. Many
git.kernel.org:/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6.git of them can be browsed at http://git.kernel.org/.
- Block development tree, Jens Axboe <jens.axboe@oracle.com> Before a proposed patch is committed to such a subsystem tree, it is
git.kernel.org:/pub/scm/linux/kernel/git/axboe/linux-2.6-block.git subject to review which primarily happens on mailing lists (see the
respective section below). For several kernel subsystems, this review
- DRM development tree, Dave Airlie <airlied@linux.ie> process is tracked with the tool patchwork. Patchwork offers a web
git.kernel.org:/pub/scm/linux/kernel/git/airlied/drm-2.6.git interface which shows patch postings, any comments on a patch or
revisions to it, and maintainers can mark patches as under review,
- ia64 development tree, Tony Luck <tony.luck@intel.com> accepted, or rejected. Most of these patchwork sites are listed at
git.kernel.org:/pub/scm/linux/kernel/git/aegl/linux-2.6.git http://patchwork.kernel.org/ or http://patchwork.ozlabs.org/.
- infiniband, Roland Dreier <rolandd@cisco.com> 2.6.x -next kernel tree for integration tests
git.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ---------------------------------------------
Before updates from subsystem trees are merged into the mainline 2.6.x
- libata, Jeff Garzik <jgarzik@pobox.com> tree, they need to be integration-tested. For this purpose, a special
git.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev.git testing repository exists into which virtually all subsystem trees are
pulled on an almost daily basis:
- network drivers, Jeff Garzik <jgarzik@pobox.com> http://git.kernel.org/?p=linux/kernel/git/sfr/linux-next.git
git.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git http://linux.f-seidel.de/linux-next/pmwiki/
- pcmcia, Dominik Brodowski <linux@dominikbrodowski.net> This way, the -next kernel gives a summary outlook onto what will be
git.kernel.org:/pub/scm/linux/kernel/git/brodo/pcmcia-2.6.git expected to go into the mainline kernel at the next merge period.
Adventurous testers are very welcome to runtime-test the -next kernel.
- SCSI, James Bottomley <James.Bottomley@hansenpartnership.com>
git.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git
- x86, Ingo Molnar <mingo@elte.hu>
git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git
quilt trees:
- USB, Driver Core, and I2C, Greg Kroah-Hartman <gregkh@suse.de>
kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/
Other kernel trees can be found listed at http://git.kernel.org/ and in
the MAINTAINERS file.
Bug Reporting Bug Reporting
------------- -------------
......
...@@ -365,6 +365,7 @@ You can change this at module load time (for a module) with: ...@@ -365,6 +365,7 @@ You can change this at module load time (for a module) with:
regshifts=<shift1>,<shift2>,... regshifts=<shift1>,<shift2>,...
slave_addrs=<addr1>,<addr2>,... slave_addrs=<addr1>,<addr2>,...
force_kipmid=<enable1>,<enable2>,... force_kipmid=<enable1>,<enable2>,...
kipmid_max_busy_us=<ustime1>,<ustime2>,...
unload_when_empty=[0|1] unload_when_empty=[0|1]
Each of these except si_trydefaults is a list, the first item for the Each of these except si_trydefaults is a list, the first item for the
...@@ -433,6 +434,7 @@ kernel command line as: ...@@ -433,6 +434,7 @@ kernel command line as:
ipmi_si.regshifts=<shift1>,<shift2>,... ipmi_si.regshifts=<shift1>,<shift2>,...
ipmi_si.slave_addrs=<addr1>,<addr2>,... ipmi_si.slave_addrs=<addr1>,<addr2>,...
ipmi_si.force_kipmid=<enable1>,<enable2>,... ipmi_si.force_kipmid=<enable1>,<enable2>,...
ipmi_si.kipmid_max_busy_us=<ustime1>,<ustime2>,...
It works the same as the module parameters of the same names. It works the same as the module parameters of the same names.
...@@ -450,6 +452,16 @@ force this thread on or off. If you force it off and don't have ...@@ -450,6 +452,16 @@ force this thread on or off. If you force it off and don't have
interrupts, the driver will run VERY slowly. Don't blame me, interrupts, the driver will run VERY slowly. Don't blame me,
these interfaces suck. these interfaces suck.
Unfortunately, this thread can use a lot of CPU depending on the
interface's performance. This can waste a lot of CPU and cause
various issues with detecting idle CPU and using extra power. To
avoid this, the kipmid_max_busy_us sets the maximum amount of time, in
microseconds, that kipmid will spin before sleeping for a tick. This
value sets a balance between performance and CPU waste and needs to be
tuned to your needs. Maybe, someday, auto-tuning will be added, but
that's not a simple thing and even the auto-tuning would need to be
tuned to the user's desired performance.
The driver supports a hot add and remove of interfaces. This way, The driver supports a hot add and remove of interfaces. This way,
interfaces can be added or removed after the kernel is up and running. interfaces can be added or removed after the kernel is up and running.
This is done using /sys/modules/ipmi_si/parameters/hotmod, which is a This is done using /sys/modules/ipmi_si/parameters/hotmod, which is a
......
obj-m := DocBook/ accounting/ auxdisplay/ connector/ \ obj-m := DocBook/ accounting/ auxdisplay/ connector/ \
filesystems/configfs/ ia64/ networking/ \ filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \
pcmcia/ spi/ video4linux/ vm/ watchdog/src/ pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/
此差异已折叠。
...@@ -34,7 +34,7 @@ NMI handler. ...@@ -34,7 +34,7 @@ NMI handler.
cpu = smp_processor_id(); cpu = smp_processor_id();
++nmi_count(cpu); ++nmi_count(cpu);
if (!rcu_dereference(nmi_callback)(regs, cpu)) if (!rcu_dereference_sched(nmi_callback)(regs, cpu))
default_do_nmi(regs); default_do_nmi(regs);
nmi_exit(); nmi_exit();
...@@ -47,12 +47,13 @@ function pointer. If this handler returns zero, do_nmi() invokes the ...@@ -47,12 +47,13 @@ function pointer. If this handler returns zero, do_nmi() invokes the
default_do_nmi() function to handle a machine-specific NMI. Finally, default_do_nmi() function to handle a machine-specific NMI. Finally,
preemption is restored. preemption is restored.
Strictly speaking, rcu_dereference() is not needed, since this code runs In theory, rcu_dereference_sched() is not needed, since this code runs
only on i386, which does not need rcu_dereference() anyway. However, only on i386, which in theory does not need rcu_dereference_sched()
it is a good documentation aid, particularly for anyone attempting to anyway. However, in practice it is a good documentation aid, particularly
do something similar on Alpha. for anyone attempting to do something similar on Alpha or on systems
with aggressive optimizing compilers.
Quick Quiz: Why might the rcu_dereference() be necessary on Alpha, Quick Quiz: Why might the rcu_dereference_sched() be necessary on Alpha,
given that the code referenced by the pointer is read-only? given that the code referenced by the pointer is read-only?
...@@ -99,17 +100,21 @@ invoke irq_enter() and irq_exit() on NMI entry and exit, respectively. ...@@ -99,17 +100,21 @@ invoke irq_enter() and irq_exit() on NMI entry and exit, respectively.
Answer to Quick Quiz Answer to Quick Quiz
Why might the rcu_dereference() be necessary on Alpha, given Why might the rcu_dereference_sched() be necessary on Alpha, given
that the code referenced by the pointer is read-only? that the code referenced by the pointer is read-only?
Answer: The caller to set_nmi_callback() might well have Answer: The caller to set_nmi_callback() might well have
initialized some data that is to be used by the initialized some data that is to be used by the new NMI
new NMI handler. In this case, the rcu_dereference() handler. In this case, the rcu_dereference_sched() would
would be needed, because otherwise a CPU that received be needed, because otherwise a CPU that received an NMI
an NMI just after the new handler was set might see just after the new handler was set might see the pointer
the pointer to the new NMI handler, but the old to the new NMI handler, but the old pre-initialized
pre-initialized version of the handler's data. version of the handler's data.
More important, the rcu_dereference() makes it clear This same sad story can happen on other CPUs when using
to someone reading the code that the pointer is being a compiler with aggressive pointer-value speculation
protected by RCU. optimizations.
More important, the rcu_dereference_sched() makes it
clear to someone reading the code that the pointer is
being protected by RCU-sched.
...@@ -260,7 +260,8 @@ over a rather long period of time, but improvements are always welcome! ...@@ -260,7 +260,8 @@ over a rather long period of time, but improvements are always welcome!
The reason that it is permissible to use RCU list-traversal The reason that it is permissible to use RCU list-traversal
primitives when the update-side lock is held is that doing so primitives when the update-side lock is held is that doing so
can be quite helpful in reducing code bloat when common code is can be quite helpful in reducing code bloat when common code is
shared between readers and updaters. shared between readers and updaters. Additional primitives
are provided for this case, as discussed in lockdep.txt.
10. Conversely, if you are in an RCU read-side critical section, 10. Conversely, if you are in an RCU read-side critical section,
and you don't hold the appropriate update-side lock, you -must- and you don't hold the appropriate update-side lock, you -must-
...@@ -344,8 +345,8 @@ over a rather long period of time, but improvements are always welcome! ...@@ -344,8 +345,8 @@ over a rather long period of time, but improvements are always welcome!
requiring SRCU's read-side deadlock immunity or low read-side requiring SRCU's read-side deadlock immunity or low read-side
realtime latency. realtime latency.
Note that, rcu_assign_pointer() and rcu_dereference() relate to Note that, rcu_assign_pointer() relates to SRCU just as they do
SRCU just as they do to other forms of RCU. to other forms of RCU.
15. The whole point of call_rcu(), synchronize_rcu(), and friends 15. The whole point of call_rcu(), synchronize_rcu(), and friends
is to wait until all pre-existing readers have finished before is to wait until all pre-existing readers have finished before
......
...@@ -32,9 +32,20 @@ checking of rcu_dereference() primitives: ...@@ -32,9 +32,20 @@ checking of rcu_dereference() primitives:
srcu_dereference(p, sp): srcu_dereference(p, sp):
Check for SRCU read-side critical section. Check for SRCU read-side critical section.
rcu_dereference_check(p, c): rcu_dereference_check(p, c):
Use explicit check expression "c". Use explicit check expression "c". This is useful in
code that is invoked by both readers and updaters.
rcu_dereference_raw(p) rcu_dereference_raw(p)
Don't check. (Use sparingly, if at all.) Don't check. (Use sparingly, if at all.)
rcu_dereference_protected(p, c):
Use explicit check expression "c", and omit all barriers
and compiler constraints. This is useful when the data
structure cannot change, for example, in code that is
invoked only by updaters.
rcu_access_pointer(p):
Return the value of the pointer and omit all barriers,
but retain the compiler constraints that prevent duplicating
or coalescsing. This is useful when when testing the
value of the pointer itself, for example, against NULL.
The rcu_dereference_check() check expression can be any boolean The rcu_dereference_check() check expression can be any boolean
expression, but would normally include one of the rcu_read_lock_held() expression, but would normally include one of the rcu_read_lock_held()
...@@ -59,7 +70,20 @@ In case (1), the pointer is picked up in an RCU-safe manner for vanilla ...@@ -59,7 +70,20 @@ In case (1), the pointer is picked up in an RCU-safe manner for vanilla
RCU read-side critical sections, in case (2) the ->file_lock prevents RCU read-side critical sections, in case (2) the ->file_lock prevents
any change from taking place, and finally, in case (3) the current task any change from taking place, and finally, in case (3) the current task
is the only task accessing the file_struct, again preventing any change is the only task accessing the file_struct, again preventing any change
from taking place. from taking place. If the above statement was invoked only from updater
code, it could instead be written as follows:
file = rcu_dereference_protected(fdt->fd[fd],
lockdep_is_held(&files->file_lock) ||
atomic_read(&files->count) == 1);
This would verify cases #2 and #3 above, and furthermore lockdep would
complain if this was used in an RCU read-side critical section unless one
of these two cases held. Because rcu_dereference_protected() omits all
barriers and compiler constraints, it generates better code than do the
other flavors of rcu_dereference(). On the other hand, it is illegal
to use rcu_dereference_protected() if either the RCU-protected pointer
or the RCU-protected data that it points to can change concurrently.
There are currently only "universal" versions of the rcu_assign_pointer() There are currently only "universal" versions of the rcu_assign_pointer()
and RCU list-/tree-traversal primitives, which do not (yet) check for and RCU list-/tree-traversal primitives, which do not (yet) check for
......
...@@ -840,6 +840,12 @@ SRCU: Initialization/cleanup ...@@ -840,6 +840,12 @@ SRCU: Initialization/cleanup
init_srcu_struct init_srcu_struct
cleanup_srcu_struct cleanup_srcu_struct
All: lockdep-checked RCU-protected pointer access
rcu_dereference_check
rcu_dereference_protected
rcu_access_pointer
See the comment headers in the source code (or the docbook generated See the comment headers in the source code (or the docbook generated
from them) for more information. from them) for more information.
......
...@@ -9,10 +9,14 @@ Documentation/SubmittingPatches and elsewhere regarding submitting Linux ...@@ -9,10 +9,14 @@ Documentation/SubmittingPatches and elsewhere regarding submitting Linux
kernel patches. kernel patches.
1: Builds cleanly with applicable or modified CONFIG options =y, =m, and 1: If you use a facility then #include the file that defines/declares
that facility. Don't depend on other header files pulling in ones
that you use.
2: Builds cleanly with applicable or modified CONFIG options =y, =m, and
=n. No gcc warnings/errors, no linker warnings/errors. =n. No gcc warnings/errors, no linker warnings/errors.
2: Passes allnoconfig, allmodconfig 2b: Passes allnoconfig, allmodconfig
3: Builds on multiple CPU architectures by using local cross-compile tools 3: Builds on multiple CPU architectures by using local cross-compile tools
or some other build farm. or some other build farm.
......
...@@ -14,8 +14,8 @@ Introduction ...@@ -14,8 +14,8 @@ Introduction
how the clocks are arranged. The first implementation used as single how the clocks are arranged. The first implementation used as single
PLL to feed the ARM, memory and peripherals via a series of dividers PLL to feed the ARM, memory and peripherals via a series of dividers
and muxes and this is the implementation that is documented here. A and muxes and this is the implementation that is documented here. A
newer version where there is a seperate PLL and clock divider for the newer version where there is a separate PLL and clock divider for the
ARM core is available as a seperate driver. ARM core is available as a separate driver.
Layout Layout
......
Samsung ARM Linux Overview
==========================
Introduction
------------
The Samsung range of ARM SoCs spans many similar devices, from the initial
ARM9 through to the newest ARM cores. This document shows an overview of
the current kernel support, how to use it and where to find the code
that supports this.
The currently supported SoCs are:
- S3C24XX: See Documentation/arm/Samsung-S3C24XX/Overview.txt for full list
- S3C64XX: S3C6400 and S3C6410
- S5PC6440
S5PC100 and S5PC110 support is currently being merged
S3C24XX Systems
---------------
There is still documentation in Documnetation/arm/Samsung-S3C24XX/ which
deals with the architecture and drivers specific to these devices.
See Documentation/arm/Samsung-S3C24XX/Overview.txt for more information
on the implementation details and specific support.
Configuration
-------------
A number of configurations are supplied, as there is no current way of
unifying all the SoCs into one kernel.
s5p6440_defconfig - S5P6440 specific default configuration
s5pc100_defconfig - S5PC100 specific default configuration
Layout
------
The directory layout is currently being restructured, and consists of
several platform directories and then the machine specific directories
of the CPUs being built for.
plat-samsung provides the base for all the implementations, and is the
last in the line of include directories that are processed for the build
specific information. It contains the base clock, GPIO and device definitions
to get the system running.
plat-s3c is the s3c24xx/s3c64xx platform directory, although it is currently
involved in other builds this will be phased out once the relevant code is
moved elsewhere.
plat-s3c24xx is for s3c24xx specific builds, see the S3C24XX docs.
plat-s3c64xx is for the s3c64xx specific bits, see the S3C24XX docs.
plat-s5p is for s5p specific builds, more to be added.
[ to finish ]
Port Contributors
-----------------
Ben Dooks (BJD)
Vincent Sanders
Herbert Potzl
Arnaud Patard (RTP)
Roc Wu
Klaus Fetscher
Dimitry Andric
Shannon Holland
Guillaume Gourat (NexVision)
Christer Weinigel (wingel) (Acer N30)
Lucas Correia Villa Real (S3C2400 port)
Document Author
---------------
Copyright 2009-2010 Ben Dooks <ben-linux@fluff.org>
#!/usr/bin/awk -f
#
# Copyright 2010 Ben Dooks <ben-linux@fluff.org>
#
# Released under GPLv2
# example usage
# ./clksrc-change-registers.awk arch/arm/plat-s5pc1xx/include/plat/regs-clock.h < src > dst
function extract_value(s)
{
eqat = index(s, "=")
comat = index(s, ",")
return substr(s, eqat+2, (comat-eqat)-2)
}
function remove_brackets(b)
{
return substr(b, 2, length(b)-2)
}
function splitdefine(l, p)
{
r = split(l, tp)
p[0] = tp[2]
p[1] = remove_brackets(tp[3])
}
function find_length(f)
{
if (0)
printf "find_length " f "\n" > "/dev/stderr"
if (f ~ /0x1/)
return 1
else if (f ~ /0x3/)
return 2
else if (f ~ /0x7/)
return 3
else if (f ~ /0xf/)
return 4
printf "unknown legnth " f "\n" > "/dev/stderr"
exit
}
function find_shift(s)
{
id = index(s, "<")
if (id <= 0) {
printf "cannot find shift " s "\n" > "/dev/stderr"
exit
}
return substr(s, id+2)
}
BEGIN {
if (ARGC < 2) {
print "too few arguments" > "/dev/stderr"
exit
}
# read the header file and find the mask values that we will need
# to replace and create an associative array of values
while (getline line < ARGV[1] > 0) {
if (line ~ /\#define.*_MASK/ &&
!(line ~ /S5PC100_EPLL_MASK/) &&
!(line ~ /USB_SIG_MASK/)) {
splitdefine(line, fields)
name = fields[0]
if (0)
printf "MASK " line "\n" > "/dev/stderr"
dmask[name,0] = find_length(fields[1])
dmask[name,1] = find_shift(fields[1])
if (0)
printf "=> '" name "' LENGTH=" dmask[name,0] " SHIFT=" dmask[name,1] "\n" > "/dev/stderr"
} else {
}
}
delete ARGV[1]
}
/clksrc_clk.*=.*{/ {
shift=""
mask=""
divshift=""
reg_div=""
reg_src=""
indent=1
print $0
for(; indent >= 1;) {
if ((getline line) <= 0) {
printf "unexpected end of file" > "/dev/stderr"
exit 1;
}
if (line ~ /\.shift/) {
shift = extract_value(line)
} else if (line ~ /\.mask/) {
mask = extract_value(line)
} else if (line ~ /\.reg_divider/) {
reg_div = extract_value(line)
} else if (line ~ /\.reg_source/) {
reg_src = extract_value(line)
} else if (line ~ /\.divider_shift/) {
divshift = extract_value(line)
} else if (line ~ /{/) {
indent++
print line
} else if (line ~ /}/) {
indent--
if (indent == 0) {
if (0) {
printf "shift '" shift "' ='" dmask[shift,0] "'\n" > "/dev/stderr"
printf "mask '" mask "'\n" > "/dev/stderr"
printf "dshft '" divshift "'\n" > "/dev/stderr"
printf "rdiv '" reg_div "'\n" > "/dev/stderr"
printf "rsrc '" reg_src "'\n" > "/dev/stderr"
}
generated = mask
sub(reg_src, reg_div, generated)
if (0) {
printf "/* rsrc " reg_src " */\n"
printf "/* rdiv " reg_div " */\n"
printf "/* shift " shift " */\n"
printf "/* mask " mask " */\n"
printf "/* generated " generated " */\n"
}
if (reg_div != "") {
printf "\t.reg_div = { "
printf ".reg = " reg_div ", "
printf ".shift = " dmask[generated,1] ", "
printf ".size = " dmask[generated,0] ", "
printf "},\n"
}
printf "\t.reg_src = { "
printf ".reg = " reg_src ", "
printf ".shift = " dmask[mask,1] ", "
printf ".size = " dmask[mask,0] ", "
printf "},\n"
}
print line
} else {
print line
}
if (0)
printf indent ":" line "\n" > "/dev/stderr"
}
}
// && ! /clksrc_clk.*=.*{/ { print $0 }
...@@ -1162,8 +1162,8 @@ where a driver received a request ala this before: ...@@ -1162,8 +1162,8 @@ where a driver received a request ala this before:
As mentioned, there is no virtual mapping of a bio. For DMA, this is As mentioned, there is no virtual mapping of a bio. For DMA, this is
not a problem as the driver probably never will need a virtual mapping. not a problem as the driver probably never will need a virtual mapping.
Instead it needs a bus mapping (pci_map_page for a single segment or Instead it needs a bus mapping (dma_map_page for a single segment or
use blk_rq_map_sg for scatter gather) to be able to ship it to the driver. For use dma_map_sg for scatter gather) to be able to ship it to the driver. For
PIO drivers (or drivers that need to revert to PIO transfer once in a PIO drivers (or drivers that need to revert to PIO transfer once in a
while (IDE for example)), where the CPU is doing the actual data while (IDE for example)), where the CPU is doing the actual data
transfer a virtual mapping is needed. If the driver supports highmem I/O, transfer a virtual mapping is needed. If the driver supports highmem I/O,
......
/*
* cgroup_event_listener.c - Simple listener of cgroup events
*
* Copyright (C) Kirill A. Shutemov <kirill@shutemov.name>
*/
#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <libgen.h>
#include <limits.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/eventfd.h>
#define USAGE_STR "Usage: cgroup_event_listener <path-to-control-file> <args>\n"
int main(int argc, char **argv)
{
int efd = -1;
int cfd = -1;
int event_control = -1;
char event_control_path[PATH_MAX];
char line[LINE_MAX];
int ret;
if (argc != 3) {
fputs(USAGE_STR, stderr);
return 1;
}
cfd = open(argv[1], O_RDONLY);
if (cfd == -1) {
fprintf(stderr, "Cannot open %s: %s\n", argv[1],
strerror(errno));
goto out;
}
ret = snprintf(event_control_path, PATH_MAX, "%s/cgroup.event_control",
dirname(argv[1]));
if (ret >= PATH_MAX) {
fputs("Path to cgroup.event_control is too long\n", stderr);
goto out;
}
event_control = open(event_control_path, O_WRONLY);
if (event_control == -1) {
fprintf(stderr, "Cannot open %s: %s\n", event_control_path,
strerror(errno));
goto out;
}
efd = eventfd(0, 0);
if (efd == -1) {
perror("eventfd() failed");
goto out;
}
ret = snprintf(line, LINE_MAX, "%d %d %s", efd, cfd, argv[2]);
if (ret >= LINE_MAX) {
fputs("Arguments string is too long\n", stderr);
goto out;
}
ret = write(event_control, line, strlen(line) + 1);
if (ret == -1) {
perror("Cannot write to cgroup.event_control");
goto out;
}
while (1) {
uint64_t result;
ret = read(efd, &result, sizeof(result));
if (ret == -1) {
if (errno == EINTR)
continue;
perror("Cannot read from eventfd");
break;
}
assert(ret == sizeof(result));
ret = access(event_control_path, W_OK);
if ((ret == -1) && (errno == ENOENT)) {
puts("The cgroup seems to have removed.");
ret = 0;
break;
}
if (ret == -1) {
perror("cgroup.event_control "
"is not accessable any more");
break;
}
printf("%s %s: crossed\n", argv[1], argv[2]);
}
out:
if (efd >= 0)
close(efd);
if (event_control >= 0)
close(event_control);
if (cfd >= 0)
close(cfd);
return (ret != 0);
}
...@@ -22,6 +22,8 @@ CONTENTS: ...@@ -22,6 +22,8 @@ CONTENTS:
2. Usage Examples and Syntax 2. Usage Examples and Syntax
2.1 Basic Usage 2.1 Basic Usage
2.2 Attaching processes 2.2 Attaching processes
2.3 Mounting hierarchies by name
2.4 Notification API
3. Kernel API 3. Kernel API
3.1 Overview 3.1 Overview
3.2 Synchronization 3.2 Synchronization
...@@ -233,8 +235,7 @@ containing the following files describing that cgroup: ...@@ -233,8 +235,7 @@ containing the following files describing that cgroup:
- cgroup.procs: list of tgids in the cgroup. This list is not - cgroup.procs: list of tgids in the cgroup. This list is not
guaranteed to be sorted or free of duplicate tgids, and userspace guaranteed to be sorted or free of duplicate tgids, and userspace
should sort/uniquify the list if this property is required. should sort/uniquify the list if this property is required.
Writing a tgid into this file moves all threads with that tgid into This is a read-only file, for now.
this cgroup.
- notify_on_release flag: run the release agent on exit? - notify_on_release flag: run the release agent on exit?
- release_agent: the path to use for release notifications (this file - release_agent: the path to use for release notifications (this file
exists in the top cgroup only) exists in the top cgroup only)
...@@ -434,6 +435,25 @@ you give a subsystem a name. ...@@ -434,6 +435,25 @@ you give a subsystem a name.
The name of the subsystem appears as part of the hierarchy description The name of the subsystem appears as part of the hierarchy description
in /proc/mounts and /proc/<pid>/cgroups. in /proc/mounts and /proc/<pid>/cgroups.
2.4 Notification API
--------------------
There is mechanism which allows to get notifications about changing
status of a cgroup.
To register new notification handler you need:
- create a file descriptor for event notification using eventfd(2);
- open a control file to be monitored (e.g. memory.usage_in_bytes);
- write "<event_fd> <control_fd> <args>" to cgroup.event_control.
Interpretation of args is defined by control file implementation;
eventfd will be woken up by control file implementation or when the
cgroup is removed.
To unregister notification handler just close eventfd.
NOTE: Support of notifications should be implemented for the control
file. See documentation for the subsystem.
3. Kernel API 3. Kernel API
============= =============
...@@ -488,6 +508,11 @@ Each subsystem should: ...@@ -488,6 +508,11 @@ Each subsystem should:
- add an entry in linux/cgroup_subsys.h - add an entry in linux/cgroup_subsys.h
- define a cgroup_subsys object called <name>_subsys - define a cgroup_subsys object called <name>_subsys
If a subsystem can be compiled as a module, it should also have in its
module initcall a call to cgroup_load_subsys(), and in its exitcall a
call to cgroup_unload_subsys(). It should also set its_subsys.module =
THIS_MODULE in its .c file.
Each subsystem may export the following methods. The only mandatory Each subsystem may export the following methods. The only mandatory
methods are create/destroy. Any others that are null are presumed to methods are create/destroy. Any others that are null are presumed to
be successful no-ops. be successful no-ops.
...@@ -536,10 +561,21 @@ returns an error, this will abort the attach operation. If a NULL ...@@ -536,10 +561,21 @@ returns an error, this will abort the attach operation. If a NULL
task is passed, then a successful result indicates that *any* task is passed, then a successful result indicates that *any*
unspecified task can be moved into the cgroup. Note that this isn't unspecified task can be moved into the cgroup. Note that this isn't
called on a fork. If this method returns 0 (success) then this should called on a fork. If this method returns 0 (success) then this should
remain valid while the caller holds cgroup_mutex. If threadgroup is remain valid while the caller holds cgroup_mutex and it is ensured that either
attach() or cancel_attach() will be called in future. If threadgroup is
true, then a successful result indicates that all threads in the given true, then a successful result indicates that all threads in the given
thread's threadgroup can be moved together. thread's threadgroup can be moved together.
void cancel_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
struct task_struct *task, bool threadgroup)
(cgroup_mutex held by caller)
Called when a task attach operation has failed after can_attach() has succeeded.
A subsystem whose can_attach() has some side-effects should provide this
function, so that the subsytem can implement a rollback. If not, not necessary.
This will be called only about subsystems whose can_attach() operation have
succeeded.
void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
struct cgroup *old_cgrp, struct task_struct *task, struct cgroup *old_cgrp, struct task_struct *task,
bool threadgroup) bool threadgroup)
......
...@@ -168,20 +168,20 @@ Each cpuset is represented by a directory in the cgroup file system ...@@ -168,20 +168,20 @@ Each cpuset is represented by a directory in the cgroup file system
containing (on top of the standard cgroup files) the following containing (on top of the standard cgroup files) the following
files describing that cpuset: files describing that cpuset:
- cpus: list of CPUs in that cpuset - cpuset.cpus: list of CPUs in that cpuset
- mems: list of Memory Nodes in that cpuset - cpuset.mems: list of Memory Nodes in that cpuset
- memory_migrate flag: if set, move pages to cpusets nodes - cpuset.memory_migrate flag: if set, move pages to cpusets nodes
- cpu_exclusive flag: is cpu placement exclusive? - cpuset.cpu_exclusive flag: is cpu placement exclusive?
- mem_exclusive flag: is memory placement exclusive? - cpuset.mem_exclusive flag: is memory placement exclusive?
- mem_hardwall flag: is memory allocation hardwalled - cpuset.mem_hardwall flag: is memory allocation hardwalled
- memory_pressure: measure of how much paging pressure in cpuset - cpuset.memory_pressure: measure of how much paging pressure in cpuset
- memory_spread_page flag: if set, spread page cache evenly on allowed nodes - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
- memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
- sched_load_balance flag: if set, load balance within CPUs on that cpuset - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
- sched_relax_domain_level: the searching range when migrating tasks - cpuset.sched_relax_domain_level: the searching range when migrating tasks
In addition, the root cpuset only has the following file: In addition, the root cpuset only has the following file:
- memory_pressure_enabled flag: compute memory_pressure? - cpuset.memory_pressure_enabled flag: compute memory_pressure?
New cpusets are created using the mkdir system call or shell New cpusets are created using the mkdir system call or shell
command. The properties of a cpuset, such as its flags, allowed command. The properties of a cpuset, such as its flags, allowed
...@@ -229,7 +229,7 @@ If a cpuset is cpu or mem exclusive, no other cpuset, other than ...@@ -229,7 +229,7 @@ If a cpuset is cpu or mem exclusive, no other cpuset, other than
a direct ancestor or descendant, may share any of the same CPUs or a direct ancestor or descendant, may share any of the same CPUs or
Memory Nodes. Memory Nodes.
A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled",
i.e. it restricts kernel allocations for page, buffer and other data i.e. it restricts kernel allocations for page, buffer and other data
commonly shared by the kernel across multiple users. All cpusets, commonly shared by the kernel across multiple users. All cpusets,
whether hardwalled or not, restrict allocations of memory for user whether hardwalled or not, restrict allocations of memory for user
...@@ -304,15 +304,15 @@ times 1000. ...@@ -304,15 +304,15 @@ times 1000.
--------------------------- ---------------------------
There are two boolean flag files per cpuset that control where the There are two boolean flag files per cpuset that control where the
kernel allocates pages for the file system buffers and related in kernel allocates pages for the file system buffers and related in
kernel data structures. They are called 'memory_spread_page' and kernel data structures. They are called 'cpuset.memory_spread_page' and
'memory_spread_slab'. 'cpuset.memory_spread_slab'.
If the per-cpuset boolean flag file 'memory_spread_page' is set, then If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then
the kernel will spread the file system buffers (page cache) evenly the kernel will spread the file system buffers (page cache) evenly
over all the nodes that the faulting task is allowed to use, instead over all the nodes that the faulting task is allowed to use, instead
of preferring to put those pages on the node where the task is running. of preferring to put those pages on the node where the task is running.
If the per-cpuset boolean flag file 'memory_spread_slab' is set, If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set,
then the kernel will spread some file system related slab caches, then the kernel will spread some file system related slab caches,
such as for inodes and dentries evenly over all the nodes that the such as for inodes and dentries evenly over all the nodes that the
faulting task is allowed to use, instead of preferring to put those faulting task is allowed to use, instead of preferring to put those
...@@ -337,21 +337,21 @@ their containing tasks memory spread settings. If memory spreading ...@@ -337,21 +337,21 @@ their containing tasks memory spread settings. If memory spreading
is turned off, then the currently specified NUMA mempolicy once again is turned off, then the currently specified NUMA mempolicy once again
applies to memory page allocations. applies to memory page allocations.
Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag
files. By default they contain "0", meaning that the feature is off files. By default they contain "0", meaning that the feature is off
for that cpuset. If a "1" is written to that file, then that turns for that cpuset. If a "1" is written to that file, then that turns
the named feature on. the named feature on.
The implementation is simple. The implementation is simple.
Setting the flag 'memory_spread_page' turns on a per-process flag Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag
PF_SPREAD_PAGE for each task that is in that cpuset or subsequently PF_SPREAD_PAGE for each task that is in that cpuset or subsequently
joins that cpuset. The page allocation calls for the page cache joins that cpuset. The page allocation calls for the page cache
is modified to perform an inline check for this PF_SPREAD_PAGE task is modified to perform an inline check for this PF_SPREAD_PAGE task
flag, and if set, a call to a new routine cpuset_mem_spread_node() flag, and if set, a call to a new routine cpuset_mem_spread_node()
returns the node to prefer for the allocation. returns the node to prefer for the allocation.
Similarly, setting 'memory_spread_slab' turns on the flag Similarly, setting 'cpuset.memory_spread_slab' turns on the flag
PF_SPREAD_SLAB, and appropriately marked slab caches will allocate PF_SPREAD_SLAB, and appropriately marked slab caches will allocate
pages from the node returned by cpuset_mem_spread_node(). pages from the node returned by cpuset_mem_spread_node().
...@@ -404,24 +404,24 @@ the following two situations: ...@@ -404,24 +404,24 @@ the following two situations:
system overhead on those CPUs, including avoiding task load system overhead on those CPUs, including avoiding task load
balancing if that is not needed. balancing if that is not needed.
When the per-cpuset flag "sched_load_balance" is enabled (the default When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default
setting), it requests that all the CPUs in that cpusets allowed 'cpus' setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus'
be contained in a single sched domain, ensuring that load balancing be contained in a single sched domain, ensuring that load balancing
can move a task (not otherwised pinned, as by sched_setaffinity) can move a task (not otherwised pinned, as by sched_setaffinity)
from any CPU in that cpuset to any other. from any CPU in that cpuset to any other.
When the per-cpuset flag "sched_load_balance" is disabled, then the When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the
scheduler will avoid load balancing across the CPUs in that cpuset, scheduler will avoid load balancing across the CPUs in that cpuset,
--except-- in so far as is necessary because some overlapping cpuset --except-- in so far as is necessary because some overlapping cpuset
has "sched_load_balance" enabled. has "sched_load_balance" enabled.
So, for example, if the top cpuset has the flag "sched_load_balance" So, for example, if the top cpuset has the flag "cpuset.sched_load_balance"
enabled, then the scheduler will have one sched domain covering all enabled, then the scheduler will have one sched domain covering all
CPUs, and the setting of the "sched_load_balance" flag in any other CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other
cpusets won't matter, as we're already fully load balancing. cpusets won't matter, as we're already fully load balancing.
Therefore in the above two situations, the top cpuset flag Therefore in the above two situations, the top cpuset flag
"sched_load_balance" should be disabled, and only some of the smaller, "cpuset.sched_load_balance" should be disabled, and only some of the smaller,
child cpusets have this flag enabled. child cpusets have this flag enabled.
When doing this, you don't usually want to leave any unpinned tasks in When doing this, you don't usually want to leave any unpinned tasks in
...@@ -433,7 +433,7 @@ scheduler might not consider the possibility of load balancing that ...@@ -433,7 +433,7 @@ scheduler might not consider the possibility of load balancing that
task to that underused CPU. task to that underused CPU.
Of course, tasks pinned to a particular CPU can be left in a cpuset Of course, tasks pinned to a particular CPU can be left in a cpuset
that disables "sched_load_balance" as those tasks aren't going anywhere that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere
else anyway. else anyway.
There is an impedance mismatch here, between cpusets and sched domains. There is an impedance mismatch here, between cpusets and sched domains.
...@@ -443,19 +443,19 @@ overlap and each CPU is in at most one sched domain. ...@@ -443,19 +443,19 @@ overlap and each CPU is in at most one sched domain.
It is necessary for sched domains to be flat because load balancing It is necessary for sched domains to be flat because load balancing
across partially overlapping sets of CPUs would risk unstable dynamics across partially overlapping sets of CPUs would risk unstable dynamics
that would be beyond our understanding. So if each of two partially that would be beyond our understanding. So if each of two partially
overlapping cpusets enables the flag 'sched_load_balance', then we overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we
form a single sched domain that is a superset of both. We won't move form a single sched domain that is a superset of both. We won't move
a task to a CPU outside it cpuset, but the scheduler load balancing a task to a CPU outside it cpuset, but the scheduler load balancing
code might waste some compute cycles considering that possibility. code might waste some compute cycles considering that possibility.
This mismatch is why there is not a simple one-to-one relation This mismatch is why there is not a simple one-to-one relation
between which cpusets have the flag "sched_load_balance" enabled, between which cpusets have the flag "cpuset.sched_load_balance" enabled,
and the sched domain configuration. If a cpuset enables the flag, it and the sched domain configuration. If a cpuset enables the flag, it
will get balancing across all its CPUs, but if it disables the flag, will get balancing across all its CPUs, but if it disables the flag,
it will only be assured of no load balancing if no other overlapping it will only be assured of no load balancing if no other overlapping
cpuset enables the flag. cpuset enables the flag.
If two cpusets have partially overlapping 'cpus' allowed, and only If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only
one of them has this flag enabled, then the other may find its one of them has this flag enabled, then the other may find its
tasks only partially load balanced, just on the overlapping CPUs. tasks only partially load balanced, just on the overlapping CPUs.
This is just the general case of the top_cpuset example given a few This is just the general case of the top_cpuset example given a few
...@@ -468,23 +468,23 @@ load balancing to the other CPUs. ...@@ -468,23 +468,23 @@ load balancing to the other CPUs.
1.7.1 sched_load_balance implementation details. 1.7.1 sched_load_balance implementation details.
------------------------------------------------ ------------------------------------------------
The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary
to most cpuset flags.) When enabled for a cpuset, the kernel will to most cpuset flags.) When enabled for a cpuset, the kernel will
ensure that it can load balance across all the CPUs in that cpuset ensure that it can load balance across all the CPUs in that cpuset
(makes sure that all the CPUs in the cpus_allowed of that cpuset are (makes sure that all the CPUs in the cpus_allowed of that cpuset are
in the same sched domain.) in the same sched domain.)
If two overlapping cpusets both have 'sched_load_balance' enabled, If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled,
then they will be (must be) both in the same sched domain. then they will be (must be) both in the same sched domain.
If, as is the default, the top cpuset has 'sched_load_balance' enabled, If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled,
then by the above that means there is a single sched domain covering then by the above that means there is a single sched domain covering
the whole system, regardless of any other cpuset settings. the whole system, regardless of any other cpuset settings.
The kernel commits to user space that it will avoid load balancing The kernel commits to user space that it will avoid load balancing
where it can. It will pick as fine a granularity partition of sched where it can. It will pick as fine a granularity partition of sched
domains as it can while still providing load balancing for any set domains as it can while still providing load balancing for any set
of CPUs allowed to a cpuset having 'sched_load_balance' enabled. of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled.
The internal kernel cpuset to scheduler interface passes from the The internal kernel cpuset to scheduler interface passes from the
cpuset code to the scheduler code a partition of the load balanced cpuset code to the scheduler code a partition of the load balanced
...@@ -495,9 +495,9 @@ all the CPUs that must be load balanced. ...@@ -495,9 +495,9 @@ all the CPUs that must be load balanced.
The cpuset code builds a new such partition and passes it to the The cpuset code builds a new such partition and passes it to the
scheduler sched domain setup code, to have the sched domains rebuilt scheduler sched domain setup code, to have the sched domains rebuilt
as necessary, whenever: as necessary, whenever:
- the 'sched_load_balance' flag of a cpuset with non-empty CPUs changes, - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
- or CPUs come or go from a cpuset with this flag enabled, - or CPUs come or go from a cpuset with this flag enabled,
- or 'sched_relax_domain_level' value of a cpuset with non-empty CPUs - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
and with this flag enabled changes, and with this flag enabled changes,
- or a cpuset with non-empty CPUs and with this flag enabled is removed, - or a cpuset with non-empty CPUs and with this flag enabled is removed,
- or a cpu is offlined/onlined. - or a cpu is offlined/onlined.
...@@ -542,7 +542,7 @@ As the result, task B on CPU X need to wait task A or wait load balance ...@@ -542,7 +542,7 @@ As the result, task B on CPU X need to wait task A or wait load balance
on the next tick. For some applications in special situation, waiting on the next tick. For some applications in special situation, waiting
1 tick may be too long. 1 tick may be too long.
The 'sched_relax_domain_level' file allows you to request changing The 'cpuset.sched_relax_domain_level' file allows you to request changing
this searching range as you like. This file takes int value which this searching range as you like. This file takes int value which
indicates size of searching range in levels ideally as follows, indicates size of searching range in levels ideally as follows,
otherwise initial value -1 that indicates the cpuset has no request. otherwise initial value -1 that indicates the cpuset has no request.
...@@ -559,8 +559,8 @@ The system default is architecture dependent. The system default ...@@ -559,8 +559,8 @@ The system default is architecture dependent. The system default
can be changed using the relax_domain_level= boot parameter. can be changed using the relax_domain_level= boot parameter.
This file is per-cpuset and affect the sched domain where the cpuset This file is per-cpuset and affect the sched domain where the cpuset
belongs to. Therefore if the flag 'sched_load_balance' of a cpuset belongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset
is disabled, then 'sched_relax_domain_level' have no effect since is disabled, then 'cpuset.sched_relax_domain_level' have no effect since
there is no sched domain belonging the cpuset. there is no sched domain belonging the cpuset.
If multiple cpusets are overlapping and hence they form a single sched If multiple cpusets are overlapping and hence they form a single sched
...@@ -607,9 +607,9 @@ from one cpuset to another, then the kernel will adjust the tasks ...@@ -607,9 +607,9 @@ from one cpuset to another, then the kernel will adjust the tasks
memory placement, as above, the next time that the kernel attempts memory placement, as above, the next time that the kernel attempts
to allocate a page of memory for that task. to allocate a page of memory for that task.
If a cpuset has its 'cpus' modified, then each task in that cpuset If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset
will have its allowed CPU placement changed immediately. Similarly, will have its allowed CPU placement changed immediately. Similarly,
if a tasks pid is written to another cpusets 'tasks' file, then its if a tasks pid is written to another cpusets 'cpuset.tasks' file, then its
allowed CPU placement is changed immediately. If such a task had been allowed CPU placement is changed immediately. If such a task had been
bound to some subset of its cpuset using the sched_setaffinity() call, bound to some subset of its cpuset using the sched_setaffinity() call,
the task will be allowed to run on any CPU allowed in its new cpuset, the task will be allowed to run on any CPU allowed in its new cpuset,
...@@ -622,8 +622,8 @@ and the processor placement is updated immediately. ...@@ -622,8 +622,8 @@ and the processor placement is updated immediately.
Normally, once a page is allocated (given a physical page Normally, once a page is allocated (given a physical page
of main memory) then that page stays on whatever node it of main memory) then that page stays on whatever node it
was allocated, so long as it remains allocated, even if the was allocated, so long as it remains allocated, even if the
cpusets memory placement policy 'mems' subsequently changes. cpusets memory placement policy 'cpuset.mems' subsequently changes.
If the cpuset flag file 'memory_migrate' is set true, then when If the cpuset flag file 'cpuset.memory_migrate' is set true, then when
tasks are attached to that cpuset, any pages that task had tasks are attached to that cpuset, any pages that task had
allocated to it on nodes in its previous cpuset are migrated allocated to it on nodes in its previous cpuset are migrated
to the tasks new cpuset. The relative placement of the page within to the tasks new cpuset. The relative placement of the page within
...@@ -631,12 +631,12 @@ the cpuset is preserved during these migration operations if possible. ...@@ -631,12 +631,12 @@ the cpuset is preserved during these migration operations if possible.
For example if the page was on the second valid node of the prior cpuset For example if the page was on the second valid node of the prior cpuset
then the page will be placed on the second valid node of the new cpuset. then the page will be placed on the second valid node of the new cpuset.
Also if 'memory_migrate' is set true, then if that cpusets Also if 'cpuset.memory_migrate' is set true, then if that cpusets
'mems' file is modified, pages allocated to tasks in that 'cpuset.mems' file is modified, pages allocated to tasks in that
cpuset, that were on nodes in the previous setting of 'mems', cpuset, that were on nodes in the previous setting of 'cpuset.mems',
will be moved to nodes in the new setting of 'mems.' will be moved to nodes in the new setting of 'mems.'
Pages that were not in the tasks prior cpuset, or in the cpusets Pages that were not in the tasks prior cpuset, or in the cpusets
prior 'mems' setting, will not be moved. prior 'cpuset.mems' setting, will not be moved.
There is an exception to the above. If hotplug functionality is used There is an exception to the above. If hotplug functionality is used
to remove all the CPUs that are currently assigned to a cpuset, to remove all the CPUs that are currently assigned to a cpuset,
...@@ -678,8 +678,8 @@ and then start a subshell 'sh' in that cpuset: ...@@ -678,8 +678,8 @@ and then start a subshell 'sh' in that cpuset:
cd /dev/cpuset cd /dev/cpuset
mkdir Charlie mkdir Charlie
cd Charlie cd Charlie
/bin/echo 2-3 > cpus /bin/echo 2-3 > cpuset.cpus
/bin/echo 1 > mems /bin/echo 1 > cpuset.mems
/bin/echo $$ > tasks /bin/echo $$ > tasks
sh sh
# The subshell 'sh' is now running in cpuset Charlie # The subshell 'sh' is now running in cpuset Charlie
...@@ -725,10 +725,13 @@ Now you want to do something with this cpuset. ...@@ -725,10 +725,13 @@ Now you want to do something with this cpuset.
In this directory you can find several files: In this directory you can find several files:
# ls # ls
cpu_exclusive memory_migrate mems tasks cpuset.cpu_exclusive cpuset.memory_spread_slab
cpus memory_pressure notify_on_release cpuset.cpus cpuset.mems
mem_exclusive memory_spread_page sched_load_balance cpuset.mem_exclusive cpuset.sched_load_balance
mem_hardwall memory_spread_slab sched_relax_domain_level cpuset.mem_hardwall cpuset.sched_relax_domain_level
cpuset.memory_migrate notify_on_release
cpuset.memory_pressure tasks
cpuset.memory_spread_page
Reading them will give you information about the state of this cpuset: Reading them will give you information about the state of this cpuset:
the CPUs and Memory Nodes it can use, the processes that are using the CPUs and Memory Nodes it can use, the processes that are using
...@@ -736,13 +739,13 @@ it, its properties. By writing to these files you can manipulate ...@@ -736,13 +739,13 @@ it, its properties. By writing to these files you can manipulate
the cpuset. the cpuset.
Set some flags: Set some flags:
# /bin/echo 1 > cpu_exclusive # /bin/echo 1 > cpuset.cpu_exclusive
Add some cpus: Add some cpus:
# /bin/echo 0-7 > cpus # /bin/echo 0-7 > cpuset.cpus
Add some mems: Add some mems:
# /bin/echo 0-7 > mems # /bin/echo 0-7 > cpuset.mems
Now attach your shell to this cpuset: Now attach your shell to this cpuset:
# /bin/echo $$ > tasks # /bin/echo $$ > tasks
...@@ -774,28 +777,28 @@ echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent ...@@ -774,28 +777,28 @@ echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent
This is the syntax to use when writing in the cpus or mems files This is the syntax to use when writing in the cpus or mems files
in cpuset directories: in cpuset directories:
# /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
# /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
To add a CPU to a cpuset, write the new list of CPUs including the To add a CPU to a cpuset, write the new list of CPUs including the
CPU to be added. To add 6 to the above cpuset: CPU to be added. To add 6 to the above cpuset:
# /bin/echo 1-4,6 > cpus -> set cpus list to cpus 1,2,3,4,6 # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6
Similarly to remove a CPU from a cpuset, write the new list of CPUs Similarly to remove a CPU from a cpuset, write the new list of CPUs
without the CPU to be removed. without the CPU to be removed.
To remove all the CPUs: To remove all the CPUs:
# /bin/echo "" > cpus -> clear cpus list # /bin/echo "" > cpuset.cpus -> clear cpus list
2.3 Setting flags 2.3 Setting flags
----------------- -----------------
The syntax is very simple: The syntax is very simple:
# /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive'
# /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive'
2.4 Attaching processes 2.4 Attaching processes
----------------------- -----------------------
......
Memory Resource Controller(Memcg) Implementation Memo. Memory Resource Controller(Memcg) Implementation Memo.
Last Updated: 2009/1/20 Last Updated: 2010/2
Base Kernel Version: based on 2.6.29-rc2. Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
Because VM is getting complex (one of reasons is memcg...), memcg's behavior Because VM is getting complex (one of reasons is memcg...), memcg's behavior
is complex. This is a document for memcg's internal behavior. is complex. This is a document for memcg's internal behavior.
...@@ -337,7 +337,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. ...@@ -337,7 +337,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
race and lock dependency with other cgroup subsystems. race and lock dependency with other cgroup subsystems.
example) example)
# mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
and do task move, mkdir, rmdir etc...under this. and do task move, mkdir, rmdir etc...under this.
...@@ -348,7 +348,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. ...@@ -348,7 +348,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
For example, test like following is good. For example, test like following is good.
(Shell-A) (Shell-A)
# mount -t cgroup none /cgroup -t memory # mount -t cgroup none /cgroup -o memory
# mkdir /cgroup/test # mkdir /cgroup/test
# echo 40M > /cgroup/test/memory.limit_in_bytes # echo 40M > /cgroup/test/memory.limit_in_bytes
# echo 0 > /cgroup/test/tasks # echo 0 > /cgroup/test/tasks
...@@ -378,3 +378,42 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. ...@@ -378,3 +378,42 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
#echo 50M > memory.limit_in_bytes #echo 50M > memory.limit_in_bytes
#echo 50M > memory.memsw.limit_in_bytes #echo 50M > memory.memsw.limit_in_bytes
run 51M of malloc run 51M of malloc
9.9 Move charges at task migration
Charges associated with a task can be moved along with task migration.
(Shell-A)
#mkdir /cgroup/A
#echo $$ >/cgroup/A/tasks
run some programs which uses some amount of memory in /cgroup/A.
(Shell-B)
#mkdir /cgroup/B
#echo 1 >/cgroup/B/memory.move_charge_at_immigrate
#echo "pid of the program running in group A" >/cgroup/B/tasks
You can see charges have been moved by reading *.usage_in_bytes or
memory.stat of both A and B.
See 8.2 of Documentation/cgroups/memory.txt to see what value should be
written to move_charge_at_immigrate.
9.10 Memory thresholds
Memory controler implements memory thresholds using cgroups notification
API. You can use Documentation/cgroups/cgroup_event_listener.c to test
it.
(Shell-A) Create cgroup and run event listener
# mkdir /cgroup/A
# ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
(Shell-B) Add task to cgroup and try to allocate and free memory
# echo $$ >/cgroup/A/tasks
# a="$(dd if=/dev/zero bs=1M count=10)"
# a=
You will see message from cgroup_event_listener every time you cross
the thresholds.
Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
It's good idea to test root cgroup as well.
...@@ -182,6 +182,8 @@ list. ...@@ -182,6 +182,8 @@ list.
NOTE: Reclaim does not work for the root cgroup, since we cannot set any NOTE: Reclaim does not work for the root cgroup, since we cannot set any
limits on the root cgroup. limits on the root cgroup.
Note2: When panic_on_oom is set to "2", the whole system will panic.
2. Locking 2. Locking
The memory controller uses the following hierarchy The memory controller uses the following hierarchy
...@@ -262,10 +264,12 @@ some of the pages cached in the cgroup (page cache pages). ...@@ -262,10 +264,12 @@ some of the pages cached in the cgroup (page cache pages).
4.2 Task migration 4.2 Task migration
When a task migrates from one cgroup to another, it's charge is not When a task migrates from one cgroup to another, it's charge is not
carried forward. The pages allocated from the original cgroup still carried forward by default. The pages allocated from the original cgroup still
remain charged to it, the charge is dropped when the page is freed or remain charged to it, the charge is dropped when the page is freed or
reclaimed. reclaimed.
Note: You can move charges of a task along with task migration. See 8.
4.3 Removing a cgroup 4.3 Removing a cgroup
A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
...@@ -336,7 +340,7 @@ Note: ...@@ -336,7 +340,7 @@ Note:
5.3 swappiness 5.3 swappiness
Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
Following cgroups' swapiness can't be changed. Following cgroups' swappiness can't be changed.
- root cgroup (uses /proc/sys/vm/swappiness). - root cgroup (uses /proc/sys/vm/swappiness).
- a cgroup which uses hierarchy and it has child cgroup. - a cgroup which uses hierarchy and it has child cgroup.
- a cgroup which uses hierarchy and not the root of hierarchy. - a cgroup which uses hierarchy and not the root of hierarchy.
...@@ -377,7 +381,8 @@ The feature can be disabled by ...@@ -377,7 +381,8 @@ The feature can be disabled by
NOTE1: Enabling/disabling will fail if the cgroup already has other NOTE1: Enabling/disabling will fail if the cgroup already has other
cgroups created below it. cgroups created below it.
NOTE2: This feature can be enabled/disabled per subtree. NOTE2: When panic_on_oom is set to "2", the whole system will panic in
case of an oom event in any cgroup.
7. Soft limits 7. Soft limits
...@@ -414,7 +419,76 @@ NOTE1: Soft limits take effect over a long period of time, since they involve ...@@ -414,7 +419,76 @@ NOTE1: Soft limits take effect over a long period of time, since they involve
NOTE2: It is recommended to set the soft limit always below the hard limit, NOTE2: It is recommended to set the soft limit always below the hard limit,
otherwise the hard limit will take precedence. otherwise the hard limit will take precedence.
8. TODO 8. Move charges at task migration
Users can move charges associated with a task along with task migration, that
is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
This feature is not supported in !CONFIG_MMU environments because of lack of
page tables.
8.1 Interface
This feature is disabled by default. It can be enabled(and disabled again) by
writing to memory.move_charge_at_immigrate of the destination cgroup.
If you want to enable it:
# echo (some positive value) > memory.move_charge_at_immigrate
Note: Each bits of move_charge_at_immigrate has its own meaning about what type
of charges should be moved. See 8.2 for details.
Note: Charges are moved only when you move mm->owner, IOW, a leader of a thread
group.
Note: If we cannot find enough space for the task in the destination cgroup, we
try to make space by reclaiming memory. Task migration may fail if we
cannot make enough space.
Note: It can take several seconds if you move charges in giga bytes order.
And if you want disable it again:
# echo 0 > memory.move_charge_at_immigrate
8.2 Type of charges which can be move
Each bits of move_charge_at_immigrate has its own meaning about what type of
charges should be moved.
bit | what type of charges would be moved ?
-----+------------------------------------------------------------------------
0 | A charge of an anonymous page(or swap of it) used by the target task.
| Those pages and swaps must be used only by the target task. You must
| enable Swap Extension(see 2.4) to enable move of swap charges.
Note: Those pages and swaps must be charged to the old cgroup.
Note: More type of pages(e.g. file cache, shmem,) will be supported by other
bits in future.
8.3 TODO
- Add support for other types of pages(e.g. file cache, shmem, etc.).
- Implement madvise(2) to let users decide the vma to be moved or not to be
moved.
- All of moving charge operations are done under cgroup_mutex. It's not good
behavior to hold the mutex too long, so we may need some trick.
9. Memory thresholds
Memory controler implements memory thresholds using cgroups notification
API (see cgroups.txt). It allows to register multiple memory and memsw
thresholds and gets notifications when it crosses.
To register a threshold application need:
- create an eventfd using eventfd(2);
- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
- write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to
cgroup.event_control.
Application will be notified through eventfd when memory usage crosses
threshold in any direction.
It's applicable for root and non-root cgroup.
10. TODO
1. Add support for accounting huge pages (as a separate controller) 1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first 2. Make per-cgroup scanner reclaim not-shared pages first
......
================
CIRCULAR BUFFERS
================
By: David Howells <dhowells@redhat.com>
Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Linux provides a number of features that can be used to implement circular
buffering. There are two sets of such features:
(1) Convenience functions for determining information about power-of-2 sized
buffers.
(2) Memory barriers for when the producer and the consumer of objects in the
buffer don't want to share a lock.
To use these facilities, as discussed below, there needs to be just one
producer and just one consumer. It is possible to handle multiple producers by
serialising them, and to handle multiple consumers by serialising them.
Contents:
(*) What is a circular buffer?
(*) Measuring power-of-2 buffers.
(*) Using memory barriers with circular buffers.
- The producer.
- The consumer.
==========================
WHAT IS A CIRCULAR BUFFER?
==========================
First of all, what is a circular buffer? A circular buffer is a buffer of
fixed, finite size into which there are two indices:
(1) A 'head' index - the point at which the producer inserts items into the
buffer.
(2) A 'tail' index - the point at which the consumer finds the next item in
the buffer.
Typically when the tail pointer is equal to the head pointer, the buffer is
empty; and the buffer is full when the head pointer is one less than the tail
pointer.
The head index is incremented when items are added, and the tail index when
items are removed. The tail index should never jump the head index, and both
indices should be wrapped to 0 when they reach the end of the buffer, thus
allowing an infinite amount of data to flow through the buffer.
Typically, items will all be of the same unit size, but this isn't strictly
required to use the techniques below. The indices can be increased by more
than 1 if multiple items or variable-sized items are to be included in the
buffer, provided that neither index overtakes the other. The implementer must
be careful, however, as a region more than one unit in size may wrap the end of
the buffer and be broken into two segments.
============================
MEASURING POWER-OF-2 BUFFERS
============================
Calculation of the occupancy or the remaining capacity of an arbitrarily sized
circular buffer would normally be a slow operation, requiring the use of a
modulus (divide) instruction. However, if the buffer is of a power-of-2 size,
then a much quicker bitwise-AND instruction can be used instead.
Linux provides a set of macros for handling power-of-2 circular buffers. These
can be made use of by:
#include <linux/circ_buf.h>
The macros are:
(*) Measure the remaining capacity of a buffer:
CIRC_SPACE(head_index, tail_index, buffer_size);
This returns the amount of space left in the buffer[1] into which items
can be inserted.
(*) Measure the maximum consecutive immediate space in a buffer:
CIRC_SPACE_TO_END(head_index, tail_index, buffer_size);
This returns the amount of consecutive space left in the buffer[1] into
which items can be immediately inserted without having to wrap back to the
beginning of the buffer.
(*) Measure the occupancy of a buffer:
CIRC_CNT(head_index, tail_index, buffer_size);
This returns the number of items currently occupying a buffer[2].
(*) Measure the non-wrapping occupancy of a buffer:
CIRC_CNT_TO_END(head_index, tail_index, buffer_size);
This returns the number of consecutive items[2] that can be extracted from
the buffer without having to wrap back to the beginning of the buffer.
Each of these macros will nominally return a value between 0 and buffer_size-1,
however:
[1] CIRC_SPACE*() are intended to be used in the producer. To the producer
they will return a lower bound as the producer controls the head index,
but the consumer may still be depleting the buffer on another CPU and
moving the tail index.
To the consumer it will show an upper bound as the producer may be busy
depleting the space.
[2] CIRC_CNT*() are intended to be used in the consumer. To the consumer they
will return a lower bound as the consumer controls the tail index, but the
producer may still be filling the buffer on another CPU and moving the
head index.
To the producer it will show an upper bound as the consumer may be busy
emptying the buffer.
[3] To a third party, the order in which the writes to the indices by the
producer and consumer become visible cannot be guaranteed as they are
independent and may be made on different CPUs - so the result in such a
situation will merely be a guess, and may even be negative.
===========================================
USING MEMORY BARRIERS WITH CIRCULAR BUFFERS
===========================================
By using memory barriers in conjunction with circular buffers, you can avoid
the need to:
(1) use a single lock to govern access to both ends of the buffer, thus
allowing the buffer to be filled and emptied at the same time; and
(2) use atomic counter operations.
There are two sides to this: the producer that fills the buffer, and the
consumer that empties it. Only one thing should be filling a buffer at any one
time, and only one thing should be emptying a buffer at any one time, but the
two sides can operate simultaneously.
THE PRODUCER
------------
The producer will look something like this:
spin_lock(&producer_lock);
unsigned long head = buffer->head;
unsigned long tail = ACCESS_ONCE(buffer->tail);
if (CIRC_SPACE(head, tail, buffer->size) >= 1) {
/* insert one item into the buffer */
struct item *item = buffer[head];
produce_item(item);
smp_wmb(); /* commit the item before incrementing the head */
buffer->head = (head + 1) & (buffer->size - 1);
/* wake_up() will make sure that the head is committed before
* waking anyone up */
wake_up(consumer);
}
spin_unlock(&producer_lock);
This will instruct the CPU that the contents of the new item must be written
before the head index makes it available to the consumer and then instructs the
CPU that the revised head index must be written before the consumer is woken.
Note that wake_up() doesn't have to be the exact mechanism used, but whatever
is used must guarantee a (write) memory barrier between the update of the head
index and the change of state of the consumer, if a change of state occurs.
THE CONSUMER
------------
The consumer will look something like this:
spin_lock(&consumer_lock);
unsigned long head = ACCESS_ONCE(buffer->head);
unsigned long tail = buffer->tail;
if (CIRC_CNT(head, tail, buffer->size) >= 1) {
/* read index before reading contents at that index */
smp_read_barrier_depends();
/* extract one item from the buffer */
struct item *item = buffer[tail];
consume_item(item);
smp_mb(); /* finish reading descriptor before incrementing tail */
buffer->tail = (tail + 1) & (buffer->size - 1);
}
spin_unlock(&consumer_lock);
This will instruct the CPU to make sure the index is up to date before reading
the new item, and then it shall make sure the CPU has finished reading the item
before it writes the new tail pointer, which will erase the item.
Note the use of ACCESS_ONCE() in both algorithms to read the opposition index.
This prevents the compiler from discarding and reloading its cached value -
which some compilers will do across smp_read_barrier_depends(). This isn't
strictly needed if you can be sure that the opposition index will _only_ be
used the once.
===============
FURTHER READING
===============
See also Documentation/memory-barriers.txt for a description of Linux's memory
barrier facilities.
...@@ -25,6 +25,7 @@ ...@@ -25,6 +25,7 @@
#include <linux/module.h> #include <linux/module.h>
#include <linux/moduleparam.h> #include <linux/moduleparam.h>
#include <linux/skbuff.h> #include <linux/skbuff.h>
#include <linux/slab.h>
#include <linux/timer.h> #include <linux/timer.h>
#include <linux/connector.h> #include <linux/connector.h>
......
...@@ -74,7 +74,7 @@ driver takes over the consoles vacated by the driver. Binding, on the other ...@@ -74,7 +74,7 @@ driver takes over the consoles vacated by the driver. Binding, on the other
hand, will bind the driver to the consoles that are currently occupied by a hand, will bind the driver to the consoles that are currently occupied by a
system driver. system driver.
NOTE1: Binding and binding must be selected in Kconfig. It's under: NOTE1: Binding and unbinding must be selected in Kconfig. It's under:
Device Drivers -> Character devices -> Support for binding and unbinding Device Drivers -> Character devices -> Support for binding and unbinding
console drivers console drivers
......
/*
* pcc-cpufreq.txt - PCC interface documentation
*
* Copyright (C) 2009 Red Hat, Matthew Garrett <mjg@redhat.com>
* Copyright (C) 2009 Hewlett-Packard Development Company, L.P.
* Nagananda Chumbalkar <nagananda.chumbalkar@hp.com>
*
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; version 2 of the License.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or NON
* INFRINGEMENT. See the GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License along
* with this program; if not, write to the Free Software Foundation, Inc.,
* 675 Mass Ave, Cambridge, MA 02139, USA.
*
* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/
Processor Clocking Control Driver
---------------------------------
Contents:
---------
1. Introduction
1.1 PCC interface
1.1.1 Get Average Frequency
1.1.2 Set Desired Frequency
1.2 Platforms affected
2. Driver and /sys details
2.1 scaling_available_frequencies
2.2 cpuinfo_transition_latency
2.3 cpuinfo_cur_freq
2.4 related_cpus
3. Caveats
1. Introduction:
----------------
Processor Clocking Control (PCC) is an interface between the platform
firmware and OSPM. It is a mechanism for coordinating processor
performance (ie: frequency) between the platform firmware and the OS.
The PCC driver (pcc-cpufreq) allows OSPM to take advantage of the PCC
interface.
OS utilizes the PCC interface to inform platform firmware what frequency the
OS wants for a logical processor. The platform firmware attempts to achieve
the requested frequency. If the request for the target frequency could not be
satisfied by platform firmware, then it usually means that power budget
conditions are in place, and "power capping" is taking place.
1.1 PCC interface:
------------------
The complete PCC specification is available here:
http://www.acpica.org/download/Processor-Clocking-Control-v1p0.pdf
PCC relies on a shared memory region that provides a channel for communication
between the OS and platform firmware. PCC also implements a "doorbell" that
is used by the OS to inform the platform firmware that a command has been
sent.
The ACPI PCCH() method is used to discover the location of the PCC shared
memory region. The shared memory region header contains the "command" and
"status" interface. PCCH() also contains details on how to access the platform
doorbell.
The following commands are supported by the PCC interface:
* Get Average Frequency
* Set Desired Frequency
The ACPI PCCP() method is implemented for each logical processor and is
used to discover the offsets for the input and output buffers in the shared
memory region.
When PCC mode is enabled, the platform will not expose processor performance
or throttle states (_PSS, _TSS and related ACPI objects) to OSPM. Therefore,
the native P-state driver (such as acpi-cpufreq for Intel, powernow-k8 for
AMD) will not load.
However, OSPM remains in control of policy. The governor (eg: "ondemand")
computes the required performance for each processor based on server workload.
The PCC driver fills in the command interface, and the input buffer and
communicates the request to the platform firmware. The platform firmware is
responsible for delivering the requested performance.
Each PCC command is "global" in scope and can affect all the logical CPUs in
the system. Therefore, PCC is capable of performing "group" updates. With PCC
the OS is capable of getting/setting the frequency of all the logical CPUs in
the system with a single call to the BIOS.
1.1.1 Get Average Frequency:
----------------------------
This command is used by the OSPM to query the running frequency of the
processor since the last time this command was completed. The output buffer
indicates the average unhalted frequency of the logical processor expressed as
a percentage of the nominal (ie: maximum) CPU frequency. The output buffer
also signifies if the CPU frequency is limited by a power budget condition.
1.1.2 Set Desired Frequency:
----------------------------
This command is used by the OSPM to communicate to the platform firmware the
desired frequency for a logical processor. The output buffer is currently
ignored by OSPM. The next invocation of "Get Average Frequency" will inform
OSPM if the desired frequency was achieved or not.
1.2 Platforms affected:
-----------------------
The PCC driver will load on any system where the platform firmware:
* supports the PCC interface, and the associated PCCH() and PCCP() methods
* assumes responsibility for managing the hardware clocking controls in order
to deliver the requested processor performance
Currently, certain HP ProLiant platforms implement the PCC interface. On those
platforms PCC is the "default" choice.
However, it is possible to disable this interface via a BIOS setting. In
such an instance, as is also the case on platforms where the PCC interface
is not implemented, the PCC driver will fail to load silently.
2. Driver and /sys details:
---------------------------
When the driver loads, it merely prints the lowest and the highest CPU
frequencies supported by the platform firmware.
The PCC driver loads with a message such as:
pcc-cpufreq: (v1.00.00) driver loaded with frequency limits: 1600 MHz, 2933
MHz
This means that the OPSM can request the CPU to run at any frequency in
between the limits (1600 MHz, and 2933 MHz) specified in the message.
Internally, there is no need for the driver to convert the "target" frequency
to a corresponding P-state.
The VERSION number for the driver will be of the format v.xy.ab.
eg: 1.00.02
----- --
| |
| -- this will increase with bug fixes/enhancements to the driver
|-- this is the version of the PCC specification the driver adheres to
The following is a brief discussion on some of the fields exported via the
/sys filesystem and how their values are affected by the PCC driver:
2.1 scaling_available_frequencies:
----------------------------------
scaling_available_frequencies is not created in /sys. No intermediate
frequencies need to be listed because the BIOS will try to achieve any
frequency, within limits, requested by the governor. A frequency does not have
to be strictly associated with a P-state.
2.2 cpuinfo_transition_latency:
-------------------------------
The cpuinfo_transition_latency field is 0. The PCC specification does
not include a field to expose this value currently.
2.3 cpuinfo_cur_freq:
---------------------
A) Often cpuinfo_cur_freq will show a value different than what is declared
in the scaling_available_frequencies or scaling_cur_freq, or scaling_max_freq.
This is due to "turbo boost" available on recent Intel processors. If certain
conditions are met the BIOS can achieve a slightly higher speed than requested
by OSPM. An example:
scaling_cur_freq : 2933000
cpuinfo_cur_freq : 3196000
B) There is a round-off error associated with the cpuinfo_cur_freq value.
Since the driver obtains the current frequency as a "percentage" (%) of the
nominal frequency from the BIOS, sometimes, the values displayed by
scaling_cur_freq and cpuinfo_cur_freq may not match. An example:
scaling_cur_freq : 1600000
cpuinfo_cur_freq : 1583000
In this example, the nominal frequency is 2933 MHz. The driver obtains the
current frequency, cpuinfo_cur_freq, as 54% of the nominal frequency:
54% of 2933 MHz = 1583 MHz
Nominal frequency is the maximum frequency of the processor, and it usually
corresponds to the frequency of the P0 P-state.
2.4 related_cpus:
-----------------
The related_cpus field is identical to affected_cpus.
affected_cpus : 4
related_cpus : 4
Currently, the PCC driver does not evaluate _PSD. The platforms that support
PCC do not implement SW_ALL. So OSPM doesn't need to perform any coordination
to ensure that the same frequency is requested of all dependent CPUs.
3. Caveats:
-----------
The "cpufreq_stats" module in its present form cannot be loaded and
expected to work with the PCC driver. Since the "cpufreq_stats" module
provides information wrt each P-state, it is not applicable to the PCC driver.
...@@ -122,3 +122,47 @@ volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16 ...@@ -122,3 +122,47 @@ volumeGroup-base: 0 2097152 snapshot-merge 254:11 254:12 P 16
brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real
brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow brw------- 1 root root 254, 12 29 ago 18:16 /dev/mapper/volumeGroup-base-cow
brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base brw------- 1 root root 254, 10 29 ago 18:16 /dev/mapper/volumeGroup-base
How to determine when a merging is complete
===========================================
The snapshot-merge and snapshot status lines end with:
<sectors_allocated>/<total_sectors> <metadata_sectors>
Both <sectors_allocated> and <total_sectors> include both data and metadata.
During merging, the number of sectors allocated gets smaller and
smaller. Merging has finished when the number of sectors holding data
is zero, in other words <sectors_allocated> == <metadata_sectors>.
Here is a practical example (using a hybrid of lvm and dmsetup commands):
# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
base volumeGroup owi-a- 4.00g
snap volumeGroup swi-a- 1.00g base 18.97
# dmsetup status volumeGroup-snap
0 8388608 snapshot 397896/2097152 1560
^^^^ metadata sectors
# lvconvert --merge -b volumeGroup/snap
Merging of volume snap started.
# lvs volumeGroup/snap
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
base volumeGroup Owi-a- 4.00g 17.23
# dmsetup status volumeGroup-base
0 8388608 snapshot-merge 281688/2097152 1104
# dmsetup status volumeGroup-base
0 8388608 snapshot-merge 180480/2097152 712
# dmsetup status volumeGroup-base
0 8388608 snapshot-merge 16/2097152 16
Merging has finished.
# lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
base volumeGroup owi-a- 4.00g
...@@ -192,7 +192,7 @@ command line. This will execute all matching early_param() callbacks. ...@@ -192,7 +192,7 @@ command line. This will execute all matching early_param() callbacks.
User specified early platform devices will be registered at this point. User specified early platform devices will be registered at this point.
For the early serial console case the user can specify port on the For the early serial console case the user can specify port on the
kernel command line as "earlyprintk=serial.0" where "earlyprintk" is kernel command line as "earlyprintk=serial.0" where "earlyprintk" is
the class string, "serial" is the name of the platfrom driver and the class string, "serial" is the name of the platform driver and
0 is the platform device id. If the id is -1 then the dot and the 0 is the platform device id. If the id is -1 then the dot and the
id can be omitted. id can be omitted.
......
...@@ -171,7 +171,7 @@ device. ...@@ -171,7 +171,7 @@ device.
virtual_root.force_probe : virtual_root.force_probe :
Force the probing code to probe EISA slots even when it cannot find an Force the probing code to probe EISA slots even when it cannot find an
EISA compliant mainboard (nothing appears on slot 0). Defaultd to 0 EISA compliant mainboard (nothing appears on slot 0). Defaults to 0
(don't force), and set to 1 (force probing) when either (don't force), and set to 1 (force probing) when either
CONFIG_ALPHA_JENSEN or CONFIG_EISA_VLB_PRIMING are set. CONFIG_ALPHA_JENSEN or CONFIG_EISA_VLB_PRIMING are set.
......
...@@ -216,26 +216,14 @@ Works. Use "Insert file..." or external editor. ...@@ -216,26 +216,14 @@ Works. Use "Insert file..." or external editor.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gmail (Web GUI) Gmail (Web GUI)
If you just have to use Gmail to send patches, it CAN be made to work. It Does not work for sending patches.
requires a bit of external help, though.
Gmail web client converts tabs to spaces automatically.
The first problem is that Gmail converts tabs to spaces. This will
totally break your patches. To prevent this, you have to use a different At the same time it wraps lines every 78 chars with CRLF style line breaks
editor. There is a firefox extension called "ViewSourceWith" although tab2space problem can be solved with external editor.
(https://addons.mozilla.org/en-US/firefox/addon/394) which allows you to
edit any text box in the editor of your choice. Configure it to launch Another problem is that Gmail will base64-encode any message that has a
your favorite editor. When you want to send a patch, use this technique. non-ASCII character. That includes things like European names.
Once you have crafted your messsage + patch, save and exit the editor,
which should reload the Gmail edit box. GMAIL WILL PRESERVE THE TABS.
Hoorah. Apparently you can cut-n-paste literal tabs, but Gmail will
convert those to spaces upon sending!
The second problem is that Gmail converts tabs to spaces on replies. If
you reply to a patch, don't expect to be able to apply it as a patch.
The last problem is that Gmail will base64-encode any message that has a
non-ASCII character. That includes things like European names. Be aware.
Gmail is not convenient for lkml patches, but CAN be made to work.
### ###
The lkdtm module provides an interface to crash or injure the kernel at
predefined crashpoints to evaluate the reliability of crash dumps obtained
using different dumping solutions. The module uses KPROBEs to instrument
crashing points, but can also crash the kernel directly without KRPOBE
support.
You can provide the way either through module arguments when inserting
the module, or through a debugfs interface.
Usage: insmod lkdtm.ko [recur_count={>0}] cpoint_name=<> cpoint_type=<>
[cpoint_count={>0}]
recur_count : Recursion level for the stack overflow test. Default is 10.
cpoint_name : Crash point where the kernel is to be crashed. It can be
one of INT_HARDWARE_ENTRY, INT_HW_IRQ_EN, INT_TASKLET_ENTRY,
FS_DEVRW, MEM_SWAPOUT, TIMERADD, SCSI_DISPATCH_CMD,
IDE_CORE_CP, DIRECT
cpoint_type : Indicates the action to be taken on hitting the crash point.
It can be one of PANIC, BUG, EXCEPTION, LOOP, OVERFLOW,
CORRUPT_STACK, UNALIGNED_LOAD_STORE_WRITE, OVERWRITE_ALLOCATION,
WRITE_AFTER_FREE,
cpoint_count : Indicates the number of times the crash point is to be hit
to trigger an action. The default is 10.
You can also induce failures by mounting debugfs and writing the type to
<mountpoint>/provoke-crash/<crashpoint>. E.g.,
mount -t debugfs debugfs /mnt
echo EXCEPTION > /mnt/provoke-crash/INT_HARDWARE_ENTRY
A special file is `DIRECT' which will induce the crash directly without
KPROBE instrumentation. This mode is the only one available when the module
is built on a kernel without KPROBEs support.
What is efifb?
===============
This is a generic EFI platform driver for Intel based Apple computers.
efifb is only for EFI booted Intel Macs.
Supported Hardware
==================
iMac 17"/20"
Macbook
Macbook Pro 15"/17"
MacMini
How to use it?
==============
efifb does not have any kind of autodetection of your machine.
You have to add the following kernel parameters in your elilo.conf:
Macbook :
video=efifb:macbook
MacMini :
video=efifb:mini
Macbook Pro 15", iMac 17" :
video=efifb:i17
Macbook Pro 17", iMac 20" :
video=efifb:i20
--
Edgar Hucek <gimli@dark-green.com>
What is imacfb?
===============
This is a generic EFI platform driver for Intel based Apple computers.
Imacfb is only for EFI booted Intel Macs.
Supported Hardware
==================
iMac 17"/20"
Macbook
Macbook Pro 15"/17"
MacMini
How to use it?
==============
Imacfb does not have any kind of autodetection of your machine.
You have to add the following kernel parameters in your elilo.conf:
Macbook :
video=imacfb:macbook
MacMini :
video=imacfb:mini
Macbook Pro 15", iMac 17" :
video=imacfb:i17
Macbook Pro 17", iMac 20" :
video=imacfb:i20
--
Edgar Hucek <gimli@dark-green.com>
...@@ -449,12 +449,6 @@ Who: Alok N Kataria <akataria@vmware.com> ...@@ -449,12 +449,6 @@ Who: Alok N Kataria <akataria@vmware.com>
---------------------------- ----------------------------
What: adt7473 hardware monitoring driver
When: February 2010
Why: Obsoleted by the adt7475 driver.
Who: Jean Delvare <khali@linux-fr.org>
---------------------------
What: Support for lcd_switch and display_get in asus-laptop driver What: Support for lcd_switch and display_get in asus-laptop driver
When: March 2010 When: March 2010
Why: These two features use non-standard interfaces. There are the Why: These two features use non-standard interfaces. There are the
...@@ -556,3 +550,42 @@ Why: udev fully replaces this special file system that only contains CAPI ...@@ -556,3 +550,42 @@ Why: udev fully replaces this special file system that only contains CAPI
NCCI TTY device nodes. User space (pppdcapiplugin) works without NCCI TTY device nodes. User space (pppdcapiplugin) works without
noticing the difference. noticing the difference.
Who: Jan Kiszka <jan.kiszka@web.de> Who: Jan Kiszka <jan.kiszka@web.de>
----------------------------
What: KVM memory aliases support
When: July 2010
Why: Memory aliasing support is used for speeding up guest vga access
through the vga windows.
Modern userspace no longer uses this feature, so it's just bitrotted
code and can be removed with no impact.
Who: Avi Kivity <avi@redhat.com>
----------------------------
What: KVM kernel-allocated memory slots
When: July 2010
Why: Since 2.6.25, kvm supports user-allocated memory slots, which are
much more flexible than kernel-allocated slots. All current userspace
supports the newer interface and this code can be removed with no
impact.
Who: Avi Kivity <avi@redhat.com>
----------------------------
What: KVM paravirt mmu host support
When: January 2011
Why: The paravirt mmu host support is slower than non-paravirt mmu, both
on newer and older hardware. It is already not exposed to the guest,
and kept only for live migration purposes.
Who: Avi Kivity <avi@redhat.com>
----------------------------
What: "acpi=ht" boot option
When: 2.6.35
Why: Useful in 2003, implementation is a hack.
Generally invoked by accident today.
Seen as doing more harm than good.
Who: Len Brown <len.brown@intel.com>
...@@ -16,6 +16,8 @@ befs.txt ...@@ -16,6 +16,8 @@ befs.txt
- information about the BeOS filesystem for Linux. - information about the BeOS filesystem for Linux.
bfs.txt bfs.txt
- info for the SCO UnixWare Boot Filesystem (BFS). - info for the SCO UnixWare Boot Filesystem (BFS).
ceph.txt
- info for the Ceph Distributed File System
cifs.txt cifs.txt
- description of the CIFS filesystem. - description of the CIFS filesystem.
coda.txt coda.txt
...@@ -32,6 +34,8 @@ dlmfs.txt ...@@ -32,6 +34,8 @@ dlmfs.txt
- info on the userspace interface to the OCFS2 DLM. - info on the userspace interface to the OCFS2 DLM.
dnotify.txt dnotify.txt
- info about directory notification in Linux. - info about directory notification in Linux.
dnotify_test.c
- example program for dnotify
ecryptfs.txt ecryptfs.txt
- docs on eCryptfs: stacked cryptographic filesystem for Linux. - docs on eCryptfs: stacked cryptographic filesystem for Linux.
exofs.txt exofs.txt
...@@ -62,6 +66,8 @@ jfs.txt ...@@ -62,6 +66,8 @@ jfs.txt
- info and mount options for the JFS filesystem. - info and mount options for the JFS filesystem.
locks.txt locks.txt
- info on file locking implementations, flock() vs. fcntl(), etc. - info on file locking implementations, flock() vs. fcntl(), etc.
logfs.txt
- info on the LogFS flash filesystem.
mandatory-locking.txt mandatory-locking.txt
- info on the Linux implementation of Sys V mandatory file locking. - info on the Linux implementation of Sys V mandatory file locking.
ncpfs.txt ncpfs.txt
......
...@@ -37,6 +37,15 @@ For Plan 9 From User Space applications (http://swtch.com/plan9) ...@@ -37,6 +37,15 @@ For Plan 9 From User Space applications (http://swtch.com/plan9)
mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER
For server running on QEMU host with virtio transport:
mount -t 9p -o trans=virtio <mount_tag> /mnt/9
where mount_tag is the tag associated by the server to each of the exported
mount points. Each 9P export is seen by the client as a virtio device with an
associated "mount_tag" property. Available mount tags can be
seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files.
OPTIONS OPTIONS
======= =======
...@@ -47,7 +56,7 @@ OPTIONS ...@@ -47,7 +56,7 @@ OPTIONS
fd - used passed file descriptors for connection fd - used passed file descriptors for connection
(see rfdno and wfdno) (see rfdno and wfdno)
virtio - connect to the next virtio channel available virtio - connect to the next virtio channel available
(from lguest or KVM with trans_virtio module) (from QEMU with trans_virtio module)
rdma - connect to a specified RDMA channel rdma - connect to a specified RDMA channel
uname=name user name to attempt mount as on the remote server. The uname=name user name to attempt mount as on the remote server. The
...@@ -85,7 +94,12 @@ OPTIONS ...@@ -85,7 +94,12 @@ OPTIONS
port=n port to connect to on the remote server port=n port to connect to on the remote server
noextend force legacy mode (no 9p2000.u semantics) noextend force legacy mode (no 9p2000.u or 9p2000.L semantics)
version=name Select 9P protocol version. Valid options are:
9p2000 - Legacy mode (same as noextend)
9p2000.u - Use 9P2000.u protocol
9p2000.L - Use 9P2000.L protocol
dfltuid attempt to mount as a particular uid dfltuid attempt to mount as a particular uid
......
...@@ -460,13 +460,6 @@ in sys_read() and friends. ...@@ -460,13 +460,6 @@ in sys_read() and friends.
--------------------------- dquot_operations ------------------------------- --------------------------- dquot_operations -------------------------------
prototypes: prototypes:
int (*initialize) (struct inode *, int);
int (*drop) (struct inode *);
int (*alloc_space) (struct inode *, qsize_t, int);
int (*alloc_inode) (const struct inode *, unsigned long);
int (*free_space) (struct inode *, qsize_t);
int (*free_inode) (const struct inode *, unsigned long);
int (*transfer) (struct inode *, struct iattr *);
int (*write_dquot) (struct dquot *); int (*write_dquot) (struct dquot *);
int (*acquire_dquot) (struct dquot *); int (*acquire_dquot) (struct dquot *);
int (*release_dquot) (struct dquot *); int (*release_dquot) (struct dquot *);
...@@ -479,13 +472,6 @@ a proper locking wrt the filesystem and call the generic quota operations. ...@@ -479,13 +472,6 @@ a proper locking wrt the filesystem and call the generic quota operations.
What filesystem should expect from the generic quota functions: What filesystem should expect from the generic quota functions:
FS recursion Held locks when called FS recursion Held locks when called
initialize: yes maybe dqonoff_sem
drop: yes -
alloc_space: ->mark_dirty() -
alloc_inode: ->mark_dirty() -
free_space: ->mark_dirty() -
free_inode: ->mark_dirty() -
transfer: yes -
write_dquot: yes dqonoff_sem or dqptr_sem write_dquot: yes dqonoff_sem or dqptr_sem
acquire_dquot: yes dqonoff_sem or dqptr_sem acquire_dquot: yes dqonoff_sem or dqptr_sem
release_dquot: yes dqonoff_sem or dqptr_sem release_dquot: yes dqonoff_sem or dqptr_sem
...@@ -495,10 +481,6 @@ write_info: yes dqonoff_sem ...@@ -495,10 +481,6 @@ write_info: yes dqonoff_sem
FS recursion means calling ->quota_read() and ->quota_write() from superblock FS recursion means calling ->quota_read() and ->quota_write() from superblock
operations. operations.
->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called
only directly by the filesystem and do not call any fs functions only
the ->mark_dirty() operation.
More details about quota locking can be found in fs/dquot.c. More details about quota locking can be found in fs/dquot.c.
--------------------------- vm_operations_struct ----------------------------- --------------------------- vm_operations_struct -----------------------------
......
# kbuild trick to avoid linker error. Can be omitted if a module is built.
obj- := dummy.o
# List of programs to build
hostprogs-y := dnotify_test
# Tell kbuild to always build the programs
always := $(hostprogs-y)
Ceph Distributed File System
============================
Ceph is a distributed network file system designed to provide good
performance, reliability, and scalability.
Basic features include:
* POSIX semantics
* Seamless scaling from 1 to many thousands of nodes
* High availability and reliability. No single point of failure.
* N-way replication of data across storage nodes
* Fast recovery from node failures
* Automatic rebalancing of data on node addition/removal
* Easy deployment: most FS components are userspace daemons
Also,
* Flexible snapshots (on any directory)
* Recursive accounting (nested files, directories, bytes)
In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely
on symmetric access by all clients to shared block devices, Ceph
separates data and metadata management into independent server
clusters, similar to Lustre. Unlike Lustre, however, metadata and
storage nodes run entirely as user space daemons. Storage nodes
utilize btrfs to store data objects, leveraging its advanced features
(checksumming, metadata replication, etc.). File data is striped
across storage nodes in large chunks to distribute workload and
facilitate high throughputs. When storage nodes fail, data is
re-replicated in a distributed fashion by the storage nodes themselves
(with some minimal coordination from a cluster monitor), making the
system extremely efficient and scalable.
Metadata servers effectively form a large, consistent, distributed
in-memory cache above the file namespace that is extremely scalable,
dynamically redistributes metadata in response to workload changes,
and can tolerate arbitrary (well, non-Byzantine) node failures. The
metadata server takes a somewhat unconventional approach to metadata
storage to significantly improve performance for common workloads. In
particular, inodes with only a single link are embedded in
directories, allowing entire directories of dentries and inodes to be
loaded into its cache with a single I/O operation. The contents of
extremely large directories can be fragmented and managed by
independent metadata servers, allowing scalable concurrent access.
The system offers automatic data rebalancing/migration when scaling
from a small cluster of just a few nodes to many hundreds, without
requiring an administrator carve the data set into static volumes or
go through the tedious process of migrating data between servers.
When the file system approaches full, new nodes can be easily added
and things will "just work."
Ceph includes flexible snapshot mechanism that allows a user to create
a snapshot on any subdirectory (and its nested contents) in the
system. Snapshot creation and deletion are as simple as 'mkdir
.snap/foo' and 'rmdir .snap/foo'.
Ceph also provides some recursive accounting on directories for nested
files and bytes. That is, a 'getfattr -d foo' on any directory in the
system will reveal the total number of nested regular files and
subdirectories, and a summation of all nested file sizes. This makes
the identification of large disk space consumers relatively quick, as
no 'du' or similar recursive scan of the file system is required.
Mount Syntax
============
The basic mount syntax is:
# mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt
You only need to specify a single monitor, as the client will get the
full list when it connects. (However, if the monitor you specify
happens to be down, the mount won't succeed.) The port can be left
off if the monitor is using the default. So if the monitor is at
1.2.3.4,
# mount -t ceph 1.2.3.4:/ /mnt/ceph
is sufficient. If /sbin/mount.ceph is installed, a hostname can be
used instead of an IP address.
Mount Options
=============
ip=A.B.C.D[:N]
Specify the IP and/or port the client should bind to locally.
There is normally not much reason to do this. If the IP is not
specified, the client's IP address is determined by looking at the
address it's connection to the monitor originates from.
wsize=X
Specify the maximum write size in bytes. By default there is no
maximum. Ceph will normally size writes based on the file stripe
size.
rsize=X
Specify the maximum readahead.
mount_timeout=X
Specify the timeout value for mount (in seconds), in the case
of a non-responsive Ceph file system. The default is 30
seconds.
rbytes
When stat() is called on a directory, set st_size to 'rbytes',
the summation of file sizes over all files nested beneath that
directory. This is the default.
norbytes
When stat() is called on a directory, set st_size to the
number of entries in that directory.
nocrc
Disable CRC32C calculation for data writes. If set, the storage node
must rely on TCP's error correction to detect data corruption
in the data payload.
noasyncreaddir
Disable client's use its local cache to satisfy readdir
requests. (This does not change correctness; the client uses
cached metadata only when a lease or capability ensures it is
valid.)
More Information
================
For more information on Ceph, see the home page at
http://ceph.newdream.net/
The Linux kernel client source tree is available at
git://ceph.newdream.net/git/ceph-client.git
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
and the source for the full system is at
git://ceph.newdream.net/git/ceph.git
...@@ -62,38 +62,9 @@ disabled, fcntl(fd, F_NOTIFY, ...) will return -EINVAL. ...@@ -62,38 +62,9 @@ disabled, fcntl(fd, F_NOTIFY, ...) will return -EINVAL.
Example Example
------- -------
See Documentation/filesystems/dnotify_test.c for an example.
#define _GNU_SOURCE /* needed to get the defines */ NOTE
#include <fcntl.h> /* in glibc 2.2 this has the needed ----
values defined */ Beginning with Linux 2.6.13, dnotify has been replaced by inotify.
#include <signal.h> See Documentation/filesystems/inotify.txt for more information on it.
#include <stdio.h>
#include <unistd.h>
static volatile int event_fd;
static void handler(int sig, siginfo_t *si, void *data)
{
event_fd = si->si_fd;
}
int main(void)
{
struct sigaction act;
int fd;
act.sa_sigaction = handler;
sigemptyset(&act.sa_mask);
act.sa_flags = SA_SIGINFO;
sigaction(SIGRTMIN + 1, &act, NULL);
fd = open(".", O_RDONLY);
fcntl(fd, F_SETSIG, SIGRTMIN + 1);
fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT);
/* we will now be notified if any of the files
in "." is modified or new files are created */
while (1) {
pause();
printf("Got event on fd=%d\n", event_fd);
}
}
#define _GNU_SOURCE /* needed to get the defines */
#include <fcntl.h> /* in glibc 2.2 this has the needed
values defined */
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
static volatile int event_fd;
static void handler(int sig, siginfo_t *si, void *data)
{
event_fd = si->si_fd;
}
int main(void)
{
struct sigaction act;
int fd;
act.sa_sigaction = handler;
sigemptyset(&act.sa_mask);
act.sa_flags = SA_SIGINFO;
sigaction(SIGRTMIN + 1, &act, NULL);
fd = open(".", O_RDONLY);
fcntl(fd, F_SETSIG, SIGRTMIN + 1);
fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT);
/* we will now be notified if any of the files
in "." is modified or new files are created */
while (1) {
pause();
printf("Got event on fd=%d\n", event_fd);
}
}
The LogFS Flash Filesystem
==========================
Specification
=============
Superblocks
-----------
Two superblocks exist at the beginning and end of the filesystem.
Each superblock is 256 Bytes large, with another 3840 Bytes reserved
for future purposes, making a total of 4096 Bytes.
Superblock locations may differ for MTD and block devices. On MTD the
first non-bad block contains a superblock in the first 4096 Bytes and
the last non-bad block contains a superblock in the last 4096 Bytes.
On block devices, the first 4096 Bytes of the device contain the first
superblock and the last aligned 4096 Byte-block contains the second
superblock.
For the most part, the superblocks can be considered read-only. They
are written only to correct errors detected within the superblocks,
move the journal and change the filesystem parameters through tunefs.
As a result, the superblock does not contain any fields that require
constant updates, like the amount of free space, etc.
Segments
--------
The space in the device is split up into equal-sized segments.
Segments are the primary write unit of LogFS. Within each segments,
writes happen from front (low addresses) to back (high addresses. If
only a partial segment has been written, the segment number, the
current position within and optionally a write buffer are stored in
the journal.
Segments are erased as a whole. Therefore Garbage Collection may be
required to completely free a segment before doing so.
Journal
--------
The journal contains all global information about the filesystem that
is subject to frequent change. At mount time, it has to be scanned
for the most recent commit entry, which contains a list of pointers to
all currently valid entries.
Object Store
------------
All space except for the superblocks and journal is part of the object
store. Each segment contains a segment header and a number of
objects, each consisting of the object header and the payload.
Objects are either inodes, directory entries (dentries), file data
blocks or indirect blocks.
Levels
------
Garbage collection (GC) may fail if all data is written
indiscriminately. One requirement of GC is that data is seperated
roughly according to the distance between the tree root and the data.
Effectively that means all file data is on level 0, indirect blocks
are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks,
respectively. Inode file data is on level 6 for the inodes and 7-11
for indirect blocks.
Each segment contains objects of a single level only. As a result,
each level requires its own seperate segment to be open for writing.
Inode File
----------
All inodes are stored in a special file, the inode file. Single
exception is the inode file's inode (master inode) which for obvious
reasons is stored in the journal instead. Instead of data blocks, the
leaf nodes of the inode files are inodes.
Aliases
-------
Writes in LogFS are done by means of a wandering tree. A naïve
implementation would require that for each write or a block, all
parent blocks are written as well, since the block pointers have
changed. Such an implementation would not be very efficient.
In LogFS, the block pointer changes are cached in the journal by means
of alias entries. Each alias consists of its logical address - inode
number, block index, level and child number (index into block) - and
the changed data. Any 8-byte word can be changes in this manner.
Currently aliases are used for block pointers, file size, file used
bytes and the height of an inodes indirect tree.
Segment Aliases
---------------
Related to regular aliases, these are used to handle bad blocks.
Initially, bad blocks are handled by moving the affected segment
content to a spare segment and noting this move in the journal with a
segment alias, a simple (to, from) tupel. GC will later empty this
segment and the alias can be removed again. This is used on MTD only.
Vim
---
By cleverly predicting the life time of data, it is possible to
seperate long-living data from short-living data and thereby reduce
the GC overhead later. Each type of distinc life expectency (vim) can
have a seperate segment open for writing. Each (level, vim) tupel can
be open just once. If an open segment with unknown vim is encountered
at mount time, it is closed and ignored henceforth.
Indirect Tree
-------------
Inodes in LogFS are similar to FFS-style filesystems with direct and
indirect block pointers. One difference is that LogFS uses a single
indirect pointer that can be either a 1x, 2x, etc. indirect pointer.
A height field in the inode defines the height of the indirect tree
and thereby the indirection of the pointer.
Another difference is the addressing of indirect blocks. In LogFS,
the first 16 pointers in the first indirect block are left empty,
corresponding to the 16 direct pointers in the inode. In ext2 (maybe
others as well) the first pointer in the first indirect block
corresponds to logical block 12, skipping the 12 direct pointers.
So where ext2 is using arithmetic to better utilize space, LogFS keeps
arithmetic simple and uses compression to save space.
Compression
-----------
Both file data and metadata can be compressed. Compression for file
data can be enabled with chattr +c and disabled with chattr -c. Doing
so has no effect on existing data, but new data will be stored
accordingly. New inodes will inherit the compression flag of the
parent directory.
Metadata is always compressed. However, the space accounting ignores
this and charges for the uncompressed size. Failing to do so could
result in GC failures when, after moving some data, indirect blocks
compress worse than previously. Even on a 100% full medium, GC may
not consume any extra space, so the compression gains are lost space
to the user.
However, they are not lost space to the filesystem internals. By
cheating the user for those bytes, the filesystem gained some slack
space and GC will run less often and faster.
Garbage Collection and Wear Leveling
------------------------------------
Garbage collection is invoked whenever the number of free segments
falls below a threshold. The best (known) candidate is picked based
on the least amount of valid data contained in the segment. All
remaining valid data is copied elsewhere, thereby invalidating it.
The GC code also checks for aliases and writes then back if their
number gets too large.
Wear leveling is done by occasionally picking a suboptimal segment for
garbage collection. If a stale segments erase count is significantly
lower than the active segments' erase counts, it will be picked. Wear
leveling is rate limited, so it will never monopolize the device for
more than one segment worth at a time.
Values for "occasionally", "significantly lower" are compile time
constants.
Hashed directories
------------------
To satisfy efficient lookup(), directory entries are hashed and
located based on the hash. In order to both support large directories
and not be overly inefficient for small directories, several hash
tables of increasing size are used. For each table, the hash value
modulo the table size gives the table index.
Tables sizes are chosen to limit the number of indirect blocks with a
fully populated table to 0, 1, 2 or 3 respectively. So the first
table contains 16 entries, the second 512-16, etc.
The last table is special in several ways. First its size depends on
the effective 32bit limit on telldir/seekdir cookies. Since logfs
uses the upper half of the address space for indirect blocks, the size
is limited to 2^31. Secondly the table contains hash buckets with 16
entries each.
Using single-entry buckets would result in birthday "attacks". At
just 2^16 used entries, hash collisions would be likely (P >= 0.5).
My math skills are insufficient to do the combinatorics for the 17x
collisions necessary to overflow a bucket, but testing showed that in
10,000 runs the lowest directory fill before a bucket overflow was
188,057,130 entries with an average of 315,149,915 entries. So for
directory sizes of up to a million, bucket overflows should be
virtually impossible under normal circumstances.
With carefully chosen filenames, it is obviously possible to cause an
overflow with just 21 entries (4 higher tables + 16 entries + 1). So
there may be a security concern if a malicious user has write access
to a directory.
Open For Discussion
===================
Device Address Space
--------------------
A device address space is used for caching. Both block devices and
MTD provide functions to either read a single page or write a segment.
Partial segments may be written for data integrity, but where possible
complete segments are written for performance on simple block device
flash media.
Meta Inodes
-----------
Inodes are stored in the inode file, which is just a regular file for
most purposes. At umount time, however, the inode file needs to
remain open until all dirty inodes are written. So
generic_shutdown_super() may not close this inode, but shouldn't
complain about remaining inodes due to the inode file either. Same
goes for mapping inode of the device address space.
Currently logfs uses a hack that essentially copies part of fs/inode.c
code over. A general solution would be preferred.
Indirect block mapping
----------------------
With compression, the block device (or mapping inode) cannot be used
to cache indirect blocks. Some other place is required. Currently
logfs uses the top half of each inode's address space. The low 8TB
(on 32bit) are filled with file data, the high 8TB are used for
indirect blocks.
One problem is that 16TB files created on 64bit systems actually have
data in the top 8TB. But files >16TB would cause problems anyway, so
only the limit has changed.
...@@ -17,8 +17,7 @@ kernels must turn 4.1 on or off *before* turning support for version 4 ...@@ -17,8 +17,7 @@ kernels must turn 4.1 on or off *before* turning support for version 4
on or off; rpc.nfsd does this correctly.) on or off; rpc.nfsd does this correctly.)
The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based
on the latest NFSv4.1 Internet Draft: on RFC 5661.
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-29
From the many new features in NFSv4.1 the current implementation From the many new features in NFSv4.1 the current implementation
focuses on the mandatory-to-implement NFSv4.1 Sessions, providing focuses on the mandatory-to-implement NFSv4.1 Sessions, providing
...@@ -44,7 +43,7 @@ interoperability problems with future clients. Known issues: ...@@ -44,7 +43,7 @@ interoperability problems with future clients. Known issues:
trunking, but this is a mandatory feature, and its use is trunking, but this is a mandatory feature, and its use is
recommended to clients in a number of places. (E.g. to ensure recommended to clients in a number of places. (E.g. to ensure
timely renewal in case an existing connection's retry timeouts timely renewal in case an existing connection's retry timeouts
have gotten too long; see section 8.3 of the draft.) have gotten too long; see section 8.3 of the RFC.)
Therefore, lack of this feature may cause future clients to Therefore, lack of this feature may cause future clients to
fail. fail.
- Incomplete backchannel support: incomplete backchannel gss - Incomplete backchannel support: incomplete backchannel gss
......
...@@ -164,6 +164,7 @@ read the file /proc/PID/status: ...@@ -164,6 +164,7 @@ read the file /proc/PID/status:
VmExe: 68 kB VmExe: 68 kB
VmLib: 1412 kB VmLib: 1412 kB
VmPTE: 20 kb VmPTE: 20 kb
VmSwap: 0 kB
Threads: 1 Threads: 1
SigQ: 0/28578 SigQ: 0/28578
SigPnd: 0000000000000000 SigPnd: 0000000000000000
...@@ -188,7 +189,13 @@ memory usage. Its seven fields are explained in Table 1-3. The stat file ...@@ -188,7 +189,13 @@ memory usage. Its seven fields are explained in Table 1-3. The stat file
contains details information about the process itself. Its fields are contains details information about the process itself. Its fields are
explained in Table 1-4. explained in Table 1-4.
Table 1-2: Contents of the statm files (as of 2.6.30-rc7) (for SMP CONFIG users)
For making accounting scalable, RSS related information are handled in
asynchronous manner and the vaule may not be very precise. To see a precise
snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
It's slow but very precise.
Table 1-2: Contents of the status files (as of 2.6.30-rc7)
.............................................................................. ..............................................................................
Field Content Field Content
Name filename of the executable Name filename of the executable
...@@ -213,6 +220,7 @@ Table 1-2: Contents of the statm files (as of 2.6.30-rc7) ...@@ -213,6 +220,7 @@ Table 1-2: Contents of the statm files (as of 2.6.30-rc7)
VmExe size of text segment VmExe size of text segment
VmLib size of shared library code VmLib size of shared library code
VmPTE size of page table entries VmPTE size of page table entries
VmSwap size of swap usage (the number of referred swapents)
Threads number of threads Threads number of threads
SigQ number of signals queued/max. number for queue SigQ number of signals queued/max. number for queue
SigPnd bitmap of pending signals for the thread SigPnd bitmap of pending signals for the thread
...@@ -430,6 +438,7 @@ Table 1-5: Kernel info in /proc ...@@ -430,6 +438,7 @@ Table 1-5: Kernel info in /proc
modules List of loaded modules modules List of loaded modules
mounts Mounted filesystems mounts Mounted filesystems
net Networking info (see text) net Networking info (see text)
pagetypeinfo Additional page allocator information (see text) (2.5)
partitions Table of partitions known to the system partitions Table of partitions known to the system
pci Deprecated info of PCI bus (new way -> /proc/bus/pci/, pci Deprecated info of PCI bus (new way -> /proc/bus/pci/,
decoupled by lspci (2.4) decoupled by lspci (2.4)
...@@ -584,7 +593,7 @@ Node 0, zone DMA 0 4 5 4 4 3 ... ...@@ -584,7 +593,7 @@ Node 0, zone DMA 0 4 5 4 4 3 ...
Node 0, zone Normal 1 0 0 1 101 8 ... Node 0, zone Normal 1 0 0 1 101 8 ...
Node 0, zone HighMem 2 0 0 1 1 0 ... Node 0, zone HighMem 2 0 0 1 1 0 ...
Memory fragmentation is a problem under some workloads, and buddyinfo is a External fragmentation is a problem under some workloads, and buddyinfo is a
useful tool for helping diagnose these problems. Buddyinfo will give you a useful tool for helping diagnose these problems. Buddyinfo will give you a
clue as to how big an area you can safely allocate, or why a previous clue as to how big an area you can safely allocate, or why a previous
allocation failed. allocation failed.
...@@ -594,6 +603,48 @@ available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in ...@@ -594,6 +603,48 @@ available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in
ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
available in ZONE_NORMAL, etc... available in ZONE_NORMAL, etc...
More information relevant to external fragmentation can be found in
pagetypeinfo.
> cat /proc/pagetypeinfo
Page block order: 9
Pages per block: 512
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0
Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2
Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0
Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9
Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0
Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452
Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0
Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
Node 0, zone DMA 2 0 5 1 0
Node 0, zone DMA32 41 6 967 2 0
Fragmentation avoidance in the kernel works by grouping pages of different
migrate types into the same contiguous regions of memory called page blocks.
A page block is typically the size of the default hugepage size e.g. 2MB on
X86-64. By keeping pages grouped based on their ability to move, the kernel
can reclaim pages within a page block to satisfy a high-order allocation.
The pagetypinfo begins with information on the size of a page block. It
then gives the same type of information as buddyinfo except broken down
by migrate-type and finishes with details on how many page blocks of each
type exist.
If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
make an estimate of the likely number of huge pages that can be allocated
at a given point in time. All the "Movable" blocks should be allocatable
unless memory has been mlock()'d. Some of the Reclaimable blocks should
also be allocatable although a lot of filesystem metadata may have to be
reclaimed to achieve this.
.............................................................................. ..............................................................................
meminfo: meminfo:
......
...@@ -82,11 +82,13 @@ tmpfs has a mount option to set the NUMA memory allocation policy for ...@@ -82,11 +82,13 @@ tmpfs has a mount option to set the NUMA memory allocation policy for
all files in that instance (if CONFIG_NUMA is enabled) - which can be all files in that instance (if CONFIG_NUMA is enabled) - which can be
adjusted on the fly via 'mount -o remount ...' adjusted on the fly via 'mount -o remount ...'
mpol=default prefers to allocate memory from the local node mpol=default use the process allocation policy
(see set_mempolicy(2))
mpol=prefer:Node prefers to allocate memory from the given Node mpol=prefer:Node prefers to allocate memory from the given Node
mpol=bind:NodeList allocates memory only from nodes in NodeList mpol=bind:NodeList allocates memory only from nodes in NodeList
mpol=interleave prefers to allocate from each node in turn mpol=interleave prefers to allocate from each node in turn
mpol=interleave:NodeList allocates from each node of NodeList in turn mpol=interleave:NodeList allocates from each node of NodeList in turn
mpol=local prefers to allocate memory from the local node
NodeList format is a comma-separated list of decimal numbers and ranges, NodeList format is a comma-separated list of decimal numbers and ranges,
a range being two hyphen-separated decimal numbers, the smallest and a range being two hyphen-separated decimal numbers, the smallest and
...@@ -134,3 +136,5 @@ Author: ...@@ -134,3 +136,5 @@ Author:
Christoph Rohland <cr@sap.com>, 1.12.01 Christoph Rohland <cr@sap.com>, 1.12.01
Updated: Updated:
Hugh Dickins, 4 June 2007 Hugh Dickins, 4 June 2007
Updated:
KOSAKI Motohiro, 16 Mar 2010
...@@ -253,6 +253,70 @@ pin setup (e.g. controlling which pin the GPIO uses, pullup/pulldown). ...@@ -253,6 +253,70 @@ pin setup (e.g. controlling which pin the GPIO uses, pullup/pulldown).
Also note that it's your responsibility to have stopped using a GPIO Also note that it's your responsibility to have stopped using a GPIO
before you free it. before you free it.
Considering in most cases GPIOs are actually configured right after they
are claimed, three additional calls are defined:
/* request a single GPIO, with initial configuration specified by
* 'flags', identical to gpio_request() wrt other arguments and
* return value
*/
int gpio_request_one(unsigned gpio, unsigned long flags, const char *label);
/* request multiple GPIOs in a single call
*/
int gpio_request_array(struct gpio *array, size_t num);
/* release multiple GPIOs in a single call
*/
void gpio_free_array(struct gpio *array, size_t num);
where 'flags' is currently defined to specify the following properties:
* GPIOF_DIR_IN - to configure direction as input
* GPIOF_DIR_OUT - to configure direction as output
* GPIOF_INIT_LOW - as output, set initial level to LOW
* GPIOF_INIT_HIGH - as output, set initial level to HIGH
since GPIOF_INIT_* are only valid when configured as output, so group valid
combinations as:
* GPIOF_IN - configure as input
* GPIOF_OUT_INIT_LOW - configured as output, initial level LOW
* GPIOF_OUT_INIT_HIGH - configured as output, initial level HIGH
In the future, these flags can be extended to support more properties such
as open-drain status.
Further more, to ease the claim/release of multiple GPIOs, 'struct gpio' is
introduced to encapsulate all three fields as:
struct gpio {
unsigned gpio;
unsigned long flags;
const char *label;
};
A typical example of usage:
static struct gpio leds_gpios[] = {
{ 32, GPIOF_OUT_INIT_HIGH, "Power LED" }, /* default to ON */
{ 33, GPIOF_OUT_INIT_LOW, "Green LED" }, /* default to OFF */
{ 34, GPIOF_OUT_INIT_LOW, "Red LED" }, /* default to OFF */
{ 35, GPIOF_OUT_INIT_LOW, "Blue LED" }, /* default to OFF */
{ ... },
};
err = gpio_request_one(31, GPIOF_IN, "Reset Button");
if (err)
...
err = gpio_request_array(leds_gpios, ARRAY_SIZE(leds_gpios));
if (err)
...
gpio_free_array(leds_gpios, ARRAY_SIZE(leds_gpios));
GPIOs mapped to IRQs GPIOs mapped to IRQs
-------------------- --------------------
......
...@@ -30,7 +30,7 @@ Supported chips: ...@@ -30,7 +30,7 @@ Supported chips:
bank1_types=1,1,0,0,0,0,0,2,0,0,0,0,2,0,0,1 bank1_types=1,1,0,0,0,0,0,2,0,0,0,0,2,0,0,1
You may also need to specify the fan_sensors option for these boards You may also need to specify the fan_sensors option for these boards
fan_sensors=5 fan_sensors=5
2) There is a seperate abituguru3 driver for these motherboards, 2) There is a separate abituguru3 driver for these motherboards,
the abituguru (without the 3 !) driver will not work on these the abituguru (without the 3 !) driver will not work on these
motherboards (and visa versa)! motherboards (and visa versa)!
......
Kernel driver adt7411
=====================
Supported chips:
* Analog Devices ADT7411
Prefix: 'adt7411'
Addresses scanned: 0x48, 0x4a, 0x4b
Datasheet: Publicly available at the Analog Devices website
Author: Wolfram Sang (based on adt7470 by Darrick J. Wong)
Description
-----------
This driver implements support for the Analog Devices ADT7411 chip. There may
be other chips that implement this interface.
The ADT7411 can use an I2C/SMBus compatible 2-wire interface or an
SPI-compatible 4-wire interface. It provides a 10-bit analog to digital
converter which measures 1 temperature, vdd and 8 input voltages. It has an
internal temperature sensor, but an external one can also be connected (one
loses 2 inputs then). There are high- and low-limit registers for all inputs.
Check the datasheet for details.
sysfs-Interface
---------------
in0_input - vdd voltage input
in[1-8]_input - analog 1-8 input
temp1_input - temperature input
Besides standard interfaces, this driver adds (0 = off, 1 = on):
adc_ref_vdd - Use vdd as reference instead of 2.25 V
fast_sampling - Sample at 22.5 kHz instead of 1.4 kHz, but drop filters
no_average - Turn off averaging over 16 samples
Notes
-----
SPI, external temperature sensor and limit registers are not supported yet.
Kernel driver adt7473
======================
Supported chips:
* Analog Devices ADT7473
Prefix: 'adt7473'
Addresses scanned: I2C 0x2C, 0x2D, 0x2E
Datasheet: Publicly available at the Analog Devices website
Author: Darrick J. Wong
This driver is depreacted, please use the adt7475 driver instead.
Description
-----------
This driver implements support for the Analog Devices ADT7473 chip family.
The ADT7473 uses the 2-wire interface compatible with the SMBUS 2.0
specification. Using an analog to digital converter it measures three (3)
temperatures and two (2) voltages. It has four (4) 16-bit counters for
measuring fan speed. There are three (3) PWM outputs that can be used
to control fan speed.
A sophisticated control system for the PWM outputs is designed into the
ADT7473 that allows fan speed to be adjusted automatically based on any of the
three temperature sensors. Each PWM output is individually adjustable and
programmable. Once configured, the ADT7473 will adjust the PWM outputs in
response to the measured temperatures without further host intervention.
This feature can also be disabled for manual control of the PWM's.
Each of the measured inputs (voltage, temperature, fan speed) has
corresponding high/low limit values. The ADT7473 will signal an ALARM if
any measured value exceeds either limit.
The ADT7473 samples all inputs continuously. The driver will not read
the registers more often than once every other second. Further,
configuration data is only read once per minute.
Special Features
----------------
The ADT7473 have a 10-bit ADC and can therefore measure temperatures
with 0.25 degC resolution. Temperature readings can be configured either
for twos complement format or "Offset 64" format, wherein 63 is subtracted
from the raw value to get the temperature value.
The Analog Devices datasheet is very detailed and describes a procedure for
determining an optimal configuration for the automatic PWM control.
Configuration Notes
-------------------
Besides standard interfaces driver adds the following:
* PWM Control
* pwm#_auto_point1_pwm and temp#_auto_point1_temp and
* pwm#_auto_point2_pwm and temp#_auto_point2_temp -
point1: Set the pwm speed at a lower temperature bound.
point2: Set the pwm speed at a higher temperature bound.
The ADT7473 will scale the pwm between the lower and higher pwm speed when
the temperature is between the two temperature boundaries. PWM values range
from 0 (off) to 255 (full speed). Fan speed will be set to maximum when the
temperature sensor associated with the PWM control exceeds temp#_max.
Notes
-----
The NVIDIA binary driver presents an ADT7473 chip via an on-card i2c bus.
Unfortunately, they fail to set the i2c adapter class, so this driver may
fail to find the chip until the nvidia driver is patched.
此差异已折叠。
...@@ -5,31 +5,23 @@ Supported chips: ...@@ -5,31 +5,23 @@ Supported chips:
* IT8705F * IT8705F
Prefix: 'it87' Prefix: 'it87'
Addresses scanned: from Super I/O config space (8 I/O ports) Addresses scanned: from Super I/O config space (8 I/O ports)
Datasheet: Publicly available at the ITE website Datasheet: Once publicly available at the ITE website, but no longer
http://www.ite.com.tw/product_info/file/pc/IT8705F_V.0.4.1.pdf
* IT8712F * IT8712F
Prefix: 'it8712' Prefix: 'it8712'
Addresses scanned: from Super I/O config space (8 I/O ports) Addresses scanned: from Super I/O config space (8 I/O ports)
Datasheet: Publicly available at the ITE website Datasheet: Once publicly available at the ITE website, but no longer
http://www.ite.com.tw/product_info/file/pc/IT8712F_V0.9.1.pdf
http://www.ite.com.tw/product_info/file/pc/Errata%20V0.1%20for%20IT8712F%20V0.9.1.pdf
http://www.ite.com.tw/product_info/file/pc/IT8712F_V0.9.3.pdf
* IT8716F/IT8726F * IT8716F/IT8726F
Prefix: 'it8716' Prefix: 'it8716'
Addresses scanned: from Super I/O config space (8 I/O ports) Addresses scanned: from Super I/O config space (8 I/O ports)
Datasheet: Publicly available at the ITE website Datasheet: Once publicly available at the ITE website, but no longer
http://www.ite.com.tw/product_info/file/pc/IT8716F_V0.3.ZIP
http://www.ite.com.tw/product_info/file/pc/IT8726F_V0.3.pdf
* IT8718F * IT8718F
Prefix: 'it8718' Prefix: 'it8718'
Addresses scanned: from Super I/O config space (8 I/O ports) Addresses scanned: from Super I/O config space (8 I/O ports)
Datasheet: Publicly available at the ITE website Datasheet: Once publicly available at the ITE website, but no longer
http://www.ite.com.tw/product_info/file/pc/IT8718F_V0.2.zip
http://www.ite.com.tw/product_info/file/pc/IT8718F_V0%203_(for%20C%20version).zip
* IT8720F * IT8720F
Prefix: 'it8720' Prefix: 'it8720'
Addresses scanned: from Super I/O config space (8 I/O ports) Addresses scanned: from Super I/O config space (8 I/O ports)
Datasheet: Not yet publicly available. Datasheet: Not publicly available
* SiS950 [clone of IT8705F] * SiS950 [clone of IT8705F]
Prefix: 'it87' Prefix: 'it87'
Addresses scanned: from Super I/O config space (8 I/O ports) Addresses scanned: from Super I/O config space (8 I/O ports)
...@@ -136,6 +128,10 @@ registers are read whenever any data is read (unless it is less than 1.5 ...@@ -136,6 +128,10 @@ registers are read whenever any data is read (unless it is less than 1.5
seconds since the last update). This means that you can easily miss seconds since the last update). This means that you can easily miss
once-only alarms. once-only alarms.
Out-of-limit readings can also result in beeping, if the chip is properly
wired and configured. Beeping can be enabled or disabled per sensor type
(temperatures, voltages and fans.)
The IT87xx only updates its values each 1.5 seconds; reading it more often The IT87xx only updates its values each 1.5 seconds; reading it more often
will do no harm, but will return 'old' values. will do no harm, but will return 'old' values.
...@@ -150,11 +146,38 @@ Fan speed control ...@@ -150,11 +146,38 @@ Fan speed control
----------------- -----------------
The fan speed control features are limited to manual PWM mode. Automatic The fan speed control features are limited to manual PWM mode. Automatic
"Smart Guardian" mode control handling is not implemented. However "Smart Guardian" mode control handling is only implemented for older chips
if you want to go for "manual mode" just write 1 to pwmN_enable. (see below.) However if you want to go for "manual mode" just write 1 to
pwmN_enable.
If you are only able to control the fan speed with very small PWM values, If you are only able to control the fan speed with very small PWM values,
try lowering the PWM base frequency (pwm1_freq). Depending on the fan, try lowering the PWM base frequency (pwm1_freq). Depending on the fan,
it may give you a somewhat greater control range. The same frequency is it may give you a somewhat greater control range. The same frequency is
used to drive all fan outputs, which is why pwm2_freq and pwm3_freq are used to drive all fan outputs, which is why pwm2_freq and pwm3_freq are
read-only. read-only.
Automatic fan speed control (old interface)
-------------------------------------------
The driver supports the old interface to automatic fan speed control
which is implemented by IT8705F chips up to revision F and IT8712F
chips up to revision G.
This interface implements 4 temperature vs. PWM output trip points.
The PWM output of trip point 4 is always the maximum value (fan running
at full speed) while the PWM output of the other 3 trip points can be
freely chosen. The temperature of all 4 trip points can be freely chosen.
Additionally, trip point 1 has an hysteresis temperature attached, to
prevent fast switching between fan on and off.
The chip automatically computes the PWM output value based on the input
temperature, based on this simple rule: if the temperature value is
between trip point N and trip point N+1 then the PWM output value is
the one of trip point N. The automatic control mode is less flexible
than the manual control mode, but it reacts faster, is more robust and
doesn't use CPU cycles.
Trip points must be set properly before switching to automatic fan speed
control mode. The driver will perform basic integrity checks before
actually switching to automatic control mode.
...@@ -84,6 +84,10 @@ Supported chips: ...@@ -84,6 +84,10 @@ Supported chips:
Addresses scanned: I2C 0x4c Addresses scanned: I2C 0x4c
Datasheet: Publicly available at the Maxim website Datasheet: Publicly available at the Maxim website
http://www.maxim-ic.com/quick_view2.cfm/qv_pk/3500 http://www.maxim-ic.com/quick_view2.cfm/qv_pk/3500
* Winbond/Nuvoton W83L771AWG/ASG
Prefix: 'w83l771'
Addresses scanned: I2C 0x4c
Datasheet: Not publicly available, can be requested from Nuvoton
Author: Jean Delvare <khali@linux-fr.org> Author: Jean Delvare <khali@linux-fr.org>
...@@ -147,6 +151,12 @@ MAX6680 and MAX6681: ...@@ -147,6 +151,12 @@ MAX6680 and MAX6681:
* Selectable address * Selectable address
* Remote sensor type selection * Remote sensor type selection
W83L771AWG/ASG
* The AWG and ASG variants only differ in package format.
* Filter and alert configuration register at 0xBF
* Diode ideality factor configuration (remote sensor) at 0xE3
* Moving average (depending on conversion rate)
All temperature values are given in degrees Celsius. Resolution All temperature values are given in degrees Celsius. Resolution
is 1.0 degree for the local temperature, 0.125 degree for the remote is 1.0 degree for the local temperature, 0.125 degree for the remote
temperature, except for the MAX6657, MAX6658 and MAX6659 which have a temperature, except for the MAX6657, MAX6658 and MAX6659 which have a
...@@ -163,6 +173,18 @@ The lm90 driver will not update its values more frequently than every ...@@ -163,6 +173,18 @@ The lm90 driver will not update its values more frequently than every
other second; reading them more often will do no harm, but will return other second; reading them more often will do no harm, but will return
'old' values. 'old' values.
SMBus Alert Support
-------------------
This driver has basic support for SMBus alert. When an alert is received,
the status register is read and the faulty temperature channel is logged.
The Analog Devices chips (ADM1032 and ADT7461) do not implement the SMBus
alert protocol properly so additional care is needed: the ALERT output is
disabled when an alert is received, and is re-enabled only when the alarm
is gone. Otherwise the chip would block alerts from other chips in the bus
as long as the alarm is active.
PEC Support PEC Support
----------- -----------
......
Explaining the dreaded "No init found." boot hang message
=========================================================
OK, so you've got this pretty unintuitive message (currently located
in init/main.c) and are wondering what the H*** went wrong.
Some high-level reasons for failure (listed roughly in order of execution)
to load the init binary are:
A) Unable to mount root FS
B) init binary doesn't exist on rootfs
C) broken console device
D) binary exists but dependencies not available
E) binary cannot be loaded
Detailed explanations:
0) Set "debug" kernel parameter (in bootloader config file or CONFIG_CMDLINE)
to get more detailed kernel messages.
A) make sure you have the correct root FS type
(and root= kernel parameter points to the correct partition),
required drivers such as storage hardware (such as SCSI or USB!)
and filesystem (ext3, jffs2 etc.) are builtin (alternatively as modules,
to be pre-loaded by an initrd)
C) Possibly a conflict in console= setup --> initial console unavailable.
E.g. some serial consoles are unreliable due to serial IRQ issues (e.g.
missing interrupt-based configuration).
Try using a different console= device or e.g. netconsole= .
D) e.g. required library dependencies of the init binary such as
/lib/ld-linux.so.2 missing or broken. Use readelf -d <INIT>|grep NEEDED
to find out which libraries are required.
E) make sure the binary's architecture matches your hardware.
E.g. i386 vs. x86_64 mismatch, or trying to load x86 on ARM hardware.
In case you tried loading a non-binary file here (shell script?),
you should make sure that the script specifies an interpreter in its shebang
header line (#!/...) that is fully working (including its library
dependencies). And before tackling scripts, better first test a simple
non-script binary such as /bin/sh and confirm its successful execution.
To find out more, add code to init/main.c to display kernel_execve()s
return values.
Please extend this explanation whenever you find new failure causes
(after all loading the init binary is a CRITICAL and hard transition step
which needs to be made as painless as possible), then submit patch to LKML.
Further TODOs:
- Implement the various run_init_process() invocations via a struct array
which can then store the kernel_execve() result value and on failure
log it all by iterating over _all_ results (very important usability fix).
- try to make the implementation itself more helpful in general,
e.g. by providing additional error messages at affected places.
Andreas Mohr <andi at lisas period de>
...@@ -68,6 +68,22 @@ like: ...@@ -68,6 +68,22 @@ like:
SYN_MT_REPORT SYN_MT_REPORT
SYN_REPORT SYN_REPORT
Here is the sequence after lifting one of the fingers:
ABS_MT_POSITION_X
ABS_MT_POSITION_Y
SYN_MT_REPORT
SYN_REPORT
And here is the sequence after lifting the remaining finger:
SYN_MT_REPORT
SYN_REPORT
If the driver reports one of BTN_TOUCH or ABS_PRESSURE in addition to the
ABS_MT events, the last SYN_MT_REPORT event may be omitted. Otherwise, the
last SYN_REPORT will be dropped by the input core, resulting in no
zero-finger event reaching userland.
Event Semantics Event Semantics
--------------- ---------------
...@@ -217,11 +233,6 @@ where examples can be found. ...@@ -217,11 +233,6 @@ where examples can be found.
difference between the contact position and the approaching tool position difference between the contact position and the approaching tool position
could be used to derive tilt. could be used to derive tilt.
[2] The list can of course be extended. [2] The list can of course be extended.
[3] The multi-touch X driver is currently in the prototyping stage. At the [3] Multitouch X driver project: http://bitmath.org/code/multitouch/.
time of writing (April 2009), the MT protocol is not yet merged, and the
prototype implements finger matching, basic mouse support and two-finger
scrolling. The project aims at improving the quality of current multi-touch
functionality available in the Synaptics X driver, and in addition
implement more advanced gestures.
[4] See the section on event computation. [4] See the section on event computation.
[5] See the section on finger tracking. [5] See the section on finger tracking.
...@@ -75,7 +75,7 @@ and the number of steps or will clamp at the maximum and zero depending on ...@@ -75,7 +75,7 @@ and the number of steps or will clamp at the maximum and zero depending on
the configuration. the configuration.
Because GPIO to IRQ mapping is platform specific, this information must Because GPIO to IRQ mapping is platform specific, this information must
be given in seperately to the driver. See the example below. be given in separately to the driver. See the example below.
---------<snip>--------- ---------<snip>---------
......
...@@ -291,6 +291,7 @@ Code Seq#(hex) Include File Comments ...@@ -291,6 +291,7 @@ Code Seq#(hex) Include File Comments
0x92 00-0F drivers/usb/mon/mon_bin.c 0x92 00-0F drivers/usb/mon/mon_bin.c
0x93 60-7F linux/auto_fs.h 0x93 60-7F linux/auto_fs.h
0x94 all fs/btrfs/ioctl.h 0x94 all fs/btrfs/ioctl.h
0x97 00-7F fs/ceph/ioctl.h Ceph file system
0x99 00-0F 537-Addinboard driver 0x99 00-0F 537-Addinboard driver
<mailto:buk@buks.ipn.de> <mailto:buk@buks.ipn.de>
0xA0 all linux/sdp/sdp.h Industrial Device Project 0xA0 all linux/sdp/sdp.h Industrial Device Project
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
...@@ -2,6 +2,12 @@ ...@@ -2,6 +2,12 @@
- This file - This file
acer-wmi.txt acer-wmi.txt
- information on the Acer Laptop WMI Extras driver. - information on the Acer Laptop WMI Extras driver.
asus-laptop.txt
- information on the Asus Laptop Extras driver.
disk-shock-protection.txt
- information on hard disk shock protection.
dslm.c
- Simple Disk Sleep Monitor program
laptop-mode.txt laptop-mode.txt
- how to conserve battery power using laptop-mode. - how to conserve battery power using laptop-mode.
sony-laptop.txt sony-laptop.txt
......
# kbuild trick to avoid linker error. Can be omitted if a module is built.
obj- := dummy.o
# List of programs to build
hostprogs-y := dslm
# Tell kbuild to always build the programs
always := $(hostprogs-y)
此差异已折叠。
此差异已折叠。
...@@ -6,3 +6,5 @@ hostprogs-y := ifenslave ...@@ -6,3 +6,5 @@ hostprogs-y := ifenslave
# Tell kbuild to always build the programs # Tell kbuild to always build the programs
always := $(hostprogs-y) always := $(hostprogs-y)
obj-m := timestamping/
...@@ -68,7 +68,7 @@ Compaq adapters (not tested): ...@@ -68,7 +68,7 @@ Compaq adapters (not tested):
======================= =======================
From v2.01 on, the driver is integrated in the linux kernel sources. From v2.01 on, the driver is integrated in the linux kernel sources.
Therefor, the installation is the same as for any other adapter Therefore, the installation is the same as for any other adapter
supported by the kernel. supported by the kernel.
Refer to the manual of your distribution about the installation Refer to the manual of your distribution about the installation
of network adapters. of network adapters.
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册