#
11fb9989 |
| 28-Jul-2016 |
Mel Gorman <mgorman@techsingularity.net> |
mm: move most file-based accounting to the node
There are now a number of accounting oddities such as mapped file pages being accounted for on the node while the total number of file pages are accou
mm: move most file-based accounting to the node
There are now a number of accounting oddities such as mapped file pages being accounted for on the node while the total number of file pages are accounted on the zone. This can be coped with to some extent but it's confusing so this patch moves the relevant file-based accounted. Due to throttling logic in the page allocator for reliable OOM detection, it is still necessary to track dirty and writeback pages on a per-zone basis.
[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting] Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v4.4.16 |
|
#
9a46b04f |
| 26-Jul-2016 |
Brian Foster <bfoster@redhat.com> |
fs/fs-writeback.c: inode writeback list tracking tracepoints
The per-sb inode writeback list tracks inodes currently under writeback to facilitate efficient sync processing. In particular, it ensur
fs/fs-writeback.c: inode writeback list tracking tracepoints
The per-sb inode writeback list tracks inodes currently under writeback to facilitate efficient sync processing. In particular, it ensures that sync only needs to walk through a list of inodes that were cleaned by the sync.
Add a couple tracepoints to help identify when inodes are added/removed to and from the writeback lists. Piggyback off of the writeback lazytime tracepoint template as it already tracks the relevant inode information.
Link: http://lkml.kernel.org/r/1466594593-6757-3-git-send-email-bfoster@redhat.com Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <dchinner@redhat.com> cc: Josef Bacik <jbacik@fb.com> Cc: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v4.7, openbmc-4.4-20160722-1, openbmc-20160722-1, openbmc-20160713-1, v4.4.15, v4.6.4, v4.6.3, v4.4.14, v4.6.2, v4.4.13, openbmc-20160606-1, v4.6.1, v4.4.12, openbmc-20160521-1, v4.4.11, openbmc-20160518-1, v4.6, v4.4.10, openbmc-20160511-1, openbmc-20160505-1, v4.4.9, v4.4.8, v4.4.7, openbmc-20160329-2, openbmc-20160329-1, openbmc-20160321-1, v4.4.6, v4.5, v4.4.5, v4.4.4 |
|
#
a664edb3 |
| 03-Mar-2016 |
Yang Shi <yang.shi@linaro.org> |
tracing, writeback: Replace cgroup path to cgroup ino
commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback tracepoints to report cgroup") made writeback tracepoints print ou
tracing, writeback: Replace cgroup path to cgroup ino
commit 5634cc2aa9aebc77bc862992e7805469dcf83dac ("writeback: update writeback tracepoints to report cgroup") made writeback tracepoints print out cgroup path when CGROUP_WRITEBACK is enabled, but it may trigger the below bug on -rt kernel since kernfs_path and kernfs_path_len are called by tracepoints, which acquire spin lock that is sleepable on -rt kernel.
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:930 in_atomic(): 1, irqs_disabled(): 0, pid: 625, name: kworker/u16:3 INFO: lockdep is turned off. Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830
CPU: 7 PID: 625 Comm: kworker/u16:3 Not tainted 4.4.1-rt5 #20 Hardware name: Freescale Layerscape 2085a RDB Board (DT) Workqueue: writeback wb_workfn (flush-7:0) Call trace: [<ffffffc00008d708>] dump_backtrace+0x0/0x200 [<ffffffc00008d92c>] show_stack+0x24/0x30 [<ffffffc0007b0f40>] dump_stack+0x88/0xa8 [<ffffffc000127d74>] ___might_sleep+0x2ec/0x300 [<ffffffc000d5d550>] rt_spin_lock+0x38/0xb8 [<ffffffc0003e0548>] kernfs_path_len+0x30/0x90 [<ffffffc00036b360>] trace_event_raw_event_writeback_work_class+0xe8/0x2e8 [<ffffffc000374f90>] wb_writeback+0x620/0x830 [<ffffffc000376224>] wb_workfn+0x61c/0x950 [<ffffffc000110adc>] process_one_work+0x3ac/0xb30 [<ffffffc0001112fc>] worker_thread+0x9c/0x7a8 [<ffffffc00011a9e8>] kthread+0x190/0x1b0 [<ffffffc000086ca0>] ret_from_fork+0x10/0x30
With unlocked kernfs_* functions, synchronize_sched() has to be called in kernfs_rename which could be called in syscall path, but it is problematic. So, print out cgroup ino instead of path name, which could be converted to path name by userland.
Withouth CGROUP_WRITEBACK enabled, it just prints out root dir. But, root dir ino vary from different filesystems, so printing out -1U to indicate an invalid cgroup ino.
Link: http://lkml.kernel.org/r/1456996137-8354-1-git-send-email-yang.shi@linaro.org
Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Yang Shi <yang.shi@linaro.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
show more ...
|
Revision tags: v4.4.3, openbmc-20160222-1, v4.4.2, openbmc-20160212-1, openbmc-20160210-1, openbmc-20160202-2, openbmc-20160202-1, v4.4.1, openbmc-20160127-1, openbmc-20160120-1, v4.4, openbmc-20151217-1, openbmc-20151210-1, openbmc-20151202-1, openbmc-20151123-1, openbmc-20151118-1, openbmc-20151104-1, v4.3, openbmc-20151102-1, openbmc-20151028-1, v4.3-rc1, v4.2, v4.2-rc8 |
|
#
5634cc2a |
| 18-Aug-2015 |
Tejun Heo <tj@kernel.org> |
writeback: update writeback tracepoints to report cgroup
The following tracepoints are updated to report the cgroup used during cgroup writeback.
* writeback_write_inode[_start] * writeback_queue *
writeback: update writeback tracepoints to report cgroup
The following tracepoints are updated to report the cgroup used during cgroup writeback.
* writeback_write_inode[_start] * writeback_queue * writeback_exec * writeback_start * writeback_written * writeback_wait * writeback_nowork * writeback_wake_background * wbc_writepage * writeback_queue_io * bdi_dirty_ratelimit * balance_dirty_pages * writeback_sb_inodes_requeue * writeback_single_inode[_start]
Note that writeback_bdi_register is separated out from writeback_class as reporting cgroup doesn't make sense to it. Tracepoints which take bdi are updated to take bdi_writeback instead.
Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
show more ...
|
Revision tags: v4.2-rc7, v4.2-rc6, v4.2-rc5, v4.2-rc4, v4.2-rc3, v4.2-rc2, v4.2-rc1, v4.1, v4.1-rc8, v4.1-rc7, v4.1-rc6, v4.1-rc5 |
|
#
dcc25ae7 |
| 22-May-2015 |
Tejun Heo <tj@kernel.org> |
writeback: move global_dirty_limit into wb_domain
This patch is a part of the series to define wb_domain which represents a domain that wb's (bdi_writeback's) belong to and are measured against each
writeback: move global_dirty_limit into wb_domain
This patch is a part of the series to define wb_domain which represents a domain that wb's (bdi_writeback's) belong to and are measured against each other in. This will enable IO backpressure propagation for cgroup writeback.
global_dirty_limit exists to regulate the global dirty threshold which is a property of the wb_domain. This patch moves hard_dirty_limit, dirty_lock, and update_time into wb_domain.
This is pure reorganization and doesn't introduce any behavioral changes.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
show more ...
|
#
a88a341a |
| 22-May-2015 |
Tejun Heo <tj@kernel.org> |
writeback: move bandwidth related fields from backing_dev_info into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For
writeback: move bandwidth related fields from backing_dev_info into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently.
This patch moves bandwidth related fields from backing_dev_info into bdi_writeback.
* The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp, write_bandwidth, avg_write_bandwidth, dirty_ratelimit, balanced_dirty_ratelimit, completions and dirty_exceeded.
* writeback_chunk_size() and over_bground_thresh() now take @wb instead of @bdi.
* bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...) bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...) bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...) bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...) [__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...) bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...) bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)
* Init/exits of the relocated fields are moved to bdi_wb_init/exit() respectively. Note that explicit zeroing is dropped in the process as wb's are cleared in entirety anyway.
* As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[] introducing no behavior changes.
v2: Typo in description fixed as suggested by Jan.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
show more ...
|
#
aad653a0 |
| 19-May-2015 |
NeilBrown <neilb@suse.de> |
block: discard bdi_unregister() in favour of bdi_destroy()
bdi_unregister() now contains very little functionality.
It contains a "WARN_ON" if bdi->dev is NULL. This warning is of no real conseque
block: discard bdi_unregister() in favour of bdi_destroy()
bdi_unregister() now contains very little functionality.
It contains a "WARN_ON" if bdi->dev is NULL. This warning is of no real consequence as bdi->dev isn't needed by anything else in the function, and it triggers if blk_cleanup_queue() -> bdi_destroy() is called before bdi_unregister, which happens since Commit: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")
So this isn't wanted.
It also calls bdi_set_min_ratio(). This needs to be called after writes through the bdi have all been flushed, and before the bdi is destroyed. Calling it early is better than calling it late as it frees up a global resource.
Calling it immediately after bdi_wb_shutdown() in bdi_destroy() perfectly fits these requirements.
So bdi_unregister() can be discarded with the important content moved to bdi_destroy(), as can the writeback_bdi_unregister event which is already not used.
Reported-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org (v4.0) Fixes: c4db59d31e39 ("fs: don't reassign dirty inodes to default_backing_dev_info") Fixes: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.") Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
show more ...
|
Revision tags: v4.1-rc4, v4.1-rc3, v4.1-rc2, v4.1-rc1, v4.0, v4.0-rc7, v4.0-rc6 |
|
#
91df6089 |
| 27-Mar-2015 |
Steven Rostedt (Red Hat) <rostedt@goodmis.org> |
writeback: Export enums used by tracepoint to user space
The enums used in tracepoints for __print_symbolic() do not have their values shown in the tracepoint format files and this makes it difficul
writeback: Export enums used by tracepoint to user space
The enums used in tracepoints for __print_symbolic() do not have their values shown in the tracepoint format files and this makes it difficult for user space tools to convert the binary values to the strings they are to represent.
Add TRACE_DEFINE_ENUM(x) macros to export the enum names to their values to make the tracing output from user space tools more robust.
Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
Cc: Dave Chinner <dchinner@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
show more ...
|
Revision tags: v4.0-rc5, v4.0-rc4, v4.0-rc3, v4.0-rc2, v4.0-rc1, v3.19 |
|
#
0ae45f63 |
| 01-Feb-2015 |
Theodore Ts'o <tytso@mit.edu> |
vfs: add support for a lazytime mount option
Add a new mount option which enables a new "lazytime" mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of
vfs: add support for a lazytime mount option
Add a new mount option which enables a new "lazytime" mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory.
This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call.
For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact long tail latencies --- which is a very big deal for web serving tiers (for example).
Google-Bug-Id: 18297052
Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
Revision tags: v3.19-rc7, v3.19-rc6, v3.19-rc5 |
|
#
df0ce26c |
| 14-Jan-2015 |
Christoph Hellwig <hch@lst.de> |
fs: remove default_backing_dev_info
Now that default_backing_dev_info is not used for writeback purposes we can git rid of it easily:
- instead of using it's name for tracing unregistered bdi we j
fs: remove default_backing_dev_info
Now that default_backing_dev_info is not used for writeback purposes we can git rid of it easily:
- instead of using it's name for tracing unregistered bdi we just use "unknown" - btrfs and ceph can just assign the default read ahead window themselves like several other filesystems already do. - we can assign noop_backing_dev_info as the default one in alloc_super. All filesystems already either assigned their own or noop_backing_dev_info.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
show more ...
|
#
de1414a6 |
| 14-Jan-2015 |
Christoph Hellwig <hch@lst.de> |
fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
Now that we got rid of the bdi abuse on character devices we can always use sb->s_bdi to get at the backing_dev_info for a fi
fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
Now that we got rid of the bdi abuse on character devices we can always use sb->s_bdi to get at the backing_dev_info for a file, except for the block device special case. Export inode_to_bdi and replace uses of mapping->backing_dev_info with it to prepare for the removal of mapping->backing_dev_info.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
show more ...
|
Revision tags: v3.19-rc4, v3.19-rc3, v3.19-rc2, v3.19-rc1, v3.18, v3.18-rc7, v3.18-rc6, v3.18-rc5, v3.18-rc4, v3.18-rc3, v3.18-rc2, v3.18-rc1, v3.17, v3.17-rc7, v3.17-rc6, v3.17-rc5, v3.17-rc4, v3.17-rc3, v3.17-rc2, v3.17-rc1, v3.16, v3.16-rc7, v3.16-rc6, v3.16-rc5, v3.16-rc4, v3.16-rc3, v3.16-rc2, v3.16-rc1, v3.15, v3.15-rc8, v3.15-rc7, v3.15-rc6, v3.15-rc5, v3.15-rc4, v3.15-rc3, v3.15-rc2, v3.15-rc1, v3.14, v3.14-rc8, v3.14-rc7, v3.14-rc6, v3.14-rc5 |
|
#
f479447a |
| 26-Feb-2014 |
Steven Rostedt (Red Hat) <rostedt@goodmis.org> |
tracing: Fix event header writeback.h to include tracepoint.h
The trace event headers are required to include tracepoint.h. The only reason they worked now is because module.h included tracepoint.h,
tracing: Fix event header writeback.h to include tracepoint.h
The trace event headers are required to include tracepoint.h. The only reason they worked now is because module.h included tracepoint.h, and that will soon change.
Link: http://lkml.kernel.org/r/20140226190644.442886305@goodmis.org
Fixes: 455b2864686d "writeback: Initial tracing support" Cc: Dave Chinner <dchinner@redhat.com> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
show more ...
|
Revision tags: v3.14-rc4 |
|
#
0dc83bd3 |
| 21-Feb-2014 |
Jan Kara <jack@suse.cz> |
Revert "writeback: do not sync data dirtied after sync start"
This reverts commit c4a391b53a72d2df4ee97f96f78c1d5971b47489. Dave Chinner <david@fromorbit.com> has reported the commit may cause some
Revert "writeback: do not sync data dirtied after sync start"
This reverts commit c4a391b53a72d2df4ee97f96f78c1d5971b47489. Dave Chinner <david@fromorbit.com> has reported the commit may cause some inodes to be left out from sync(2). This is because we can call redirty_tail() for some inode (which sets i_dirtied_when to current time) after sync(2) has started or similarly requeue_inode() can set i_dirtied_when to current time if writeback had to skip some pages. The real problem is in the functions clobbering i_dirtied_when but fixing that isn't trivial so revert is a safer choice for now.
CC: stable@vger.kernel.org # >= 3.13 Signed-off-by: Jan Kara <jack@suse.cz>
show more ...
|
Revision tags: v3.14-rc3, v3.14-rc2 |
|
#
774016b2 |
| 06-Feb-2014 |
Steven Whitehouse <swhiteho@redhat.com> |
GFS2: journal data writepages update
GFS2 has carried what is more or less a copy of the write_cache_pages() for some time. It seems that this copy has slipped behind the core code over time. This p
GFS2: journal data writepages update
GFS2 has carried what is more or less a copy of the write_cache_pages() for some time. It seems that this copy has slipped behind the core code over time. This patch brings it back uptodate, and in addition adds the tracepoint which would otherwise be missing.
We could go further, and eliminate some or all of the code duplication here. The issue is that if we do that, then the function we need to split out from the existing write_cache_pages(), which will look a lot like gfs2_jdata_write_pagevec(), would land up putting quite a lot of extra variables on the stack. I know that has been a problem in the past in the writeback code path, which is why I've hesitated to do it here.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
show more ...
|
Revision tags: v3.14-rc1, v3.13, v3.13-rc8, v3.13-rc7, v3.13-rc6, v3.13-rc5, v3.13-rc4, v3.13-rc3, v3.13-rc2, v3.13-rc1 |
|
#
c4a391b5 |
| 12-Nov-2013 |
Jan Kara <jack@suse.cz> |
writeback: do not sync data dirtied after sync start
When there are processes heavily creating small files while sync(2) is running, it can easily happen that quite some new files are created betwee
writeback: do not sync data dirtied after sync start
When there are processes heavily creating small files while sync(2) is running, it can easily happen that quite some new files are created between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen especially if there are several busy filesystems (remember that sync traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one fs before starting it on another fs). Because WB_SYNC_ALL pass is slow (e.g. causes a transaction commit and cache flush for each inode in ext3), resulting sync(2) times are rather large.
The following script reproduces the problem:
function run_writers { for (( i = 0; i < 10; i++ )); do mkdir $1/dir$i for (( j = 0; j < 40000; j++ )); do dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null done & done }
for dir in "$@"; do run_writers $dir done
sleep 40 time sync
Fix the problem by disregarding inodes dirtied after sync(2) was called in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes a time stamp when sync has started which is used for setting up work for flusher threads.
To give some numbers, when above script is run on two ext4 filesystems on simple SATA drive, the average sync time from 10 runs is 267.549 seconds with standard deviation 104.799426. With the patched kernel, the average sync time from 10 runs is 2.995 seconds with standard deviation 0.096.
Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v3.12, v3.12-rc7, v3.12-rc6, v3.12-rc5, v3.12-rc4, v3.12-rc3, v3.12-rc2, v3.12-rc1, v3.11, v3.11-rc7, v3.11-rc6, v3.11-rc5, v3.11-rc4, v3.11-rc3, v3.11-rc2, v3.11-rc1, v3.10, v3.10-rc7, v3.10-rc6, v3.10-rc5, v3.10-rc4, v3.10-rc3, v3.10-rc2, v3.10-rc1, v3.9, v3.9-rc8, v3.9-rc7, v3.9-rc6 |
|
#
839a8e86 |
| 01-Apr-2013 |
Tejun Heo <tj@kernel.org> |
writeback: replace custom worker pool implementation with unbound workqueue
Writeback implements its own worker pool - each bdi can be associated with a worker thread which is created and destroyed
writeback: replace custom worker pool implementation with unbound workqueue
Writeback implements its own worker pool - each bdi can be associated with a worker thread which is created and destroyed dynamically. The worker thread for the default bdi is always present and serves as the "forker" thread which forks off worker threads for other bdis.
there's no reason for writeback to implement its own worker pool when using unbound workqueue instead is much simpler and more efficient. This patch replaces custom worker pool implementation in writeback with an unbound workqueue.
The conversion isn't too complicated but the followings are worth mentioning.
* bdi_writeback->last_active, task and wakeup_timer are removed. delayed_work ->dwork is added instead. Explicit timer handling is no longer necessary. Everything works by either queueing / modding / flushing / canceling the delayed_work item.
* bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off bdi_writeback->dwork. On each execution, it processes bdi->work_list and reschedules itself if there are more things to do.
The function also handles low-mem condition, which used to be handled by the forker thread. If the function is running off a rescuer thread, it only writes out limited number of pages so that the rescuer can serve other bdis too. This preserves the flusher creation failure behavior of the forker thread.
* INIT_LIST_HEAD(&bdi->bdi_list) is used to tell bdi_writeback_workfn() about on-going bdi unregistration so that it always drains work_list even if it's running off the rescuer. Note that the original code was broken in this regard. Under memory pressure, a bdi could finish unregistration with non-empty work_list.
* The default bdi is no longer special. It now is treated the same as any other bdi and bdi_cap_flush_forker() is removed.
* BDI_pending is no longer used. Removed.
* Some tracepoints become non-applicable. The following TPs are removed - writeback_nothread, writeback_wake_thread, writeback_wake_forker_thread, writeback_thread_start, writeback_thread_stop.
Everything, including devices coming and going away and rescuer operation under simulated memory pressure, seems to work fine in my test setup.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com>
show more ...
|
Revision tags: v3.9-rc5, v3.9-rc4, v3.9-rc3, v3.9-rc2, v3.9-rc1, v3.8, v3.8-rc7, v3.8-rc6, v3.8-rc5, v3.8-rc4 |
|
#
9fb0a7da |
| 11-Jan-2013 |
Tejun Heo <tj@kernel.org> |
writeback: add more tracepoints
Add tracepoints for page dirtying, writeback_single_inode start, inode dirtying and writeback. For the latter two inode events, a pair of events are defined to denot
writeback: add more tracepoints
Add tracepoints for page dirtying, writeback_single_inode start, inode dirtying and writeback. For the latter two inode events, a pair of events are defined to denote start and end of the operations (the starting one has _start suffix and the one w/o suffix happens after the operation is complete). These inode ops are FS specific and can be non-trivial and having enclosing tracepoints is useful for external tracers.
This is part of tracepoint additions to improve visiblity into dirtying / writeback operations for io tracer and userland.
v2: writeback_dirty_inode[_start] TPs may be called for files on pseudo FSes w/ unregistered bdi. Check whether bdi->dev is %NULL before dereferencing.
v3: buffer dirtying moved to a block TP.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
Revision tags: v3.8-rc3, v3.8-rc2, v3.8-rc1, v3.7, v3.7-rc8, v3.7-rc7, v3.7-rc6, v3.7-rc5, v3.7-rc4, v3.7-rc3, v3.7-rc2, v3.7-rc1, v3.6, v3.6-rc7, v3.6-rc6, v3.6-rc5, v3.6-rc4, v3.6-rc3, v3.6-rc2, v3.6-rc1, v3.5, v3.5-rc7, v3.5-rc6, v3.5-rc5, v3.5-rc4, v3.5-rc3, v3.5-rc2, v3.5-rc1, v3.4, v3.4-rc7, v3.4-rc6 |
|
#
cc1676d9 |
| 03-May-2012 |
Jan Kara <jack@suse.cz> |
writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
When writeback_single_inode() is called on inode which has I_SYNC already set while doing WB_SYNC_NONE, inode is moved to b_more_i
writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
When writeback_single_inode() is called on inode which has I_SYNC already set while doing WB_SYNC_NONE, inode is moved to b_more_io list. However this makes sense only if the caller is flusher thread. For other callers of writeback_single_inode() it doesn't really make sense and may be even wrong - flusher thread may be doing WB_SYNC_ALL writeback in parallel.
So we move requeueing from writeback_single_inode() to writeback_sb_inodes().
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
show more ...
|
Revision tags: v3.4-rc5, v3.4-rc4, v3.4-rc3, v3.4-rc2, v3.4-rc1, v3.3, v3.3-rc7, v3.3-rc6, v3.3-rc5, v3.3-rc4, v3.3-rc3, v3.3-rc2 |
|
#
313162d0 |
| 30-Jan-2012 |
Paul Gortmaker <paul.gortmaker@windriver.com> |
device.h: audit and cleanup users in main include dir
The <linux/device.h> header includes a lot of stuff, and it in turn gets a lot of use just for the basic "struct device" which appears so often.
device.h: audit and cleanup users in main include dir
The <linux/device.h> header includes a lot of stuff, and it in turn gets a lot of use just for the basic "struct device" which appears so often.
Clean up the users as follows:
1) For those headers only needing "struct device" as a pointer in fcn args, replace the include with exactly that.
2) For headers not really using anything from device.h, simply delete the include altogether.
3) For headers relying on getting device.h implicitly before being included themselves, now explicitly include device.h
4) For files in which doing #1 or #2 uncovers an implicit dependency on some other header, fix by explicitly adding the required header(s).
Any C files that were implicitly relying on device.h to be present have already been dealt with in advance.
Total removals from #1 and #2: 51. Total additions coming from #3: 9. Total other implicit dependencies from #4: 7.
As of 3.3-rc1, there were 110, so a net removal of 42 gives about a 38% reduction in device.h presence in include/*
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
show more ...
|
#
977b7e3a |
| 04-Feb-2012 |
Wu Fengguang <fengguang.wu@intel.com> |
writeback: fix dereferencing NULL bdi->dev on trace_writeback_queue
When a SD card is hot removed without umount, del_gendisk() will call bdi_unregister() without destroying/freeing it. This leaves
writeback: fix dereferencing NULL bdi->dev on trace_writeback_queue
When a SD card is hot removed without umount, del_gendisk() will call bdi_unregister() without destroying/freeing it. This leaves the bdi in the bdi->dev = NULL, bdi->wb.task = NULL, bdi->bdi_list removed state.
When sync(2) gets the bdi before bdi_unregister() and calls bdi_queue_work() after the unregister, trace_writeback_queue will be dereferencing the NULL bdi->dev. Fix it with a simple test for NULL.
LKML-reference: http://lkml.org/lkml/2012/1/18/346 Cc: stable@kernel.org Reported-by: Rabin Vincent <rabin@rab.in> Tested-by: Namjae Jeon <linkinjeon@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
show more ...
|
Revision tags: v3.3-rc1 |
|
#
15eb77a0 |
| 17-Jan-2012 |
Wu Fengguang <fengguang.wu@intel.com> |
writeback: fix NULL bdi->dev in trace writeback_single_inode
bdi_prune_sb() resets sb->s_bdi to default_backing_dev_info when the tearing down the original bdi. Fix trace_writeback_single_inode to u
writeback: fix NULL bdi->dev in trace writeback_single_inode
bdi_prune_sb() resets sb->s_bdi to default_backing_dev_info when the tearing down the original bdi. Fix trace_writeback_single_inode to use sb->s_bdi=default_backing_dev_info rather than bdi->dev=NULL for a teared down bdi.
Cc: <stable@kernel.org> Reported-by: Rabin Vincent <rabin@rab.in> Tested-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
show more ...
|
Revision tags: v3.2, v3.2-rc7, v3.2-rc6, v3.2-rc5, v3.2-rc4, v3.2-rc3, v3.2-rc2, v3.2-rc1, v3.1, v3.1-rc10, v3.1-rc9, v3.1-rc8, v3.1-rc7, v3.1-rc6, v3.1-rc5, v3.1-rc4, v3.1-rc3, v3.1-rc2, v3.1-rc1, v3.0, v3.0-rc7, v3.0-rc6, v3.0-rc5, v3.0-rc4, v3.0-rc3 |
|
#
83712358 |
| 11-Jun-2011 |
Wu Fengguang <fengguang.wu@intel.com> |
writeback: dirty ratelimit - think time compensation
Compensate the task's think time when computing the final pause time, so that ->dirty_ratelimit can be executed accurately.
think time :
writeback: dirty ratelimit - think time compensation
Compensate the task's think time when computing the final pause time, so that ->dirty_ratelimit can be executed accurately.
think time := time spend outside of balance_dirty_pages()
In the rare case that the task slept longer than the 200ms period time (result in negative pause time), the sleep time will be compensated in the following periods, too, if it's less than 1 second.
Accumulated errors are carefully avoided as long as the max pause area is not hitted.
Pseudo code:
period = pages_dirtied / task_ratelimit; think = jiffies - dirty_paused_when; pause = period - think;
1) normal case: period > think
pause = period - think dirty_paused_when = jiffies + pause nr_dirtied = 0
period time |===============================>| think time pause time |===============>|==============>| ------|----------------|---------------|------------------------ dirty_paused_when jiffies
2) no pause case: period <= think
don't pause; reduce future pause time by: dirty_paused_when += period nr_dirtied = 0
period time |===============================>| think time |===================================================>| ------|--------------------------------+-------------------|---- dirty_paused_when jiffies
Acked-by: Jan Kara <jack@suse.cz> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
show more ...
|
#
b3bba872 |
| 08-Dec-2011 |
Wu Fengguang <fengguang.wu@intel.com> |
writeback: show writeback reason with __print_symbolic
This makes the binary trace understandable by trace-cmd.
CC: Dave Chinner <david@fromorbit.com> CC: Curt Wohlgemuth <curtw@google.com> CC: Ste
writeback: show writeback reason with __print_symbolic
This makes the binary trace understandable by trace-cmd.
CC: Dave Chinner <david@fromorbit.com> CC: Curt Wohlgemuth <curtw@google.com> CC: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
show more ...
|
#
0e175a18 |
| 07-Oct-2011 |
Curt Wohlgemuth <curtw@google.com> |
writeback: Add a 'reason' to wb_writeback_work
This creates a new 'reason' field in a wb_writeback_work structure, which unambiguously identifies who initiates writeback activity. A 'wb_reason' enu
writeback: Add a 'reason' to wb_writeback_work
This creates a new 'reason' field in a wb_writeback_work structure, which unambiguously identifies who initiates writeback activity. A 'wb_reason' enumeration has been added to writeback.h, to enumerate the possible reasons.
The 'writeback_work_class' and tracepoint event class and 'writeback_queue_io' tracepoints are updated to include the symbolic 'reason' in all trace events.
And the 'writeback_inodes_sbXXX' family of routines has had a wb_stats parameter added to them, so callers can specify why writeback is being started.
Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
show more ...
|
#
ad4e38dd |
| 07-Oct-2011 |
Curt Wohlgemuth <curtw@google.com> |
writeback: send work item to queue_io, move_expired_inodes
Instead of sending ->older_than_this to queue_io() and move_expired_inodes(), send the entire wb_writeback_work structure. There are other
writeback: send work item to queue_io, move_expired_inodes
Instead of sending ->older_than_this to queue_io() and move_expired_inodes(), send the entire wb_writeback_work structure. There are other fields of a work item that are useful in these routines and in tracepoints.
Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
show more ...
|