Revision tags: v3.0-rc2, v3.0-rc1, v2.6.39, v2.6.39-rc7, v2.6.39-rc6, v2.6.39-rc5, v2.6.39-rc4, v2.6.39-rc3, v2.6.39-rc2, v2.6.39-rc1, v2.6.38, v2.6.38-rc8, v2.6.38-rc7, v2.6.38-rc6, v2.6.38-rc5, v2.6.38-rc4, v2.6.38-rc3, v2.6.38-rc2, v2.6.38-rc1, v2.6.37, v2.6.37-rc8, v2.6.37-rc7, v2.6.37-rc6, v2.6.37-rc5, v2.6.37-rc4, v2.6.37-rc3, v2.6.37-rc2, v2.6.37-rc1, v2.6.36, v2.6.36-rc8, v2.6.36-rc7, v2.6.36-rc6, v2.6.36-rc5, v2.6.36-rc4 |
|
#
ece13ac3 |
| 30-Aug-2010 |
Wu Fengguang <fengguang.wu@intel.com> |
writeback: trace event balance_dirty_pages
Useful for analyzing the dynamics of the throttling algorithms and debugging user reported problems.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
|
#
b48c104d |
| 02-Mar-2011 |
Wu Fengguang <fengguang.wu@intel.com> |
writeback: trace event bdi_dirty_ratelimit
It helps understand how various throttle bandwidths are updated.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
|
Revision tags: v2.6.36-rc3 |
|
#
143dfe86 |
| 27-Aug-2010 |
Wu Fengguang <fengguang.wu@intel.com> |
writeback: IO-less balance_dirty_pages()
As proposed by Chris, Dave and Jan, don't start foreground writeback IO inside balance_dirty_pages(). Instead, simply let it idle sleep for some time to thro
writeback: IO-less balance_dirty_pages()
As proposed by Chris, Dave and Jan, don't start foreground writeback IO inside balance_dirty_pages(). Instead, simply let it idle sleep for some time to throttle the dirtying task. In the mean while, kick off the per-bdi flusher thread to do background writeback IO.
RATIONALS =========
- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)
If every thread doing writes and being throttled start foreground writeback, it leads to N IO submitters from at least N different inodes at the same time, end up with N different sets of IO being issued with potentially zero locality to each other, resulting in much lower elevator sort/merge efficiency and hence we seek the disk all over the place to service the different sets of IO. OTOH, if there is only one submission thread, it doesn't jump between inodes in the same way when congestion clears - it keeps writing to the same inode, resulting in large related chunks of sequential IOs being issued to the disk. This is more efficient than the above foreground writeback because the elevator works better and the disk seeks less.
- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)
With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".
* "CPU usage has dropped by ~55%", "it certainly appears that most of the CPU time saving comes from the removal of contention on the inode_wb_list_lock" (IMHO at least 10% comes from the reduction of cacheline bouncing, because the new code is able to call much less frequently into balance_dirty_pages() and hence access the global page states)
* the user space "App overhead" is reduced by 20%, by avoiding the cacheline pollution by the complex writeback code path
* "for a ~5% throughput reduction", "the number of write IOs have dropped by ~25%", and the elapsed time reduced from 41:42.17 to 40:53.23.
* On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and improves IO throughput from 38MB/s to 42MB/s.
- IO size too small for fast arrays and too large for slow USB sticks
The write_chunk used by current balance_dirty_pages() cannot be directly set to some large value (eg. 128MB) for better IO efficiency. Because it could lead to more than 1 second user perceivable stalls. Even the current 4MB write size may be too large for slow USB sticks. The fact that balance_dirty_pages() starts IO on itself couples the IO size to wait time, which makes it hard to do suitable IO size while keeping the wait time under control.
Now it's possible to increase writeback chunk size proportional to the disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram, the larger writeback size dramatically reduces the seek count to 1/10 (far beyond my expectation) and improves the write throughput by 24%.
- long block time in balance_dirty_pages() hurts desktop responsiveness
Many of us may have the experience: it often takes a couple of seconds or even long time to stop a heavy writing dd/cp/tar command with Ctrl-C or "kill -9".
- IO pipeline broken by bumpy write() progress
There are a broad class of "loop {read(buf); write(buf);}" applications whose read() pipeline will be under-utilized or even come to a stop if the write()s have long latencies _or_ don't progress in a constant rate. The current threshold based throttling inherently transfers the large low level IO completion fluctuations to bumpy application write()s, and further deteriorates with increasing number of dirtiers and/or bdi's.
For example, when doing 50 dd's + 1 remote rsync to an XFS partition, the rsync progresses very bumpy in legacy kernel, and throughput is improved by 67% by this patchset. (plus the larger write chunk size, it will be 93% speedup).
The new rate based throttling can support 1000+ dd's with excellent smoothness, low latency and low overheads.
For the above reasons, it's much better to do IO-less and low latency pauses in balance_dirty_pages().
Jan Kara, Dave Chinner and me explored the scheme to let balance_dirty_pages() wait for enough writeback IO completions to safeguard the dirty limit. However it's found to have two problems:
- in large NUMA systems, the per-cpu counters may have big accounting errors, leading to big throttle wait time and jitters.
- NFS may kill large amount of unstable pages with one single COMMIT. Because NFS server serves COMMIT with expensive fsync() IOs, it is desirable to delay and reduce the number of COMMITs. So it's not likely to optimize away such kind of bursty IO completions, and the resulted large (and tiny) stall times in IO completion based throttling.
So here is a pause time oriented approach, which tries to control the pause time in each balance_dirty_pages() invocations, by controlling the number of pages dirtied before calling balance_dirty_pages(), for smooth and efficient dirty throttling:
- avoid useless (eg. zero pause time) balance_dirty_pages() calls - avoid too small pause time (less than 4ms, which burns CPU power) - avoid too large pause time (more than 200ms, which hurts responsiveness) - avoid big fluctuations of pause times
It can control pause times at will. The default policy (in a followup patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case.
BEHAVIOR CHANGE ===============
(1) dirty threshold
Users will notice that the applications will get throttled once crossing the global (background + dirty)/2=15% threshold, and then balanced around 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable memory in 1-dd case.
Since the task will be soft throttled earlier than before, it may be perceived by end users as performance "slow down" if his application happens to dirty more than 15% dirtyable memory.
(2) smoothness/responsiveness
Users will notice a more responsive system during heavy writeback. "killall dd" will take effect instantly.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
show more ...
|
#
c8ad6206 |
| 29-Aug-2011 |
Wu Fengguang <fengguang.wu@intel.com> |
writeback: show raw dirtied_when in trace writeback_single_inode
Save inode->dirtied_when in the raw trace output for reliable scripting, and to also show in formatted output the relative age in sec
writeback: show raw dirtied_when in trace writeback_single_inode
Save inode->dirtied_when in the raw trace output for reliable scripting, and to also show in formatted output the relative age in seconds for easy human reading.
CC: Jan Kara <jack@suse.cz> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
show more ...
|
#
e1cbe236 |
| 06-Dec-2010 |
Wu Fengguang <fengguang.wu@intel.com> |
writeback: trace global_dirty_state
Add trace event balance_dirty_state for showing the global dirty page counts and thresholds at each global_dirty_limits() invocation. This will cover the callers
writeback: trace global_dirty_state
Add trace event balance_dirty_state for showing the global dirty page counts and thresholds at each global_dirty_limits() invocation. This will cover the callers throttle_vm_writeout(), over_bground_thresh() and each balance_dirty_pages() loop.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
show more ...
|
#
d46db3d5 |
| 04-May-2011 |
Wu Fengguang <fengguang.wu@intel.com> |
writeback: make writeback_control.nr_to_write straight
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(), and initialize the struct writeback_control there.
struct writeback_c
writeback: make writeback_control.nr_to_write straight
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(), and initialize the struct writeback_control there.
struct writeback_control is basically designed to control writeback of a single file, but we keep abuse it for writing multiple files in writeback_sb_inodes() and its callers.
It immediately clean things up, e.g. suddenly wbc.nr_to_write vs work->nr_pages starts to make sense, and instead of saving and restoring pages_skipped in writeback_sb_inodes it can always start with a clean zero value.
It also makes a neat IO pattern change: large dirty files are now written in the full 4MB writeback chunk size, rather than whatever remained quota in wbc->nr_to_write.
Acked-by: Jan Kara <jack@suse.cz> Proposed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
show more ...
|
#
e84d0a4f |
| 23-Apr-2011 |
Wu Fengguang <fengguang.wu@intel.com> |
writeback: trace event writeback_queue_io
Note that it adds a little overheads to account the moved/enqueued inodes from b_dirty to b_io. The "moved" accounting may be later used to limit the number
writeback: trace event writeback_queue_io
Note that it adds a little overheads to account the moved/enqueued inodes from b_dirty to b_io. The "moved" accounting may be later used to limit the number of inodes that can be moved in one shot, in order to keep spinlock hold time under control.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
show more ...
|
#
251d6a47 |
| 01-Dec-2010 |
Wu Fengguang <fengguang.wu@intel.com> |
writeback: trace event writeback_single_inode
It is valuable to know how the dirty inodes are iterated and their IO size.
"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC a
writeback: trace event writeback_single_inode
It is valuable to know how the dirty inodes are iterated and their IO size.
"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"
- "state" reflects inode->i_state at the end of writeback_single_inode() - "index" reflects mapping->writeback_index after the ->writepages() call - "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode() - "wrote" is the number of pages actually written
v2: add trace event writeback_single_inode_requeue as proposed by Dave.
CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
show more ...
|
Revision tags: v2.6.36-rc2, v2.6.36-rc1, v2.6.35, v2.6.35-rc6 |
|
#
b7a2441f |
| 21-Jul-2010 |
Wu Fengguang <fengguang.wu@intel.com> |
writeback: remove writeback_control.more_io
When wbc.more_io was first introduced, it indicates whether there are at least one superblock whose s_more_io contains more IO work. Now with the per-bdi
writeback: remove writeback_control.more_io
When wbc.more_io was first introduced, it indicates whether there are at least one superblock whose s_more_io contains more IO work. Now with the per-bdi writeback, it can be replaced with a simple b_more_io test.
Acked-by: Jan Kara <jack@suse.cz> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
show more ...
|
#
71927e84 |
| 13-Jan-2011 |
Wu Fengguang <fengguang.wu@intel.com> |
writeback: trace wakeup event for background writeback
This tracks when balance_dirty_pages() tries to wakeup the flusher thread for background writeback (if it was not started already).
Suggested-
writeback: trace wakeup event for background writeback
This tracks when balance_dirty_pages() tries to wakeup the flusher thread for background writeback (if it was not started already).
Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Engelhardt <jengelh@medozas.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
0e093d99 |
| 26-Oct-2010 |
Mel Gorman <mel@csn.ul.ie> |
writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
If congestion_wait() is called with no BDI co
writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
If congestion_wait() is called with no BDI congested, the caller will sleep for the full timeout and this may be an unnecessary sleep. This patch adds a wait_iff_congested() that checks congestion and only sleeps if a BDI is congested else, it calls cond_resched() to ensure the caller is not hogging the CPU longer than its quota but otherwise will not sleep.
This is aimed at reducing some of the major desktop stalls reported during IO. For example, while kswapd is operating, it calls congestion_wait() but it could just have been reclaiming clean page cache pages with no congestion. Without this patch, it would sleep for a full timeout but after this patch, it'll just call schedule() if it has been on the CPU too long. Similar logic applies to direct reclaimers that are not making enough progress.
Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
52bb9198 |
| 26-Oct-2010 |
Mel Gorman <mel@csn.ul.ie> |
writeback: account for time spent congestion_waited
There is strong evidence to indicate a lot of time is being spent in congestion_wait(), some of it unnecessarily. This patch adds a tracepoint fo
writeback: account for time spent congestion_waited
There is strong evidence to indicate a lot of time is being spent in congestion_wait(), some of it unnecessarily. This patch adds a tracepoint for congestion_wait to record when congestion_wait() was called, how long the timeout was for and how long it actually slept.
Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
1b430bee |
| 26-Oct-2010 |
Wu Fengguang <fengguang.wu@intel.com> |
writeback: remove nonblocking/encountered_congestion references
This removes more dead code that was somehow missed by commit 0d99519efef (writeback: remove unused nonblocking and congestion checks)
writeback: remove nonblocking/encountered_congestion references
This removes more dead code that was somehow missed by commit 0d99519efef (writeback: remove unused nonblocking and congestion checks). There are no behavior change except for the removal of two entries from one of the ext4 tracing interface.
The nonblocking checks in ->writepages are no longer used because the flusher now prefer to block on get_request_wait() than to skip inodes on IO congestion. The latter will lead to more seeky IO.
The nonblocking checks in ->writepage are no longer used because it's redundant with the WB_SYNC_NONE check.
We no long set ->nonblocking in VM page out and page migration, because a) it's effectively redundant with WB_SYNC_NONE in current code b) it's old semantic of "Don't get stuck on request queues" is mis-behavior: that would skip some dirty inodes on congestion and page out others, which is unfair in terms of LRU age.
Inspired by Christoph Hellwig. Thanks!
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: David Howells <dhowells@redhat.com> Cc: Sage Weil <sage@newdream.net> Cc: Steve French <sfrench@samba.org> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
60332023 |
| 25-Jul-2010 |
Artem Bityutskiy <Artem.Bityutskiy@nokia.com> |
writeback: add new tracepoints
Add 2 new trace points to the periodic write-back wake up case, just like we do in the 'bdi_queue_work()' function. Namely, introduce:
1. trace_writeback_wake_thread(
writeback: add new tracepoints
Add 2 new trace points to the periodic write-back wake up case, just like we do in the 'bdi_queue_work()' function. Namely, introduce:
1. trace_writeback_wake_thread(bdi) 2. trace_writeback_wake_forker_thread(bdi)
The first event is triggered every time we wake up a bdi thread to start periodic background write-out. The second event is triggered only when the bdi thread does not exist and should be created by the forker thread.
This patch was suggested by Dave Chinner and Christoph Hellwig.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
show more ...
|
#
96dccab1 |
| 19-Jul-2010 |
Randy Dunlap <randy.dunlap@oracle.com> |
writeback.h: needs linux/device.h
include/trace/events/writeback.h uses dev_name(), so it needs to include linux/device.h.
include/trace/events/writeback.h:12: error: implicit declaration of functi
writeback.h: needs linux/device.h
include/trace/events/writeback.h uses dev_name(), so it needs to include linux/device.h.
include/trace/events/writeback.h:12: error: implicit declaration of function 'dev_name'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
show more ...
|
Revision tags: v2.6.35-rc5 |
|
#
9e094383 |
| 06-Jul-2010 |
Dave Chinner <dchinner@redhat.com> |
writeback: Add tracing to write_cache_pages
Add a trace event to the ->writepage loop in write_cache_pages to give visibility into how the ->writepage call is changing variables within the writeback
writeback: Add tracing to write_cache_pages
Add a trace event to the ->writepage loop in write_cache_pages to give visibility into how the ->writepage call is changing variables within the writeback control structure. Of most interest is how wbc->nr_to_write changes from call to call, especially with filesystems that write multiple pages in ->writepage.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
show more ...
|
#
028c2dd1 |
| 06-Jul-2010 |
Dave Chinner <dchinner@redhat.com> |
writeback: Add tracing to balance_dirty_pages
Tracing high level background writeback events is good, but it doesn't give the entire picture. Add visibility into write throttling to catch IO dispatc
writeback: Add tracing to balance_dirty_pages
Tracing high level background writeback events is good, but it doesn't give the entire picture. Add visibility into write throttling to catch IO dispatched by foreground throttling of processing dirtying lots of pages.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
show more ...
|
#
455b2864 |
| 06-Jul-2010 |
Dave Chinner <david@fromorbit.com> |
writeback: Initial tracing support
Trace queue/sched/exec parts of the writeback loop. This provides insight into when and why flusher threads are scheduled to run. e.g a sync invocation leaves trac
writeback: Initial tracing support
Trace queue/sched/exec parts of the writeback loop. This provides insight into when and why flusher threads are scheduled to run. e.g a sync invocation leaves traces like:
sync-[...]: writeback_queue: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0 flush-8:0-[...]: writeback_exec: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0
This also lays the foundation for adding more writeback tracing to provide deeper insight into the whole writeback path.
The original tracing code is from Jens Axboe, though this version is a rewrite as a result of the code being traced changing significantly.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
show more ...
|
#
fdeb17c7 |
| 24-Nov-2020 |
Hui Su <sh_def@163.com> |
trace: fix potenial dangerous pointer The bdi_dev_name() returns a char [64], and the __entry->name is a char [32]. It maybe dangerous to TP_printk("%s", __entry->name) afte
trace: fix potenial dangerous pointer The bdi_dev_name() returns a char [64], and the __entry->name is a char [32]. It maybe dangerous to TP_printk("%s", __entry->name) after the strncpy(). CC: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201124165205.GA23937@rlk Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hui Su <sh_def@163.com> Signed-off-by: Jan Kara <jack@suse.cz>
show more ...
|
Revision tags: v5.8.17, v5.8.16, v5.8.15, v5.9, v5.8.14, v5.8.13, v5.8.12, v5.8.11, v5.8.10, v5.8.9, v5.8.8, v5.8.7, v5.8.6, v5.4.62, v5.8.5, v5.8.4, v5.4.61, v5.8.3, v5.4.60, v5.8.2, v5.4.59, v5.8.1, v5.4.58, v5.4.57, v5.4.56, v5.8, v5.7.12, v5.4.55, v5.7.11, v5.4.54, v5.7.10, v5.4.53, v5.4.52, v5.7.9, v5.7.8, v5.4.51, v5.4.50, v5.7.7, v5.4.49, v5.7.6, v5.7.5, v5.4.48, v5.7.4, v5.7.3, v5.4.47, v5.4.46, v5.7.2, v5.4.45, v5.7.1, v5.4.44, v5.7 |
|
#
5fcd5750 |
| 29-May-2020 |
Jan Kara <jack@suse.cz> |
writeback: Drop I_DIRTY_TIME_EXPIRE The only use of I_DIRTY_TIME_EXPIRE is to detect in __writeback_single_inode() that inode got there because flush worker decided it's time to writ
writeback: Drop I_DIRTY_TIME_EXPIRE The only use of I_DIRTY_TIME_EXPIRE is to detect in __writeback_single_inode() that inode got there because flush worker decided it's time to writeback the dirty inode time stamps (either because we are syncing or because of age). However we can detect this directly in __writeback_single_inode() and there's no need for the strange propagation with I_DIRTY_TIME_EXPIRE flag. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
show more ...
|
#
f9cae926 |
| 29-May-2020 |
Jan Kara <jack@suse.cz> |
writeback: Fix sync livelock due to b_dirty_time processing When we are processing writeback for sync(2), move_expired_inodes() didn't set any inode expiry value (older_than_this). This
writeback: Fix sync livelock due to b_dirty_time processing When we are processing writeback for sync(2), move_expired_inodes() didn't set any inode expiry value (older_than_this). This can result in writeback never completing if there's steady stream of inodes added to b_dirty_time list as writeback rechecks dirty lists after each writeback round whether there's more work to be done. Fix the problem by using sync(2) start time is inode expiry value when processing b_dirty_time list similarly as for ordinarily dirtied inodes. This requires some refactoring of older_than_this handling which simplifies the code noticeably as a bonus. Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option") CC: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
show more ...
|
#
8d92890b |
| 01-Jun-2020 |
NeilBrown <neilb@suse.de> |
mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead After an NFS page has been written it is considered "unstable" until a COMMIT request succeeds. If the COMMIT fails, the
mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead After an NFS page has been written it is considered "unstable" until a COMMIT request succeeds. If the COMMIT fails, the page will be re-written. These "unstable" pages are currently accounted as "reclaimable", either in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a 'reclaimable' count. This might have made sense when sending the COMMIT required a separate action by the VFS/MM (e.g. releasepage() used to send a COMMIT). However now that all writes generated by ->writepages() will automatically be followed by a COMMIT (since commit 919e3bd9a875 ("NFS: Ensure we commit after writeback is complete")) it makes more sense to treat them as writeback pages. So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in NR_WRITEBACK and WB_WRITEBACK. A particular effect of this change is that when wb_check_background_flush() calls wb_over_bg_threshold(), the latter will report 'true' a lot less often as the 'unstable' pages are no longer considered 'dirty' (as there is nothing that writeback can do about them anyway). Currently wb_check_background_flush() will trigger writeback to NFS even when there are relatively few dirty pages (if there are lots of unstable pages), this can result in small writes going to the server (10s of Kilobytes rather than a Megabyte) which hurts throughput. With this patch, there are fewer writes which are each larger on average. Where the NR_UNSTABLE_NFS count was included in statistics virtual-files, the entry is retained, but the value is hard-coded as zero. static trace points and warning printks which mentioned this counter no longer report it. [akpm@linux-foundation.org: re-layout comment] [akpm@linux-foundation.org: fix printk warning] Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com> Acked-by: Michal Hocko <mhocko@suse.com> [mm] Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v5.4.43, v5.4.42, v5.4.41, v5.4.40, v5.4.39, v5.4.38, v5.4.37, v5.4.36, v5.4.35, v5.4.34, v5.4.33, v5.4.32 |
|
#
c4b4c2a7 |
| 13-Apr-2020 |
Zhiqiang Liu <liuzhiqiang26@huawei.com> |
buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason. free_more_memory func has been completely removed in commit bc48f001de12 ("buffer: eliminate the need to call free_mor
buffer: remove useless comment and WB_REASON_FREE_MORE_MEM, reason. free_more_memory func has been completely removed in commit bc48f001de12 ("buffer: eliminate the need to call free_more_memory() in __getblk_slow()") So comment and `WB_REASON_FREE_MORE_MEM` reason about free_more_memory are no longer needed. Fixes: bc48f001de12 ("buffer: eliminate the need to call free_more_memory() in __getblk_slow()") Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
Revision tags: v5.4.31, v5.4.30, v5.4.29, v5.6, v5.4.28, v5.4.27, v5.4.26, v5.4.25, v5.4.24, v5.4.23, v5.4.22, v5.4.21, v5.4.20, v5.4.19, v5.4.18, v5.4.17 |
|
#
68f23b89 |
| 31-Jan-2020 |
Theodore Ts'o <tytso@mit.edu> |
memcg: fix a crash in wb_workfn when a device disappears Without memcg, there is a one-to-one mapping between the bdi and bdi_writeback structures. In this world, things are fairly
memcg: fix a crash in wb_workfn when a device disappears Without memcg, there is a one-to-one mapping between the bdi and bdi_writeback structures. In this world, things are fairly straightforward; the first thing bdi_unregister() does is to shutdown the bdi_writeback structure (or wb), and part of that writeback ensures that no other work queued against the wb, and that the wb is fully drained. With memcg, however, there is a one-to-many relationship between the bdi and bdi_writeback structures; that is, there are multiple wb objects which can all point to a single bdi. There is a refcount which prevents the bdi object from being released (and hence, unregistered). So in theory, the bdi_unregister() *should* only get called once its refcount goes to zero (bdi_put will drop the refcount, and when it is zero, release_bdi gets called, which calls bdi_unregister). Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about the Brave New memcg World, and calls bdi_unregister directly. It does this without informing the file system, or the memcg code, or anything else. This causes the root wb associated with the bdi to be unregistered, but none of the memcg-specific wb's are shutdown. So when one of these wb's are woken up to do delayed work, they try to dereference their wb->bdi->dev to fetch the device name, but unfortunately bdi->dev is now NULL, thanks to the bdi_unregister() called by del_gendisk(). As a result, *boom*. Fortunately, it looks like the rest of the writeback path is perfectly happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to create a bdi_dev_name() function which can handle bdi->dev being NULL. This also allows us to bulletproof the writeback tracepoints to prevent them from dereferencing a NULL pointer and crashing the kernel if one is tracing with memcg's enabled, and an iSCSI device dies or a USB storage stick is pulled. The most common way of triggering this will be hotremoval of a device while writeback with memcg enabled is going on. It was triggering several times a day in a heavily loaded production environment. Google Bug Id: 145475544 Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v5.4.16, v5.5, v5.4.15, v5.4.14, v5.4.13, v5.4.12, v5.4.11, v5.4.10, v5.4.9, v5.4.8, v5.4.7, v5.4.6, v5.4.5, v5.4.4, v5.4.3, v5.3.15, v5.4.2, v5.4.1, v5.3.14, v5.4, v5.3.13, v5.3.12 |
|
#
40363cf1 |
| 14-Nov-2019 |
Qian Cai <cai@lca.pw> |
writeback: fix -Wformat compilation warnings The commit f05499a06fb4 ("writeback: use ino_t for inodes in tracepoints") introduced a lot of GCC compilation warnings on s390, In
writeback: fix -Wformat compilation warnings The commit f05499a06fb4 ("writeback: use ino_t for inodes in tracepoints") introduced a lot of GCC compilation warnings on s390, In file included from ./include/trace/define_trace.h:102, from ./include/trace/events/writeback.h:904, from fs/fs-writeback.c:82: ./include/trace/events/writeback.h: In function 'trace_raw_output_writeback_page_template': ./include/trace/events/writeback.h:76:12: warning: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'ino_t' {aka 'unsigned int'} [-Wformat=] TP_printk("bdi %s: ino=%lu index=%lu", ^~~~~~~~~~~~~~~~~~~~~~~~~~~ ./include/trace/trace_events.h:360:22: note: in definition of macro 'DECLARE_EVENT_CLASS' trace_seq_printf(s, print); \ ^~~~~ ./include/trace/events/writeback.h:76:2: note: in expansion of macro 'TP_printk' TP_printk("bdi %s: ino=%lu index=%lu", ^~~~~~~~~ Fix them by adding necessary casts where ino_t could be either "unsigned int" or "unsigned long". Fixes: f05499a06fb4 ("writeback: use ino_t for inodes in tracepoints") Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|