1.. SPDX-License-Identifier: GPL-2.0
2
3=================================
4NETWORK FILESYSTEM HELPER LIBRARY
5=================================
6
7.. Contents:
8
9 - Overview.
10 - Buffered read helpers.
11   - Read helper functions.
12   - Read helper structures.
13   - Read helper operations.
14   - Read helper procedure.
15   - Read helper cache API.
16
17
18Overview
19========
20
21The network filesystem helper library is a set of functions designed to aid a
22network filesystem in implementing VM/VFS operations.  For the moment, that
23just includes turning various VM buffered read operations into requests to read
24from the server.  The helper library, however, can also interpose other
25services, such as local caching or local data encryption.
26
27Note that the library module doesn't link against local caching directly, so
28access must be provided by the netfs.
29
30
31Buffered Read Helpers
32=====================
33
34The library provides a set of read helpers that handle the ->readpage(),
35->readahead() and much of the ->write_begin() VM operations and translate them
36into a common call framework.
37
38The following services are provided:
39
40 * Handles transparent huge pages (THPs).
41
42 * Insulates the netfs from VM interface changes.
43
44 * Allows the netfs to arbitrarily split reads up into pieces, even ones that
45   don't match page sizes or page alignments and that may cross pages.
46
47 * Allows the netfs to expand a readahead request in both directions to meet
48   its needs.
49
50 * Allows the netfs to partially fulfil a read, which will then be resubmitted.
51
52 * Handles local caching, allowing cached data and server-read data to be
53   interleaved for a single request.
54
55 * Handles clearing of bufferage that aren't on the server.
56
57 * Handle retrying of reads that failed, switching reads from the cache to the
58   server as necessary.
59
60 * In the future, this is a place that other services can be performed, such as
61   local encryption of data to be stored remotely or in the cache.
62
63From the network filesystem, the helpers require a table of operations.  This
64includes a mandatory method to issue a read operation along with a number of
65optional methods.
66
67
68Read Helper Functions
69---------------------
70
71Three read helpers are provided::
72
73 * void netfs_readahead(struct readahead_control *ractl,
74			const struct netfs_read_request_ops *ops,
75			void *netfs_priv);``
76 * int netfs_readpage(struct file *file,
77		      struct page *page,
78		      const struct netfs_read_request_ops *ops,
79		      void *netfs_priv);
80 * int netfs_write_begin(struct file *file,
81			 struct address_space *mapping,
82			 loff_t pos,
83			 unsigned int len,
84			 unsigned int flags,
85			 struct page **_page,
86			 void **_fsdata,
87			 const struct netfs_read_request_ops *ops,
88			 void *netfs_priv);
89
90Each corresponds to a VM operation, with the addition of a couple of parameters
91for the use of the read helpers:
92
93 * ``ops``
94
95   A table of operations through which the helpers can talk to the filesystem.
96
97 * ``netfs_priv``
98
99   Filesystem private data (can be NULL).
100
101Both of these values will be stored into the read request structure.
102
103For ->readahead() and ->readpage(), the network filesystem should just jump
104into the corresponding read helper; whereas for ->write_begin(), it may be a
105little more complicated as the network filesystem might want to flush
106conflicting writes or track dirty data and needs to put the acquired page if an
107error occurs after calling the helper.
108
109The helpers manage the read request, calling back into the network filesystem
110through the suppplied table of operations.  Waits will be performed as
111necessary before returning for helpers that are meant to be synchronous.
112
113If an error occurs and netfs_priv is non-NULL, ops->cleanup() will be called to
114deal with it.  If some parts of the request are in progress when an error
115occurs, the request will get partially completed if sufficient data is read.
116
117Additionally, there is::
118
119  * void netfs_subreq_terminated(struct netfs_read_subrequest *subreq,
120				 ssize_t transferred_or_error,
121				 bool was_async);
122
123which should be called to complete a read subrequest.  This is given the number
124of bytes transferred or a negative error code, plus a flag indicating whether
125the operation was asynchronous (ie. whether the follow-on processing can be
126done in the current context, given this may involve sleeping).
127
128
129Read Helper Structures
130----------------------
131
132The read helpers make use of a couple of structures to maintain the state of
133the read.  The first is a structure that manages a read request as a whole::
134
135	struct netfs_read_request {
136		struct inode		*inode;
137		struct address_space	*mapping;
138		struct netfs_cache_resources cache_resources;
139		void			*netfs_priv;
140		loff_t			start;
141		size_t			len;
142		loff_t			i_size;
143		const struct netfs_read_request_ops *netfs_ops;
144		unsigned int		debug_id;
145		...
146	};
147
148The above fields are the ones the netfs can use.  They are:
149
150 * ``inode``
151 * ``mapping``
152
153   The inode and the address space of the file being read from.  The mapping
154   may or may not point to inode->i_data.
155
156 * ``cache_resources``
157
158   Resources for the local cache to use, if present.
159
160 * ``netfs_priv``
161
162   The network filesystem's private data.  The value for this can be passed in
163   to the helper functions or set during the request.  The ->cleanup() op will
164   be called if this is non-NULL at the end.
165
166 * ``start``
167 * ``len``
168
169   The file position of the start of the read request and the length.  These
170   may be altered by the ->expand_readahead() op.
171
172 * ``i_size``
173
174   The size of the file at the start of the request.
175
176 * ``netfs_ops``
177
178   A pointer to the operation table.  The value for this is passed into the
179   helper functions.
180
181 * ``debug_id``
182
183   A number allocated to this operation that can be displayed in trace lines
184   for reference.
185
186
187The second structure is used to manage individual slices of the overall read
188request::
189
190	struct netfs_read_subrequest {
191		struct netfs_read_request *rreq;
192		loff_t			start;
193		size_t			len;
194		size_t			transferred;
195		unsigned long		flags;
196		unsigned short		debug_index;
197		...
198	};
199
200Each subrequest is expected to access a single source, though the helpers will
201handle falling back from one source type to another.  The members are:
202
203 * ``rreq``
204
205   A pointer to the read request.
206
207 * ``start``
208 * ``len``
209
210   The file position of the start of this slice of the read request and the
211   length.
212
213 * ``transferred``
214
215   The amount of data transferred so far of the length of this slice.  The
216   network filesystem or cache should start the operation this far into the
217   slice.  If a short read occurs, the helpers will call again, having updated
218   this to reflect the amount read so far.
219
220 * ``flags``
221
222   Flags pertaining to the read.  There are two of interest to the filesystem
223   or cache:
224
225   * ``NETFS_SREQ_CLEAR_TAIL``
226
227     This can be set to indicate that the remainder of the slice, from
228     transferred to len, should be cleared.
229
230   * ``NETFS_SREQ_SEEK_DATA_READ``
231
232     This is a hint to the cache that it might want to try skipping ahead to
233     the next data (ie. using SEEK_DATA).
234
235 * ``debug_index``
236
237   A number allocated to this slice that can be displayed in trace lines for
238   reference.
239
240
241Read Helper Operations
242----------------------
243
244The network filesystem must provide the read helpers with a table of operations
245through which it can issue requests and negotiate::
246
247	struct netfs_read_request_ops {
248		void (*init_rreq)(struct netfs_read_request *rreq, struct file *file);
249		bool (*is_cache_enabled)(struct inode *inode);
250		int (*begin_cache_operation)(struct netfs_read_request *rreq);
251		void (*expand_readahead)(struct netfs_read_request *rreq);
252		bool (*clamp_length)(struct netfs_read_subrequest *subreq);
253		void (*issue_op)(struct netfs_read_subrequest *subreq);
254		bool (*is_still_valid)(struct netfs_read_request *rreq);
255		int (*check_write_begin)(struct file *file, loff_t pos, unsigned len,
256					 struct page *page, void **_fsdata);
257		void (*done)(struct netfs_read_request *rreq);
258		void (*cleanup)(struct address_space *mapping, void *netfs_priv);
259	};
260
261The operations are as follows:
262
263 * ``init_rreq()``
264
265   [Optional] This is called to initialise the request structure.  It is given
266   the file for reference and can modify the ->netfs_priv value.
267
268 * ``is_cache_enabled()``
269
270   [Required] This is called by netfs_write_begin() to ask if the file is being
271   cached.  It should return true if it is being cached and false otherwise.
272
273 * ``begin_cache_operation()``
274
275   [Optional] This is called to ask the network filesystem to call into the
276   cache (if present) to initialise the caching state for this read.  The netfs
277   library module cannot access the cache directly, so the cache should call
278   something like fscache_begin_read_operation() to do this.
279
280   The cache gets to store its state in ->cache_resources and must set a table
281   of operations of its own there (though of a different type).
282
283   This should return 0 on success and an error code otherwise.  If an error is
284   reported, the operation may proceed anyway, just without local caching (only
285   out of memory and interruption errors cause failure here).
286
287 * ``expand_readahead()``
288
289   [Optional] This is called to allow the filesystem to expand the size of a
290   readahead read request.  The filesystem gets to expand the request in both
291   directions, though it's not permitted to reduce it as the numbers may
292   represent an allocation already made.  If local caching is enabled, it gets
293   to expand the request first.
294
295   Expansion is communicated by changing ->start and ->len in the request
296   structure.  Note that if any change is made, ->len must be increased by at
297   least as much as ->start is reduced.
298
299 * ``clamp_length()``
300
301   [Optional] This is called to allow the filesystem to reduce the size of a
302   subrequest.  The filesystem can use this, for example, to chop up a request
303   that has to be split across multiple servers or to put multiple reads in
304   flight.
305
306   This should return 0 on success and an error code on error.
307
308 * ``issue_op()``
309
310   [Required] The helpers use this to dispatch a subrequest to the server for
311   reading.  In the subrequest, ->start, ->len and ->transferred indicate what
312   data should be read from the server.
313
314   There is no return value; the netfs_subreq_terminated() function should be
315   called to indicate whether or not the operation succeeded and how much data
316   it transferred.  The filesystem also should not deal with setting pages
317   uptodate, unlocking them or dropping their refs - the helpers need to deal
318   with this as they have to coordinate with copying to the local cache.
319
320   Note that the helpers have the pages locked, but not pinned.  It is possible
321   to use the ITER_XARRAY iov iterator to refer to the range of the inode that
322   is being operated upon without the need to allocate large bvec tables.
323
324 * ``is_still_valid()``
325
326   [Optional] This is called to find out if the data just read from the local
327   cache is still valid.  It should return true if it is still valid and false
328   if not.  If it's not still valid, it will be reread from the server.
329
330 * ``check_write_begin()``
331
332   [Optional] This is called from the netfs_write_begin() helper once it has
333   allocated/grabbed the page to be modified to allow the filesystem to flush
334   conflicting state before allowing it to be modified.
335
336   It should return 0 if everything is now fine, -EAGAIN if the page should be
337   regrabbed and any other error code to abort the operation.
338
339 * ``done``
340
341   [Optional] This is called after the pages in the request have all been
342   unlocked (and marked uptodate if applicable).
343
344 * ``cleanup``
345
346   [Optional] This is called as the request is being deallocated so that the
347   filesystem can clean up ->netfs_priv.
348
349
350
351Read Helper Procedure
352---------------------
353
354The read helpers work by the following general procedure:
355
356 * Set up the request.
357
358 * For readahead, allow the local cache and then the network filesystem to
359   propose expansions to the read request.  This is then proposed to the VM.
360   If the VM cannot fully perform the expansion, a partially expanded read will
361   be performed, though this may not get written to the cache in its entirety.
362
363 * Loop around slicing chunks off of the request to form subrequests:
364
365   * If a local cache is present, it gets to do the slicing, otherwise the
366     helpers just try to generate maximal slices.
367
368   * The network filesystem gets to clamp the size of each slice if it is to be
369     the source.  This allows rsize and chunking to be implemented.
370
371   * The helpers issue a read from the cache or a read from the server or just
372     clears the slice as appropriate.
373
374   * The next slice begins at the end of the last one.
375
376   * As slices finish being read, they terminate.
377
378 * When all the subrequests have terminated, the subrequests are assessed and
379   any that are short or have failed are reissued:
380
381   * Failed cache requests are issued against the server instead.
382
383   * Failed server requests just fail.
384
385   * Short reads against either source will be reissued against that source
386     provided they have transferred some more data:
387
388     * The cache may need to skip holes that it can't do DIO from.
389
390     * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the
391       end of the slice instead of reissuing.
392
393 * Once the data is read, the pages that have been fully read/cleared:
394
395   * Will be marked uptodate.
396
397   * If a cache is present, will be marked with PG_fscache.
398
399   * Unlocked
400
401 * Any pages that need writing to the cache will then have DIO writes issued.
402
403 * Synchronous operations will wait for reading to be complete.
404
405 * Writes to the cache will proceed asynchronously and the pages will have the
406   PG_fscache mark removed when that completes.
407
408 * The request structures will be cleaned up when everything has completed.
409
410
411Read Helper Cache API
412---------------------
413
414When implementing a local cache to be used by the read helpers, two things are
415required: some way for the network filesystem to initialise the caching for a
416read request and a table of operations for the helpers to call.
417
418The network filesystem's ->begin_cache_operation() method is called to set up a
419cache and this must call into the cache to do the work.  If using fscache, for
420example, the cache would call::
421
422	int fscache_begin_read_operation(struct netfs_read_request *rreq,
423					 struct fscache_cookie *cookie);
424
425passing in the request pointer and the cookie corresponding to the file.
426
427The netfs_read_request object contains a place for the cache to hang its
428state::
429
430	struct netfs_cache_resources {
431		const struct netfs_cache_ops	*ops;
432		void				*cache_priv;
433		void				*cache_priv2;
434	};
435
436This contains an operations table pointer and two private pointers.  The
437operation table looks like the following::
438
439	struct netfs_cache_ops {
440		void (*end_operation)(struct netfs_cache_resources *cres);
441
442		void (*expand_readahead)(struct netfs_cache_resources *cres,
443					 loff_t *_start, size_t *_len, loff_t i_size);
444
445		enum netfs_read_source (*prepare_read)(struct netfs_read_subrequest *subreq,
446						       loff_t i_size);
447
448		int (*read)(struct netfs_cache_resources *cres,
449			    loff_t start_pos,
450			    struct iov_iter *iter,
451			    bool seek_data,
452			    netfs_io_terminated_t term_func,
453			    void *term_func_priv);
454
455		int (*write)(struct netfs_cache_resources *cres,
456			     loff_t start_pos,
457			     struct iov_iter *iter,
458			     netfs_io_terminated_t term_func,
459			     void *term_func_priv);
460	};
461
462With a termination handler function pointer::
463
464	typedef void (*netfs_io_terminated_t)(void *priv,
465					      ssize_t transferred_or_error,
466					      bool was_async);
467
468The methods defined in the table are:
469
470 * ``end_operation()``
471
472   [Required] Called to clean up the resources at the end of the read request.
473
474 * ``expand_readahead()``
475
476   [Optional] Called at the beginning of a netfs_readahead() operation to allow
477   the cache to expand a request in either direction.  This allows the cache to
478   size the request appropriately for the cache granularity.
479
480   The function is passed poiners to the start and length in its parameters,
481   plus the size of the file for reference, and adjusts the start and length
482   appropriately.  It should return one of:
483
484   * ``NETFS_FILL_WITH_ZEROES``
485   * ``NETFS_DOWNLOAD_FROM_SERVER``
486   * ``NETFS_READ_FROM_CACHE``
487   * ``NETFS_INVALID_READ``
488
489   to indicate whether the slice should just be cleared or whether it should be
490   downloaded from the server or read from the cache - or whether slicing
491   should be given up at the current point.
492
493 * ``prepare_read()``
494
495   [Required] Called to configure the next slice of a request.  ->start and
496   ->len in the subrequest indicate where and how big the next slice can be;
497   the cache gets to reduce the length to match its granularity requirements.
498
499 * ``read()``
500
501   [Required] Called to read from the cache.  The start file offset is given
502   along with an iterator to read to, which gives the length also.  It can be
503   given a hint requesting that it seek forward from that start position for
504   data.
505
506   Also provided is a pointer to a termination handler function and private
507   data to pass to that function.  The termination function should be called
508   with the number of bytes transferred or an error code, plus a flag
509   indicating whether the termination is definitely happening in the caller's
510   context.
511
512 * ``write()``
513
514   [Required] Called to write to the cache.  The start file offset is given
515   along with an iterator to write from, which gives the length also.
516
517   Also provided is a pointer to a termination handler function and private
518   data to pass to that function.  The termination function should be called
519   with the number of bytes transferred or an error code, plus a flag
520   indicating whether the termination is definitely happening in the caller's
521   context.
522
523Note that these methods are passed a pointer to the cache resource structure,
524not the read request structure as they could be used in other situations where
525there isn't a read request structure as well, such as writing dirty data to the
526cache.
527