19a610812SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
29a610812SMauro Carvalho Chehab
39a610812SMauro Carvalho Chehab================================================
49a610812SMauro Carvalho ChehabZoneFS - Zone filesystem for Zoned block devices
59a610812SMauro Carvalho Chehab================================================
69a610812SMauro Carvalho Chehab
79a610812SMauro Carvalho ChehabIntroduction
89a610812SMauro Carvalho Chehab============
99a610812SMauro Carvalho Chehab
109a610812SMauro Carvalho Chehabzonefs is a very simple file system exposing each zone of a zoned block device
119a610812SMauro Carvalho Chehabas a file. Unlike a regular POSIX-compliant file system with native zoned block
129a610812SMauro Carvalho Chehabdevice support (e.g. f2fs), zonefs does not hide the sequential write
139a610812SMauro Carvalho Chehabconstraint of zoned block devices to the user. Files representing sequential
149a610812SMauro Carvalho Chehabwrite zones of the device must be written sequentially starting from the end
159a610812SMauro Carvalho Chehabof the file (append only writes).
169a610812SMauro Carvalho Chehab
179a610812SMauro Carvalho ChehabAs such, zonefs is in essence closer to a raw block device access interface
189a610812SMauro Carvalho Chehabthan to a full-featured POSIX file system. The goal of zonefs is to simplify
199a610812SMauro Carvalho Chehabthe implementation of zoned block device support in applications by replacing
209a610812SMauro Carvalho Chehabraw block device file accesses with a richer file API, avoiding relying on
219a610812SMauro Carvalho Chehabdirect block device file ioctls which may be more obscure to developers. One
229a610812SMauro Carvalho Chehabexample of this approach is the implementation of LSM (log-structured merge)
239a610812SMauro Carvalho Chehabtree structures (such as used in RocksDB and LevelDB) on zoned block devices
249a610812SMauro Carvalho Chehabby allowing SSTables to be stored in a zone file similarly to a regular file
259a610812SMauro Carvalho Chehabsystem rather than as a range of sectors of the entire disk. The introduction
269a610812SMauro Carvalho Chehabof the higher level construct "one file is one zone" can help reducing the
279a610812SMauro Carvalho Chehabamount of changes needed in the application as well as introducing support for
289a610812SMauro Carvalho Chehabdifferent application programming languages.
299a610812SMauro Carvalho Chehab
309a610812SMauro Carvalho ChehabZoned block devices
319a610812SMauro Carvalho Chehab-------------------
329a610812SMauro Carvalho Chehab
339a610812SMauro Carvalho ChehabZoned storage devices belong to a class of storage devices with an address
349a610812SMauro Carvalho Chehabspace that is divided into zones. A zone is a group of consecutive LBAs and all
359a610812SMauro Carvalho Chehabzones are contiguous (there are no LBA gaps). Zones may have different types.
369a610812SMauro Carvalho Chehab
379a610812SMauro Carvalho Chehab* Conventional zones: there are no access constraints to LBAs belonging to
389a610812SMauro Carvalho Chehab  conventional zones. Any read or write access can be executed, similarly to a
399a610812SMauro Carvalho Chehab  regular block device.
409a610812SMauro Carvalho Chehab* Sequential zones: these zones accept random reads but must be written
419a610812SMauro Carvalho Chehab  sequentially. Each sequential zone has a write pointer maintained by the
429a610812SMauro Carvalho Chehab  device that keeps track of the mandatory start LBA position of the next write
439a610812SMauro Carvalho Chehab  to the device. As a result of this write constraint, LBAs in a sequential zone
449a610812SMauro Carvalho Chehab  cannot be overwritten. Sequential zones must first be erased using a special
459a610812SMauro Carvalho Chehab  command (zone reset) before rewriting.
469a610812SMauro Carvalho Chehab
479a610812SMauro Carvalho ChehabZoned storage devices can be implemented using various recording and media
489a610812SMauro Carvalho Chehabtechnologies. The most common form of zoned storage today uses the SCSI Zoned
499a610812SMauro Carvalho ChehabBlock Commands (ZBC) and Zoned ATA Commands (ZAC) interfaces on Shingled
509a610812SMauro Carvalho ChehabMagnetic Recording (SMR) HDDs.
519a610812SMauro Carvalho Chehab
529a610812SMauro Carvalho ChehabSolid State Disks (SSD) storage devices can also implement a zoned interface
539a610812SMauro Carvalho Chehabto, for instance, reduce internal write amplification due to garbage collection.
549a610812SMauro Carvalho ChehabThe NVMe Zoned NameSpace (ZNS) is a technical proposal of the NVMe standard
559a610812SMauro Carvalho Chehabcommittee aiming at adding a zoned storage interface to the NVMe protocol.
569a610812SMauro Carvalho Chehab
579a610812SMauro Carvalho ChehabZonefs Overview
589a610812SMauro Carvalho Chehab===============
599a610812SMauro Carvalho Chehab
609a610812SMauro Carvalho ChehabZonefs exposes the zones of a zoned block device as files. The files
619a610812SMauro Carvalho Chehabrepresenting zones are grouped by zone type, which are themselves represented
629a610812SMauro Carvalho Chehabby sub-directories. This file structure is built entirely using zone information
639a610812SMauro Carvalho Chehabprovided by the device and so does not require any complex on-disk metadata
649a610812SMauro Carvalho Chehabstructure.
659a610812SMauro Carvalho Chehab
669a610812SMauro Carvalho ChehabOn-disk metadata
679a610812SMauro Carvalho Chehab----------------
689a610812SMauro Carvalho Chehab
699a610812SMauro Carvalho Chehabzonefs on-disk metadata is reduced to an immutable super block which
709a610812SMauro Carvalho Chehabpersistently stores a magic number and optional feature flags and values. On
719a610812SMauro Carvalho Chehabmount, zonefs uses blkdev_report_zones() to obtain the device zone configuration
729a610812SMauro Carvalho Chehaband populates the mount point with a static file tree solely based on this
739a610812SMauro Carvalho Chehabinformation. File sizes come from the device zone type and write pointer
749a610812SMauro Carvalho Chehabposition managed by the device itself.
759a610812SMauro Carvalho Chehab
769a610812SMauro Carvalho ChehabThe super block is always written on disk at sector 0. The first zone of the
779a610812SMauro Carvalho Chehabdevice storing the super block is never exposed as a zone file by zonefs. If
789a610812SMauro Carvalho Chehabthe zone containing the super block is a sequential zone, the mkzonefs format
799a610812SMauro Carvalho Chehabtool always "finishes" the zone, that is, it transitions the zone to a full
809a610812SMauro Carvalho Chehabstate to make it read-only, preventing any data write.
819a610812SMauro Carvalho Chehab
829a610812SMauro Carvalho ChehabZone type sub-directories
839a610812SMauro Carvalho Chehab-------------------------
849a610812SMauro Carvalho Chehab
859a610812SMauro Carvalho ChehabFiles representing zones of the same type are grouped together under the same
869a610812SMauro Carvalho Chehabsub-directory automatically created on mount.
879a610812SMauro Carvalho Chehab
889a610812SMauro Carvalho ChehabFor conventional zones, the sub-directory "cnv" is used. This directory is
899a610812SMauro Carvalho Chehabhowever created if and only if the device has usable conventional zones. If
909a610812SMauro Carvalho Chehabthe device only has a single conventional zone at sector 0, the zone will not
919a610812SMauro Carvalho Chehabbe exposed as a file as it will be used to store the zonefs super block. For
929a610812SMauro Carvalho Chehabsuch devices, the "cnv" sub-directory will not be created.
939a610812SMauro Carvalho Chehab
949a610812SMauro Carvalho ChehabFor sequential write zones, the sub-directory "seq" is used.
959a610812SMauro Carvalho Chehab
969a610812SMauro Carvalho ChehabThese two directories are the only directories that exist in zonefs. Users
979a610812SMauro Carvalho Chehabcannot create other directories and cannot rename nor delete the "cnv" and
989a610812SMauro Carvalho Chehab"seq" sub-directories.
999a610812SMauro Carvalho Chehab
1009a610812SMauro Carvalho ChehabThe size of the directories indicated by the st_size field of struct stat,
1019a610812SMauro Carvalho Chehabobtained with the stat() or fstat() system calls, indicates the number of files
1029a610812SMauro Carvalho Chehabexisting under the directory.
1039a610812SMauro Carvalho Chehab
1049a610812SMauro Carvalho ChehabZone files
1059a610812SMauro Carvalho Chehab----------
1069a610812SMauro Carvalho Chehab
1079a610812SMauro Carvalho ChehabZone files are named using the number of the zone they represent within the set
1089a610812SMauro Carvalho Chehabof zones of a particular type. That is, both the "cnv" and "seq" directories
1099a610812SMauro Carvalho Chehabcontain files named "0", "1", "2", ... The file numbers also represent
1109a610812SMauro Carvalho Chehabincreasing zone start sector on the device.
1119a610812SMauro Carvalho Chehab
1129a610812SMauro Carvalho ChehabAll read and write operations to zone files are not allowed beyond the file
1134c96870eSJohannes Thumshirnmaximum size, that is, beyond the zone capacity. Any access exceeding the zone
1144c96870eSJohannes Thumshirncapacity is failed with the -EFBIG error.
1159a610812SMauro Carvalho Chehab
1169a610812SMauro Carvalho ChehabCreating, deleting, renaming or modifying any attribute of files and
1179a610812SMauro Carvalho Chehabsub-directories is not allowed.
1189a610812SMauro Carvalho Chehab
1199a610812SMauro Carvalho ChehabThe number of blocks of a file as reported by stat() and fstat() indicates the
1204c96870eSJohannes Thumshirncapacity of the zone file, or in other words, the maximum file size.
1219a610812SMauro Carvalho Chehab
1229a610812SMauro Carvalho ChehabConventional zone files
1239a610812SMauro Carvalho Chehab-----------------------
1249a610812SMauro Carvalho Chehab
1259a610812SMauro Carvalho ChehabThe size of conventional zone files is fixed to the size of the zone they
1269a610812SMauro Carvalho Chehabrepresent. Conventional zone files cannot be truncated.
1279a610812SMauro Carvalho Chehab
1289a610812SMauro Carvalho ChehabThese files can be randomly read and written using any type of I/O operation:
1299a610812SMauro Carvalho Chehabbuffered I/Os, direct I/Os, memory mapped I/Os (mmap), etc. There are no I/O
1309a610812SMauro Carvalho Chehabconstraint for these files beyond the file size limit mentioned above.
1319a610812SMauro Carvalho Chehab
1329a610812SMauro Carvalho ChehabSequential zone files
1339a610812SMauro Carvalho Chehab---------------------
1349a610812SMauro Carvalho Chehab
1359a610812SMauro Carvalho ChehabThe size of sequential zone files grouped in the "seq" sub-directory represents
1369a610812SMauro Carvalho Chehabthe file's zone write pointer position relative to the zone start sector.
1379a610812SMauro Carvalho Chehab
1389a610812SMauro Carvalho ChehabSequential zone files can only be written sequentially, starting from the file
1399a610812SMauro Carvalho Chehabend, that is, write operations can only be append writes. Zonefs makes no
1409a610812SMauro Carvalho Chehabattempt at accepting random writes and will fail any write request that has a
1419a610812SMauro Carvalho Chehabstart offset not corresponding to the end of the file, or to the end of the last
142481ed297SLinus Torvaldswrite issued and still in-flight (for asynchronous I/O operations).
1439a610812SMauro Carvalho Chehab
1449a610812SMauro Carvalho ChehabSince dirty page writeback by the page cache does not guarantee a sequential
1459a610812SMauro Carvalho Chehabwrite pattern, zonefs prevents buffered writes and writeable shared mappings
1469a610812SMauro Carvalho Chehabon sequential files. Only direct I/O writes are accepted for these files.
1479a610812SMauro Carvalho Chehabzonefs relies on the sequential delivery of write I/O requests to the device
1489a610812SMauro Carvalho Chehabimplemented by the block layer elevator. An elevator implementing the sequential
1499a610812SMauro Carvalho Chehabwrite feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature)
150481ed297SLinus Torvaldsmust be used. This type of elevator (e.g. mq-deadline) is set by default
1519a610812SMauro Carvalho Chehabfor zoned block devices on device initialization.
1529a610812SMauro Carvalho Chehab
1539a610812SMauro Carvalho ChehabThere are no restrictions on the type of I/O used for read operations in
1549a610812SMauro Carvalho Chehabsequential zone files. Buffered I/Os, direct I/Os and shared read mappings are
1559a610812SMauro Carvalho Chehaball accepted.
1569a610812SMauro Carvalho Chehab
1579a610812SMauro Carvalho ChehabTruncating sequential zone files is allowed only down to 0, in which case, the
1589a610812SMauro Carvalho Chehabzone is reset to rewind the file zone write pointer position to the start of
1594c96870eSJohannes Thumshirnthe zone, or up to the zone capacity, in which case the file's zone is
1604c96870eSJohannes Thumshirntransitioned to the FULL state (finish zone operation).
1619a610812SMauro Carvalho Chehab
1629a610812SMauro Carvalho ChehabFormat options
1639a610812SMauro Carvalho Chehab--------------
1649a610812SMauro Carvalho Chehab
1659a610812SMauro Carvalho ChehabSeveral optional features of zonefs can be enabled at format time.
1669a610812SMauro Carvalho Chehab
1679a610812SMauro Carvalho Chehab* Conventional zone aggregation: ranges of contiguous conventional zones can be
1689a610812SMauro Carvalho Chehab  aggregated into a single larger file instead of the default one file per zone.
1699a610812SMauro Carvalho Chehab* File ownership: The owner UID and GID of zone files is by default 0 (root)
1709a610812SMauro Carvalho Chehab  but can be changed to any valid UID/GID.
1719a610812SMauro Carvalho Chehab* File access permissions: the default 640 access permissions can be changed.
1729a610812SMauro Carvalho Chehab
1739a610812SMauro Carvalho ChehabIO error handling
1749a610812SMauro Carvalho Chehab-----------------
1759a610812SMauro Carvalho Chehab
1769a610812SMauro Carvalho ChehabZoned block devices may fail I/O requests for reasons similar to regular block
1779a610812SMauro Carvalho Chehabdevices, e.g. due to bad sectors. However, in addition to such known I/O
1789a610812SMauro Carvalho Chehabfailure pattern, the standards governing zoned block devices behavior define
1799a610812SMauro Carvalho Chehabadditional conditions that result in I/O errors.
1809a610812SMauro Carvalho Chehab
1819a610812SMauro Carvalho Chehab* A zone may transition to the read-only condition (BLK_ZONE_COND_READONLY):
1829a610812SMauro Carvalho Chehab  While the data already written in the zone is still readable, the zone can
1839a610812SMauro Carvalho Chehab  no longer be written. No user action on the zone (zone management command or
1849a610812SMauro Carvalho Chehab  read/write access) can change the zone condition back to a normal read/write
1859a610812SMauro Carvalho Chehab  state. While the reasons for the device to transition a zone to read-only
1869a610812SMauro Carvalho Chehab  state are not defined by the standards, a typical cause for such transition
1879a610812SMauro Carvalho Chehab  would be a defective write head on an HDD (all zones under this head are
1889a610812SMauro Carvalho Chehab  changed to read-only).
1899a610812SMauro Carvalho Chehab
1909a610812SMauro Carvalho Chehab* A zone may transition to the offline condition (BLK_ZONE_COND_OFFLINE):
1919a610812SMauro Carvalho Chehab  An offline zone cannot be read nor written. No user action can transition an
1929a610812SMauro Carvalho Chehab  offline zone back to an operational good state. Similarly to zone read-only
1939a610812SMauro Carvalho Chehab  transitions, the reasons for a drive to transition a zone to the offline
1949a610812SMauro Carvalho Chehab  condition are undefined. A typical cause would be a defective read-write head
1959a610812SMauro Carvalho Chehab  on an HDD causing all zones on the platter under the broken head to be
1969a610812SMauro Carvalho Chehab  inaccessible.
1979a610812SMauro Carvalho Chehab
1989a610812SMauro Carvalho Chehab* Unaligned write errors: These errors result from the host issuing write
1999a610812SMauro Carvalho Chehab  requests with a start sector that does not correspond to a zone write pointer
2009a610812SMauro Carvalho Chehab  position when the write request is executed by the device. Even though zonefs
2019a610812SMauro Carvalho Chehab  enforces sequential file write for sequential zones, unaligned write errors
2029a610812SMauro Carvalho Chehab  may still happen in the case of a partial failure of a very large direct I/O
2039a610812SMauro Carvalho Chehab  operation split into multiple BIOs/requests or asynchronous I/O operations.
2049a610812SMauro Carvalho Chehab  If one of the write request within the set of sequential write requests
205481ed297SLinus Torvalds  issued to the device fails, all write requests queued after it will
2069a610812SMauro Carvalho Chehab  become unaligned and fail.
2079a610812SMauro Carvalho Chehab
2089a610812SMauro Carvalho Chehab* Delayed write errors: similarly to regular block devices, if the device side
2099a610812SMauro Carvalho Chehab  write cache is enabled, write errors may occur in ranges of previously
2109a610812SMauro Carvalho Chehab  completed writes when the device write cache is flushed, e.g. on fsync().
2119a610812SMauro Carvalho Chehab  Similarly to the previous immediate unaligned write error case, delayed write
2129a610812SMauro Carvalho Chehab  errors can propagate through a stream of cached sequential data for a zone
2139a610812SMauro Carvalho Chehab  causing all data to be dropped after the sector that caused the error.
2149a610812SMauro Carvalho Chehab
2159a610812SMauro Carvalho ChehabAll I/O errors detected by zonefs are notified to the user with an error code
216481ed297SLinus Torvaldsreturn for the system call that triggered or detected the error. The recovery
2179a610812SMauro Carvalho Chehabactions taken by zonefs in response to I/O errors depend on the I/O type (read
2189a610812SMauro Carvalho Chehabvs write) and on the reason for the error (bad sector, unaligned writes or zone
2199a610812SMauro Carvalho Chehabcondition change).
2209a610812SMauro Carvalho Chehab
2219a610812SMauro Carvalho Chehab* For read I/O errors, zonefs does not execute any particular recovery action,
2229a610812SMauro Carvalho Chehab  but only if the file zone is still in a good condition and there is no
2239a610812SMauro Carvalho Chehab  inconsistency between the file inode size and its zone write pointer position.
2249a610812SMauro Carvalho Chehab  If a problem is detected, I/O error recovery is executed (see below table).
2259a610812SMauro Carvalho Chehab
2269a610812SMauro Carvalho Chehab* For write I/O errors, zonefs I/O error recovery is always executed.
2279a610812SMauro Carvalho Chehab
2289a610812SMauro Carvalho Chehab* A zone condition change to read-only or offline also always triggers zonefs
2299a610812SMauro Carvalho Chehab  I/O error recovery.
2309a610812SMauro Carvalho Chehab
231481ed297SLinus TorvaldsZonefs minimal I/O error recovery may change a file size and file access
2329a610812SMauro Carvalho Chehabpermissions.
2339a610812SMauro Carvalho Chehab
2349a610812SMauro Carvalho Chehab* File size changes:
2359a610812SMauro Carvalho Chehab  Immediate or delayed write errors in a sequential zone file may cause the file
2369a610812SMauro Carvalho Chehab  inode size to be inconsistent with the amount of data successfully written in
2379a610812SMauro Carvalho Chehab  the file zone. For instance, the partial failure of a multi-BIO large write
2389a610812SMauro Carvalho Chehab  operation will cause the zone write pointer to advance partially, even though
2399a610812SMauro Carvalho Chehab  the entire write operation will be reported as failed to the user. In such
2409a610812SMauro Carvalho Chehab  case, the file inode size must be advanced to reflect the zone write pointer
2419a610812SMauro Carvalho Chehab  change and eventually allow the user to restart writing at the end of the
2429a610812SMauro Carvalho Chehab  file.
2439a610812SMauro Carvalho Chehab  A file size may also be reduced to reflect a delayed write error detected on
2449a610812SMauro Carvalho Chehab  fsync(): in this case, the amount of data effectively written in the zone may
2459a610812SMauro Carvalho Chehab  be less than originally indicated by the file inode size. After such I/O
246481ed297SLinus Torvalds  error, zonefs always fixes the file inode size to reflect the amount of data
2479a610812SMauro Carvalho Chehab  persistently stored in the file zone.
2489a610812SMauro Carvalho Chehab
2499a610812SMauro Carvalho Chehab* Access permission changes:
2509a610812SMauro Carvalho Chehab  A zone condition change to read-only is indicated with a change in the file
2519a610812SMauro Carvalho Chehab  access permissions to render the file read-only. This disables changes to the
2529a610812SMauro Carvalho Chehab  file attributes and data modification. For offline zones, all permissions
2539a610812SMauro Carvalho Chehab  (read and write) to the file are disabled.
2549a610812SMauro Carvalho Chehab
2559a610812SMauro Carvalho ChehabFurther action taken by zonefs I/O error recovery can be controlled by the user
2569a610812SMauro Carvalho Chehabwith the "errors=xxx" mount option. The table below summarizes the result of
2579a610812SMauro Carvalho Chehabzonefs I/O error processing depending on the mount option and on the zone
2589a610812SMauro Carvalho Chehabconditions::
2599a610812SMauro Carvalho Chehab
2609a610812SMauro Carvalho Chehab    +--------------+-----------+-----------------------------------------+
2619a610812SMauro Carvalho Chehab    |              |           |            Post error state             |
2629a610812SMauro Carvalho Chehab    | "errors=xxx" |  device   |                 access permissions      |
2639a610812SMauro Carvalho Chehab    |    mount     |   zone    | file         file          device zone  |
2649a610812SMauro Carvalho Chehab    |    option    | condition | size     read    write    read    write |
2659a610812SMauro Carvalho Chehab    +--------------+-----------+-----------------------------------------+
2669a610812SMauro Carvalho Chehab    |              | good      | fixed    yes     no       yes     yes   |
267481ed297SLinus Torvalds    | remount-ro   | read-only | as is    yes     no       yes     no    |
2689a610812SMauro Carvalho Chehab    | (default)    | offline   |   0      no      no       no      no    |
2699a610812SMauro Carvalho Chehab    +--------------+-----------+-----------------------------------------+
2709a610812SMauro Carvalho Chehab    |              | good      | fixed    yes     no       yes     yes   |
271481ed297SLinus Torvalds    | zone-ro      | read-only | as is    yes     no       yes     no    |
2729a610812SMauro Carvalho Chehab    |              | offline   |   0      no      no       no      no    |
2739a610812SMauro Carvalho Chehab    +--------------+-----------+-----------------------------------------+
2749a610812SMauro Carvalho Chehab    |              | good      |   0      no      no       yes     yes   |
2759a610812SMauro Carvalho Chehab    | zone-offline | read-only |   0      no      no       yes     no    |
2769a610812SMauro Carvalho Chehab    |              | offline   |   0      no      no       no      no    |
2779a610812SMauro Carvalho Chehab    +--------------+-----------+-----------------------------------------+
2789a610812SMauro Carvalho Chehab    |              | good      | fixed    yes     yes      yes     yes   |
279481ed297SLinus Torvalds    | repair       | read-only | as is    yes     no       yes     no    |
2809a610812SMauro Carvalho Chehab    |              | offline   |   0      no      no       no      no    |
2819a610812SMauro Carvalho Chehab    +--------------+-----------+-----------------------------------------+
2829a610812SMauro Carvalho Chehab
2839a610812SMauro Carvalho ChehabFurther notes:
2849a610812SMauro Carvalho Chehab
2859a610812SMauro Carvalho Chehab* The "errors=remount-ro" mount option is the default behavior of zonefs I/O
2869a610812SMauro Carvalho Chehab  error processing if no errors mount option is specified.
2879a610812SMauro Carvalho Chehab* With the "errors=remount-ro" mount option, the change of the file access
2889a610812SMauro Carvalho Chehab  permissions to read-only applies to all files. The file system is remounted
2899a610812SMauro Carvalho Chehab  read-only.
2909a610812SMauro Carvalho Chehab* Access permission and file size changes due to the device transitioning zones
291481ed297SLinus Torvalds  to the offline condition are permanent. Remounting or reformatting the device
2929a610812SMauro Carvalho Chehab  with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good
2939a610812SMauro Carvalho Chehab  state.
2949a610812SMauro Carvalho Chehab* File access permission changes to read-only due to the device transitioning
295481ed297SLinus Torvalds  zones to the read-only condition are permanent. Remounting or reformatting
2969a610812SMauro Carvalho Chehab  the device will not re-enable file write access.
2979a610812SMauro Carvalho Chehab* File access permission changes implied by the remount-ro, zone-ro and
2989a610812SMauro Carvalho Chehab  zone-offline mount options are temporary for zones in a good condition.
2999a610812SMauro Carvalho Chehab  Unmounting and remounting the file system will restore the previous default
3009a610812SMauro Carvalho Chehab  (format time values) access rights to the files affected.
3019a610812SMauro Carvalho Chehab* The repair mount option triggers only the minimal set of I/O error recovery
3029a610812SMauro Carvalho Chehab  actions, that is, file size fixes for zones in a good condition. Zones
3039a610812SMauro Carvalho Chehab  indicated as being read-only or offline by the device still imply changes to
3049a610812SMauro Carvalho Chehab  the zone file access permissions as noted in the table above.
3059a610812SMauro Carvalho Chehab
3069a610812SMauro Carvalho ChehabMount options
3079a610812SMauro Carvalho Chehab-------------
3089a610812SMauro Carvalho Chehab
309ae430388SDamien Le Moalzonefs defines several mount options:
310ae430388SDamien Le Moal* errors=<behavior>
311ae430388SDamien Le Moal* explicit-open
312ae430388SDamien Le Moal
313ae430388SDamien Le Moal"errors=<behavior>" option
314ae430388SDamien Le Moal~~~~~~~~~~~~~~~~~~~~~~~~~~
315ae430388SDamien Le Moal
316ae430388SDamien Le MoalThe "errors=<behavior>" option mount option allows the user to specify zonefs
317ae430388SDamien Le Moalbehavior in response to I/O errors, inode size inconsistencies or zone
318481ed297SLinus Torvaldscondition changes. The defined behaviors are as follow:
3199a610812SMauro Carvalho Chehab
3209a610812SMauro Carvalho Chehab* remount-ro (default)
3219a610812SMauro Carvalho Chehab* zone-ro
3229a610812SMauro Carvalho Chehab* zone-offline
3239a610812SMauro Carvalho Chehab* repair
3249a610812SMauro Carvalho Chehab
325481ed297SLinus TorvaldsThe run-time I/O error actions defined for each behavior are detailed in the
326481ed297SLinus Torvaldsprevious section. Mount time I/O errors will cause the mount operation to fail.
327481ed297SLinus TorvaldsThe handling of read-only zones also differs between mount-time and run-time.
328481ed297SLinus TorvaldsIf a read-only zone is found at mount time, the zone is always treated in the
329481ed297SLinus Torvaldssame manner as offline zones, that is, all accesses are disabled and the zone
330481ed297SLinus Torvaldsfile size set to 0. This is necessary as the write pointer of read-only zones
331481ed297SLinus Torvaldsis defined as invalib by the ZBC and ZAC standards, making it impossible to
332481ed297SLinus Torvaldsdiscover the amount of data that has been written to the zone. In the case of a
333481ed297SLinus Torvaldsread-only zone discovered at run-time, as indicated in the previous section.
3344c96870eSJohannes ThumshirnThe size of the zone file is left unchanged from its last updated value.
3359a610812SMauro Carvalho Chehab
336ae430388SDamien Le Moal"explicit-open" option
337ae430388SDamien Le Moal~~~~~~~~~~~~~~~~~~~~~~
338ae430388SDamien Le Moal
33948bfd5c6SJohannes ThumshirnA zoned block device (e.g. an NVMe Zoned Namespace device) may have limits on
34048bfd5c6SJohannes Thumshirnthe number of zones that can be active, that is, zones that are in the
34148bfd5c6SJohannes Thumshirnimplicit open, explicit open or closed conditions.  This potential limitation
34248bfd5c6SJohannes Thumshirntranslates into a risk for applications to see write IO errors due to this
34348bfd5c6SJohannes Thumshirnlimit being exceeded if the zone of a file is not already active when a write
34448bfd5c6SJohannes Thumshirnrequest is issued by the user.
34548bfd5c6SJohannes Thumshirn
34648bfd5c6SJohannes ThumshirnTo avoid these potential errors, the "explicit-open" mount option forces zones
34748bfd5c6SJohannes Thumshirnto be made active using an open zone command when a file is opened for writing
34848bfd5c6SJohannes Thumshirnfor the first time. If the zone open command succeeds, the application is then
34948bfd5c6SJohannes Thumshirnguaranteed that write requests can be processed. Conversely, the
35048bfd5c6SJohannes Thumshirn"explicit-open" mount option will result in a zone close command being issued
35148bfd5c6SJohannes Thumshirnto the device on the last close() of a zone file if the zone is not full nor
35248bfd5c6SJohannes Thumshirnempty.
35348bfd5c6SJohannes Thumshirn
35431a644b3SDamien Le MoalRuntime sysfs attributes
35531a644b3SDamien Le Moal------------------------
35631a644b3SDamien Le Moal
35731a644b3SDamien Le Moalzonefs defines several sysfs attributes for mounted devices.  All attributes
35831a644b3SDamien Le Moalare user readable and can be found in the directory /sys/fs/zonefs/<dev>/,
35931a644b3SDamien Le Moalwhere <dev> is the name of the mounted zoned block device.
36031a644b3SDamien Le Moal
36131a644b3SDamien Le MoalThe attributes defined are as follows.
36231a644b3SDamien Le Moal
36331a644b3SDamien Le Moal* **max_wro_seq_files**:  This attribute reports the maximum number of
36431a644b3SDamien Le Moal  sequential zone files that can be open for writing.  This number corresponds
36531a644b3SDamien Le Moal  to the maximum number of explicitly or implicitly open zones that the device
36631a644b3SDamien Le Moal  supports.  A value of 0 means that the device has no limit and that any zone
36731a644b3SDamien Le Moal  (any file) can be open for writing and written at any time, regardless of the
36831a644b3SDamien Le Moal  state of other zones.  When the *explicit-open* mount option is used, zonefs
36931a644b3SDamien Le Moal  will fail any open() system call requesting to open a sequential zone file for
37031a644b3SDamien Le Moal  writing when the number of sequential zone files already open for writing has
37131a644b3SDamien Le Moal  reached the *max_wro_seq_files* limit.
37231a644b3SDamien Le Moal* **nr_wro_seq_files**:  This attribute reports the current number of sequential
37331a644b3SDamien Le Moal  zone files open for writing.  When the "explicit-open" mount option is used,
37431a644b3SDamien Le Moal  this number can never exceed *max_wro_seq_files*.  If the *explicit-open*
37531a644b3SDamien Le Moal  mount option is not used, the reported number can be greater than
37631a644b3SDamien Le Moal  *max_wro_seq_files*.  In such case, it is the responsibility of the
37731a644b3SDamien Le Moal  application to not write simultaneously more than *max_wro_seq_files*
37831a644b3SDamien Le Moal  sequential zone files.  Failure to do so can result in write errors.
37931a644b3SDamien Le Moal* **max_active_seq_files**:  This attribute reports the maximum number of
38031a644b3SDamien Le Moal  sequential zone files that are in an active state, that is, sequential zone
381*d56b699dSBjorn Helgaas  files that are partially written (not empty nor full) or that have a zone that
38231a644b3SDamien Le Moal  is explicitly open (which happens only if the *explicit-open* mount option is
38331a644b3SDamien Le Moal  used).  This number is always equal to the maximum number of active zones that
38431a644b3SDamien Le Moal  the device supports.  A value of 0 means that the mounted device has no limit
38531a644b3SDamien Le Moal  on the number of sequential zone files that can be active.
38631a644b3SDamien Le Moal* **nr_active_seq_files**:  This attributes reports the current number of
38731a644b3SDamien Le Moal  sequential zone files that are active. If *max_active_seq_files* is not 0,
38831a644b3SDamien Le Moal  then the value of *nr_active_seq_files* can never exceed the value of
38931a644b3SDamien Le Moal  *nr_active_seq_files*, regardless of the use of the *explicit-open* mount
39031a644b3SDamien Le Moal  option.
39131a644b3SDamien Le Moal
3929a610812SMauro Carvalho ChehabZonefs User Space Tools
3939a610812SMauro Carvalho Chehab=======================
3949a610812SMauro Carvalho Chehab
3959a610812SMauro Carvalho ChehabThe mkzonefs tool is used to format zoned block devices for use with zonefs.
3969a610812SMauro Carvalho ChehabThis tool is available on Github at:
3979a610812SMauro Carvalho Chehab
3989a610812SMauro Carvalho Chehabhttps://github.com/damien-lemoal/zonefs-tools
3999a610812SMauro Carvalho Chehab
4009a610812SMauro Carvalho Chehabzonefs-tools also includes a test suite which can be run against any zoned
4019a610812SMauro Carvalho Chehabblock device, including null_blk block device created with zoned mode.
4029a610812SMauro Carvalho Chehab
4039a610812SMauro Carvalho ChehabExamples
4049a610812SMauro Carvalho Chehab--------
4059a610812SMauro Carvalho Chehab
4069a610812SMauro Carvalho ChehabThe following formats a 15TB host-managed SMR HDD with 256 MB zones
4079a610812SMauro Carvalho Chehabwith the conventional zones aggregation feature enabled::
4089a610812SMauro Carvalho Chehab
4099a610812SMauro Carvalho Chehab    # mkzonefs -o aggr_cnv /dev/sdX
4109a610812SMauro Carvalho Chehab    # mount -t zonefs /dev/sdX /mnt
4119a610812SMauro Carvalho Chehab    # ls -l /mnt/
4129a610812SMauro Carvalho Chehab    total 0
4139a610812SMauro Carvalho Chehab    dr-xr-xr-x 2 root root     1 Nov 25 13:23 cnv
4149a610812SMauro Carvalho Chehab    dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
4159a610812SMauro Carvalho Chehab
4169a610812SMauro Carvalho ChehabThe size of the zone files sub-directories indicate the number of files
4179a610812SMauro Carvalho Chehabexisting for each type of zones. In this example, there is only one
4189a610812SMauro Carvalho Chehabconventional zone file (all conventional zones are aggregated under a single
4199a610812SMauro Carvalho Chehabfile)::
4209a610812SMauro Carvalho Chehab
4219a610812SMauro Carvalho Chehab    # ls -l /mnt/cnv
4229a610812SMauro Carvalho Chehab    total 137101312
4239a610812SMauro Carvalho Chehab    -rw-r----- 1 root root 140391743488 Nov 25 13:23 0
4249a610812SMauro Carvalho Chehab
4259a610812SMauro Carvalho ChehabThis aggregated conventional zone file can be used as a regular file::
4269a610812SMauro Carvalho Chehab
4279a610812SMauro Carvalho Chehab    # mkfs.ext4 /mnt/cnv/0
4289a610812SMauro Carvalho Chehab    # mount -o loop /mnt/cnv/0 /data
4299a610812SMauro Carvalho Chehab
4309a610812SMauro Carvalho ChehabThe "seq" sub-directory grouping files for sequential write zones has in this
4319a610812SMauro Carvalho Chehabexample 55356 zones::
4329a610812SMauro Carvalho Chehab
4339a610812SMauro Carvalho Chehab    # ls -lv /mnt/seq
4349a610812SMauro Carvalho Chehab    total 14511243264
4359a610812SMauro Carvalho Chehab    -rw-r----- 1 root root 0 Nov 25 13:23 0
4369a610812SMauro Carvalho Chehab    -rw-r----- 1 root root 0 Nov 25 13:23 1
4379a610812SMauro Carvalho Chehab    -rw-r----- 1 root root 0 Nov 25 13:23 2
4389a610812SMauro Carvalho Chehab    ...
4399a610812SMauro Carvalho Chehab    -rw-r----- 1 root root 0 Nov 25 13:23 55354
4409a610812SMauro Carvalho Chehab    -rw-r----- 1 root root 0 Nov 25 13:23 55355
4419a610812SMauro Carvalho Chehab
4429a610812SMauro Carvalho ChehabFor sequential write zone files, the file size changes as data is appended at
4439a610812SMauro Carvalho Chehabthe end of the file, similarly to any regular file system::
4449a610812SMauro Carvalho Chehab
4459a610812SMauro Carvalho Chehab    # dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct
4469a610812SMauro Carvalho Chehab    1+0 records in
4479a610812SMauro Carvalho Chehab    1+0 records out
4489a610812SMauro Carvalho Chehab    4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s
4499a610812SMauro Carvalho Chehab
4509a610812SMauro Carvalho Chehab    # ls -l /mnt/seq/0
4519a610812SMauro Carvalho Chehab    -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
4529a610812SMauro Carvalho Chehab
4539a610812SMauro Carvalho ChehabThe written file can be truncated to the zone size, preventing any further
4549a610812SMauro Carvalho Chehabwrite operation::
4559a610812SMauro Carvalho Chehab
4569a610812SMauro Carvalho Chehab    # truncate -s 268435456 /mnt/seq/0
4579a610812SMauro Carvalho Chehab    # ls -l /mnt/seq/0
4589a610812SMauro Carvalho Chehab    -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
4599a610812SMauro Carvalho Chehab
4609a610812SMauro Carvalho ChehabTruncation to 0 size allows freeing the file zone storage space and restart
4619a610812SMauro Carvalho Chehabappend-writes to the file::
4629a610812SMauro Carvalho Chehab
4639a610812SMauro Carvalho Chehab    # truncate -s 0 /mnt/seq/0
4649a610812SMauro Carvalho Chehab    # ls -l /mnt/seq/0
4659a610812SMauro Carvalho Chehab    -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
4669a610812SMauro Carvalho Chehab
4674c96870eSJohannes ThumshirnSince files are statically mapped to zones on the disk, the number of blocks
4684c96870eSJohannes Thumshirnof a file as reported by stat() and fstat() indicates the capacity of the file
4694c96870eSJohannes Thumshirnzone::
4709a610812SMauro Carvalho Chehab
4719a610812SMauro Carvalho Chehab    # stat /mnt/seq/0
4729a610812SMauro Carvalho Chehab    File: /mnt/seq/0
4739a610812SMauro Carvalho Chehab    Size: 0         	Blocks: 524288     IO Block: 4096   regular empty file
4749a610812SMauro Carvalho Chehab    Device: 870h/2160d	Inode: 50431       Links: 1
4759a610812SMauro Carvalho Chehab    Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/    root)
4769a610812SMauro Carvalho Chehab    Access: 2019-11-25 13:23:57.048971997 +0900
4779a610812SMauro Carvalho Chehab    Modify: 2019-11-25 13:52:25.553805765 +0900
4789a610812SMauro Carvalho Chehab    Change: 2019-11-25 13:52:25.553805765 +0900
4799a610812SMauro Carvalho Chehab    Birth: -
4809a610812SMauro Carvalho Chehab
4819a610812SMauro Carvalho ChehabThe number of blocks of the file ("Blocks") in units of 512B blocks gives the
4829a610812SMauro Carvalho Chehabmaximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone
4834c96870eSJohannes Thumshirncapacity in this example. Of note is that the "IO block" field always
4844c96870eSJohannes Thumshirnindicates the minimum I/O size for writes and corresponds to the device
4854c96870eSJohannes Thumshirnphysical sector size.
486