1471379a1SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0 2471379a1SMauro Carvalho Chehab 3471379a1SMauro Carvalho Chehab============================ 4471379a1SMauro Carvalho ChehabCeph Distributed File System 5471379a1SMauro Carvalho Chehab============================ 6471379a1SMauro Carvalho Chehab 7471379a1SMauro Carvalho ChehabCeph is a distributed network file system designed to provide good 8471379a1SMauro Carvalho Chehabperformance, reliability, and scalability. 9471379a1SMauro Carvalho Chehab 10471379a1SMauro Carvalho ChehabBasic features include: 11471379a1SMauro Carvalho Chehab 12471379a1SMauro Carvalho Chehab * POSIX semantics 13471379a1SMauro Carvalho Chehab * Seamless scaling from 1 to many thousands of nodes 14471379a1SMauro Carvalho Chehab * High availability and reliability. No single point of failure. 15471379a1SMauro Carvalho Chehab * N-way replication of data across storage nodes 16471379a1SMauro Carvalho Chehab * Fast recovery from node failures 17471379a1SMauro Carvalho Chehab * Automatic rebalancing of data on node addition/removal 18471379a1SMauro Carvalho Chehab * Easy deployment: most FS components are userspace daemons 19471379a1SMauro Carvalho Chehab 20471379a1SMauro Carvalho ChehabAlso, 21471379a1SMauro Carvalho Chehab 22471379a1SMauro Carvalho Chehab * Flexible snapshots (on any directory) 23471379a1SMauro Carvalho Chehab * Recursive accounting (nested files, directories, bytes) 24471379a1SMauro Carvalho Chehab 25471379a1SMauro Carvalho ChehabIn contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely 26471379a1SMauro Carvalho Chehabon symmetric access by all clients to shared block devices, Ceph 27471379a1SMauro Carvalho Chehabseparates data and metadata management into independent server 28471379a1SMauro Carvalho Chehabclusters, similar to Lustre. Unlike Lustre, however, metadata and 29471379a1SMauro Carvalho Chehabstorage nodes run entirely as user space daemons. File data is striped 30471379a1SMauro Carvalho Chehabacross storage nodes in large chunks to distribute workload and 31471379a1SMauro Carvalho Chehabfacilitate high throughputs. When storage nodes fail, data is 32471379a1SMauro Carvalho Chehabre-replicated in a distributed fashion by the storage nodes themselves 33471379a1SMauro Carvalho Chehab(with some minimal coordination from a cluster monitor), making the 34471379a1SMauro Carvalho Chehabsystem extremely efficient and scalable. 35471379a1SMauro Carvalho Chehab 36471379a1SMauro Carvalho ChehabMetadata servers effectively form a large, consistent, distributed 37471379a1SMauro Carvalho Chehabin-memory cache above the file namespace that is extremely scalable, 38471379a1SMauro Carvalho Chehabdynamically redistributes metadata in response to workload changes, 39471379a1SMauro Carvalho Chehaband can tolerate arbitrary (well, non-Byzantine) node failures. The 40471379a1SMauro Carvalho Chehabmetadata server takes a somewhat unconventional approach to metadata 41471379a1SMauro Carvalho Chehabstorage to significantly improve performance for common workloads. In 42471379a1SMauro Carvalho Chehabparticular, inodes with only a single link are embedded in 43471379a1SMauro Carvalho Chehabdirectories, allowing entire directories of dentries and inodes to be 44471379a1SMauro Carvalho Chehabloaded into its cache with a single I/O operation. The contents of 45471379a1SMauro Carvalho Chehabextremely large directories can be fragmented and managed by 46471379a1SMauro Carvalho Chehabindependent metadata servers, allowing scalable concurrent access. 47471379a1SMauro Carvalho Chehab 48471379a1SMauro Carvalho ChehabThe system offers automatic data rebalancing/migration when scaling 49471379a1SMauro Carvalho Chehabfrom a small cluster of just a few nodes to many hundreds, without 50471379a1SMauro Carvalho Chehabrequiring an administrator carve the data set into static volumes or 51471379a1SMauro Carvalho Chehabgo through the tedious process of migrating data between servers. 52471379a1SMauro Carvalho ChehabWhen the file system approaches full, new nodes can be easily added 53471379a1SMauro Carvalho Chehaband things will "just work." 54471379a1SMauro Carvalho Chehab 55471379a1SMauro Carvalho ChehabCeph includes flexible snapshot mechanism that allows a user to create 56471379a1SMauro Carvalho Chehaba snapshot on any subdirectory (and its nested contents) in the 57471379a1SMauro Carvalho Chehabsystem. Snapshot creation and deletion are as simple as 'mkdir 58471379a1SMauro Carvalho Chehab.snap/foo' and 'rmdir .snap/foo'. 59471379a1SMauro Carvalho Chehab 60*230bd8b9SLuís HenriquesSnapshot names have two limitations: 61*230bd8b9SLuís Henriques 62*230bd8b9SLuís Henriques* They can not start with an underscore ('_'), as these names are reserved 63*230bd8b9SLuís Henriques for internal usage by the MDS. 64*230bd8b9SLuís Henriques* They can not exceed 240 characters in size. This is because the MDS makes 65*230bd8b9SLuís Henriques use of long snapshot names internally, which follow the format: 66*230bd8b9SLuís Henriques `_<SNAPSHOT-NAME>_<INODE-NUMBER>`. Since filenames in general can't have 67*230bd8b9SLuís Henriques more than 255 characters, and `<node-id>` takes 13 characters, the long 68*230bd8b9SLuís Henriques snapshot names can take as much as 255 - 1 - 1 - 13 = 240. 69*230bd8b9SLuís Henriques 70471379a1SMauro Carvalho ChehabCeph also provides some recursive accounting on directories for nested 71471379a1SMauro Carvalho Chehabfiles and bytes. That is, a 'getfattr -d foo' on any directory in the 72471379a1SMauro Carvalho Chehabsystem will reveal the total number of nested regular files and 73471379a1SMauro Carvalho Chehabsubdirectories, and a summation of all nested file sizes. This makes 74471379a1SMauro Carvalho Chehabthe identification of large disk space consumers relatively quick, as 75471379a1SMauro Carvalho Chehabno 'du' or similar recursive scan of the file system is required. 76471379a1SMauro Carvalho Chehab 77471379a1SMauro Carvalho ChehabFinally, Ceph also allows quotas to be set on any directory in the system. 78471379a1SMauro Carvalho ChehabThe quota can restrict the number of bytes or the number of files stored 79471379a1SMauro Carvalho Chehabbeneath that point in the directory hierarchy. Quotas can be set using 80471379a1SMauro Carvalho Chehabextended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg:: 81471379a1SMauro Carvalho Chehab 82471379a1SMauro Carvalho Chehab setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir 83471379a1SMauro Carvalho Chehab getfattr -n ceph.quota.max_bytes /some/dir 84471379a1SMauro Carvalho Chehab 85471379a1SMauro Carvalho ChehabA limitation of the current quotas implementation is that it relies on the 86471379a1SMauro Carvalho Chehabcooperation of the client mounting the file system to stop writers when a 87471379a1SMauro Carvalho Chehablimit is reached. A modified or adversarial client cannot be prevented 88471379a1SMauro Carvalho Chehabfrom writing as much data as it needs. 89471379a1SMauro Carvalho Chehab 90471379a1SMauro Carvalho ChehabMount Syntax 91471379a1SMauro Carvalho Chehab============ 92471379a1SMauro Carvalho Chehab 93471379a1SMauro Carvalho ChehabThe basic mount syntax is:: 94471379a1SMauro Carvalho Chehab 95e1b9eb50SVenky Shankar # mount -t ceph user@fsid.fs_name=/[subdir] mnt -o mon_addr=monip1[:port][/monip2[:port]] 96471379a1SMauro Carvalho Chehab 97471379a1SMauro Carvalho ChehabYou only need to specify a single monitor, as the client will get the 98471379a1SMauro Carvalho Chehabfull list when it connects. (However, if the monitor you specify 99471379a1SMauro Carvalho Chehabhappens to be down, the mount won't succeed.) The port can be left 100471379a1SMauro Carvalho Chehaboff if the monitor is using the default. So if the monitor is at 101471379a1SMauro Carvalho Chehab1.2.3.4:: 102471379a1SMauro Carvalho Chehab 103e1b9eb50SVenky Shankar # mount -t ceph cephuser@07fe3187-00d9-42a3-814b-72a4d5e7d5be.cephfs=/ /mnt/ceph -o mon_addr=1.2.3.4 104471379a1SMauro Carvalho Chehab 105471379a1SMauro Carvalho Chehabis sufficient. If /sbin/mount.ceph is installed, a hostname can be 106e1b9eb50SVenky Shankarused instead of an IP address and the cluster FSID can be left out 107e1b9eb50SVenky Shankar(as the mount helper will fill it in by reading the ceph configuration 108e1b9eb50SVenky Shankarfile):: 109471379a1SMauro Carvalho Chehab 110e1b9eb50SVenky Shankar # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=mon-addr 111471379a1SMauro Carvalho Chehab 112e1b9eb50SVenky ShankarMultiple monitor addresses can be passed by separating each address with a slash (`/`):: 113e1b9eb50SVenky Shankar 114e1b9eb50SVenky Shankar # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=192.168.1.100/192.168.1.101 115e1b9eb50SVenky Shankar 116e1b9eb50SVenky ShankarWhen using the mount helper, monitor address can be read from ceph 117e1b9eb50SVenky Shankarconfiguration file if available. Note that, the cluster FSID (passed as part 118e1b9eb50SVenky Shankarof the device string) is validated by checking it with the FSID reported by 119e1b9eb50SVenky Shankarthe monitor. 120471379a1SMauro Carvalho Chehab 121471379a1SMauro Carvalho ChehabMount Options 122471379a1SMauro Carvalho Chehab============= 123471379a1SMauro Carvalho Chehab 124e1b9eb50SVenky Shankar mon_addr=ip_address[:port][/ip_address[:port]] 125e1b9eb50SVenky Shankar Monitor address to the cluster. This is used to bootstrap the 126e1b9eb50SVenky Shankar connection to the cluster. Once connection is established, the 127e1b9eb50SVenky Shankar monitor addresses in the monitor map are followed. 128e1b9eb50SVenky Shankar 129e1b9eb50SVenky Shankar fsid=cluster-id 130e1b9eb50SVenky Shankar FSID of the cluster (from `ceph fsid` command). 131e1b9eb50SVenky Shankar 132471379a1SMauro Carvalho Chehab ip=A.B.C.D[:N] 133471379a1SMauro Carvalho Chehab Specify the IP and/or port the client should bind to locally. 134471379a1SMauro Carvalho Chehab There is normally not much reason to do this. If the IP is not 135471379a1SMauro Carvalho Chehab specified, the client's IP address is determined by looking at the 136471379a1SMauro Carvalho Chehab address its connection to the monitor originates from. 137471379a1SMauro Carvalho Chehab 138471379a1SMauro Carvalho Chehab wsize=X 139fcc95f06SLinus Torvalds Specify the maximum write size in bytes. Default: 64 MB. 140471379a1SMauro Carvalho Chehab 141471379a1SMauro Carvalho Chehab rsize=X 142fcc95f06SLinus Torvalds Specify the maximum read size in bytes. Default: 64 MB. 143471379a1SMauro Carvalho Chehab 144471379a1SMauro Carvalho Chehab rasize=X 145471379a1SMauro Carvalho Chehab Specify the maximum readahead size in bytes. Default: 8 MB. 146471379a1SMauro Carvalho Chehab 147471379a1SMauro Carvalho Chehab mount_timeout=X 148471379a1SMauro Carvalho Chehab Specify the timeout value for mount (in seconds), in the case 149fcc95f06SLinus Torvalds of a non-responsive Ceph file system. The default is 60 150471379a1SMauro Carvalho Chehab seconds. 151471379a1SMauro Carvalho Chehab 152471379a1SMauro Carvalho Chehab caps_max=X 153471379a1SMauro Carvalho Chehab Specify the maximum number of caps to hold. Unused caps are released 154471379a1SMauro Carvalho Chehab when number of caps exceeds the limit. The default is 0 (no limit) 155471379a1SMauro Carvalho Chehab 156471379a1SMauro Carvalho Chehab rbytes 157471379a1SMauro Carvalho Chehab When stat() is called on a directory, set st_size to 'rbytes', 158471379a1SMauro Carvalho Chehab the summation of file sizes over all files nested beneath that 159471379a1SMauro Carvalho Chehab directory. This is the default. 160471379a1SMauro Carvalho Chehab 161471379a1SMauro Carvalho Chehab norbytes 162471379a1SMauro Carvalho Chehab When stat() is called on a directory, set st_size to the 163471379a1SMauro Carvalho Chehab number of entries in that directory. 164471379a1SMauro Carvalho Chehab 165471379a1SMauro Carvalho Chehab nocrc 166471379a1SMauro Carvalho Chehab Disable CRC32C calculation for data writes. If set, the storage node 167471379a1SMauro Carvalho Chehab must rely on TCP's error correction to detect data corruption 168471379a1SMauro Carvalho Chehab in the data payload. 169471379a1SMauro Carvalho Chehab 170471379a1SMauro Carvalho Chehab dcache 171471379a1SMauro Carvalho Chehab Use the dcache contents to perform negative lookups and 172471379a1SMauro Carvalho Chehab readdir when the client has the entire directory contents in 173471379a1SMauro Carvalho Chehab its cache. (This does not change correctness; the client uses 174471379a1SMauro Carvalho Chehab cached metadata only when a lease or capability ensures it is 175471379a1SMauro Carvalho Chehab valid.) 176471379a1SMauro Carvalho Chehab 177471379a1SMauro Carvalho Chehab nodcache 178471379a1SMauro Carvalho Chehab Do not use the dcache as above. This avoids a significant amount of 179471379a1SMauro Carvalho Chehab complex code, sacrificing performance without affecting correctness, 180471379a1SMauro Carvalho Chehab and is useful for tracking down bugs. 181471379a1SMauro Carvalho Chehab 182471379a1SMauro Carvalho Chehab noasyncreaddir 183471379a1SMauro Carvalho Chehab Do not use the dcache as above for readdir. 184471379a1SMauro Carvalho Chehab 185471379a1SMauro Carvalho Chehab noquotadf 186471379a1SMauro Carvalho Chehab Report overall filesystem usage in statfs instead of using the root 187471379a1SMauro Carvalho Chehab directory quota. 188471379a1SMauro Carvalho Chehab 189471379a1SMauro Carvalho Chehab nocopyfrom 190471379a1SMauro Carvalho Chehab Don't use the RADOS 'copy-from' operation to perform remote object 191471379a1SMauro Carvalho Chehab copies. Currently, it's only used in copy_file_range, which will revert 192471379a1SMauro Carvalho Chehab to the default VFS implementation if this option is used. 193471379a1SMauro Carvalho Chehab 194471379a1SMauro Carvalho Chehab recover_session=<no|clean> 1950b98acd6SIlya Dryomov Set auto reconnect mode in the case where the client is blocklisted. The 196471379a1SMauro Carvalho Chehab available modes are "no" and "clean". The default is "no". 197471379a1SMauro Carvalho Chehab 198471379a1SMauro Carvalho Chehab * no: never attempt to reconnect when client detects that it has been 1990b98acd6SIlya Dryomov blocklisted. Operations will generally fail after being blocklisted. 200471379a1SMauro Carvalho Chehab 201471379a1SMauro Carvalho Chehab * clean: client reconnects to the ceph cluster automatically when it 2020b98acd6SIlya Dryomov detects that it has been blocklisted. During reconnect, client drops 203471379a1SMauro Carvalho Chehab dirty data/metadata, invalidates page caches and writable file handles. 204471379a1SMauro Carvalho Chehab After reconnect, file locks become stale because the MDS loses track 205471379a1SMauro Carvalho Chehab of them. If an inode contains any stale file locks, read/write on the 206471379a1SMauro Carvalho Chehab inode is not allowed until applications release all stale file locks. 207471379a1SMauro Carvalho Chehab 208471379a1SMauro Carvalho ChehabMore Information 209471379a1SMauro Carvalho Chehab================ 210471379a1SMauro Carvalho Chehab 211471379a1SMauro Carvalho ChehabFor more information on Ceph, see the home page at 212471379a1SMauro Carvalho Chehab https://ceph.com/ 213471379a1SMauro Carvalho Chehab 214471379a1SMauro Carvalho ChehabThe Linux kernel client source tree is available at 215471379a1SMauro Carvalho Chehab - https://github.com/ceph/ceph-client.git 216471379a1SMauro Carvalho Chehab 217471379a1SMauro Carvalho Chehaband the source for the full system is at 218471379a1SMauro Carvalho Chehab https://github.com/ceph/ceph.git 219