Documentation/filesystems/ceph.rst

471379a1SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab============================
471379a1SMauro Carvalho ChehabCeph Distributed File System
471379a1SMauro Carvalho Chehab============================
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabCeph is a distributed network file system designed to provide good
471379a1SMauro Carvalho Chehabperformance, reliability, and scalability.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabBasic features include:
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab * POSIX semantics
471379a1SMauro Carvalho Chehab * Seamless scaling from 1 to many thousands of nodes
471379a1SMauro Carvalho Chehab * High availability and reliability.  No single point of failure.
471379a1SMauro Carvalho Chehab * N-way replication of data across storage nodes
471379a1SMauro Carvalho Chehab * Fast recovery from node failures
471379a1SMauro Carvalho Chehab * Automatic rebalancing of data on node addition/removal
471379a1SMauro Carvalho Chehab * Easy deployment: most FS components are userspace daemons
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabAlso,
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab * Flexible snapshots (on any directory)
471379a1SMauro Carvalho Chehab * Recursive accounting (nested files, directories, bytes)
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabIn contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely
471379a1SMauro Carvalho Chehabon symmetric access by all clients to shared block devices, Ceph
471379a1SMauro Carvalho Chehabseparates data and metadata management into independent server
471379a1SMauro Carvalho Chehabclusters, similar to Lustre.  Unlike Lustre, however, metadata and
471379a1SMauro Carvalho Chehabstorage nodes run entirely as user space daemons.  File data is striped
471379a1SMauro Carvalho Chehabacross storage nodes in large chunks to distribute workload and
471379a1SMauro Carvalho Chehabfacilitate high throughputs.  When storage nodes fail, data is
471379a1SMauro Carvalho Chehabre-replicated in a distributed fashion by the storage nodes themselves
471379a1SMauro Carvalho Chehab(with some minimal coordination from a cluster monitor), making the
471379a1SMauro Carvalho Chehabsystem extremely efficient and scalable.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabMetadata servers effectively form a large, consistent, distributed
471379a1SMauro Carvalho Chehabin-memory cache above the file namespace that is extremely scalable,
471379a1SMauro Carvalho Chehabdynamically redistributes metadata in response to workload changes,
471379a1SMauro Carvalho Chehaband can tolerate arbitrary (well, non-Byzantine) node failures.  The
471379a1SMauro Carvalho Chehabmetadata server takes a somewhat unconventional approach to metadata
471379a1SMauro Carvalho Chehabstorage to significantly improve performance for common workloads.  In
471379a1SMauro Carvalho Chehabparticular, inodes with only a single link are embedded in
471379a1SMauro Carvalho Chehabdirectories, allowing entire directories of dentries and inodes to be
471379a1SMauro Carvalho Chehabloaded into its cache with a single I/O operation.  The contents of
471379a1SMauro Carvalho Chehabextremely large directories can be fragmented and managed by
471379a1SMauro Carvalho Chehabindependent metadata servers, allowing scalable concurrent access.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabThe system offers automatic data rebalancing/migration when scaling
471379a1SMauro Carvalho Chehabfrom a small cluster of just a few nodes to many hundreds, without
471379a1SMauro Carvalho Chehabrequiring an administrator carve the data set into static volumes or
471379a1SMauro Carvalho Chehabgo through the tedious process of migrating data between servers.
471379a1SMauro Carvalho ChehabWhen the file system approaches full, new nodes can be easily added
471379a1SMauro Carvalho Chehaband things will "just work."
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabCeph includes flexible snapshot mechanism that allows a user to create
471379a1SMauro Carvalho Chehaba snapshot on any subdirectory (and its nested contents) in the
471379a1SMauro Carvalho Chehabsystem.  Snapshot creation and deletion are as simple as 'mkdir
471379a1SMauro Carvalho Chehab.snap/foo' and 'rmdir .snap/foo'.
471379a1SMauro Carvalho Chehab
*230bd8b9SLuís HenriquesSnapshot names have two limitations:
*230bd8b9SLuís Henriques
*230bd8b9SLuís Henriques* They can not start with an underscore ('_'), as these names are reserved
*230bd8b9SLuís Henriques  for internal usage by the MDS.
*230bd8b9SLuís Henriques* They can not exceed 240 characters in size.  This is because the MDS makes
*230bd8b9SLuís Henriques  use of long snapshot names internally, which follow the format:
*230bd8b9SLuís Henriques  `_<SNAPSHOT-NAME>_<INODE-NUMBER>`.  Since filenames in general can't have
*230bd8b9SLuís Henriques  more than 255 characters, and `<node-id>` takes 13 characters, the long
*230bd8b9SLuís Henriques  snapshot names can take as much as 255 - 1 - 1 - 13 = 240.
*230bd8b9SLuís Henriques
471379a1SMauro Carvalho ChehabCeph also provides some recursive accounting on directories for nested
471379a1SMauro Carvalho Chehabfiles and bytes.  That is, a 'getfattr -d foo' on any directory in the
471379a1SMauro Carvalho Chehabsystem will reveal the total number of nested regular files and
471379a1SMauro Carvalho Chehabsubdirectories, and a summation of all nested file sizes.  This makes
471379a1SMauro Carvalho Chehabthe identification of large disk space consumers relatively quick, as
471379a1SMauro Carvalho Chehabno 'du' or similar recursive scan of the file system is required.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabFinally, Ceph also allows quotas to be set on any directory in the system.
471379a1SMauro Carvalho ChehabThe quota can restrict the number of bytes or the number of files stored
471379a1SMauro Carvalho Chehabbeneath that point in the directory hierarchy.  Quotas can be set using
471379a1SMauro Carvalho Chehabextended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg::
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir
471379a1SMauro Carvalho Chehab getfattr -n ceph.quota.max_bytes /some/dir
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabA limitation of the current quotas implementation is that it relies on the
471379a1SMauro Carvalho Chehabcooperation of the client mounting the file system to stop writers when a
471379a1SMauro Carvalho Chehablimit is reached.  A modified or adversarial client cannot be prevented
471379a1SMauro Carvalho Chehabfrom writing as much data as it needs.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabMount Syntax
471379a1SMauro Carvalho Chehab============
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabThe basic mount syntax is::
471379a1SMauro Carvalho Chehab
e1b9eb50SVenky Shankar # mount -t ceph user@fsid.fs_name=/[subdir] mnt -o mon_addr=monip1[:port][/monip2[:port]]
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabYou only need to specify a single monitor, as the client will get the
471379a1SMauro Carvalho Chehabfull list when it connects.  (However, if the monitor you specify
471379a1SMauro Carvalho Chehabhappens to be down, the mount won't succeed.)  The port can be left
471379a1SMauro Carvalho Chehaboff if the monitor is using the default.  So if the monitor is at
471379a1SMauro Carvalho Chehab1.2.3.4::
471379a1SMauro Carvalho Chehab
e1b9eb50SVenky Shankar # mount -t ceph cephuser@07fe3187-00d9-42a3-814b-72a4d5e7d5be.cephfs=/ /mnt/ceph -o mon_addr=1.2.3.4
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehabis sufficient.  If /sbin/mount.ceph is installed, a hostname can be
e1b9eb50SVenky Shankarused instead of an IP address and the cluster FSID can be left out
e1b9eb50SVenky Shankar(as the mount helper will fill it in by reading the ceph configuration
e1b9eb50SVenky Shankarfile)::
471379a1SMauro Carvalho Chehab
e1b9eb50SVenky Shankar  # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=mon-addr
471379a1SMauro Carvalho Chehab
e1b9eb50SVenky ShankarMultiple monitor addresses can be passed by separating each address with a slash (`/`)::
e1b9eb50SVenky Shankar
e1b9eb50SVenky Shankar  # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=192.168.1.100/192.168.1.101
e1b9eb50SVenky Shankar
e1b9eb50SVenky ShankarWhen using the mount helper, monitor address can be read from ceph
e1b9eb50SVenky Shankarconfiguration file if available. Note that, the cluster FSID (passed as part
e1b9eb50SVenky Shankarof the device string) is validated by checking it with the FSID reported by
e1b9eb50SVenky Shankarthe monitor.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabMount Options
471379a1SMauro Carvalho Chehab=============
471379a1SMauro Carvalho Chehab
e1b9eb50SVenky Shankar  mon_addr=ip_address[:port][/ip_address[:port]]
e1b9eb50SVenky Shankar	Monitor address to the cluster. This is used to bootstrap the
e1b9eb50SVenky Shankar        connection to the cluster. Once connection is established, the
e1b9eb50SVenky Shankar        monitor addresses in the monitor map are followed.
e1b9eb50SVenky Shankar
e1b9eb50SVenky Shankar  fsid=cluster-id
e1b9eb50SVenky Shankar	FSID of the cluster (from `ceph fsid` command).
e1b9eb50SVenky Shankar
471379a1SMauro Carvalho Chehab  ip=A.B.C.D[:N]
471379a1SMauro Carvalho Chehab	Specify the IP and/or port the client should bind to locally.
471379a1SMauro Carvalho Chehab	There is normally not much reason to do this.  If the IP is not
471379a1SMauro Carvalho Chehab	specified, the client's IP address is determined by looking at the
471379a1SMauro Carvalho Chehab	address its connection to the monitor originates from.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab  wsize=X
fcc95f06SLinus Torvalds	Specify the maximum write size in bytes.  Default: 64 MB.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab  rsize=X
fcc95f06SLinus Torvalds	Specify the maximum read size in bytes.  Default: 64 MB.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab  rasize=X
471379a1SMauro Carvalho Chehab	Specify the maximum readahead size in bytes.  Default: 8 MB.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab  mount_timeout=X
471379a1SMauro Carvalho Chehab	Specify the timeout value for mount (in seconds), in the case
fcc95f06SLinus Torvalds	of a non-responsive Ceph file system.  The default is 60
471379a1SMauro Carvalho Chehab	seconds.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab  caps_max=X
471379a1SMauro Carvalho Chehab	Specify the maximum number of caps to hold. Unused caps are released
471379a1SMauro Carvalho Chehab	when number of caps exceeds the limit. The default is 0 (no limit)
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab  rbytes
471379a1SMauro Carvalho Chehab	When stat() is called on a directory, set st_size to 'rbytes',
471379a1SMauro Carvalho Chehab	the summation of file sizes over all files nested beneath that
471379a1SMauro Carvalho Chehab	directory.  This is the default.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab  norbytes
471379a1SMauro Carvalho Chehab	When stat() is called on a directory, set st_size to the
471379a1SMauro Carvalho Chehab	number of entries in that directory.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab  nocrc
471379a1SMauro Carvalho Chehab	Disable CRC32C calculation for data writes.  If set, the storage node
471379a1SMauro Carvalho Chehab	must rely on TCP's error correction to detect data corruption
471379a1SMauro Carvalho Chehab	in the data payload.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab  dcache
471379a1SMauro Carvalho Chehab        Use the dcache contents to perform negative lookups and
471379a1SMauro Carvalho Chehab        readdir when the client has the entire directory contents in
471379a1SMauro Carvalho Chehab        its cache.  (This does not change correctness; the client uses
471379a1SMauro Carvalho Chehab        cached metadata only when a lease or capability ensures it is
471379a1SMauro Carvalho Chehab        valid.)
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab  nodcache
471379a1SMauro Carvalho Chehab        Do not use the dcache as above.  This avoids a significant amount of
471379a1SMauro Carvalho Chehab        complex code, sacrificing performance without affecting correctness,
471379a1SMauro Carvalho Chehab        and is useful for tracking down bugs.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab  noasyncreaddir
471379a1SMauro Carvalho Chehab	Do not use the dcache as above for readdir.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab  noquotadf
471379a1SMauro Carvalho Chehab        Report overall filesystem usage in statfs instead of using the root
471379a1SMauro Carvalho Chehab        directory quota.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab  nocopyfrom
471379a1SMauro Carvalho Chehab        Don't use the RADOS 'copy-from' operation to perform remote object
471379a1SMauro Carvalho Chehab        copies.  Currently, it's only used in copy_file_range, which will revert
471379a1SMauro Carvalho Chehab        to the default VFS implementation if this option is used.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab  recover_session=<no|clean>
0b98acd6SIlya Dryomov	Set auto reconnect mode in the case where the client is blocklisted. The
471379a1SMauro Carvalho Chehab	available modes are "no" and "clean". The default is "no".
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab	* no: never attempt to reconnect when client detects that it has been
0b98acd6SIlya Dryomov	  blocklisted. Operations will generally fail after being blocklisted.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehab	* clean: client reconnects to the ceph cluster automatically when it
0b98acd6SIlya Dryomov	  detects that it has been blocklisted. During reconnect, client drops
471379a1SMauro Carvalho Chehab	  dirty data/metadata, invalidates page caches and writable file handles.
471379a1SMauro Carvalho Chehab	  After reconnect, file locks become stale because the MDS loses track
471379a1SMauro Carvalho Chehab	  of them. If an inode contains any stale file locks, read/write on the
471379a1SMauro Carvalho Chehab	  inode is not allowed until applications release all stale file locks.
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabMore Information
471379a1SMauro Carvalho Chehab================
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabFor more information on Ceph, see the home page at
471379a1SMauro Carvalho Chehab	https://ceph.com/
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho ChehabThe Linux kernel client source tree is available at
471379a1SMauro Carvalho Chehab	- https://github.com/ceph/ceph-client.git
471379a1SMauro Carvalho Chehab
471379a1SMauro Carvalho Chehaband the source for the full system is at
471379a1SMauro Carvalho Chehab	https://github.com/ceph/ceph.git