1471379a1SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
2471379a1SMauro Carvalho Chehab
3471379a1SMauro Carvalho Chehab============================
4471379a1SMauro Carvalho ChehabCeph Distributed File System
5471379a1SMauro Carvalho Chehab============================
6471379a1SMauro Carvalho Chehab
7471379a1SMauro Carvalho ChehabCeph is a distributed network file system designed to provide good
8471379a1SMauro Carvalho Chehabperformance, reliability, and scalability.
9471379a1SMauro Carvalho Chehab
10471379a1SMauro Carvalho ChehabBasic features include:
11471379a1SMauro Carvalho Chehab
12471379a1SMauro Carvalho Chehab * POSIX semantics
13471379a1SMauro Carvalho Chehab * Seamless scaling from 1 to many thousands of nodes
14471379a1SMauro Carvalho Chehab * High availability and reliability.  No single point of failure.
15471379a1SMauro Carvalho Chehab * N-way replication of data across storage nodes
16471379a1SMauro Carvalho Chehab * Fast recovery from node failures
17471379a1SMauro Carvalho Chehab * Automatic rebalancing of data on node addition/removal
18471379a1SMauro Carvalho Chehab * Easy deployment: most FS components are userspace daemons
19471379a1SMauro Carvalho Chehab
20471379a1SMauro Carvalho ChehabAlso,
21471379a1SMauro Carvalho Chehab
22471379a1SMauro Carvalho Chehab * Flexible snapshots (on any directory)
23471379a1SMauro Carvalho Chehab * Recursive accounting (nested files, directories, bytes)
24471379a1SMauro Carvalho Chehab
25471379a1SMauro Carvalho ChehabIn contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely
26471379a1SMauro Carvalho Chehabon symmetric access by all clients to shared block devices, Ceph
27471379a1SMauro Carvalho Chehabseparates data and metadata management into independent server
28471379a1SMauro Carvalho Chehabclusters, similar to Lustre.  Unlike Lustre, however, metadata and
29471379a1SMauro Carvalho Chehabstorage nodes run entirely as user space daemons.  File data is striped
30471379a1SMauro Carvalho Chehabacross storage nodes in large chunks to distribute workload and
31471379a1SMauro Carvalho Chehabfacilitate high throughputs.  When storage nodes fail, data is
32471379a1SMauro Carvalho Chehabre-replicated in a distributed fashion by the storage nodes themselves
33471379a1SMauro Carvalho Chehab(with some minimal coordination from a cluster monitor), making the
34471379a1SMauro Carvalho Chehabsystem extremely efficient and scalable.
35471379a1SMauro Carvalho Chehab
36471379a1SMauro Carvalho ChehabMetadata servers effectively form a large, consistent, distributed
37471379a1SMauro Carvalho Chehabin-memory cache above the file namespace that is extremely scalable,
38471379a1SMauro Carvalho Chehabdynamically redistributes metadata in response to workload changes,
39471379a1SMauro Carvalho Chehaband can tolerate arbitrary (well, non-Byzantine) node failures.  The
40471379a1SMauro Carvalho Chehabmetadata server takes a somewhat unconventional approach to metadata
41471379a1SMauro Carvalho Chehabstorage to significantly improve performance for common workloads.  In
42471379a1SMauro Carvalho Chehabparticular, inodes with only a single link are embedded in
43471379a1SMauro Carvalho Chehabdirectories, allowing entire directories of dentries and inodes to be
44471379a1SMauro Carvalho Chehabloaded into its cache with a single I/O operation.  The contents of
45471379a1SMauro Carvalho Chehabextremely large directories can be fragmented and managed by
46471379a1SMauro Carvalho Chehabindependent metadata servers, allowing scalable concurrent access.
47471379a1SMauro Carvalho Chehab
48471379a1SMauro Carvalho ChehabThe system offers automatic data rebalancing/migration when scaling
49471379a1SMauro Carvalho Chehabfrom a small cluster of just a few nodes to many hundreds, without
50471379a1SMauro Carvalho Chehabrequiring an administrator carve the data set into static volumes or
51471379a1SMauro Carvalho Chehabgo through the tedious process of migrating data between servers.
52471379a1SMauro Carvalho ChehabWhen the file system approaches full, new nodes can be easily added
53471379a1SMauro Carvalho Chehaband things will "just work."
54471379a1SMauro Carvalho Chehab
55471379a1SMauro Carvalho ChehabCeph includes flexible snapshot mechanism that allows a user to create
56471379a1SMauro Carvalho Chehaba snapshot on any subdirectory (and its nested contents) in the
57471379a1SMauro Carvalho Chehabsystem.  Snapshot creation and deletion are as simple as 'mkdir
58471379a1SMauro Carvalho Chehab.snap/foo' and 'rmdir .snap/foo'.
59471379a1SMauro Carvalho Chehab
60*230bd8b9SLuís HenriquesSnapshot names have two limitations:
61*230bd8b9SLuís Henriques
62*230bd8b9SLuís Henriques* They can not start with an underscore ('_'), as these names are reserved
63*230bd8b9SLuís Henriques  for internal usage by the MDS.
64*230bd8b9SLuís Henriques* They can not exceed 240 characters in size.  This is because the MDS makes
65*230bd8b9SLuís Henriques  use of long snapshot names internally, which follow the format:
66*230bd8b9SLuís Henriques  `_<SNAPSHOT-NAME>_<INODE-NUMBER>`.  Since filenames in general can't have
67*230bd8b9SLuís Henriques  more than 255 characters, and `<node-id>` takes 13 characters, the long
68*230bd8b9SLuís Henriques  snapshot names can take as much as 255 - 1 - 1 - 13 = 240.
69*230bd8b9SLuís Henriques
70471379a1SMauro Carvalho ChehabCeph also provides some recursive accounting on directories for nested
71471379a1SMauro Carvalho Chehabfiles and bytes.  That is, a 'getfattr -d foo' on any directory in the
72471379a1SMauro Carvalho Chehabsystem will reveal the total number of nested regular files and
73471379a1SMauro Carvalho Chehabsubdirectories, and a summation of all nested file sizes.  This makes
74471379a1SMauro Carvalho Chehabthe identification of large disk space consumers relatively quick, as
75471379a1SMauro Carvalho Chehabno 'du' or similar recursive scan of the file system is required.
76471379a1SMauro Carvalho Chehab
77471379a1SMauro Carvalho ChehabFinally, Ceph also allows quotas to be set on any directory in the system.
78471379a1SMauro Carvalho ChehabThe quota can restrict the number of bytes or the number of files stored
79471379a1SMauro Carvalho Chehabbeneath that point in the directory hierarchy.  Quotas can be set using
80471379a1SMauro Carvalho Chehabextended attributes 'ceph.quota.max_files' and 'ceph.quota.max_bytes', eg::
81471379a1SMauro Carvalho Chehab
82471379a1SMauro Carvalho Chehab setfattr -n ceph.quota.max_bytes -v 100000000 /some/dir
83471379a1SMauro Carvalho Chehab getfattr -n ceph.quota.max_bytes /some/dir
84471379a1SMauro Carvalho Chehab
85471379a1SMauro Carvalho ChehabA limitation of the current quotas implementation is that it relies on the
86471379a1SMauro Carvalho Chehabcooperation of the client mounting the file system to stop writers when a
87471379a1SMauro Carvalho Chehablimit is reached.  A modified or adversarial client cannot be prevented
88471379a1SMauro Carvalho Chehabfrom writing as much data as it needs.
89471379a1SMauro Carvalho Chehab
90471379a1SMauro Carvalho ChehabMount Syntax
91471379a1SMauro Carvalho Chehab============
92471379a1SMauro Carvalho Chehab
93471379a1SMauro Carvalho ChehabThe basic mount syntax is::
94471379a1SMauro Carvalho Chehab
95e1b9eb50SVenky Shankar # mount -t ceph user@fsid.fs_name=/[subdir] mnt -o mon_addr=monip1[:port][/monip2[:port]]
96471379a1SMauro Carvalho Chehab
97471379a1SMauro Carvalho ChehabYou only need to specify a single monitor, as the client will get the
98471379a1SMauro Carvalho Chehabfull list when it connects.  (However, if the monitor you specify
99471379a1SMauro Carvalho Chehabhappens to be down, the mount won't succeed.)  The port can be left
100471379a1SMauro Carvalho Chehaboff if the monitor is using the default.  So if the monitor is at
101471379a1SMauro Carvalho Chehab1.2.3.4::
102471379a1SMauro Carvalho Chehab
103e1b9eb50SVenky Shankar # mount -t ceph cephuser@07fe3187-00d9-42a3-814b-72a4d5e7d5be.cephfs=/ /mnt/ceph -o mon_addr=1.2.3.4
104471379a1SMauro Carvalho Chehab
105471379a1SMauro Carvalho Chehabis sufficient.  If /sbin/mount.ceph is installed, a hostname can be
106e1b9eb50SVenky Shankarused instead of an IP address and the cluster FSID can be left out
107e1b9eb50SVenky Shankar(as the mount helper will fill it in by reading the ceph configuration
108e1b9eb50SVenky Shankarfile)::
109471379a1SMauro Carvalho Chehab
110e1b9eb50SVenky Shankar  # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=mon-addr
111471379a1SMauro Carvalho Chehab
112e1b9eb50SVenky ShankarMultiple monitor addresses can be passed by separating each address with a slash (`/`)::
113e1b9eb50SVenky Shankar
114e1b9eb50SVenky Shankar  # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=192.168.1.100/192.168.1.101
115e1b9eb50SVenky Shankar
116e1b9eb50SVenky ShankarWhen using the mount helper, monitor address can be read from ceph
117e1b9eb50SVenky Shankarconfiguration file if available. Note that, the cluster FSID (passed as part
118e1b9eb50SVenky Shankarof the device string) is validated by checking it with the FSID reported by
119e1b9eb50SVenky Shankarthe monitor.
120471379a1SMauro Carvalho Chehab
121471379a1SMauro Carvalho ChehabMount Options
122471379a1SMauro Carvalho Chehab=============
123471379a1SMauro Carvalho Chehab
124e1b9eb50SVenky Shankar  mon_addr=ip_address[:port][/ip_address[:port]]
125e1b9eb50SVenky Shankar	Monitor address to the cluster. This is used to bootstrap the
126e1b9eb50SVenky Shankar        connection to the cluster. Once connection is established, the
127e1b9eb50SVenky Shankar        monitor addresses in the monitor map are followed.
128e1b9eb50SVenky Shankar
129e1b9eb50SVenky Shankar  fsid=cluster-id
130e1b9eb50SVenky Shankar	FSID of the cluster (from `ceph fsid` command).
131e1b9eb50SVenky Shankar
132471379a1SMauro Carvalho Chehab  ip=A.B.C.D[:N]
133471379a1SMauro Carvalho Chehab	Specify the IP and/or port the client should bind to locally.
134471379a1SMauro Carvalho Chehab	There is normally not much reason to do this.  If the IP is not
135471379a1SMauro Carvalho Chehab	specified, the client's IP address is determined by looking at the
136471379a1SMauro Carvalho Chehab	address its connection to the monitor originates from.
137471379a1SMauro Carvalho Chehab
138471379a1SMauro Carvalho Chehab  wsize=X
139fcc95f06SLinus Torvalds	Specify the maximum write size in bytes.  Default: 64 MB.
140471379a1SMauro Carvalho Chehab
141471379a1SMauro Carvalho Chehab  rsize=X
142fcc95f06SLinus Torvalds	Specify the maximum read size in bytes.  Default: 64 MB.
143471379a1SMauro Carvalho Chehab
144471379a1SMauro Carvalho Chehab  rasize=X
145471379a1SMauro Carvalho Chehab	Specify the maximum readahead size in bytes.  Default: 8 MB.
146471379a1SMauro Carvalho Chehab
147471379a1SMauro Carvalho Chehab  mount_timeout=X
148471379a1SMauro Carvalho Chehab	Specify the timeout value for mount (in seconds), in the case
149fcc95f06SLinus Torvalds	of a non-responsive Ceph file system.  The default is 60
150471379a1SMauro Carvalho Chehab	seconds.
151471379a1SMauro Carvalho Chehab
152471379a1SMauro Carvalho Chehab  caps_max=X
153471379a1SMauro Carvalho Chehab	Specify the maximum number of caps to hold. Unused caps are released
154471379a1SMauro Carvalho Chehab	when number of caps exceeds the limit. The default is 0 (no limit)
155471379a1SMauro Carvalho Chehab
156471379a1SMauro Carvalho Chehab  rbytes
157471379a1SMauro Carvalho Chehab	When stat() is called on a directory, set st_size to 'rbytes',
158471379a1SMauro Carvalho Chehab	the summation of file sizes over all files nested beneath that
159471379a1SMauro Carvalho Chehab	directory.  This is the default.
160471379a1SMauro Carvalho Chehab
161471379a1SMauro Carvalho Chehab  norbytes
162471379a1SMauro Carvalho Chehab	When stat() is called on a directory, set st_size to the
163471379a1SMauro Carvalho Chehab	number of entries in that directory.
164471379a1SMauro Carvalho Chehab
165471379a1SMauro Carvalho Chehab  nocrc
166471379a1SMauro Carvalho Chehab	Disable CRC32C calculation for data writes.  If set, the storage node
167471379a1SMauro Carvalho Chehab	must rely on TCP's error correction to detect data corruption
168471379a1SMauro Carvalho Chehab	in the data payload.
169471379a1SMauro Carvalho Chehab
170471379a1SMauro Carvalho Chehab  dcache
171471379a1SMauro Carvalho Chehab        Use the dcache contents to perform negative lookups and
172471379a1SMauro Carvalho Chehab        readdir when the client has the entire directory contents in
173471379a1SMauro Carvalho Chehab        its cache.  (This does not change correctness; the client uses
174471379a1SMauro Carvalho Chehab        cached metadata only when a lease or capability ensures it is
175471379a1SMauro Carvalho Chehab        valid.)
176471379a1SMauro Carvalho Chehab
177471379a1SMauro Carvalho Chehab  nodcache
178471379a1SMauro Carvalho Chehab        Do not use the dcache as above.  This avoids a significant amount of
179471379a1SMauro Carvalho Chehab        complex code, sacrificing performance without affecting correctness,
180471379a1SMauro Carvalho Chehab        and is useful for tracking down bugs.
181471379a1SMauro Carvalho Chehab
182471379a1SMauro Carvalho Chehab  noasyncreaddir
183471379a1SMauro Carvalho Chehab	Do not use the dcache as above for readdir.
184471379a1SMauro Carvalho Chehab
185471379a1SMauro Carvalho Chehab  noquotadf
186471379a1SMauro Carvalho Chehab        Report overall filesystem usage in statfs instead of using the root
187471379a1SMauro Carvalho Chehab        directory quota.
188471379a1SMauro Carvalho Chehab
189471379a1SMauro Carvalho Chehab  nocopyfrom
190471379a1SMauro Carvalho Chehab        Don't use the RADOS 'copy-from' operation to perform remote object
191471379a1SMauro Carvalho Chehab        copies.  Currently, it's only used in copy_file_range, which will revert
192471379a1SMauro Carvalho Chehab        to the default VFS implementation if this option is used.
193471379a1SMauro Carvalho Chehab
194471379a1SMauro Carvalho Chehab  recover_session=<no|clean>
1950b98acd6SIlya Dryomov	Set auto reconnect mode in the case where the client is blocklisted. The
196471379a1SMauro Carvalho Chehab	available modes are "no" and "clean". The default is "no".
197471379a1SMauro Carvalho Chehab
198471379a1SMauro Carvalho Chehab	* no: never attempt to reconnect when client detects that it has been
1990b98acd6SIlya Dryomov	  blocklisted. Operations will generally fail after being blocklisted.
200471379a1SMauro Carvalho Chehab
201471379a1SMauro Carvalho Chehab	* clean: client reconnects to the ceph cluster automatically when it
2020b98acd6SIlya Dryomov	  detects that it has been blocklisted. During reconnect, client drops
203471379a1SMauro Carvalho Chehab	  dirty data/metadata, invalidates page caches and writable file handles.
204471379a1SMauro Carvalho Chehab	  After reconnect, file locks become stale because the MDS loses track
205471379a1SMauro Carvalho Chehab	  of them. If an inode contains any stale file locks, read/write on the
206471379a1SMauro Carvalho Chehab	  inode is not allowed until applications release all stale file locks.
207471379a1SMauro Carvalho Chehab
208471379a1SMauro Carvalho ChehabMore Information
209471379a1SMauro Carvalho Chehab================
210471379a1SMauro Carvalho Chehab
211471379a1SMauro Carvalho ChehabFor more information on Ceph, see the home page at
212471379a1SMauro Carvalho Chehab	https://ceph.com/
213471379a1SMauro Carvalho Chehab
214471379a1SMauro Carvalho ChehabThe Linux kernel client source tree is available at
215471379a1SMauro Carvalho Chehab	- https://github.com/ceph/ceph-client.git
216471379a1SMauro Carvalho Chehab
217471379a1SMauro Carvalho Chehaband the source for the full system is at
218471379a1SMauro Carvalho Chehab	https://github.com/ceph/ceph.git
219