1# eMMC Storage Design
2
3Author: Adriana Kobylak < anoo! >
4
5Other contributors: Joel Stanley < shenki! >, Milton Miller
6
7Created: 2019-06-20
8
9## Problem Description
10
11Proposal to define an initial storage design for an eMMC device. This includes
12filesystem type, partitioning, volume management, boot options and
13initialization, etc.
14
15## Background and References
16
17OpenBMC currently supports raw flash such as the SPI NOR found in the systems
18based on AST2400 and AST2500, but there is no design for managed NAND.
19
20## Requirements
21
22- Security: Ability to enforce read-only, verification of official/signed images
23  for production.
24
25- Updatable: Ensure that the filesystem design allows for an effective and
26  simple update mechanism to be implemented.
27
28- Simplicity: Make the system easy to understand, so that it is easy to develop,
29  test, use, and recover.
30
31- Code reuse: Try to use something that already exists instead of re-inventing
32  the wheel.
33
34## Proposed Design
35
36- The eMMC image layout and characteristics are specified in a meta layer. This
37  allows OpenBMC to support different layouts and configurations. The tarball to
38  perform a code update is still built by image_types_phosphor, so a separate
39  IMAGE_TYPES would need to be created to support a different filesystem type.
40
41- Code update: Support two versions on flash. This allows a known good image to
42  be retained and a new image to be validated.
43
44- GPT partitioning for the eMMC User Data Area: This is chosen over dynamic
45  partitioning due to the lack of offline tools to build an LVM image (see
46  Logical Volumes in the Alternatives section below).
47
48- Initramfs: An initramfs is needed to run sgdisk on first boot to move the
49  secondary GPT to the end of the device where it belongs, since the yocto wic
50  tool does not currently support building an image of a specified size and
51  therefore the generated image may not be exactly the size of the device that
52  is flashed into.
53
54- Read-only and read-write filesystem: ext4. This is a stable and widely used
55  filesystem for eMMC.
56
57- Filesystem layout: The root filesystem is hosted in a read-only volume. The
58  /var directory is mounted in a read-write volume that persists through code
59  updates. The /home directory needs to be writable to store user data such as
60  ssh keys, so it is a bind mount to a directory in the read-write volume. A
61  bind mount is more reliable than an overlay, and has been around longer. Since
62  there are no contents delivered by the image in the /home directory, a bind
63  mount can be used. On the other hand, the /etc directory has content delivered
64  by the image, so it is an overlayfs to have the ability to restore its
65  configuration content on a factory reset.
66
67        +------------------+ +-----------------------------+
68        | Read-only volume | | Read-write volume           |
69        |------------------| |-----------------------------|
70        |                  | |                             |
71        | / (rootfs)       | | /var                        |
72        |                  | |                             |
73        | /etc  +------------->/var/etc-work/  (overlayfs) |
74        |                  | |                             |
75        | /home +------------->/var/home-work/ (bind mount)|
76        |                  | |                             |
77        |                  | |                             |
78        +------------------+ +-----------------------------+
79
80- Provisioning: OpenBMC will produce as a build artifact a flashable eMMC image
81  as it currently does for NOR chips.
82
83## Alternatives Considered
84
85- Store U-Boot and the Linux kernel in a separate SPI NOR flash device, since
86  SOCs such as the AST2500 do not support executing U-Boot from an eMMC. In
87  addition, having the Linux kernel on the NOR saves from requiring U-Boot
88  support for the eMMC. The U-Boot and kernel are less than 10MB in size, so a
89  fairly small chip such as a 32MB one would suffice. Therefore, in order to
90  support two firmware versions, the kernel for each version would need to be
91  stored in the NOR. A second NOR device could be added as redundancy in case
92  U-Boot or the kernel failed to run.
93
94  Format the NOR as it is currently done for a system that supports UBI: a fixed
95  MTD partition for U-Boot, one for its environment, and a UBI volume spanning
96  the remaining of the flash. Store the dual kernel volumes in the UBI
97  partition. This approach allows the re-use of the existing code update
98  interfaces, since the static approach does not currently support storing two
99  kernel images. Selection of the desired kernel image would be done with the
100  existing U-Boot environment approach.
101
102  Static MTD partitions could be created to store the kernel images, but
103  additional work would be required to introduce a new method to select the
104  desired kernel image, because the static layout does not currently have dual
105  image support.
106
107  The AST2600 supports executing U-Boot from the eMMC, so that provides the
108  flexibility of just having the eMMC chip on a system, or still have U-Boot in
109  a separate chip for recovery in cases where the eMMC goes bad.
110
111- Filesystem: f2fs (Flash-Friendly File System). The f2fs is an up-and-coming
112  filesystem, and therefore it may be seen as less mature and stable than the
113  ext4 filesystem, although it is unknown how any of the two would perform in an
114  OpenBMC environment.
115
116  A suitable alternative would be btrfs, which has checksums for both metadata
117  and data in the filesystem, and therefore provides stronger guarantees on the
118  data integrity.
119
120- All Code update artifacts combined into a single image.
121
122  This provides simple code maintenance where an image is intact or not, and
123  works or not, with no additional fragments lying around. U-Boot has one choice
124  to make - which image to load, and one piece of information to forward to the
125  kernel.
126
127  To reduce boot time by limiting IO reading unneeded sectors into memory, a
128  small FS is placed at the beginning of the partition to contain any artifacts
129  that must be accessed by U-Boot.
130
131  This file system will be selected from ext2, FAT12, and cramfs, as these are
132  all supported in both the Linux kernel and U-Boot. (If we desire the U-Boot
133  environment to be per-side, then choose one of ext2 or FAT12 (squashfs support
134  has not been merged, it was last updated in 2018 -- two years ago).
135
136- No initramfs: It may be possible to boot the rootfs by passing the UUID of the
137  logical volume to the kernel, although a [pre-init script][] will likely still
138  be needed. Therefore, having an initramfs would offer a more standard implementation
139  for initialization.
140
141- FAT MBR partitioning: FAT is a simple and well understood partition table
142  format. There is space for 4 independent partitions. Alternatively one slot
143  can be chained into extended partitions, but each partition in the chan
144  depends on the prior partition. Four partitions may be sufficient to meet the
145  initial demand for a shared (single) boot filesystem design (boot, rofs-a,
146  rofs-b, and read-write). Additional partitions would be needed for a dual boot
147  volume design.
148
149  If common space is needed for the U-Boot environment, is is redundantly stored
150  as file in partition 1. The U-Boot SPL will be located here. If this is not
151  needed, partition 1 can remain unallocated.
152
153  The two code sides are created in slots 2 and 3.
154
155  The read-write filesystem occupies partition 4.
156
157  If in the future there is demand for additional partitions, partition can be
158  moved into an extended partition in a future code update.
159
160- Device Mapper: The eMMC is divided using the device-mapper linear target,
161  which allows for the expansion of devices if necessary without having to
162  physically repartition since the device-mapper devices expose logical blocks.
163  This is achieved by changing the device-mapper configuration table entries
164  provided to the kernel to append unused physical blocks.
165
166- Logical Volumes:
167
168  - Volume management: LVM. This allows for dynamic partition/removal, similar
169    to the current UBI implementation. LVM support increases the size of the
170    kernel by ~100kB, but the increase in size is worth the ability of being
171    able to resize the partition if needed. In addition, UBI volume management
172    works in a similar way, so it would not be complex to implement LVM
173    management in the code update application.
174
175  - Partitioning: If the eMMC is used to store the boot loader, a ext4 (or vfat)
176    partition would hold the FIT image containing the kernel, initrd and device
177    tree. This volume would be mounted as /boot. This allows U-Boot to load the
178    kernel since it doesn't have support for LVM. After the boot partition,
179    assign the remaining eMMC flash as a single physical volume containing
180    logical volumes, instead of fixed-size partitions. This provides flexibility
181    for cases where the contents of a partition outgrow a fixed size. This also
182    means that other firmware images, such as BIOS and PSU, can be stored in
183    volumes in the single eMMC device.
184
185  - Initramfs: Use an initramfs, which is the default in OpenBMC, to boot the
186    rootfs from a logical volume. An initramfs allows for flexibility if
187    additional boot actions are needed, such as mounting overlays. It also
188    provides a point of departure (environment) to provision and format the eMMC
189    volume(s). To boot the rootfs, the initramfs would search for the desired
190    rootfs volume to be mounted, instead of using the U-Boot environments.
191
192  - Mount points: For firmware images such as BIOS that currently reside in
193    separate SPI NOR modules, the logical volume in the eMMC would be mounted in
194    the same paths as to prevent changes to the applications that rely on the
195    location of that data.
196
197  - Provisioning: Since the LVM userspace tools don't offer an offline mode,
198    it's not straightforward to assemble an LVM disk image from a bitbake task.
199    Therefore, have the initramfs create the LVM volume and fetch the rootfs
200    file into tmpfs from an external source to flash the volume. The rootfs file
201    can be fetched using DHCP, UART, USB key, etc. An alternative option include
202    to build the image from QEMU, this would require booting QEMU as part of the
203    build process to setup the LVM volume and create the image file.
204
205## Impacts
206
207This design would impact the OpenBMC build process and code update internal
208implementations but should not affect the external interfaces.
209
210- openbmc/linux: Kernel changes to support the eMMC chip and its filesystem.
211- openbmc/openbmc: Changes to create an eMMC image.
212- openbmc/openpower-pnor-code-mgmt: Changes to support updating the new
213  filesystem.
214- openbmc/phosphor-bmc-code-mgmt: Changes to support updating the new
215  filesystem.
216
217## Testing
218
219Verify OpenBMC functionality in a system containing an eMMC. This system could
220be added to the CI pool.
221
222[pre-init script]:
223  https://github.com/openbmc/openbmc/blob/master/meta-phosphor/recipes-phosphor/preinit-mounts/preinit-mounts/init
224