xref: /openbmc/linux/Documentation/virt/uml/user_mode_linux_howto_v2.rst (revision 75b1a8f9d62e50f05d0e4e9f3c8bcde32527ffc1)
1.. SPDX-License-Identifier: GPL-2.0
2
3#########
4UML HowTo
5#########
6
7.. contents:: :local:
8
9************
10Introduction
11************
12
13Welcome to User Mode Linux
14
15User Mode Linux is the first Open Source virtualization platform (first
16release date 1991) and second virtualization platform for an x86 PC.
17
18How is UML Different from a VM using Virtualization package X?
19==============================================================
20
21We have come to assume that virtualization also means some level of
22hardware emulation. In fact, it does not. As long as a virtualization
23package provides the OS with devices which the OS can recognize and
24has a driver for, the devices do not need to emulate real hardware.
25Most OSes today have built-in support for a number of "fake"
26devices used only under virtualization.
27User Mode Linux takes this concept to the ultimate extreme - there
28is not a single real device in sight. It is 100% artificial or if
29we use the correct term 100% paravirtual. All UML devices are abstract
30concepts which map onto something provided by the host - files, sockets,
31pipes, etc.
32
33The other major difference between UML and various virtualization
34packages is that there is a distinct difference between the way the UML
35kernel and the UML programs operate.
36The UML kernel is just a process running on Linux - same as any other
37program. It can be run by an unprivileged user and it does not require
38anything in terms of special CPU features.
39The UML userspace, however, is a bit different. The Linux kernel on the
40host machine assists UML in intercepting everything the program running
41on a UML instance is trying to do and making the UML kernel handle all
42of its requests.
43This is different from other virtualization packages which do not make any
44difference between the guest kernel and guest programs. This difference
45results in a number of advantages and disadvantages of UML over let's say
46QEMU which we will cover later in this document.
47
48
49Why Would I Want User Mode Linux?
50=================================
51
52
53* If User Mode Linux kernel crashes, your host kernel is still fine. It
54  is not accelerated in any way (vhost, kvm, etc) and it is not trying to
55  access any devices directly.  It is, in fact, a process like any other.
56
57* You can run a usermode kernel as a non-root user (you may need to
58  arrange appropriate permissions for some devices).
59
60* You can run a very small VM with a minimal footprint for a specific
61  task (for example 32M or less).
62
63* You can get extremely high performance for anything which is a "kernel
64  specific task" such as forwarding, firewalling, etc while still being
65  isolated from the host kernel.
66
67* You can play with kernel concepts without breaking things.
68
69* You are not bound by "emulating" hardware, so you can try weird and
70  wonderful concepts which are very difficult to support when emulating
71  real hardware such as time travel and making your system clock
72  dependent on what UML does (very useful for things like tests).
73
74* It's fun.
75
76Why not to run UML
77==================
78
79* The syscall interception technique used by UML makes it inherently
80  slower for any userspace applications. While it can do kernel tasks
81  on par with most other virtualization packages, its userspace is
82  **slow**. The root cause is that UML has a very high cost of creating
83  new processes and threads (something most Unix/Linux applications
84  take for granted).
85
86* UML is strictly uniprocessor at present. If you want to run an
87  application which needs many CPUs to function, it is clearly the
88  wrong choice.
89
90***********************
91Building a UML instance
92***********************
93
94There is no UML installer in any distribution. While you can use off
95the shelf install media to install into a blank VM using a virtualization
96package, there is no UML equivalent. You have to use appropriate tools on
97your host to build a viable filesystem image.
98
99This is extremely easy on Debian - you can do it using debootstrap. It is
100also easy on OpenWRT - the build process can build UML images. All other
101distros - YMMV.
102
103Creating an image
104=================
105
106Create a sparse raw disk image::
107
108   # dd if=/dev/zero of=disk_image_name bs=1 count=1 seek=16G
109
110This will create a 16G disk image. The OS will initially allocate only one
111block and will allocate more as they are written by UML. As of kernel
112version 4.19 UML fully supports TRIM (as usually used by flash drives).
113Using TRIM inside the UML image by specifying discard as a mount option
114or by running ``tune2fs -o discard /dev/ubdXX`` will request UML to
115return any unused blocks to the OS.
116
117Create a filesystem on the disk image and mount it::
118
119   # mkfs.ext4 ./disk_image_name && mount ./disk_image_name /mnt
120
121This example uses ext4, any other filesystem such as ext3, btrfs, xfs,
122jfs, etc will work too.
123
124Create a minimal OS installation on the mounted filesystem::
125
126   # debootstrap buster /mnt http://deb.debian.org/debian
127
128debootstrap does not set up the root password, fstab, hostname or
129anything related to networking. It is up to the user to do that.
130
131Set the root password -t he easiest way to do that is to chroot into the
132mounted image::
133
134   # chroot /mnt
135   # passwd
136   # exit
137
138Edit key system files
139=====================
140
141UML block devices are called ubds. The fstab created by debootstrap
142will be empty and it needs an entry for the root file system::
143
144   /dev/ubd0   ext4    discard,errors=remount-ro  0       1
145
146The image hostname will be set to the same as the host on which you
147are creating it image. It is a good idea to change that to avoid
148"Oh, bummer, I rebooted the wrong machine".
149
150UML supports two classes of network devices - the older uml_net ones
151which are scheduled for obsoletion. These are called ethX. It also
152supports the newer vector IO devices which are significantly faster
153and have support for some standard virtual network encapsulations like
154Ethernet over GRE and Ethernet over L2TPv3. These are called vec0.
155
156Depending on which one is in use, ``/etc/network/interfaces`` will
157need entries like::
158
159   # legacy UML network devices
160   auto eth0
161   iface eth0 inet dhcp
162
163   # vector UML network devices
164   auto vec0
165   iface eth0 inet dhcp
166
167We now have a UML image which is nearly ready to run, all we need is a
168UML kernel and modules for it.
169
170Most distributions have a UML package. Even if you intend to use your own
171kernel, testing the image with a stock one is always a good start. These
172packages come with a set of modules which should be copied to the target
173filesystem. The location is distribution dependent. For Debian these
174reside under /usr/lib/uml/modules. Copy recursively the content of this
175directory to the mounted UML filesystem::
176
177   # cp -rax /usr/lib/uml/modules /mnt/lib/modules
178
179If you have compiled your own kernel, you need to use the usual "install
180modules to a location" procedure by running::
181
182  # make install MODULES_DIR=/mnt/lib/modules
183
184At this point the image is ready to be brought up.
185
186*************************
187Setting Up UML Networking
188*************************
189
190UML networking is designed to emulate an Ethernet connection. This
191connection may be either a point-to-point (similar to a connection
192between machines using a back-to-back cable) or a connection to a
193switch. UML supports a wide variety of means to build these
194connections to all of: local machine, remote machine(s), local and
195remote UML and other VM instances.
196
197
198+-----------+--------+------------------------------------+------------+
199| Transport |  Type  |        Capabilities                | Throughput |
200+===========+========+====================================+============+
201| tap       | vector | checksum, tso                      | > 8Gbit    |
202+-----------+--------+------------------------------------+------------+
203| hybrid    | vector | checksum, tso, multipacket rx      | > 6GBit    |
204+-----------+--------+------------------------------------+------------+
205| raw       | vector | checksum, tso, multipacket rx, tx" | > 6GBit    |
206+-----------+--------+------------------------------------+------------+
207| EoGRE     | vector | multipacket rx, tx                 | > 3Gbit    |
208+-----------+--------+------------------------------------+------------+
209| Eol2tpv3  | vector | multipacket rx, tx                 | > 3Gbit    |
210+-----------+--------+------------------------------------+------------+
211| bess      | vector | multipacket rx, tx                 | > 3Gbit    |
212+-----------+--------+------------------------------------+------------+
213| fd        | vector | dependent on fd type               | varies     |
214+-----------+--------+------------------------------------+------------+
215| tuntap    | legacy | none                               | ~ 500Mbit  |
216+-----------+--------+------------------------------------+------------+
217| daemon    | legacy | none                               | ~ 450Mbit  |
218+-----------+--------+------------------------------------+------------+
219| socket    | legacy | none                               | ~ 450Mbit  |
220+-----------+--------+------------------------------------+------------+
221| pcap      | legacy | rx only                            | ~ 450Mbit  |
222+-----------+--------+------------------------------------+------------+
223| ethertap  | legacy | obsolete                           | ~ 500Mbit  |
224+-----------+--------+------------------------------------+------------+
225| vde       | legacy | obsolete                           | ~ 500Mbit  |
226+-----------+--------+------------------------------------+------------+
227
228* All transports which have tso and checksum offloads can deliver speeds
229  approaching 10G on TCP streams.
230
231* All transports which have multi-packet rx and/or tx can deliver pps
232  rates of up to 1Mps or more.
233
234* All legacy transports are generally limited to ~600-700MBit and 0.05Mps
235
236* GRE and L2TPv3 allow connections to all of: local machine, remote
237  machines, remote network devices and remote UML instances.
238
239* Socket allows connections only between UML instances.
240
241* Daemon and bess require running a local switch. This switch may be
242  connected to the host as well.
243
244
245Network configuration privileges
246================================
247
248The majority of the supported networking modes need ``root`` privileges.
249For example, in the legacy tuntap networking mode, users were required
250to be part of the group associated with the tunnel device.
251
252For newer network drivers like the vector transports, ``root`` privilege
253is required to fire an ioctl to setup the tun interface and/or use
254raw sockets where needed.
255
256This can be achieved by granting the user a particular capability instead
257of running UML as root.  In case of vector transport, a user can add the
258capability ``CAP_NET_ADMIN`` or ``CAP_NET_RAW``, to the uml binary.
259Thenceforth, UML can be run with normal user privilges, along with
260full networking.
261
262For example::
263
264   # sudo setcap cap_net_raw,cap_net_admin+ep linux
265
266Configuring vector transports
267===============================
268
269All vector transports support a similar syntax:
270
271If X is the interface number as in vec0, vec1, vec2, etc, the general
272syntax for options is::
273
274   vecX:transport="Transport Name",option=value,option=value,...,option=value
275
276Common options
277--------------
278
279These options are common for all transports:
280
281* ``depth=int`` - sets the queue depth for vector IO. This is the
282  amount of packets UML will attempt to read or write in a single
283  system call. The default number is 64 and is generally sufficient
284  for most applications that need throughput in the 2-4 Gbit range.
285  Higher speeds may require larger values.
286
287* ``mac=XX:XX:XX:XX:XX`` - sets the interface MAC address value.
288
289* ``gro=[0,1]`` - sets GRO on or off. Enables receive/transmit offloads.
290  The effect of this option depends on the host side support in the transport
291  which is being configured. In most cases it will enable TCP segmentation and
292  RX/TX checksumming offloads. The setting must be identical on the host side
293  and the UML side. The UML kernel will produce warnings if it is not.
294  For example, GRO is enabled by default on local machine interfaces
295  (e.g. veth pairs, bridge, etc), so it should be enabled in UML in the
296  corresponding UML transports (raw, tap, hybrid) in order for networking to
297  operate correctly.
298
299* ``mtu=int`` - sets the interface MTU
300
301* ``headroom=int`` - adjusts the default headroom (32 bytes) reserved
302  if a packet will need to be re-encapsulated into for instance VXLAN.
303
304* ``vec=0`` - disable multipacket io and fall back to packet at a
305  time mode
306
307Shared Options
308--------------
309
310* ``ifname=str`` Transports which bind to a local network interface
311  have a shared option - the name of the interface to bind to.
312
313* ``src, dst, src_port, dst_port`` - all transports which use sockets
314  which have the notion of source and destination and/or source port
315  and destination port use these to specify them.
316
317* ``v6=[0,1]`` to specify if a v6 connection is desired for all
318  transports which operate over IP. Additionally, for transports that
319  have some differences in the way they operate over v4 and v6 (for example
320  EoL2TPv3), sets the correct mode of operation. In the absense of this
321  option, the socket type is determined based on what do the src and dst
322  arguments resolve/parse to.
323
324tap transport
325-------------
326
327Example::
328
329   vecX:transport=tap,ifname=tap0,depth=128,gro=1
330
331This will connect vec0 to tap0 on the host. Tap0 must already exist (for example
332created using tunctl) and UP.
333
334tap0 can be configured as a point-to-point interface and given an ip
335address so that UML can talk to the host. Alternatively, it is possible
336to connect UML to a tap interface which is connected to a bridge.
337
338While tap relies on the vector infrastructure, it is not a true vector
339transport at this point, because Linux does not support multi-packet
340IO on tap file descriptors for normal userspace apps like UML. This
341is a privilege which is offered only to something which can hook up
342to it at kernel level via specialized interfaces like vhost-net. A
343vhost-net like helper for UML is planned at some point in the future.
344
345Privileges required: tap transport requires either:
346
347* tap interface to exist and be created persistent and owned by the
348  UML user using tunctl. Example ``tunctl -u uml-user -t tap0``
349
350* binary to have ``CAP_NET_ADMIN`` privilege
351
352hybrid transport
353----------------
354
355Example::
356
357   vecX:transport=hybrid,ifname=tap0,depth=128,gro=1
358
359This is an experimental/demo transport which couples tap for transmit
360and a raw socket for receive. The raw socket allows multi-packet
361receive resulting in significantly higher packet rates than normal tap
362
363Privileges required: hybrid requires ``CAP_NET_RAW`` capability by
364the UML user as well as the requirements for the tap transport.
365
366raw socket transport
367--------------------
368
369Example::
370
371   vecX:transport=raw,ifname=p-veth0,depth=128,gro=1
372
373
374This transport uses vector IO on raw sockets. While you can bind to any
375interface including a physical one, the most common use it to bind to
376the "peer" side of a veth pair with the other side configured on the
377host.
378
379Example host configuration for Debian:
380
381**/etc/network/interfaces**::
382
383   auto veth0
384   iface veth0 inet static
385	address 192.168.4.1
386	netmask 255.255.255.252
387	broadcast 192.168.4.3
388	pre-up ip link add veth0 type veth peer name p-veth0 && \
389          ifconfig p-veth0 up
390
391UML can now bind to p-veth0 like this::
392
393   vec0:transport=raw,ifname=p-veth0,depth=128,gro=1
394
395
396If the UML guest is configured with 192.168.4.2 and netmask 255.255.255.0
397it can talk to the host on 192.168.4.1
398
399The raw transport also provides some support for offloading some of the
400filtering to the host. The two options to control it are:
401
402* ``bpffile=str`` filename of raw bpf code to be loaded as a socket filter
403
404* ``bpfflash=int`` 0/1 allow loading of bpf from inside User Mode Linux.
405  This option allows the use of the ethtool load firmware command to
406  load bpf code.
407
408In either case the bpf code is loaded into the host kernel. While this is
409presently limited to legacy bpf syntax (not ebpf), it is still a security
410risk. It is not recommended to allow this unless the User Mode Linux
411instance is considered trusted.
412
413Privileges required: raw socket transport requires `CAP_NET_RAW`
414capability.
415
416GRE socket transport
417--------------------
418
419Example::
420
421   vecX:transport=gre,src=$src_host,dst=$dst_host
422
423
424This will configure an Ethernet over ``GRE`` (aka ``GRETAP`` or
425``GREIRB``) tunnel which will connect the UML instance to a ``GRE``
426endpoint at host dst_host. ``GRE`` supports the following additional
427options:
428
429* ``rx_key=int`` - GRE 32 bit integer key for rx packets, if set,
430  ``txkey`` must be set too
431
432* ``tx_key=int`` - GRE 32 bit integer key for tx packets, if set
433  ``rx_key`` must be set too
434
435* ``sequence=[0,1]`` - enable GRE sequence
436
437* ``pin_sequence=[0,1]`` - pretend that the sequence is always reset
438  on each packet (needed to interoperate with some really broken
439  implementations)
440
441* ``v6=[0,1]`` - force IPv4 or IPv6 sockets respectively
442
443* GRE checksum is not presently supported
444
445GRE has a number of caveats:
446
447* You can use only one GRE connection per ip address. There is no way to
448  multiplex connections as each GRE tunnel is terminated directly on
449  the UML instance.
450
451* The key is not really a security feature. While it was intended as such
452  it's "security" is laughable. It is, however, a useful feature to
453  ensure that the tunnel is not misconfigured.
454
455An example configuration for a Linux host with a local address of
456192.168.128.1 to connect to a UML instance at 192.168.129.1
457
458**/etc/network/interfaces**::
459
460   auto gt0
461   iface gt0 inet static
462    address 10.0.0.1
463    netmask 255.255.255.0
464    broadcast 10.0.0.255
465    mtu 1500
466    pre-up ip link add gt0 type gretap local 192.168.128.1 \
467           remote 192.168.129.1 || true
468    down ip link del gt0 || true
469
470Additionally, GRE has been tested versus a variety of network equipment.
471
472Privileges required: GRE requires ``CAP_NET_RAW``
473
474l2tpv3 socket transport
475-----------------------
476
477_Warning_. L2TPv3 has a "bug". It is the "bug" known as "has more
478options than GNU ls". While it has some advantages, there are usually
479easier (and less verbose) ways to connect a UML instance to something.
480For example, most devices which support L2TPv3 also support GRE.
481
482Example::
483
484    vec0:transport=l2tpv3,udp=1,src=$src_host,dst=$dst_host,srcport=$src_port,dstport=$dst_port,depth=128,rx_session=0xffffffff,tx_session=0xffff
485
486This will configure an Ethernet over L2TPv3 fixed tunnel which will
487connect the UML instance to a L2TPv3 endpoint at host $dst_host using
488the L2TPv3 UDP flavour and UDP destination port $dst_port.
489
490L2TPv3 always requires the following additional options:
491
492* ``rx_session=int`` - l2tpv3 32 bit integer session for rx packets
493
494* ``tx_session=int`` - l2tpv3 32 bit integer session for tx packets
495
496As the tunnel is fixed these are not negotiated and they are
497preconfigured on both ends.
498
499Additionally, L2TPv3 supports the following optional parameters
500
501* ``rx_cookie=int`` - l2tpv3 32 bit integer cookie for rx packets - same
502  functionality as GRE key, more to prevent misconfiguration than provide
503  actual security
504
505* ``tx_cookie=int`` - l2tpv3 32 bit integer cookie for tx packets
506
507* ``cookie64=[0,1]`` - use 64 bit cookies instead of 32 bit.
508
509* ``counter=[0,1]`` - enable l2tpv3 counter
510
511* ``pin_counter=[0,1]`` - pretend that the counter is always reset on
512  each packet (needed to interoperate with some really broken
513  implementations)
514
515* ``v6=[0,1]`` - force v6 sockets
516
517* ``udp=[0,1]`` - use raw sockets (0) or UDP (1) version of the protocol
518
519L2TPv3 has a number of caveats:
520
521* you can use only one connection per ip address in raw mode. There is
522  no way to multiplex connections as each L2TPv3 tunnel is terminated
523  directly on the UML instance. UDP mode can use different ports for
524  this purpose.
525
526Here is an example of how to configure a linux host to connect to UML
527via L2TPv3:
528
529**/etc/network/interfaces**::
530
531   auto l2tp1
532   iface l2tp1 inet static
533    address 192.168.126.1
534    netmask 255.255.255.0
535    broadcast 192.168.126.255
536    mtu 1500
537    pre-up ip l2tp add tunnel remote 127.0.0.1 \
538           local 127.0.0.1 encap udp tunnel_id 2 \
539           peer_tunnel_id 2 udp_sport 1706 udp_dport 1707 && \
540           ip l2tp add session name l2tp1 tunnel_id 2 \
541           session_id 0xffffffff peer_session_id 0xffffffff
542    down ip l2tp del session tunnel_id 2 session_id 0xffffffff && \
543           ip l2tp del tunnel tunnel_id 2
544
545
546Privileges required: L2TPv3 requires ``CAP_NET_RAW`` for raw IP mode and
547no special privileges for the UDP mode.
548
549BESS socket transport
550---------------------
551
552BESS is a high performance modular network switch.
553
554https://github.com/NetSys/bess
555
556It has support for a simple sequential packet socket mode which in the
557more recent versions is using vector IO for high performance.
558
559Example::
560
561   vecX:transport=bess,src=$unix_src,dst=$unix_dst
562
563This will configure a BESS transport using the unix_src Unix domain
564socket address as source and unix_dst socket address as destination.
565
566For BESS configuration and how to allocate a BESS Unix domain socket port
567please see the BESS documentation.
568
569https://github.com/NetSys/bess/wiki/Built-In-Modules-and-Ports
570
571BESS transport does not require any special privileges.
572
573Configuring Legacy transports
574=============================
575
576Legacy transports are now considered obsolete. Please use the vector
577versions.
578
579***********
580Running UML
581***********
582
583This section assumes that either the user-mode-linux package from the
584distribution or a custom built kernel has been installed on the host.
585
586These add an executable called linux to the system. This is the UML
587kernel. It can be run just like any other executable.
588It will take most normal linux kernel arguments as command line
589arguments.  Additionally, it will need some UML specific arguments
590in order to do something useful.
591
592Arguments
593=========
594
595Mandatory Arguments:
596--------------------
597
598* ``mem=int[K,M,G]`` - amount of memory. By default bytes. It will
599  also accept K, M or G qualifiers.
600
601* ``ubdX[s,d,c,t]=`` virtual disk specification. This is not really
602  mandatory, but it is likely to be needed in nearly all cases so we can
603  specify a root file system.
604  The simplest possible image specification is the name of the image
605  file for the filesystem (created using one of the methods described
606  in `Creating an image`_)
607
608  * UBD devices support copy on write (COW). The changes are kept in
609    a separate file which can be discarded allowing a rollback to the
610    original pristine image.  If COW is desired, the UBD image is
611    specified as: ``cow_file,master_image``.
612    Example:``ubd0=Filesystem.cow,Filesystem.img``
613
614  * UBD devices can be set to use synchronous IO. Any writes are
615    immediately flushed to disk. This is done by adding ``s`` after
616    the ``ubdX`` specification
617
618  * UBD performs some euristics on devices specified as a single
619    filename to make sure that a COW file has not been specified as
620    the image. To turn them off, use the ``d`` flag after ``ubdX``
621
622  * UBD supports TRIM - asking the Host OS to reclaim any unused
623    blocks in the image. To turn it off, specify the ``t`` flag after
624    ``ubdX``
625
626* ``root=`` root device - most likely ``/dev/ubd0`` (this is a Linux
627  filesystem image)
628
629Important Optional Arguments
630----------------------------
631
632If UML is run as "linux" with no extra arguments, it will try to start an
633xterm for every console configured inside the image (up to 6 in most
634linux distributions). Each console is started inside an
635xterm. This makes it nice and easy to use UML on a host with a GUI. It is,
636however, the wrong approach if UML is to be used as a testing harness or run
637in a text-only environment.
638
639In order to change this behaviour we need to specify an alternative console
640and wire it to one of the supported "line" channels. For this we need to map a
641console to use something different from the default xterm.
642
643Example which will divert console number 1 to stdin/stdout::
644
645   con1=fd:0,fd:1
646
647UML supports a wide variety of serial line channels which are specified using
648the following syntax
649
650   conX=channel_type:options[,channel_type:options]
651
652
653If the channel specification contains two parts separated by comma, the first
654one is input, the second one output.
655
656* The null channel - Discard all input or output. Example ``con=null`` will set
657  all consoles to null by default.
658
659* The fd channel - use file descriptor numbers for input/out. Example:
660  ``con1=fd:0,fd:1.``
661
662* The port channel - listen on tcp port number. Example: ``con1=port:4321``
663
664* The pty and pts channels - use system pty/pts.
665
666* The tty channel - bind to an existing system tty. Example: ``con1=/dev/tty8``
667  will make UML use the host 8th console (usually unused).
668
669* The xterm channel - this is the default - bring up an xterm on this channel
670  and direct IO to it. Note, that in order for xterm to work, the host must
671  have the UML distribution package installed. This usually contains the
672  port-helper and other utilities needed for UML to communicate with the xterm.
673  Alternatively, these need to be complied and installed from source. All
674  options applicable to consoles also apply to UML serial lines which are
675  presented as ttyS inside UML.
676
677Starting UML
678============
679
680We can now run UML.
681::
682
683   # linux mem=2048M umid=TEST \
684    ubd0=Filesystem.img \
685    vec0:transport=tap,ifname=tap0,depth=128,gro=1 \
686    root=/dev/ubda con=null con0=null,fd:2 con1=fd:0,fd:1
687
688This will run an instance with ``2048M RAM``, try to use the image file
689called ``Filesystem.img`` as root. It will connect to the host using tap0.
690All consoles except ``con1`` will be disabled and console 1 will
691use standard input/output making it appear in the same terminal it was started.
692
693Logging in
694============
695
696If you have not set up a password when generating the image, you will have to
697shut down the UML instance, mount the image, chroot into it and set it - as
698described in the Generating an Image section.  If the password is already set,
699you can just log in.
700
701The UML Management Console
702============================
703
704In addition to managing the image from "the inside" using normal sysadmin tools,
705it is possible to perform a number of low level operations using the UML
706management console. The UML management console is a low-level interface to the
707kernel on a running UML instance, somewhat like the i386 SysRq interface. Since
708there is a full-blown operating system under UML, there is much greater
709flexibility possible than with the SysRq mechanism.
710
711There are a number of things you can do with the mconsole interface:
712
713* get the kernel version
714* add and remove devices
715* halt or reboot the machine
716* Send SysRq commands
717* Pause and resume the UML
718* Inspect processes running inside UML
719* Inspect UML internal /proc state
720
721You need the mconsole client (uml\_mconsole) which is a part of the UML
722tools package available in most Linux distritions.
723
724You also need ``CONFIG_MCONSOLE`` (under 'General Setup') enabled in the UML
725kernel.  When you boot UML, you'll see a line like::
726
727   mconsole initialized on /home/jdike/.uml/umlNJ32yL/mconsole
728
729If you specify a unique machine id one the UML command line, i.e.
730``umid=debian``, you'll see this::
731
732   mconsole initialized on /home/jdike/.uml/debian/mconsole
733
734
735That file is the socket that uml_mconsole will use to communicate with
736UML.  Run it with either the umid or the full path as its argument::
737
738   # uml_mconsole debian
739
740or
741
742   # uml_mconsole /home/jdike/.uml/debian/mconsole
743
744
745You'll get a prompt, at which you can run one of these commands:
746
747* version
748* help
749* halt
750* reboot
751* config
752* remove
753* sysrq
754* help
755* cad
756* stop
757* go
758* proc
759* stack
760
761version
762-------
763
764This command takes no arguments.  It prints the UML version::
765
766   (mconsole)  version
767   OK Linux OpenWrt 4.14.106 #0 Tue Mar 19 08:19:41 2019 x86_64
768
769
770There are a couple actual uses for this.  It's a simple no-op which
771can be used to check that a UML is running.  It's also a way of
772sending a device interrupt to the UML. UML mconsole is treated internally as
773a UML device.
774
775help
776----
777
778This command takes no arguments. It prints a short help screen with the
779supported mconsole commands.
780
781
782halt and reboot
783---------------
784
785These commands take no arguments.  They shut the machine down immediately, with
786no syncing of disks and no clean shutdown of userspace.  So, they are
787pretty close to crashing the machine::
788
789   (mconsole)  halt
790   OK
791
792config
793------
794
795"config" adds a new device to the virtual machine. This is supported
796by most UML device drivers. It takes one argument, which is the
797device to add, with the same syntax as the kernel command line::
798
799   (mconsole) config ubd3=/home/jdike/incoming/roots/root_fs_debian22
800
801remove
802------
803
804"remove" deletes a device from the system.  Its argument is just the
805name of the device to be removed. The device must be idle in whatever
806sense the driver considers necessary.  In the case of the ubd driver,
807the removed block device must not be mounted, swapped on, or otherwise
808open, and in the case of the network driver, the device must be down::
809
810   (mconsole)  remove ubd3
811
812sysrq
813-----
814
815This command takes one argument, which is a single letter.  It calls the
816generic kernel's SysRq driver, which does whatever is called for by
817that argument.  See the SysRq documentation in
818Documentation/admin-guide/sysrq.rst in your favorite kernel tree to
819see what letters are valid and what they do.
820
821cad
822---
823
824This invokes the ``Ctl-Alt-Del`` action in the running image.  What exactly
825this ends up doing is up to init, systemd, etc.  Normally, it reboots the
826machine.
827
828stop
829----
830
831This puts the UML in a loop reading mconsole requests until a 'go'
832mconsole command is received. This is very useful as a
833debugging/snapshotting tool.
834
835go
836--
837
838This resumes a UML after being paused by a 'stop' command. Note that
839when the UML has resumed, TCP connections may have timed out and if
840the UML is paused for a long period of time, crond might go a little
841crazy, running all the jobs it didn't do earlier.
842
843proc
844----
845
846This takes one argument - the name of a file in /proc which is printed
847to the mconsole standard output
848
849stack
850-----
851
852This takes one argument - the pid number of a process. Its stack is
853printed to a standard output.
854
855*******************
856Advanced UML Topics
857*******************
858
859Sharing Filesystems between Virtual Machines
860============================================
861
862Don't attempt to share filesystems simply by booting two UMLs from the
863same file.  That's the same thing as booting two physical machines
864from a shared disk.  It will result in filesystem corruption.
865
866Using layered block devices
867---------------------------
868
869The way to share a filesystem between two virtual machines is to use
870the copy-on-write (COW) layering capability of the ubd block driver.
871Any changed blocks are stored in the private COW file, while reads come
872from either device - the private one if the requested block is valid in
873it, the shared one if not.  Using this scheme, the majority of data
874which is unchanged is shared between an arbitrary number of virtual
875machines, each of which has a much smaller file containing the changes
876that it has made.  With a large number of UMLs booting from a large root
877filesystem, this leads to a huge disk space saving.
878
879Sharing file system data will also help performance, since the host will
880be able to cache the shared data using a much smaller amount of memory,
881so UML disk requests will be served from the host's memory rather than
882its disks.  There is a major caveat in doing this on multisocket NUMA
883machines.  On such hardware, running many UML instances with a shared
884master image and COW changes may caise issues like NMIs from excess of
885inter-socket traffic.
886
887If you are running UML on high end hardware like this, make sure to
888bind UML to a set of logical cpus residing on the same socket using the
889``taskset`` command or have a look at the "tuning" section.
890
891To add a copy-on-write layer to an existing block device file, simply
892add the name of the COW file to the appropriate ubd switch::
893
894   ubd0=root_fs_cow,root_fs_debian_22
895
896where ``root_fs_cow`` is the private COW file and ``root_fs_debian_22`` is
897the existing shared filesystem.  The COW file need not exist.  If it
898doesn't, the driver will create and initialize it.
899
900Disk Usage
901----------
902
903UML has TRIM support which will release any unused space in its disk
904image files to the underlying OS. It is important to use either ls -ls
905or du to verify the actual file size.
906
907COW validity.
908-------------
909
910Any changes to the master image will invalidate all COW files. If this
911happens, UML will *NOT* automatically delete any of the COW files and
912will refuse to boot. In this case the only solution is to either
913restore the old image (including its last modified timestamp) or remove
914all COW files which will result in their recreation. Any changes in
915the COW files will be lost.
916
917Cows can moo - uml_moo : Merging a COW file with its backing file
918-----------------------------------------------------------------
919
920Depending on how you use UML and COW devices, it may be advisable to
921merge the changes in the COW file into the backing file every once in
922a while.
923
924The utility that does this is uml_moo.  Its usage is::
925
926   uml_moo COW_file new_backing_file
927
928
929There's no need to specify the backing file since that information is
930already in the COW file header.  If you're paranoid, boot the new
931merged file, and if you're happy with it, move it over the old backing
932file.
933
934``uml_moo`` creates a new backing file by default as a safety measure.
935It also has a destructive merge option which will merge the COW file
936directly into its current backing file.  This is really only usable
937when the backing file only has one COW file associated with it.  If
938there are multiple COWs associated with a backing file, a -d merge of
939one of them will invalidate all of the others.  However, it is
940convenient if you're short of disk space, and it should also be
941noticeably faster than a non-destructive merge.
942
943``uml_moo`` is installed with the UML distribution packages and is
944available as a part of UML utilities.
945
946Host file access
947==================
948
949If you want to access files on the host machine from inside UML, you
950can treat it as a separate machine and either nfs mount directories
951from the host or copy files into the virtual machine with scp.
952However, since UML is running on the host, it can access those
953files just like any other process and make them available inside the
954virtual machine without the need to use the network.
955This is possible with the hostfs virtual filesystem.  With it, you
956can mount a host directory into the UML filesystem and access the
957files contained in it just as you would on the host.
958
959*SECURITY WARNING*
960
961Hostfs without any parameters to the UML Image will allow the image
962to mount any part of the host filesystem and write to it. Always
963confine hostfs to a specific "harmless" directory (for example ``/var/tmp``)
964if running UML. This is especially important if UML is being run as root.
965
966Using hostfs
967------------
968
969To begin with, make sure that hostfs is available inside the virtual
970machine with::
971
972   # cat /proc/filesystems
973
974``hostfs`` should be listed.  If it's not, either rebuild the kernel
975with hostfs configured into it or make sure that hostfs is built as a
976module and available inside the virtual machine, and insmod it.
977
978
979Now all you need to do is run mount::
980
981   # mount none /mnt/host -t hostfs
982
983will mount the host's ``/`` on the virtual machine's ``/mnt/host``.
984If you don't want to mount the host root directory, then you can
985specify a subdirectory to mount with the -o switch to mount::
986
987   # mount none /mnt/home -t hostfs -o /home
988
989will mount the hosts's /home on the virtual machine's /mnt/home.
990
991hostfs as the root filesystem
992-----------------------------
993
994It's possible to boot from a directory hierarchy on the host using
995hostfs rather than using the standard filesystem in a file.
996To start, you need that hierarchy.  The easiest way is to loop mount
997an existing root_fs file::
998
999   #  mount root_fs uml_root_dir -o loop
1000
1001
1002You need to change the filesystem type of ``/`` in ``etc/fstab`` to be
1003'hostfs', so that line looks like this::
1004
1005   /dev/ubd/0       /        hostfs      defaults          1   1
1006
1007Then you need to chown to yourself all the files in that directory
1008that are owned by root.  This worked for me::
1009
1010   #  find . -uid 0 -exec chown jdike {} \;
1011
1012Next, make sure that your UML kernel has hostfs compiled in, not as a
1013module.  Then run UML with the boot device pointing at that directory::
1014
1015   ubd0=/path/to/uml/root/directory
1016
1017UML should then boot as it does normally.
1018
1019Hostfs Caveats
1020--------------
1021
1022Hostfs does not support keeping track of host filesystem changes on the
1023host (outside UML). As a result, if a file is changed without UML's
1024knowledge, UML will not know about it and its own in-memory cache of
1025the file may be corrupt. While it is possible to fix this, it is not
1026something which is being worked on at present.
1027
1028Tuning UML
1029============
1030
1031UML at present is strictly uniprocessor. It will, however spin up a
1032number of threads to handle various functions.
1033
1034The UBD driver, SIGIO and the MMU emulation do that. If the system is
1035idle, these threads will be migrated to other processors on a SMP host.
1036This, unfortunately, will usually result in LOWER performance because of
1037all of the cache/memory synchronization traffic between cores. As a
1038result, UML will usually benefit from being pinned on a single CPU
1039especially on a large system. This can result in performance differences
1040of 5 times or higher on some benchmarks.
1041
1042Similarly, on large multi-node NUMA systems UML will benefit if all of
1043its memory is allocated from the same NUMA node it will run on. The
1044OS will *NOT* do that by default. In order to do that, the sysadmin
1045needs to create a suitable tmpfs ramdisk bound to a particular node
1046and use that as the source for UML RAM allocation by specifying it
1047in the TMP or TEMP environment variables. UML will look at the values
1048of ``TMPDIR``, ``TMP`` or ``TEMP`` for that. If that fails, it will
1049look for shmfs mounted under ``/dev/shm``. If everything else fails use
1050``/tmp/`` regardless of the filesystem type used for it::
1051
1052   mount -t tmpfs -ompol=bind:X none /mnt/tmpfs-nodeX
1053   TEMP=/mnt/tmpfs-nodeX taskset -cX linux options options options..
1054
1055*******************************************
1056Contributing to UML and Developing with UML
1057*******************************************
1058
1059UML is an excellent platform to develop new Linux kernel concepts -
1060filesystems, devices, virtualization, etc. It provides unrivalled
1061opportunities to create and test them without being constrained to
1062emulating specific hardware.
1063
1064Example - want to try how linux will work with 4096 "proper" network
1065devices?
1066
1067Not an issue with UML. At the same time, this is something which
1068is difficult with other virtualization packages - they are
1069constrained by the number of devices allowed on the hardware bus
1070they are trying to emulate (for example 16 on a PCI bus in qemu).
1071
1072If you have something to contribute such as a patch, a bugfix, a
1073new feature, please send it to ``linux-um@lists.infradead.org``
1074
1075Please follow all standard Linux patch guidelines such as cc-ing
1076relevant maintainers and run ``./sripts/checkpatch.pl`` on your patch.
1077For more details see ``Documentation/process/submitting-patches.rst``
1078
1079Note - the list does not accept HTML or attachments, all emails must
1080be formatted as plain text.
1081
1082Developing always goes hand in hand with debugging. First of all,
1083you can always run UML under gdb and there will be a whole section
1084later on on how to do that. That, however, is not the only way to
1085debug a linux kernel. Quite often adding tracing statements and/or
1086using UML specific approaches such as ptracing the UML kernel process
1087are significantly more informative.
1088
1089Tracing UML
1090=============
1091
1092When running UML consists of a main kernel thread and a number of
1093helper threads. The ones of interest for tracing are NOT the ones
1094that are already ptraced by UML as a part of its MMU emulation.
1095
1096These are usually the first three threads visible in a ps display.
1097The one with the lowest PID number and using most CPU is usually the
1098kernel thread. The other threads are the disk
1099(ubd) device helper thread and the sigio helper thread.
1100Running ptrace on this thread usually results in the following picture::
1101
1102   host$ strace -p 16566
1103   --- SIGIO {si_signo=SIGIO, si_code=POLL_IN, si_band=65} ---
1104   epoll_wait(4, [{EPOLLIN, {u32=3721159424, u64=3721159424}}], 64, 0) = 1
1105   epoll_wait(4, [], 64, 0)                = 0
1106   rt_sigreturn({mask=[PIPE]})             = 16967
1107   ptrace(PTRACE_GETREGS, 16967, NULL, 0xd5f34f38) = 0
1108   ptrace(PTRACE_GETREGSET, 16967, NT_X86_XSTATE, [{iov_base=0xd5f35010, iov_len=832}]) = 0
1109   ptrace(PTRACE_GETSIGINFO, 16967, NULL, {si_signo=SIGTRAP, si_code=0x85, si_pid=16967, si_uid=0}) = 0
1110   ptrace(PTRACE_SETREGS, 16967, NULL, 0xd5f34f38) = 0
1111   ptrace(PTRACE_SETREGSET, 16967, NT_X86_XSTATE, [{iov_base=0xd5f35010, iov_len=2696}]) = 0
1112   ptrace(PTRACE_SYSEMU, 16967, NULL, 0)   = 0
1113   --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_TRAPPED, si_pid=16967, si_uid=0, si_status=SIGTRAP, si_utime=65, si_stime=89} ---
1114   wait4(16967, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP | 0x80}], WSTOPPED|__WALL, NULL) = 16967
1115   ptrace(PTRACE_GETREGS, 16967, NULL, 0xd5f34f38) = 0
1116   ptrace(PTRACE_GETREGSET, 16967, NT_X86_XSTATE, [{iov_base=0xd5f35010, iov_len=832}]) = 0
1117   ptrace(PTRACE_GETSIGINFO, 16967, NULL, {si_signo=SIGTRAP, si_code=0x85, si_pid=16967, si_uid=0}) = 0
1118   timer_settime(0, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=2830912}}, NULL) = 0
1119   getpid()                                = 16566
1120   clock_nanosleep(CLOCK_MONOTONIC, 0, {tv_sec=1, tv_nsec=0}, NULL) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
1121   --- SIGALRM {si_signo=SIGALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=1631716592, ptr=0x614204f0}} ---
1122   rt_sigreturn({mask=[PIPE]})             = -1 EINTR (Interrupted system call)
1123
1124This is a typical picture from a mostly idle UML instance
1125
1126* UML interrupt controller uses epoll - this is UML waiting for IO
1127  interrupts:
1128
1129   epoll_wait(4, [{EPOLLIN, {u32=3721159424, u64=3721159424}}], 64, 0) = 1
1130
1131* The sequence of ptrace calls is part of MMU emulation and runnin the
1132  UML userspace
1133* ``timer_settime`` is part of the UML high res timer subsystem mapping
1134  timer requests from inside UML onto the host high resultion timers.
1135* ``clock_nanosleep`` is UML going into idle (similar to the way a PC
1136  will execute an ACPI idle).
1137
1138As you can see UML will generate quite a bit of output even in idle.The output
1139can be very informative when observing IO. It shows the actual IO calls, their
1140arguments and returns values.
1141
1142Kernel debugging
1143================
1144
1145You can run UML under gdb now, though it will not necessarily agree to
1146be started under it. If you are trying to track a runtime bug, it is
1147much better to attach gdb to a running UML instance and let UML run.
1148
1149Assuming the same PID number as in the previous example, this would be::
1150
1151   # gdb -p 16566
1152
1153This will STOP the UML instance, so you must enter `cont` at the GDB
1154command line to request it to continue. It may be a good idea to make
1155this into a gdb script and pass it to gdb as an argument.
1156
1157Developing Device Drivers
1158=========================
1159
1160Nearly all UML drivers are monolithic. While it is possible to build a
1161UML driver as a kernel module, that limits the possible functionality
1162to in-kernel only and non-UML specific.  The reason for this is that
1163in order to really leverage UML, one needs to write a piece of
1164userspace code which maps driver concepts onto actual userspace host
1165calls.
1166
1167This forms the so called "user" portion of the driver. While it can
1168reuse a lot of kernel concepts, it is generally just another piece of
1169userspace code. This portion needs some matching "kernel" code which
1170resides inside the UML image and which implements the Linux kernel part.
1171
1172*Note: There are very few limitations in the way "kernel" and "user" interact*.
1173
1174UML does not have a strictly defined kernel to host API. It does not
1175try to emulate a specific architecture or bus. UML's "kernel" and
1176"user" can share memory, code and interact as needed to implement
1177whatever design the software developer has in mind. The only
1178limitations are purely technical. Due to a lot of functions and
1179variables having the same names, the developer should be careful
1180which includes and libraries they are trying to refer to.
1181
1182As a result a lot of userspace code consists of simple wrappers.
1183F.e. ``os_close_file()`` is just a wrapper around ``close()``
1184which ensures that the userspace function close does not clash
1185with similarly named function(s) in the kernel part.
1186
1187Security Considerations
1188-----------------------
1189
1190Drivers or any new functionality should default to not
1191accepting arbitrary filename, bpf code or other  parameters
1192which can affect the host from inside the UML instance.
1193For example, specifying the socket used for IPC communication
1194between a driver and the host at the UML command line is OK
1195security-wise. Allowing it as a loadable module parameter
1196isn't.
1197
1198If such functionality is desireable for a particular application
1199(e.g. loading BPF "firmware" for raw socket network transports),
1200it should be off by default and should be explicitly turned on
1201as a command line parameter at startup.
1202
1203Even with this in mind, the level of isolation between UML
1204and the host is relatively weak. If the UML userspace is
1205allowed to load arbitrary kernel drivers, an attacker can
1206use this to break out of UML. Thus, if UML is used in
1207a production application, it is recommended that all modules
1208are loaded at boot and kernel module loading is disabled
1209afterwards.
1210