xref: /openbmc/linux/Documentation/process/botching-up-ioctls.rst (revision c900529f3d9161bfde5cca0754f83b4d3c3e0220)
15ecd0a06SJonathan Corbet=================================
25ecd0a06SJonathan Corbet(How to avoid) Botching up ioctls
35ecd0a06SJonathan Corbet=================================
45ecd0a06SJonathan Corbet
5e7b4311eSAlexander A. KlimovFrom: https://blog.ffwll.ch/2013/11/botching-up-ioctls.html
65ecd0a06SJonathan Corbet
75ecd0a06SJonathan CorbetBy: Daniel Vetter, Copyright © 2013 Intel Corporation
85ecd0a06SJonathan Corbet
95ecd0a06SJonathan CorbetOne clear insight kernel graphics hackers gained in the past few years is that
105ecd0a06SJonathan Corbettrying to come up with a unified interface to manage the execution units and
115ecd0a06SJonathan Corbetmemory on completely different GPUs is a futile effort. So nowadays every
125ecd0a06SJonathan Corbetdriver has its own set of ioctls to allocate memory and submit work to the GPU.
135ecd0a06SJonathan CorbetWhich is nice, since there's no more insanity in the form of fake-generic, but
145ecd0a06SJonathan Corbetactually only used once interfaces. But the clear downside is that there's much
155ecd0a06SJonathan Corbetmore potential to screw things up.
165ecd0a06SJonathan Corbet
175ecd0a06SJonathan CorbetTo avoid repeating all the same mistakes again I've written up some of the
185ecd0a06SJonathan Corbetlessons learned while botching the job for the drm/i915 driver. Most of these
195ecd0a06SJonathan Corbetonly cover technicalities and not the big-picture issues like what the command
205ecd0a06SJonathan Corbetsubmission ioctl exactly should look like. Learning these lessons is probably
215ecd0a06SJonathan Corbetsomething every GPU driver has to do on its own.
225ecd0a06SJonathan Corbet
235ecd0a06SJonathan Corbet
245ecd0a06SJonathan CorbetPrerequisites
255ecd0a06SJonathan Corbet-------------
265ecd0a06SJonathan Corbet
275ecd0a06SJonathan CorbetFirst the prerequisites. Without these you have already failed, because you
285ecd0a06SJonathan Corbetwill need to add a 32-bit compat layer:
295ecd0a06SJonathan Corbet
305ecd0a06SJonathan Corbet * Only use fixed sized integers. To avoid conflicts with typedefs in userspace
315ecd0a06SJonathan Corbet   the kernel has special types like __u32, __s64. Use them.
325ecd0a06SJonathan Corbet
335ecd0a06SJonathan Corbet * Align everything to the natural size and use explicit padding. 32-bit
345ecd0a06SJonathan Corbet   platforms don't necessarily align 64-bit values to 64-bit boundaries, but
355ecd0a06SJonathan Corbet   64-bit platforms do. So we always need padding to the natural size to get
365ecd0a06SJonathan Corbet   this right.
375ecd0a06SJonathan Corbet
385ecd0a06SJonathan Corbet * Pad the entire struct to a multiple of 64-bits if the structure contains
395ecd0a06SJonathan Corbet   64-bit types - the structure size will otherwise differ on 32-bit versus
405ecd0a06SJonathan Corbet   64-bit. Having a different structure size hurts when passing arrays of
415ecd0a06SJonathan Corbet   structures to the kernel, or if the kernel checks the structure size, which
425ecd0a06SJonathan Corbet   e.g. the drm core does.
435ecd0a06SJonathan Corbet
4453379797SFederico Vaga * Pointers are __u64, cast from/to a uintptr_t on the userspace side and
455ecd0a06SJonathan Corbet   from/to a void __user * in the kernel. Try really hard not to delay this
465ecd0a06SJonathan Corbet   conversion or worse, fiddle the raw __u64 through your code since that
475ecd0a06SJonathan Corbet   diminishes the checking tools like sparse can provide. The macro
485ecd0a06SJonathan Corbet   u64_to_user_ptr can be used in the kernel to avoid warnings about integers
49b275fb60SChris Packham   and pointers of different sizes.
505ecd0a06SJonathan Corbet
515ecd0a06SJonathan Corbet
525ecd0a06SJonathan CorbetBasics
535ecd0a06SJonathan Corbet------
545ecd0a06SJonathan Corbet
555ecd0a06SJonathan CorbetWith the joys of writing a compat layer avoided we can take a look at the basic
565ecd0a06SJonathan Corbetfumbles. Neglecting these will make backward and forward compatibility a real
575ecd0a06SJonathan Corbetpain. And since getting things wrong on the first attempt is guaranteed you
585ecd0a06SJonathan Corbetwill have a second iteration or at least an extension for any given interface.
595ecd0a06SJonathan Corbet
605ecd0a06SJonathan Corbet * Have a clear way for userspace to figure out whether your new ioctl or ioctl
615ecd0a06SJonathan Corbet   extension is supported on a given kernel. If you can't rely on old kernels
625ecd0a06SJonathan Corbet   rejecting the new flags/modes or ioctls (since doing that was botched in the
635ecd0a06SJonathan Corbet   past) then you need a driver feature flag or revision number somewhere.
645ecd0a06SJonathan Corbet
655ecd0a06SJonathan Corbet * Have a plan for extending ioctls with new flags or new fields at the end of
665ecd0a06SJonathan Corbet   the structure. The drm core checks the passed-in size for each ioctl call
675ecd0a06SJonathan Corbet   and zero-extends any mismatches between kernel and userspace. That helps,
685ecd0a06SJonathan Corbet   but isn't a complete solution since newer userspace on older kernels won't
695ecd0a06SJonathan Corbet   notice that the newly added fields at the end get ignored. So this still
705ecd0a06SJonathan Corbet   needs a new driver feature flags.
715ecd0a06SJonathan Corbet
725ecd0a06SJonathan Corbet * Check all unused fields and flags and all the padding for whether it's 0,
735ecd0a06SJonathan Corbet   and reject the ioctl if that's not the case. Otherwise your nice plan for
745ecd0a06SJonathan Corbet   future extensions is going right down the gutters since someone will submit
755ecd0a06SJonathan Corbet   an ioctl struct with random stack garbage in the yet unused parts. Which
765ecd0a06SJonathan Corbet   then bakes in the ABI that those fields can never be used for anything else
775ecd0a06SJonathan Corbet   but garbage. This is also the reason why you must explicitly pad all
785ecd0a06SJonathan Corbet   structures, even if you never use them in an array - the padding the compiler
795ecd0a06SJonathan Corbet   might insert could contain garbage.
805ecd0a06SJonathan Corbet
815ecd0a06SJonathan Corbet * Have simple testcases for all of the above.
825ecd0a06SJonathan Corbet
835ecd0a06SJonathan Corbet
845ecd0a06SJonathan CorbetFun with Error Paths
855ecd0a06SJonathan Corbet--------------------
865ecd0a06SJonathan Corbet
875ecd0a06SJonathan CorbetNowadays we don't have any excuse left any more for drm drivers being neat
885ecd0a06SJonathan Corbetlittle root exploits. This means we both need full input validation and solid
895ecd0a06SJonathan Corbeterror handling paths - GPUs will die eventually in the oddmost corner cases
905ecd0a06SJonathan Corbetanyway:
915ecd0a06SJonathan Corbet
925ecd0a06SJonathan Corbet * The ioctl must check for array overflows. Also it needs to check for
935ecd0a06SJonathan Corbet   over/underflows and clamping issues of integer values in general. The usual
945ecd0a06SJonathan Corbet   example is sprite positioning values fed directly into the hardware with the
955ecd0a06SJonathan Corbet   hardware just having 12 bits or so. Works nicely until some odd display
965ecd0a06SJonathan Corbet   server doesn't bother with clamping itself and the cursor wraps around the
975ecd0a06SJonathan Corbet   screen.
985ecd0a06SJonathan Corbet
995ecd0a06SJonathan Corbet * Have simple testcases for every input validation failure case in your ioctl.
1005ecd0a06SJonathan Corbet   Check that the error code matches your expectations. And finally make sure
1015ecd0a06SJonathan Corbet   that you only test for one single error path in each subtest by submitting
1025ecd0a06SJonathan Corbet   otherwise perfectly valid data. Without this an earlier check might reject
1035ecd0a06SJonathan Corbet   the ioctl already and shadow the codepath you actually want to test, hiding
1045ecd0a06SJonathan Corbet   bugs and regressions.
1055ecd0a06SJonathan Corbet
1065ecd0a06SJonathan Corbet * Make all your ioctls restartable. First X really loves signals and second
1075ecd0a06SJonathan Corbet   this will allow you to test 90% of all error handling paths by just
1085ecd0a06SJonathan Corbet   interrupting your main test suite constantly with signals. Thanks to X's
1095ecd0a06SJonathan Corbet   love for signal you'll get an excellent base coverage of all your error
1105ecd0a06SJonathan Corbet   paths pretty much for free for graphics drivers. Also, be consistent with
1115ecd0a06SJonathan Corbet   how you handle ioctl restarting - e.g. drm has a tiny drmIoctl helper in its
1125ecd0a06SJonathan Corbet   userspace library. The i915 driver botched this with the set_tiling ioctl,
1135ecd0a06SJonathan Corbet   now we're stuck forever with some arcane semantics in both the kernel and
1145ecd0a06SJonathan Corbet   userspace.
1155ecd0a06SJonathan Corbet
1165ecd0a06SJonathan Corbet * If you can't make a given codepath restartable make a stuck task at least
1175ecd0a06SJonathan Corbet   killable. GPUs just die and your users won't like you more if you hang their
1185ecd0a06SJonathan Corbet   entire box (by means of an unkillable X process). If the state recovery is
1195ecd0a06SJonathan Corbet   still too tricky have a timeout or hangcheck safety net as a last-ditch
1205ecd0a06SJonathan Corbet   effort in case the hardware has gone bananas.
1215ecd0a06SJonathan Corbet
1225ecd0a06SJonathan Corbet * Have testcases for the really tricky corner cases in your error recovery code
1235ecd0a06SJonathan Corbet   - it's way too easy to create a deadlock between your hangcheck code and
1245ecd0a06SJonathan Corbet   waiters.
1255ecd0a06SJonathan Corbet
1265ecd0a06SJonathan Corbet
1275ecd0a06SJonathan CorbetTime, Waiting and Missing it
1285ecd0a06SJonathan Corbet----------------------------
1295ecd0a06SJonathan Corbet
1305ecd0a06SJonathan CorbetGPUs do most everything asynchronously, so we have a need to time operations and
1315ecd0a06SJonathan Corbetwait for outstanding ones. This is really tricky business; at the moment none of
1325ecd0a06SJonathan Corbetthe ioctls supported by the drm/i915 get this fully right, which means there's
1335ecd0a06SJonathan Corbetstill tons more lessons to learn here.
1345ecd0a06SJonathan Corbet
1355ecd0a06SJonathan Corbet * Use CLOCK_MONOTONIC as your reference time, always. It's what alsa, drm and
1365ecd0a06SJonathan Corbet   v4l use by default nowadays. But let userspace know which timestamps are
1375ecd0a06SJonathan Corbet   derived from different clock domains like your main system clock (provided
1385ecd0a06SJonathan Corbet   by the kernel) or some independent hardware counter somewhere else. Clocks
1395ecd0a06SJonathan Corbet   will mismatch if you look close enough, but if performance measuring tools
1405ecd0a06SJonathan Corbet   have this information they can at least compensate. If your userspace can
1415ecd0a06SJonathan Corbet   get at the raw values of some clocks (e.g. through in-command-stream
1425ecd0a06SJonathan Corbet   performance counter sampling instructions) consider exposing those also.
1435ecd0a06SJonathan Corbet
1445ecd0a06SJonathan Corbet * Use __s64 seconds plus __u64 nanoseconds to specify time. It's not the most
1455ecd0a06SJonathan Corbet   convenient time specification, but it's mostly the standard.
1465ecd0a06SJonathan Corbet
1475ecd0a06SJonathan Corbet * Check that input time values are normalized and reject them if not. Note
1485ecd0a06SJonathan Corbet   that the kernel native struct ktime has a signed integer for both seconds
1495ecd0a06SJonathan Corbet   and nanoseconds, so beware here.
1505ecd0a06SJonathan Corbet
1515ecd0a06SJonathan Corbet * For timeouts, use absolute times. If you're a good fellow and made your
1525ecd0a06SJonathan Corbet   ioctl restartable relative timeouts tend to be too coarse and can
1535ecd0a06SJonathan Corbet   indefinitely extend your wait time due to rounding on each restart.
1545ecd0a06SJonathan Corbet   Especially if your reference clock is something really slow like the display
1555ecd0a06SJonathan Corbet   frame counter. With a spec lawyer hat on this isn't a bug since timeouts can
1565ecd0a06SJonathan Corbet   always be extended - but users will surely hate you if their neat animations
1575ecd0a06SJonathan Corbet   starts to stutter due to this.
1585ecd0a06SJonathan Corbet
1595ecd0a06SJonathan Corbet * Consider ditching any synchronous wait ioctls with timeouts and just deliver
1605ecd0a06SJonathan Corbet   an asynchronous event on a pollable file descriptor. It fits much better
1615ecd0a06SJonathan Corbet   into event driven applications' main loop.
1625ecd0a06SJonathan Corbet
1635ecd0a06SJonathan Corbet * Have testcases for corner-cases, especially whether the return values for
1645ecd0a06SJonathan Corbet   already-completed events, successful waits and timed-out waits are all sane
1655ecd0a06SJonathan Corbet   and suiting to your needs.
1665ecd0a06SJonathan Corbet
1675ecd0a06SJonathan Corbet
1685ecd0a06SJonathan CorbetLeaking Resources, Not
1695ecd0a06SJonathan Corbet----------------------
1705ecd0a06SJonathan Corbet
1715ecd0a06SJonathan CorbetA full-blown drm driver essentially implements a little OS, but specialized to
1725ecd0a06SJonathan Corbetthe given GPU platforms. This means a driver needs to expose tons of handles
1735ecd0a06SJonathan Corbetfor different objects and other resources to userspace. Doing that right
1745ecd0a06SJonathan Corbetentails its own little set of pitfalls:
1755ecd0a06SJonathan Corbet
1765ecd0a06SJonathan Corbet * Always attach the lifetime of your dynamically created resources to the
1775ecd0a06SJonathan Corbet   lifetime of a file descriptor. Consider using a 1:1 mapping if your resource
1785ecd0a06SJonathan Corbet   needs to be shared across processes -  fd-passing over unix domain sockets
1795ecd0a06SJonathan Corbet   also simplifies lifetime management for userspace.
1805ecd0a06SJonathan Corbet
1815ecd0a06SJonathan Corbet * Always have O_CLOEXEC support.
1825ecd0a06SJonathan Corbet
1835ecd0a06SJonathan Corbet * Ensure that you have sufficient insulation between different clients. By
1845ecd0a06SJonathan Corbet   default pick a private per-fd namespace which forces any sharing to be done
1855ecd0a06SJonathan Corbet   explicitly. Only go with a more global per-device namespace if the objects
1865ecd0a06SJonathan Corbet   are truly device-unique. One counterexample in the drm modeset interfaces is
1875ecd0a06SJonathan Corbet   that the per-device modeset objects like connectors share a namespace with
1885ecd0a06SJonathan Corbet   framebuffer objects, which mostly are not shared at all. A separate
1895ecd0a06SJonathan Corbet   namespace, private by default, for framebuffers would have been more
1905ecd0a06SJonathan Corbet   suitable.
1915ecd0a06SJonathan Corbet
1925ecd0a06SJonathan Corbet * Think about uniqueness requirements for userspace handles. E.g. for most drm
1935ecd0a06SJonathan Corbet   drivers it's a userspace bug to submit the same object twice in the same
1945ecd0a06SJonathan Corbet   command submission ioctl. But then if objects are shareable userspace needs
1955ecd0a06SJonathan Corbet   to know whether it has seen an imported object from a different process
1965ecd0a06SJonathan Corbet   already or not. I haven't tried this myself yet due to lack of a new class
1975ecd0a06SJonathan Corbet   of objects, but consider using inode numbers on your shared file descriptors
1985ecd0a06SJonathan Corbet   as unique identifiers - it's how real files are told apart, too.
1995ecd0a06SJonathan Corbet   Unfortunately this requires a full-blown virtual filesystem in the kernel.
2005ecd0a06SJonathan Corbet
2015ecd0a06SJonathan Corbet
2025ecd0a06SJonathan CorbetLast, but not Least
2035ecd0a06SJonathan Corbet-------------------
2045ecd0a06SJonathan Corbet
2055ecd0a06SJonathan CorbetNot every problem needs a new ioctl:
2065ecd0a06SJonathan Corbet
2075ecd0a06SJonathan Corbet * Think hard whether you really want a driver-private interface. Of course
2085ecd0a06SJonathan Corbet   it's much quicker to push a driver-private interface than engaging in
2095ecd0a06SJonathan Corbet   lengthy discussions for a more generic solution. And occasionally doing a
2105ecd0a06SJonathan Corbet   private interface to spearhead a new concept is what's required. But in the
211*d56b699dSBjorn Helgaas   end, once the generic interface comes around you'll end up maintaining two
2125ecd0a06SJonathan Corbet   interfaces. Indefinitely.
2135ecd0a06SJonathan Corbet
2145ecd0a06SJonathan Corbet * Consider other interfaces than ioctls. A sysfs attribute is much better for
2155ecd0a06SJonathan Corbet   per-device settings, or for child objects with fairly static lifetimes (like
2165ecd0a06SJonathan Corbet   output connectors in drm with all the detection override attributes). Or
2175ecd0a06SJonathan Corbet   maybe only your testsuite needs this interface, and then debugfs with its
2185ecd0a06SJonathan Corbet   disclaimer of not having a stable ABI would be better.
2195ecd0a06SJonathan Corbet
2205ecd0a06SJonathan CorbetFinally, the name of the game is to get it right on the first attempt, since if
2215ecd0a06SJonathan Corbetyour driver proves popular and your hardware platforms long-lived then you'll
2225ecd0a06SJonathan Corbetbe stuck with a given ioctl essentially forever. You can try to deprecate
2235ecd0a06SJonathan Corbethorrible ioctls on newer iterations of your hardware, but generally it takes
2245ecd0a06SJonathan Corbetyears to accomplish this. And then again years until the last user able to
2255ecd0a06SJonathan Corbetcomplain about regressions disappears, too.
226