1658234deSKevin Tian.. SPDX-License-Identifier: GPL-2.0+ 2658234deSKevin Tian 3658234deSKevin Tian======= 4658234deSKevin TianIOMMUFD 5658234deSKevin Tian======= 6658234deSKevin Tian 7658234deSKevin Tian:Author: Jason Gunthorpe 8658234deSKevin Tian:Author: Kevin Tian 9658234deSKevin Tian 10658234deSKevin TianOverview 11658234deSKevin Tian======== 12658234deSKevin Tian 13658234deSKevin TianIOMMUFD is the user API to control the IOMMU subsystem as it relates to managing 14658234deSKevin TianIO page tables from userspace using file descriptors. It intends to be general 15658234deSKevin Tianand consumable by any driver that wants to expose DMA to userspace. These 16658234deSKevin Tiandrivers are eventually expected to deprecate any internal IOMMU logic 17658234deSKevin Tianthey may already/historically implement (e.g. vfio_iommu_type1.c). 18658234deSKevin Tian 19658234deSKevin TianAt minimum iommufd provides universal support of managing I/O address spaces and 20658234deSKevin TianI/O page tables for all IOMMUs, with room in the design to add non-generic 21658234deSKevin Tianfeatures to cater to specific hardware functionality. 22658234deSKevin Tian 23658234deSKevin TianIn this context the capital letter (IOMMUFD) refers to the subsystem while the 24658234deSKevin Tiansmall letter (iommufd) refers to the file descriptors created via /dev/iommu for 25658234deSKevin Tianuse by userspace. 26658234deSKevin Tian 27658234deSKevin TianKey Concepts 28658234deSKevin Tian============ 29658234deSKevin Tian 30658234deSKevin TianUser Visible Objects 31658234deSKevin Tian-------------------- 32658234deSKevin Tian 33658234deSKevin TianFollowing IOMMUFD objects are exposed to userspace: 34658234deSKevin Tian 35658234deSKevin Tian- IOMMUFD_OBJ_IOAS, representing an I/O address space (IOAS), allowing map/unmap 36658234deSKevin Tian of user space memory into ranges of I/O Virtual Address (IOVA). 37658234deSKevin Tian 38658234deSKevin Tian The IOAS is a functional replacement for the VFIO container, and like the VFIO 39658234deSKevin Tian container it copies an IOVA map to a list of iommu_domains held within it. 40658234deSKevin Tian 41658234deSKevin Tian- IOMMUFD_OBJ_DEVICE, representing a device that is bound to iommufd by an 42658234deSKevin Tian external driver. 43658234deSKevin Tian 44658234deSKevin Tian- IOMMUFD_OBJ_HW_PAGETABLE, representing an actual hardware I/O page table 45658234deSKevin Tian (i.e. a single struct iommu_domain) managed by the iommu driver. 46658234deSKevin Tian 47658234deSKevin Tian The IOAS has a list of HW_PAGETABLES that share the same IOVA mapping and 48658234deSKevin Tian it will synchronize its mapping with each member HW_PAGETABLE. 49658234deSKevin Tian 50658234deSKevin TianAll user-visible objects are destroyed via the IOMMU_DESTROY uAPI. 51658234deSKevin Tian 52658234deSKevin TianThe diagram below shows relationship between user-visible objects and kernel 53658234deSKevin Tiandatastructures (external to iommufd), with numbers referred to operations 54658234deSKevin Tiancreating the objects and links:: 55658234deSKevin Tian 56658234deSKevin Tian _________________________________________________________ 57658234deSKevin Tian | iommufd | 58658234deSKevin Tian | [1] | 59658234deSKevin Tian | _________________ | 60658234deSKevin Tian | | | | 61658234deSKevin Tian | | | | 62658234deSKevin Tian | | | | 63658234deSKevin Tian | | | | 64658234deSKevin Tian | | | | 65658234deSKevin Tian | | | | 66658234deSKevin Tian | | | [3] [2] | 67658234deSKevin Tian | | | ____________ __________ | 68658234deSKevin Tian | | IOAS |<--| |<------| | | 69658234deSKevin Tian | | | |HW_PAGETABLE| | DEVICE | | 70658234deSKevin Tian | | | |____________| |__________| | 71658234deSKevin Tian | | | | | | 72658234deSKevin Tian | | | | | | 73658234deSKevin Tian | | | | | | 74658234deSKevin Tian | | | | | | 75658234deSKevin Tian | | | | | | 76658234deSKevin Tian | |_________________| | | | 77658234deSKevin Tian | | | | | 78658234deSKevin Tian |_________|___________________|___________________|_______| 79658234deSKevin Tian | | | 80658234deSKevin Tian | _____v______ _______v_____ 81658234deSKevin Tian | PFN storage | | | | 82658234deSKevin Tian |------------>|iommu_domain| |struct device| 83658234deSKevin Tian |____________| |_____________| 84658234deSKevin Tian 85658234deSKevin Tian1. IOMMUFD_OBJ_IOAS is created via the IOMMU_IOAS_ALLOC uAPI. An iommufd can 86658234deSKevin Tian hold multiple IOAS objects. IOAS is the most generic object and does not 87658234deSKevin Tian expose interfaces that are specific to single IOMMU drivers. All operations 88658234deSKevin Tian on the IOAS must operate equally on each of the iommu_domains inside of it. 89658234deSKevin Tian 90658234deSKevin Tian2. IOMMUFD_OBJ_DEVICE is created when an external driver calls the IOMMUFD kAPI 91658234deSKevin Tian to bind a device to an iommufd. The driver is expected to implement a set of 92658234deSKevin Tian ioctls to allow userspace to initiate the binding operation. Successful 93658234deSKevin Tian completion of this operation establishes the desired DMA ownership over the 94658234deSKevin Tian device. The driver must also set the driver_managed_dma flag and must not 95658234deSKevin Tian touch the device until this operation succeeds. 96658234deSKevin Tian 97658234deSKevin Tian3. IOMMUFD_OBJ_HW_PAGETABLE is created when an external driver calls the IOMMUFD 98658234deSKevin Tian kAPI to attach a bound device to an IOAS. Similarly the external driver uAPI 99658234deSKevin Tian allows userspace to initiate the attaching operation. If a compatible 100658234deSKevin Tian pagetable already exists then it is reused for the attachment. Otherwise a 101658234deSKevin Tian new pagetable object and iommu_domain is created. Successful completion of 102658234deSKevin Tian this operation sets up the linkages among IOAS, device and iommu_domain. Once 103658234deSKevin Tian this completes the device could do DMA. 104658234deSKevin Tian 105658234deSKevin Tian Every iommu_domain inside the IOAS is also represented to userspace as a 106658234deSKevin Tian HW_PAGETABLE object. 107658234deSKevin Tian 108658234deSKevin Tian .. note:: 109658234deSKevin Tian 110658234deSKevin Tian Future IOMMUFD updates will provide an API to create and manipulate the 111658234deSKevin Tian HW_PAGETABLE directly. 112658234deSKevin Tian 113658234deSKevin TianA device can only bind to an iommufd due to DMA ownership claim and attach to at 114658234deSKevin Tianmost one IOAS object (no support of PASID yet). 115658234deSKevin Tian 116658234deSKevin TianKernel Datastructure 117658234deSKevin Tian-------------------- 118658234deSKevin Tian 119658234deSKevin TianUser visible objects are backed by following datastructures: 120658234deSKevin Tian 121658234deSKevin Tian- iommufd_ioas for IOMMUFD_OBJ_IOAS. 122658234deSKevin Tian- iommufd_device for IOMMUFD_OBJ_DEVICE. 123658234deSKevin Tian- iommufd_hw_pagetable for IOMMUFD_OBJ_HW_PAGETABLE. 124658234deSKevin Tian 125658234deSKevin TianSeveral terminologies when looking at these datastructures: 126658234deSKevin Tian 127658234deSKevin Tian- Automatic domain - refers to an iommu domain created automatically when 128658234deSKevin Tian attaching a device to an IOAS object. This is compatible to the semantics of 129658234deSKevin Tian VFIO type1. 130658234deSKevin Tian 131658234deSKevin Tian- Manual domain - refers to an iommu domain designated by the user as the 132658234deSKevin Tian target pagetable to be attached to by a device. Though currently there are 133658234deSKevin Tian no uAPIs to directly create such domain, the datastructure and algorithms 134658234deSKevin Tian are ready for handling that use case. 135658234deSKevin Tian 136658234deSKevin Tian- In-kernel user - refers to something like a VFIO mdev that is using the 137658234deSKevin Tian IOMMUFD access interface to access the IOAS. This starts by creating an 138658234deSKevin Tian iommufd_access object that is similar to the domain binding a physical device 139658234deSKevin Tian would do. The access object will then allow converting IOVA ranges into struct 140658234deSKevin Tian page * lists, or doing direct read/write to an IOVA. 141658234deSKevin Tian 142658234deSKevin Tianiommufd_ioas serves as the metadata datastructure to manage how IOVA ranges are 143658234deSKevin Tianmapped to memory pages, composed of: 144658234deSKevin Tian 145658234deSKevin Tian- struct io_pagetable holding the IOVA map 146658234deSKevin Tian- struct iopt_area's representing populated portions of IOVA 147658234deSKevin Tian- struct iopt_pages representing the storage of PFNs 148658234deSKevin Tian- struct iommu_domain representing the IO page table in the IOMMU 149658234deSKevin Tian- struct iopt_pages_access representing in-kernel users of PFNs 150658234deSKevin Tian- struct xarray pinned_pfns holding a list of pages pinned by in-kernel users 151658234deSKevin Tian 152658234deSKevin TianEach iopt_pages represents a logical linear array of full PFNs. The PFNs are 153658234deSKevin Tianultimately derived from userspace VAs via an mm_struct. Once they have been 154658234deSKevin Tianpinned the PFNs are stored in IOPTEs of an iommu_domain or inside the pinned_pfns 155658234deSKevin Tianxarray if they have been pinned through an iommufd_access. 156658234deSKevin Tian 157658234deSKevin TianPFN have to be copied between all combinations of storage locations, depending 158658234deSKevin Tianon what domains are present and what kinds of in-kernel "software access" users 159658234deSKevin Tianexist. The mechanism ensures that a page is pinned only once. 160658234deSKevin Tian 161658234deSKevin TianAn io_pagetable is composed of iopt_areas pointing at iopt_pages, along with a 162658234deSKevin Tianlist of iommu_domains that mirror the IOVA to PFN map. 163658234deSKevin Tian 164658234deSKevin TianMultiple io_pagetable-s, through their iopt_area-s, can share a single 165658234deSKevin Tianiopt_pages which avoids multi-pinning and double accounting of page 166658234deSKevin Tianconsumption. 167658234deSKevin Tian 168*c1966bd1SRandy Dunlapiommufd_ioas is shareable between subsystems, e.g. VFIO and VDPA, as long as 169658234deSKevin Tiandevices managed by different subsystems are bound to a same iommufd. 170658234deSKevin Tian 171658234deSKevin TianIOMMUFD User API 172658234deSKevin Tian================ 173658234deSKevin Tian 174658234deSKevin Tian.. kernel-doc:: include/uapi/linux/iommufd.h 175658234deSKevin Tian 176658234deSKevin TianIOMMUFD Kernel API 177658234deSKevin Tian================== 178658234deSKevin Tian 179658234deSKevin TianThe IOMMUFD kAPI is device-centric with group-related tricks managed behind the 180658234deSKevin Tianscene. This allows the external drivers calling such kAPI to implement a simple 181658234deSKevin Tiandevice-centric uAPI for connecting its device to an iommufd, instead of 182658234deSKevin Tianexplicitly imposing the group semantics in its uAPI as VFIO does. 183658234deSKevin Tian 184658234deSKevin Tian.. kernel-doc:: drivers/iommu/iommufd/device.c 185658234deSKevin Tian :export: 186658234deSKevin Tian 187658234deSKevin Tian.. kernel-doc:: drivers/iommu/iommufd/main.c 188658234deSKevin Tian :export: 189658234deSKevin Tian 190658234deSKevin TianVFIO and IOMMUFD 191658234deSKevin Tian---------------- 192658234deSKevin Tian 193658234deSKevin TianConnecting a VFIO device to iommufd can be done in two ways. 194658234deSKevin Tian 195658234deSKevin TianFirst is a VFIO compatible way by directly implementing the /dev/vfio/vfio 196658234deSKevin Tiancontainer IOCTLs by mapping them into io_pagetable operations. Doing so allows 197658234deSKevin Tianthe use of iommufd in legacy VFIO applications by symlinking /dev/vfio/vfio to 198658234deSKevin Tian/dev/iommufd or extending VFIO to SET_CONTAINER using an iommufd instead of a 199658234deSKevin Tiancontainer fd. 200658234deSKevin Tian 201658234deSKevin TianThe second approach directly extends VFIO to support a new set of device-centric 202658234deSKevin Tianuser API based on aforementioned IOMMUFD kernel API. It requires userspace 203658234deSKevin Tianchange but better matches the IOMMUFD API semantics and easier to support new 204658234deSKevin Tianiommufd features when comparing it to the first approach. 205658234deSKevin Tian 206658234deSKevin TianCurrently both approaches are still work-in-progress. 207658234deSKevin Tian 208658234deSKevin TianThere are still a few gaps to be resolved to catch up with VFIO type1, as 209658234deSKevin Tiandocumented in iommufd_vfio_check_extension(). 210658234deSKevin Tian 211658234deSKevin TianFuture TODOs 212658234deSKevin Tian============ 213658234deSKevin Tian 214658234deSKevin TianCurrently IOMMUFD supports only kernel-managed I/O page table, similar to VFIO 215658234deSKevin Tiantype1. New features on the radar include: 216658234deSKevin Tian 217658234deSKevin Tian - Binding iommu_domain's to PASID/SSID 218658234deSKevin Tian - Userspace page tables, for ARM, x86 and S390 219658234deSKevin Tian - Kernel bypass'd invalidation of user page tables 220658234deSKevin Tian - Re-use of the KVM page table in the IOMMU 221658234deSKevin Tian - Dirty page tracking in the IOMMU 222658234deSKevin Tian - Runtime Increase/Decrease of IOPTE size 223658234deSKevin Tian - PRI support with faults resolved in userspace 224