virtiofsd.rst (13316B)
1 QEMU virtio-fs shared file system daemon 2 ======================================== 3 4 Synopsis 5 -------- 6 7 **virtiofsd** [*OPTIONS*] 8 9 Description 10 ----------- 11 12 Share a host directory tree with a guest through a virtio-fs device. This 13 program is a vhost-user backend that implements the virtio-fs device. Each 14 virtio-fs device instance requires its own virtiofsd process. 15 16 This program is designed to work with QEMU's ``--device vhost-user-fs-pci`` 17 but should work with any virtual machine monitor (VMM) that supports 18 vhost-user. See the Examples section below. 19 20 This program must be run as the root user. The program drops privileges where 21 possible during startup although it must be able to create and access files 22 with any uid/gid: 23 24 * The ability to invoke syscalls is limited using seccomp(2). 25 * Linux capabilities(7) are dropped. 26 27 In "namespace" sandbox mode the program switches into a new file system 28 namespace and invokes pivot_root(2) to make the shared directory tree its root. 29 A new pid and net namespace is also created to isolate the process. 30 31 In "chroot" sandbox mode the program invokes chroot(2) to make the shared 32 directory tree its root. This mode is intended for container environments where 33 the container runtime has already set up the namespaces and the program does 34 not have permission to create namespaces itself. 35 36 Both sandbox modes prevent "file system escapes" due to symlinks and other file 37 system objects that might lead to files outside the shared directory. 38 39 Options 40 ------- 41 42 .. program:: virtiofsd 43 44 .. option:: -h, --help 45 46 Print help. 47 48 .. option:: -V, --version 49 50 Print version. 51 52 .. option:: -d 53 54 Enable debug output. 55 56 .. option:: --syslog 57 58 Print log messages to syslog instead of stderr. 59 60 .. option:: -o OPTION 61 62 * debug - 63 Enable debug output. 64 65 * flock|no_flock - 66 Enable/disable flock. The default is ``no_flock``. 67 68 * modcaps=CAPLIST 69 Modify the list of capabilities allowed; CAPLIST is a colon separated 70 list of capabilities, each preceded by either + or -, e.g. 71 ''+sys_admin:-chown''. 72 73 * log_level=LEVEL - 74 Print only log messages matching LEVEL or more severe. LEVEL is one of 75 ``err``, ``warn``, ``info``, or ``debug``. The default is ``info``. 76 77 * posix_lock|no_posix_lock - 78 Enable/disable remote POSIX locks. The default is ``no_posix_lock``. 79 80 * readdirplus|no_readdirplus - 81 Enable/disable readdirplus. The default is ``readdirplus``. 82 83 * sandbox=namespace|chroot - 84 Sandbox mode: 85 - namespace: Create mount, pid, and net namespaces and pivot_root(2) into 86 the shared directory. 87 - chroot: chroot(2) into shared directory (use in containers). 88 The default is "namespace". 89 90 * source=PATH - 91 Share host directory tree located at PATH. This option is required. 92 93 * timeout=TIMEOUT - 94 I/O timeout in seconds. The default depends on cache= option. 95 96 * writeback|no_writeback - 97 Enable/disable writeback cache. The cache allows the FUSE client to buffer 98 and merge write requests. The default is ``no_writeback``. 99 100 * xattr|no_xattr - 101 Enable/disable extended attributes (xattr) on files and directories. The 102 default is ``no_xattr``. 103 104 * posix_acl|no_posix_acl - 105 Enable/disable posix acl support. Posix ACLs are disabled by default. 106 107 * security_label|no_security_label - 108 Enable/disable security label support. Security labels are disabled by 109 default. This will allow client to send a MAC label of file during 110 file creation. Typically this is expected to be SELinux security 111 label. Server will try to set that label on newly created file 112 atomically wherever possible. 113 114 * killpriv_v2|no_killpriv_v2 - 115 Enable/disable ``FUSE_HANDLE_KILLPRIV_V2`` support. KILLPRIV_V2 is enabled 116 by default as long as the client supports it. Enabling this option helps 117 with performance in write path. 118 119 .. option:: --socket-path=PATH 120 121 Listen on vhost-user UNIX domain socket at PATH. 122 123 .. option:: --socket-group=GROUP 124 125 Set the vhost-user UNIX domain socket gid to GROUP. 126 127 .. option:: --fd=FDNUM 128 129 Accept connections from vhost-user UNIX domain socket file descriptor FDNUM. 130 The file descriptor must already be listening for connections. 131 132 .. option:: --thread-pool-size=NUM 133 134 Restrict the number of worker threads per request queue to NUM. The default 135 is 0. 136 137 .. option:: --cache=none|auto|always 138 139 Select the desired trade-off between coherency and performance. ``none`` 140 forbids the FUSE client from caching to achieve best coherency at the cost of 141 performance. ``auto`` acts similar to NFS with a 1 second metadata cache 142 timeout. ``always`` sets a long cache lifetime at the expense of coherency. 143 The default is ``auto``. 144 145 Extended attribute (xattr) mapping 146 ---------------------------------- 147 148 By default the name of xattr's used by the client are passed through to the server 149 file system. This can be a problem where either those xattr names are used 150 by something on the server (e.g. selinux client/server confusion) or if the 151 ``virtiofsd`` is running in a container with restricted privileges where it 152 cannot access some attributes. 153 154 Mapping syntax 155 ~~~~~~~~~~~~~~ 156 157 A mapping of xattr names can be made using -o xattrmap=mapping where the ``mapping`` 158 string consists of a series of rules. 159 160 The first matching rule terminates the mapping. 161 The set of rules must include a terminating rule to match any remaining attributes 162 at the end. 163 164 Each rule consists of a number of fields separated with a separator that is the 165 first non-white space character in the rule. This separator must then be used 166 for the whole rule. 167 White space may be added before and after each rule. 168 169 Using ':' as the separator a rule is of the form: 170 171 ``:type:scope:key:prepend:`` 172 173 **scope** is: 174 175 - 'client' - match 'key' against a xattr name from the client for 176 setxattr/getxattr/removexattr 177 - 'server' - match 'prepend' against a xattr name from the server 178 for listxattr 179 - 'all' - can be used to make a single rule where both the server 180 and client matches are triggered. 181 182 **type** is one of: 183 184 - 'prefix' - is designed to prepend and strip a prefix; the modified 185 attributes then being passed on to the client/server. 186 187 - 'ok' - Causes the rule set to be terminated when a match is found 188 while allowing matching xattr's through unchanged. 189 It is intended both as a way of explicitly terminating 190 the list of rules, and to allow some xattr's to skip following rules. 191 192 - 'bad' - If a client tries to use a name matching 'key' it's 193 denied using EPERM; when the server passes an attribute 194 name matching 'prepend' it's hidden. In many ways it's use is very like 195 'ok' as either an explicit terminator or for special handling of certain 196 patterns. 197 198 - 'unsupported' - If a client tries to use a name matching 'key' it's 199 denied using ENOTSUP; when the server passes an attribute 200 name matching 'prepend' it's hidden. In many ways it's use is very like 201 'ok' as either an explicit terminator or for special handling of certain 202 patterns. 203 204 **key** is a string tested as a prefix on an attribute name originating 205 on the client. It maybe empty in which case a 'client' rule 206 will always match on client names. 207 208 **prepend** is a string tested as a prefix on an attribute name originating 209 on the server, and used as a new prefix. It may be empty 210 in which case a 'server' rule will always match on all names from 211 the server. 212 213 e.g.: 214 215 ``:prefix:client:trusted.:user.virtiofs.:`` 216 217 will match 'trusted.' attributes in client calls and prefix them before 218 passing them to the server. 219 220 ``:prefix:server::user.virtiofs.:`` 221 222 will strip 'user.virtiofs.' from all server replies. 223 224 ``:prefix:all:trusted.:user.virtiofs.:`` 225 226 combines the previous two cases into a single rule. 227 228 ``:ok:client:user.::`` 229 230 will allow get/set xattr for 'user.' xattr's and ignore 231 following rules. 232 233 ``:ok:server::security.:`` 234 235 will pass 'security.' xattr's in listxattr from the server 236 and ignore following rules. 237 238 ``:ok:all:::`` 239 240 will terminate the rule search passing any remaining attributes 241 in both directions. 242 243 ``:bad:server::security.:`` 244 245 would hide 'security.' xattr's in listxattr from the server. 246 247 A simpler 'map' type provides a shorter syntax for the common case: 248 249 ``:map:key:prepend:`` 250 251 The 'map' type adds a number of separate rules to add **prepend** as a prefix 252 to the matched **key** (or all attributes if **key** is empty). 253 There may be at most one 'map' rule and it must be the last rule in the set. 254 255 Note: When the 'security.capability' xattr is remapped, the daemon has to do 256 extra work to remove it during many operations, which the host kernel normally 257 does itself. 258 259 Security considerations 260 ~~~~~~~~~~~~~~~~~~~~~~~ 261 262 Operating systems typically partition the xattr namespace using 263 well defined name prefixes. Each partition may have different 264 access controls applied. For example, on Linux there are multiple 265 partitions 266 267 * ``system.*`` - access varies depending on attribute & filesystem 268 * ``security.*`` - only processes with CAP_SYS_ADMIN 269 * ``trusted.*`` - only processes with CAP_SYS_ADMIN 270 * ``user.*`` - any process granted by file permissions / ownership 271 272 While other OS such as FreeBSD have different name prefixes 273 and access control rules. 274 275 When remapping attributes on the host, it is important to 276 ensure that the remapping does not allow a guest user to 277 evade the guest access control rules. 278 279 Consider if ``trusted.*`` from the guest was remapped to 280 ``user.virtiofs.trusted*`` in the host. An unprivileged 281 user in a Linux guest has the ability to write to xattrs 282 under ``user.*``. Thus the user can evade the access 283 control restriction on ``trusted.*`` by instead writing 284 to ``user.virtiofs.trusted.*``. 285 286 As noted above, the partitions used and access controls 287 applied, will vary across guest OS, so it is not wise to 288 try to predict what the guest OS will use. 289 290 The simplest way to avoid an insecure configuration is 291 to remap all xattrs at once, to a given fixed prefix. 292 This is shown in example (1) below. 293 294 If selectively mapping only a subset of xattr prefixes, 295 then rules must be added to explicitly block direct 296 access to the target of the remapping. This is shown 297 in example (2) below. 298 299 Mapping examples 300 ~~~~~~~~~~~~~~~~ 301 302 1) Prefix all attributes with 'user.virtiofs.' 303 304 :: 305 306 -o xattrmap=":prefix:all::user.virtiofs.::bad:all:::" 307 308 309 This uses two rules, using : as the field separator; 310 the first rule prefixes and strips 'user.virtiofs.', 311 the second rule hides any non-prefixed attributes that 312 the host set. 313 314 This is equivalent to the 'map' rule: 315 316 :: 317 318 -o xattrmap=":map::user.virtiofs.:" 319 320 2) Prefix 'trusted.' attributes, allow others through 321 322 :: 323 324 "/prefix/all/trusted./user.virtiofs./ 325 /bad/server//trusted./ 326 /bad/client/user.virtiofs.// 327 /ok/all///" 328 329 330 Here there are four rules, using / as the field 331 separator, and also demonstrating that new lines can 332 be included between rules. 333 The first rule is the prefixing of 'trusted.' and 334 stripping of 'user.virtiofs.'. 335 The second rule hides unprefixed 'trusted.' attributes 336 on the host. 337 The third rule stops a guest from explicitly setting 338 the 'user.virtiofs.' path directly to prevent access 339 control bypass on the target of the earlier prefix 340 remapping. 341 Finally, the fourth rule lets all remaining attributes 342 through. 343 344 This is equivalent to the 'map' rule: 345 346 :: 347 348 -o xattrmap="/map/trusted./user.virtiofs./" 349 350 3) Hide 'security.' attributes, and allow everything else 351 352 :: 353 354 "/bad/all/security./security./ 355 /ok/all///' 356 357 The first rule combines what could be separate client and server 358 rules into a single 'all' rule, matching 'security.' in either 359 client arguments or lists returned from the host. This stops 360 the client seeing any 'security.' attributes on the server and 361 stops it setting any. 362 363 SELinux support 364 --------------- 365 One can enable support for SELinux by running virtiofsd with option 366 "-o security_label". But this will try to save guest's security context 367 in xattr security.selinux on host and it might fail if host's SELinux 368 policy does not permit virtiofsd to do this operation. 369 370 Hence, it is preferred to remap guest's "security.selinux" xattr to say 371 "trusted.virtiofs.security.selinux" on host. 372 373 "-o xattrmap=:map:security.selinux:trusted.virtiofs.:" 374 375 This will make sure that guest and host's SELinux xattrs on same file 376 remain separate and not interfere with each other. And will allow both 377 host and guest to implement their own separate SELinux policies. 378 379 Setting trusted xattr on host requires CAP_SYS_ADMIN. So one will need 380 add this capability to daemon. 381 382 "-o modcaps=+sys_admin" 383 384 Giving CAP_SYS_ADMIN increases the risk on system. Now virtiofsd is more 385 powerful and if gets compromised, it can do lot of damage to host system. 386 So keep this trade-off in my mind while making a decision. 387 388 Examples 389 -------- 390 391 Export ``/var/lib/fs/vm001/`` on vhost-user UNIX domain socket 392 ``/var/run/vm001-vhost-fs.sock``: 393 394 .. parsed-literal:: 395 396 host# virtiofsd --socket-path=/var/run/vm001-vhost-fs.sock -o source=/var/lib/fs/vm001 397 host# |qemu_system| \\ 398 -chardev socket,id=char0,path=/var/run/vm001-vhost-fs.sock \\ 399 -device vhost-user-fs-pci,chardev=char0,tag=myfs \\ 400 -object memory-backend-memfd,id=mem,size=4G,share=on \\ 401 -numa node,memdev=mem \\ 402 ... 403 guest# mount -t virtiofs myfs /mnt