qemu

FORK: QEMU emulator
git clone https://git.neptards.moe/neptards/qemu.git
Log | Files | Refs | Submodules | LICENSE

virtiofsd.rst (13316B)


      1 QEMU virtio-fs shared file system daemon
      2 ========================================
      3 
      4 Synopsis
      5 --------
      6 
      7 **virtiofsd** [*OPTIONS*]
      8 
      9 Description
     10 -----------
     11 
     12 Share a host directory tree with a guest through a virtio-fs device.  This
     13 program is a vhost-user backend that implements the virtio-fs device.  Each
     14 virtio-fs device instance requires its own virtiofsd process.
     15 
     16 This program is designed to work with QEMU's ``--device vhost-user-fs-pci``
     17 but should work with any virtual machine monitor (VMM) that supports
     18 vhost-user.  See the Examples section below.
     19 
     20 This program must be run as the root user.  The program drops privileges where
     21 possible during startup although it must be able to create and access files
     22 with any uid/gid:
     23 
     24 * The ability to invoke syscalls is limited using seccomp(2).
     25 * Linux capabilities(7) are dropped.
     26 
     27 In "namespace" sandbox mode the program switches into a new file system
     28 namespace and invokes pivot_root(2) to make the shared directory tree its root.
     29 A new pid and net namespace is also created to isolate the process.
     30 
     31 In "chroot" sandbox mode the program invokes chroot(2) to make the shared
     32 directory tree its root. This mode is intended for container environments where
     33 the container runtime has already set up the namespaces and the program does
     34 not have permission to create namespaces itself.
     35 
     36 Both sandbox modes prevent "file system escapes" due to symlinks and other file
     37 system objects that might lead to files outside the shared directory.
     38 
     39 Options
     40 -------
     41 
     42 .. program:: virtiofsd
     43 
     44 .. option:: -h, --help
     45 
     46   Print help.
     47 
     48 .. option:: -V, --version
     49 
     50   Print version.
     51 
     52 .. option:: -d
     53 
     54   Enable debug output.
     55 
     56 .. option:: --syslog
     57 
     58   Print log messages to syslog instead of stderr.
     59 
     60 .. option:: -o OPTION
     61 
     62   * debug -
     63     Enable debug output.
     64 
     65   * flock|no_flock -
     66     Enable/disable flock.  The default is ``no_flock``.
     67 
     68   * modcaps=CAPLIST
     69     Modify the list of capabilities allowed; CAPLIST is a colon separated
     70     list of capabilities, each preceded by either + or -, e.g.
     71     ''+sys_admin:-chown''.
     72 
     73   * log_level=LEVEL -
     74     Print only log messages matching LEVEL or more severe.  LEVEL is one of
     75     ``err``, ``warn``, ``info``, or ``debug``.  The default is ``info``.
     76 
     77   * posix_lock|no_posix_lock -
     78     Enable/disable remote POSIX locks.  The default is ``no_posix_lock``.
     79 
     80   * readdirplus|no_readdirplus -
     81     Enable/disable readdirplus.  The default is ``readdirplus``.
     82 
     83   * sandbox=namespace|chroot -
     84     Sandbox mode:
     85     - namespace: Create mount, pid, and net namespaces and pivot_root(2) into
     86     the shared directory.
     87     - chroot: chroot(2) into shared directory (use in containers).
     88     The default is "namespace".
     89 
     90   * source=PATH -
     91     Share host directory tree located at PATH.  This option is required.
     92 
     93   * timeout=TIMEOUT -
     94     I/O timeout in seconds.  The default depends on cache= option.
     95 
     96   * writeback|no_writeback -
     97     Enable/disable writeback cache. The cache allows the FUSE client to buffer
     98     and merge write requests.  The default is ``no_writeback``.
     99 
    100   * xattr|no_xattr -
    101     Enable/disable extended attributes (xattr) on files and directories.  The
    102     default is ``no_xattr``.
    103 
    104   * posix_acl|no_posix_acl -
    105     Enable/disable posix acl support.  Posix ACLs are disabled by default.
    106 
    107   * security_label|no_security_label -
    108     Enable/disable security label support. Security labels are disabled by
    109     default. This will allow client to send a MAC label of file during
    110     file creation. Typically this is expected to be SELinux security
    111     label. Server will try to set that label on newly created file
    112     atomically wherever possible.
    113 
    114   * killpriv_v2|no_killpriv_v2 -
    115     Enable/disable ``FUSE_HANDLE_KILLPRIV_V2`` support. KILLPRIV_V2 is enabled
    116     by default as long as the client supports it. Enabling this option helps
    117     with performance in write path.
    118 
    119 .. option:: --socket-path=PATH
    120 
    121   Listen on vhost-user UNIX domain socket at PATH.
    122 
    123 .. option:: --socket-group=GROUP
    124 
    125   Set the vhost-user UNIX domain socket gid to GROUP.
    126 
    127 .. option:: --fd=FDNUM
    128 
    129   Accept connections from vhost-user UNIX domain socket file descriptor FDNUM.
    130   The file descriptor must already be listening for connections.
    131 
    132 .. option:: --thread-pool-size=NUM
    133 
    134   Restrict the number of worker threads per request queue to NUM.  The default
    135   is 0.
    136 
    137 .. option:: --cache=none|auto|always
    138 
    139   Select the desired trade-off between coherency and performance.  ``none``
    140   forbids the FUSE client from caching to achieve best coherency at the cost of
    141   performance.  ``auto`` acts similar to NFS with a 1 second metadata cache
    142   timeout.  ``always`` sets a long cache lifetime at the expense of coherency.
    143   The default is ``auto``.
    144 
    145 Extended attribute (xattr) mapping
    146 ----------------------------------
    147 
    148 By default the name of xattr's used by the client are passed through to the server
    149 file system.  This can be a problem where either those xattr names are used
    150 by something on the server (e.g. selinux client/server confusion) or if the
    151 ``virtiofsd`` is running in a container with restricted privileges where it
    152 cannot access some attributes.
    153 
    154 Mapping syntax
    155 ~~~~~~~~~~~~~~
    156 
    157 A mapping of xattr names can be made using -o xattrmap=mapping where the ``mapping``
    158 string consists of a series of rules.
    159 
    160 The first matching rule terminates the mapping.
    161 The set of rules must include a terminating rule to match any remaining attributes
    162 at the end.
    163 
    164 Each rule consists of a number of fields separated with a separator that is the
    165 first non-white space character in the rule.  This separator must then be used
    166 for the whole rule.
    167 White space may be added before and after each rule.
    168 
    169 Using ':' as the separator a rule is of the form:
    170 
    171 ``:type:scope:key:prepend:``
    172 
    173 **scope** is:
    174 
    175 - 'client' - match 'key' against a xattr name from the client for
    176              setxattr/getxattr/removexattr
    177 - 'server' - match 'prepend' against a xattr name from the server
    178              for listxattr
    179 - 'all' - can be used to make a single rule where both the server
    180           and client matches are triggered.
    181 
    182 **type** is one of:
    183 
    184 - 'prefix' - is designed to prepend and strip a prefix;  the modified
    185   attributes then being passed on to the client/server.
    186 
    187 - 'ok' - Causes the rule set to be terminated when a match is found
    188   while allowing matching xattr's through unchanged.
    189   It is intended both as a way of explicitly terminating
    190   the list of rules, and to allow some xattr's to skip following rules.
    191 
    192 - 'bad' - If a client tries to use a name matching 'key' it's
    193   denied using EPERM; when the server passes an attribute
    194   name matching 'prepend' it's hidden.  In many ways it's use is very like
    195   'ok' as either an explicit terminator or for special handling of certain
    196   patterns.
    197 
    198 - 'unsupported' - If a client tries to use a name matching 'key' it's
    199   denied using ENOTSUP; when the server passes an attribute
    200   name matching 'prepend' it's hidden.  In many ways it's use is very like
    201   'ok' as either an explicit terminator or for special handling of certain
    202   patterns.
    203 
    204 **key** is a string tested as a prefix on an attribute name originating
    205 on the client.  It maybe empty in which case a 'client' rule
    206 will always match on client names.
    207 
    208 **prepend** is a string tested as a prefix on an attribute name originating
    209 on the server, and used as a new prefix.  It may be empty
    210 in which case a 'server' rule will always match on all names from
    211 the server.
    212 
    213 e.g.:
    214 
    215   ``:prefix:client:trusted.:user.virtiofs.:``
    216 
    217   will match 'trusted.' attributes in client calls and prefix them before
    218   passing them to the server.
    219 
    220   ``:prefix:server::user.virtiofs.:``
    221 
    222   will strip 'user.virtiofs.' from all server replies.
    223 
    224   ``:prefix:all:trusted.:user.virtiofs.:``
    225 
    226   combines the previous two cases into a single rule.
    227 
    228   ``:ok:client:user.::``
    229 
    230   will allow get/set xattr for 'user.' xattr's and ignore
    231   following rules.
    232 
    233   ``:ok:server::security.:``
    234 
    235   will pass 'security.' xattr's in listxattr from the server
    236   and ignore following rules.
    237 
    238   ``:ok:all:::``
    239 
    240   will terminate the rule search passing any remaining attributes
    241   in both directions.
    242 
    243   ``:bad:server::security.:``
    244 
    245   would hide 'security.' xattr's in listxattr from the server.
    246 
    247 A simpler 'map' type provides a shorter syntax for the common case:
    248 
    249 ``:map:key:prepend:``
    250 
    251 The 'map' type adds a number of separate rules to add **prepend** as a prefix
    252 to the matched **key** (or all attributes if **key** is empty).
    253 There may be at most one 'map' rule and it must be the last rule in the set.
    254 
    255 Note: When the 'security.capability' xattr is remapped, the daemon has to do
    256 extra work to remove it during many operations, which the host kernel normally
    257 does itself.
    258 
    259 Security considerations
    260 ~~~~~~~~~~~~~~~~~~~~~~~
    261 
    262 Operating systems typically partition the xattr namespace using
    263 well defined name prefixes. Each partition may have different
    264 access controls applied. For example, on Linux there are multiple
    265 partitions
    266 
    267  * ``system.*`` - access varies depending on attribute & filesystem
    268  * ``security.*`` - only processes with CAP_SYS_ADMIN
    269  * ``trusted.*`` - only processes with CAP_SYS_ADMIN
    270  * ``user.*`` - any process granted by file permissions / ownership
    271 
    272 While other OS such as FreeBSD have different name prefixes
    273 and access control rules.
    274 
    275 When remapping attributes on the host, it is important to
    276 ensure that the remapping does not allow a guest user to
    277 evade the guest access control rules.
    278 
    279 Consider if ``trusted.*`` from the guest was remapped to
    280 ``user.virtiofs.trusted*`` in the host. An unprivileged
    281 user in a Linux guest has the ability to write to xattrs
    282 under ``user.*``. Thus the user can evade the access
    283 control restriction on ``trusted.*`` by instead writing
    284 to ``user.virtiofs.trusted.*``.
    285 
    286 As noted above, the partitions used and access controls
    287 applied, will vary across guest OS, so it is not wise to
    288 try to predict what the guest OS will use.
    289 
    290 The simplest way to avoid an insecure configuration is
    291 to remap all xattrs at once, to a given fixed prefix.
    292 This is shown in example (1) below.
    293 
    294 If selectively mapping only a subset of xattr prefixes,
    295 then rules must be added to explicitly block direct
    296 access to the target of the remapping. This is shown
    297 in example (2) below.
    298 
    299 Mapping examples
    300 ~~~~~~~~~~~~~~~~
    301 
    302 1) Prefix all attributes with 'user.virtiofs.'
    303 
    304 ::
    305 
    306  -o xattrmap=":prefix:all::user.virtiofs.::bad:all:::"
    307 
    308 
    309 This uses two rules, using : as the field separator;
    310 the first rule prefixes and strips 'user.virtiofs.',
    311 the second rule hides any non-prefixed attributes that
    312 the host set.
    313 
    314 This is equivalent to the 'map' rule:
    315 
    316 ::
    317 
    318  -o xattrmap=":map::user.virtiofs.:"
    319 
    320 2) Prefix 'trusted.' attributes, allow others through
    321 
    322 ::
    323 
    324    "/prefix/all/trusted./user.virtiofs./
    325     /bad/server//trusted./
    326     /bad/client/user.virtiofs.//
    327     /ok/all///"
    328 
    329 
    330 Here there are four rules, using / as the field
    331 separator, and also demonstrating that new lines can
    332 be included between rules.
    333 The first rule is the prefixing of 'trusted.' and
    334 stripping of 'user.virtiofs.'.
    335 The second rule hides unprefixed 'trusted.' attributes
    336 on the host.
    337 The third rule stops a guest from explicitly setting
    338 the 'user.virtiofs.' path directly to prevent access
    339 control bypass on the target of the earlier prefix
    340 remapping.
    341 Finally, the fourth rule lets all remaining attributes
    342 through.
    343 
    344 This is equivalent to the 'map' rule:
    345 
    346 ::
    347 
    348  -o xattrmap="/map/trusted./user.virtiofs./"
    349 
    350 3) Hide 'security.' attributes, and allow everything else
    351 
    352 ::
    353 
    354     "/bad/all/security./security./
    355      /ok/all///'
    356 
    357 The first rule combines what could be separate client and server
    358 rules into a single 'all' rule, matching 'security.' in either
    359 client arguments or lists returned from the host.  This stops
    360 the client seeing any 'security.' attributes on the server and
    361 stops it setting any.
    362 
    363 SELinux support
    364 ---------------
    365 One can enable support for SELinux by running virtiofsd with option
    366 "-o security_label". But this will try to save guest's security context
    367 in xattr security.selinux on host and it might fail if host's SELinux
    368 policy does not permit virtiofsd to do this operation.
    369 
    370 Hence, it is preferred to remap guest's "security.selinux" xattr to say
    371 "trusted.virtiofs.security.selinux" on host.
    372 
    373 "-o xattrmap=:map:security.selinux:trusted.virtiofs.:"
    374 
    375 This will make sure that guest and host's SELinux xattrs on same file
    376 remain separate and not interfere with each other. And will allow both
    377 host and guest to implement their own separate SELinux policies.
    378 
    379 Setting trusted xattr on host requires CAP_SYS_ADMIN. So one will need
    380 add this capability to daemon.
    381 
    382 "-o modcaps=+sys_admin"
    383 
    384 Giving CAP_SYS_ADMIN increases the risk on system. Now virtiofsd is more
    385 powerful and if gets compromised, it can do lot of damage to host system.
    386 So keep this trade-off in my mind while making a decision.
    387 
    388 Examples
    389 --------
    390 
    391 Export ``/var/lib/fs/vm001/`` on vhost-user UNIX domain socket
    392 ``/var/run/vm001-vhost-fs.sock``:
    393 
    394 .. parsed-literal::
    395 
    396   host# virtiofsd --socket-path=/var/run/vm001-vhost-fs.sock -o source=/var/lib/fs/vm001
    397   host# |qemu_system| \\
    398         -chardev socket,id=char0,path=/var/run/vm001-vhost-fs.sock \\
    399         -device vhost-user-fs-pci,chardev=char0,tag=myfs \\
    400         -object memory-backend-memfd,id=mem,size=4G,share=on \\
    401         -numa node,memdev=mem \\
    402         ...
    403   guest# mount -t virtiofs myfs /mnt