mbox series

[v15,00/20] Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support

Message ID 20240501085210.2213060-1-michael.roth@amd.com
Headers show
Series Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand

Message

Michael Roth May 1, 2024, 8:51 a.m. UTC
This patchset is also available at:

  https://github.com/amdese/linux/commits/snp-host-v15

and is based on top of the series:
  
  "Add SEV-ES hypervisor support for GHCB protocol version 2"
  https://lore.kernel.org/kvm/20240501071048.2208265-1-michael.roth@amd.com/
  https://github.com/amdese/linux/commits/sev-init2-ghcb-v1

which in turn is based on commit 20cc50a0410f (just before v14 SNP patches):

  https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=kvm-coco-queue


Patch Layout
------------

01-02: These patches revert+replace the existing .gmem_validate_fault hook
       with a similar .private_max_mapping_level as suggested by Sean[1]

03-04: These patches add some basic infrastructure and introduces a new
       KVM_X86_SNP_VM vm_type to handle differences verses the existing
       KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types.

05-07: These implement the KVM API to handle the creation of a
       cryptographic launch context, encrypt/measure the initial image
       into guest memory, and finalize it before launching it.

08-12: These implement handling for various guest-generated events such
       as page state changes, onlining of additional vCPUs, etc.

13-16: These implement the gmem/mmu hooks needed to prepare gmem-allocated
       pages before mapping them into guest private memory ranges as
       well as cleaning them up prior to returning them to the host for
       use as normal memory. Because this supplants certain activities
       like issued WBINVDs during KVM MMU invalidations, there's also
       a patch to avoid duplicating that work to avoid unecessary
       overhead.

17:    With all the core support in place, the patch adds a kvm_amd module
       parameter to enable SNP support.

18-20: These patches all deal with the servicing of guest requests to handle
       things like attestation, as well as some related host-management
       interfaces.

[1] https://lore.kernel.org/kvm/ZimnngU7hn7sKoSc@google.com/#t


Testing
-------

For testing this via QEMU, use the following tree:

  https://github.com/amdese/qemu/commits/snp-v4-wip3c

A patched OVMF is also needed due to upstream KVM no longer supporting MMIO
ranges that are mapped as private. It is recommended you build the AmdSevX64
variant as it provides the kernel-hashing support present in this series:

  https://github.com/amdese/ovmf/commits/apic-mmio-fix1d

A basic command-line invocation for SNP would be:

 qemu-system-x86_64 -smp 32,maxcpus=255 -cpu EPYC-Milan-v2
  -machine q35,confidential-guest-support=sev0,memory-backend=ram1
  -object memory-backend-memfd,id=ram1,size=4G,share=true,reserve=false
  -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,id-auth=
  -bios OVMF_CODE-upstream-20240410-apic-mmio-fix1d-AmdSevX64.fd

With kernel-hashing and certificate data supplied:

 qemu-system-x86_64 -smp 32,maxcpus=255 -cpu EPYC-Milan-v2
  -machine q35,confidential-guest-support=sev0,memory-backend=ram1
  -object memory-backend-memfd,id=ram1,size=4G,share=true,reserve=false
  -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,id-auth=,certs-path=/home/mroth/cert.blob,kernel-hashes=on
  -bios OVMF_CODE-upstream-20240410-apic-mmio-fix1d-AmdSevX64.fd
  -kernel /boot/vmlinuz-$ver
  -initrd /boot/initrd.img-$ver
  -append "root=UUID=d72a6d1c-06cf-4b79-af43-f1bac4f620f9 ro console=ttyS0,115200n8"

With standard X64 OVMF package with separate image for persistent NVRAM:

 qemu-system-x86_64 -smp 32,maxcpus=255 -cpu EPYC-Milan-v2
  -machine q35,confidential-guest-support=sev0,memory-backend=ram1
  -object memory-backend-memfd,id=ram1,size=4G,share=true,reserve=false
  -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,id-auth=
  -bios OVMF_CODE-upstream-20240410-apic-mmio-fix1d.fd 
  -drive if=pflash,format=raw,unit=0,file=OVMF_VARS-upstream-20240410-apic-mmio-fix1d.fd,readonly=off


Known issues / TODOs
--------------------

 * Base tree in some cases reports "Unpatched return thunk in use. This should 
   not happen!" the first time it runs an SVM/SEV/SNP guests. This a recent
   regression upstream and unrelated to this series:

     https://lore.kernel.org/linux-kernel/CANpmjNOcKzEvLHoGGeL-boWDHJobwfwyVxUqMq2kWeka3N4tXA@mail.gmail.com/T/

 * 2MB hugepage support has been dropped pending discussion on how we plan to
   re-enable it in gmem.

 * Host kexec should work, but there is a known issue with host kdump support
   while SNP guests are running that will be addressed as a follow-up.

 * SNP kselftests are currently a WIP and will be included as part of SNP
   upstreaming efforts in the near-term.


SEV-SNP Overview
----------------

This part of the Secure Encrypted Paging (SEV-SNP) series focuses on the
changes required to add KVM support for SEV-SNP. This series builds upon
SEV-SNP guest support, which is now in mainline, and and SEV-SNP host
initialization support, which is now in linux-next.

While series provides the basic building blocks to support booting the
SEV-SNP VMs, it does not cover all the security enhancement introduced by
the SEV-SNP such as interrupt protection, which will added in the future.

With SNP, when pages are marked as guest-owned in the RMP table, they are
assigned to a specific guest/ASID, as well as a specific GFN with in the
guest. Any attempts to map it in the RMP table to a different guest/ASID,
or a different GFN within a guest/ASID, will result in an RMP nested page
fault.

Prior to accessing a guest-owned page, the guest must validate it with a
special PVALIDATE instruction which will set a special bit in the RMP table
for the guest. This is the only way to set the validated bit outside of the
initial pre-encrypted guest payload/image; any attempts outside the guest to
modify the RMP entry from that point forward will result in the validated
bit being cleared, at which point the guest will trigger an exception if it
attempts to access that page so it can be made aware of possible tampering.

One exception to this is the initial guest payload, which is pre-validated
by the firmware prior to launching. The guest can use Guest Message requests 
to fetch an attestation report which will include the measurement of the
initial image so that the guest can verify it was booted with the expected
image/environment.

After boot, guests can use Page State Change requests to switch pages
between shared/hypervisor-owned and private/guest-owned to share data for
things like DMA, virtio buffers, and other GHCB requests.

In this implementation of SEV-SNP, private guest memory is managed by a new
kernel framework called guest_memfd (gmem). With gmem, a new
KVM_SET_MEMORY_ATTRIBUTES KVM ioctl has been added to tell the KVM
MMU whether a particular GFN should be backed by shared (normal) memory or
private (gmem-allocated) memory. To tie into this, Page State Change
requests are forward to userspace via KVM_EXIT_VMGEXIT exits, which will
then issue the corresponding KVM_SET_MEMORY_ATTRIBUTES call to set the
private/shared state in the KVM MMU.

The gmem / KVM MMU hooks implemented in this series will then update the RMP
table entries for the backing PFNs to set them to guest-owned/private when
mapping private pages into the guest via KVM MMU, or use the normal KVM MMU
handling in the case of shared pages where the corresponding RMP table
entries are left in the default shared/hypervisor-owned state.

Feedback/review is very much appreciated!

-Mike


Changes since v14:

 * switch to vendor-agnostic KVM_HC_MAP_GPA_RANGE exit for forwarding
   page-state change requests to userspace instead of an SNP-specific exit
   (Sean)
 * drop SNP_PAUSE_ATTESTATION/SNP_RESUME_ATTESTATION interfaces, instead
   add handling in KVM_EXIT_VMGEXIT so that VMMs can implement their own
   mechanisms for keeping userspace-supplied certificates in-sync with
   firmware's TCB/endorsement key (Sean)
 * carve out SEV-ES-specific handling for GHCB protocol 2, add control of 
   the protocol version, and post as a separate prereq patchset (Sean)
 * use more consistent error-handling in snp_launch_{start,update,finish},
   simplify logic based on review comments (Sean)
 * rename .gmem_validate_fault to .private_max_mapping_level and rework
   logic based on review suggestions (Sean)
 * reduce number of pr_debug()'s in series, avoid multiple WARN's in
   succession (Sean)
 * improve documentation and comments throughout

Changes since v13:

 * rebase to new kvm-coco-queue and wire up to PFERR_PRIVATE_ACCESS (Paolo)
 * handle setting kvm->arch.has_private_mem in same location as
   kvm->arch.has_protected_state (Paolo)
 * add flags and additional padding fields to
   snp_launch{start,update,finish} APIs to address alignment and
   expandability (Paolo)
 * update snp_launch_update() to update input struct values to reflect
   current progress of command in situations where mulitple calls are
   needed (Paolo)
 * update snp_launch_update() to avoid copying/accessing 'src' parameter
   when dealing with zero pages. (Paolo)
 * update snp_launch_update() to use u64 as length input parameter instead
   of u32 and adjust padding accordingly
 * modify ordering of SNP_POLICY_MASK_* definitions to be consistent with
   bit order of corresponding flags
 * let firmware handle enforcement of policy bits corresponding to
   user-specified minimum API version
 * add missing "0x" prefixs in pr_debug()'s for snp_launch_start()
 * fix handling of VMSAs during in-place migration (Paolo)

Changes since v12:

 * rebased to latest kvm-coco-queue branch (commit 4d2deb62185f)
 * add more input validation for SNP_LAUNCH_START, especially for handling
   things like MBO/MBZ policy bits, and API major/minor minimums. (Paolo)
 * block SNP KVM instances from being able to run legacy SEV commands (Paolo)
 * don't attempt to measure VMSA for vcpu 0/BSP before the others, let
   userspace deal with the ordering just like with SEV-ES (Paolo)
 * fix up docs for SNP_LAUNCH_FINISH (Paolo)
 * introduce svm->sev_es.snp_has_guest_vmsa flag to better distinguish
   handling for guest-mapped vs non-guest-mapped VMSAs, rename
   'snp_ap_create' flag to 'snp_ap_waiting_for_reset' (Paolo)
 * drop "KVM: SEV: Use a VMSA physical address variable for populating VMCB"
   as it is no longer needed due to above VMSA rework
 * replace pr_debug_ratelimited() messages for RMP #NPFs with a single trace
   event
 * handle transient PSMASH_FAIL_INUSE return codes in kvm_gmem_invalidate(),
   switch to WARN_ON*()'s to indicate remaining error cases are not expected
   and should not be seen in practice. (Paolo)
 * add a cond_resched() in kvm_gmem_invalidate() to avoid soft lock-ups when
   cleaning up large guest memory ranges.
 * rename VLEK_REQUIRED to VCEK_DISABLE. it's be more applicable if another
   key type ever gets added.
 * don't allow attestation to be paused while an attestation request is
   being processed by firmware (Tom)
 * add missing Documentation entry for SNP_VLEK_LOAD
 * collect Reviewed-by's from Paolo and Tom


----------------------------------------------------------------
Ashish Kalra (1):
      KVM: SEV: Avoid WBINVD for HVA-based MMU notifications for SNP

Brijesh Singh (8):
      KVM: SEV: Add initial SEV-SNP support
      KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command
      KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command
      KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH command
      KVM: SEV: Add support to handle GHCB GPA register VMGEXIT
      KVM: SEV: Add support to handle RMP nested page faults
      KVM: SVM: Add module parameter to enable SEV-SNP
      KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event

Michael Roth (10):
      Revert "KVM: x86: Add gmem hook for determining max NPT mapping level"
      KVM: x86: Add hook for determining max NPT mapping level
      KVM: SEV: Select KVM_GENERIC_PRIVATE_MEM when CONFIG_KVM_AMD_SEV=y
      KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT
      KVM: SEV: Add support to handle Page State Change VMGEXIT
      KVM: SEV: Implement gmem hook for initializing private pages
      KVM: SEV: Implement gmem hook for invalidating private pages
      KVM: x86: Implement hook for determining max NPT mapping level
      KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event
      crypto: ccp: Add the SNP_VLEK_LOAD command

Tom Lendacky (1):
      KVM: SEV: Support SEV-SNP AP Creation NAE event

 Documentation/virt/coco/sev-guest.rst              |   19 +
 Documentation/virt/kvm/api.rst                     |   87 ++
 .../virt/kvm/x86/amd-memory-encryption.rst         |  110 +-
 arch/x86/include/asm/kvm-x86-ops.h                 |    2 +-
 arch/x86/include/asm/kvm_host.h                    |    5 +-
 arch/x86/include/asm/sev-common.h                  |   25 +
 arch/x86/include/asm/sev.h                         |    3 +
 arch/x86/include/asm/svm.h                         |    9 +-
 arch/x86/include/uapi/asm/kvm.h                    |   48 +
 arch/x86/kvm/Kconfig                               |    3 +
 arch/x86/kvm/mmu.h                                 |    2 -
 arch/x86/kvm/mmu/mmu.c                             |   27 +-
 arch/x86/kvm/svm/sev.c                             | 1538 +++++++++++++++++++-
 arch/x86/kvm/svm/svm.c                             |   44 +-
 arch/x86/kvm/svm/svm.h                             |   52 +
 arch/x86/kvm/trace.h                               |   31 +
 arch/x86/kvm/x86.c                                 |   17 +
 drivers/crypto/ccp/sev-dev.c                       |   36 +
 include/linux/psp-sev.h                            |    4 +-
 include/uapi/linux/kvm.h                           |   23 +
 include/uapi/linux/psp-sev.h                       |   27 +
 include/uapi/linux/sev-guest.h                     |    9 +
 virt/kvm/guest_memfd.c                             |    4 +-
 23 files changed, 2081 insertions(+), 44 deletions(-)

Comments

Michael Roth May 7, 2024, 6:14 p.m. UTC | #1
On Tue, May 07, 2024 at 08:04:50PM +0200, Paolo Bonzini wrote:
> On Wed, May 1, 2024 at 11:03 AM Michael Roth <michael.roth@amd.com> wrote:
> >
> > This patchset is also available at:
> >
> >   https://github.com/amdese/linux/commits/snp-host-v15
> >
> > and is based on top of the series:
> >
> >   "Add SEV-ES hypervisor support for GHCB protocol version 2"
> >   https://lore.kernel.org/kvm/20240501071048.2208265-1-michael.roth@amd.com/
> >   https://github.com/amdese/linux/commits/sev-init2-ghcb-v1
> >
> > which in turn is based on commit 20cc50a0410f (just before v14 SNP patches):
> >
> >   https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=kvm-coco-queue
> 
> I have mostly reviewed this, with the exception of the
> snp_begin/complete_psc parts.

Thanks Paolo. We actually recently uncovered some issues with
snp_begin/complete_psc using some internal kvm-unit-tests that exercise
some edge cases, so I would hold off on reviewing that. Will send a
fix-up patch today after a bit more testing.

> 
> I am not sure about removing all the pr_debug() - I am sure it will be
> a bit more painful for userspace developers to figure out what exactly
> has gone wrong in some cases. But we can add them later, if needed -
> I'm certainly not going to make a fuss about it.

Yah, they do tend to be useful for that purpose. I think if we do add
them back we can consolidate the information a little better versus what
I had previously.

-Mike

> 
> Paolo
> 
> 
> > Patch Layout
> > ------------
> >
> > 01-02: These patches revert+replace the existing .gmem_validate_fault hook
> >        with a similar .private_max_mapping_level as suggested by Sean[1]
> >
> > 03-04: These patches add some basic infrastructure and introduces a new
> >        KVM_X86_SNP_VM vm_type to handle differences verses the existing
> >        KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types.
> >
> > 05-07: These implement the KVM API to handle the creation of a
> >        cryptographic launch context, encrypt/measure the initial image
> >        into guest memory, and finalize it before launching it.
> >
> > 08-12: These implement handling for various guest-generated events such
> >        as page state changes, onlining of additional vCPUs, etc.
> >
> > 13-16: These implement the gmem/mmu hooks needed to prepare gmem-allocated
> >        pages before mapping them into guest private memory ranges as
> >        well as cleaning them up prior to returning them to the host for
> >        use as normal memory. Because this supplants certain activities
> >        like issued WBINVDs during KVM MMU invalidations, there's also
> >        a patch to avoid duplicating that work to avoid unecessary
> >        overhead.
> >
> > 17:    With all the core support in place, the patch adds a kvm_amd module
> >        parameter to enable SNP support.
> >
> > 18-20: These patches all deal with the servicing of guest requests to handle
> >        things like attestation, as well as some related host-management
> >        interfaces.
> >
> > [1] https://lore.kernel.org/kvm/ZimnngU7hn7sKoSc@google.com/#t
> >
> >
> > Testing
> > -------
> >
> > For testing this via QEMU, use the following tree:
> >
> >   https://github.com/amdese/qemu/commits/snp-v4-wip3c
> >
> > A patched OVMF is also needed due to upstream KVM no longer supporting MMIO
> > ranges that are mapped as private. It is recommended you build the AmdSevX64
> > variant as it provides the kernel-hashing support present in this series:
> >
> >   https://github.com/amdese/ovmf/commits/apic-mmio-fix1d
> >
> > A basic command-line invocation for SNP would be:
> >
> >  qemu-system-x86_64 -smp 32,maxcpus=255 -cpu EPYC-Milan-v2
> >   -machine q35,confidential-guest-support=sev0,memory-backend=ram1
> >   -object memory-backend-memfd,id=ram1,size=4G,share=true,reserve=false
> >   -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,id-auth=
> >   -bios OVMF_CODE-upstream-20240410-apic-mmio-fix1d-AmdSevX64.fd
> >
> > With kernel-hashing and certificate data supplied:
> >
> >  qemu-system-x86_64 -smp 32,maxcpus=255 -cpu EPYC-Milan-v2
> >   -machine q35,confidential-guest-support=sev0,memory-backend=ram1
> >   -object memory-backend-memfd,id=ram1,size=4G,share=true,reserve=false
> >   -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,id-auth=,certs-path=/home/mroth/cert.blob,kernel-hashes=on
> >   -bios OVMF_CODE-upstream-20240410-apic-mmio-fix1d-AmdSevX64.fd
> >   -kernel /boot/vmlinuz-$ver
> >   -initrd /boot/initrd.img-$ver
> >   -append "root=UUID=d72a6d1c-06cf-4b79-af43-f1bac4f620f9 ro console=ttyS0,115200n8"
> >
> > With standard X64 OVMF package with separate image for persistent NVRAM:
> >
> >  qemu-system-x86_64 -smp 32,maxcpus=255 -cpu EPYC-Milan-v2
> >   -machine q35,confidential-guest-support=sev0,memory-backend=ram1
> >   -object memory-backend-memfd,id=ram1,size=4G,share=true,reserve=false
> >   -object sev-snp-guest,id=sev0,cbitpos=51,reduced-phys-bits=1,id-auth=
> >   -bios OVMF_CODE-upstream-20240410-apic-mmio-fix1d.fd
> >   -drive if=pflash,format=raw,unit=0,file=OVMF_VARS-upstream-20240410-apic-mmio-fix1d.fd,readonly=off
> >
> >
> > Known issues / TODOs
> > --------------------
> >
> >  * Base tree in some cases reports "Unpatched return thunk in use. This should
> >    not happen!" the first time it runs an SVM/SEV/SNP guests. This a recent
> >    regression upstream and unrelated to this series:
> >
> >      https://lore.kernel.org/linux-kernel/CANpmjNOcKzEvLHoGGeL-boWDHJobwfwyVxUqMq2kWeka3N4tXA@mail.gmail.com/T/
> >
> >  * 2MB hugepage support has been dropped pending discussion on how we plan to
> >    re-enable it in gmem.
> >
> >  * Host kexec should work, but there is a known issue with host kdump support
> >    while SNP guests are running that will be addressed as a follow-up.
> >
> >  * SNP kselftests are currently a WIP and will be included as part of SNP
> >    upstreaming efforts in the near-term.
> >
> >
> > SEV-SNP Overview
> > ----------------
> >
> > This part of the Secure Encrypted Paging (SEV-SNP) series focuses on the
> > changes required to add KVM support for SEV-SNP. This series builds upon
> > SEV-SNP guest support, which is now in mainline, and and SEV-SNP host
> > initialization support, which is now in linux-next.
> >
> > While series provides the basic building blocks to support booting the
> > SEV-SNP VMs, it does not cover all the security enhancement introduced by
> > the SEV-SNP such as interrupt protection, which will added in the future.
> >
> > With SNP, when pages are marked as guest-owned in the RMP table, they are
> > assigned to a specific guest/ASID, as well as a specific GFN with in the
> > guest. Any attempts to map it in the RMP table to a different guest/ASID,
> > or a different GFN within a guest/ASID, will result in an RMP nested page
> > fault.
> >
> > Prior to accessing a guest-owned page, the guest must validate it with a
> > special PVALIDATE instruction which will set a special bit in the RMP table
> > for the guest. This is the only way to set the validated bit outside of the
> > initial pre-encrypted guest payload/image; any attempts outside the guest to
> > modify the RMP entry from that point forward will result in the validated
> > bit being cleared, at which point the guest will trigger an exception if it
> > attempts to access that page so it can be made aware of possible tampering.
> >
> > One exception to this is the initial guest payload, which is pre-validated
> > by the firmware prior to launching. The guest can use Guest Message requests
> > to fetch an attestation report which will include the measurement of the
> > initial image so that the guest can verify it was booted with the expected
> > image/environment.
> >
> > After boot, guests can use Page State Change requests to switch pages
> > between shared/hypervisor-owned and private/guest-owned to share data for
> > things like DMA, virtio buffers, and other GHCB requests.
> >
> > In this implementation of SEV-SNP, private guest memory is managed by a new
> > kernel framework called guest_memfd (gmem). With gmem, a new
> > KVM_SET_MEMORY_ATTRIBUTES KVM ioctl has been added to tell the KVM
> > MMU whether a particular GFN should be backed by shared (normal) memory or
> > private (gmem-allocated) memory. To tie into this, Page State Change
> > requests are forward to userspace via KVM_EXIT_VMGEXIT exits, which will
> > then issue the corresponding KVM_SET_MEMORY_ATTRIBUTES call to set the
> > private/shared state in the KVM MMU.
> >
> > The gmem / KVM MMU hooks implemented in this series will then update the RMP
> > table entries for the backing PFNs to set them to guest-owned/private when
> > mapping private pages into the guest via KVM MMU, or use the normal KVM MMU
> > handling in the case of shared pages where the corresponding RMP table
> > entries are left in the default shared/hypervisor-owned state.
> >
> > Feedback/review is very much appreciated!
> >
> > -Mike
> >
> >
> > Changes since v14:
> >
> >  * switch to vendor-agnostic KVM_HC_MAP_GPA_RANGE exit for forwarding
> >    page-state change requests to userspace instead of an SNP-specific exit
> >    (Sean)
> >  * drop SNP_PAUSE_ATTESTATION/SNP_RESUME_ATTESTATION interfaces, instead
> >    add handling in KVM_EXIT_VMGEXIT so that VMMs can implement their own
> >    mechanisms for keeping userspace-supplied certificates in-sync with
> >    firmware's TCB/endorsement key (Sean)
> >  * carve out SEV-ES-specific handling for GHCB protocol 2, add control of
> >    the protocol version, and post as a separate prereq patchset (Sean)
> >  * use more consistent error-handling in snp_launch_{start,update,finish},
> >    simplify logic based on review comments (Sean)
> >  * rename .gmem_validate_fault to .private_max_mapping_level and rework
> >    logic based on review suggestions (Sean)
> >  * reduce number of pr_debug()'s in series, avoid multiple WARN's in
> >    succession (Sean)
> >  * improve documentation and comments throughout
> >
> > Changes since v13:
> >
> >  * rebase to new kvm-coco-queue and wire up to PFERR_PRIVATE_ACCESS (Paolo)
> >  * handle setting kvm->arch.has_private_mem in same location as
> >    kvm->arch.has_protected_state (Paolo)
> >  * add flags and additional padding fields to
> >    snp_launch{start,update,finish} APIs to address alignment and
> >    expandability (Paolo)
> >  * update snp_launch_update() to update input struct values to reflect
> >    current progress of command in situations where mulitple calls are
> >    needed (Paolo)
> >  * update snp_launch_update() to avoid copying/accessing 'src' parameter
> >    when dealing with zero pages. (Paolo)
> >  * update snp_launch_update() to use u64 as length input parameter instead
> >    of u32 and adjust padding accordingly
> >  * modify ordering of SNP_POLICY_MASK_* definitions to be consistent with
> >    bit order of corresponding flags
> >  * let firmware handle enforcement of policy bits corresponding to
> >    user-specified minimum API version
> >  * add missing "0x" prefixs in pr_debug()'s for snp_launch_start()
> >  * fix handling of VMSAs during in-place migration (Paolo)
> >
> > Changes since v12:
> >
> >  * rebased to latest kvm-coco-queue branch (commit 4d2deb62185f)
> >  * add more input validation for SNP_LAUNCH_START, especially for handling
> >    things like MBO/MBZ policy bits, and API major/minor minimums. (Paolo)
> >  * block SNP KVM instances from being able to run legacy SEV commands (Paolo)
> >  * don't attempt to measure VMSA for vcpu 0/BSP before the others, let
> >    userspace deal with the ordering just like with SEV-ES (Paolo)
> >  * fix up docs for SNP_LAUNCH_FINISH (Paolo)
> >  * introduce svm->sev_es.snp_has_guest_vmsa flag to better distinguish
> >    handling for guest-mapped vs non-guest-mapped VMSAs, rename
> >    'snp_ap_create' flag to 'snp_ap_waiting_for_reset' (Paolo)
> >  * drop "KVM: SEV: Use a VMSA physical address variable for populating VMCB"
> >    as it is no longer needed due to above VMSA rework
> >  * replace pr_debug_ratelimited() messages for RMP #NPFs with a single trace
> >    event
> >  * handle transient PSMASH_FAIL_INUSE return codes in kvm_gmem_invalidate(),
> >    switch to WARN_ON*()'s to indicate remaining error cases are not expected
> >    and should not be seen in practice. (Paolo)
> >  * add a cond_resched() in kvm_gmem_invalidate() to avoid soft lock-ups when
> >    cleaning up large guest memory ranges.
> >  * rename VLEK_REQUIRED to VCEK_DISABLE. it's be more applicable if another
> >    key type ever gets added.
> >  * don't allow attestation to be paused while an attestation request is
> >    being processed by firmware (Tom)
> >  * add missing Documentation entry for SNP_VLEK_LOAD
> >  * collect Reviewed-by's from Paolo and Tom
> >
> >
> > ----------------------------------------------------------------
> > Ashish Kalra (1):
> >       KVM: SEV: Avoid WBINVD for HVA-based MMU notifications for SNP
> >
> > Brijesh Singh (8):
> >       KVM: SEV: Add initial SEV-SNP support
> >       KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command
> >       KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command
> >       KVM: SEV: Add KVM_SEV_SNP_LAUNCH_FINISH command
> >       KVM: SEV: Add support to handle GHCB GPA register VMGEXIT
> >       KVM: SEV: Add support to handle RMP nested page faults
> >       KVM: SVM: Add module parameter to enable SEV-SNP
> >       KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event
> >
> > Michael Roth (10):
> >       Revert "KVM: x86: Add gmem hook for determining max NPT mapping level"
> >       KVM: x86: Add hook for determining max NPT mapping level
> >       KVM: SEV: Select KVM_GENERIC_PRIVATE_MEM when CONFIG_KVM_AMD_SEV=y
> >       KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT
> >       KVM: SEV: Add support to handle Page State Change VMGEXIT
> >       KVM: SEV: Implement gmem hook for initializing private pages
> >       KVM: SEV: Implement gmem hook for invalidating private pages
> >       KVM: x86: Implement hook for determining max NPT mapping level
> >       KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event
> >       crypto: ccp: Add the SNP_VLEK_LOAD command
> >
> > Tom Lendacky (1):
> >       KVM: SEV: Support SEV-SNP AP Creation NAE event
> >
> >  Documentation/virt/coco/sev-guest.rst              |   19 +
> >  Documentation/virt/kvm/api.rst                     |   87 ++
> >  .../virt/kvm/x86/amd-memory-encryption.rst         |  110 +-
> >  arch/x86/include/asm/kvm-x86-ops.h                 |    2 +-
> >  arch/x86/include/asm/kvm_host.h                    |    5 +-
> >  arch/x86/include/asm/sev-common.h                  |   25 +
> >  arch/x86/include/asm/sev.h                         |    3 +
> >  arch/x86/include/asm/svm.h                         |    9 +-
> >  arch/x86/include/uapi/asm/kvm.h                    |   48 +
> >  arch/x86/kvm/Kconfig                               |    3 +
> >  arch/x86/kvm/mmu.h                                 |    2 -
> >  arch/x86/kvm/mmu/mmu.c                             |   27 +-
> >  arch/x86/kvm/svm/sev.c                             | 1538 +++++++++++++++++++-
> >  arch/x86/kvm/svm/svm.c                             |   44 +-
> >  arch/x86/kvm/svm/svm.h                             |   52 +
> >  arch/x86/kvm/trace.h                               |   31 +
> >  arch/x86/kvm/x86.c                                 |   17 +
> >  drivers/crypto/ccp/sev-dev.c                       |   36 +
> >  include/linux/psp-sev.h                            |    4 +-
> >  include/uapi/linux/kvm.h                           |   23 +
> >  include/uapi/linux/psp-sev.h                       |   27 +
> >  include/uapi/linux/sev-guest.h                     |    9 +
> >  virt/kvm/guest_memfd.c                             |    4 +-
> >  23 files changed, 2081 insertions(+), 44 deletions(-)
> >
> 
>
Paolo Bonzini May 10, 2024, 1:50 p.m. UTC | #2
On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
>
> > +      * Since software-protected VMs don't have a notion of a shared vs.
> > +      * private that's separate from what KVM is tracking, the above
> > +      * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> > +      * special handling for that case for now.
>
> Very technically, it can occur if userspace _just_ modified the attributes.  And
> as I've said multiple times, at least for now, I want to avoid special casing
> SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
> is to allow testing flows that are impossible to excercise without SNP/TDX hardware.

Yep, it is not like they have to be optimized.

> > +      */
> > +     if (kvm_slot_can_be_private(fault->slot) &&
> > +         !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
> > +           vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM))
>
> Heh, !(x && y) kills me, I misread this like 4 times.
>
> Anyways, I don't like the heuristic.  It doesn't tie the restriction back to the
> cause in any reasonable way.  Can't this simply be?
>
>         if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)
>                 return false;

You beat me to it by seconds. And it can also be guarded by a check on
kvm->arch.has_private_mem to avoid the attributes lookup.

> Which is much, much more self-explanatory.

Both more self-explanatory and more correct.

Paolo
Michael Roth May 10, 2024, 3:27 p.m. UTC | #3
On Fri, May 10, 2024 at 03:50:26PM +0200, Paolo Bonzini wrote:
> On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > > +      * Since software-protected VMs don't have a notion of a shared vs.
> > > +      * private that's separate from what KVM is tracking, the above
> > > +      * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> > > +      * special handling for that case for now.
> >
> > Very technically, it can occur if userspace _just_ modified the attributes.  And
> > as I've said multiple times, at least for now, I want to avoid special casing
> > SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
> > is to allow testing flows that are impossible to excercise without SNP/TDX hardware.
> 
> Yep, it is not like they have to be optimized.

Ok, I thought there were maybe some future plans to use sw-protected VMs
to get some added protections from userspace. But even then there'd
probably still be extra considerations for how to handle access tracking
so white-listing them probably isn't right anyway.

I was also partly tempted to take this route because it would cover this
TDX patch as well:

  https://lore.kernel.org/lkml/91c797997b57056224571e22362321a23947172f.1705965635.git.isaku.yamahata@intel.com/

and avoid any weirdness about checking kvm_mem_is_private() without
checking mmu_invalidate_seq, but I think those cases all end up
resolving themselves eventually and added some comments around that.

> 
> > > +      */
> > > +     if (kvm_slot_can_be_private(fault->slot) &&
> > > +         !(IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) &&
> > > +           vcpu->kvm->arch.vm_type == KVM_X86_SW_PROTECTED_VM))
> >
> > Heh, !(x && y) kills me, I misread this like 4 times.
> >
> > Anyways, I don't like the heuristic.  It doesn't tie the restriction back to the
> > cause in any reasonable way.  Can't this simply be?
> >
> >         if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)
> >                 return false;
> 
> You beat me to it by seconds. And it can also be guarded by a check on
> kvm->arch.has_private_mem to avoid the attributes lookup.

I re-tested with things implemented this way and everything seems to
look good. It's not clear to me whether this would cover the cases the
above-mentioned TDX patch handles, but no biggie if that's still needed.

The new version of the patch is here:

  https://github.com/mdroth/linux/commit/39643f9f6da6265d39d633a703c53997985c1208

And I've updated my branches with to replace the old patch and also
incorporate Sean's suggestions for patch 22:

  https://github.com/mdroth/linux/commits/snp-host-v15c3-unsquashed

and have them here with things already squashed in/relocated:

  https://github.com/mdroth/linux/commits/snp-host-v15c3

Thanks for the feedback Sean, Paolo.

-Mike
  
> 
> > Which is much, much more self-explanatory.
> 
> Both more self-explanatory and more correct.
> 
> Paolo
> 
>
Michael Roth May 10, 2024, 3:36 p.m. UTC | #4
On Fri, May 10, 2024 at 06:58:45AM -0700, Sean Christopherson wrote:
> On Thu, May 09, 2024, Michael Roth wrote:
> > The intended logic when handling #NPFs with the RMP bit set (31) is to
> > first check to see if the #NPF requires a shared<->private transition
> > and, if so, to go ahead and let the corresponding KVM_EXIT_MEMORY_FAULT
> > get forwarded on to userspace before proceeding with any handling of
> > other potential RMP fault conditions like needing to PSMASH the RMP
> > entry/etc (which will be done later if the guest still re-faults after
> > the KVM_EXIT_MEMORY_FAULT is processed by userspace).
> > 
> > The determination of whether any userspace handling of
> > KVM_EXIT_MEMORY_FAULT is needed is done by interpreting the return code
> > of kvm_mmu_page_fault(). However, the current code misinterprets the
> > return code, expecting 0 to indicate a userspace exit rather than less
> > than 0 (-EFAULT). This leads to the following unexpected behavior:
> > 
> >   - for KVM_EXIT_MEMORY_FAULTs resulting for implicit shared->private
> >     conversions, warnings get printed from sev_handle_rmp_fault()
> >     because it does not expect to be called for GPAs where
> >     KVM_MEMORY_ATTRIBUTE_PRIVATE is not set. Standard linux guests don't
> >     generally do this, but it is allowed and should be handled
> >     similarly to private->shared conversions rather than triggering any
> >     sort of warnings
> > 
> >   - if gmem support for 2MB folios is enabled (via currently out-of-tree
> >     code), implicit shared<->private conversions will always result in
> >     a PSMASH being attempted, even if it's not actually needed to
> >     resolve the RMP fault. This doesn't cause any harm, but results in a
> >     needless PSMASH and zapping of the sPTE
> > 
> > Resolve these issues by calling sev_handle_rmp_fault() only when
> > kvm_mmu_page_fault()'s return code is greater than or equal to 0,
> > indicating a KVM_MEMORY_EXIT_FAULT/-EFAULT isn't needed. While here,
> > simplify the code slightly and fix up the associated comments for better
> > clarity.
> > 
> > Fixes: ccc9d836c5c3 ("KVM: SEV: Add support to handle RMP nested page faults")
> > 
> > Signed-off-by: Michael Roth <michael.roth@amd.com>
> > ---
> >  arch/x86/kvm/svm/svm.c | 10 ++++------
> >  1 file changed, 4 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 426ad49325d7..9431ce74c7d4 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -2070,14 +2070,12 @@ static int npf_interception(struct kvm_vcpu *vcpu)
> >  				svm->vmcb->control.insn_len);
> >  
> >  	/*
> > -	 * rc == 0 indicates a userspace exit is needed to handle page
> > -	 * transitions, so do that first before updating the RMP table.
> > +	 * rc < 0 indicates a userspace exit may be needed to handle page
> > +	 * attribute updates, so deal with that first before handling other
> > +	 * potential RMP fault conditions.
> >  	 */
> > -	if (error_code & PFERR_GUEST_RMP_MASK) {
> > -		if (rc == 0)
> > -			return rc;
> > +	if (rc >= 0 && error_code & PFERR_GUEST_RMP_MASK)
> 
> This isn't correct either.  A return of '0' also indiciates "exit to userspace",
> it just doesn't happen with SNP because '0' is returned only when KVM attempts
> emulation, and that too gets short-circuited by svm_check_emulate_instruction().
> 
> And I would honestly drop the comment, KVM's less-than-pleasant 1/0/-errno return
> values overload is ubiquitous enough that it should be relatively self-explanatory.
> 
> Or if you prefer to keep a comment, drop the part that specifically calls out
> attributes updates, because that incorrectly implies that's the _only_ reason
> why KVM checks the return.  But my vote is to drop the comment, because it
> essentially becomes "don't proceed to step 2 if step 1 failed", which kind of
> makes the reader go "well, yeah".

Ok, I think I was just paranoid after missing this. I've gone ahead and
dropped the comment, and hopefully it's now drilled into my head enough
that it's obvious to me now as well :) I've also changed the logic to
skip the extra RMP handling for rc==0 as well (should that ever arise
for any future reason):

  https://github.com/mdroth/linux/commit/0a0ba0d7f7571a31f0abc68acc51f24c2a14a8cf

Thanks!

-Mike
Sean Christopherson May 10, 2024, 3:59 p.m. UTC | #5
On Fri, May 10, 2024, Michael Roth wrote:
> On Fri, May 10, 2024 at 03:50:26PM +0200, Paolo Bonzini wrote:
> > On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > > +      * Since software-protected VMs don't have a notion of a shared vs.
> > > > +      * private that's separate from what KVM is tracking, the above
> > > > +      * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> > > > +      * special handling for that case for now.
> > >
> > > Very technically, it can occur if userspace _just_ modified the attributes.  And
> > > as I've said multiple times, at least for now, I want to avoid special casing
> > > SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
> > > is to allow testing flows that are impossible to excercise without SNP/TDX hardware.
> > 
> > Yep, it is not like they have to be optimized.
> 
> Ok, I thought there were maybe some future plans to use sw-protected VMs
> to get some added protections from userspace. But even then there'd
> probably still be extra considerations for how to handle access tracking
> so white-listing them probably isn't right anyway.
> 
> I was also partly tempted to take this route because it would cover this
> TDX patch as well:
> 
>   https://lore.kernel.org/lkml/91c797997b57056224571e22362321a23947172f.1705965635.git.isaku.yamahata@intel.com/

Hmm, I'm pretty sure that patch is trying to fix the exact same issue you are
fixing, just in a less precise way.  S-EPT entries only support RWX=0 and RWX=111b,
i.e. it should be impossible to have a write-fault to a present S-EPT entry.

And if TDX is running afoul of this code:

	if (!fault->present)
		return !kvm_ad_enabled();

then KVM should do the sane thing and require A/D support be enabled for TDX.

And if it's something else entirely, that changelog has some explaining to do.

> and avoid any weirdness about checking kvm_mem_is_private() without
> checking mmu_invalidate_seq, but I think those cases all end up
> resolving themselves eventually and added some comments around that.

Yep, checking state that is protected by mmu_invalidate_seq outside of mmu_lock
is definitely allowed, e.g. the entire fast page fault path operates outside of
mmu_lock and thus outside of mmu_invalidate_seq's purview.

It's a-ok because the SPTE are done with an atomic CMPXCHG, and so KVM only needs
to ensure its page tables aren't outright _freed_.  If the zap triggered by the
attributes change "wins", then the fast #PF path will fail the CMPXCHG and be an
expensive NOP.  If the fast #PF wins, the zap will pave over the fast #PF fix,
and the IPI+flush that is needed for all zaps, to ensure vCPUs don't have stale
references, does the rest.

And if there's an attributes race that causes the fast #PF to bail early, the vCPU
will see the correct state on the next page fault.
Paolo Bonzini May 10, 2024, 4:01 p.m. UTC | #6
On 5/10/24 15:58, Sean Christopherson wrote:
> On Thu, May 09, 2024, Michael Roth wrote:
>> The intended logic when handling #NPFs with the RMP bit set (31) is to
>> first check to see if the #NPF requires a shared<->private transition
>> and, if so, to go ahead and let the corresponding KVM_EXIT_MEMORY_FAULT
>> get forwarded on to userspace before proceeding with any handling of
>> other potential RMP fault conditions like needing to PSMASH the RMP
>> entry/etc (which will be done later if the guest still re-faults after
>> the KVM_EXIT_MEMORY_FAULT is processed by userspace).
>>
>> The determination of whether any userspace handling of
>> KVM_EXIT_MEMORY_FAULT is needed is done by interpreting the return code
>> of kvm_mmu_page_fault(). However, the current code misinterprets the
>> return code, expecting 0 to indicate a userspace exit rather than less
>> than 0 (-EFAULT). This leads to the following unexpected behavior:
>>
>>    - for KVM_EXIT_MEMORY_FAULTs resulting for implicit shared->private
>>      conversions, warnings get printed from sev_handle_rmp_fault()
>>      because it does not expect to be called for GPAs where
>>      KVM_MEMORY_ATTRIBUTE_PRIVATE is not set. Standard linux guests don't
>>      generally do this, but it is allowed and should be handled
>>      similarly to private->shared conversions rather than triggering any
>>      sort of warnings
>>
>>    - if gmem support for 2MB folios is enabled (via currently out-of-tree
>>      code), implicit shared<->private conversions will always result in
>>      a PSMASH being attempted, even if it's not actually needed to
>>      resolve the RMP fault. This doesn't cause any harm, but results in a
>>      needless PSMASH and zapping of the sPTE
>>
>> Resolve these issues by calling sev_handle_rmp_fault() only when
>> kvm_mmu_page_fault()'s return code is greater than or equal to 0,
>> indicating a KVM_MEMORY_EXIT_FAULT/-EFAULT isn't needed. While here,
>> simplify the code slightly and fix up the associated comments for better
>> clarity.
>>
>> Fixes: ccc9d836c5c3 ("KVM: SEV: Add support to handle RMP nested page faults")
>>
>> Signed-off-by: Michael Roth <michael.roth@amd.com>
>> ---
>>   arch/x86/kvm/svm/svm.c | 10 ++++------
>>   1 file changed, 4 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index 426ad49325d7..9431ce74c7d4 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -2070,14 +2070,12 @@ static int npf_interception(struct kvm_vcpu *vcpu)
>>   				svm->vmcb->control.insn_len);
>>   
>>   	/*
>> -	 * rc == 0 indicates a userspace exit is needed to handle page
>> -	 * transitions, so do that first before updating the RMP table.
>> +	 * rc < 0 indicates a userspace exit may be needed to handle page
>> +	 * attribute updates, so deal with that first before handling other
>> +	 * potential RMP fault conditions.
>>   	 */
>> -	if (error_code & PFERR_GUEST_RMP_MASK) {
>> -		if (rc == 0)
>> -			return rc;
>> +	if (rc >= 0 && error_code & PFERR_GUEST_RMP_MASK)
> 
> This isn't correct either.  A return of '0' also indiciates "exit to userspace",
> it just doesn't happen with SNP because '0' is returned only when KVM attempts
> emulation, and that too gets short-circuited by svm_check_emulate_instruction().
> 
> And I would honestly drop the comment, KVM's less-than-pleasant 1/0/-errno return
> values overload is ubiquitous enough that it should be relatively self-explanatory.
> 
> Or if you prefer to keep a comment, drop the part that specifically calls out
> attributes updates, because that incorrectly implies that's the _only_ reason
> why KVM checks the return.  But my vote is to drop the comment, because it
> essentially becomes "don't proceed to step 2 if step 1 failed", which kind of
> makes the reader go "well, yeah".

So IIUC you're suggesting

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 426ad49325d7..c39eaeb21981 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2068,16 +2068,11 @@ static int npf_interception(struct kvm_vcpu *vcpu)
  				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
  				svm->vmcb->control.insn_bytes : NULL,
  				svm->vmcb->control.insn_len);
+	if (rc <= 0)
+		return rc;
  
-	/*
-	 * rc == 0 indicates a userspace exit is needed to handle page
-	 * transitions, so do that first before updating the RMP table.
-	 */
-	if (error_code & PFERR_GUEST_RMP_MASK) {
-		if (rc == 0)
-			return rc;
+	if (error_code & PFERR_GUEST_RMP_MASK)
  		sev_handle_rmp_fault(vcpu, fault_address, error_code);
-	}
  
  	return rc;
  }

?

So, we're... a bit tight for 6.10 to include SNP and that is an
understatement.  My plan is to merge it for 6.11, but do so
immediately after the merge window ends.  In other words, it
is a delay in terms of release but not in terms of time.  I
don't want QEMU and kvm-unit-tests work to be delayed any
further, in particular.

Once we sort out the loose ends of patches 21-23, you could send
it as a pull request.

Paolo
Michael Roth May 10, 2024, 4:37 p.m. UTC | #7
On Fri, May 10, 2024 at 06:01:52PM +0200, Paolo Bonzini wrote:
> On 5/10/24 15:58, Sean Christopherson wrote:
> > On Thu, May 09, 2024, Michael Roth wrote:
> > > The intended logic when handling #NPFs with the RMP bit set (31) is to
> > > first check to see if the #NPF requires a shared<->private transition
> > > and, if so, to go ahead and let the corresponding KVM_EXIT_MEMORY_FAULT
> > > get forwarded on to userspace before proceeding with any handling of
> > > other potential RMP fault conditions like needing to PSMASH the RMP
> > > entry/etc (which will be done later if the guest still re-faults after
> > > the KVM_EXIT_MEMORY_FAULT is processed by userspace).
> > > 
> > > The determination of whether any userspace handling of
> > > KVM_EXIT_MEMORY_FAULT is needed is done by interpreting the return code
> > > of kvm_mmu_page_fault(). However, the current code misinterprets the
> > > return code, expecting 0 to indicate a userspace exit rather than less
> > > than 0 (-EFAULT). This leads to the following unexpected behavior:
> > > 
> > >    - for KVM_EXIT_MEMORY_FAULTs resulting for implicit shared->private
> > >      conversions, warnings get printed from sev_handle_rmp_fault()
> > >      because it does not expect to be called for GPAs where
> > >      KVM_MEMORY_ATTRIBUTE_PRIVATE is not set. Standard linux guests don't
> > >      generally do this, but it is allowed and should be handled
> > >      similarly to private->shared conversions rather than triggering any
> > >      sort of warnings
> > > 
> > >    - if gmem support for 2MB folios is enabled (via currently out-of-tree
> > >      code), implicit shared<->private conversions will always result in
> > >      a PSMASH being attempted, even if it's not actually needed to
> > >      resolve the RMP fault. This doesn't cause any harm, but results in a
> > >      needless PSMASH and zapping of the sPTE
> > > 
> > > Resolve these issues by calling sev_handle_rmp_fault() only when
> > > kvm_mmu_page_fault()'s return code is greater than or equal to 0,
> > > indicating a KVM_MEMORY_EXIT_FAULT/-EFAULT isn't needed. While here,
> > > simplify the code slightly and fix up the associated comments for better
> > > clarity.
> > > 
> > > Fixes: ccc9d836c5c3 ("KVM: SEV: Add support to handle RMP nested page faults")
> > > 
> > > Signed-off-by: Michael Roth <michael.roth@amd.com>
> > > ---
> > >   arch/x86/kvm/svm/svm.c | 10 ++++------
> > >   1 file changed, 4 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > > index 426ad49325d7..9431ce74c7d4 100644
> > > --- a/arch/x86/kvm/svm/svm.c
> > > +++ b/arch/x86/kvm/svm/svm.c
> > > @@ -2070,14 +2070,12 @@ static int npf_interception(struct kvm_vcpu *vcpu)
> > >   				svm->vmcb->control.insn_len);
> > >   	/*
> > > -	 * rc == 0 indicates a userspace exit is needed to handle page
> > > -	 * transitions, so do that first before updating the RMP table.
> > > +	 * rc < 0 indicates a userspace exit may be needed to handle page
> > > +	 * attribute updates, so deal with that first before handling other
> > > +	 * potential RMP fault conditions.
> > >   	 */
> > > -	if (error_code & PFERR_GUEST_RMP_MASK) {
> > > -		if (rc == 0)
> > > -			return rc;
> > > +	if (rc >= 0 && error_code & PFERR_GUEST_RMP_MASK)
> > 
> > This isn't correct either.  A return of '0' also indiciates "exit to userspace",
> > it just doesn't happen with SNP because '0' is returned only when KVM attempts
> > emulation, and that too gets short-circuited by svm_check_emulate_instruction().
> > 
> > And I would honestly drop the comment, KVM's less-than-pleasant 1/0/-errno return
> > values overload is ubiquitous enough that it should be relatively self-explanatory.
> > 
> > Or if you prefer to keep a comment, drop the part that specifically calls out
> > attributes updates, because that incorrectly implies that's the _only_ reason
> > why KVM checks the return.  But my vote is to drop the comment, because it
> > essentially becomes "don't proceed to step 2 if step 1 failed", which kind of
> > makes the reader go "well, yeah".
> 
> So IIUC you're suggesting
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 426ad49325d7..c39eaeb21981 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2068,16 +2068,11 @@ static int npf_interception(struct kvm_vcpu *vcpu)
>  				static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
>  				svm->vmcb->control.insn_bytes : NULL,
>  				svm->vmcb->control.insn_len);
> +	if (rc <= 0)
> +		return rc;
> -	/*
> -	 * rc == 0 indicates a userspace exit is needed to handle page
> -	 * transitions, so do that first before updating the RMP table.
> -	 */
> -	if (error_code & PFERR_GUEST_RMP_MASK) {
> -		if (rc == 0)
> -			return rc;
> +	if (error_code & PFERR_GUEST_RMP_MASK)
>  		sev_handle_rmp_fault(vcpu, fault_address, error_code);
> -	}
>  	return rc;
>  }
> 
> ?
> 
> So, we're... a bit tight for 6.10 to include SNP and that is an
> understatement.  My plan is to merge it for 6.11, but do so
> immediately after the merge window ends.  In other words, it
> is a delay in terms of release but not in terms of time.  I
> don't want QEMU and kvm-unit-tests work to be delayed any
> further, in particular.

That's unfortunate, I'd thought from the PUCK call that we still had
some time to stabilize things before merge window. But whatever you
think is best.

> 
> Once we sort out the loose ends of patches 21-23, you could send
> it as a pull request.

Ok, as a pull request against kvm/next, or kvm/queue?

Thanks,

Mike

> 
> Paolo
> 
>
Paolo Bonzini May 10, 2024, 4:59 p.m. UTC | #8
On 5/10/24 18:37, Michael Roth wrote:
>> So, we're... a bit tight for 6.10 to include SNP and that is an
>> understatement.  My plan is to merge it for 6.11, but do so
>> immediately after the merge window ends.  In other words, it
>> is a delay in terms of release but not in terms of time.  I
>> don't want QEMU and kvm-unit-tests work to be delayed any
>> further, in particular.
>
> That's unfortunate, I'd thought from the PUCK call that we still had
> some time to stabilize things before merge window. But whatever you
> think is best.

Well, the merge window starts next sunday, doesn't it?  If there's an 
-rc8 I agree there's some leeway, but that is not too likely.

>> Once we sort out the loose ends of patches 21-23, you could send
>> it as a pull request.
> Ok, as a pull request against kvm/next, or kvm/queue?

Against kvm/next.

Paolo
Paolo Bonzini May 10, 2024, 5:25 p.m. UTC | #9
On Fri, May 10, 2024 at 6:59 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> Well, the merge window starts next sunday, doesn't it?  If there's an
> -rc8 I agree there's some leeway, but that is not too likely.
>
> >> Once we sort out the loose ends of patches 21-23, you could send
> >> it as a pull request.
> > Ok, as a pull request against kvm/next, or kvm/queue?
>
> Against kvm/next.

Ah no, only kvm/queue has the preparatory hooks - they make no sense
without something that uses them.  kvm/queue is ready now.

Also, please send the pull request "QEMU style", i.e. with patches
as replies.

If there's an -rc8, I'll probably pull it on Thursday morning.

Paolo
Isaku Yamahata May 10, 2024, 5:47 p.m. UTC | #10
On Fri, May 10, 2024 at 08:59:09AM -0700,
Sean Christopherson <seanjc@google.com> wrote:

> On Fri, May 10, 2024, Michael Roth wrote:
> > On Fri, May 10, 2024 at 03:50:26PM +0200, Paolo Bonzini wrote:
> > > On Fri, May 10, 2024 at 3:47 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >
> > > > > +      * Since software-protected VMs don't have a notion of a shared vs.
> > > > > +      * private that's separate from what KVM is tracking, the above
> > > > > +      * KVM_EXIT_MEMORY_FAULT condition wouldn't occur, so avoid the
> > > > > +      * special handling for that case for now.
> > > >
> > > > Very technically, it can occur if userspace _just_ modified the attributes.  And
> > > > as I've said multiple times, at least for now, I want to avoid special casing
> > > > SW-protected VMs unless it is *absolutely* necessary, because their sole purpose
> > > > is to allow testing flows that are impossible to excercise without SNP/TDX hardware.
> > > 
> > > Yep, it is not like they have to be optimized.
> > 
> > Ok, I thought there were maybe some future plans to use sw-protected VMs
> > to get some added protections from userspace. But even then there'd
> > probably still be extra considerations for how to handle access tracking
> > so white-listing them probably isn't right anyway.
> > 
> > I was also partly tempted to take this route because it would cover this
> > TDX patch as well:
> > 
> >   https://lore.kernel.org/lkml/91c797997b57056224571e22362321a23947172f.1705965635.git.isaku.yamahata@intel.com/
> 
> Hmm, I'm pretty sure that patch is trying to fix the exact same issue you are
> fixing, just in a less precise way.  S-EPT entries only support RWX=0 and RWX=111b,
> i.e. it should be impossible to have a write-fault to a present S-EPT entry.
> 
> And if TDX is running afoul of this code:
> 
> 	if (!fault->present)
> 		return !kvm_ad_enabled();
> 
> then KVM should do the sane thing and require A/D support be enabled for TDX.
> 
> And if it's something else entirely, that changelog has some explaining to do.

Yes, it's for KVM_EXIT_MEMORY_FAULT case.  Because Secure-EPT has non-present or
all RWX allowed, fast page fault always returns RET_PF_INVALID by
is_shadow_present_pte() check.

I lightly tested the patch at [1] and it works for TDX KVM.

[1] https://github.com/mdroth/linux/commit/39643f9f6da6265d39d633a703c53997985c1208

Just in case for that patch,
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Michael Roth May 10, 2024, 7:08 p.m. UTC | #11
On Fri, May 10, 2024 at 07:09:07PM +0200, Paolo Bonzini wrote:
> On 5/10/24 03:58, Michael Roth wrote:
> > There are a few edge-cases that the current processing for GHCB PSC
> > requests doesn't handle properly:
> > 
> >   - KVM properly ignores SMASH/UNSMASH ops when they are embedded in a
> >     PSC request buffer which contains other PSC operations, but
> >     inadvertantly forwards them to userspace as private->shared PSC
> >     requests if they appear at the end of the buffer. Make sure these are
> >     ignored instead, just like cases where they are not at the end of the
> >     request buffer.
> > 
> >   - Current code handles non-zero 'cur_page' fields when they are at the
> >     beginning of a new GPA range, but it is not handling properly when
> >     iterating through subsequent entries which are otherwise part of a
> >     contiguous range. Fix up the handling so that these entries are not
> >     combined into a larger contiguous range that include unintended GPA
> >     ranges and are instead processed later as the start of a new
> >     contiguous range.
> > 
> >   - The page size variable used to track 2M entries in KVM for inflight PSCs
> >     might be artifically set to a different value, which can lead to
> >     unexpected values in the entry's final 'cur_page' update. Use the
> >     entry's 'pagesize' field instead to determine what the value of
> >     'cur_page' should be upon completion of processing.
> > 
> > While here, also add a small helper for clearing in-flight PSCs
> > variables and fix up comments for better readability.
> > 
> > Fixes: 266205d810d2 ("KVM: SEV: Add support to handle Page State Change VMGEXIT")
> > Signed-off-by: Michael Roth <michael.roth@amd.com>
> 
> There are some more improvements that can be made to the readability of
> the code... this one is already better than the patch is fixing up, but I
> don't like the code that is in the loop even though it is unconditionally
> followed by "break".
> 
> Here's my attempt at replacing this patch, which is really more of a
> rewrite of the whole function...  Untested beyond compilation.

Thanks for the suggested rework. I tested with/without 2MB pages and
everything worked as-written. This is the full/squashed patch I plan to
include in the pull request:

  https://github.com/mdroth/linux/commit/91f6d31c4dfc88dd1ac378e2db6117b0c982e63c

-Mike

> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 35f0bd91f92e..6e612789c35f 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3555,23 +3555,25 @@ struct psc_buffer {
>  static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc);
> -static int snp_complete_psc(struct kvm_vcpu *vcpu)
> +static void snp_complete_psc(struct vcpu_svm *svm, u64 psc_ret)
> +{
> +	svm->sev_es.psc_inflight = 0;
> +	svm->sev_es.psc_idx = 0;
> +	svm->sev_es.psc_2m = 0;
> +	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, psc_ret);
> +}
> +
> +static void __snp_complete_one_psc(struct vcpu_svm *svm)
>  {
> -	struct vcpu_svm *svm = to_svm(vcpu);
>  	struct psc_buffer *psc = svm->sev_es.ghcb_sa;
>  	struct psc_entry *entries = psc->entries;
>  	struct psc_hdr *hdr = &psc->hdr;
> -	__u64 psc_ret;
>  	__u16 idx;
> -	if (vcpu->run->hypercall.ret) {
> -		psc_ret = VMGEXIT_PSC_ERROR_GENERIC;
> -		goto out_resume;
> -	}
> -
>  	/*
>  	 * Everything in-flight has been processed successfully. Update the
> -	 * corresponding entries in the guest's PSC buffer.
> +	 * corresponding entries in the guest's PSC buffer and zero out the
> +	 * count of in-flight PSC entries.
>  	 */
>  	for (idx = svm->sev_es.psc_idx; svm->sev_es.psc_inflight;
>  	     svm->sev_es.psc_inflight--, idx++) {
> @@ -3581,17 +3583,22 @@ static int snp_complete_psc(struct kvm_vcpu *vcpu)
>  	}
>  	hdr->cur_entry = idx;
> +}
> +
> +static int snp_complete_one_psc(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +	struct psc_buffer *psc = svm->sev_es.ghcb_sa;
> +
> +	if (vcpu->run->hypercall.ret) {
> +		snp_complete_psc(svm, VMGEXIT_PSC_ERROR_GENERIC);
> +		return 1; /* resume guest */
> +	}
> +
> +	__snp_complete_one_psc(svm);
>  	/* Handle the next range (if any). */
>  	return snp_begin_psc(svm, psc);
> -
> -out_resume:
> -	svm->sev_es.psc_idx = 0;
> -	svm->sev_es.psc_inflight = 0;
> -	svm->sev_es.psc_2m = false;
> -	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, psc_ret);
> -
> -	return 1; /* resume guest */
>  }
>  static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc)
> @@ -3601,18 +3608,20 @@ static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc)
>  	struct psc_hdr *hdr = &psc->hdr;
>  	struct psc_entry entry_start;
>  	u16 idx, idx_start, idx_end;
> -	__u64 psc_ret, gpa;
> +	u64 gfn;
>  	int npages;
> -
> -	/* There should be no other PSCs in-flight at this point. */
> -	if (WARN_ON_ONCE(svm->sev_es.psc_inflight)) {
> -		psc_ret = VMGEXIT_PSC_ERROR_GENERIC;
> -		goto out_resume;
> -	}
> +	bool huge;
>  	if (!(vcpu->kvm->arch.hypercall_exit_enabled & (1 << KVM_HC_MAP_GPA_RANGE))) {
> -		psc_ret = VMGEXIT_PSC_ERROR_GENERIC;
> -		goto out_resume;
> +		snp_complete_psc(svm, VMGEXIT_PSC_ERROR_GENERIC);
> +		return 1;
> +	}
> +
> +next_range:
> +	/* There should be no other PSCs in-flight at this point. */
> +	if (WARN_ON_ONCE(svm->sev_es.psc_inflight)) {
> +		snp_complete_psc(svm, VMGEXIT_PSC_ERROR_GENERIC);
> +		return 1;
>  	}
>  	/*
> @@ -3624,97 +3633,99 @@ static int snp_begin_psc(struct vcpu_svm *svm, struct psc_buffer *psc)
>  	idx_end = hdr->end_entry;
>  	if (idx_end >= VMGEXIT_PSC_MAX_COUNT) {
> -		psc_ret = VMGEXIT_PSC_ERROR_INVALID_HDR;
> -		goto out_resume;
> -	}
> -
> -	/* Nothing more to process. */
> -	if (idx_start > idx_end) {
> -		psc_ret = 0;
> -		goto out_resume;
> +		snp_complete_psc(svm, VMGEXIT_PSC_ERROR_INVALID_HDR);
> +		return 1;
>  	}
>  	/* Find the start of the next range which needs processing. */
>  	for (idx = idx_start; idx <= idx_end; idx++, hdr->cur_entry++) {
> -		__u16 cur_page;
> -		gfn_t gfn;
> -		bool huge;
> -
>  		entry_start = entries[idx];
> -
> -		/* Only private/shared conversions are currently supported. */
> -		if (entry_start.operation != VMGEXIT_PSC_OP_PRIVATE &&
> -		    entry_start.operation != VMGEXIT_PSC_OP_SHARED)
> -			continue;
> -
>  		gfn = entry_start.gfn;
> -		cur_page = entry_start.cur_page;
>  		huge = entry_start.pagesize;
> +		npages = huge ? 512 : 1;
> -		if ((huge && (cur_page > 512 || !IS_ALIGNED(gfn, 512))) ||
> -		    (!huge && cur_page > 1)) {
> -			psc_ret = VMGEXIT_PSC_ERROR_INVALID_ENTRY;
> -			goto out_resume;
> +		if (entry_start.cur_page > npages || !IS_ALIGNED(gfn, npages)) {
> +			snp_complete_psc(svm, VMGEXIT_PSC_ERROR_INVALID_ENTRY);
> +			return 1;
>  		}
> +		if (entry_start.cur_page) {
> +			/*
> +			 * If this is a partially-completed 2M range, force 4K
> +			 * handling for the remaining pages since they're effectively
> +			 * split at this point. Subsequent code should ensure this
> +			 * doesn't get combined with adjacent PSC entries where 2M
> +			 * handling is still possible.
> +			 */
> +			npages -= entry_start.cur_page;
> +			gfn += entry_start.cur_page;
> +			huge = false;
> +		}
> +		if (npages)
> +			break;
> +
>  		/* All sub-pages already processed. */
> -		if ((huge && cur_page == 512) || (!huge && cur_page == 1))
> -			continue;
> -
> -		/*
> -		 * If this is a partially-completed 2M range, force 4K handling
> -		 * for the remaining pages since they're effectively split at
> -		 * this point. Subsequent code should ensure this doesn't get
> -		 * combined with adjacent PSC entries where 2M handling is still
> -		 * possible.
> -		 */
> -		svm->sev_es.psc_2m = cur_page ? false : huge;
> -		svm->sev_es.psc_idx = idx;
> -		svm->sev_es.psc_inflight = 1;
> -
> -		gpa = gfn_to_gpa(gfn + cur_page);
> -		npages = huge ? 512 - cur_page : 1;
> -		break;
>  	}
> +	if (idx > idx_end) {
> +		/* Nothing more to process. */
> +		snp_complete_psc(svm, 0);
> +		return 1;
> +	}
> +
> +	svm->sev_es.psc_2m = huge;
> +	svm->sev_es.psc_idx = idx;
> +	svm->sev_es.psc_inflight = 1;
> +
>  	/*
>  	 * Find all subsequent PSC entries that contain adjacent GPA
>  	 * ranges/operations and can be combined into a single
>  	 * KVM_HC_MAP_GPA_RANGE exit.
>  	 */
> -	for (idx = svm->sev_es.psc_idx + 1; idx <= idx_end; idx++) {
> +	while (++idx <= idx_end) {
>  		struct psc_entry entry = entries[idx];
>  		if (entry.operation != entry_start.operation ||
> -		    entry.gfn != entry_start.gfn + npages ||
> -		    !!entry.pagesize != svm->sev_es.psc_2m)
> +		    entry.gfn != gfn + npages ||
> +		    entry.cur_page ||
> +		    !!entry.pagesize != huge)
>  			break;
>  		svm->sev_es.psc_inflight++;
> -		npages += entry_start.pagesize ? 512 : 1;
> +		npages += huge ? 512 : 1;
>  	}
> -	vcpu->run->exit_reason = KVM_EXIT_HYPERCALL;
> -	vcpu->run->hypercall.nr = KVM_HC_MAP_GPA_RANGE;
> -	vcpu->run->hypercall.args[0] = gpa;
> -	vcpu->run->hypercall.args[1] = npages;
> -	vcpu->run->hypercall.args[2] = entry_start.operation == VMGEXIT_PSC_OP_PRIVATE
> -				       ? KVM_MAP_GPA_RANGE_ENCRYPTED
> -				       : KVM_MAP_GPA_RANGE_DECRYPTED;
> -	vcpu->run->hypercall.args[2] |= entry_start.pagesize
> -					? KVM_MAP_GPA_RANGE_PAGE_SZ_2M
> -					: KVM_MAP_GPA_RANGE_PAGE_SZ_4K;
> -	vcpu->arch.complete_userspace_io = snp_complete_psc;
> +	switch (entry_start.operation) {
> +	case VMGEXIT_PSC_OP_PRIVATE:
> +	case VMGEXIT_PSC_OP_SHARED:
> +		vcpu->run->exit_reason = KVM_EXIT_HYPERCALL;
> +		vcpu->run->hypercall.nr = KVM_HC_MAP_GPA_RANGE;
> +		vcpu->run->hypercall.args[0] = gfn_to_gpa(gfn);
> +		vcpu->run->hypercall.args[1] = npages;
> +		vcpu->run->hypercall.args[2] = entry_start.operation == VMGEXIT_PSC_OP_PRIVATE
> +			? KVM_MAP_GPA_RANGE_ENCRYPTED
> +			: KVM_MAP_GPA_RANGE_DECRYPTED;
> +		vcpu->run->hypercall.args[2] |= huge
> +			? KVM_MAP_GPA_RANGE_PAGE_SZ_2M
> +			: KVM_MAP_GPA_RANGE_PAGE_SZ_4K;
> +		vcpu->arch.complete_userspace_io = snp_complete_one_psc;
> -	return 0; /* forward request to userspace */
> +		return 0; /* forward request to userspace */
> -out_resume:
> -	svm->sev_es.psc_idx = 0;
> -	svm->sev_es.psc_inflight = 0;
> -	svm->sev_es.psc_2m = false;
> -	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, psc_ret);
> +	default:
> +		/*
> +		 * Only shared/private PSC operations are currently supported, so if the
> +		 * entire range consists of unsupported operations (e.g. SMASH/UNSMASH),
> +		 * then consider the entire range completed and avoid exiting to
> +		 * userspace. In theory snp_complete_psc() can be called directly
> +		 * at this point to complete the current range and start the next one,
> +		 * but that could lead to unexpected levels of recursion.
> +		 */
> +		__snp_complete_one_psc(svm);
> +		goto next_range;
> +	}
> -	return 1; /* resume guest */
> +	unreachable();
>  }
>  static int __sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu)
> 
>
Borislav Petkov May 14, 2024, 8:10 a.m. UTC | #12
On May 10, 2024 6:59:37 PM GMT+02:00, Paolo Bonzini <pbonzini@redhat.com> wrote:
>Well, the merge window starts next sunday, doesn't it?  If there's an -rc8 I agree there's some leeway, but that is not too likely.

Nah, the merge window just opened yesterday.