Vmware Nvidia Logos Ee2F18Dc 615D 4C9E 8F11 9C3C2Ce2Bf37 Prv
| | | | | | |

Automating the deployment of my Homelab AI Infrastructure

📅Published: Updated:

In a previous post, I wrote about using my VMware lab with an NVIDIA Tesla P4 for running some AI services. However, this deployment was done with the GPU in passthrough mode (I will refer to this a GPU). I wanted to take this to the next level and I also wanted to automate most of the steps. This was for a few reasons, firstly I wanted to get better at automating in general. Secondly, I found the setup brittle and wanted to improve the reliability of deployments. This post will be about using automation to deploying of the VM infrastructure required to be able to run AI Workloads.

I also decided to bite the bullet and update the graphics card I was using to something a bit more modern and capable. After a bit of searching I decided on an Nvidia A10

Here is my write-up, and a link to my GitHub repository with the relevant code.

GPU vs vGPU

For anyone not familiar, it may be worth giving an overview of the differences between GPU and vGPU. What I describe as “GPU” is where I am passing the entire Graphics PCI device through to a single VM within vSphere. Using this method has some advantages, firstly, it doesn’t require any special licences or drivers. The entire PCI card gets passed into the VM as is. However, it has some downsides. The two most important ones for me are that it can only be presented to a single VM at a time and that VM cannot be snapshotted while it is turned on. This made backups convoluted. As I was changing configurations a lot this became tedious. I also wanted to be able to pass the card through to multiple VM’s

vGPU is only officially supported on “datacentre” cards from NVIDIA it virtualizes the graphics card and allows you to share it across multiple VM’s in a similar way to what vSphere has done for compute virtualization for years. It allows you to split the card into some predefined profiles that I have listed below and attach them to multiple virtual machines at the same time.

Pre req’s

There are quite a lot of pre-reqs required to be in place to utilise the attached deployment scripts. So it’s worth ensuring that all of these are complete.

  • Obviously, you need a vSphere Host and vCentre licensed at least at the enterprise level (I am using Enterprise Plus on 8.0u3)
  • NVIDIA datacenter Graphics card and associated Host and Guest drivers, I am using an A10 and using driver version 535.247.0
  • NVIDIA vGPU licence. Trials are available here if needed
  • NFS Server (used for NVIDIA software deployment)
  • Domain hosted with Cloudflare and API token with Zone:DNS:Edit permissions.
  • SSH access to the vGPU VM with SUDO permission
  • Internal DNS Records created for
    • vGPU VM
    • Traefik Dashboard
    • Test NGINX Server – Optional
    • NFS Server (Used for vGPU Software install)

Host Preparation

To make vGPU work successfully, there are two elements required. The first is that the vSphere host has the driver installed. For now, I’m using the ‘535’ version of the driver, which is the LTS version.` To utilise vGPU with vSphere you need an NVIDIA account to obtain the software. Once this has been obtained, you need to copy the host driver to a VMware datastore. Place the host in maintenance mode and then install the driver.

esxcli software vib install -d /vmfs/volumes/02cb3d2b-457ccb84/nvidia/NVIDIA-GRID-vSphere-8.0-535.247.02-535.247.01-539.28.zip

If it worked successfully, you should get a result like the below

Installation Result
   Message: Operation finished successfully.
   VIBs Installed: NVD_bootbank_NVD-VMware_ESXi_8.0.0_Driver_535.247.02-1OEM.800.1.0.20613240
   VIBs Removed: 
   VIBs Skipped: 
   Reboot Required: false
   DPU Results: 

I always choose to restart the host after installing the driver. I have been bitten in the past where it says a reboot isn’t required but it was.

Once the host has restarted, ssh into it and validate that the driver is talking to the card correctly. This can be done with the nvidia-smi command. This is executed on the ESXi host and will show something similar to the below if its working.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.247.02 Driver Version: 535.247.02 CUDA Version: N/A |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10 On | 00000000:03:00.0 Off | 0 |
| 0% 41C P8 22W / 150W | 11200MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

Guest Setup

I have built some Ansible playbooks that perform a number of activities that make it MUCH easier to get the guest up and running for use with AI workloads and other containers. These have only been tested with Ubuntu but will probably work with other Linux distro’s without a lot of changes.

At a high level, they:

Ansible Details

I have kept these as separate playbooks for now. This hopefully makes it easier to follow and/or troubleshoot if needed. The playbooks are intended to be run in order.

I am using SemaphoreUI to handle the deployment but this isn’t required.

1. Docker deployment

I have covered the deployment of Docker already in this post. Obviously, you don’t have to use Semaphore to do this (especially as Semaphore requires Docker in the first place). However, you can use the Ansible playbook with some amendments.

No parameters are required to be set for the Docker deployment. Everything is set in the playbook.

2. Install NVIDIA Guest Drivers

This playbook configures an NFS client on the Ubuntu server and then copies both the vGPU license file and the driver from my NFS storage before installing them.

A number of parameters need to be defined or amended to run this successfully in your environment.

The following parameters need to be set in SemaphoreUI as a variable group.

Jump to the full Semaphore Instructions at the bottom

VariableDefaultDescription
nvidia_vgpu_driver_fileNVIDIA-Linux-x86_64-535.247.01-grid.runDriver installer filename
nvidia_vgpu_licence_fileclient_configuration_token_04-08-2025-16-54-19.tokLicense token filename
nvidia_vgpu_driver_path/tmp/<driver_file>Local path for installer
nvidia_nfs_servernas.jameskilby.cloudNFS server hostname
nvidia_nfs_export_path/mnt/pool1/ISO/nvidiaNFS export path
nvidia_nfs_mount_point/mnt/iso/nvidiaLocal mount point

3. Install NVIDIA Container Toolkit

This playbook installs the NVIDIA container toolkit and validates that everything is working by running a Docker container and executing the nvidia-smi command from within a container.

No parameters are required to be set

4. Install Traefik

Traefik is a reverse proxy/ingress controller. I am using it effectively as a load balancer in front of my web-based homelab services. I also have it integrated with Let’s Encrypt and Cloudflare so that it will automatically obtain a trusted certificate for my internal services. This has the added benefit that I don’t need to remember the relevant port the containers are running on.

This playbook needs a lot of Variables as can be seen below. In most cases the default is ok.

When they are all input to Semaphore it will look something like this.

Screenshot 2026 02 05 At 23.01.04 806X1024

Rather than typing all of the values out you can copy the json below and then just tweak what you need to.

{
  "nvidia_vgpu_driver_file": "NVIDIA-Linux-x86_64-535.247.01-grid.run",
  "nvidia_vgpu_licence_file": "client_configuration_token_04-08-2025-16-54-19.tok",
  "nvidia_vgpu_driver_path": "/tmp/<driver_file>",
  "nvidia_nfs_server": "nas.jameskilby.cloud",
  "nvidia_nfs_export_path": "/mnt/pool1/ISO/nvidia",
  "nvidia_nfs_mount_point": "/mnt/iso/nvidia",
  "traefik_deploy_test_service": "true",
  "traefik_healthcheck_poll_retries": "12",
  "traefik_healthcheck_poll_delay": "5",
  "traefik_docker_image": "traefik:v3.6.4",
  "traefik_name": "traefik",
  "traefik_config_path": "/etc/traefik",
  "traefik_acme_path": "/etc/traefik/acme",
  "traefik_docker_network": "traefik",
  "traefik_http_port": "80",
  "traefik_https_port": "443",
  "acme_dns_delay": "10",
  "acme_dns_resolvers": "['1.1.1.1:53', '8.8.8.8:53']",
  "traefik_log_level": "info",
  "traefik_log_format": "json",
  "traefik_log_max_size": "10m",
  "traefik_log_max_files": "3",
  "traefik_healthcheck_interval": "30s",
  "traefik_healthcheck_timeout": "10s",
  "traefik_healthcheck_retries": "3",
  "traefik_healthcheck_start_period": "30s",
  "traefik_test_container": "nginx-test",
  "traefik_test_image": "nginx:alpine",
  "traefik_test_domain": "test.jameskilby.cloud",
  "traefik_dashboard_user": "admin"
}

Variable Definitions

VariableDefaultDescription
traefik_deploy_test_servicetrueSet to false to skip test nginx deployment
traefik_healthcheck_poll_retries12Number of health check poll attempts
traefik_healthcheck_poll_delay5Seconds between health check polls
traefik_docker_imagetraefik:v3Traefik Docker image
traefik_nametraefikContainer name
traefik_config_path/etc/traefikConfig directory
traefik_acme_path/etc/traefik/acmeACME/cert directory
traefik_docker_networktraefikDocker network name
traefik_http_port80HTTP port
traefik_https_port443HTTPS port
acme_dns_delay10Seconds before DNS check
acme_dns_resolvers['1.1.1.1:53', '8.8.8.8:53']DNS resolvers
traefik_log_levelINFOLog level
traefik_log_formatjsonLog format
traefik_log_max_size10mMax log size
traefik_log_max_files3Max log files
traefik_healthcheck_interval30sHealth check interval
traefik_healthcheck_timeout10sHealth check timeout
traefik_healthcheck_retries3Health check retries
traefik_healthcheck_start_period30sHealth check grace period
traefik_test_containernginx-testTest container name
traefik_test_imagenginx:alpineTest container image
traefik_test_domaintest.<wildcard_domain>Test service domain
traefik_dashboard_useradminDashboard username

It also needs additional variables that should be classed as secrets as they are sensitive

These are the Traefik admin dashboard hash and the Cloudflare API token

Semaphoresecrets 1024X262

To generate the hash for the password, the easiest way is with Docker

docker run --rm httpd:2 htpasswd -nbB admin 'your_password_here'

Semaphore Implementation Instructions

Assuming you are going to use SemaphoreUI to execute the playbooks these are the steps you will need to take. If you haven’t already set it up review my guide here

The first step is to define a new repository where the playbooks will be executed from. As my Git repo is public no authentication is required. You also need to specify the branch, in this case I am using main.

Repository

Semaphorerepo E1769682413872

Key Store

You also need to define the authentication from Semaphore to the target workload. This is done in the Key Store section.

I do this in two parts. The first is auth which I choose to do with SSH keys

I then also store the become password securely in Semaphore for use as SUDO

Inventory

Sempahore Inventory 665X1024

Task Templates

Variable Group

We will need to create a variable group and set the variables when we come to install the NVIDIA drivers

Review the variable table above and set to match your environment. This is what mine looks like

Set Variable Group 922X1024

Then ensure that the task template is set to use it

Variable Group 1024X547

Similar Posts