Private LLMs on KubeEdge and WasmEdge [How-To 2026]
Bottom Line
Use KubeEdge for control-plane reach and WasmEdge for portable inference binaries. The practical pattern is to pin a small GGUF model first, prove it on both x86_64 and arm64 nodes, then scale out with node labels and local model caches.
Key Takeaways
- ›As of May 14, 2026, KubeEdge v1.23.0 is the latest release and WasmEdge 0.15.0 is the latest stable release.
- ›KubeEdge edge nodes need CloudCore ports 10000 and 10002 reachable, plus a matching
--advertise-address. - ›WasmEdge installs
wasinn-ggmlwith one flag, and the same.wasmapp runs on both x8664 and arm64. - ›For containerd-backed edge nodes,
keadm joinuses--remote-runtime-endpoint=unix:///run/containerd/containerd.sock. - ›Start with a 1B GGUF model for cross-arch smoke tests, then promote larger models only after logs and memory usage are clean.
Private LLMs at the edge stop looking exotic once you separate control plane from inference runtime. KubeEdge gives you Kubernetes-native reach to remote nodes, while WasmEdge gives you a portable WebAssembly runtime that can execute the same LLM app on x86_64 and arm64. As of May 14, 2026, the latest verified releases are KubeEdge v1.23.0 and WasmEdge 0.15.0, which is a solid baseline for a heterogeneous cluster rollout.
Prerequisites
Before you start
- One Kubernetes control-plane node with a working
kubectlcontext. - At least two edge nodes: one x86_64 and one arm64, both running containerd.
- sudo access on cloud and edge hosts.
- Outbound access from edge nodes to GitHub and your model source.
- Enough RAM for your first model. Start with a 1B GGUF model; move to 3B, 7B, or larger only after you measure memory headroom.
- Local disk on each edge node for a model cache such as
/var/lib/llm.
Bottom Line
Use KubeEdge to join remote nodes and schedule by architecture, then use WasmEdge plus wasi_nn-ggml to run the same WebAssembly LLM app across them. Prove the smallest useful model first, then scale up model size and node count.
Build the KubeEdge Control Plane
Step 1: Install keadm and bootstrap CloudCore
- Set your cloud-side values.
export KUBEEDGE_VERSION=v1.23.0
export CLOUD_IP=10.0.0.10
- Install the keadm binary on the cloud node.
wget https://github.com/kubeedge/kubeedge/releases/download/${KUBEEDGE_VERSION}/keadm-${KUBEEDGE_VERSION}-linux-amd64.tar.gz
tar -zxvf keadm-${KUBEEDGE_VERSION}-linux-amd64.tar.gz
sudo cp keadm-${KUBEEDGE_VERSION}-linux-amd64/keadm/keadm /usr/local/bin/keadm
- Initialize KubeEdge. The official docs require edge access to CloudCore on 10000 and 10002, and the advertise address must be the IP edge nodes can actually reach.
sudo keadm init \
--advertise-address="${CLOUD_IP}" \
--kubeedge-version="${KUBEEDGE_VERSION}" \
--kube-config="$HOME/.kube/config"
- Confirm CloudCore is up.
kubectl get all -n kubeedge
Step 2: Generate a join token and attach each edge node
- Get the token from the cloud side.
TOKEN=$(sudo keadm gettoken)
echo "$TOKEN"
- Install keadm on each edge node using the matching architecture package.
# amd64 edge
wget https://github.com/kubeedge/kubeedge/releases/download/${KUBEEDGE_VERSION}/keadm-${KUBEEDGE_VERSION}-linux-amd64.tar.gz
# arm64 edge
wget https://github.com/kubeedge/kubeedge/releases/download/${KUBEEDGE_VERSION}/keadm-${KUBEEDGE_VERSION}-linux-arm64.tar.gz
- Join each edge node to CloudCore. For current KubeEdge runtime docs, containerd uses
unix:///run/containerd/containerd.sock.
sudo keadm join \
--cloudcore-ipport="${CLOUD_IP}:10000" \
--token="${TOKEN}" \
--remote-runtime-endpoint=unix:///run/containerd/containerd.sock \
--cgroupdriver=systemd
- Verify the nodes from the cloud side.
kubectl get nodes -o wide
If you need to paste logs or tokens into tickets or chat, scrub them first with TechBytes' Data Masking Tool. KubeEdge join output often includes sensitive connection data you do not want in screenshots.
Install WasmEdge on Edge Nodes
Step 3: Install WasmEdge and the GGUF inference plug-in
WasmEdge's official installer can pin a specific version and install the wasi_nn-ggml plug-in in one pass. That plug-in is the piece that lets your Wasm app call into GGUF-backed LLM inference.
export WASMEDGE_VERSION=0.15.0
sudo apt-get update
sudo apt-get install -y curl git ca-certificates libopenblas-dev
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | \
sudo bash -s -- -p /usr/local -v ${WASMEDGE_VERSION} --plugins wasi_nn-ggml
/usr/local/bin/wasmedge --version
- Use the same install command on both x86_64 and arm64 nodes.
- The WASI-NN backends are exclusive, so install only the backend you actually need.
- CPU-only nodes usually need
libopenblas-dev; the LlamaEdge project explicitly calls this out for CPU installs.
Optional: enable runwasi for native Wasm RuntimeClass scheduling
If you want containerd to launch Wasm workloads directly, WasmEdge's official docs use the runwasi shim:
git clone https://github.com/containerd/runwasi.git
cd runwasi
./scripts/setup-linux.sh
make build-wasmedge
INSTALL="sudo install" LN="sudo ln -sf" make install-wasmedge
echo '[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.wasmedge] runtime_type = "io.containerd.wasmedge.v1"' | \
sudo tee -a /etc/containerd/config.toml > /dev/null
sudo systemctl restart containerd
You do not need this shim for the smoke test below, but it is the clean next step if you want pure Wasm RuntimeClass objects later.
Deploy a Private LLM Workload
Step 4: Label nodes by architecture
kubectl label node edge-amd64 llm=true arch=amd64 --overwrite
kubectl label node edge-arm64 llm=true arch=arm64 --overwrite
Step 5: Run a cross-arch smoke test with LlamaEdge and a private GGUF model
The LlamaEdge project publishes a portable llama-chat.wasm app, and its docs show a verified WasmEdge invocation using --nn-preload and -p llama-3-chat. The job below downloads the Wasm app and a small GGUF model in an init container, then calls the host-installed WasmEdge binary through a hostPath mount.
cat <<'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: llm-smoke-amd64
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
nodeSelector:
llm: "true"
arch: amd64
volumes:
- name: workspace
emptyDir: {}
- name: wasmedge-bin
hostPath:
path: /usr/local/bin/wasmedge
type: File
- name: wasmedge-lib
hostPath:
path: /usr/local/lib/wasmedge
type: Directory
initContainers:
- name: fetch-assets
image: ubuntu:24.04
command: ["bash", "-lc"]
args:
- |
apt-get update && apt-get install -y curl ca-certificates
cd /workspace
curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm
curl -LO https://huggingface.co/second-state/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q5_K_M.gguf
volumeMounts:
- name: workspace
mountPath: /workspace
containers:
- name: infer
image: ubuntu:24.04
env:
- name: WASMEDGE_PLUGIN_PATH
value: /host/usr/local/lib/wasmedge
command: ["bash", "-lc"]
args:
- |
cat <<'PROMPT' | /host/usr/local/bin/wasmedge \
--dir /workspace:/workspace \
--nn-preload default:GGML:AUTO:/workspace/Llama-3.2-1B-Instruct-Q5_K_M.gguf \
/workspace/llama-chat.wasm \
-p llama-3-chat
Give a one-sentence answer: why run LLMs on edge nodes?
PROMPT
volumeMounts:
- name: workspace
mountPath: /workspace
- name: wasmedge-bin
mountPath: /host/usr/local/bin/wasmedge
readOnly: true
- name: wasmedge-lib
mountPath: /host/usr/local/lib/wasmedge
readOnly: true
EOF
Duplicate the job for arm64 by changing metadata.name and nodeSelector.arch. The important part is that the same llama-chat.wasm binary remains unchanged across both nodes.
Verify the Deployment
Expected cluster checks
kubectl get nodes
kubectl get jobs
kubectl logs job/llm-smoke-amd64
- Node status: your KubeEdge nodes should show
Ready. - Job status: the smoke-test job should move to
Completed. - Logs: you should see the prompt banner and an assistant response, which proves the GGUF file loaded through WASI-NN.
What good output looks like
NAME STATUS ROLES AGE VERSION
edge-amd64 Ready agent,edge ... ...
edge-arm64 Ready agent,edge ... ...
NAME STATUS COMPLETIONS DURATION AGE
llm-smoke-amd64 Complete 1/1 ... ...
Do not overfit to exact text generation. A different, but coherent, answer is still success. What matters is that the job schedules to the intended edge node, loads the private model locally, and returns output without a WASI-NN or plug-in failure.
Troubleshooting and What's Next
Top 3 issues to fix first
- Edge node will not join or stays NotReady: verify CloudCore reachability on 10000 and 10002, and make sure
--advertise-addressand--cloudcore-ipportpoint to the same reachable cloud IP. - Inference fails with plug-in or backend errors: confirm WasmEdge was installed with
--plugins wasi_nn-ggml, keepWASMEDGE_PLUGIN_PATH=/usr/local/lib/wasmedge, and installlibopenblas-devon CPU-only Ubuntu nodes. - Container runtime mismatch: on containerd-backed edge nodes, KubeEdge's current runtime docs require
--remote-runtime-endpoint=unix:///run/containerd/containerd.sock. If your host uses systemd cgroups, keep--cgroupdriver=systemdaligned.
What's next
- Promote the smoke test into a long-running API deployment by packaging the LlamaEdge server variant you standardize on.
- Pre-stage GGUF files on each edge node and switch from ad hoc downloads to a managed local cache or OCI artifact flow.
- Add scheduling rules for arch, RAM tier, and GPU presence so bigger models never land on the wrong node.
- Run all YAML through a formatter before shipping it to your repo; TechBytes' Code Formatter is a quick way to normalize examples and keep reviews boring.
Frequently Asked Questions
Can KubeEdge run the same private LLM across amd64 and arm64 edge nodes? +
arch=amd64 or arch=arm64. You still need to validate model size, RAM, and any accelerator differences on each node class.What KubeEdge flags matter most when joining containerd-backed edge nodes? +
--cloudcore-ipport, and for current containerd-backed installs you should also set --remote-runtime-endpoint=unix:///run/containerd/containerd.sock. If your host uses systemd cgroups, keep --cgroupdriver=systemd aligned so EdgeCore and the runtime do not drift.Do I need runwasi to deploy WasmEdge-based LLM workloads? +
Why start with a 1B GGUF model instead of a 7B model? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.