Home Posts Private LLMs on KubeEdge and WasmEdge [How-To 2026]
Cloud Infrastructure

Private LLMs on KubeEdge and WasmEdge [How-To 2026]

Private LLMs on KubeEdge and WasmEdge [How-To 2026]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · May 14, 2026 · 10 min read

Bottom Line

Use KubeEdge for control-plane reach and WasmEdge for portable inference binaries. The practical pattern is to pin a small GGUF model first, prove it on both x86_64 and arm64 nodes, then scale out with node labels and local model caches.

Key Takeaways

  • As of May 14, 2026, KubeEdge v1.23.0 is the latest release and WasmEdge 0.15.0 is the latest stable release.
  • KubeEdge edge nodes need CloudCore ports 10000 and 10002 reachable, plus a matching --advertise-address.
  • WasmEdge installs wasinn-ggml with one flag, and the same .wasm app runs on both x8664 and arm64.
  • For containerd-backed edge nodes, keadm join uses --remote-runtime-endpoint=unix:///run/containerd/containerd.sock.
  • Start with a 1B GGUF model for cross-arch smoke tests, then promote larger models only after logs and memory usage are clean.

Private LLMs at the edge stop looking exotic once you separate control plane from inference runtime. KubeEdge gives you Kubernetes-native reach to remote nodes, while WasmEdge gives you a portable WebAssembly runtime that can execute the same LLM app on x86_64 and arm64. As of May 14, 2026, the latest verified releases are KubeEdge v1.23.0 and WasmEdge 0.15.0, which is a solid baseline for a heterogeneous cluster rollout.

Prerequisites

Before you start

  • One Kubernetes control-plane node with a working kubectl context.
  • At least two edge nodes: one x86_64 and one arm64, both running containerd.
  • sudo access on cloud and edge hosts.
  • Outbound access from edge nodes to GitHub and your model source.
  • Enough RAM for your first model. Start with a 1B GGUF model; move to 3B, 7B, or larger only after you measure memory headroom.
  • Local disk on each edge node for a model cache such as /var/lib/llm.

Bottom Line

Use KubeEdge to join remote nodes and schedule by architecture, then use WasmEdge plus wasi_nn-ggml to run the same WebAssembly LLM app across them. Prove the smallest useful model first, then scale up model size and node count.

Build the KubeEdge Control Plane

Step 1: Install keadm and bootstrap CloudCore

  1. Set your cloud-side values.
export KUBEEDGE_VERSION=v1.23.0
export CLOUD_IP=10.0.0.10
  1. Install the keadm binary on the cloud node.
wget https://github.com/kubeedge/kubeedge/releases/download/${KUBEEDGE_VERSION}/keadm-${KUBEEDGE_VERSION}-linux-amd64.tar.gz
tar -zxvf keadm-${KUBEEDGE_VERSION}-linux-amd64.tar.gz
sudo cp keadm-${KUBEEDGE_VERSION}-linux-amd64/keadm/keadm /usr/local/bin/keadm
  1. Initialize KubeEdge. The official docs require edge access to CloudCore on 10000 and 10002, and the advertise address must be the IP edge nodes can actually reach.
sudo keadm init \
  --advertise-address="${CLOUD_IP}" \
  --kubeedge-version="${KUBEEDGE_VERSION}" \
  --kube-config="$HOME/.kube/config"
  1. Confirm CloudCore is up.
kubectl get all -n kubeedge

Step 2: Generate a join token and attach each edge node

  1. Get the token from the cloud side.
TOKEN=$(sudo keadm gettoken)
echo "$TOKEN"
  1. Install keadm on each edge node using the matching architecture package.
# amd64 edge
wget https://github.com/kubeedge/kubeedge/releases/download/${KUBEEDGE_VERSION}/keadm-${KUBEEDGE_VERSION}-linux-amd64.tar.gz

# arm64 edge
wget https://github.com/kubeedge/kubeedge/releases/download/${KUBEEDGE_VERSION}/keadm-${KUBEEDGE_VERSION}-linux-arm64.tar.gz
  1. Join each edge node to CloudCore. For current KubeEdge runtime docs, containerd uses unix:///run/containerd/containerd.sock.
sudo keadm join \
  --cloudcore-ipport="${CLOUD_IP}:10000" \
  --token="${TOKEN}" \
  --remote-runtime-endpoint=unix:///run/containerd/containerd.sock \
  --cgroupdriver=systemd
  1. Verify the nodes from the cloud side.
kubectl get nodes -o wide

If you need to paste logs or tokens into tickets or chat, scrub them first with TechBytes' Data Masking Tool. KubeEdge join output often includes sensitive connection data you do not want in screenshots.

Install WasmEdge on Edge Nodes

Step 3: Install WasmEdge and the GGUF inference plug-in

WasmEdge's official installer can pin a specific version and install the wasi_nn-ggml plug-in in one pass. That plug-in is the piece that lets your Wasm app call into GGUF-backed LLM inference.

export WASMEDGE_VERSION=0.15.0
sudo apt-get update
sudo apt-get install -y curl git ca-certificates libopenblas-dev
curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | \
  sudo bash -s -- -p /usr/local -v ${WASMEDGE_VERSION} --plugins wasi_nn-ggml
/usr/local/bin/wasmedge --version
  • Use the same install command on both x86_64 and arm64 nodes.
  • The WASI-NN backends are exclusive, so install only the backend you actually need.
  • CPU-only nodes usually need libopenblas-dev; the LlamaEdge project explicitly calls this out for CPU installs.
Pro tip: Keep the Wasm app portable and the model local. The same .wasm artifact can move across architectures, but large GGUF files should stay cached on the node that serves them.

Optional: enable runwasi for native Wasm RuntimeClass scheduling

If you want containerd to launch Wasm workloads directly, WasmEdge's official docs use the runwasi shim:

git clone https://github.com/containerd/runwasi.git
cd runwasi
./scripts/setup-linux.sh
make build-wasmedge
INSTALL="sudo install" LN="sudo ln -sf" make install-wasmedge

echo '[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.wasmedge] runtime_type = "io.containerd.wasmedge.v1"' | \
  sudo tee -a /etc/containerd/config.toml > /dev/null
sudo systemctl restart containerd

You do not need this shim for the smoke test below, but it is the clean next step if you want pure Wasm RuntimeClass objects later.

Deploy a Private LLM Workload

Step 4: Label nodes by architecture

kubectl label node edge-amd64 llm=true arch=amd64 --overwrite
kubectl label node edge-arm64 llm=true arch=arm64 --overwrite

Step 5: Run a cross-arch smoke test with LlamaEdge and a private GGUF model

The LlamaEdge project publishes a portable llama-chat.wasm app, and its docs show a verified WasmEdge invocation using --nn-preload and -p llama-3-chat. The job below downloads the Wasm app and a small GGUF model in an init container, then calls the host-installed WasmEdge binary through a hostPath mount.

cat <<'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: llm-smoke-amd64
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      nodeSelector:
        llm: "true"
        arch: amd64
      volumes:
      - name: workspace
        emptyDir: {}
      - name: wasmedge-bin
        hostPath:
          path: /usr/local/bin/wasmedge
          type: File
      - name: wasmedge-lib
        hostPath:
          path: /usr/local/lib/wasmedge
          type: Directory
      initContainers:
      - name: fetch-assets
        image: ubuntu:24.04
        command: ["bash", "-lc"]
        args:
        - |
          apt-get update && apt-get install -y curl ca-certificates
          cd /workspace
          curl -LO https://github.com/LlamaEdge/LlamaEdge/releases/latest/download/llama-chat.wasm
          curl -LO https://huggingface.co/second-state/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q5_K_M.gguf
        volumeMounts:
        - name: workspace
          mountPath: /workspace
      containers:
      - name: infer
        image: ubuntu:24.04
        env:
        - name: WASMEDGE_PLUGIN_PATH
          value: /host/usr/local/lib/wasmedge
        command: ["bash", "-lc"]
        args:
        - |
          cat <<'PROMPT' | /host/usr/local/bin/wasmedge \
            --dir /workspace:/workspace \
            --nn-preload default:GGML:AUTO:/workspace/Llama-3.2-1B-Instruct-Q5_K_M.gguf \
            /workspace/llama-chat.wasm \
            -p llama-3-chat
          Give a one-sentence answer: why run LLMs on edge nodes?
          PROMPT
        volumeMounts:
        - name: workspace
          mountPath: /workspace
        - name: wasmedge-bin
          mountPath: /host/usr/local/bin/wasmedge
          readOnly: true
        - name: wasmedge-lib
          mountPath: /host/usr/local/lib/wasmedge
          readOnly: true
EOF

Duplicate the job for arm64 by changing metadata.name and nodeSelector.arch. The important part is that the same llama-chat.wasm binary remains unchanged across both nodes.

Watch out: This job is a validation pattern, not a production API tier. Once it is stable, replace the one-shot prompt with the repo's llama-api-server project or another long-running service wrapper, and move model downloads into a durable cache.

Verify the Deployment

Expected cluster checks

kubectl get nodes
kubectl get jobs
kubectl logs job/llm-smoke-amd64
  • Node status: your KubeEdge nodes should show Ready.
  • Job status: the smoke-test job should move to Completed.
  • Logs: you should see the prompt banner and an assistant response, which proves the GGUF file loaded through WASI-NN.

What good output looks like

NAME          STATUS    ROLES       AGE   VERSION
edge-amd64    Ready     agent,edge  ...   ...
edge-arm64    Ready     agent,edge  ...   ...

NAME              STATUS     COMPLETIONS   DURATION   AGE
llm-smoke-amd64   Complete   1/1           ...        ...

Do not overfit to exact text generation. A different, but coherent, answer is still success. What matters is that the job schedules to the intended edge node, loads the private model locally, and returns output without a WASI-NN or plug-in failure.

Troubleshooting and What's Next

Top 3 issues to fix first

  • Edge node will not join or stays NotReady: verify CloudCore reachability on 10000 and 10002, and make sure --advertise-address and --cloudcore-ipport point to the same reachable cloud IP.
  • Inference fails with plug-in or backend errors: confirm WasmEdge was installed with --plugins wasi_nn-ggml, keep WASMEDGE_PLUGIN_PATH=/usr/local/lib/wasmedge, and install libopenblas-dev on CPU-only Ubuntu nodes.
  • Container runtime mismatch: on containerd-backed edge nodes, KubeEdge's current runtime docs require --remote-runtime-endpoint=unix:///run/containerd/containerd.sock. If your host uses systemd cgroups, keep --cgroupdriver=systemd aligned.

What's next

  • Promote the smoke test into a long-running API deployment by packaging the LlamaEdge server variant you standardize on.
  • Pre-stage GGUF files on each edge node and switch from ad hoc downloads to a managed local cache or OCI artifact flow.
  • Add scheduling rules for arch, RAM tier, and GPU presence so bigger models never land on the wrong node.
  • Run all YAML through a formatter before shipping it to your repo; TechBytes' Code Formatter is a quick way to normalize examples and keep reviews boring.

Frequently Asked Questions

Can KubeEdge run the same private LLM across amd64 and arm64 edge nodes? +
Yes. The useful split is to keep the .wasm inference app portable and let KubeEdge schedule by node labels such as arch=amd64 or arch=arm64. You still need to validate model size, RAM, and any accelerator differences on each node class.
What KubeEdge flags matter most when joining containerd-backed edge nodes? +
The critical flag is --cloudcore-ipport, and for current containerd-backed installs you should also set --remote-runtime-endpoint=unix:///run/containerd/containerd.sock. If your host uses systemd cgroups, keep --cgroupdriver=systemd aligned so EdgeCore and the runtime do not drift.
Do I need runwasi to deploy WasmEdge-based LLM workloads? +
Not for a first smoke test. You can install WasmEdge on the host and invoke it from a standard Kubernetes job, then add runwasi later if you want a cleaner RuntimeClass-based path for native Wasm scheduling.
Why start with a 1B GGUF model instead of a 7B model? +
Because the first milestone is operational proof, not benchmark bragging rights. A smaller GGUF model reduces download time, startup latency, and RAM pressure, which makes it much easier to isolate networking, plug-in, and scheduling problems.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.