When OpenAI launched GPT-5, they removed access to all previous models and forced everyone to upgrade. For me, that was a disaster - GPT-5 was much slower in my workflow, and it seriously stalled my productivity.
Around the same time, GPT-OSS dropped, and I figured… why rely on someone else’s servers when I can run my own LLM locally?
So, I decided to self-host LLaMA 3 with a clean, easy-to-use web interface - and it’s been great! Here’s how I did it.
This guide assumes you want to run locally on your own laptop, which is the easiest and fastest way to get started.
You’ll need:
If you prefer a permanent, always-on setup, you can run this on a remote machine instead (I’ll link to my previous Linux server setup guide here).
If you go this route, I recommend using Tailscale to securely connect to your server from anywhere without exposing it to the public internet.
Tailscale is essentially a secure, peer-to-peer VPN between your devices, so your server will feel like it’s on your home network.
We’ll be running two services:
Here’s a minimal docker-compose.yml:
services:
openwebui:
image: ghcr.io/open-webui/open-webui:ollama
container_name: openwebui
restart: unless-stopped
ports:
- "3000:8080" # WebUI is inside on 8080, exposed as 3000
volumes:
- ollama_data:/root/.ollama # Model storage
- webui_data:/app/backend/data # Chat history, user data
deploy:
resources:
reservations:
devices:
- capabilities: [gpu] # Enable this only if using GPU
runtime: nvidia # Needed if using GPU
volumes:
ollama_data:
webui_data:
To bring it up:
docker compose up -d
Once it’s running, open http://localhost:3000 in your browser. You’ll be greeted by the Open WebUI interface.
Unless your laptop packs a serious GPU (think RTX 3090 or better), you’re probably not going to run a full-scale 70B parameter model. But don’t worry — smaller models can still be surprisingly capable.
nvidia-smi
If you see something like RTX 3060, 3070, or 3080, you’re in good shape to run 7B–13B models. If it shows “No devices were found,” you’ll either need to install the correct drivers or skip GPU entirely (though performance will take a big hit).
I went with LLaMA 3 8B — it’s fast enough on my RTX 3070, has solid reasoning, and supports system prompts and function calling. If you’re on a weaker GPU (or just want snappier responses), consider Mistral 7B or LLaMA 2 7B. Ollama also supports quantized models (like llama3:8b-q4_K_M) to reduce memory usage with minimal accuracy loss.
To download and start using a model with Ollama, we enter the docker container and download whichever model we chose.
docker exec -it openwebui bash
This gives you a shell inside the container, where you can run ollama commands directly.
ollama pull llama3:8b
You’ll see download progress — this can take a few minutes depending on your connection and the model size.
ollama run llama3
You can type in prompts directly — press Ctrl+C to exit.
Exit the container when you’re done:
exit
Back in your browser, open http://localhost:3000 You’ll now see llama3 available under the “Models” tab, ready to use in the GUI.
Local, fast, and free: No API calls, no latency, no token limits
Full control: Customize models, UI, and even write plugins
Offline: No internet? No problem.
Fun: There’s something incredibly satisfying about running your own AI
Is it better than GPT-4 or GPT-5? Not always - but for offline coding help, writing, note-taking, and prototyping, it’s more than enough. But don’t listen to me, here’s what LLaMA 3 has to say:

Running your own LLM might sound intimidating — but thanks to Ollama and Open WebUI, it’s actually easier than ever.
Whether you’re fed up with GPT-5 like I was, or just curious about running models locally - it’s a great way to take back control and see what these tools can do on your own terms.