ML applications

A Coding Implementation to Build a Complete Self-Hosted LLM Workflow with Ollama, REST API, and Gradio Chat Interface

A Coding Implementation to Build a Complete Self-Hosted LLM Workflow with Ollama, REST API, and Gradio Chat Interface


In this tutorial, we implement a fully functional Ollama environment inside Google Colab to replicate a self-hosted LLM workflow. We begin by installing Ollama directly on the Colab VM using the official Linux installer and then launch the Ollama server in the background to expose the HTTP API on localhost:11434. After verifying the service, we pull lightweight models such as qwen2.5:0.5b-instruct or llama3.2:1b, which balance resource constraints with usability in a CPU-only environment. To interact with these models programmatically, we use the /api/chat endpoint via Python’s requests module with streaming enabled, which allows token-level output to be captured incrementally. Finally, we layer a Gradio-based UI on top of this client so we can issue prompts, maintain multi-turn history, configure parameters like temperature and context size, and view results in real time. Check out the Full Codes here.

import os, sys, subprocess, time, json, requests, textwrap
from pathlib import Path


def sh(cmd, check=True):
   """Run a shell command, stream output."""
   p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
   for line in p.stdout:
       print(line, end="")
   p.wait()
   if check and p.returncode != 0:
       raise RuntimeError(f"Command failed: {cmd}")


if not Path("/usr/local/bin/ollama").exists() and not Path("/usr/bin/ollama").exists():
   print("🔧 Installing Ollama ...")
   sh("curl -fsSL https://ollama.com/install.sh | sh")
else:
   print("✅ Ollama already installed.")


try:
   import gradio 
except Exception:
   print("🔧 Installing Gradio ...")
   sh("pip -q install gradio==4.44.0")

We first check if Ollama is already installed on the system, and if not, we install it using the official script. At the same time, we ensure Gradio is available by importing it or installing the required version when missing. This way, we prepare our Colab environment for running the chat interface smoothly. Check out the Full Codes here.

def start_ollama():
   try:
       requests.get("http://127.0.0.1:11434/api/tags", timeout=1)
       print("✅ Ollama server already running.")
       return None
   except Exception:
       pass
   print("🚀 Starting Ollama server ...")
   proc = subprocess.Popen(["ollama", "serve"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
   for _ in range(60):
       time.sleep(1)
       try:
           r = requests.get("http://127.0.0.1:11434/api/tags", timeout=1)
           if r.ok:
               print("✅ Ollama server is up.")
               break
       except Exception:
           pass
   else:
       raise RuntimeError("Ollama did not start in time.")
   return proc


server_proc = start_ollama()

We start the Ollama server in the background and keep checking its health endpoint until it responds successfully. By doing this, we ensure the server is running and ready before sending any API requests. Check out the Full Codes here.

MODEL = os.environ.get("OLLAMA_MODEL", "qwen2.5:0.5b-instruct")
print(f"🧠 Using model: {MODEL}")
try:
   tags = requests.get("http://127.0.0.1:11434/api/tags", timeout=5).json()
   have = any(m.get("name")==MODEL for m in tags.get("models", []))
except Exception:
   have = False


if not have:
   print(f"⬇️  Pulling model {MODEL} (first time only) ...")
   sh(f"ollama pull {MODEL}")

We define the default model to use, check if it is already available on the Ollama server, and if not, we automatically pull it. This ensures that the chosen model is ready before we start running any chat sessions. Check out the Full Codes here.

OLLAMA_URL = "http://127.0.0.1:11434/api/chat"


def ollama_chat_stream(messages, model=MODEL, temperature=0.2, num_ctx=None):
   """Yield streaming text chunks from Ollama /api/chat."""
   payload = {
       "model": model,
       "messages": messages,
       "stream": True,
       "options": {"temperature": float(temperature)}
   }
   if num_ctx:
       payload["options"]["num_ctx"] = int(num_ctx)
   with requests.post(OLLAMA_URL, json=payload, stream=True) as r:
       r.raise_for_status()
       for line in r.iter_lines():
           if not line:
               continue
           data = json.loads(line.decode("utf-8"))
           if "message" in data and "content" in data["message"]:
               yield data["message"]["content"]
           if data.get("done"):
               break

We create a streaming client for the Ollama /api/chat endpoint, where we send messages as JSON payloads and yield tokens as they arrive. This lets us handle responses incrementally, so we see the model’s output in real time instead of waiting for the full completion. Check out the Full Codes here.

def smoke_test():
   print("n🧪 Smoke test:")
   sys_msg = {"role":"system","content":"You are concise. Use short bullets."}
   user_msg = {"role":"user","content":"Give 3 quick tips to sleep better."}
   out = []
   for chunk in ollama_chat_stream([sys_msg, user_msg], temperature=0.3):
       print(chunk, end="")
       out.append(chunk)
   print("n🧪 Done.n")
try:
   smoke_test()
except Exception as e:
   print("⚠️ Smoke test skipped:", e)

We run a quick smoke test by sending a simple prompt through our streaming client to confirm that the model responds correctly. This helps us verify that Ollama is installed, the server is running, and the chosen model is working before we build the full chat UI. Check out the Full Codes here.

import gradio as gr


SYSTEM_PROMPT = "You are a helpful, crisp assistant. Prefer bullets when helpful."


def chat_fn(message, history, temperature, num_ctx):
   msgs = [{"role":"system","content":SYSTEM_PROMPT}]
   for u, a in history:
       if u: msgs.append({"role":"user","content":u})
       if a: msgs.append({"role":"assistant","content":a})
   msgs.append({"role":"user","content": message})
   acc = ""
   try:
       for part in ollama_chat_stream(msgs, model=MODEL, temperature=temperature, num_ctx=num_ctx or None):
           acc += part
           yield acc
   except Exception as e:
       yield f"⚠️ Error: {e}"


with gr.Blocks(title="Ollama Chat (Colab)", fill_height=True) as demo:
   gr.Markdown("# 🦙 Ollama Chat (Colab)nSmall local-ish LLM via Ollama + Gradio.n")
   with gr.Row():
       temp = gr.Slider(0.0, 1.0, value=0.3, step=0.1, label="Temperature")
       num_ctx = gr.Slider(512, 8192, value=2048, step=256, label="Context Tokens (num_ctx)")
   chat = gr.Chatbot(height=460)
   msg = gr.Textbox(label="Your message", placeholder="Ask anything…", lines=3)
   clear = gr.Button("Clear")


   def user_send(m, h):
       m = (m or "").strip()
       if not m: return "", h
       return "", h + [[m, None]]


   def bot_reply(h, temperature, num_ctx):
       u = h[-1][0]
       stream = chat_fn(u, h[:-1], temperature, int(num_ctx))
       acc = ""
       for partial in stream:
           acc = partial
           h[-1][1] = acc
           yield h


   msg.submit(user_send, [msg, chat], [msg, chat])
      .then(bot_reply, [chat, temp, num_ctx], [chat])
   clear.click(lambda: None, None, chat)


print("🌐 Launching Gradio ...")
demo.launch(share=True)

We integrate Gradio to build an interactive chat UI on top of the Ollama server, where user input and conversation history are converted into the correct message format and streamed back as model responses. The sliders let us adjust parameters like temperature and context length, while the chat box and clear button provide a simple, real-time interface for testing different prompts.

In conclusion, we establish a reproducible pipeline for running Ollama in Colab: installation, server startup, model management, API access, and user interface integration. The system uses Ollama’s REST API as the core interaction layer, providing both command-line and Python streaming access, while Gradio handles session persistence and chat rendering. This approach preserves the “self-hosted” design described in the original guide but adapts it for Colab’s constraints, where Docker and GPU-backed Ollama images are not practical. The result is a compact yet technically complete framework that lets us experiment with multiple LLMs, adjust generation parameters dynamically, and test conversational AI locally within a notebook environment.


Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


A Coding Implementation to Build a Complete Self-Hosted LLM Workflow with Ollama, REST API, and Gradio Chat Interface

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

A Coding Implementation to Build a Complete Self-Hosted LLM Workflow with Ollama, REST API, and Gradio Chat Interface

Source link