I wrote about setting up opensource models locally in one of my previous posts, so today I'll go beyond and develop API for making a local model available online.
I'll show three most popular approaches for creating AI APIs:
- simple JSON request/response
- text streaming
- Server-Side Events
My tech choice
- Hono - a web framework that is at least 4x times faster than Express, lightweight, and runtime-agnostic.
- Vercel AI SDK - a model-agnostic framework for building AI apps, my favorite one atm
- Ollama - a way to host open LLM models locally
Example 1: simple JSON request/response
Let's look at the full example of an API server with simple request/response approach.
import { serve } from '@hono/node-server'
import { generateText, streamText } from 'ai'
import { Hono } from 'hono'
import { stream, streamSSE } from 'hono/streaming'
import { ollama } from 'ollama-ai-provider'
const app = new Hono()
app.post('/simple', async (c) => {
try {
const body = await c.req.json()
if (!body || !body.question) {
return c.json({ error: 'No question provided', success: false }, 400)
}
const prompt = body.question
// Get the full text response from LLM.
const { text } = await generateText({
model: ollama('gemma3n:e2b'),
prompt,
})
// And return it as a JSON response.
return c.json({
text,
success: true,
})
} catch (error) {
console.error('Simple server error:', error)
return c.json(
{
error: 'Failed to perform the request',
success: false,
},
500
)
}
})
serve(
{
fetch: app.fetch,
port: 3000,
},
(info) => {
console.log(`Server running on http://localhost:${info.port}`)
}
)
As you may see here, when the server receives a POST request to http://localhost:3000/simple
with {"question": "How are you?"}
in the request body,
it asks the LLM for an answer, then wraps the received answer into JSON and returns it to the user.
The example uses the generateText()
functions from Vercel AI SDK to get the full LLM response before responding to the user.
There is nothing wrong with this approach but the user gets nothing until the LLM finishes answering, which is not ideal in terms of usability, so let's see how we can improve it with response streaming.
Example 2: text streaming
Here I won't repeat the server code, the request validation and error handling because they are basically the same. I'll focus on the changed part only.
app.post('/stream', async (c) => {
const body = await c.req.json()
const prompt = body.question
// Prepare LLM response stream.
const { textStream } = await streamText({
model: ollama('gemma3n:e2b'),
prompt,
})
// Pipe the LLM response stream to the HTTP response stream.
return stream(c, async (stream) => {
stream.onAbort(() => {
console.log('Stream aborted!')
})
await stream.pipe(textStream)
})
})
Thanks to the Hono's stream()
helper, we basically pipe the text stream from the LLM directly to the HTTP response,
which makes it possible for the frontend to display parts of the LLM answer upon receiving, so that user can see the LLM progress and start analyzing the partial answer asap. It's a popular pattern in chat-based AI interfaces today but it's not the best one.
Example 3: Server-Side Events
A better approach is to use Server-Side Events (SSE), which is natively supported in browsers via EventSource API and easier to implement on the frontend. With SSE, you don't need to manually parse stream chunks or manage reconnection.
app.post('/sse', async (c) => {
const body = await c.req.json()
const prompt = body.question
// Prepare LLM response stream.
const { textStream } = await streamText({
model: ollama('gemma3n:e2b'),
prompt,
})
return streamSSE(c, async (stream) => {
stream.onAbort(() => {
console.log('SSE aborted!')
})
// Stream chunks one by one as message events.
let id = 0
for await (const chunk of textStream) {
await stream.writeSSE({
data: JSON.stringify({ text: chunk }),
event: 'message',
id: String(id++),
})
}
// Send completion event.
await stream.writeSSE({
data: JSON.stringify({ completed: true }),
event: 'complete',
id: String(id++),
})
})
})
Hono gives us wings again here, now with the streamSSE()
helper.
In addition to the advantages I mentioned above, as you may see, SSE allows to return different kinds of information, for example, LLM answer parts, state updates (started, completed), logs, etc.
In a real app, you'd probably want to use text streaming or SSE to give user more predictability and interactivity, but for non-chat UIs, a simple JSON response is still a valid option.
You can find all examples from the post in kometolabs/ai-api-examples as well as some ways to test them.