Ai workers
Auto (multimodal router)
One endpoint that routes text, vision, audio, and document tasks to the right Workers AI models.
Auto is a multimodal “router” endpoint that inspects each request and:
Endpoint
- URL:
POST /auto - Auth:
Authorization: Bearer <api_key> - Content-Type:
application/json - Base URL:
https://aiworker.linconwaves.com(or your deployment)
Capabilities
- Text reasoning: uses
@cf/openai/gpt-oss-120bfor intent, planning, and replies. - Vision: uses
llava-1p5-7b-hf, then falls back touform-gen2-qwen-500mto describe images. - Audio (STT): uses
whisper-large-v3-turbo(orwhisperfallback) to transcribe audio files. - Document parsing: extracts text from PDFs and plaintext uploads; summarizes and proposes follow-ups.
- File generation (on request): can generate Markdown, PDF, DOCX, XLSX, or CSV from the latest assistant content.
- TTS (on request): can render the latest assistant content as audio with
aura-2-en.
When to use
Use /auto when you want a single endpoint to handle:
- Pure text chat and reasoning.
- Images that need a description or follow-up questions.
- Audio files that need transcription and next-step guidance.
- Documents (PDF/Doc/Excel/text) that need quick summaries or extractions.
- On-demand file generation or TTS after the assistant has produced content.
Inputs
Send a JSON body with:
{
"messages": [{ "role": "user", "content": "your text" }],
"attachments": [
{
"name": "file.png",
"mime": "image/png",
"type": "image",
"data": "<data-url-or-base64>"
}
],
"conversationId": "optional-conversation-id"
}- messages: chat-style array (roles:
user,assistant,system). - attachments: optional array of uploaded files. Supported
typevalues:image,audio,video,file. PDFs and common office docs go intype: "file". - conversationId: optional; when provided, Auto will thread memory via chat history.
Outputs
Auto returns a standard chat-like JSON response with:
response: assistant text.generatedAttachments(optional): files produced on request (md/pdf/docx/xlsx/csv or audio).conversationId/conversationSlug: conversation tracking.memoryMessages: recent messages used for summarization.
Binary media (e.g., generated audio) is uploaded to storage and returned as attachment metadata (type, mime, size, r2Key, url).
Behavior notes
- Vision payloads are sent as compressed
image: number[], matching Workers AI vision schema. - Audio is transcoded to 16 kHz mono WAV for STT; TTS uses
aura-2-en. - Document parsing uses PDF text extraction (where possible) or a safe text preview, then summarizes.
- File generation and TTS only happen when explicitly requested in the user text (e.g., “generate a pdf”, “make audio”), and only after the assistant produces content.
- If vision fails to read an image, the assistant will ask for clarification/re-upload instead of guessing.
Example request
curl -X POST https://aiworker.linconwaves.com/auto \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "user", "content": "What is in this image?" }
],
"attachments": [
{
"name": "photo.png",
"type": "image",
"mime": "image/png",
"data": "data:image/png;base64,iVBORw0KGgo..."
}
]
}'Auto will:
- Describe the image via LLaVA (fallback UForm).
- Pass the description into GPT-OSS reasoning.
- Reply with a concise description and a follow-up question.