Auto (multimodal router)

One endpoint that routes text, vision, audio, and document tasks to the right Workers AI models.

Auto is a multimodal “router” endpoint that inspects each request and:

Endpoint

URL: POST /auto
Auth: Authorization: Bearer <api_key>
Content-Type: application/json
Base URL: https://aiworker.linconwaves.com (or your deployment)

Capabilities

Text reasoning: uses @cf/openai/gpt-oss-120b for intent, planning, and replies.
Vision: uses llava-1p5-7b-hf, then falls back to uform-gen2-qwen-500m to describe images.
Audio (STT): uses whisper-large-v3-turbo (or whisper fallback) to transcribe audio files.
Document parsing: extracts text from PDFs and plaintext uploads; summarizes and proposes follow-ups.
File generation (on request): can generate Markdown, PDF, DOCX, XLSX, or CSV from the latest assistant content.
TTS (on request): can render the latest assistant content as audio with aura-2-en.

When to use

Use /auto when you want a single endpoint to handle:

Pure text chat and reasoning.
Images that need a description or follow-up questions.
Audio files that need transcription and next-step guidance.
Documents (PDF/Doc/Excel/text) that need quick summaries or extractions.
On-demand file generation or TTS after the assistant has produced content.

Inputs

Send a JSON body with:

{
  "messages": [{ "role": "user", "content": "your text" }],
  "attachments": [
    {
      "name": "file.png",
      "mime": "image/png",
      "type": "image",
      "data": "<data-url-or-base64>"
    }
  ],
  "conversationId": "optional-conversation-id"
}

messages: chat-style array (roles: user, assistant, system).
attachments: optional array of uploaded files. Supported type values: image, audio, video, file. PDFs and common office docs go in type: "file".
conversationId: optional; when provided, Auto will thread memory via chat history.

Outputs

Auto returns a standard chat-like JSON response with:

response: assistant text.
generatedAttachments (optional): files produced on request (md/pdf/docx/xlsx/csv or audio).
conversationId / conversationSlug: conversation tracking.
memoryMessages: recent messages used for summarization.

Binary media (e.g., generated audio) is uploaded to storage and returned as attachment metadata (type, mime, size, r2Key, url).

Behavior notes

Vision payloads are sent as compressed image: number[], matching Workers AI vision schema.
Audio is transcoded to 16 kHz mono WAV for STT; TTS uses aura-2-en.
Document parsing uses PDF text extraction (where possible) or a safe text preview, then summarizes.
File generation and TTS only happen when explicitly requested in the user text (e.g., “generate a pdf”, “make audio”), and only after the assistant produces content.
If vision fails to read an image, the assistant will ask for clarification/re-upload instead of guessing.

Example request

curl -X POST https://aiworker.linconwaves.com/auto \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "user", "content": "What is in this image?" }
    ],
    "attachments": [
      {
        "name": "photo.png",
        "type": "image",
        "mime": "image/png",
        "data": "data:image/png;base64,iVBORw0KGgo..."
      }
    ]
  }'

Auto will:

Describe the image via LLaVA (fallback UForm).
Pass the description into GPT-OSS reasoning.
Reply with a concise description and a follow-up question.