Captioning

Create, import, and auto‑generate captions. This editor supports manual SRT/VTT workflows and automatic speech‑to‑text via a server endpoint that can point to a Cloudflare Worker running Whisper, OpenAI Whisper, or your own/local transcription service.

Two modes

  • Manual: Import SRT/VTT or paste a transcript to generate timed captions, then edit and style.
  • Automatic (AI): Pick a video, the app extracts audio, uploads it to /api/transcription, and converts the result back into a caption track (with word timings when available).

Manual workflow

  • Create a track: In the Captions panel, click “New Track”, set a name and language.
  • Import SRT/VTT: Use “Add Captions → Import SRT/VTT”. Files are parsed client‑side and merged into the active track. Parser supports common SRT and basic VTT.
  • Paste transcript: Paste text and click “Generate Captions”. The editor splits text into readable chunks and spreads them across the project duration. You can retime any item on the timeline.
  • Style and position: Choose presets or tweak family, size, colors, outline, alignment, and top/center/bottom placement. Styles apply per‑track.
  • Export: Download as .srt or .vtt for use outside the editor.

Automatic (AI) workflow

Automatic captioning is powered by a thin Next.js API at /api/transcription that forwards your audio/video to a back‑end service (default: a Cloudflare Worker running Whisper). You can swap that service for OpenAI Whisper or a local server without changing the UI.

  • Pick a video in “Automatic Caption Generation”. The app extracts mono 16kHz audio (FFmpeg WASM fast path; MediaBunny/Web Audio fallbacks) and uploads a small WAV/WEBM.
  • Large inputs: Videos longer than ~10 minutes or >80MB are chunked (~3‑minute windows) and stitched back together to improve reliability.
  • Server call: The client posts FormData with fields audio, task(transcribe or translate), language, and vad_filter to/api/transcription.
  • Response shape: The API returns { text, duration, segments[] } where each segment hasstart, end, text, and optionally words[] with word‑level timings. Segments are mapped to caption items on the track.
  • Timeline alignment: Generated captions are offset by the clip’s positionStart so they line up with edits.

Cloudflare Whisper Worker (default)

The server route aivideoeditor/app/api/transcription/route.ts validates the request origin, rate limits, and forwards the uploaded audio to your Cloudflare Worker. Configure via environment:

# Server-side token your Worker expects (hex string)
TRANSCRIPTION_BEARER_TOKEN=0123456789abcdef0123456789abcdef
# Your Worker URL (defaults to example worker if unset)
TRANSCRIPTION_WORKER_URL=https://<your-worker>.<your-account>.workers.dev

# Origin security (avoid 403 Unauthorized origin)
NEXT_PUBLIC_SITE_URL=https://your.app
ALLOWED_ORIGINS=https://your.app,https://www.your.app
# Optional (Electron/file:// contexts)
ALLOW_NULL_ORIGIN=true
  • Headers: The server adds Authorization: Bearer TRANSCRIPTION_BEARER_TOKEN when calling the Worker.
  • File types: WAV/WEBM/MP3/MP4 accepted by the route; max ~100MB per request before chunking kicks in.
  • Rate limit: 10 requests/minute per deployment token (tune in code if needed).
  • Security: Production requests must come from an allowed origin; configure ALLOWED_ORIGINS andNEXT_PUBLIC_SITE_URL.

Switch to OpenAI Whisper

To use OpenAI’s Whisper instead of a Worker, point the server route to OpenAI’s Audio Transcriptions API and keep the same response shape expected by the client.

  1. Add OPENAI_API_KEY to your environment (server‑side only).
  2. Update aivideoeditor/app/api/transcription/route.ts to call OpenAI and return { text, duration, segments }. Example minimal server logic:
// inside POST handler after reading FormData 'audio'
const openaiRes = await fetch('https://api.openai.com/v1/audio/transcriptions', {
  method: 'POST',
  headers: { Authorization: 'Bearer ' + process.env.OPENAI_API_KEY },
  body: (() => { const fd = new FormData(); fd.append('file', audioFile); fd.append('model', 'whisper-1'); fd.append('response_format', 'verbose_json'); return fd; })()
});
const data = await openaiRes.json();
// Map OpenAI response to our shape
return NextResponse.json({
  text: data.text || '',
  duration: data.duration,
  segments: (data.segments || []).map((s: any) => ({ start: s.start, end: s.end, text: s.text, words: s.words }))
});

Keep origin validation in place. You can leave TRANSCRIPTION_WORKER_URL unset when using OpenAI directly from the route.

Use a local Whisper server

You can run Faster‑Whisper locally and point the server route to it. Return the same JSON structure the client expects. Example FastAPI service (Python):

from fastapi import FastAPI, File, UploadFile
from faster_whisper import WhisperModel
import uvicorn

app = FastAPI()
model = WhisperModel('base', device='cpu')

@app.post('/transcribe')
async def transcribe(audio: UploadFile = File(...)):
    segments, info = model.transcribe(await audio.read(), vad_filter=True)
    segs = []
    for s in segments:
        item = { 'start': float(s.start), 'end': float(s.end), 'text': s.text }
        if hasattr(s, 'words') and s.words:
            item['words'] = [{ 'start': float(w.start), 'end': float(w.end), 'word': w.word } for w in s.words]
        segs.append(item)
    return { 'text': ' '.join([s['text'] for s in segs]), 'segments': segs }

if __name__ == '__main__': uvicorn.run(app, host='0.0.0.0', port=9000)
  • Set TRANSCRIPTION_WORKER_URL=http://localhost:9000/transcribe and keepTRANSCRIPTION_BEARER_TOKEN as a 32‑hex string (the local server can ignore it or validate).
  • Return fields text, segments[], and optionally duration to enable duration estimation when the file isn’t seekable.

Security, CSP, and limits

  • Origin checks: In production, /api/transcription rejects requests unlessALLOWED_ORIGINS and NEXT_PUBLIC_SITE_URL include your domain(s).
  • CSP: The app allows posting to its own origin by default. If you proxy to another host from the client, add it to connect-src (not needed for server‑side proxy used here).
  • Size/timeouts: The route accepts typical audio/video types, ~100MB per chunk. Very long videos are processed in multiple requests transparently.

Troubleshooting

403 Unauthorized origin

Set NEXT_PUBLIC_SITE_URL and include your URL in ALLOWED_ORIGINS. For Electron or file:// contexts, set ALLOW_NULL_ORIGIN=true.

“Transcription service not configured”

Add a valid TRANSCRIPTION_BEARER_TOKEN (hex) and optionally TRANSCRIPTION_WORKER_URL. For OpenAI, add OPENAI_API_KEY and adjust the route.

No speech / empty result

Verify the clip has an audio track, choose the correct language, and try enabling VAD filtering.

Large files fail

The client auto‑chunks long videos. If your back‑end struggles with big files, keep chunkLen small (≈180s) or raise server limits.