Captioning
Create, import, and auto‑generate captions. This editor supports manual SRT/VTT workflows and automatic speech‑to‑text via a server endpoint that can point to a Cloudflare Worker running Whisper, OpenAI Whisper, or your own/local transcription service.
Two modes
- Manual: Import SRT/VTT or paste a transcript to generate timed captions, then edit and style.
- Automatic (AI): Pick a video, the app extracts audio, uploads it to
/api/transcription
, and converts the result back into a caption track (with word timings when available).
Manual workflow
- Create a track: In the Captions panel, click “New Track”, set a name and language.
- Import SRT/VTT: Use “Add Captions → Import SRT/VTT”. Files are parsed client‑side and merged into the active track. Parser supports common SRT and basic VTT.
- Paste transcript: Paste text and click “Generate Captions”. The editor splits text into readable chunks and spreads them across the project duration. You can retime any item on the timeline.
- Style and position: Choose presets or tweak family, size, colors, outline, alignment, and top/center/bottom placement. Styles apply per‑track.
- Export: Download as
.srt
or.vtt
for use outside the editor.
Automatic (AI) workflow
Automatic captioning is powered by a thin Next.js API at /api/transcription
that forwards your audio/video to a back‑end service (default: a Cloudflare Worker running Whisper). You can swap that service for OpenAI Whisper or a local server without changing the UI.
- Pick a video in “Automatic Caption Generation”. The app extracts mono 16kHz audio (FFmpeg WASM fast path; MediaBunny/Web Audio fallbacks) and uploads a small WAV/WEBM.
- Large inputs: Videos longer than ~10 minutes or >80MB are chunked (~3‑minute windows) and stitched back together to improve reliability.
- Server call: The client posts
FormData
with fieldsaudio
,task
(transcribe
ortranslate
),language
, andvad_filter
to/api/transcription
. - Response shape: The API returns
{ text, duration, segments[] }
where each segment hasstart
,end
,text
, and optionallywords[]
with word‑level timings. Segments are mapped to caption items on the track. - Timeline alignment: Generated captions are offset by the clip’s
positionStart
so they line up with edits.
Cloudflare Whisper Worker (default)
The server route aivideoeditor/app/api/transcription/route.ts
validates the request origin, rate limits, and forwards the uploaded audio to your Cloudflare Worker. Configure via environment:
# Server-side token your Worker expects (hex string) TRANSCRIPTION_BEARER_TOKEN=0123456789abcdef0123456789abcdef # Your Worker URL (defaults to example worker if unset) TRANSCRIPTION_WORKER_URL=https://<your-worker>.<your-account>.workers.dev # Origin security (avoid 403 Unauthorized origin) NEXT_PUBLIC_SITE_URL=https://your.app ALLOWED_ORIGINS=https://your.app,https://www.your.app # Optional (Electron/file:// contexts) ALLOW_NULL_ORIGIN=true
- Headers: The server adds
Authorization: Bearer TRANSCRIPTION_BEARER_TOKEN
when calling the Worker. - File types: WAV/WEBM/MP3/MP4 accepted by the route; max ~100MB per request before chunking kicks in.
- Rate limit: 10 requests/minute per deployment token (tune in code if needed).
- Security: Production requests must come from an allowed origin; configure
ALLOWED_ORIGINS
andNEXT_PUBLIC_SITE_URL
.
Switch to OpenAI Whisper
To use OpenAI’s Whisper instead of a Worker, point the server route to OpenAI’s Audio Transcriptions API and keep the same response shape expected by the client.
- Add
OPENAI_API_KEY
to your environment (server‑side only). - Update
aivideoeditor/app/api/transcription/route.ts
to call OpenAI and return{ text, duration, segments }
. Example minimal server logic:
// inside POST handler after reading FormData 'audio' const openaiRes = await fetch('https://api.openai.com/v1/audio/transcriptions', { method: 'POST', headers: { Authorization: 'Bearer ' + process.env.OPENAI_API_KEY }, body: (() => { const fd = new FormData(); fd.append('file', audioFile); fd.append('model', 'whisper-1'); fd.append('response_format', 'verbose_json'); return fd; })() }); const data = await openaiRes.json(); // Map OpenAI response to our shape return NextResponse.json({ text: data.text || '', duration: data.duration, segments: (data.segments || []).map((s: any) => ({ start: s.start, end: s.end, text: s.text, words: s.words })) });
Keep origin validation in place. You can leave TRANSCRIPTION_WORKER_URL
unset when using OpenAI directly from the route.
Use a local Whisper server
You can run Faster‑Whisper locally and point the server route to it. Return the same JSON structure the client expects. Example FastAPI service (Python):
from fastapi import FastAPI, File, UploadFile from faster_whisper import WhisperModel import uvicorn app = FastAPI() model = WhisperModel('base', device='cpu') @app.post('/transcribe') async def transcribe(audio: UploadFile = File(...)): segments, info = model.transcribe(await audio.read(), vad_filter=True) segs = [] for s in segments: item = { 'start': float(s.start), 'end': float(s.end), 'text': s.text } if hasattr(s, 'words') and s.words: item['words'] = [{ 'start': float(w.start), 'end': float(w.end), 'word': w.word } for w in s.words] segs.append(item) return { 'text': ' '.join([s['text'] for s in segs]), 'segments': segs } if __name__ == '__main__': uvicorn.run(app, host='0.0.0.0', port=9000)
- Set
TRANSCRIPTION_WORKER_URL=http://localhost:9000/transcribe
and keepTRANSCRIPTION_BEARER_TOKEN
as a 32‑hex string (the local server can ignore it or validate). - Return fields
text
,segments[]
, and optionallyduration
to enable duration estimation when the file isn’t seekable.
Security, CSP, and limits
- Origin checks: In production,
/api/transcription
rejects requests unlessALLOWED_ORIGINS
andNEXT_PUBLIC_SITE_URL
include your domain(s). - CSP: The app allows posting to its own origin by default. If you proxy to another host from the client, add it to
connect-src
(not needed for server‑side proxy used here). - Size/timeouts: The route accepts typical audio/video types, ~100MB per chunk. Very long videos are processed in multiple requests transparently.
Troubleshooting
403 Unauthorized origin
Set NEXT_PUBLIC_SITE_URL
and include your URL in ALLOWED_ORIGINS
. For Electron or file://
contexts, set ALLOW_NULL_ORIGIN=true
.
“Transcription service not configured”
Add a valid TRANSCRIPTION_BEARER_TOKEN
(hex) and optionally TRANSCRIPTION_WORKER_URL
. For OpenAI, add OPENAI_API_KEY
and adjust the route.
No speech / empty result
Verify the clip has an audio track, choose the correct language, and try enabling VAD filtering.
Large files fail
The client auto‑chunks long videos. If your back‑end struggles with big files, keep chunkLen
small (≈180s) or raise server limits.