Implementing AI Auto-Subtitle Generation for Video in a Mobile App
AI subtitles — Whisper or analogues. Task seems simple on surface: send video, get text with timecodes. But mobile context raises several non-trivial questions: transcribe on-device or via API, how render subtitles over video, how let user edit result.
Whisper on-device vs API
Whisper via OpenAI API — simplest path. Send audio (up to 25 MB), get JSON with segments (timecodes + text):
POST https://api.openai.com/v1/audio/transcriptions
model=whisper-1&response_format=verbose_json×tamp_granularities[]=word
verbose_json with word granularity gives timecode per word — needed for subtitle sync. Processing time: ~10-second clip — 2–4 sec, one minute video — 10–20 sec.
Whisper on-device — real for iOS 16+ via WhisperKit (swift-transformers). whisper-small model — 244 MB, ~0.3× real-time speed on iPhone 14 (i.e. one minute audio = 3 minutes processing). whisper-tiny — 77 MB, 0.7× real-time, but accuracy notably worse. Russian worse than English.
Android: whisper.cpp via JNI, or openai-whisper-tflite — but trickier to build. Simpler for most apps — API.
Extract audio from video on client
Before sending to Whisper, extract audio track — sending whole video redundant:
// iOS: AVAssetExportSession for audio extraction
func extractAudio(from videoURL: URL) async throws -> URL {
let asset = AVURLAsset(url: videoURL)
guard let exportSession = AVAssetExportSession(
asset: asset, presetName: AVAssetExportPresetAppleM4A
) else { throw SubtitleError.exportFailed }
let outputURL = FileManager.default.temporaryDirectory
.appendingPathComponent(UUID().uuidString + ".m4a")
exportSession.outputURL = outputURL
exportSession.outputFileType = .m4a
await exportSession.export()
return outputURL
}
m4a/mp3 3–5× smaller than original video — faster upload and cheaper API.
Subtitle segments: on-client processing
Whisper verbose_json returns segments with start, end, text. Slice into 5–7 words per subtitle for readability:
// Android: split into subtitles
data class SubtitleCue(val start: Double, val end: Double, val text: String)
fun segmentsToSubtitles(words: List<WhisperWord>, maxWords: Int = 7): List<SubtitleCue> {
val cues = mutableListOf<SubtitleCue>()
var chunk = mutableListOf<WhisperWord>()
for (word in words) {
chunk.add(word)
if (chunk.size >= maxWords || word.word.endsWith(".") || word.word.endsWith("!")) {
cues.add(SubtitleCue(
start = chunk.first().start,
end = chunk.last().end,
text = chunk.joinToString(" ") { it.word.trim() }
))
chunk.clear()
}
}
if (chunk.isNotEmpty()) {
cues.add(SubtitleCue(chunk.first().start, chunk.last().end,
chunk.joinToString(" ") { it.word.trim() }))
}
return cues
}
Render subtitles over video
Two approaches:
Overlay during playback — UILabel/TextView over AVPlayerLayer/ExoPlayer. Update text via timer from player.currentTime(). Simple, doesn't modify source file.
Burned-in subtitles — FFmpeg subtitles filter, saves subtitles as video pixels. Visible permanently in any player:
ffmpeg -i input.mp4 -vf "subtitles=subs.srt:force_style='FontSize=20,PrimaryColour=&HFFFFFF'" output.mp4
Burned-in better for sharing, overlay better for editing.
Edit corrections
Whisper makes mistakes. Let user correct:
// iOS: editable subtitle cues
@State private var subtitles: [SubtitleCue] = []
@State private var editingIndex: Int? = nil
if let index = editingIndex {
TextField("Edit subtitle", text: $subtitles[index].text)
}
Store edits locally or sync to backend.
Timeline
Whisper API integration + audio extraction + subtitle rendering — 4–6 days. With on-device option, user editing, burned-in export — 2–3 weeks.







