Implementing Speech Recognition from Video Files Extracting audio from video and transcribing it is a standard task when creating subtitles, indexing video archives, and analyzing webinars. Key point: the quality of the original audio track determines WER more than the choice of model. ### Extraction and Transcription Pipeline
import subprocess
import tempfile
from pathlib import Path
from faster_whisper import WhisperModel
def extract_audio_from_video(video_path: str) -> str:
"""Извлекаем аудио из видео через FFmpeg"""
output_path = tempfile.mktemp(suffix='.wav')
cmd = [
'ffmpeg', '-i', video_path,
'-vn', # отключаем видео
'-ar', '16000', # 16kHz для ASR
'-ac', '1', # моно
'-acodec', 'pcm_s16le', # PCM 16-bit
'-af', 'loudnorm', # нормализация громкости
output_path,
'-y', '-loglevel', 'error'
]
subprocess.run(cmd, check=True)
return output_path
def transcribe_video(video_path: str, model: WhisperModel) -> dict:
audio_path = extract_audio_from_video(video_path)
try:
segments, info = model.transcribe(
audio_path,
vad_filter=True,
word_timestamps=True,
language="ru"
)
return {
"language": info.language,
"segments": [
{
"start": seg.start,
"end": seg.end,
"text": seg.text
}
for seg in segments
]
}
finally:
Path(audio_path).unlink(missing_ok=True)
```### FFmpeg supports all video formats: MP4, MKV, AVI, MOV, WebM, and FLV. Low-bitrate audio is common when recording webinars (Zoom, Teams). We use the `loudnorm` filter and optionally `highpass=f=200` to remove low-frequency noise. ### Multi-track video processing: In video conferences, each participant can have a separate audio track:```python
# Получаем информацию о дорожках
probe = ffmpeg.probe(video_path)
audio_streams = [s for s in probe['streams'] if s['codec_type'] == 'audio']
# Обрабатываем каждую дорожку отдельно для диаризации
```### Subtitle generation From the transcription result, we automatically generate SRT/VTT:```python
def to_srt(segments) -> str:
lines = []
for i, seg in enumerate(segments, 1):
start = format_timestamp(seg['start'])
end = format_timestamp(seg['end'])
lines.append(f"{i}\n{start} --> {end}\n{seg['text'].strip()}\n")
return "\n".join(lines)
```Deadlines: basic script – 1 day, batch system with queue – 3–4 days.