Microservice for Audio Transcription and Diarization
Citeck stt sidecar is a microservice for audio transcription and diarization (speaker identification), integrated as a sidecar into the Citeck platform. It processes conference recordings from the citeck-meeting-recorder / citeck-ai pipeline.
How It Works
The service accepts an audio file, converts it to the required format, splits it into chunks for transcription, and if needed, identifies which participant is speaking in each segment.
Audio Conversion
Any input format (WebM, Opus, etc.) is converted to WAV 16 kHz / mono / PCM16 via ffmpeg. Temporary files created during conversion are guaranteed to be deleted in finally blocks.
Chunked Transcription
Audio is split into 25-second chunks with a 1-second overlap. This allows long recordings to be processed without memory overflow and prevents words from being lost at chunk boundaries. Each chunk is transcribed by the GigaAM-v3 (Sber) model — the primary Russian speech recognition model. The result is a list of segments with timestamps.
Diarization
Diarization is performed by the pyannote.audio library and is loaded lazily — only on the first request to /transcribe-diarize. If the HF_TOKEN variable is not set, diarization is disabled gracefully, and the service continues running in transcription-only mode.
After receiving transcription and diarization results, the service matches segments: for each text segment, the speaker is determined by the maximum overlap with the diarization segments.
Processing Flow
citeck-ai (WebSocket handler)
↓
citeck-stt-sidecar (FastAPI :8090)
├── Принять аудиофайл (multipart, WebM или WAV)
├── Конвертировать в WAV 16kHz/mono/PCM16 (ffmpeg)
├── Транскрибировать чанками по 25с (GigaAM)
├── [опционально] Диаризация (pyannote)
├── Слить сегменты транскрипции и говорящих
└── Вернуть JSON: текст, сегменты, кол-во говорящих, длительность
API
Endpoint |
Method |
Description |
|---|---|---|
|
|
Service status and model availability |
|
|
Audio file transcription (text only, without diarization) |
|
|
Transcription + speaker identification |
Environment Variables
Variable |
Default |
Purpose |
|---|---|---|
|
— |
HuggingFace token (required for diarization via pyannote) |
|
|
Path to the GigaAM model cache (mounted as a volume from citeck-ai) |
|
|
Service port |
Technology Stack
FastAPI + Uvicorn — asynchronous REST API (port 8090)
GigaAM-v3 (Sber) — Russian speech recognition
pyannote.audio — diarization
pydub + ffmpeg — audio conversion (WebM/Opus → WAV)
Python 3.11, deployed in Docker
Key Files
File |
Purpose |
|---|---|
|
FastAPI entry point, model lifecycle management, merging transcription and diarization results |
|
|
|
|
|
Converts any format to WAV 16 kHz/mono via ffmpeg, cleans up temporary files |
|
Python 3.11-slim, 2 GB memory limit, models are mounted as a volume from citeck-ai |