Microservice for Audio Transcription and Diarization 

Citeck stt sidecar is a microservice for audio transcription and diarization (speaker identification), integrated as a sidecar into the Citeck platform. It processes conference recordings from the citeck-meeting-recorder / citeck-ai pipeline.

How It Works 

The service accepts an audio file, converts it to the required format, splits it into chunks for transcription, and if needed, identifies which participant is speaking in each segment.

Audio Conversion 

Any input format (WebM, Opus, etc.) is converted to WAV 16 kHz / mono / PCM16 via ffmpeg. Temporary files created during conversion are guaranteed to be deleted in finally blocks.

Chunked Transcription 

Audio is split into 25-second chunks with a 1-second overlap. This allows long recordings to be processed without memory overflow and prevents words from being lost at chunk boundaries. Each chunk is transcribed by the GigaAM-v3 (Sber) model — the primary Russian speech recognition model. The result is a list of segments with timestamps.

Diarization 

Diarization is performed by the pyannote.audio library and is loaded lazily — only on the first request to /transcribe-diarize. If the HF_TOKEN variable is not set, diarization is disabled gracefully, and the service continues running in transcription-only mode.

After receiving transcription and diarization results, the service matches segments: for each text segment, the speaker is determined by the maximum overlap with the diarization segments.

Processing Flow 

citeck-ai (WebSocket handler)
    ↓
citeck-stt-sidecar (FastAPI :8090)
    ├── Принять аудиофайл (multipart, WebM или WAV)
    ├── Конвертировать в WAV 16kHz/mono/PCM16 (ffmpeg)
    ├── Транскрибировать чанками по 25с (GigaAM)
    ├── [опционально] Диаризация (pyannote)
    ├── Слить сегменты транскрипции и говорящих
    └── Вернуть JSON: текст, сегменты, кол-во говорящих, длительность

API 

Endpoint	Method	Description
`/health`	`GET`	Service status and model availability
`/transcribe`	`POST`	Audio file transcription (text only, without diarization)
`/transcribe-diarize`	`POST`	Transcription + speaker identification

Environment Variables 

Variable	Default	Purpose
`HF_TOKEN`	—	HuggingFace token (required for diarization via pyannote)
`GIGAAM_MODELS_PATH`	`../citeck-ai/models`	Path to the GigaAM model cache (mounted as a volume from citeck-ai)
`PORT` / `STT_PORT`	`8090`	Service port

Technology Stack 

FastAPI + Uvicorn — asynchronous REST API (port 8090)
GigaAM-v3 (Sber) — Russian speech recognition
pyannote.audio — diarization
pydub + ffmpeg — audio conversion (WebM/Opus → WAV)
Python 3.11, deployed in Docker

Key Files 

File	Purpose
`app/main.py`	FastAPI entry point, model lifecycle management, merging transcription and diarization results
`app/transcriber.py`	`GigaAmTranscriber`: chunked audio processing (25s with 1s overlap), returns segments with timestamps
`app/diarizer.py`	`SpeakerDiarizer`: lazy loading of the pyannote pipeline on first request (requires `HF_TOKEN`)
`app/audio_utils.py`	Converts any format to WAV 16 kHz/mono via ffmpeg, cleans up temporary files
`Dockerfile` / `docker-compose.yml`	Python 3.11-slim, 2 GB memory limit, models are mounted as a volume from citeck-ai