Microservice for Audio Transcription and Diarization

Citeck stt sidecar is a microservice for audio transcription and diarization (speaker identification), integrated as a sidecar into the Citeck platform. It processes conference recordings from the citeck-meeting-recorder / citeck-ai pipeline.

How It Works

The service accepts an audio file, converts it to the required format, splits it into chunks for transcription, and if needed, identifies which participant is speaking in each segment.

Audio Conversion

Any input format (WebM, Opus, etc.) is converted to WAV 16 kHz / mono / PCM16 via ffmpeg. Temporary files created during conversion are guaranteed to be deleted in finally blocks.

Chunked Transcription

Audio is split into 25-second chunks with a 1-second overlap. This allows long recordings to be processed without memory overflow and prevents words from being lost at chunk boundaries. Each chunk is transcribed by the GigaAM-v3 (Sber) model — the primary Russian speech recognition model. The result is a list of segments with timestamps.

Diarization

Diarization is performed by the pyannote.audio library and is loaded lazily — only on the first request to /transcribe-diarize. If the HF_TOKEN variable is not set, diarization is disabled gracefully, and the service continues running in transcription-only mode.

After receiving transcription and diarization results, the service matches segments: for each text segment, the speaker is determined by the maximum overlap with the diarization segments.

Processing Flow

citeck-ai (WebSocket handler)
    ↓
citeck-stt-sidecar (FastAPI :8090)
    ├── Принять аудиофайл (multipart, WebM или WAV)
    ├── Конвертировать в WAV 16kHz/mono/PCM16 (ffmpeg)
    ├── Транскрибировать чанками по 25с (GigaAM)
    ├── [опционально] Диаризация (pyannote)
    ├── Слить сегменты транскрипции и говорящих
    └── Вернуть JSON: текст, сегменты, кол-во говорящих, длительность

API

Endpoint

Method

Description

/health

GET

Service status and model availability

/transcribe

POST

Audio file transcription (text only, without diarization)

/transcribe-diarize

POST

Transcription + speaker identification

Environment Variables

Variable

Default

Purpose

HF_TOKEN

HuggingFace token (required for diarization via pyannote)

GIGAAM_MODELS_PATH

../citeck-ai/models

Path to the GigaAM model cache (mounted as a volume from citeck-ai)

PORT / STT_PORT

8090

Service port

Technology Stack

  • FastAPI + Uvicorn — asynchronous REST API (port 8090)

  • GigaAM-v3 (Sber) — Russian speech recognition

  • pyannote.audio — diarization

  • pydub + ffmpeg — audio conversion (WebM/Opus → WAV)

  • Python 3.11, deployed in Docker

Key Files

File

Purpose

app/main.py

FastAPI entry point, model lifecycle management, merging transcription and diarization results

app/transcriber.py

GigaAmTranscriber: chunked audio processing (25s with 1s overlap), returns segments with timestamps

app/diarizer.py

SpeakerDiarizer: lazy loading of the pyannote pipeline on first request (requires HF_TOKEN)

app/audio_utils.py

Converts any format to WAV 16 kHz/mono via ffmpeg, cleans up temporary files

Dockerfile / docker-compose.yml

Python 3.11-slim, 2 GB memory limit, models are mounted as a volume from citeck-ai