I’m struggling with inconsistent output from Gemini 2.5 Pro when transcribing long multi-speaker audio files.
I am trying to transcribe a longer audio (4h) using Gemini 2.5 Pro. The language is quite specific (non-common). But Gemini works, and Whisper doesn’t. My approach is the following:
(1) Remove non-speech segments using SpeechBrain’s VAD model
(2) Identify speaker segments using pyannote/speaker-diarization-3.1
(3) Merge consecutive speakers to create groups of max 30 mins
(4) Pass the 30 mins audio files to Gemini 2.5 Pro
As the 30 mins audio has multiple speakers, and we have pre-computed timings for each speaker, I need the Gemini’s output to match these timings.
To achieve this, I have tried two things:
(1) Pass the whole 30 mins audio along with the speaker timestamps (a list of start_time=MM:SS and end_time=MM:SS) to the Gemini model and ask it to transcribe each subsegment.
(2) Split the 30 mins audio into individual files for each speaker and ask the model to transcribe each file individually.
In both scenarios I ask the model to respond with a JSON array of transcribed_text
which needs to have the same length as the number of speakers.
I am facing the following problems:
- The model often gives less transcriptions than there are speakers in the audio
- The model returns an incomplete JSON (starts hallucinating after a while)
What is the best approach to take here?