How to get consistent Multi-Speaker Transcription output from Gemini 2.5 Pro?

DEDI · June 6, 2025, 3:14am

I’m struggling with inconsistent output from Gemini 2.5 Pro when transcribing long multi-speaker audio files.

I am trying to transcribe a longer audio (4h) using Gemini 2.5 Pro. The language is quite specific (non-common). But Gemini works, and Whisper doesn’t. My approach is the following:
(1) Remove non-speech segments using SpeechBrain’s VAD model
(2) Identify speaker segments using pyannote/speaker-diarization-3.1
(3) Merge consecutive speakers to create groups of max 30 mins
(4) Pass the 30 mins audio files to Gemini 2.5 Pro

As the 30 mins audio has multiple speakers, and we have pre-computed timings for each speaker, I need the Gemini’s output to match these timings.
To achieve this, I have tried two things:
(1) Pass the whole 30 mins audio along with the speaker timestamps (a list of start_time=MM:SS and end_time=MM:SS) to the Gemini model and ask it to transcribe each subsegment.
(2) Split the 30 mins audio into individual files for each speaker and ask the model to transcribe each file individually.
In both scenarios I ask the model to respond with a JSON array of transcribed_text which needs to have the same length as the number of speakers.

I am facing the following problems:

The model often gives less transcriptions than there are speakers in the audio
The model returns an incomplete JSON (starts hallucinating after a while)

What is the best approach to take here?

Akhilesh_Kambhampati · June 9, 2025, 7:05pm

@DEDI
welcome to the community,

Have you tried reducing the temperature in the gemini call?
also in your case, could you specify the number of speakers in the audio in the prompt and ask gemini to give individual transcriptions.

alternative approach is to split the audio into smaller chunks i.e 6-7 sub samples of 5 min chunks (including some overlap) and ask gemini to transcribe it into text and then do a secondary pass on the total text to correct the transcriptions and any final text formatting.(like Json)

I think this might give you better result.

Topic		Replies	Views
Transcribe text to text and vice versa, speech to speech and image to text in a flutter app using gemini Gemini API	15	601	May 20, 2024
Transcribing calls with Gemini - labelling speakers wrong Gemini API gemini	3	187	October 25, 2024
Gemini 2.5 timestamp references for start and end in the prompt are being ignored Documentation audio , gemini-25	2	99	May 8, 2025
Resuming structured output after MAX_TOKENS cut-off Gemini API gemini-15	2	109	March 3, 2025
Gemini Pro Timestamp Accuracy Issues in Audio Transcription Gemini API gemini-15 , api	9	541	March 27, 2025

How to get consistent Multi-Speaker Transcription output from Gemini 2.5 Pro?

Related topics