How to get consistent Multi-Speaker Transcription output from Gemini 2.5 Pro?

I’m struggling with inconsistent output from Gemini 2.5 Pro when transcribing long multi-speaker audio files.

I am trying to transcribe a longer audio (4h) using Gemini 2.5 Pro. The language is quite specific (non-common). But Gemini works, and Whisper doesn’t. My approach is the following:
(1) Remove non-speech segments using SpeechBrain’s VAD model
(2) Identify speaker segments using pyannote/speaker-diarization-3.1
(3) Merge consecutive speakers to create groups of max 30 mins
(4) Pass the 30 mins audio files to Gemini 2.5 Pro

As the 30 mins audio has multiple speakers, and we have pre-computed timings for each speaker, I need the Gemini’s output to match these timings.
To achieve this, I have tried two things:
(1) Pass the whole 30 mins audio along with the speaker timestamps (a list of start_time=MM:SS and end_time=MM:SS) to the Gemini model and ask it to transcribe each subsegment.
(2) Split the 30 mins audio into individual files for each speaker and ask the model to transcribe each file individually.
In both scenarios I ask the model to respond with a JSON array of transcribed_text which needs to have the same length as the number of speakers.

I am facing the following problems:

  • The model often gives less transcriptions than there are speakers in the audio
  • The model returns an incomplete JSON (starts hallucinating after a while)

What is the best approach to take here?

@DEDI
welcome to the community,

Have you tried reducing the temperature in the gemini call?
also in your case, could you specify the number of speakers in the audio in the prompt and ask gemini to give individual transcriptions.

alternative approach is to split the audio into smaller chunks i.e 6-7 sub samples of 5 min chunks (including some overlap) and ask gemini to transcribe it into text and then do a secondary pass on the total text to correct the transcriptions and any final text formatting.(like Json)

I think this might give you better result.