Name	LocalWhisperTranscriber
Folder path	/audio/
Most common position in a pipeline	As the first component in an indexing pipeline
Mandatory input variables	“audio_files”: A list of paths or binary streams that you want to transcribe
Output variables	“documents”: A list of documents

Overview

The component also needs to know which Whisper model to work with. Specify this when initializing the component in the model parameter.

See other optional parameters you can specify in our API documentation.

See the Whisper API documentation and the official Whisper GitHub repo for the supported audio formats and languages.

To work with the LocalWhisperTranscriber, install torch and Whisper first with the following commands:

pip install transformers[torch]
pip install -U openai-whisper

Usage

On its own

Here’s an example of how to use LocalWhisperTranscriber on its own:

from haystack.components.audio import LocalWhisperTranscriber

whisper = LocalWhisperTranscriber(model="small")
whisper.warm_up()
transcription = whisper.run(audio_files=["path/to/audio/file"])

In a pipeline

This example shows an indexing pipeline that takes audio files, transcribes them, and then stores the text as documents in a document store. “.” needs to be a directory that contains only audio files.

from pathlib import Path
from haystack import Pipeline
from haystack.components.audio import LocalWhisperTranscriber
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=LocalWhisperTranscriber(model="small"), name="transcriber")
p.add_component(instance=DocumentCleaner(), name="cleaner")
p.add_component(
    instance=DocumentSplitter(split_by="sentence", split_length=10), name="splitter"
)
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")

p.connect("transcriber.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

p.run({"transcriber": {"sources": list(Path(".").iterdir())}})