DocumentationAPI ReferenceTutorialsGitHub Code ExamplesDiscord Community

Generates captions for images.

Module transformers


class TransformersImageToText(BaseImageToText)

A transformer-based model to generate captions for images using the Hugging Face's transformers framework.


   image_file_paths = ["/path/to/images/apple.jpg",
                       "/path/to/images/cat.jpg", ]

   # Generate captions
   documents = image_to_text.generate_captions(image_file_paths=image_file_paths)

   # Show results (List of Documents, containing caption and image file_path)

           "content": "a red apple is sitting on a pile of hay",
           "meta": {
                       "image_path": "/path/to/images/apple.jpg",


def __init__(model_name_or_path: str = "Salesforce/blip-image-captioning-base",
             model_version: Optional[str] = None,
             generation_kwargs: Optional[dict] = None,
             use_gpu: bool = True,
             batch_size: int = 16,
             progress_bar: bool = True,
             use_auth_token: Optional[Union[str, bool]] = None,
             devices: Optional[List[Union[str, "torch.device"]]] = None)

Load an Image-to-Text model from transformers.


  • model_name_or_path: Directory of a saved model or the name of a public model. To find these models:
  1. Visit Hugging Face image to text models.`
  2. Open the model you want to check.
  3. On the model page, go to the "Files and Versions" tab.
  4. Open the config.json file and make sure the architectures field contains VisionEncoderDecoderModel, BlipForConditionalGeneration, or Blip2ForConditionalGeneration.
  • model_version: The version of the model to use from the Hugging Face model hub. This can be the tag name, branch name, or commit hash.
  • generation_kwargs: Dictionary containing arguments for the generate() method of the Hugging Face model. See generate() in Hugging Face documentation.
  • use_gpu: Whether to use GPU (if available).
  • batch_size: Number of documents to process at a time.
  • progress_bar: Whether to show a progress bar.
  • use_auth_token: The API token used to download private models from Hugging Face. If set to True, the token generated when running transformers-cli login (stored in ~/.huggingface) is used. For more information, see from_pretrained() in Hugging Face documentation.
  • devices: List of torch devices (for example, cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects or strings is supported (for example [torch.device('cuda:0'), "mps", "cuda:1"]). If you set use_gpu=False, the devices parameter is not used and a single CPU device is used for inference.


def generate_captions(image_file_paths: List[str],
                      generation_kwargs: Optional[dict] = None,
                      batch_size: Optional[int] = None) -> List[Document]

Generate captions for the image files you specify.


  • image_file_paths: Paths to the images for which you want to generate captions.
  • generation_kwargs: Dictionary containing arguments for the generate method of the Hugging Face model. See generate() in Hugging Face documentation.
  • batch_size: Number of images to process at a time.


List of Documents. Document.content is the caption. Document.meta["image_file_path"] contains the path to the image file.