HomeDocumentationAPI ReferenceTutorials
Haystack Homepage

Generates captions for images.

Module base

BaseImageToText

class BaseImageToText(BaseComponent)

Abstract class for ImageToText.

BaseImageToText.generate_captions

@abstractmethod
def generate_captions(image_file_paths: List[str],
                      generation_kwargs: Optional[dict] = None,
                      batch_size: Optional[int] = None) -> List[Document]

Abstract method for generating captions.

Arguments:

  • image_file_paths: Paths to the images for which you want to generate captions.
  • generation_kwargs: Dictionary containing arguments for the generate() method of the Hugging Face model. See generate() on Hugging Face.
  • batch_size: Number of images to process at a time.

Returns:

List of Documents. Document.content is the caption. Document.meta["image_file_path"] contains the path to the image file.

Module transformers

TransformersImageToText

class TransformersImageToText(BaseImageToText)

A transformer-based model to generate captions for images using the Hugging Face's transformers framework.

For an up-to-date list of available models, see Hugging Face image to text models`__

Example

   image_file_paths = ["/path/to/images/apple.jpg",
                       "/path/to/images/cat.jpg", ]

   # Generate captions
   documents = image_to_text.generate_captions(image_file_paths=image_file_paths)

   # Show results (List of Documents, containing caption and image file_path)
   print(documents)

   [
       {
           "content": "a red apple is sitting on a pile of hay",
           ...
           "meta": {
                       "image_path": "/path/to/images/apple.jpg",
                       ...
                   },
           ...
       },
       ...
   ]

TransformersImageToText.__init__

def __init__(model_name_or_path: str = "nlpconnect/vit-gpt2-image-captioning",
             model_version: Optional[str] = None,
             generation_kwargs: Optional[dict] = None,
             use_gpu: bool = True,
             batch_size: int = 16,
             progress_bar: bool = True,
             use_auth_token: Optional[Union[str, bool]] = None,
             devices: Optional[List[Union[str, torch.device]]] = None)

Load an image-to-text model from transformers.

For an up-to-date list of available models, see Hugging Face image-to-text models.

Arguments:

  • model_name_or_path: Directory of a saved model or the name of a public model. For a full list of models, see Hugging Face image-to-text models.
  • model_version: The version of the model to use from the Hugging Face model hub. This can be the tag name, branch name, or commit hash.
  • generation_kwargs: Dictionary containing arguments for the generate() method of the Hugging Face model. See generate() in Hugging Face documentation.
  • use_gpu: Whether to use GPU (if available).
  • batch_size: Number of documents to process at a time.
  • progress_bar: Whether to show a progress bar.
  • use_auth_token: The API token used to download private models from Hugging Face. If set to True, the token generated when running transformers-cli login (stored in ~/.huggingface) is used. For more information, see from_pretrained() in Hugging Face documentation.
  • devices: List of torch devices (for example, cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects or strings is supported (for example [torch.device('cuda:0'), "mps", "cuda:1"]). If you set use_gpu=False, the devices parameter is not used and a single CPU device is used for inference.

TransformersImageToText.generate_captions

def generate_captions(image_file_paths: List[str],
                      generation_kwargs: Optional[dict] = None,
                      batch_size: Optional[int] = None) -> List[Document]

Generate captions for the image files you specify.

Arguments:

  • image_file_paths: Paths to the images for which you want to generate captions.
  • generation_kwargs: Dictionary containing arguments for the generate method of the Hugging Face model. See generate() in Hugging Face documentation.
  • batch_size: Number of images to process at a time.

Returns:

List of Documents. Document.content is the caption. Document.meta["image_file_path"] contains the path to the image file.