Generates captions for images.
Module transformers
TransformersImageToText
class TransformersImageToText(BaseImageToText)
A transformer-based model to generate captions for images using the Hugging Face's transformers framework.
Example
image_file_paths = ["/path/to/images/apple.jpg",
"/path/to/images/cat.jpg", ]
# Generate captions
documents = image_to_text.generate_captions(image_file_paths=image_file_paths)
# Show results (List of Documents, containing caption and image file_path)
print(documents)
[
{
"content": "a red apple is sitting on a pile of hay",
...
"meta": {
"image_path": "/path/to/images/apple.jpg",
...
},
...
},
...
]
TransformersImageToText.__init__
def __init__(model_name_or_path: str = "Salesforce/blip-image-captioning-base",
model_version: Optional[str] = None,
generation_kwargs: Optional[dict] = None,
use_gpu: bool = True,
batch_size: int = 16,
progress_bar: bool = True,
use_auth_token: Optional[Union[str, bool]] = None,
devices: Optional[List[Union[str, "torch.device"]]] = None)
Load an Image-to-Text model from transformers.
Arguments:
model_name_or_path
: Directory of a saved model or the name of a public model. To find these models:
- Visit Hugging Face image to text models.`
- Open the model you want to check.
- On the model page, go to the "Files and Versions" tab.
- Open the
config.json
file and make sure thearchitectures
field containsVisionEncoderDecoderModel
,BlipForConditionalGeneration
, orBlip2ForConditionalGeneration
.
model_version
: The version of the model to use from the Hugging Face model hub. This can be the tag name, branch name, or commit hash.generation_kwargs
: Dictionary containing arguments for thegenerate()
method of the Hugging Face model. See generate() in Hugging Face documentation.use_gpu
: Whether to use GPU (if available).batch_size
: Number of documents to process at a time.progress_bar
: Whether to show a progress bar.use_auth_token
: The API token used to download private models from Hugging Face. If set toTrue
, the token generated when runningtransformers-cli login
(stored in ~/.huggingface) is used. For more information, see from_pretrained() in Hugging Face documentation.devices
: List of torch devices (for example, cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects or strings is supported (for example [torch.device('cuda:0'), "mps", "cuda:1"]). If you setuse_gpu=False
, the devices parameter is not used and a single CPU device is used for inference.
TransformersImageToText.generate_captions
def generate_captions(image_file_paths: List[str],
generation_kwargs: Optional[dict] = None,
batch_size: Optional[int] = None) -> List[Document]
Generate captions for the image files you specify.
Arguments:
image_file_paths
: Paths to the images for which you want to generate captions.generation_kwargs
: Dictionary containing arguments for the generate method of the Hugging Face model. See generate() in Hugging Face documentation.batch_size
: Number of images to process at a time.
Returns:
List of Documents. Document.content
is the caption. Document.meta["image_file_path"]
contains the path to the image file.
TransformersImageToText.run
def run(file_paths: Optional[List[str]] = None,
documents: Optional[List[Document]] = None)
Arguments:
file_paths
: Paths to the images for which you want to generate captions.documents
: List of image Documents to process into text.
TransformersImageToText.run_batch
def run_batch(file_paths: Optional[List[str]] = None,
documents: Optional[List[Document]] = None)
Arguments:
file_paths
: Paths to the images for which you want to generate captions.documents
: List of image Documents to process into text.