In this tutorial, we will introduce some steps to extract all images in a pdf file using python pymupdf.
1.Install pymupdf library
pip install PyMuPDF
2.Import library
import fitz #the PyMuPDF module from PIL import Image import io
3.Open a pdf file and save all images
filename = "my_file.pdf" # open file with fitz.open(filename) as my_pdf_file: #loop through every page for page_number in range(1, len(my_pdf_file)+1): # acess individual page page = my_pdf_file[page_number-1] # accesses all images of the page images = page.getImageList() # check if images are there if images: print(f"There are {len(images)} image/s on page number {page_number}[+]") else: print(f"There are No image/s on page number {page_number}[!]") # loop through all images present in the page for image_number, image in enumerate(page.getImageList(), start=1): #access image xerf xref_value = image[0] #extract image information base_image = my_pdf_file.extractImage(xref_value) # access the image itself image_bytes = base_image["image"] #get image extension ext = base_image["ext"] #load image image = Image.open(io.BytesIO(image_bytes)) #save image locally image.save(open(f"Page{page_number}Image{image_number}.{ext}", "wb"))
In this example code, we will use three steps to save images in pdf.
(1) Get current pdf page
page = my_pdf_file[page_number-1]
(2) Extract all images in the current page
images = page.getImageList()
(3)Get image data and save it
xref_value = image[0] #extract image information base_image = my_pdf_file.extractImage(xref_value) # access the image itself image_bytes = base_image["image"] #get image extension ext = base_image["ext"] #load image image = Image.open(io.BytesIO(image_bytes)) #save image locally image.save(open(f"Page{page_number}Image{image_number}.{ext}", "wb"))