In this tutorial, we will introduce some steps to extract tables from a pdf file using python tabula-py libary.
1.Install tabula-py libary
pip install tabula-py
2.Import library
from tabula import read_pdf
3.Extract all tables in a pdf file
pdf_file="test.pdf" #list all tables tables = read_pdf(pdf_file, pages='all')
4.Iterate all tables and convert them to csv files
for table in tables: #remove Nan columns table = table.dropna(axis="columns") if not table.empty: print(f"Table {table_number}") print(table) #convert the table dataframe into csv file table.to_csv(f'table{table_number}.csv') table_number += 1
Run this code, you will see some csv files: