Python可以從中提取文本后讀取PDF文件并打印出內(nèi)容。 為此,必須首先安裝所需的模塊PyPDF2
,以下是安裝模塊的命令。應該已經(jīng)在python環(huán)境中安裝了pip
。
pip install pypdf2
成功安裝此模塊后,可以使用模塊中提供的方法讀取PDF文件。
import PyPDF2
pdfName = 'path\Yiibaipoint.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content
當運行上面的程序時,我們得到以下輸出 -
Yiibai Point originated from the idea that there exists a class of readers who respond better
to online content and prefer to learn new skills at their own pace from the comforts of their
drawing rooms.
The journey commenced with a single tutorial on HTML in 2006 and elated by the response
it generated, we worked our way to adding fresh tutorials to our repository which now
proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming
languages to web designing to academics and much more.
讀取多個頁面
要閱讀包含多個頁面的pdf并使用頁碼打印每個頁面,使用帶有getPageNumber()
函數(shù)的循環(huán)。 在下面的例子中有兩個頁面的PDF文件。內(nèi)容在兩個單獨的頁面標題下打印。
import PyPDF2
pdfName = 'Path\Yiibaispoint2.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)
for i in xrange(read_pdf.getNumPages()):
page = read_pdf.getPage(i)
print 'Page No - ' + str(1+read_pdf.getPageNumber(page))
page_content = page.extractText()
print page_content
執(zhí)行上面示例代碼,得到以下結(jié)果 -
Page No - 1
Yiibai Point originated from the idea that there exists a class of readers who respond better to
online content and prefer to learn new skills at their own pace from the comforts of their drawing
rooms.
Page No - 2
The journey commenced with a single tutorial on HTML in 2006 and elated by the response it
generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts
a wealth of tutorials and allied articles on topics ranging from p
rogramming languages to web
designing to academics and much more.