鍍金池/ 教程/ Python/ 處理PDF
文本翻譯
提取URL地址
處理PDF
塊分類
搜索和匹配
大寫轉(zhuǎn)換
提取電子郵件地址
字符串的不變性
文本處理狀態(tài)機
雙字母組
閱讀RSS提要
單詞替換
WordNet接口
重新格式化段落
標記單詞
向后讀取文件
塊和裂口
美化打印數(shù)字
拼寫檢查
將二進制轉(zhuǎn)換為ASCII
文本分類
文字換行
頻率分布
字符串作為文件
約束搜索
詞干算法
符號化
同義詞和反義詞
過濾重復的字詞
刪除停用詞
Python文本處理教程
文字摘要
段落計數(shù)令牌
語料訪問
文字改寫
文本處理簡介
處理Word文檔
Python文本處理開發(fā)環(huán)境
排序行

處理PDF

Python可以從中提取文本后讀取PDF文件并打印出內(nèi)容。 為此,必須首先安裝所需的模塊PyPDF2,以下是安裝模塊的命令。應該已經(jīng)在python環(huán)境中安裝了pip。

pip install pypdf2

成功安裝此模塊后,可以使用模塊中提供的方法讀取PDF文件。

import PyPDF2

pdfName = 'path\Yiibaipoint.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content

當運行上面的程序時,我們得到以下輸出 -

Yiibai Point originated from the idea that there exists a class of readers who respond better 
to online content and prefer to learn new skills at their own pace from the comforts of their 
drawing rooms.

The journey commenced with a single tutorial on HTML in 2006 and elated by the response 
it generated, we worked our way to adding fresh tutorials to our repository which now 
proudly flaunts a wealth of tutorials and allied articles on topics ranging from programming
languages to web designing to academics and much more.

讀取多個頁面

要閱讀包含多個頁面的pdf并使用頁碼打印每個頁面,使用帶有getPageNumber()函數(shù)的循環(huán)。 在下面的例子中有兩個頁面的PDF文件。內(nèi)容在兩個單獨的頁面標題下打印。

import PyPDF2

pdfName = 'Path\Yiibaispoint2.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)

for i in xrange(read_pdf.getNumPages()):
    page = read_pdf.getPage(i)
    print 'Page No - ' + str(1+read_pdf.getPageNumber(page))
    page_content = page.extractText()
    print page_content

執(zhí)行上面示例代碼,得到以下結(jié)果 -

Page No - 1
Yiibai Point originated from the idea that there exists a class of readers who respond better to 
online content and prefer to learn new skills at their own pace from the comforts of their drawing 
rooms. 


Page No - 2

The journey commenced with a single tutorial on HTML in 2006 and elated by the response it 
generated, we worked our way to adding fresh tutorials to our repository which now proudly flaunts 
a wealth of tutorials and allied articles on topics ranging from p
rogramming languages to web 
designing to academics and much more.

上一篇:字符串作為文件下一篇:語料訪問