文本翻譯

1. 文本翻譯

提取URL地址

1. 提取URL地址

處理PDF

1. 處理PDF

塊分類

1. 塊分類

搜索和匹配

1. 搜索和匹配

大寫轉換

1. 大寫轉換

提取電子郵件地址

1. 提取電子郵件地址

字符串的不變性

1. 字符串的不變性

文本處理狀態(tài)機

1. 文本處理狀態(tài)機

雙字母組

1. 雙字母組

閱讀RSS提要

1. 閱讀RSS提要

單詞替換

1. 單詞替換

WordNet接口

1. WordNet接口

重新格式化段落

1. 重新格式化段落

標記單詞

1. 標記單詞

向后讀取文件

1. 向后讀取文件

塊和裂口

1. 塊和裂口

美化打印數字

1. 美化打印數字

拼寫檢查

1. 拼寫檢查

將二進制轉換為ASCII

1. 將二進制轉換為ASCII

文本分類

1. 文本分類

文字換行

1. 文字換行

頻率分布

1. 頻率分布

字符串作為文件

1. 字符串作為文件

約束搜索

1. 約束搜索

詞干算法

1. 詞干算法

符號化

1. 符號化

同義詞和反義詞

1. 同義詞和反義詞

過濾重復的字詞

1. 過濾重復的字詞

刪除停用詞

1. 刪除停用詞

Python文本處理教程

1. Python文本處理教程

文字摘要

1. 文字摘要

段落計數令牌

1. 段落計數令牌

語料訪問

1. 語料訪問

文字改寫

1. 文字改寫

文本處理簡介

1. 文本處理簡介

處理Word文檔

1. 處理Word文檔

Python文本處理開發(fā)環(huán)境

1. Python文本處理開發(fā)環(huán)境

排序行

1. 排序行

標記單詞

標記是文本處理的基本特征，我們將單詞標記為語法分類。借助tokenization和pos_tag函數來為每個單詞創(chuàng)建標簽。

import nltk

text = nltk.word_tokenize("A Python is a serpent which eats eggs from the nest")
tagged_text=nltk.pos_tag(text)
print(tagged_text)

執(zhí)行上面示例代碼，得到以下結果 -

[('A', 'DT'), ('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('serpent', 'NN'), 
('which', 'WDT'), ('eats', 'VBZ'), ('eggs', 'NNS'), ('from', 'IN'), 
('the', 'DT'), ('nest', 'JJS')]

標簽說明

可以使用以下顯示內置值的程序來描述每個標記的含義。

import nltk

nltk.help.upenn_tagset('NN')
nltk.help.upenn_tagset('IN')
nltk.help.upenn_tagset('DT')

當運行上面的程序時，我們得到以下輸出 -

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those

標記語料庫

還可以標記語料庫數據并查看該語料庫中每個單詞的標記結果。參考以下實現代碼 -

import nltk

from nltk.tokenize import sent_tokenize
from nltk.corpus import gutenberg
sample = gutenberg.raw("blake-poems.txt")
tokenized = sent_tokenize(sample)
for i in tokenized[:2]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

執(zhí)行上面示例代碼，得到以下結果 -

[([', 'JJ'), (Poems', 'NNP'), (by', 'IN'), (William', 'NNP'), (Blake', 'NNP'), (1789', 'CD'), 
(]', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (AND', 'NNP'), (OF', 'NNP'), 
(EXPERIENCE', 'NNP'), (and', 'CC'), (THE', 'NNP'), (BOOK', 'NNP'), (of', 'IN'), 
(THEL', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (INTRODUCTION', 'NNP'), 
(Piping', 'VBG'), (down', 'RP'), (the', 'DT'), (valleys', 'NN'), (wild', 'JJ'), 
(,', ','), (Piping', 'NNP'), (songs', 'NNS'), (of', 'IN'), (pleasant', 'JJ'), (glee', 'NN'),
 (,', ','), (On', 'IN'), (a', 'DT'), (cloud', 'NN'), (I', 'PRP'), (saw', 'VBD'), 
 (a', 'DT'), (child', 'NN'), (,', ','), (And', 'CC'), (he', 'PRP'), (laughing', 'VBG'), 
 (said', 'VBD'), (to', 'TO'), (me', 'PRP'), (:', ':'), (``', '``'), (Pipe', 'VB'),
 (a', 'DT'), (song', 'NN'), (about', 'IN'), (a', 'DT'), (Lamb', 'NN'), (!', '.'), (u"''", "''")]