MaDi's Blog

一個紀錄自己在轉職軟體工程師路上的學習小空間

0%

NLP斷詞統計分析(II)-NLTK、wordnet

NLP斷詞可以處理不同語言,中文常用jieba套件來處理,英文語系則採用NLTK套件居多,本篇文章採用NLTK來做英文語句的斷詞,並結合wordnet這個字詞-語義的網路來協助我們分析同義字,甚至能夠計算不同字詞分類後的結構相似度。

前情提要,NLP自然語言處理的流程:

語料庫(corpus) => 文本(file) => 段落/句子(sentences) => texts(字詞)

NLTK(Natural Language Tool Kit)

首先,先下載套件

1
2
3
import nltk

nltk.download() # GUI介面協助下載

引用網路上的某個英文名言佳句做測試

1
text = "It's true that we don't know what we've got until we lose it, but it's also true that we don't know what we've been losing until it arrives. It is better to stay silent and be thought a fool, than to open one’s mouth and remove all doubt."

sent_tokenize的函式斷句

1
2
3
from nltk import sent_tokenize

sentences = sent_tokenize(text) # 斷句

[“It’s true that we don’t know what we’ve got until we lose it, but it’s also true that we don’t know what we’ve been losing until it arrives.”,
‘It is better to stay silent and be thought a fool, than to open one’s mouth and remove all doubt.’]

word_tokenize的函式斷詞

1
2
3
from nltk import word_tokenize

tokens = word_tokenize(text) # 斷詞

[‘It’,
“‘s”,
‘true’,
‘that’,
‘we’,
‘do’,
“n’t”…

FreqDist找出高頻率的字詞

1
2
3
from nltk.probability import FreqDist

fdist = FreqDist(tokens)

FreqDist({‘we’: 5, ‘it’: 3, ‘It’: 2, “‘s”: 2, ‘true’: 2, ‘that’: 2, ‘do’: 2, “n’t”: 2, ‘know’: 2, ‘what’: 2, …})

1
2
# 出現頻率最高的10個單字
freq_top10 = fdist.most_common(10)

[(‘we’, 5),
(‘it’, 3),
(‘It’, 2),
(“‘s”, 2),
(‘true’, 2),
(‘that’, 2),
(‘do’, 2),
(“n’t”, 2),
(‘know’, 2),
(‘what’, 2)]

載入英文版的停用字

1
2
3
from nltk.corpus import stopwords

print(stopwords.words('english'))

[‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, “you’re”, “you’ve”, “you’ll”, “you’d”, ‘your’, ‘yours’, ‘yourself’, ‘yourselves’…

1
print([token for token in tokens if token not in stopwords.words('english')])

[‘It’, “‘s”, ‘true’, “n’t”, ‘know’, “‘ve”, ‘got’, ‘lose’, ‘,’, “‘s”, ‘also’, ‘true’, “n’t”, ‘know’, “‘ve”, ‘losing’, ‘arrives’, ‘.’, ‘It’, ‘better’, ‘stay’, ‘silent’, ‘thought’, ‘fool’, ‘,’, ‘open’, ‘one’, ‘’’, ‘mouth’, ‘remove’, ‘doubt’, ‘.’]

pos_tag來標註詞性 (Parts of speech, POS)

1
2
3
from nltk import pos_tag

pos = [pos_tag(token) for token in tokens] # 詞性

[[(‘I’, ‘PRP’), (‘t’, ‘VBP’)],
[(“‘“, ‘POS’), (‘s’, ‘NN’)],
[(‘t’, ‘NN’), (‘r’, ‘NN’), (‘u’, ‘JJ’), (‘e’, ‘NN’)],
[(‘t’, ‘NN’), (‘h’, ‘VBZ’), (‘a’, ‘DT’), (‘t’, ‘NN’)],
[(‘w’, ‘NN’), (‘e’, ‘NN’)]…

去除字尾(Stemming)

1
2
3
4
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
print(stemmer.stem("writing")) # 去除字尾

write

透過詞性去還原(Lemmatization)

1
2
3
4
5
6
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("went", pos="v")) #動詞
print(lemmatizer.lemmatize("women", pos="n")) #名詞
print(lemmatizer.lemmatize("better", pos="a")) #形容詞

go
woman
good

wordnet

wordnet是一個語義網絡,以英語語系為主,透過”字詞-語意”的關係去找尋文本中的上下位關係、同義詞、文法的處理(ex:動詞時態、名詞複數…)

首先,載入套件

1
from nltk.corpus import wordnet as wn

synsets找同義詞

1
2
# 找synsets集合
wn.synsets('engineer')

[Synset(‘engineer.n.01’), Synset(‘engineer.n.02’), Synset(‘engineer.v.01’), Synset(‘mastermind.v.01’)]

1
2
# 找出選定synsets裡面的字詞
wn.synset('engineer.n.01').lemma_names()

[‘engineer’, ‘applied_scientist’, ‘technologist’]

1
2
3
# 找出所有synsets裡面的字詞
for synset in wn.synsets('engineer'):
print(synset.lemma_names())

[‘engineer’, ‘applied_scientist’, ‘technologist’]
[‘engineer’, ‘locomotive_engineer’, ‘railroad_engineer’, ‘engine_driver’]
[‘engineer’]
[‘mastermind’, ‘engineer’, ‘direct’, ‘organize’, ‘organise’, ‘orchestrate’]

definition來查看synset的定義

1
wn.synset('engineer.n.01').definition()

‘a person who uses scientific knowledge to solve practical problems’

hypernym,hyponym來找上位詞與下位詞,wordnet是對字詞作分類,所以會有結構層級關係,類似於界門綱目科屬種。而上位詞與下位詞就是代表不同階層的字詞。

1
2
# engineer的上位詞
wn.synset('engineer.n.01').hypernyms()

[Synset(‘person.n.01’)]

1
2
# engineer的下位詞
wn.synset('engineer.n.01').hyponyms()

[Synset(‘aeronautical_engineer.n.01’),
Synset(‘aerospace_engineer.n.01’),
Synset(‘army_engineer.n.01’),
Synset(‘automotive_engineer.n.01’),
Synset(‘civil_engineer.n.01’)…

1
2
# engineer最根源的上位詞
wn.synset('engineer.n.01').root_hypernyms()

[Synset(‘entity.n.01’)]

lowest_common_hypernyms找尋兩個synset的最低位共同詞組,並用path_similarity計算兩者的結構相似度

1
2
3
4
# 以工程師為例
engineer = wn.synset("engineer.n.01")
hacker = wn.synset("hacker.n.01")
elephant = wn.synset("elephant.n.01")
1
2
3
print('engineer vs enginner: ',engineer.lowest_common_hypernyms(engineer))
print('engineer vs hacker: ',engineer.lowest_common_hypernyms(hacker))
print('engineer vs elephant: ',engineer.lowest_common_hypernyms(elephant))

engineer vs enginner: [Synset(‘engineer.n.01’)]
engineer vs hacker: [Synset(‘person.n.01’)]
engineer vs elephant: [Synset(‘organism.n.01’)]

1
2
3
print('engineer vs enginner: ',engineer.path_similarity(engineer))
print('engineer vs hacker: ',engineer.path_similarity(hacker))
print('engineer vs elephant: ',engineer.path_similarity(elephant))

engineer vs enginner: 1.0
engineer vs hacker: 0.16666666666666666
engineer vs elephant: 0.1

最後,我們也可以透過wordnetWordNetLemmatizer將指定的一句話自動還原詞性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

#獲取詞性
def get_pos(tag):
if tag.startswith('J'): #形容詞
return wordnet.ADJ
elif tag.startswith('V'): #動詞
return wordnet.VERB
elif tag.startswith('N'): #名詞
return wordnet.NOUN
elif tag.startswith('R'): #副詞
return wordnet.ADV
else:
return wordnet.NOUN #如果都不符合,預設傳回名詞
1
2
3
4
5
6
7
8
9
10
11
12
# 測試語句
text = "Hello, my name is Madi. I love playing basketball."

# 斷詞/詞性
tokens = word_tokenize(text)
postag_tokens = pos_tag(tokens)

# 定義wordnet的詞態還原
wnl = WordNetLemmatizer()

# 還原詞性的句子
print(" ".join([wnl.lemmatize(tag[0], pos=get_wordnet_pos(tag[1])) for tag in postag_tokens]))

Hello , my name be Madi . I love play basketball .

參考