[豆瓣]電影評分系統(一)

[豆瓣]電影評分系統(一)

來自專欄 Antenna的python學習筆記

原作者:Charles的皮卡丘(微信公眾號)

1.項目內容

讓機器學會分析不同電影的評論,然後根據評論,對所評論的電影進行打分(5分制)。

具體實現:

(1)用python寫爬蟲爬取豆瓣電影評論以及相應評分作為訓練數據。

(2)利用神經網路對獲得的數據進行學習以獲得相應的模型。

2.爬取豆瓣Top250電影排行榜

(1)主要應用技術介紹

主要利用了request模塊和BeautifulSoup模塊實現。

pip install requestspip install bs4

(2)源碼

import requestsimport bs4import redef open_url(url): # 使用代理(可以不用,所以注釋掉了) # proxies = {http: 127.0.0.1:1080, https:, 127.0.0.1:1080} headers = {user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.18 Safari/537.36} # res = requests.get(url, headers=headers, proxies=proxies) res = requests.get(url, headers=headers) return resdef find_movies(res): soup = bs4.BeautifulSoup(res.text, html.parser) # 電影名 movies = [] targets = soup.find_all("div", class_="hd") for each in targets: movies.append(each.a.span.text) # 評分 ranks = [] targets = soup.find_all("span", class_="rating_num") for each in targets: ranks.append(評分:%s % each.text) # 資料 messages = [] targets = soup.find_all("div", class_="bd") for each in targets: try: messages.append(each.p.text.split(
)[1].strip() + each.p.text.split(
)[2].strip()) except: continue # 匯總 result = [] length = len(movies) for i in range(length): result.append(movies[i] + ranks[i] + messages[i] +
) return result# 找出一共有多少個頁面def find_depth(res): soup = bs4.BeautifulSoup(res.text, html.parser) depth = soup.find(span, class_=next).previous_sibling.previous_sibling.text return int(depth)def main(): host = "https://movie.douban.com/top250" res = open_url(host) depth = find_depth(res) result = [] for i in range(depth): url = host + /?start= + str(25 * i) res = open_url(url) result.extend(find_movies(res)) with open("豆瓣Top250電影.txt", w, encoding=utf-8) as f: for each in result: f.write(each)if __name__ == "__main__": main()

3.利用神經網路實現英語文章自動識別及分類

(1)主要技術

keras

pip install keras

TensorFlow在win7中的安裝

訪問網頁Python Extension Packages for Windows,

註:我後來發現是python版本的原因,win7 64 python3.7還不支持。python3.6可以直接pip install TensorFlow

pip install tensorflow

(2)源碼

import reimport numpy as np from keras.models import Sequentialfrom keras.layers import Dense, Activationpaper1 = paper2 = paper3 = paper4 = paper5 = paper6 = paper7 = paper8 = paper9 = paper10 = paper11 = paper12 = paper13 = paper14 = paper15 = paper16 = paper17 = paper18 = paper19 = paper20 = papers = [paper1, paper2, paper3, paper4, paper5, paper6, paper7, paper8, paper9, paper10, paper11, paper12, paper13, paper14, paper15, paper16, paper17, paper18, paper19, paper20]labels = np.array([[1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [0, 1], [0, 1], [0, 1], [0, 1], [0, 1]])sp = re.compile(W)voca = []for paper in papers: filteredwords = [_.lower() for _ in sp.split(paper) if _] voca.extend(filteredwords)voca = list(set(voca)) #去重x_data = []for paper in papers: vword = np.zeros(761) filteredwords = [_.lower() for _ in sp.split(paper) if _] for word in filteredwords: vword[voca.index(word)] += 1 x_data.append(vword)x_data = np.array(x_data)model = Sequential()model.add(Dense(761, input_dim=761))model.add(Activation(sigmoid))model.add(Dense(2))model.add(Activation(softmax))model.compile(optimizer=rmsprop, loss=categorical_crossentropy)model.fit(x_data[2:, :], labels[2:, :], batch_size=18, epochs=10, verbose=True)model.evaluate(x_data[:2, :], labels[:2, :], batch_size=2, verbose=1)#print(model.predict(x_data[:2, :]))pre_result = model.predict(x_data[:2, :])if(pre_result[0][0] > pre_result[0][1]): print(該文章為財經類文章)else: print(該文章為體育類文章)if(pre_result[1][0] > pre_result[1][1]): print(該文章為財經類文章)else: print(該文章為體育類文章)input()

(3)注釋

從網上截取20段英語文章(段落),其中15段為財經方面的文章,5段為體育文章,分別進行標記,分成2類。用其中18段進行訓練,剩下的2段進行預測驗證。

標記規則:

如果是財經文章,標記為[1,0]

體育文章,標記為[0,1]


keras確實不太熟悉,這裡先留個疑問,以後再深入研究吧。(paper內容被建議修改這裡全部刪除了。)

推薦閱讀:

EU超時任務-威爺
一次比一次好看系列《碟中諜6:全面瓦解》
R0003A - 《害蟲》與永恆的少女宮崎葵(上)
請理智一些,完全沒看出《動物世界》好在哪裡!

TAG:電影評分 | 豆瓣 |