defjaccard_similarity(A, B): # 求集合 A 和集合 B 的交集 nominator = A.intersection(B) # 求集合 A 和集合 B 的并集 denominator = A.union(B) # 计算比率 similarity = len(nominator)/len(denominator) return similarity similarity = jaccard_similarity(A, B) print(similarity)
结果为 0.25,与手动计算的结果相同。
8. Python 计算 Jaccard 距离
使用相同的数据计算 Jaccard 距离:
defjaccard_distance(A, B): #Find symmetric difference of two sets nominator = A.symmetric_difference(B) #Find union of two sets denominator = A.union(B) #Take the ratio of sizes distance = len(nominator)/len(denominator) return distance distance = jaccard_distance(A, B) print(distance)
结果为 0.75,与手动计算的结果相同。
9. Python 计算非对称二元变量
# 导入模块 import numpy as np from scipy.spatial.distance import jaccard from sklearn.metrics import jaccard_score
根据矩阵创建两个向量:
Apple
Tomato
Eggs
Milk
Coffee
Sugar
A
1
0
0
1
1
1
B
0
0
1
1
1
0
A = np.array([1,0,0,1,1,1]) B = np.array([0,0,1,1,1,0]) similarity = jaccard_score(A, B) distance = jaccard(A, B) print(f'Jaccard similarity is equal to: {similarity}') print(f'Jaccard distance is equal to: {distance}')
得到的结果为:
Jaccard similarity is equal to: 0.4 Jaccard distance is equal to: 0.6
10. Python 计算中文 Jaccard 相似度
import pandas as pd import jieba import re
# 调用数据 data = pd.read_excel("https://file.lianxh.cn/data/m/mda.xlsx") stopwords = pd.read_csv("https://file.lianxh.cn/data/c/cn_stopwords.txt", names=["stopwords"])
# 定义分词函数def cut_words(text): defcut_words(text): words_list = [] text = re.sub("[\W\d]", "", text) # 替换符号和数字 words = jieba.lcut(text) for word in words: if word notin list(stopwords["stopwords"]): words_list.append(word) return" ".join(words_list)
# 对文本分词 data["BusDA"] = data["BusDA"].apply(cut_words) data