import pandas as pd
wordsWeWant = ["ball", "bat", "ball-sports"]
words = [
"football, ball-sports, ball",
"ball, bat, ball, ball, ball, ballgame, football, ball-sports",
"soccer",
"football, basketball, roundball, ball" ]
df = pd.DataFrame({"WORDS":words})
df["WORDS_list"] = df["WORDS"].str.split(",")
这将导致数据框中的列充满字符串值,该字符串值始终由逗号分隔,之间没有空格(可以有连字符、下划线、数字和其他非字符)。此外,子字符串可以出现多次,也可以出现在部分匹配之前或之后(不返回部分,只返回精确的部分)。
WORDS WORDS_list
football, ball-sports, ball ['football', ' ball-sports', ' ball']
ball, bat, ball, ball, ball, ballgame, football, ball-sports ['ball', ' bat', ' ball', ' ball', ' ball', ' ballgame', ' football', ' ball-sports']
soccer ['soccer']
football, basketball, roundball, ball ['football', ' basketball', ' roundball', ' ball']
(很抱歉,我不知道如何粘贴输出数据框或如何从Excel粘贴)
我想要的是一个没有重复匹配项的新列。我试着使用一些正则表达式,但没能让它按预期工作。接下来,我尝试使用交叉点设置操作,但当我将列转换为列表(即“单词列表”)并运行此操作时
df["WORDS_list"].apply(lambda x: list(set(x).intersection(set(wordsWeWant))))
我最终得到了意想不到的结果(见下文:
0 []
1 [ball]
2 []
3 []
我的真实数据集可能非常大,需要在字符串中签入多个项,所以我想避免在“WORDS”列上重复wordsweant的嵌套for循环,我当时正在思考。地图或地图。应用是更快的方法。如果返回的列是一个列表,则可以将其转换为一个由逗号和空格分隔的单词组成的字符串。