This can be done by using apply to search for df2's patterns in each row of df1. This will require runtime proportional to O(n*m)
, where n is the number of rows in df1, and m is the number of rows in df2. This is not very efficient, but that's fine for small dataframes.
Once we identify the matches between df1 and df2, we can merge the two dataframes. After that, we just need to clean up the dataframe and drop unneeded columns.
Code:
import pandas as pd
import fnmatch
df1 = pd.DataFrame({'A': ["He eat an apple in his office.", "There are many apples on the tree."], 'B': [1, 2]})
df2 = pd.DataFrame({'A': ["apple*tree", "apple*pie"], 'C': [4, 9]})
def wildcard_search(pattern):
# Comment this line to require exact match
pattern = "*" + pattern + "*"
# Apply pattern to every A values within df1
matching = df1['A'].apply(lambda x: fnmatch.fnmatch(x, pattern))
# Get index of largest member
first_match = matching.idxmax()
# If we have all zeros, then first_match will refer to the first
# zero. Check for this.
if matching.loc[first_match] == 0:
return None
# print(first_match)
return df1.loc[first_match, 'A']
# Using df2 patterns, search through df1. Record values found.
df2['merge_key'] = df2['A'].apply(wildcard_search)
# Merge two dataframes, on cols merge_key and A
res = df2.merge(
df1,
left_on='merge_key',
right_on='A',
suffixes=("_x", "") # Don't add a suffix to df1's columns
)
# Reorder cols, drop unneeded
res = res[['A', 'B', 'C']]
print(res)
This answer is adapted from this post.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…