regex - Merge two dataframes on string columns with values containing wilcards as for like in SQL - Python

Question

Welcome To Ask or Share your Answers For Others

regex - Merge two dataframes on string columns with values containing wilcards as for like in SQL - Python

asked Feb 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - Merge two dataframes on string columns with values containing wilcards as for like in SQL - Python

I want to merge 2 dataframes on string columns with values containing wildcards as we can do with like in SQL.

Example :

import pandas as pd

df1 = pd.DataFrame({'A': ["He eat an apple in his office.", "There are many apples on the tree."], 'B': [1, 2]})
df2 = pd.DataFrame({'A': ["apple*tree", "apple*pie"], 'C': [4, 9]})

df1
                                    A  B
0      He eat an apple in his office.  1
1  There are many apples on the tree.  2

df2
            A  C
0  apple*tree  4
1   apple*pie  9


pd.merge(df1, df2, on = ['A']) 

# What it gives me :

Empty DataFrame
Columns: [A, B, C]
Index: []


# What I want:
                                    A  B  C
0  There are many apples on the tree.  2  4

I want to join the two dataframes and "apple*tree" of df2 has to match the sentence "There are many apples on the tree." of df1.

Can you help me to do this please?

I have found the function fnmatch.fnmatch(string, pattern) but can I use it in this case with a merge?

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-02-06T00:14:59+0000

This can be done by using apply to search for df2's patterns in each row of df1. This will require runtime proportional to O(n*m), where n is the number of rows in df1, and m is the number of rows in df2. This is not very efficient, but that's fine for small dataframes.

Once we identify the matches between df1 and df2, we can merge the two dataframes. After that, we just need to clean up the dataframe and drop unneeded columns.

Code:

import pandas as pd
import fnmatch

df1 = pd.DataFrame({'A': ["He eat an apple in his office.", "There are many apples on the tree."], 'B': [1, 2]})
df2 = pd.DataFrame({'A': ["apple*tree", "apple*pie"], 'C': [4, 9]})

def wildcard_search(pattern):
    # Comment this line to require exact match
    pattern = "*" + pattern + "*"
    # Apply pattern to every A values within df1
    matching = df1['A'].apply(lambda x: fnmatch.fnmatch(x, pattern))
    # Get index of largest member
    first_match = matching.idxmax()
    # If we have all zeros, then first_match will refer to the first
    # zero. Check for this.
    if matching.loc[first_match] == 0:
        return None
    # print(first_match)
    return df1.loc[first_match, 'A']

# Using df2 patterns, search through df1. Record values found.
df2['merge_key'] = df2['A'].apply(wildcard_search)

# Merge two dataframes, on cols merge_key and A
res = df2.merge(
    df1,
    left_on='merge_key',
    right_on='A',
    suffixes=("_x", "")  # Don't add a suffix to df1's columns
)
# Reorder cols, drop unneeded
res = res[['A', 'B', 'C']]
print(res)

This answer is adapted from this post.

Categories

regex - Merge two dataframes on string columns with values containing wilcards as for like in SQL - Python

regex - Merge two dataframes on string columns with values containing wilcards as for like in SQL - Python

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags