I am trying to collect stats on my data that has two columns- subject and _time. I have a list of regex patterns that I would like to filter the data to derive a summary dataframe showing the pattern that has a match or more, the timestamp of the first instance matched, the timestamp of the last instance matched, and then its count (number of times that pattern matched with the subjects).
Here is a list of my regex patterns:
['^Payment Issue', 'Invoice\s#(([0-9])\w+)\sfrom\s([a-zA-Z]\w+)', '(Cares+message)']
Here is the data that is collected as a dictionary and turned into a DataFrame:
subject type _time
0 Payment Issue email 2020-12-09 19:45:48.013140
1 Payment Issue email 2020-13-09 19:45:48.013140
2 Invoice\ email 2020-12-09 19:45:48.013140
3 Hello! email 2020-14-11 19:45:48.013140
I want to create a new DataFrame that has the following columns- pattern, first_seen, last_seen, count. "regex_pattern" is the pattern that matched with the subject, "first_seen" would be the value of the earliest _time, "last_seen" would be the latest _time, and count would be the number of times the pattern matched with subjects.
This is what I want the output to look like:
pattern first_seen last_seen count
0 ^Payment Issue 2020-12-09 19:45:48.013140 2020-13-09 19:45:48.013140 2
1 Invoice\s#(([0-9])\w+)\sfrom\s([a-zA-Z]\w+) 2020-12-09 19:45:48.013140 2020-12-09 19:45:48.013140 1
2 (Cares+message) NA NA 0
Here is my python so far:
# Collect a list of regex patterns to query against the data
patterns = ["^Payment Issue", "Invoice\s#(([0-9])\w+)\sfrom\s([a-zA-Z]\w+)", "(Cares+message)"]
# Collect the table that holds the dictionary of emails sent
email_result_dict = {'subject': ['Payment Issue ', 'Payment Issue ', 'Invoice\ ', 'Hello!'],
'type': ['email', 'email', 'email', 'email'],
'_time': [' 2020-12-09 19:45:48.013140', '2020-13-09 19:45:48.013140',
'2020-12-09 19:45:48.013140', '2020-14-09 19:45:48.013140']}
# Put data results in to a DataFrame
df2 = pd.DataFrame.from_dict(email_result_dict)
# List of subject values from the data
value_list = []
# Gather all subjects
snd = [result['subject'] for result in email_result_dict]
value_list.extend(snd)
df = pd.DataFrame()
for j in patterns:
p = re.compile(j, re.IGNORECASE)
for i in value_list:
matches = p.findall(i)
if len(matches) != 0:
df = df.append({'regex': j, 'matched': matches}, ignore_index=True)
# Create the output DataFrame
new_dataFrame = df.groupby('regex')['matched'].count().reset_index()
new_dataFrame.columns = ['regex', 'count']
This script runs but is missing the first_seen and last_seen data. Any ideas or suggestions as to how to get my expected output in a dataframe? I have tried everything and keep getting stuck on getting the first_seen and last_seen values for each match that the count is greater than 0.