python - Extracting matches with the original case used in the pattern during a case insensitive search

Question

Welcome To Ask or Share your Answers For Others

python - Extracting matches with the original case used in the pattern during a case insensitive search

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Extracting matches with the original case used in the pattern during a case insensitive search

While doing a regex pattern match, we get the content which has been a match. What if I want the pattern which was found in the content?

See the below example:

>>> import re
>>> r = re.compile('ERP|Gap', re.I)
>>> string = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'
>>> r.findall(string)
['ERP', 'GAP', 'erp', 'ErP']

but I want the output to look like this : ['ERP', 'Gap', 'ERP', 'ERP']

Because if I do a group by and sum on the original output, I would get the following output as a dataframe:

ERP 1
erp 1
ErP 1
GAP 1
gap 1

But what if I want the output to look like

ERP 3
Gap 2

in par with the keywords I am searching for?

MORE CONTEXT

I have a keyword list like this: ['ERP', 'Gap']. I have a string like this: "ERP, erp, ErP, GAP, gap"

I want to take count of number of times each keyword has appeared in the string. Now if I am doing a pattern matching, I am getting the following output: [ERP, erp, ErP, GAP, gap].

Now if I want to aggregate and take a count, I am getting the following dataframe:

ERP 1
erp 1
ErP 1
GAP 1
gap 1

While I want the output to look like this:

ERP 3
Gap 2

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:33:49+0000

You may build the pattern dynamically to include indices of the words you search for in the group names and then grab those pattern parts that matched:

import re

words = ["ERP", "Gap"]
words_dict = { f'g{i}':item for i,item in enumerate(words) } 

rx = rf"(?:{'|'.join([ rf'(?P<g{i}>{item})' for i,item in enumerate(words) ])})"

text = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'

results = []
for match in re.finditer(rx, text, flags=re.IGNORECASE):
    results.append( [words_dict.get(key) for key,value in match.groupdict().items() if value][0] )

print(results) # => ['ERP', 'Gap', 'ERP', 'ERP']

See the Python demo online

The pattern will look like (?:(?P<g0>ERP)|(?P<g1>Gap)):

- a word boundary
(?: - start of a non-capturing group encapsulating pattern parts:
- (?P<g0>ERP) - Group "g0": ERP
- | - or
- (?P<g1>Gap) - Group "g1": Gap
) - end of the group
- a word boundary.

See the regex demo.

Note [0] with [words_dict.get(key) for key,value in match.groupdict().items() if value][0] will work in all cases since when there is a match, only one group matched.

Categories

python - Extracting matches with the original case used in the pattern during a case insensitive search

python - Extracting matches with the original case used in the pattern during a case insensitive search

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags