Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
571 views
in Technique[技术] by (71.8m points)

python - Extracting matches with the original case used in the pattern during a case insensitive search

While doing a regex pattern match, we get the content which has been a match. What if I want the pattern which was found in the content?

See the below example:

>>> import re
>>> r = re.compile('ERP|Gap', re.I)
>>> string = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'
>>> r.findall(string)
['ERP', 'GAP', 'erp', 'ErP']

but I want the output to look like this : ['ERP', 'Gap', 'ERP', 'ERP']

Because if I do a group by and sum on the original output, I would get the following output as a dataframe:

ERP 1
erp 1
ErP 1
GAP 1
gap 1

But what if I want the output to look like

ERP 3
Gap 2

in par with the keywords I am searching for?

MORE CONTEXT

I have a keyword list like this: ['ERP', 'Gap']. I have a string like this: "ERP, erp, ErP, GAP, gap"

I want to take count of number of times each keyword has appeared in the string. Now if I am doing a pattern matching, I am getting the following output: [ERP, erp, ErP, GAP, gap].

Now if I want to aggregate and take a count, I am getting the following dataframe:

ERP 1
erp 1
ErP 1
GAP 1
gap 1

While I want the output to look like this:

ERP 3
Gap 2
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You may build the pattern dynamically to include indices of the words you search for in the group names and then grab those pattern parts that matched:

import re

words = ["ERP", "Gap"]
words_dict = { f'g{i}':item for i,item in enumerate(words) } 

rx = rf"(?:{'|'.join([ rf'(?P<g{i}>{item})' for i,item in enumerate(words) ])})"

text = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'

results = []
for match in re.finditer(rx, text, flags=re.IGNORECASE):
    results.append( [words_dict.get(key) for key,value in match.groupdict().items() if value][0] )

print(results) # => ['ERP', 'Gap', 'ERP', 'ERP']

See the Python demo online

The pattern will look like (?:(?P<g0>ERP)|(?P<g1>Gap)):

  • - a word boundary
  • (?: - start of a non-capturing group encapsulating pattern parts:
    • (?P<g0>ERP) - Group "g0": ERP
    • | - or
    • (?P<g1>Gap) - Group "g1": Gap
  • ) - end of the group
  • - a word boundary.

See the regex demo.

Note [0] with [words_dict.get(key) for key,value in match.groupdict().items() if value][0] will work in all cases since when there is a match, only one group matched.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...