Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

r - quanteda kwic regex operation

Further edit to original question.
Question originated by expectation that regexes would work identically or nearly to "grep" or to some programming language. This below is what I expected and the fact that it did not happen generated my question (using cygwin):

echo "regex unusual operation will deport into a different" > out.txt
grep "will * dep" out.txt
"regex unusual operation will deport into a different"


Originary question
Trying to follow https://github.com/kbenoit/ITAUR/blob/master/README.md
to learn Quanteda after seeing that everybody that uses this package finds it very good.
In demo.R, line 22 I find the line:
kwic(immigCorpus, "deport", window = 3)  

Its output is -

[BNP, 157]        The BNP will | deport | all foreigners convicted  
[BNP, 1946]                . 2. | Deport | all illegal immigrants    
[BNP, 1952] immigrants We shall | deport | all illegal immigrants  
[BNP, 2585]  Criminals We shall | deport | all criminal entrants  

To try/learn the basics I execute

kwic(immigCorpus, "will *depo", window = 3, valuetype = "regex")

expecting to get

[BNP, 157]        The BNP will | deport | all foreigners convicted

but I get:

kwic object with 0 rows

Similar attempts like

kwic(immigCorpus, ".*will *depo.*", window = 3, valuetype = "regex")

Get the same result:

kwic object with 0 rows

Why is that? Tokenization? if so how should I write the regex?

PS Thanks for this great package

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You are trying to match a phrase with your pattern. By default, the pattern argument is treated as a space separated list of keywords, and the search is performed against this list. So, you may get your expected result using

> kwic(immigCorpus, phrase("will deport"), window = 3)
[BNP, 156:157] - The BNP | will deport | all foreigners convicted

A valuetype = "regex" makes sense if you are using a regex. E.g. to get both shall and will deport use

> kwic(immigCorpus, phrase("(will|shall) deport"), window = 3, valuetype = "regex")

   [BNP, 156:157]             - The BNP | will deport  | all foreigners convicted
 [BNP, 1951:1952] illegal immigrants We | shall deport | all illegal immigrants  
 [BNP, 2584:2585]  Foreign Criminals We | shall deport | all criminal entrants 

See this kwic documentation.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...