I'm trying to split a series of sentences into separate words, that is, to tokenize the text.
I have found an R package splitstackshape
that is able to do what I want, well almost... it truncates the output to the first and last 5 rows.
Anyway, this is what I need to do:
id text
1 Lorem ipsum dolor sit amet
2 consectetur adipiscing elit
3 Donec euismod enim quis
4 nunc fringilla sodales
5 Etiam tempor ligula vitae
6 pellentesque dictum
7 Quisque non justo scelerisque
8 est facilisis congue quis vel
9 Phasellus ex lorem
10 eleifend at magna vel
11 egestas eleifend massa
Output:
id word
1 Lorem
1 ipsum
1 dolor
1 sit
1 amet
2 consectetur
2 adipiscing
...
That is, I need words in separate rows, but with alongside the ID of the sentence it belongs to.
I was trying cSplit(data, "text", " ", "long")
, but it truncates..
Update. FYI, here is how to do the reverse
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…