Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.
Here's my experience with BreakIterator:
Using the example here:
I have the following Japanese:
今日はパソコンを買った。高性能のマックは早い!とても快適です。
In ascii, it looks like this:
ufeffu4ecau65e5u306fu30d1u30bdu30b3u30f3u3092u8cb7u3063u305fu3002u9ad8u6027u80fdu306eu30deu30c3u30afu306fu65e9u3044uff01u3068u3066u3082u5febu9069u3067u3059u3002
Here's the part of that sample that I changed:
static void sentenceExamples() {
Locale currentLocale = new Locale ("ja","JP");
BreakIterator sentenceIterator =
BreakIterator.getSentenceInstance(currentLocale);
String someText = "今日はパソコンを買った。高性能のマックは早い!とても快適です。";
When I look at the Boundary indices, I see this:
0|13|24|32
But those indices don't correspond to any sentence terminators.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…