string - Java library that finds sentence boundaries

Question

Welcome To Ask or Share your Answers For Others

string - Java library that finds sentence boundaries

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

string - Java library that finds sentence boundaries

Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.

Here's my experience with BreakIterator:

Using the example here: I have the following Japanese:

今日はパソコンを買った。高性能のマックは早い！とても快適です。

In ascii, it looks like this:

ufeffu4ecau65e5u306fu30d1u30bdu30b3u30f3u3092u8cb7u3063u305fu3002u9ad8u6027u80fdu306eu30deu30c3u30afu306fu65e9u3044uff01u3068u3066u3082u5febu9069u3067u3059u3002

Here's the part of that sample that I changed: static void sentenceExamples() {

  Locale currentLocale = new Locale ("ja","JP");
  BreakIterator sentenceIterator = 
     BreakIterator.getSentenceInstance(currentLocale);
  String someText = "今日はパソコンを買った。高性能のマックは早い！とても快適です。";

When I look at the Boundary indices, I see this:

0|13|24|32

But those indices don't correspond to any sentence terminators.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:22:58+0000

You want to look into the internationalized BreakIterator classes. A good starting point for sentence boundaries.

Categories

string - Java library that finds sentence boundaries

string - Java library that finds sentence boundaries

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags