The H2 answer is good, but maybe overkill. All the words in the english language won't be more than a few Mb. Just use a set. You could use this in RAnders00 program.
public static void read50Gigs(String fileLocation, String newFileLocation) {
Set<String> words = new HashSet<>();
try(FileInputStream fileInputStream = new FileInputStream(fileLocation);
Scanner scanner = new Scanner(fileInputStream);) {
while (scanner.hasNext()) {
String nextWord = scanner.next();
words.add(nextWord);
}
System.out.println("words size "+words.size());
Files.write(Paths.get(newFileLocation), words,
StandardOpenOption.CREATE, StandardOpenOption.WRITE);
} catch (IOException e) {
throw new RuntimeException(e);
}
}
As an estimate of common words, I added this for war and peace (from gutenberg)
public static void read50Gigs(String fileLocation, String newFileLocation) {
try {
Set<String> words = Files.lines(Paths.get("war and peace.txt"))
.map(s -> s.replaceAll("[^a-zA-Z\s]", ""))
.flatMap(Pattern.compile("\s")::splitAsStream)
.collect(Collectors.toSet());
System.out.println("words size " + words.size());//22100
Files.write(Paths.get("out.txt"), words,
StandardOpenOption.CREATE,
StandardOpenOption.TRUNCATE_EXISTING,
StandardOpenOption.WRITE);
} catch (IOException e) {}
}
It completed in 0 seconds. You can't use Files.lines
unless your huge source file has line breaks. With line breaks, it will process it line by line so it won't use too much memory.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…