Delete duplicate words in a big text file - Java

Question

Welcome To Ask or Share your Answers For Others

Delete duplicate words in a big text file - Java

asked Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

Delete duplicate words in a big text file - Java

I have text file with a size of over 50gb. Now i want to delete the duplicate words. But I have heard, that i need very much RAM to load every Word from the text file into an Hash Set. Can you tell me a very good way to delete every duplicate word from the text file? The Words are sorted by a white space, like this.

word1 word2 word3 ... ...

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2022-01-31T07:26:48+0000

The H2 answer is good, but maybe overkill. All the words in the english language won't be more than a few Mb. Just use a set. You could use this in RAnders00 program.

public static void read50Gigs(String fileLocation, String newFileLocation) {
    Set<String> words = new HashSet<>();
    try(FileInputStream fileInputStream = new FileInputStream(fileLocation);
        Scanner scanner = new Scanner(fileInputStream);) {

        while (scanner.hasNext()) {
            String nextWord = scanner.next();
            words.add(nextWord);
        }
        System.out.println("words size "+words.size());
        Files.write(Paths.get(newFileLocation), words, 
                StandardOpenOption.CREATE, StandardOpenOption.WRITE);

    } catch (IOException e) {
        throw new RuntimeException(e);
    }
}

As an estimate of common words, I added this for war and peace (from gutenberg)

public static void read50Gigs(String fileLocation, String newFileLocation) {
    try {
        Set<String> words = Files.lines(Paths.get("war and peace.txt"))
                .map(s -> s.replaceAll("[^a-zA-Z\s]", ""))
                .flatMap(Pattern.compile("\s")::splitAsStream)
                .collect(Collectors.toSet());

        System.out.println("words size " + words.size());//22100
        Files.write(Paths.get("out.txt"), words,
                StandardOpenOption.CREATE, 
                StandardOpenOption.TRUNCATE_EXISTING,
                StandardOpenOption.WRITE);

    } catch (IOException e) {}
}

It completed in 0 seconds. You can't use Files.lines unless your huge source file has line breaks. With line breaks, it will process it line by line so it won't use too much memory.

Categories

Delete duplicate words in a big text file - Java

Delete duplicate words in a big text file - Java

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags