wordpress - PHP swear word filter

Question

Welcome To Ask or Share your Answers For Others

wordpress - PHP swear word filter

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

wordpress - PHP swear word filter

I'm working on a WordPress plugin that replaces the bad words from the comments with random new ones from a list.

I now have 2 arrays: one containing the bad words and another containing the good words.

$bad = array("bad", "words", "here");
$good = array("good", "words", "here");

Since I'm a beginner, I got stuck at some point.

In order to replace the bad words, I've been using $newstring = str_replace($bad, $good, $string);.

My first problem is that I want to turn off the case sensivity, so I won't put the words like this "bad", "Bad", "BAD", "bAd", "BAd", etc but I need the new word to keep the format of the original word, for example if I write "Bad", it would be replaced with "Words", but if I type "bad", it would be replaced with "words", etc.

My first tought was to use str_ireplace, but it forgets if the original word had a capital letter.

The second problem is that I don't know how to deal with the users that type like this: "b a d", "w o r d s", etc. I need an idea.

In order to make it select a random word, I think I can use $new = $good[rand(0, count($good)-1)]; then $newstring = str_replace($bad, $new, $string);. If you have a better idea, I'm here to listen.

The general look of my script:

function noswear($string)
{
    if ($string)
    {       
        $bad = array("bad", "words");
        $good = array("good", "words"); 
        $newstring = str_replace($bad, $good, $string);     
        return $newstring;
}

echo noswear("I see bad words coming!");

Thank you in advance for your help!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:20:45+0000

Precursor

There are (as has been pointed out in the comments numerous times) gaping holes for you - and/or your code - to fall into through implementing such a feature, to name but a few:

People will add characters to fool the filter
People will become creative (e.g. innuendo)
People will use passive aggression and sarcasm
People will use sentences/phrases not just words

You'd do better to implement a moderation/flagging system where people can flag offensive comments which can then be edited/removed by mods, users, etc.

On that understanding, let us proceed...

Solution

Given that you:

Have a forbidden word list $bad_words
Have a replacement word list $good_words
Want to replace bad words regardless of case
Want to replace bad words with random good words
Have a correctly escaped bad word list: see http://php.net/preg_quote

You can very easily use PHPs preg_replace_callback function:

$input_string = 'This Could be interesting but should it be? Perhaps this 'would' work; or couldn't it?';

$bad_words  = array('could', 'would', 'should');
$good_words = array('might', 'will');

function replace_words($matches){
    global $good_words;
    return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
}

echo preg_replace_callback('/(^||s)('.implode('|', $bad_words).')(|s|$)/i', 'replace_words', $input_string);

Okay, so what the preg_replace_callback does is it compiles a regex pattern consisting of all of the bad words. Matches will then be in the format:

/(START OR WORD_BOUNDARY OR WHITE_SPACE)(BAD_WORD)(WORD_BOUNDARY OR WHITE_SPACE OR END)/i

The i modifier makes it case insensitive so both bad and Bad would match.

The function replace_words then takes the matched word and it's boundaries (either blank or a white space character) and replaces it with the boundaries and a random good word.

global $good_words; <-- Makes the $good_words variable accessible from within the function
$matches[1] <-- The word boundary before the matched word
$matches[3] <-- The word boundary after  the matched word
$good_words[rand(0, count($good_words)-1] <-- Selects a random good word from $good_words

Anonymous function

You could rewrite the above as a one liner using an anonymous function in the preg_replace_callback

echo preg_replace_callback(
        '/(^||s)('.implode('|', $bad_words).')(|s|$)/i',
        function ($matches) use ($good_words){
            return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
        },
        $input_string
    );

Function wrapper

If you're going to use it multiple times you may also write it as a self-contained function, although in this case you're most likely going to want to feed the good/bad words in to the function when calling it (or hard code them in there permanently) but that depends on how you derive them...

function clean_string($input_string, $bad_words, $good_words){
    return preg_replace_callback(
        '/(^||s)('.implode('|', $bad_words).')(|s|$)/i',
        function ($matches) use ($good_words){
            return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
        },
        $input_string
    );
}

echo clean_string($input_string, $bad_words, $good_words);

Output

Running the above functions consecutively with the input and word lists shown in the first example:

This will be interesting but might it be? Perhaps this 'will' work; or couldn't it?
This might be interesting but might it be? Perhaps this 'might' work; or couldn't it?
This might be interesting but will it be? Perhaps this 'will' work; or couldn't it?

Of course the replacement words are chosen randomly so if I refreshed the page I'd get something else... But this shows what does/doesn't get replaced.

N.B.

Escaping `$bad_words`

foreach($bad_words as $key=>$word){
    $bad_words[$key] = preg_quote($word);
}

Word boundaries

In this code I've used , s, and ^ or $ as word boundaries there is a good reason for this. While white space, start of string, and end of string are all considered word boundaries will not match in all cases, for example:

$h1t <---Will not match

This is because matches against non-word characters (i.e. [^a-zA-Z0-9]) and characters like $ don't count as word characters.

Misc

Depending on the size of your word list there are a couple of potential hiccups. From a system design perspective it's generally bad form to have huge regexes for a couple of reasons:

It can be difficult to maintain
It's difficult to read/understand what it does
It's difficult to find errors
It can be memory intensive if the list is too large

Given that the regex pattern is compiled by PHP the first reason is negated. The second should be negated as well; if you're word list is large with a dozen permutations of each bad word then I suggest you stop and rethink your approach (read: use a flagging/moderation system).

To clarify, I don't see a problem have a small word list to filter out specific expletives as it serves a purpose: to stop users from having an outburst at one another; the problem comes when you try to filter out too much including permutations. Stick to filtering common swear words and if that doesn't work then - for the last time - implement a flagging/moderation system.

Categories

wordpress - PHP swear word filter

wordpress - PHP swear word filter

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Precursor

Solution

Anonymous function

Function wrapper

Output

N.B.

Escaping `$bad_words`

Word boundaries

Misc

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

wordpress - PHP swear word filter

wordpress - PHP swear word filter

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Precursor

Solution

Anonymous function

Function wrapper

Output

N.B.

Escaping $bad_words

Word boundaries

Misc

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Escaping `$bad_words`