Replace blacklisted word unless it's a part of whitelisted word - string
Alright so there's the unapproved Advanced Swear Filter, which apparently should have this feature. It doesn't work properly, so copying its' filtering algorithm is out of question.
What I'm trying to do is filter out curse words from a string, as long as they're not whitelisted. An example: I blacklisted shit and whitelisted the word 'doshite' (sorry, couldn't think of anything better) If the user says "If you want to say why in Japanese, you say doshite.", nothing happens. If they say "Fuck off dipshit", it turns into "Fuck off dip****". How would you go about this? I have the blacklisted words stored in Array: blacklist and whitelisted words in Array: whitelist. I'd also like to completely evade the breaking into single words method, as I'll remove the whitespaces before filtering the string anyway. |
Re: Replace blacklisted word unless it's a part of whitelisted word - string
Removing whitespaces actually makes it harder. The best way that I can think of is to use regex finding the word that contains "shit" but also include wild cards to get the whole word. Then, you can run that through your whitelist to see if you should block it or not.
regex: \b\w*shit\w*\b This should return "doshite" which you then run through your whitelist. It no in the whitelist, assume it's a bad word. (Do NOT remove spaces) |
Re: Replace blacklisted word unless it's a part of whitelisted word - string
I have to remove spaces, as people just decide to write "s h i t" instead which asks for more attention, ultimately being exactly the opposite of what I'm trying to achieve.
|
Re: Replace blacklisted word unless it's a part of whitelisted word - string
Negative lookbehind would be well suited in this case.
Code:
(?<!do)shitCode:
(?<!do)shit(?!e)Whitespace is really not good to remove because you could end up with new words that are blacklisted. Wince whitespace is easy to account for it would be better to remove all special characters (.,-_*! and so on) and just doing Code:
(?<!do)\s*s\s*h\s*i\s*t |
Re: Replace blacklisted word unless it's a part of whitelisted word - string
I guess I could keep the whitespace. Anyway, I have very little experience in regex, especially AMXX wise. How do I go about this if I have exactly two arrays, one with blacklisted words and one with whitelisted words?
Also, "shite" really needs to be filtered. I take it as there's no other option than breaking the whole string into words and then comparing them one by one? |
Re: Replace blacklisted word unless it's a part of whitelisted word - string
My last post is based on blacklist only.
You check the whole string for results using the last expression i posted. It will match everything that contains "shit" with whitespace in between unless it has a prefix of "do". Perhaps removing the first whitespace check would be good. Otherwise you might end up missing for example "You don't do shit". The following will match "shit" and part of "shite", "dipshit" et.c. but not "doshit" (or "doshite"). Code:
(?<!do)s\s*h\s*i\s*tI'll show it without the whitespace checks so it will be easier to read. The following will match the whole word of "shit", "shite", "dipshit" but not "doshit". Code:
(?<!do)(?:dip)?shit(?:e)?Code:
(?<!do)s\s*h\s*i\s*t(?:\s*e)?Code:
(?<!do)s\s*h\s*i\s*t[\w]* |
Re: Replace blacklisted word unless it's a part of whitelisted word - string
I really need it with dynamic lists though, otherwise it's pointless.
|
Re: Replace blacklisted word unless it's a part of whitelisted word - string
Dynamic lists? What do you mean?
|
Re: Replace blacklisted word unless it's a part of whitelisted word - string
I mean that the user has the ability to write his own whitelist or blacklist entries into a config file which then gets parsed and its content saved to the arrays.
|
Re: Replace blacklisted word unless it's a part of whitelisted word - string
Well anything can be read from a file meaning everything can be made into a dynamic list.
I gave you some options to pick from. This is the best I can come up with at the moment: Code:
Code:
PRE: shit doshite dipshit shite |
| All times are GMT -4. The time now is 06:05. |
Powered by vBulletin®
Copyright ©2000 - 2024, vBulletin Solutions, Inc.