Those characters you are talking about, are also UTF-8.
You can separate ANSI / Multibyte character by calling
GetCharBytes function.
Personally, I'm filtering nickname characters by removing all UTF-8 (that's possibly not correct).
+ remove all chars with code < 32.
If you look in
Client Name Fixer by @CrazyHackGUT, he uses a technique of removing all characters that have a char byte length > 2. And, that's possibly a correct way, you are asking about.
For reference, here is a plugin I'm using (that's for L4D).