PDA

View Full Version : Regex to match Cyrillic letters


Pelipoika
03-16-2016, 09:41
How to match them (https://en.wikipedia.org/wiki/List_of_Cyrillic_letters)

klippy
03-16-2016, 10:14
A single cyrillic character could be probably matched with

[\u0400-\u04FF]

according to this page: https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode

Pelipoika
03-16-2016, 10:20
A single cyrillic character could be probably matched with

[\u0400-\u04FF]

according to this page: https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode

Regex rgx = new Regex("[\0400-\04FF]"); = error 043: character constant exceeds range for packed string

klippy
03-16-2016, 10:27
Try

[\\u0400-\\u04FF]

I don't know if Regex engine is going to parse that properly though. I haven't done much stuff with Regex.

Pelipoika
03-16-2016, 10:29
Try

[\\u0400-\\u04FF]

I don't know if Regex engine is going to parse that properly though. I haven't done much stuff with Regex.

Didn't like that either:
error 010: invalid function or declaration
error 008: must be a constant expression; assumed zero

klippy
03-16-2016, 10:36
That doesn't seem right. Did you even put that under quotation marks? Post that line (maybe a few more around).

Pelipoika
03-16-2016, 10:41
That doesn't seem right. Did you even put that under quotation marks? Post that line (maybe a few more around).

public void OnPluginStart()
{
HookUserMessage(GetUserMessageId("SayText2"), UsrMsg_SayText2, true);
}

Regex rgx = new Regex("[\\u0400-\\u04FF]");

public Action UsrMsg_SayText2(UserMsg msg_id, Handle msg, const int[] players, int playersNum, bool reliable, bool init)
{
char params[4][64];
int client = BfReadByte(msg);

for (int i = 0; i < 4; i++)
BfReadString(msg, params[i], sizeof(params[]));

if(rgx.Match(params[1]))
{

}
}

klippy
03-16-2016, 12:55
The initialization has to be done inside a function body, you can't just leave it outside like that.

shavit
03-16-2016, 13:05
The initialization has to be done inside a function body, you can't just leave it outside like that.

yeah
and if you insist on making the regex a global, you can declare it outside a function and then 'regex = new Regex();' inside one

Pelipoika
03-16-2016, 13:21
The initialization has to be done inside a function body, you can't just leave it outside like that.

That compiled but

Regex rgx = new Regex("[\\u0400-\\u04FF]");
if(rgx != INVALID_HANDLE)
{
if(rgx.Match(params[1]))
PrintToServer("%N", client);
}
else
PrintToServer("Invalid handle");

always prints invalid handle

Grey83
03-16-2016, 16:52
maybe needed flag PCRE_UTF8 for Unicode?Regex rgx = new Regex("[\\u0400-\\u04FF]", PCRE_UTF8);

Pelipoika
03-17-2016, 05:36
maybe needed flag PCRE_UTF8 for Unicode?Regex rgx = new Regex("[\\u0400-\\u04FF]", PCRE_UTF8);

Still an invalid handle

klippy
03-17-2016, 06:26
I think that's because the Regex pattern couldn't be compiled. You can check that by retrieving the error while constructing the handle, as you can see here:

public native Regex(const char[] pattern, int flags = 0, char[] error="", int maxLen = 0, RegexError &errcode = REGEX_ERROR_NONE);

just put in some more arguments in the constructor.

That being said, I don't know if there is a way to achieve that with strings being made of byte-long characters. Maybe splitting unicode characters up in 2 characters? I don't know how would that work, but try the following:

Regex rgx = new Regex("[\xd0\x80-\xd3\xbf]", PCRE_UTF8);


Character hex codes are according to this page: http://www.utf8-chartable.de/unicode-utf8-table.pl?start=1024.

Pelipoika
03-17-2016, 06:41
That being said, I don't know if there is a way to achieve that with strings being made of byte-long characters. Maybe splitting unicode characters up in 2 characters? I don't know how would that work, but try the following:

Regex rgx = new Regex("[\xd0\x80-\xd3\xbf]", PCRE_UTF8);


Character hex codes are according to this page: http://www.utf8-chartable.de/unicode-utf8-table.pl?start=1024.

That works, thank you