Raised This Month: $51 Target: $400
 12% 

strlen() with multi-byte characters in mind


Post New Thread Reply   
 
Thread Tools Display Modes
Author Message
ElooKoN
        ^_^
Join Date: Apr 2011
Old 05-26-2013 , 08:58   strlen() with multi-byte characters in mind
Reply With Quote #1

Hello,

I'm wondering how it is possible to count the length of an string with multi-characters in mind?

If I count the length of "Apple" it will result in 5 but when I count "Äpfel" the result will be 6 (because of the multi-byte "Ä").

Anyone has a solution for this problem?

Sincerely,
Lawrence
ElooKoN is offline
Impact123
Veteran Member
Join Date: Oct 2011
Location: Germany
Old 05-26-2013 , 09:54   Re: strlen() with multi-byte characters in mind
Reply With Quote #2

This should do just fine.

PHP Code:
stock StrLenMB(const String:str[])
{
    
    new 
len strlen(str);
    new 
count;
    
    for(new 
ileni++)
    {
        
count += ((str[i] & 0xc0) != 0x80) ? 0;
    }
    
    
    return 
count;

The logic was taken from here.

Here is an little bit slower example using the native IsCharMb function.
PHP Code:
stock StrLenMB2(const String:str[])
{
    
    new 
len strlen(str);
    new 
count;
    new 
bytes;
    
    for(new 
ileni++)
    {
        
bytes IsCharMB(str[i]);
        
        if(
bytes 0)
        {
            
+= (bytes 1);
        } 
        
        
count ++;
    }
    
    
    return 
count;

Yours sincerely
Impact
__________________

Last edited by Impact123; 05-28-2013 at 19:59.
Impact123 is offline
ElooKoN
        ^_^
Join Date: Apr 2011
Old 05-26-2013 , 10:16   Re: strlen() with multi-byte characters in mind
Reply With Quote #3

Thanks for your answer

Just tried:

Code:
PrintToAll("%d, %d, %d", StrLenMB("Blookoa"), StrLenMB("Blöökoa"), StrLenMB("Blooköa"));
This will result in

Code:
7, 6, 7
But

Code:
PrintToAll("%d, %d", StrLenMB("Apple"), StrLenMB("Äpfel"));
will (correctly) result in

Code:
5, 5
Curious : ))

Last edited by ElooKoN; 05-26-2013 at 10:18.
ElooKoN is offline
Impact123
Veteran Member
Join Date: Oct 2011
Location: Germany
Old 05-26-2013 , 12:48   Re: strlen() with multi-byte characters in mind
Reply With Quote #4

I've updated the stock i posted above, it should work now.

Yours sincerely
Impact
__________________
Impact123 is offline
ElooKoN
        ^_^
Join Date: Apr 2011
Old 05-26-2013 , 21:29   Re: strlen() with multi-byte characters in mind
Reply With Quote #5

Seems to work correctly, thank you!
ElooKoN is offline
htcarnage
Senior Member
Join Date: Oct 2009
Old 05-27-2013 , 03:51   Re: strlen() with multi-byte characters in mind
Reply With Quote #6

For uneducated readers, can you explain what your code means? Particularly

Code:
count += ((str[i] & 0xc0) != 0x80) ? 1 : 0;
this part
__________________
htcarnage is offline
11530
Veteran Member
Join Date: Sep 2011
Location: Underworld
Old 05-27-2013 , 12:24   Re: strlen() with multi-byte characters in mind
Reply With Quote #7

Variable-width encoding, as the name suggests, may use more than one byte to display a certain symbol. These multi-byte characters will typically consist of a lead byte and any number (up to five I think) of trailing bytes. Without going into too much detail, we can ignore the trailing bytes and only use the single lead byte when calculating the string's length.

So how do we know when we have a trailing byte to ignore? They should always take the form of 10******, i.e. a '10' in the high-order position. To extract whatever bits are in that high order we perform a bit-wise AND with the 11000000 bit-mask (0xc0 in hexadecimal). Once we have those two bits we'll compare it to the high '10' bits (10000000 or 0x80 in hexadecimal) we want to find. If those first two bits are not '10' it is not a trailing byte, so is either a lead byte or a singleton (a standard single-character byte) therefore we can add one to the length of our string so far.

e.g. Take the character Ä (U+00C4). This takes two bytes to encode and in hexadecimal should be:
Code:
11000011 10000100
Using the formula we would count the first eight bits (since it is a leading byte) and we wouldn't count the second set of eight which is a trailing byte (we know this because it begins with '10').

Hopefully all of this is correct and makes sense.
__________________

Last edited by 11530; 05-30-2013 at 08:49.
11530 is offline
ElooKoN
        ^_^
Join Date: Apr 2011
Old 05-27-2013 , 13:05   Re: strlen() with multi-byte characters in mind
Reply With Quote #8

+1
ElooKoN is offline
berni
SourceMod Plugin Approver
Join Date: May 2007
Location: Austria
Old 05-28-2013 , 06:46   Re: strlen() with multi-byte characters in mind
Reply With Quote #9

Note that there is also a native sourcemod function "IsCharMB"
__________________
Why reinvent the wheel ? Download smlib with over 350 useful functions.

When people ask me "Plz" just because it's shorter than "Please" I feel perfectly justified to answer "No" because it's shorter than "Yes"
powered by Core i7 3770k | 32GB DDR3 1886Mhz | 2x Vertex4 SSD Raid0
berni is offline
Impact123
Veteran Member
Join Date: Oct 2011
Location: Germany
Old 05-28-2013 , 11:09   Re: strlen() with multi-byte characters in mind
Reply With Quote #10

I added another exampe using the native IsCharMB function, but it may not be perfect.

Yours sincerely
Impact
__________________

Last edited by Impact123; 05-28-2013 at 11:15.
Impact123 is offline
Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -4. The time now is 02:58.


Powered by vBulletin®
Copyright ©2000 - 2024, vBulletin Solutions, Inc.
Theme made by Freecode