8. Mbs

The mbs(3m) module provides extended string functions that will work with the locale dependant encoding such as UTF-8 or any 8 bit encoding. They are useful for determining complete substrings of UTF-8 sequences such as when terminal output must consider the number of display positions that a sequence of characters will occupy. More generally, the objective of this function is to emulate the behavior of non-multibyte Unicode string manipulation like that of UTF-16 and JAVA encodings although such behavior has not been verified.

Please note some of these functions are not actively used by the author. They have been tested but should be considered experimental and may be subject to change or removal.

8.1. Multibute string functions

These functions convert multibyte sequences into UCS codes to determine the number of characters (read number of display positions provided you're strings are not polluted with control characters) in a string, the size of a complete valid sequence of characters, create a copy of a complete valid sequence of characters, return the substring starting at an offset number of characters, etc.

Which encoding is used is dependant on locale. Programs that use these functions can write programs that will exibit the same behavior in many different locales. Developers can test the success of their work by running their program in the UTF-8 locale provided they have a capable terminal, a Unicode font, supporting mbtowc(3) and wctomb(3) functions, and a __STDC_ISO_10646__ environment. Although this may not be obvious the Linux glibc 2.2 and Solaris with dtterm environments appear to meet these requirements.

To execute a program in the UTF-8 locale on a glibc 2.2+ Linux system try:

  plain$ xterm -u8 -fn '-*-fixed-*-*-*-*-12-*-*-*-*-*-iso10646-1'
  xterm$ LANG=en_US.UTF-8 ./someprogram
  
For more information on UTF-8 and i18n particularly on Linux read the UTF-8 and Unicode FAQ for Unix/Linux.

The mbslen function
Synopsis

#include <mba/mbs.h> int mbslen(const char *src);
Description
The mbslen function will return the number of characters in the multibyte string pointed to by src. Characters in this context are contol characters and complete multibyte sequences. Combining characters are not reduced. See mbswidth(3m) for calculating display positions.

The mbsnlen function
Synopsis

#include <mba/mbs.h> int mbsnlen(const char *src, size_t sn, int cn);
Description
The mbsnlen function will return the number of characters in the multibyte string pointed to by src. Characters in this context are contol characters and complete multibyte sequences. Combining characters are not reduced. See mbswidth(3m) for calculating display positions. No more than sn bytes of src will be examined and no more than cn characters will be converted to make the determination. Either or both sn and cn can be -1 indicating that the constraint should be ignored (no limit).

The mbssize function
Synopsis

#include <mba/mbs.h> size_t mbssize(const char *src);
Description
The mbssize function returns the number of bytes in a complete character sequence. Note this will not be the same as strlen(3) if there is an incomplete multibyte sequence at the end of the string.

The mbsnsize function
Synopsis

#include <mba/mbs.h> size_t mbsnsize(const char *src, size_t sn, int cn);
Description
The mbsnsize function returns the number of bytes in a complete character sequence. No more than sn bytes of src will be examined and no more than cn characters will be converted. Note this will not be the same as strnlen(3) if the sn or cn constraints end on an incomplete multibyte sequence or if the '\0' is encountered in the middle of an incomplete multibyte sequence.

The mbsdup function
Synopsis

#include <mba/mbs.h> char *mbsdup(const char *src);
Description
The mbsdup function will return a copy of the multibyte string at src. An incomplete multibyte sequence at the end of the string will not be copied. Only a complete valid multibyte string will be returned.

The mbsndup function
Synopsis

#include <mba/mbs.h> char *mbsndup(const char *src, size_t n, int cn);
Description
The mbsndup function will return a copy of the multibyte string at src. No more than sn bytes of src will be examined and no more than cn characters will be converted. If the sn or cn constraints end on an incomplete multibyte sequence or if the '\0' is encountered in the middle of an incomplete multibyte sequence those extra bytes will not be copied. Only a complete multibyte string will be returned.

The mbsoff function
Synopsis

#include <mba/mbs.h> char *mbsoff(char *src, int off);
Description
The mbsoff function will return the substring of src that starts at off. The off parameter is measured in characters where characters are display positions and control character however it is not common that strings contain control characters (should not from an ADT perspective).

The mbsnoff function
Synopsis

#include <mba/mbs.h> char *mbsnoff(char *src, int off, size_t sn);
Description
The mbsnoff function will return the substring of src that starts at off number of characters. No more than sn number of bytes of src will be examined. If the sn parameter is exhausted, a pointer to the next valid multibyte character sequence following the sn position is returned.

The mbschr function
Synopsis

#include <mba/mbs.h> char *mbschr(char *src, wchar_t wc);
Description
The mbschr function will return a substring pointing to the first occurrence of the character wc in the mutibyte string represented by src.

The mbsnchr function
Synopsis

#include <mba/mbs.h> char *mbsnchr(char *src, size_t sn, int cn, wchar_t wc);
Description
The mbschr function will return a substring pointing to the first occurrence of the character wc in the mutibyte string represented by src. No more than sn bytes of src will be examined and no more than cn characters will be converted. Either or both sn and cn may be -1 indicating the constraint should be ignored (no limit).

The mbswidth function
Synopsis

#include <mba/mbs.h> int mbswidth(const char *src, size_t sn, int wn);
Description
The mbswidth function will return the number of display positions a multibyte sequence will occupy. No more than sn bytes of src will be examined and no more than wn display positions will be considered. Control characters are considered to occupy 1 display position (so there should be no control characters in the src string).

The mbssub function
Synopsis

#include <mba/mbs.h> char *mbssub(char *src, size_t sn, int wn);
Description
The mbssub function will return a substring of the multibyte sequence src that is no larger in size than sn and will occupy no more than wn display positions should it be printed on a mutilbyte (UTF-8) capable display.


Copyright 2002 Michael B. Allen <mballen@erols.com>