Ingres CL CM
From Ingres Community Wiki
|
Ingres Compatability Library |
Compatibility Library Specification – CM
Abstract
This is the specification of the CM facility provided by the compatibility library for multi-byte character operations.
Revision: 1.1, 10-Nov-1997.
Document History
- Revision 1.1, last modified 10-Nov-1997.
- Converted to HTML.
- Revision 1.0, last modified 29-Aug-91.
- Fixed typos,
- Change c++ in examples to p++ to avoid confusion with the language C++.
Specification
Introduction
The CM (character manipulation) library allows programs to reference characters within strings. The routines allow programmers to deal with characters independently as to whether the character is one or two bytes. This allows us to deal with Japanese Kanji character sets.
Library
CL
Intended Uses
CM provides functions to manipulate and classify possibly double-byte character objects. In order to provide character manipulation without relying on a one byte character set, the CM module is used for the following common programming tasks:
- Point to the next or previous character within a string (in place of ++c, --c). Also, increment and decrement byte counters associated with strings (in place of ++i, --i).
- Check attributes of a character (eg. is the next character a digit, a printable character, etc...)
- Copy characters from one string to another (in place of *c=*d). These can also convert the case of a character in the copy.
- Compare two characters, either with or without case significance.
To support local collation sequences, routines to read and write collation sequence description file were added. This file type is separate from DI because it is needed by FE programs and it is not SI because that is not legal for the DB server. The routines added were:
| CMopen_col | open collation file for reading |
| CMread_col | read collation file |
| CMclose_col | close collation file |
| CMdump_col | create and write a collation file. |
To support multiple character sets, CM provides the ability to switch character sets on an installation specific basis. CM had been using #ifdef's and compiling in separate attribute tables for different character sets, but this was untenable, requiring new executables to be built and shipped to support any new character set, as opposed to simply providing a definition file for, say, the Greek character set.
The traditional practice of adding or subtracting a constant 'A' - 'a' to translate between upper and lower case does not work in all character sets, so a case translation table is also provided.
Assumptions
We assume that we can support most 8-bit character sets. We can also support JIS type mixed 8 and 16-bit character sets. These use the convention that the leading byte of a character can be used to indicate that this is the first byte of a multi-byte character set.
This abstraction does not assume that only one or two byte characters can be supported.
This abstraction does assume that strings are terminated, at the start of a character, with a single byte NULL terminator (referred to as EOS).
This abstraction cannot be used for EBCDIC "shift-in/shift-out 16-bit character sets, which require state information to be kept during string manipulation.
Only the single-byte character sets are runtime switchable. It is not possible to switch between multiple-byte and single-byte character sets via the runtime switch. This means that the CM attribute routines and CMnext will work as efficiently as they do now for the single-byte case.
Only a single character set is intended to be in use in a given database, and that set must be chosen once, and never changed. In fact, the character set may be left constant for an entire installation.
The main reason for specifying a character set in CMset_attr() is to allow frontends to use the appropriate character set when connected across a net.
This is a very important assumption. All sorts of havoc can, and will, occur if the character set attributes are changed once a database really contains any data. Hence, it is important to note that the installation tool is the only caller of CMwrite_attr().
There is an implicit assumption in mainline code that an upper (lower) case character automatically has a lower (upper) case counterpart.
CMcmp*** routines might also be theoretically affected by this, except that if mainline code is really attempting to do character string sorting which is to reflect local character set conventions, it should be doing so through interfaces making use of the ADT collating sequence definitions. Therefore,
CMcmp*** is left simple minded.
Definitions and Concepts
| Character Value | In single-byte, used to mean an integer value in the range 0 - 255, which is assumed to represent a character of some sort, the attributes of that character to be determined by installation specific definition. |
| Character attributes | The CM interface includes several routines to classify characters by type: CMalpha, CMnmstart, CMnmchar, CMprint, CMdigit, CMlower, CMupper, CMwhite, CMspace, CMhex, CMoper. |
| Whether or not a given character value falls into one of these classifications is dependent on the character set being used. | |
| There are also routines CMtolower and CMtoupper to translate character case. What character value is the alternate case counterpart of any other is also dependent on the character set being used. | |
| CMATTR structure | The CMATTR structure contains the attribute array, and the case translation table defining a character set. The CM calls use pointers to an attribute array and case translation table which may be initialized from such a structure. |
| This structure is surfaced to allow it to be filled in by the installation tool, which is mainline code calling CMwrite_attr(). The human readable text description of a character set is read by this program and used to create a CMATTR, which is then written through CMwrite_attr(). | |
| Character attribute file | The CM routines allow character set information to be read from, or written to, a file. The file name and location is hidden. The read operation includes initializing the CM module so that all the routines reflect the information read in. |
| The write operation is only used by a tool which may be used as part of the installation procedure, or as part of the build procedure for preparing a release. Which of these scenarios is followed is an issue of whether character sets are something defined by VAR's and customers, who may add to the current set, or whether they are more tightly controlled by some governing body. | |
| Default character set: | If no attribute file exists, or if the call is never made to read one, the CM attribute routines will reflect a default, builtin character set, (i.e. the one currently compiled into the CM module.) |
| Double Byte Character | Any character that requires two bytes to represent. All Kanji characters are double byte, but other characters are double byte as well. |
| Byte Count | This is the number of bytes (8-bit) since the start of a string. Note that this is not necessarily the number of characters (as some of the characters may be double byte characters). |
| Next Character | The next character in a string is the character represented by the byte (or bytes) at which the string pointer is currently positioned. |
| Alphabetic Character | Any of the common alphabetic characters within a character set, which includes [a-z] [A-Z], and any additional characters (such as a with umlaut, or e with accent) commonly used in the alphabet for the character set. For the Japanese character set, this includes katakana characters. This does not include the underscore character. |
| Leading Name Character | Any valid character which can begin an INGRES name (table, column, user, form, etc.) It can be any alphabetic character or a kanji character. |
| Trailing Name Character | Any valid character which can be contained in an INGRES name (table, column, user, form, etc.) This is defined as a superset of the SQL standard. It can be any alphabetic or kanji character or a "$", "@", "#", an underscore character, or a digit. |
| Printing Character | A printing character is any character that can be displayed directly on the terminal as a single character. Control characters are not included. |
| Control Characters | Any non-printing character that requires special processing to display (eg, to put a carat in front of it). |
| Whitespace Characters | Any character used for space control only. These are space, double byte space, tab, line feed, carriage return, and form feed. |
The general abstraction is that all character manipulation within C is done using these routines, rather than the standard C conventions. Instead of dealing with characters as independent objects, the programmer always uses string pointers pointing to the ``next character in a string, which may be one or two bytes. The CM macros are used to abstract out the one-byte/two-byte knowledge.
These routines are all defined as macros, SO YOU MUST INCLUDE <cm.h> IN ORDER TO USE ANY OF THE CM ROUTINES.
The macros serve two purposes. First, it allows an (almost) portable set of routines to be written for these functions. More importantly, it allows us to provide two definitions for the macros, depending on whether or not any two byte characters are in the character set. In the case of character sets which do not contain any two byte characters, the CM routines convert into the familiar C coding conventions (++c, ++i, etc.). Only in the case of character sets which contain two byte characters are the penalties for extra checks realized.
The base information used for the CM routines is contained in a character set dependent table local to the CM routines. Clients of the CM library never need refer to this table directly, but implementors need to change it for different character set implementations. The compiler constant DOUBLEBYTE (for double-byte character set) is used to indicate that double byte characters are used in the character set. Normally, CM clients need not be aware of this compiler flag, but in unusual circumstances, it may be used.
Only when manipulating characters within strings are these routines needed. If you are processing strings as sequences of bytes (as you do with inline string copies), you can use the more familiar p++ conventions of C.
This approach obviously greatly diminishes the role of the char (as opposed to char *) datatype. Only in very limited circumstances are char variables allowed. The current exceptions are for use in scanners (in order to allow switch statements) and internally encoded strings (which are known to provide certain special characters).
It is highly discouraged to move backwards within strings under this abstraction. While single byte character manipulation will translate the CMprev call to a --c (low cost), the double byte implementation needs to reprocess a string from it's beginning to move back a single character. Use of byte counters is also discouraged. Instead, if you need to know how many bytes are between the current pointer and the start of the string, you should use pointer arithmetic. For compatibility of current code, the CMbytedec and CMbyteinc routines are provided for counting bytes. However, there is an important order dependency when using them in conjunction with CMnext and CMprev. In particular, you must make sure that you call CMnext AFTER calling CMbyteinc, and call CMprev BEFORE calling CMbyteinc, in order to make sure that you are counting bytes in the correct character.
CM Routines for Checking Character Attributes:
| CMalpha | test for alphabetic character |
| CMdbl1st | test for 1st byte of double byte |
| CMnmstart | test for leading character in name |
| CMnmchar | test for trailing character in name |
| CMprint | test for printing character |
| CMdigit | test for digit |
| CMcntrl | test for cntrol character |
| CMlower | test for lower case character |
| CMupper | test for upper case character |
| CMwhite | test for white space |
| CMspace | test for space character |
| CMhex | test for hexadecimal digit |
| CMoper | test for operator character |
CM Routines for String Movement:
| CMnext | increment character pointer |
| CMprev | decrement character pointer |
| CMbyteinc | increment byte counter |
| CMbytedec | decrement byte counter |
| CMbytecnt | count the bytes in the next character |
CM Routines for Copying Characters:
| CMcpychar | copy character to string |
| CMcpyinc | copy character to string and increment |
| CMtolower | copy lower case character to string |
| CMtoupper | copy upper case character to string |
| CMcopy | copy a character string of specified length |
CM Routines for Comparing Characters:
| CMcmpnocase | compare two characters, ignoring case |
| CMcmpcase | compare two characters for exact match |
Header File <cm.h>
The header file <cm.h> must be included before using any of the interfaces provided. Many are macros. It also defines the following.
CM_MAXATTRNAME
Maximum length for name identifying character set.
#define CM_MAXATTRNAME 8
CMATTR - CM attribute structure
Structure which defines the attribute array and case translation table for a character set. both items are arrays indexed by character value. The attr array contains the bits used for the character classification routines. The xcase array contains the alternate case character value if the character is upper or lower case alpha. It is meaningless otherwise.
typedef struct
{
u_i2 attr[256]; /* attribute bits */
char xcase[256]; /* case translation */
}
Attribute Bits
Historical note:
The list below represents the earlier complied-in set, with only a few modifications. The old _IU and _IL (0100 and 0200) which are ``international upper and ``international lower are superseded by this interface. The NCALPHA bit is added, indicating an alpha character that does not have an alternate case representation, and hence is not considered to be either upper or lower case. 0400 was the ``katakana bit which already had this interpretation where used.
# define CM_A_UPPER 1 /* Upper case ALPHA */ # define CM_A_LOWER 2 /* Lower case ALPHA */ # define CM_A_NCALPHA 4 /* non-cased alpha */ # define CM_A_DIGIT 8 /* DIGIT */ # define CM_A_SPACE 0x10 /* SPACE */ # define CM_A_PRINT 0x20 /* PRINTABLE */ # define CM_A_CONTROL 0x40 /* CONTROL */ # define CM_A_DBL1 0x80 /* 1st byte of double byte character */ # define CM_A_DBL2 0x100 /* 2nd byte of double byte character */ # define CM_A_NMSTART 0x200 /* Leading character of a name */ # define CM_A_NMCHAR 0x400 /* Trailing character of a name */ # define CM_A_HEX 0x800 /* Hexadecimal Digit */ # define CM_A_OPER 0x1000 /* Operator Character */ # define CM_A_ALPHA (CM_A_UPPER|CM_A_LOWER|CM_A_NCALPHA)
COL_BLOCK - size of an in-memory collation description
This constant is the size of the buffer that needs to be handed to CMread_col.
Executable Interface
The following function-like interfaces are provided.
CMalpha - test for alphabetic character
Test next character in string for alphabetic.
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
| str | pointer to character to be checked |
Outputs:
| None |
Returns:
| FALSE | if the character is not alphabetic. |
Definition:
bool CMalpha(str) char *str;
CMbytecnt - count bytes in the next character
Count the number of bytes in the next character, returning either 1 or 2. This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
| str | pointer to character within a string. |
Outputs:
| None |
Returns:
| The number of bytes in the next character, either 1 or 2. |
Definition:
i4 CMbytecnt(str) char *str;
CMbytedec - decrement a byte counter
Decrement a byte counter one or two bytes within a string, depending on whether the next character in the string pointed to by 'str' is one or two bytes long. This mimics the --i expression of C when used in manipulating string pointers, but takes into account 2 byte characters.
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>. CMbytedec has a side effect, and has to be implemented as a macro.
Inputs:
| count | byte counter depending on 'str' |
| str | pointer to character within a string. |
Outputs:
| count | This will change the value of the parameter 'count' either to 'count-1' or 'count-2'. |
Returns:
| The return value is set to the value of 'count', after decrement (as done by --i). |
Side Effects:
| The count argument is decremented by 1 or 2. Note that this is NOT a pointer argument. |
Definition:
i4 CMbytedec(count, str) i4 count;
CMbyteinc - increment a byte counter
Increment a byte counter one or two bytes within a string, depending on whether the next character in the string pointed to by 'str' is one or two bytes long. This mimics the ++i expression of C when used in manipulating string pointers, but takes into account 2 byte characters.
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>. CMbyteinc has a side effect, and has to be implemented as a macro.
Inputs:
| count | byte counter depending on 'str' |
| str | pointer to character within a string. |
Outputs:
| count | This will change the value of the parameter 'count' either to 'count+1' or 'count+2'. |
Returns:
| The return value is set to the value of 'count', after increment (as done by ++i). |
Side Effects:
| The count argument is incremented by 1 or 2. Note that this is NOT a pointer argument. |
Definition:
i4 CMbyteinc(count, str) i4 count;
CMclose_col - close collation file
Close an open collation file.
Inputs:
| None |
Outputs:
| syserr | System specific error |
Returns:
| OK | if operation succeeded, otherwise system specific error status. |
Side Effects:
| Releases semaphore which may have been acquired during CMopen_col. |
Definition:
STATUS CMclose_col(syserr) CL_ERR_DESC *syserr;
CMcmpcase - compare characters, case sensitive
Compare the next character in each of two strings, either single or double byte. (Case is significance)
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
| str1 | pointer to first character to compare. |
| str2 | pointer to second character to be compared. |
Outputs:
| None |
Returns:
| < 0 | if char1 < char2 |
| 0 | if char1 == char2 |
| > 0 | if char1 > char2 |
| (char1 and char2 is one character, include double byte character, within the string str1 and str2.) |
Definition:
i4 CMcmpcase(str1, str2) char *str1; char *str2;
CMcmpnocase - compare characters, ignoring case
Compare the next character in each of two strings, either single or double byte characters. (Case is not significant)
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
| str1 | pointer to first character to compare. |
| str2 | pointer to second character to be compared. |
Outputs:
| None. |
Returns:
| < 0 | if char1 < char2 |
| 0 | if char1 == char2 |
| > 0 | if char1 > char2 |
Definition:
i4 CMcmpnocase(str1, str2) char *str1; char *str2;
CMcntrl - test for non-printing character
Test next character in string for non-printing character
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
| str | pointer to character to be checked |
Outputs:
| None |
Returns:
| FALSE | if the character is not a control character. |
Definition:
bool CMcntrl(str) char *str;
CMcopy - copy characters from source to destination
CMcopy will copy (not necessarily null-terminated) character data from a source buffer into a destination buffer of a given length. Like STlcopy, CMcopy returns the number of bytes it copied (which may differ from the number of bytes requested to be copied if the last byte was double byte and could not be completely copied).
This routine, which is a system-independent macro, requires the user to include the global header <cm.h>. Also note that because some of the macro expansions may use copy_len twice, you should be aware of possible side effects (ie, ++ and -- operators).
Inputs:
| source | Pointer to beginning of source buffer. |
| copy_len | Number of bytes to copy. |
| dest | Pointer to beginning of destination buffer. |
Outputs:
| None |
Returns:
| Number of bytes actually copied (this may be less than copy_len in the case of double-byte characters, but will never be more). |
Definition:
u_i4 CMcopy( source, copy_len, dest ) char *source; u_i4 copy_len; char *dest
Example:
u_i4 length; length = CMcopy(source, copy_len, dest);
CMcpychar - copy one character to another
Copy one character (either one or two bytes) from string 'src' to the current position in string 'dst'. This mimics the (*d = *s) expression of C.
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
| src | pointer to character, which is positioned at the point in string from which character is to be copied. |
| dst | pointer in string into which character is to be copied. |
Outputs:
| None |
Returns:
| None |
Definition:
VOID CMcpychar(src, dst) char *src; char *dst;
CMcpyinc - copy one character to another, incremeting pointer
Copy one character (either one or two bytes) from string 'src' to the next position in string 'dst', incrementing the pointers for both strings by one or two bytes. This mimics the (*dest++ = *src++) expression of C.
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
| src | pointer to character, which is positioned at the point in string from which character is to be copied. |
| dst | pointer in string into which character is to be copied. |
Outputs:
| src | incremented by 1 or 2 bytes |
| dst | incremented by 1 or 2 bytes. |
Returns:
| None |
Side Effects:
| The passed strings, 'src' and 'dst', are both changed upon return. |
Definition:
VOID CMcpyinc(src, dst) char *src; char *dst;
CMdbl1st - test for first byte of a double-byte character
Test next character in string for leading byte of a two byte character
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
| str | pointer to character to be checked |
Outputs:
| None |
Returns:
| FALSE | if the byte at the string is not the first of a two-byte character. |
Definition:
bool CMdbl1st(str) char *str;
CMdigit - test for digit character
Test next character in string for digit
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
| str | pointer to character to be checked |
Outputs:
| None |
Returns:
| FALSE | if the character is not a digit. |
Definition:
bool CMdigit(str) char *str;
CMdump_col - open and write a collation file
Create and write a collation file.
Inputs:
| colname | collation name |
| tablep | collation table pointer |
| tablen | collation table length |
Outputs:
| syserr | System specific error |
Returns:
| OK | if operation succeeded, otherwise system specific error status. |
Definition:
STATUS CMdump_col(colname, tablep, tablen, syserr) char *colname; PTR tablep; i4 tablen; CL_ERR_DESC *syserr;
CMhex - test for hexadecimal digit character.
Checks that the current character of the input string is a hexadecimal digit. That is, one of [0-9A-Fa-f].
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
| str | Pointer to character within a string. |
Outputs:
| None |
Returns:
| FALSE | if not a hexadecimal digit. |
Definition:
bool CMhex(str) char *str;
CMlower - test for lower case
Test next character in string for lower case.
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
| str | pointer to character to be checked |
Outputs:
| None |
Returns:
| FALSE | if not a lower case character. |
Definition:
bool CMlower(str) char *str;
CMnext - increment character string pointer
Move string pointer forward one character within a string. This mimics the ++c expression in C, but takes into account whether the next character is one or two bytes long.
Note that if you use this routine in conjunction with either CMbyteinc or CMbytedec, you should call the CMbyte* routine first to make sure that you are pointing to the same character.
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
| str | pointer to character to be incremented |
Outputs:
| str | This will change the value of the parameter 'str' to either 'str+1' or 'str+2', depending on the size of the character. |
Returns:
| The incremented value sent to the routine (as done by ++c). |
Definition:
char * CMnext(str) char *str;
CMnmchar - check trailing character as an Ingres name
Within a string, check the next character to see if it is a valid character within an INGRES name.
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
| nm | pointer to character within a name |
Outputs:
| None |
Returns: |
| FALSE | if the character is not valid in a name. |
Definition:
bool CMnmchar(nm) char *nm;
CMnmstart - check leading character of an Ingres name
Within a string, check the next character to see if it is a valid character as the first position of an INGRES name.
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
| nm | pointer to character within a name |
Outputs:
| None |
Returns:
| FALSE | if the character is not valid as the start of a name. |
Definition:
bool CMnmstart(nm) char *nm;
CMopen_col - open collation file for reading
Open collation file for reading.
Inputs:
colname collation name
Outputs:
syserr System specific error
Returns:
OK if operation succeeded, otherwise system specific error status.
Side Effects:
May acquire a semaphore to protect itself from reentry and hold that semaphore until the file is closed. CMclose_col must be called to release this semaphore.
Definition:
STATUS CMopen_col(colname, syserr) char *colname; CL_ERR_DESC *syserr;
CMoper - test for operator character
Test next character in a string for membership in the set of operators. This set of operators constitutes the set of all INGRES operators that are used throught the query languages and their various extensions. These operators also include all the operators of host languages in which a query language may be embedded. The set of operators is made up of all printable characters, less the set of alphanumeric and space characters. This set of operators may change across different character sets (i.e., the EBCDIC value for ^), and languages (i.e., the Spanish question prefix '¿' - the upside down '?')
For example, the set of ASCII (American) operators is: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \\ ] ^ ` { | } ~
Inputs:
str Pointer to character to be checked.
Outputs:
None
Returns:
FALSE if the character is not an operator.
Definition:
bool CMoper(str) char *str;
CMprev - decrement a character pointer
Move backwards one character within a string. This mimics the --c expression in C, but takes into account whether the previous character is one or two bytes long.
Note that if you use this routine in conjunction with either CMbyteinc or CMbytedec, you should call the CMprev reoutine first to make sure that you are pointing to the same character. THE USE OF THIS ROUTINE IS HIGHLY DISCOURAGED UNLESS ABSOLUTELY NECESSARY. PLEASE TRY TO CODE SUCH THAT MOVING BACKWARDS WITHIN A STRING IS NOT NEEDED.
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
str pointer to character to be decremented startpos start pointer of the string being processed (or a location in the string which you know contains the start of a character) as double byte processing must reprocess the string from the start to find the previous character.
Outputs:
str This will change the value of the parameter 'str' to either 'str-1' or 'str-2', depending on the size of the character.
Returns:
The value of 'str' after decrement (as done by --c).
Definition:
char * CMprev(str, startpos) char *str; char *startpos;
CMprint - test for printing character
Test next character in string for printing character
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
str pointer to character to be checked
Outputs:
None
Returns:
FALSE if the character is not printable.
Definition:
bool CMprint(str) char *str;
CMread_col - read from collation file
Read a COL_BLOCK size record from open collation file.
Inputs:
bufp pointer to buffer
Outputs:
syserr System specific error
Returns:
OK if operation succeeded, otherwise system specific error status.
Side Effects:
Moves open file to next record.
Definition:
STATUS CMread_col(bufp, syserr) char *bufp; CL_ERR_DESC *syserr;
CMset_attr - set character attributes
Sets character attribute and case translation to correspond to a given character set. If the named character set has never been installed, assure that the default character set is in effect.
This routine must be called at startup by all executables, and at a convenient entry into INGRES code for user 3GL applications. Having been called once, it need never be called again, although it won't hurt anything if that happens.
Inputs:
name name of character set to be chosen, at most CM_MAXATTRNAME in length.
Outputs:
None
Returns:
OK if operation succeeded, otherwise system specific error status.
Side Effects:
May perform file i/o on a "hidden file name.
Definition:
STATUS CMset_attr(name,err) char *name; CL_ERR_DESC *err;
CMspace - check for space or double byte space
Within a string, check to see if the next character is a space or double byte space.
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
str pointer positioned at the point in the string to check a character for space. (Also see CMwhite()).
Outputs:
None
Returns:
FALSE if the character is not a space.
Definition:
bool CMspace(str) char *str;
CMtolower - convert to lower case
Copy one character (either one or two bytes) from string 'src' to string 'dst', with possible conversion to lower case.
Note that the copy can be done in place ('src' = 'dst').
This routine must be "safe", that is, if the source character is not upper case, it is simply copied to the destination without conversion.
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
src pointer to character, which is positioned at the point in string from which character is to be copied.
Outputs:
dst pointer in string into which lower case version of character is to be placed.
Returns:
None
Definition:
VOID CMtolower(src, dst) char *src; char *dst;
CMtoupper - convert to upper case
Copy one character (either one or two bytes) from string 'src' to string 'dst', with possible conversion to upper case.
This routine must be ``safe, that is, if the source character is not lower case, it is simply copied to the destination without conversion.
Note that the copy can be done in place ('src' = 'dst').
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
src pointer to character, which is positioned at the point in string from which character is to be copied.
Outputs: dst pointer in string into which upper case version of character is to be placed.
Returns:
None
Definition:
VOID CMtoupper(src, dst) char *src; char *dst;
CMunctrl - return string representation of a control character
Returns a pointer to a string representing the character that was passed to it. This routine is most useful when one wants to output a representation of a control character to a user's terminal. If the character is a printable character, the return string only contains the character.
In order to allow this routine to be used in a server environment, it may not be implemented in such a fashion as to return a static buffer overwritten by the next call.
Also, please note that the ASCII and EBCDIC versions will be different. An implementation is free to pick an appropriate representation of control characters for the system. For example, on ASCII systems control characters in the range 0-037 and 0177 (octal) become their upper-case equivalents preceded by a ^ character. On VMS systems the control characters above 0177 (octal) become the strings VMS displays them as. Those that have no VMS display are returned as <XN> where N is their hex value.
Inputs:
str pointer to the next character in a sting.
Outputs:
None
Returns:
printable string representation of character.
Definition:
char* CMunctrl(str) char *str;
CMupper - test for upper case
Test next character in string for upper case (Is *str in [A-Z]?).
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Inputs:
str pointer to character to be checked
Outputs:
None
Returns:
FALSE if the character is not upper case.
Definition:
bool CMupper(str) char *str;
CMwhite - check white space
Within a string, check to see if the next character is white space (for example, a space, tab, LF, NL, CR, FF or double byte space.)
This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.
Also see CMspace().
Inputs: str pointer positioned at the point in the string to check a character for white space.
Outputs:
None
Returns:
FALSE if the character is not whitespace.
Definition:
bool CMwhite(str) char *str;
CMwrite_attr - write character attribute file
Use a passed-in CMATTR to define the character attributes and case translation for a given character set. Once this call is succesful, CMset_attr calls in all other processes from that point on will reflect the given CMATTR, (i.e., this routine produces a persistent copy, such as a file, accessed by CMset_attr().
NOTE: This routine is ONLY to be called by a program used during installation or release procedures.
It is important that once a DB is populated, this does not get used except perhaps in recovery situations, and then only to restore the same character attributes originally defined.
Inputs:
name name of character set, at most CM_MAXATTRNAME in length.
attr character attributes and case translation.
Outputs:
None
Returns:
OK if operation succeeded, otherwise system specific error status.
Side Effects: Performs file i/o on a "hidden file name.
Definition:
STATUS CMwrite_attr(name,attr,err) | char *name; CMATTR *attr; CL_ERR_DESC *err;
Examples:
Indexing Through Strings
The common C convention of p++ does not work with mixed 1 and 2 byte characters. In particular, the following C-Code does NOT work with Kanji.
/* ERROR ERROR ERROR ERROR */
char *string,*x;
i4 i;
....
for (i=0,x=string; *x!='\0'; x++,i++)
{
/* process next character */
}
/* ERROR ERROR ERROR ERROR */
Instead, use the new functions in the CL to increment and decrement pointers
within strings next, CMprev, and pointer arithmetic, so the code above becomes:
char *string,*x;
i4 i;
....
for (x=string; *x!=EOS; CMnext(x))
{
/* process next character */
}
i = string-x;
If you need to use the byte counter routines, it is EXTREMELY important to put the
CMbyteinc a CMnext calls in that order, as incrementing the counter AFTER the call
will obviously count the bytes in the wrong character. When using CMprev and
CMbytedec, you should specify the calls with CMprev first, for the same reason.
Note that you should try hard to avoid moving backwards in a string, through the
CMprev* routines, as Kanji must repreprocess a string from the beginning to move
back one character.
If you are simply moving through bytes in a string, and don't concern yourself with
whether they are characters or not, simply use the p++ and i++ C coding conventions.
Skipping White Space in Strings
You should not check for blank directly, as a two-byte blank may be in place. In particular, the following code, which simply skips any whitespace in a string, will NOT work.
/* ERROR ERROR ERROR ERROR */
for(x=string; (*x==' ')||(*x=='\n')||(*x=='\t'); x++);
...
/* ERROR ERROR ERROR ERROR */
Instead, use the CMwhite and CMspace macros.
for(x=string; CMwhite(x); CMnext(x));
...
The CMspace macro is used to check for the space character only, instead of any white space.
Checking for Valid Names
You should be consistent in checking for valid Ingres names (tables, columns, reports, etc.) Only file names do not follow these conventions. Do not use the CMalpha macro. Instead, use the CMnmstart and CMnmchar macros. The first is used to check if a character is valid as the leading character in an Ingres name, and the second is used for succeeding characters.
/* ERROR ERROR ERROR ERROR */
char *name;
char *cp;
if (!CMalpha(name)
return(FAIL);
for(cp=name+1; *cp!=EOS; cp++)
{
if (!CMalpha(cp) && (!CMdigit(cp) && (*cp!='_'))
return(FAIL);
}
/* ERROR ERROR ERROR ERROR */
Instead, use the following.
char *name;
char *cp;
cp = name;
if (!CMnmstart(cp))
return(FAIL);
for(CMnext(cp); *cp!=EOS; CMnext(cp))
{
if (!CMnmchar(cp))
return(FAIL);
}
Upper and Lower Casing
Code such as this:
/* ERROR ERROR ERROR ERROR */
char *oldstring,*newstring;
char *c,*d;
c = string;
d = newstring;
for(; *c!=EOS; c++,d++)
{
*d = CHtolower(*c);
}
/* ERROR ERROR ERROR ERROR */
should be changed to be:
char *oldstring,*newstring;
char *c,*d;
c = string;
d = newstring;
for(; *c!=EOS; CMnext(c), CMnext(d))
{
CMtolower(c,d);
}
Note that CMtolower converts the characters, but leaves the 'c' and 'd' pointers unchanged. The operation in the example should be done using the CVlower routine instead of the inline loop shown.
Writing Scanners
The conventions provided by the CM routines allows us to write C code, without ever dealing with the 'char' datatype, and this works well in almost all instances in the code.
One exception to this rule are the scanners, because it is very convenient to use switch statements on the next 'char' in a string for efficient code. Because this is very efficient, and because mixed one and two byte character sets still leave most of the one-byte characters of interest to scanners as one byte, use of a switch statement is still allowed. However, you should be careful to avoid copying the 'next' char into a temporary variable and manipulating it as an independent object.
Using the double byte character conventions, a typical scanner would look something like this:
u_char *next_char = Qry_next;
u_char *Qry_prev, *Qry_next;
Qry_prev = Qry_next;
while (next_char <= Qry_end)
{
switch (*next_char)
{
case '\t': case '\r':
CMnext(next_char);
continue;
case '\n':
CMnext(next_char);
yyline++;
continue;
}
if (CMspace(next_char))
{
CMnext(next_char);
continue;
}
if (CMnmchar(next_char))
{
CMtolower(next_char, next_char);
......
}
As you can see, the use of switch statements is allowed, if you are careful.
|
Ingres Compatability Library |
