Ingres CL CM

From Ingres Community Wiki

Jump to: navigation, search

Ingres Compatability Library
Architecture - Overview - Suggestions - GL: BA - BT - ERGL - handy - HSH - LC - LL - MEGL - MM - MO - MU - PM - SP - TMGL - CL: CI - CK - CM - CP - CS - CSMT - CV - CX - DI - DL - DS - ER - ERold - EX - FP - GC - GV - handy - ID - JF - LG - LK - LO - ME - MH - NM - OL - PC - PE - QU - SA - SI - SR - ST - TC - TE - TH - TM - TR - UT

Contents

Compatibility Library Specification – CM

Abstract

This is the specification of the CM facility provided by the compatibility library for multi-byte character operations.

Revision: 1.1, 10-Nov-1997.

Document History

  • Revision 1.1, last modified 10-Nov-1997.
    • Converted to HTML.
  • Revision 1.0, last modified 29-Aug-91.
    • Fixed typos,
    • Change c++ in examples to p++ to avoid confusion with the language C++.

Specification

Introduction

The CM (character manipulation) library allows programs to reference characters within strings. The routines allow programmers to deal with characters independently as to whether the character is one or two bytes. This allows us to deal with Japanese Kanji character sets.

Library

CL

Intended Uses

CM provides functions to manipulate and classify possibly double-byte character objects. In order to provide character manipulation without relying on a one byte character set, the CM module is used for the following common programming tasks:

  • Point to the next or previous character within a string (in place of ++c, --c). Also, increment and decrement byte counters associated with strings (in place of ++i, --i).
  • Check attributes of a character (eg. is the next character a digit, a printable character, etc...)
  • Copy characters from one string to another (in place of *c=*d). These can also convert the case of a character in the copy.
  • Compare two characters, either with or without case significance.

To support local collation sequences, routines to read and write collation sequence description file were added. This file type is separate from DI because it is needed by FE programs and it is not SI because that is not legal for the DB server. The routines added were:

CMopen_col open collation file for reading
CMread_col read collation file
CMclose_col close collation file
CMdump_col create and write a collation file.

To support multiple character sets, CM provides the ability to switch character sets on an installation specific basis. CM had been using #ifdef's and compiling in separate attribute tables for different character sets, but this was untenable, requiring new executables to be built and shipped to support any new character set, as opposed to simply providing a definition file for, say, the Greek character set.

The traditional practice of adding or subtracting a constant 'A' - 'a' to translate between upper and lower case does not work in all character sets, so a case translation table is also provided.

Assumptions

We assume that we can support most 8-bit character sets. We can also support JIS type mixed 8 and 16-bit character sets. These use the convention that the leading byte of a character can be used to indicate that this is the first byte of a multi-byte character set.

This abstraction does not assume that only one or two byte characters can be supported.

This abstraction does assume that strings are terminated, at the start of a character, with a single byte NULL terminator (referred to as EOS).

This abstraction cannot be used for EBCDIC "shift-in/shift-out 16-bit character sets, which require state information to be kept during string manipulation.

Only the single-byte character sets are runtime switchable. It is not possible to switch between multiple-byte and single-byte character sets via the runtime switch. This means that the CM attribute routines and CMnext will work as efficiently as they do now for the single-byte case.

Only a single character set is intended to be in use in a given database, and that set must be chosen once, and never changed. In fact, the character set may be left constant for an entire installation.

The main reason for specifying a character set in CMset_attr() is to allow frontends to use the appropriate character set when connected across a net.

This is a very important assumption. All sorts of havoc can, and will, occur if the character set attributes are changed once a database really contains any data. Hence, it is important to note that the installation tool is the only caller of CMwrite_attr().

There is an implicit assumption in mainline code that an upper (lower) case character automatically has a lower (upper) case counterpart.

CMcmp*** routines might also be theoretically affected by this, except that if mainline code is really attempting to do character string sorting which is to reflect local character set conventions, it should be doing so through interfaces making use of the ADT collating sequence definitions. Therefore,

CMcmp*** is left simple minded.

Definitions and Concepts

Character Value In single-byte, used to mean an integer value in the range 0 - 255, which is assumed to represent a character of some sort, the attributes of that character to be determined by installation specific definition.
Character attributes The CM interface includes several routines to classify characters by type: CMalpha, CMnmstart, CMnmchar, CMprint, CMdigit, CMlower, CMupper, CMwhite, CMspace, CMhex, CMoper.
Whether or not a given character value falls into one of these classifications is dependent on the character set being used.
There are also routines CMtolower and CMtoupper to translate character case. What character value is the alternate case counterpart of any other is also dependent on the character set being used.
CMATTR structure The CMATTR structure contains the attribute array, and the case translation table defining a character set. The CM calls use pointers to an attribute array and case translation table which may be initialized from such a structure.
This structure is surfaced to allow it to be filled in by the installation tool, which is mainline code calling CMwrite_attr(). The human readable text description of a character set is read by this program and used to create a CMATTR, which is then written through CMwrite_attr().
Character attribute file The CM routines allow character set information to be read from, or written to, a file. The file name and location is hidden. The read operation includes initializing the CM module so that all the routines reflect the information read in.
The write operation is only used by a tool which may be used as part of the installation procedure, or as part of the build procedure for preparing a release. Which of these scenarios is followed is an issue of whether character sets are something defined by VAR's and customers, who may add to the current set, or whether they are more tightly controlled by some governing body.
Default character set: If no attribute file exists, or if the call is never made to read one, the CM attribute routines will reflect a default, builtin character set, (i.e. the one currently compiled into the CM module.)
Double Byte Character Any character that requires two bytes to represent. All Kanji characters are double byte, but other characters are double byte as well.
Byte Count This is the number of bytes (8-bit) since the start of a string. Note that this is not necessarily the number of characters (as some of the characters may be double byte characters).
Next Character The next character in a string is the character represented by the byte (or bytes) at which the string pointer is currently positioned.
Alphabetic Character Any of the common alphabetic characters within a character set, which includes [a-z] [A-Z], and any additional characters (such as a with umlaut, or e with accent) commonly used in the alphabet for the character set. For the Japanese character set, this includes katakana characters. This does not include the underscore character.
Leading Name Character Any valid character which can begin an INGRES name (table, column, user, form, etc.) It can be any alphabetic character or a kanji character.
Trailing Name Character Any valid character which can be contained in an INGRES name (table, column, user, form, etc.) This is defined as a superset of the SQL standard. It can be any alphabetic or kanji character or a "$", "@", "#", an underscore character, or a digit.
Printing Character A printing character is any character that can be displayed directly on the terminal as a single character. Control characters are not included.
Control Characters Any non-printing character that requires special processing to display (eg, to put a carat in front of it).
Whitespace Characters Any character used for space control only. These are space, double byte space, tab, line feed, carriage return, and form feed.

The general abstraction is that all character manipulation within C is done using these routines, rather than the standard C conventions. Instead of dealing with characters as independent objects, the programmer always uses string pointers pointing to the ``next character in a string, which may be one or two bytes. The CM macros are used to abstract out the one-byte/two-byte knowledge.

These routines are all defined as macros, SO YOU MUST INCLUDE <cm.h> IN ORDER TO USE ANY OF THE CM ROUTINES.

The macros serve two purposes. First, it allows an (almost) portable set of routines to be written for these functions. More importantly, it allows us to provide two definitions for the macros, depending on whether or not any two byte characters are in the character set. In the case of character sets which do not contain any two byte characters, the CM routines convert into the familiar C coding conventions (++c, ++i, etc.). Only in the case of character sets which contain two byte characters are the penalties for extra checks realized.

The base information used for the CM routines is contained in a character set dependent table local to the CM routines. Clients of the CM library never need refer to this table directly, but implementors need to change it for different character set implementations. The compiler constant DOUBLEBYTE (for double-byte character set) is used to indicate that double byte characters are used in the character set. Normally, CM clients need not be aware of this compiler flag, but in unusual circumstances, it may be used.

Only when manipulating characters within strings are these routines needed. If you are processing strings as sequences of bytes (as you do with inline string copies), you can use the more familiar p++ conventions of C.

This approach obviously greatly diminishes the role of the char (as opposed to char *) datatype. Only in very limited circumstances are char variables allowed. The current exceptions are for use in scanners (in order to allow switch statements) and internally encoded strings (which are known to provide certain special characters).

It is highly discouraged to move backwards within strings under this abstraction. While single byte character manipulation will translate the CMprev call to a --c (low cost), the double byte implementation needs to reprocess a string from it's beginning to move back a single character. Use of byte counters is also discouraged. Instead, if you need to know how many bytes are between the current pointer and the start of the string, you should use pointer arithmetic. For compatibility of current code, the CMbytedec and CMbyteinc routines are provided for counting bytes. However, there is an important order dependency when using them in conjunction with CMnext and CMprev. In particular, you must make sure that you call CMnext AFTER calling CMbyteinc, and call CMprev BEFORE calling CMbyteinc, in order to make sure that you are counting bytes in the correct character.

CM Routines for Checking Character Attributes:
CMalpha test for alphabetic character
CMdbl1st test for 1st byte of double byte
CMnmstart test for leading character in name
CMnmchar test for trailing character in name
CMprint test for printing character
CMdigit test for digit
CMcntrl test for cntrol character
CMlower test for lower case character
CMupper test for upper case character
CMwhite test for white space
CMspace test for space character
CMhex test for hexadecimal digit
CMoper test for operator character
CM Routines for String Movement:
CMnext increment character pointer
CMprev decrement character pointer
CMbyteinc increment byte counter
CMbytedec decrement byte counter
CMbytecnt count the bytes in the next character
CM Routines for Copying Characters:
CMcpychar copy character to string
CMcpyinc copy character to string and increment
CMtolower copy lower case character to string
CMtoupper copy upper case character to string
CMcopy copy a character string of specified length
CM Routines for Comparing Characters:
CMcmpnocase compare two characters, ignoring case
CMcmpcase compare two characters for exact match

Header File <cm.h>

The header file <cm.h> must be included before using any of the interfaces provided. Many are macros. It also defines the following.

CM_MAXATTRNAME

Maximum length for name identifying character set.

#define	CM_MAXATTRNAME	8

CMATTR - CM attribute structure

Structure which defines the attribute array and case translation table for a character set. both items are arrays indexed by character value. The attr array contains the bits used for the character classification routines. The xcase array contains the alternate case character value if the character is upper or lower case alpha. It is meaningless otherwise.

typedef struct
{
     u_i2 attr[256];		/* attribute bits */
     char xcase[256];	/* case translation */
}
Attribute Bits

Historical note:

The list below represents the earlier complied-in set, with only a few modifications. The old _IU and _IL (0100 and 0200) which are ``international upper and ``international lower are superseded by this interface. The NCALPHA bit is added, indicating an alpha character that does not have an alternate case representation, and hence is not considered to be either upper or lower case. 0400 was the ``katakana bit which already had this interpretation where used.

# define	CM_A_UPPER	1	/* Upper case ALPHA */
# define	CM_A_LOWER	2	/* Lower case ALPHA */
# define	CM_A_NCALPHA	4	/* non-cased alpha */
# define	CM_A_DIGIT	8	/* DIGIT */
# define	CM_A_SPACE	0x10	/* SPACE */
# define	CM_A_PRINT	0x20	/* PRINTABLE */
# define	CM_A_CONTROL	0x40	/* CONTROL */
# define	CM_A_DBL1	0x80	/* 1st byte of double byte character */
# define	CM_A_DBL2	0x100	/* 2nd byte of double byte character */
# define	CM_A_NMSTART	0x200	/* Leading character of a name */
# define	CM_A_NMCHAR	0x400	/* Trailing character of a name */
# define	CM_A_HEX	0x800	/* Hexadecimal Digit */
# define	CM_A_OPER	0x1000	/* Operator Character */
# define	CM_A_ALPHA	(CM_A_UPPER|CM_A_LOWER|CM_A_NCALPHA)

COL_BLOCK - size of an in-memory collation description

This constant is the size of the buffer that needs to be handed to CMread_col.

Executable Interface

The following function-like interfaces are provided.

CMalpha - test for alphabetic character

Test next character in string for alphabetic.

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

str pointer to character to be checked

Outputs:

None

Returns:

FALSE if the character is not alphabetic.

Definition:

   bool
   CMalpha(str)
   char    *str;

CMbytecnt - count bytes in the next character

Count the number of bytes in the next character, returning either 1 or 2. This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

str pointer to character within a string.

Outputs:

None

Returns:

The number of bytes in the next character, either 1 or 2.

Definition:

   i4
   CMbytecnt(str)
   char    *str;

CMbytedec - decrement a byte counter

Decrement a byte counter one or two bytes within a string, depending on whether the next character in the string pointed to by 'str' is one or two bytes long. This mimics the --i expression of C when used in manipulating string pointers, but takes into account 2 byte characters.

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>. CMbytedec has a side effect, and has to be implemented as a macro.

Inputs:

count byte counter depending on 'str'
str pointer to character within a string.

Outputs:

count This will change the value of the parameter 'count' either to 'count-1' or 'count-2'.

Returns:

The return value is set to the value of 'count', after decrement (as done by --i).

Side Effects:

The count argument is decremented by 1 or 2. Note that this is NOT a pointer argument.

Definition:

   i4
   CMbytedec(count, str)
   i4      count;

CMbyteinc - increment a byte counter

Increment a byte counter one or two bytes within a string, depending on whether the next character in the string pointed to by 'str' is one or two bytes long. This mimics the ++i expression of C when used in manipulating string pointers, but takes into account 2 byte characters.

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>. CMbyteinc has a side effect, and has to be implemented as a macro.

Inputs:

count byte counter depending on 'str'
str pointer to character within a string.

Outputs:

count This will change the value of the parameter 'count' either to 'count+1' or 'count+2'.

Returns:

The return value is set to the value of 'count', after increment (as done by ++i).

Side Effects:

The count argument is incremented by 1 or 2. Note that this is NOT a pointer argument.

Definition:

   i4
   CMbyteinc(count, str)
   i4      count;

CMclose_col - close collation file

Close an open collation file.

Inputs:

None

Outputs:

syserr System specific error

Returns:

OK if operation succeeded, otherwise system specific error status.

Side Effects:

Releases semaphore which may have been acquired during CMopen_col.

Definition:

   STATUS
   CMclose_col(syserr)
   CL_ERR_DESC		*syserr;

CMcmpcase - compare characters, case sensitive

Compare the next character in each of two strings, either single or double byte. (Case is significance)

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

str1 pointer to first character to compare.
str2 pointer to second character to be compared.

Outputs:

None

Returns:

< 0 if char1 < char2
0 if char1 == char2
> 0 if char1 > char2
(char1 and char2 is one character, include double byte character, within the string str1 and str2.)

Definition:

   i4
   CMcmpcase(str1, str2)
   char    *str1;
   char    *str2;

CMcmpnocase - compare characters, ignoring case

Compare the next character in each of two strings, either single or double byte characters. (Case is not significant)

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

str1 pointer to first character to compare.
str2 pointer to second character to be compared.

Outputs:

None.

Returns:

< 0 if char1 < char2
0 if char1 == char2
> 0 if char1 > char2

Definition:

   i4
   CMcmpnocase(str1, str2)
   char    *str1;
   char    *str2;

CMcntrl - test for non-printing character

Test next character in string for non-printing character

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

str pointer to character to be checked

Outputs:

None

Returns:

FALSE if the character is not a control character.

Definition:

   bool
   CMcntrl(str)
   char    *str;

CMcopy - copy characters from source to destination

CMcopy will copy (not necessarily null-terminated) character data from a source buffer into a destination buffer of a given length. Like STlcopy, CMcopy returns the number of bytes it copied (which may differ from the number of bytes requested to be copied if the last byte was double byte and could not be completely copied).

This routine, which is a system-independent macro, requires the user to include the global header <cm.h>. Also note that because some of the macro expansions may use copy_len twice, you should be aware of possible side effects (ie, ++ and -- operators).

Inputs:

source Pointer to beginning of source buffer.
copy_len Number of bytes to copy.
dest Pointer to beginning of destination buffer.

Outputs:

None

Returns:

Number of bytes actually copied (this may be less than copy_len in the case of double-byte characters, but will never be more).

Definition:

   u_i4
   CMcopy( source, copy_len, dest )
   char *source;
   u_i4  copy_len;
   char *dest

Example:

u_i4 length;
length = CMcopy(source, copy_len, dest);

CMcpychar - copy one character to another

Copy one character (either one or two bytes) from string 'src' to the current position in string 'dst'. This mimics the (*d = *s) expression of C.

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

src pointer to character, which is positioned at the point in string from which character is to be copied.
dst pointer in string into which character is to be copied.

Outputs:

None

Returns:

None

Definition:

   VOID
   CMcpychar(src, dst)
   char    *src;
   char    *dst;

CMcpyinc - copy one character to another, incremeting pointer

Copy one character (either one or two bytes) from string 'src' to the next position in string 'dst', incrementing the pointers for both strings by one or two bytes. This mimics the (*dest++ = *src++) expression of C.

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

src pointer to character, which is positioned at the point in string from which character is to be copied.
dst pointer in string into which character is to be copied.

Outputs:

src incremented by 1 or 2 bytes
dst incremented by 1 or 2 bytes.

Returns:

None

Side Effects:

The passed strings, 'src' and 'dst', are both changed upon return.

Definition:

   VOID
   CMcpyinc(src, dst)
   char    *src;
   char    *dst;

CMdbl1st - test for first byte of a double-byte character

Test next character in string for leading byte of a two byte character

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

str pointer to character to be checked

Outputs:

None

Returns:

FALSE if the byte at the string is not the first of a two-byte character.

Definition:

   bool
   CMdbl1st(str)
   char    *str;

CMdigit - test for digit character

Test next character in string for digit

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

str pointer to character to be checked

Outputs:

None

Returns:

FALSE if the character is not a digit.

Definition:

   bool
   CMdigit(str)
   char    *str;

CMdump_col - open and write a collation file

Create and write a collation file.

Inputs:

colname collation name
tablep collation table pointer
tablen collation table length

Outputs:

syserr System specific error

Returns:

OK if operation succeeded, otherwise system specific error status.

Definition:

   STATUS
   CMdump_col(colname, tablep, tablen, syserr)
   char			*colname;
   PTR			tablep;
   i4 			tablen;
   CL_ERR_DESC		*syserr;

CMhex - test for hexadecimal digit character.

Checks that the current character of the input string is a hexadecimal digit. That is, one of [0-9A-Fa-f].

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

str Pointer to character within a string.

Outputs:

None

Returns:

FALSE if not a hexadecimal digit.

Definition:

   bool
   CMhex(str)
   char *str;

CMlower - test for lower case

Test next character in string for lower case.

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

str pointer to character to be checked

Outputs:

None

Returns:

FALSE if not a lower case character.

Definition:

   bool
   CMlower(str)
   char    *str;

CMnext - increment character string pointer

Move string pointer forward one character within a string. This mimics the ++c expression in C, but takes into account whether the next character is one or two bytes long.

Note that if you use this routine in conjunction with either CMbyteinc or CMbytedec, you should call the CMbyte* routine first to make sure that you are pointing to the same character.

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

str pointer to character to be incremented

Outputs:

str This will change the value of the parameter 'str' to either 'str+1' or 'str+2', depending on the size of the character.

Returns:

The incremented value sent to the routine (as done by ++c).

Definition:

   char *
   CMnext(str)
   char    *str;

CMnmchar - check trailing character as an Ingres name

Within a string, check the next character to see if it is a valid character within an INGRES name.

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

nm pointer to character within a name

Outputs:

None

Returns: |

FALSE if the character is not valid in a name.

Definition:

   bool
   CMnmchar(nm)
   char    *nm;

CMnmstart - check leading character of an Ingres name

Within a string, check the next character to see if it is a valid character as the first position of an INGRES name.

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

nm pointer to character within a name

Outputs:

None

Returns:

FALSE if the character is not valid as the start of a name.

Definition:

   bool
   CMnmstart(nm)
   char    *nm;

CMopen_col - open collation file for reading

Open collation file for reading.

Inputs:

colname collation name

Outputs:

syserr System specific error

Returns:

OK if operation succeeded, otherwise system specific error status.

Side Effects:

May acquire a semaphore to protect itself from reentry and hold that semaphore until the file is closed. CMclose_col must be called to release this semaphore.

Definition:

STATUS
CMopen_col(colname, syserr)
char			*colname;
CL_ERR_DESC		*syserr;

CMoper - test for operator character

Test next character in a string for membership in the set of operators. This set of operators constitutes the set of all INGRES operators that are used throught the query languages and their various extensions. These operators also include all the operators of host languages in which a query language may be embedded. The set of operators is made up of all printable characters, less the set of alphanumeric and space characters. This set of operators may change across different character sets (i.e., the EBCDIC value for ^), and languages (i.e., the Spanish question prefix '¿' - the upside down '?')

For example, the set of ASCII (American) operators is: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \\ ] ^ ` { | } ~

Inputs:

str Pointer to character to be checked.

Outputs:

None

Returns:

FALSE if the character is not an operator.

Definition:

   bool
   CMoper(str)
   char *str;

CMprev - decrement a character pointer

Move backwards one character within a string. This mimics the --c expression in C, but takes into account whether the previous character is one or two bytes long.

Note that if you use this routine in conjunction with either CMbyteinc or CMbytedec, you should call the CMprev reoutine first to make sure that you are pointing to the same character. THE USE OF THIS ROUTINE IS HIGHLY DISCOURAGED UNLESS ABSOLUTELY NECESSARY. PLEASE TRY TO CODE SUCH THAT MOVING BACKWARDS WITHIN A STRING IS NOT NEEDED.

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

str pointer to character to be decremented startpos start pointer of the string being processed (or a location in the string which you know contains the start of a character) as double byte processing must reprocess the string from the start to find the previous character.

Outputs:

str This will change the value of the parameter 'str' to either 'str-1' or 'str-2', depending on the size of the character.

Returns:

The value of 'str' after decrement (as done by --c).

Definition:

   char *
   CMprev(str, startpos)
   char    *str;
   char    *startpos;

CMprint - test for printing character

Test next character in string for printing character

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

str pointer to character to be checked

Outputs:

None

Returns:

FALSE if the character is not printable.

Definition:

   bool
   CMprint(str)
   char    *str;

CMread_col - read from collation file

Read a COL_BLOCK size record from open collation file.

Inputs:

bufp pointer to buffer

Outputs:

syserr System specific error

Returns:

OK if operation succeeded, otherwise system specific error status.

Side Effects:

Moves open file to next record.

Definition:

   STATUS
   CMread_col(bufp, syserr)
   char		*bufp;
   CL_ERR_DESC		*syserr;

CMset_attr - set character attributes

Sets character attribute and case translation to correspond to a given character set. If the named character set has never been installed, assure that the default character set is in effect.

This routine must be called at startup by all executables, and at a convenient entry into INGRES code for user 3GL applications. Having been called once, it need never be called again, although it won't hurt anything if that happens.

Inputs:

name name of character set to be chosen, at most CM_MAXATTRNAME in length.

Outputs:

None

Returns:

OK if operation succeeded, otherwise system specific error status.

Side Effects:

May perform file i/o on a "hidden file name.

Definition:

STATUS
CMset_attr(name,err)
char *name;
CL_ERR_DESC *err;

CMspace - check for space or double byte space

Within a string, check to see if the next character is a space or double byte space.

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

str pointer positioned at the point in the string to check a character for space. (Also see CMwhite()).

Outputs:

None

Returns:

FALSE if the character is not a space.

Definition:

   bool
   CMspace(str)
   char    *str;

CMtolower - convert to lower case

Copy one character (either one or two bytes) from string 'src' to string 'dst', with possible conversion to lower case.

Note that the copy can be done in place ('src' = 'dst').

This routine must be "safe", that is, if the source character is not upper case, it is simply copied to the destination without conversion.

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

src pointer to character, which is positioned at the point in string from which character is to be copied.

Outputs:

dst pointer in string into which lower case version of character is to be placed.

Returns:

None

Definition:

   VOID
   CMtolower(src, dst)
   char    *src;
   char    *dst;

CMtoupper - convert to upper case

Copy one character (either one or two bytes) from string 'src' to string 'dst', with possible conversion to upper case.

This routine must be ``safe, that is, if the source character is not lower case, it is simply copied to the destination without conversion.

Note that the copy can be done in place ('src' = 'dst').

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

src pointer to character, which is positioned at the point in string from which character is to be copied.

Outputs: dst pointer in string into which upper case version of character is to be placed.

Returns:

None

Definition:

   VOID
   CMtoupper(src, dst)
   char    *src;
   char    *dst;

CMunctrl - return string representation of a control character

Returns a pointer to a string representing the character that was passed to it. This routine is most useful when one wants to output a representation of a control character to a user's terminal. If the character is a printable character, the return string only contains the character.

In order to allow this routine to be used in a server environment, it may not be implemented in such a fashion as to return a static buffer overwritten by the next call.

Also, please note that the ASCII and EBCDIC versions will be different. An implementation is free to pick an appropriate representation of control characters for the system. For example, on ASCII systems control characters in the range 0-037 and 0177 (octal) become their upper-case equivalents preceded by a ^ character. On VMS systems the control characters above 0177 (octal) become the strings VMS displays them as. Those that have no VMS display are returned as <XN> where N is their hex value.

Inputs:

str pointer to the next character in a sting.

Outputs:

None

Returns:

printable string representation of character.

Definition:

   char*
   CMunctrl(str)
   char          *str;

CMupper - test for upper case

Test next character in string for upper case (Is *str in [A-Z]?).

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Inputs:

str pointer to character to be checked

Outputs:

None

Returns:

FALSE if the character is not upper case.

Definition:

   bool
   CMupper(str)
   char    *str;

CMwhite - check white space

Within a string, check to see if the next character is white space (for example, a space, tab, LF, NL, CR, FF or double byte space.)

This routine, which is system-independent MACRO, requires the user to include the global header <cm.h>.

Also see CMspace().

Inputs: str pointer positioned at the point in the string to check a character for white space.

Outputs:

None

Returns:

FALSE if the character is not whitespace.

Definition:

   bool
   CMwhite(str)
   char    *str;

CMwrite_attr - write character attribute file

Use a passed-in CMATTR to define the character attributes and case translation for a given character set. Once this call is succesful, CMset_attr calls in all other processes from that point on will reflect the given CMATTR, (i.e., this routine produces a persistent copy, such as a file, accessed by CMset_attr().

NOTE: This routine is ONLY to be called by a program used during installation or release procedures.

It is important that once a DB is populated, this does not get used except perhaps in recovery situations, and then only to restore the same character attributes originally defined.

Inputs:

name name of character set, at most CM_MAXATTRNAME in length.

attr character attributes and case translation.

Outputs:

None

Returns:

OK if operation succeeded, otherwise system specific error status.

Side Effects: Performs file i/o on a "hidden file name.

Definition:

STATUS
CMwrite_attr(name,attr,err)                                                   |
char *name;
CMATTR *attr;
CL_ERR_DESC *err;

Examples:

Indexing Through Strings

The common C convention of p++ does not work with mixed 1 and 2 byte characters. In particular, the following C-Code does NOT work with Kanji.

/*	ERROR ERROR ERROR ERROR */
char		*string,*x;
i4		i;
                 ....
for (i=0,x=string; *x!='\0'; x++,i++)
{
/* process next character */
}
/*	ERROR ERROR ERROR ERROR */
Instead, use the new functions in the CL to increment and decrement pointers 
within strings next, CMprev, and pointer arithmetic, so the code above becomes: 
char		*string,*x;
i4		i;
                  ....
for (x=string; *x!=EOS; CMnext(x))
{
 /* process next character */
}
i = string-x;
If you need to use the byte counter routines, it is EXTREMELY important to put the 
CMbyteinc a CMnext calls in that order, as incrementing the counter AFTER the call
will obviously count the bytes in the wrong character. When using CMprev and 
CMbytedec, you should specify the calls with CMprev first, for the same reason. 
Note that you should try hard to avoid moving backwards in a string, through the 
CMprev* routines, as Kanji must repreprocess a string from the beginning to move 
back one character. 
If you are simply moving through bytes in a string, and don't concern yourself with 
whether they are characters or not, simply use the p++ and i++ C coding conventions.
Skipping White Space in Strings

You should not check for blank directly, as a two-byte blank may be in place. In particular, the following code, which simply skips any whitespace in a string, will NOT work.

/*	ERROR ERROR ERROR ERROR */
for(x=string; (*x==' ')||(*x=='\n')||(*x=='\t'); x++);
                ...
/* 	ERROR ERROR ERROR ERROR */
Instead, use the CMwhite and CMspace macros. 
for(x=string; CMwhite(x); CMnext(x));
                ...

The CMspace macro is used to check for the space character only, instead of any white space.

Checking for Valid Names

You should be consistent in checking for valid Ingres names (tables, columns, reports, etc.) Only file names do not follow these conventions. Do not use the CMalpha macro. Instead, use the CMnmstart and CMnmchar macros. The first is used to check if a character is valid as the leading character in an Ingres name, and the second is used for succeeding characters.

/* ERROR ERROR ERROR ERROR */
char	*name;
char	*cp;
if (!CMalpha(name)
   return(FAIL);
for(cp=name+1; *cp!=EOS; cp++)
{
     if (!CMalpha(cp) && (!CMdigit(cp) && (*cp!='_'))
          return(FAIL);
}
/* ERROR ERROR ERROR ERROR */
Instead, use the following. 
char	*name;
char	*cp;
cp = name;
if (!CMnmstart(cp))
     return(FAIL);
for(CMnext(cp); *cp!=EOS; CMnext(cp))
{
     if (!CMnmchar(cp))
          return(FAIL);
 }
Upper and Lower Casing

Code such as this:

/* ERROR ERROR ERROR ERROR */
char	*oldstring,*newstring;
char	*c,*d;
c = string;
d = newstring;
for(; *c!=EOS; c++,d++)
{
   *d = CHtolower(*c);
 }
/* ERROR ERROR ERROR ERROR */
 should be changed to be: 
char	*oldstring,*newstring;
char	*c,*d;
c = string;
d = newstring;
for(; *c!=EOS; CMnext(c), CMnext(d))
{
    CMtolower(c,d);
}

Note that CMtolower converts the characters, but leaves the 'c' and 'd' pointers unchanged. The operation in the example should be done using the CVlower routine instead of the inline loop shown.

Writing Scanners

The conventions provided by the CM routines allows us to write C code, without ever dealing with the 'char' datatype, and this works well in almost all instances in the code.

One exception to this rule are the scanners, because it is very convenient to use switch statements on the next 'char' in a string for efficient code. Because this is very efficient, and because mixed one and two byte character sets still leave most of the one-byte characters of interest to scanners as one byte, use of a switch statement is still allowed. However, you should be careful to avoid copying the 'next' char into a temporary variable and manipulating it as an independent object.

Using the double byte character conventions, a typical scanner would look something like this:

u_char		*next_char = Qry_next;
u_char		*Qry_prev, *Qry_next;
Qry_prev = Qry_next;
while (next_char <= Qry_end)
{
    switch (*next_char)
   {
         case '\t': case '\r':
              CMnext(next_char);
              continue;
         case '\n':
              CMnext(next_char);
              yyline++;
              continue;
    }
    if (CMspace(next_char))
    {
          CMnext(next_char);
          continue;
     }
     if (CMnmchar(next_char))
     {
         CMtolower(next_char, next_char);
                     ......
}

As you can see, the use of switch statements is allowed, if you are careful.


Ingres Compatability Library
Architecture - Overview - Suggestions - GL: BA - BT - ERGL - handy - HSH - LC - LL - MEGL - MM - MO - MU - PM - SP - TMGL - CL: CI - CK - CM - CP - CS - CSMT - CV - CX - DI - DL - DS - ER - ERold - EX - FP - GC - GV - handy - ID - JF - LG - LK - LO - ME - MH - NM - OL - PC - PE - QU - SA - SI - SR - ST - TC - TE - TH - TM - TR - UT

Personal tools
Developing With