A Short Excursion Through The Ingres Back-End

From Ingres Community Wiki

Jump to: navigation, search

Contents

A Short Excursion Through The Ingres Back-End

by Karl Schendel Ingres Development


This document is a brief tour of the back-end DBMS server code of Ingres. I'll talk a little about how Ingres got to where it was, and try to explain the usually good reasons for its little peculiarities.

I'll have little to say here about the front-end tools. The character based front-end has been reasonably stable for some time, and there hasn't been much done to it. (Which is a good opportunity for someone!) OpenROAD and the various API interfaces are outside the scope of this document.

Although I am reasonably sure that I have my facts straight, all opinions as to goodness and badness are my own.

The codeline history

The current Ingres code line is about 15+ years old. Some of it, especially in the parser and optimizer, is based on even older code. This is not a criticism. The code has in general maintained a high quality level over the years. There are a few oddities directly traceable to its age, though; I'll get into more detail in a bit.

Ingres went through an initial 5 versions. The original Ingres code "just grew", and turned into a bit of a mess. Starting with version 6, the entire server was rewritten. The new code reused a lot of the algorithms and even code from earlier versions, but re-layered and cleaned up. Ingres version 6 went through several iterations, the most stable and most used being 6.4. The next version was renamed OpenIngres and re-versioned back to 1; the stable OpenIngres release was 1.2. (This was about the time that CA bought the product.) OpenIngres 2.0 was re-launched as Ingres II, which went through versions 2.0, 2.5, and 2.6. Ingres r3 was the version released as open source by CA. Post-divestiture, (the new) Ingres Corp has released Ingres 2006 (9.0.4), Ingres 2006 Release 2 (9.1.0), and as of this update is releasing Ingres 2006 Release 3 (9.2.0).

A very early variant of Ingres, very pre-6.0, has kicked around the net for years. It has sometimes been known as "university Ingres". I believe that a descendant of that code was used in the Postgres project at UC Berkeley, which then grew into PostgreSQL. Thus, Ingres and PostgreSQL are very distantly related, but neither is a descendant of the other, as some have erroneously claimed.

There have been 2 entirely separate development groups working on Ingres over the years. The original developers were part of RTI (aka the original Ingres Corp). When RTI/Ingres/ASK dissolved into mush, that entire development team disappeared. The second development team was brought together by CA. The CA team was never very large, but in my opinion it was of very high quality (and remains so, in their new home). CA as a corporation always supported Ingres, more or less, but unfortunately they were reluctant to invest in major long-term improvements until very late in the game. After Ingres was divested and Ingres Corp was formed, the vast majority of development and support engineers followed the product to the new Ingres Corp.

The original team seems to have been somewhat dogmatic, perhaps even bureaucratic, in their approach. (Caveat here: I did not know them personally.) This has led to some good results (the rigorous facility layering which has prevented the back-end from turning into a mush of spaghetti), and some not so good results (excessive bit-fiddling as you move from one sublayer to another; different facilities recomputing similar things; multiplicity of similar but not identical data structures). The current team has worked in a much more pragmatic environment. Thus, some of the less successful code from olden times is still there, simply because it's not hurting any customers directly, and the current developers have been focused on customer visible improvements.

Perhaps the most unfortunate effect of the transition is that when the RTI team disappeared, their goals and agenda went with them. We have relatively little knowledge about what they felt the weak points were, and where they wanted the code to go.

As an example, a particularly visible leftover from the RTI-to-CA transition is in the DDL arena. It appears that RTI was in the process of migrating DDL execution from a call-by-hand sequence ("psy") to a compiled query plan executed in the main query execution facility ("qeq"). The latter is more complex, and takes DDL through a number of unnecessary-looking phases; but perhaps was meant to lead to the ability to (eg) put DDL in database procedures. In any case, the transition was interrupted part-way through. We don't know what their end-point design, and the transition was never completed. Either completing or undoing the project would have been a significant effort with little external benefit. So, we have an unnecessarily large and complicated DDL code path.


VAX/VMS, compilers, ports, and the CLF

A C programmer first reading the Ingres code is likely to ask questions like "Why the heck aren't they using strcpy", or "What's up with this GLOBALDEF / GLOBALREF stuff?", or "Why do they use i4 instead of int?" As it happens, the answer is NOT "because we like to be different".

Unless you've been in the porting business for the last 20 years or so, it's likely that Ingres has seen many more C compilers than you have. Nowadays, most C compilers are pretty reliable; but if you go back 15 or 20 years, there were some very nasty ones. Some of the coding standards for Ingres grew out of a healthy fear of C compiler or library bugs.

The multiplicity of platforms that Ingres has run on has had a very similar effect on the code. In C, "int" can be thought of as an optimization, rather than a portable declaration. Fortunately, these days we mostly have LPI32 (long-pointer-int 32 bits) or LP64 (int 32 bits, long-pointer 64 bits). It turns out to be a lot more portable, though, and actually easier in my opinion to code, when you use typedefs that say exactly what you mean. "I need a 4-byte integer here."

For quite some time in the RTI days, a primary (if not THE primary) development platform for Ingres was VAX/VMS. The VAX C compiler had some interesting peculiarities, and that's why we have GLOBALDEF and some of the other funny looking definitions. VMS wasn't Unix, and at the time did not have a 100% unix-like C library, and that's why you see STcopy instead of strcpy. (By the way, many of the lower level things like STcopy have been #defined to the real C library routines, on most platforms.) Not all of the Ingres idioms came from VMS. Some of the early Unix compilers and libraries were strange enough in their own way. (Eg, on a few, structure assignment either didn't work or worked poorly, hence STRUCT_ASSIGN_MACRO.)

We are slowly Ingres away from the bad old days of by-hand portability definitions, and taking advantage of C standards. For instance, the READONLY declaration can now be replaced by const, and VOID with void. We're slowly moving to the proper use of void * as a generic pointer, instead of a char *. And so on. This sort of work is being done as opportunity allows, and it would be great to have someone make one pass through and get it all.

One other side effect of the old VMS development environment seems to have been a reluctance on the part of the original developers to make mass changes. (The reason for this was explained to me as being partly due to days-long build times on a VAX 780.) In some places you will see a page or two of code just to translate A to B, instead of changing A to B everywhere. Eventually, we would like to get that sort of thing fixed.

The hardware available 15 years ago has left its mark on the Ingres architecture to this day. Memory was scarce, and sometimes you will see many lines of code just to save a few bytes or a memory allocation. (Oddly enough, you will also sometimes see the same object copied 3 or 4 times from facility to facility, instead of layering the data structure itself so that it can be passed along nicely.) Another historical mark is that in the past, architectures were single-CPU architectures with strict memory ordering. Ingres doesn't really understand today's more relaxed memory architectures. We pretty much mostly get away with it, but ideally Ingres would have a proper atomic-access data type and atomic operations.


QUEL

Originally, Ingres did not support SQL; it supported QUEL, a much nicer and more regular language. Ingres version 5 had an internal SQL-to-QUEL translator. The code you are looking at has always supported SQL natively, but QUEL remains. I have not heard of any plans to eliminate QUEL, although it has not been enhanced much.

QUEL, and QUEL aggregates in particular, impose design requirements on the parser, optimizer, and query execution engine that go beyond what SQL needs. If you plan on doing any serious work in any of those facilities, you should gain at least some acquaintance of QUEL. And if you plan on doing anything serious in the optimizer, you MUST have some understanding of QUEL and its aggregates; otherwise some of the things it does don't make any sense. For example, Ingres tries to rewrite a NOT IN/NOT EXISTS into an outer join, but when it can't do that, it rewrites it into a QUEL-like ANY aggregate expression.

I'll now go through the main back-end facilities and discuss each one briefly. I will discuss a very little about the internal organization of each facility, and give my opinions on what is easy and obvious vs subtle and delicate. I'll mention a few of my thoughts about improving the general architecture and code.

(Unfortunately, following the principle that good news is no news, I don't have a whole lot to say about the good things! My comments are not meant to denigrate the code wholesale; there are plenty of nice parts.)

System Control Facility (SCF)

SCF is the main program, driver, and dispatcher for the back-end. The most prominent feature is the Sequencer, which is the code that steers a query through the various facilities for parsing, optimization, and execution. SCF works closely with the CLF subsystem "CS" (control system) to do thread (session) management.

The parts of SCF are:

  • sc0 – Low level utilities, such as memory management
  • sca – An interface for user defined data type definition
  • scc – Communications services. scc interfaces with the GCF (general communications facility), which I won't talk about (mostly because I don't know much about it!)
  • scd – Backend main entry point, session startup and shutdown, and server configuration options
  • sce – Dbevent handling. Dbevents were once known as alerters, which is a good description of their function.
  • scf – SCF facility call interface (scf_call)
  • scs – The sequencer, and sequencer-related routines: internal session dispatchers, session privilege setting, etc. There are a couple of special-purpose sequencers, one for Star (the distributed server) and one for RAAT (a deprecated row-at-a-time interface).
  • scu – Information fetchers for various "get session information" requests

While SCF might sound like a logical place to start reading Ingres code, it isn't really. You will do better to have an overall understanding of "the life of a query", and you can assume that control magically goes to the right place at the right time. (Indeed, I would suggest starting casual reading with QEF, or PSF if your inclinations lie with parsing.)

The Sequencer itself in particular is a nightmarishly long, complex, and delicate routine that goes on and on and on and on, and knows way too much about special cases to suit me. I believe that at one point there was a project to redesign the Sequencer, but as far as I know any result of that work was lost. A Sequencer rewrite (e.g. to table drive it, and break it up into rationally sized pieces) would be a Good Thing, but is not for the faint of heart. Right now, it ain't really broke, and nobody is rushing to fix it.

By the way, some of the peculiarities of SCF in general and Sequencer in particular are caused by the multiple ways in which queries can arrive at the back-end. While everything uses the GCA protocol, the specifics vary considerably. Embedded SQL programs use "libq" and send query parts in one way, while API-based front-ends (eg ODBC) may send the same information in some other specific order. I am unaware of any comprehensive document that lists all of the variations, and no Sequencer rewrite should be attempted without that sort of information.

General flow. Things get started in the scd routines, in scdmain.c. scdinit.c contains server initialization and shutdown controllers. New threads call scb_alloc_scb in scdmain.c to create a Session Control Block, and go to scsinit.c (scs_initiate) to actually get initialized and running. Some of these calls, eg the one to scb_alloc_scb, are actually done via function pointers handed to CS (in the CL) at server startup; if you grep for direct calls you won't find them. For normal user threads, the main loop in CS calls scs_sequencer to process user input (queries), guide the query through execution, and return the results.

The sequencer calls the parser and if necessary the optimizer. These work pretty much like straight routine calls. Then it will call QEF for query execution. The call to QEF is more of a co-routine relationship, because QEF may well decide that it needs additional parsing done. (Of a DB procedure, say.) QEF does not call sequencer as a subroutine, it returns to the sequencer with various flags set. The sequencer uses the flags to decide what to do (parse and optimize the DB procedure), and then calls QEF again. In essence this is a resume back to QEF, to continue query execution. A similar situation exists when QEF is returning rows to the client: QEF will fill a buffer with some rows, and then return to the sequencer. The sequencer sends the rows out to the client and then calls QEF again to get more rows. There are a few different specific scenarios for returning rows, which means more flags and indicators.

Regular DML goes through PSF, OPF, and then QEF. Some queries, mostly DDL and the SET-type statements, are parsed through PSF and then sent to PSY (a subfacility of PSF) for execution. OPF does not get involved, and QEF is called indirectly via PSY. A few queries, MODIFY and CREATE INDEX and a few others, are parsed with PSF and then sent directly to special QEU routines in QEF for execution. This means that the parser is generating a variety of output depending on the query: query plans for DML, PSY and QEU cb's for the PSY stuff, QEU and DMU cb's for the direct-to-QEF queries. Fortunately, the sequencer doesn't care much, because whatever the parser generates is placed into QSF. The sequencer just has to pass the QSF handle around. This is an oversimplification (unfortunately) but a decent first approximation.

Some sessions don't use the regular Sequencer. "Monitor" sessions are controlled by scsmntr.c. RAAT sessions go through scsraat.c. Star sessions are sequenced by scsstar.c. Internal DMF threads are sent off to DMF early in life, and never come back to SCF. Factotum threads are initialized by scs_initiate and then are dispatched off to the caller specified factotum handler. When (if) a DMF or factotum thread returns to SCF, they are terminated. (Many DMF threads stay in DMF for the life of the server, once started.)

There's lots more, including interrupts and DBevents, but this will serve as a starting point.

Cursors. Cursor operations look a little different from regular SQL statements. In one sense, the entire cursor lifetime is a "statement"; Ingres has to retain query state for successive cursor fetches and updates. In another sense, each individual fetch or update arrives as a "statement". (The ESQL or API client pre-digests simple cursor fetch and update statements, so that they don't actually have to be parsed.) There are cursor control blocks in SCF and elsewhere to track cursors globally throughout their lifetime.


The Parser (PSF)

The parser's job is to parse a query and produce some suitable data structure, or an error. The result data structure is usually a query tree, but in the case of DDL or control statements may be some other kind of control block. Oddly enough, the parser facility also contains routines to execute (!) some DDL and control statements.

  • psf – The main parser call interface(s)
  • psl – This is where the actual parsing happens. There are a pair of YACC grammars, one for QUEL and one (much larger!) for SQL. Each has its own token scanner, but semantic routines are generally shared between QUEL and SQL.
  • psq – Support utilities for query parsing, including some session and parser startup and shutdown utilities.
  • pst – Parse-tree utilities for creating and doing various things to parse tree nodes.
  • psy – query rewrite and DDL execution. There are two things here. The query rewrite routines do view and permit processing, merging view and QUEL permit clauses with the user query. The execution routines are a bit of an oddity. Once an SQL or QUEL statement is parsed, the Sequencer might send it to one of three places: to the optimizer (for normal DML), directly to query execution (for some physical-DDL things like CREATE INDEX and MODIFY), or to PSY. Much of the DDL outside of CREATE TABLE arrives at PSY, which will in turn call out to query execution to do the real work.

The parser in general is not terribly difficult or delicate. A novice's biggest problem with the parser will likely be that there is no one, comprehensive document that diagrams exactly what parse trees result from various queries. (psfparse.h in back/hdr/hdr comes closest.) The node definitions in the header files are clear enough, but the nodes can be hooked together in unexpected ways. (For instance, various ON clauses in the FROM clause are sort of all smooshed together, with an ID number showing which bits belong with which clauses. N-operand functions are built differently from 1- or 2-operand functions. And so on.)

If one were to have such a tree diagram document, I rather suspect that one would discover that the parser output is sometimes more complex and arbitrary than necessary. For instance, CREATE TABLE parsing produces a "DMU_CB", while CREATE TABLE AS SELECT parsing produces something entirely different. It would be nice to clean some of this up.

A project to add a readable parse tree display would be very useful. There's some dumping facilities now, but they are limited, stale, and not very readable. PS145 isn't too bad as long as the tree is not too complex, although it omits the header. (Actually, it doesn't, but manages to output the header trace to the wrong place, ie the DBMS log!)

When reading the parser code, a certain amount of what you read will be couched in QUEL-like terminology. A "resdom" is a result column, a "domain" is a column, and a "variable" or "range variable" is a table (or more accurately, a relvar).

Call interface and control blocks. PSF does not quite follow the usual model of having one xxx_call facility interface: it has two. Queries are parsed with psq_call, and (some) queries are executed via psy_call. PSF also is a bit more tricksy with its control blocks than other facilities. There is a PSQ_CB which lives with the query, and is essentially the parser's equivalent of a Request Control Block (except that it tends to live along with the query, not just the parser request). There is also a PSS_SESS_CB which contains the session specific data. The PSQ_CB is exposed globally to the back-end; in particular, the sequencer pokes around in it from time to time. The PSS_SESS_CB is private to the parser.

The PSY (qrymod) interface uses a PSY_CB request block instead of a PSQ_CB.

Grammars and lex. Both grammars (QUEL and SQL) are yacc grammars. A private yacc ("byacc") is used, but this is mostly to generate re-entrant code and maybe some Ingres specific header stuff. Both lexes (token scanners) are hand coded. A massive amount of parser state is maintained, in a PSS_YYVARS structure.

The QUEL grammar is reasonably clean and regular. The SQL grammar is not nearly as successful, and certainly much of that is SQL's fault. YACC parsers require a reserved-word language, and SQL really wants to be a keyword language. (Ingres deals with this by an error recovery trick that sends RW tokens back around as identifier tokens, which sort of mostly works.) The SQL token scanner will return many double keywords as single "reserved word" tokens to try to help out. Watch out for the WITH PARTITION= clause in CREATE TABLE and MODIFY: in an attempt to simplify parsing and reduce the number of keywords, the token scanner recognizes a completely different (and much reduced) set of keywords in the scope of the PARTITION definition. A keyword oriented (instead of reserved-word oriented) top-down parser would probably suit SQL better. A top-down SQL parser rewrite would be a fun, if major, project!

While the grammars are separate, QUEL and SQL share most if not all of the semantic action subroutines for specific statements, especially DDL. (The DML is too different to share much of anything, it would seem.)

Parser result. As noted earlier, the result of parsing a query can be a variety of things, depending on what was parsed. The major players are a query tree (see psfparse.h); or a PSY_CB (ditto); or a QEU_CB (see qeuqcb.h); or a DMU_CB (see dmucb.h). Notice that CREATE TABLE and CREATE TABLE AS SELECT produce radically different things describing the result table; the former produces a DMU_CB, the latter produces a decorated SELECT query tree.

The parser result is always placed into a QSF memory stream. The parser has its own PSF memory pool, but it's used mostly for occasional working storage, and cursor stuff. Parsing result control blocks are generated right into the QSF stream.

CREATE SCHEMA. If you are doing anything with DDL parsing, watch out for CREATE SCHEMA. This statement encloses a series of DDL statements and essentially turns off all semantic actions for the enclosed statements. Only those productions declared with %nobypass actually have their semantic actions executed. The enclosed statements have their query text preserved, using a tricky text-stream mechanism and some help from the token scanner. The result of the CREATE SCHEMA statement is a bunch of "execute immediate" query tree nodes pointing to the preserved DDL statement text.

Grants/permits. QUEL alert: the code that does much of the parse-time processing of grants and privileges (psypermit.c) will seem awfully confusing if all you know is SQL. Much of the code is concerned with QUEL "integrities". These are not ANSI integrity constraints; rather, they are a type of grant-like permit that allows a WHERE clause. QUEL integrity permits do not presently apply to SQL statements, although that seems to be an arbitrary coding decision that could easily be reversed. (SQL grants do apply to QUEL statements.)

Errors. The parser is full of error calls that look like: psf_error(1234, ...) where 1234 is some sort of random looking number. To translate the error, convert the number 1234 to hex, and look up the corresponding UShhhh error code in common/hdr/erusf.msg. (Newer code uses an E_USnnnn mnemonic instead of a bare random number, and this is much preferred.) User syntax errors should be E_US errors rather than E_PS errors; E_PS errors should be reserved for internal errors or psy execution-time errors.

The parser depends a little too much on generic error messages, which don't show enough context and aren't as useful as they could be. Improving syntax error messages would be a useful project.



The Optimizer Facility (OPF)

The optimizer basically has three phases: rewrite, enumeration, and code generation; sometimes referred to as opa, opf (or opn), and opc. Rewrite figures out the aggregate structure of the query and sets up independently optimizable pieces, or subqueries. The preparation for enumeration normalizes the WHERE clause into CNF (conjunctive normal form, AND's of OR's), and does a lot of relabeling. Enumeration chooses the best cost tree, or CO tree, for each subquery; the CO tree is what users see as the QEP. And finally, code generation takes the CO tree and subquery structure, and produces a query plan (QP) that QEF can execute.

OPF is difficult, subtle, and not to be lightly messed with. The somewhat peculiar form of the parse tree doesn't help, especially when outer joins are around. Some of the enumeration and rewrite heuristics depend on specific details of the parse tree layout (and various bitmaps) at any given point, and changing one heuristic can cause a later one to fall over.

The compilation part, OPC, is actually pretty straightforward, as long as you understand what sort of output you need. Once again, there is no comprehensive document that describes the various query plan (QP) nodes and their variations; most of what you need is in qefact.h and qefnode.h. However, in my opinion, the QP tree details are simpler and more obvious than the parse tree details.

The subdirectory structure of OPF is a little different than the other facilities; it is subdivided into more pieces. The top level "opf_call" routines are actually in ops (don't look for an opf subdirectory). The opq subdirectory is not part of the back-end; it's for the optimizedb and statdump programs (the histogram statistics generator, the equivalent to UPDATE STATISTICS in other DBMS's).

  • opa – Aggregate processor and query rewrite. The initial phase of the optimizer.
  • opb – Boolean factor extraction
  • opc – Code generation (the OPC phase)
  • opd – Special routines for Ingres/Star query optimization
  • ope – Equivalence class detection
  • oph – Histogram processing and estimation
  • opj – "joinop", enumeration driver and pre/post processing
  • opl – Outer join routines (somewhat miscellaneous)
  • opn – Enumeration and cost tree evaluation
  • opo – "Ordering" processors, manipulations of sort orderings
  • opq – optimizedb and statdump, not part of the backend
  • ops – Facility entry point, optimizer's "sequencer"
  • opt – Trace point and debug printers
  • opu – Utilities
  • opv – "Variable", table info extraction from query tree, index selection, etc
  • opx – Error handling
  • opz – Joinop attribute handling

In addition to the terms we've already met from PSF, you'll see "eqclass" or equivalence class: a set of columns that are known to be equal thanks to joins or WHERE predicates. A "joinop attribute" is just a column or expression, relabeled for the optimizer globally instead of reusing numbers table by table. (So that the optimizer can just look at a joinop att number instead of a (table,column) pair.) Sometimes attributes aren't columns; a "fattr" or function attribute is an expression. All of this extra nomenclature makes reading the optimizer reminiscent of the sci-fi story "The Answerer" – to ask a question, you already have to know an awful lot of the answer.

General flow. The main entry point opf_call is in the ops subdirectory. The controlling routine is ops_sequencer in opsseq.c. Some DDL queries bypass most of OPF and go directly to code generation in OPC. Normal DML statements are headed by a PST_PROCEDURE node, and so go through ops_dbprocedure and then ops_statement. (ops_dbprocedure is misleading when you see it on the call stack, it doesn't imply that a real DB procedure is being processed.)

There are lots of special cases, but ordinary DML-ish things first go to opa_aggregate. This phase breaks the query tree into a collection of trees based on aggregate processing, and does a number of query transformations. OPA is almost entirely QUEL-ish, and can be extremely mysterious! Next, opj_joinop is called, which decides on a query plan shape for each subtree.

opj_joinop first prepares various data structures, normalizes the query tree into CNF, and extracts lots of info from the query tree. (variables, joinables, joinop attributes, eqclasses, boolean factors, etc.) Then, joinop calls one of two different enumeration strategies, old or new. (The new strategy is called the "greedy" enumeration.) The classic Ingres enumeration generates all possible plan shapes, trimmed by heuristics, and costs them all out. This is an exponential (indeed, factorial) type of algorithm, so there is a timeout that kicks in if a "good enough" plan is found. The classic enumeration works fine for up to maybe 7-9 tables (including indexes) in a query. Beyond that, timeout is probable, and the quality of plan found (and the length of the search) is very sensitive to available cost statistics, heuristics, and luck. The greedy enumeration kicks in when there are enough tables (about 10) in a subquery. Greedy enumeration searches all combinations of 3 tables for the best combination and best plan shape for those 3. The resulting subtree is substituted as a "table" for the originals, and the process repeated until the a final plan is found, or timeout occurs.

After enumeration, opj does some post-processing, partly to finalize join choices, and partly to regenerate some stuff (multicolumn sort orderings) that got thrown away when the temporary enumeration memory stream was closed. (!)

Once a plan outline (cost tree) is decided on, opc_querycomp is called to generate the actual "executable" query plan (QP). (By the way, the display that Ingres users know as a Query Execution Plan, QEP, corresponds more closely to the cost-tree than to the QP.) The QP is built into a QSF memory stream and consists of three main pieces: a QEF node tree (or set of trees) that reflects the query subtrees; a bunch of ADF compiled expressions (CX's) to execute the actual query clauses (testing, arithmetic, that sort of thing); and resource lists indicating what tables the query needs. A CX is not machine code, it's a list of interpreted bytecodes that can be executed by the CX interpreter in ADF at query runtime. (Generating machine code in a semi-portable or at least regular manner would be a fascinating and useful research problem; most nontrivial queries spend a good bit of their time interpreting CX's.)

During the entire optimization process, the original parser query tree tends to get rewritten in various ways at various times. Some of these rewrites are essential to later phases, while others (eg the not-in to outer join transform) are just optimizations. A good source of optimizer bugs is to modify a rewrite heuristic in an earlier phase, producing a subtree variant that a later phase doesn't expect or knows "can't happen". The same sort of thing happens to the bit-maps that represent equivalence classes (ie columns or expressions) that denote values available or required at different parts of the tree.

Join selection. One might imagine that join selection is done as part of plan enumeration; i.e. by treating "a KJOIN b" and "a FSMJOIN b" as two different plan shapes, generating the allowable variants, and costing it all out. It doesn't exactly work that way, partly to reduce the number of plans that need to be costed. There are various heuristics that take a proposed plan shape and decide what sort of join to attempt at each node. [more info to be supplied]

Hash joins and FSM-joins. Hash joins are a recent addition to Ingres, and follow the aforementioned model of messing with the costing phase as little as possible. A hash join is simply substituted for an FSM-join in a cost tree candidate when conditions look right (according to various heuristics). The two possibilities (hash vs fsm) are not costed out separately. This may possibly lead to non-optimal plans where a sort has to be reinserted, but simplifies the code and reduces the number of plans examined.

This is just one example of the sort of heuristics applied in the optimizer. It's simply infeasible to do a true exhaustive costing and enumeration for most non-trivial queries.

Coding. Much of the older optimizer code was written to not annoy the hardware and compilers of its day: loops counting backwards to zero, assignment expressions inside if and loop conditions, and "tight" or at least tight looking code. Read carefully.



The Query Execution Facility (QEF)

QEF is where query execution actually happens. Inside QEF, you can distinguish two kinds of query: queries with a Query Plan (QP), and queries executed on an ad-hoc basis (most DDL and such). (You could count distributed queries as a third kind, although they generally run off of a query plan.) The subdirectories inside QEF are:

  • qec – The QEF controller: facility startup and shutdown, and some utilities.
  • qed – Routines for distributed queries (Ingres/Star)
  • qee – Query execution preparation. Routines for setting up and tearing down the data structures needed for execution.
  • qef – The qef_call interface and some utility things.
  • qek – Some readonly constant definitions
  • qel – Register-as-link execution (Star/Gateway stuff)
  • qen – The query node executors: orig node, joins, sorting
  • qeq – The query action executors and plan execution drivers. I'll say more about nodes vs actions below.
  • qet – Transaction control, mostly an interface to DMF
  • qeu – "Utility", in this case meaning queries not executed via a query plan. A large part of QEU is devoted to DDL. There are also routines for doing table or catalog I/O callable from other facilities; these are marginally simpler than the equivalent DMF calls.

Speaking very roughly, the two main components of a query plan (QP) are query nodes and query actions. The distinction is not real clear-cut, but in general, actions are at the top and are executed "by hand" from the qeq driver. Nodes are below actions, and are executed in coroutine fashion. Ingres uses a streaming or "pull" model, where the top level action asks for a row, and the request percolates down through the nodes until it's satisfied. This model has the benefit of being easily parallelized, and not requiring space for an entire node result (unless the node is one that requires it, such as a sort). The downside is some extra overhead, as control runs repeatedly up and down the node tree to produce rows.

The interface with the Sequencer is again essentially a co-routine setup. QEF may have more rows to send, or it may require that a rule or procedure be re-parsed. Some of this code, especially in the face of cursors, rules (triggers) and DB procedures, seems to have "just happened" and is full of delicate looking state flags and counters.

Plan preparation is in itself no big deal; the complexity lies in the data structures produced. The QEF data structures are very complicated, possibly more so than necessary. Some of this no doubt arises from the fact that the central piece, the QP itself with all its nodes and actions, must be read-only. Any query state has to live in auxiliary data structures allocated at plan preparation time.

QEU is a whole different subject. Most of QEU is devoted to DDL, and there's a tremendous amount of hand-coded catalog accessing and updating. (This is also the area where we see most of the effects of the interrupted project to move DDL into a plan execution mode.) QEU tends to not get much attention, because there's an awful lot of it, and it all seems to work, and people have other things to do. Perhaps there's room for some good cleanup here. Or not.

General flow. Queries that actually have a query plan (QP) arrive at qeq_query for preparation and dispatch. Queries that don't have a query plan generally arrive somewhere in qeu, either directly from the Sequencer (see qeu_dbu), or via psy. (Many of the psy wrappers do an RDF_UPDATE which in turn ends up calling some qeu function.)

In some cases, the same statement can be executed in different ways depending on context. Two examples: a CREATE TABLE generates a little QP that dispatches out of qeq_query to the table creator utility, qeu_dbu. A CREATE TABLE AS SELECT generates a regular select-like QP topped by a CREATE DMU action, which qeq_query dispatches to qea_dmu, which in turn calls qeu_create_table (a subroutine of qeu_dbu)! Another example is MODIFY. A regular MODIFY goes from the sequencer directly to qeu_dbu. Our friend CREATE TABLE AS SELECT can generate a MODIFY DMU as part of its QP; once again qeq_query hands off to qea_dmu which for the MODIFY case calls qeu_dbu. It can get confusing, and the name similarity doesn't help (dmu vs dbu vs dmt and qea_ vs qeu_).

For most queries, the qeq driver first constructs a DSH (data segment header); this is what qee.c is all about. The DSH coordinates all the info needed to run an instance of the query: row buffers, lists of control blocks, query state, and so forth. Query setup will also do various size and bucket computations for hash joins / hash aggregations.

After everything is set up, qeq calls the appropriate qea_xxx action handler, which will in turn call node handlers to fetch rows.

Read-only QP. The entire query plan (QP and CX's and all) coming out of the optimizer is treated by QEF as being read-only. This leads to some complexities in qee's plan preparation, since it has to store anything specific to an execution separately. The data structures involved in an actual execution are primarily the DSH, one per execution; and QEN_STATUS entries, one per QP node. There are also row buffers to hold CX and row-fetch results, and other stuff too. If a query is executed in parallel, these structures are replicated for each parallel child so that execution can proceed independently per thread.

The reason that a QP is read-only is that a query can be "pre-compiled", or REPEATED. A REPEATED query's QP is stored in a ULH hash table. The front-end client does the equivalent of "execute query number 27"; the sequencer asks QEF if it knows of a query #27; and if the answer is Yes, most if not all of parsing and optimizing can be skipped. Multiple sessions can ask to execute the same QP simultaneously, and so QEF must isolate the execution dependent parts from the read-only parts. (If QEF does not have the query, it is parsed, optimized, and the QP stored and tagged with a client supplied tag.)

Query validation. No table control locks are taken out during query parsing and compilation. It may be that by the time the QP reaches QEF, it's been invalidated by some sort of table structure change. Query validation was originally designed to check the timestamps on all relevant tables, and send the query around to be re-parsed if necessary. Since we're checking timestamps, we might as well open (in the DMF sense) the tables while we're at it. Thus, all the tables that a query needs are opened at the start of the query.

The advent of partitioned tables and parallel query has muddied these waters quite a bit. A partitioned table master is still opened at validation time, but individual partitions are opened as needed. (Opens are non-trivial and we don't want to open partitions we might not look at.) In a parallel query environment, child threads have to re-open the tables they will work with, so that they have an independent DMF access id. (Sort of like reopening a unix file to get an independently seek-able file descriptor.)

Node executors. The qen node executors are conceptually pretty simple, but the code tends to be cluttery. That's partly because a node executor really wants to be a co-routine. The top level action calls the top node to get rows, and that node calls its children nodes, and so on down the tree until a row appears. Then the row is handed back up the tree until it reaches the top. (Conceptually, the various nodes could run independently in parallel, and that's basically what parallel query is all about.) When a node executor has a row for its caller, what it really wants to do is resume to the caller, but C doesn't have any such thing. So it has to save a bunch of state instead, and return. When it's called again for another row, the executor has to decide from its state where it left off; it can't just start over from the top. As you read (or modify) node executors, keep in mind that the routine prolog and setup will be executed once for every row returned by the node.

The addition of outer join to the node executors (post 6.4) didn't help matters. Some of the executors could probably be cleaned up and streamlined a bit in detail.

In a perfect world, the node handlers would be coded as actual coroutines, but C doesn't support any such thing. There have been occasional research projects claiming to provide coroutine support for C, and these might be worth investigation.

Node actions. The node actions are mostly pretty straightforward, since they're at the top of the heap. Where things get tricky is if we're doing an action and it turns out that we need to re-parse something (a database procedure, say). This involves setting a bunch of flags and returning to the sequencer. See the earlier discussion about the sequencer and QEF.


Data Manipulation Facility (DMF)

DMF is the largest single facility in the server. Fundamentally, it's what mainframers used to call an "access method". It gets rows and puts rows. There's so much that goes along with getting and putting rows, though, that DMF is very large.

DMF is not just part of a DBMS server. DMF exists in a standalone form as the dmfjsp "journal support program". Dmfjsp implements the checkpoint, rollforward, archiver, and fastload commands, plus a few others. (DMF running in a standalone command is always essentially single-threaded; normal server DMF is multi-threaded.)

The subdirectory pieces are:

  • dma – Routines for doing security audit calls. These are largely inactive unless C2 security auditing has been enabled.
  • dmd – Debug routines. Debugging, tracing, control block dumps. DMF is (unfortunately) unique in having decent basic support for dumping control blocks if something bad happens. It would be nice to see more of this in other facilities, especially PSF and OPF.
  • dmf – The dmf_call interface, and the various stand-alone front-ends to stand-alone DMF.
  • dml – The "logical" layer. DML is the first stop for handling most of the DMF requests. This layer does a bunch of basic setup, and disassembles requests into DMF-speak. (See also DMU.)
  • dmm – A set of specialty routines for creating core catalogs and doing physical-level version upgrades. Note that all the code for upgrading any Ingres database version back to release 6.x is still included; there may still be 6.4 databases out there to be upgraded. (For that matter, there are still a couple of release 5 sites.)
  • dmp – The "physical" layer. This is actually 3 different layers. Files named "dm2xxx" are the first stop after the logical layer, and are drivers and dispatchers. Files named "dm1xxx" contain most of the storage-structure specific code, plus details of row accessing. Files named "dm0xxx" are bottom level utilities. One in particular is dm0p.c, which contains the (entire!) buffer manager. Understanding the table and record control blocks (TCB and RCB) is crucial in dmp.
  • dmse – Sorting. The DMF sorter is a general-purpose disk-based heapsort that can run in a parallel mode.
  • dmu – Table utilities, meaning table create, drop, modify, index, alter. The "logical" layer of dmu perhaps really belongs in dml, as they are called directly by dmf_call and do the same sort of validation and disassembly that dml does. Files named "dm2uxxx" are the physical layer of table create and drop.
  • dmve – Recovery, meaning log undo and redo. Each type of transaction log record has a dmve routine, and each routine can do UNDO, REDO, or DO as appropriate. UNDO occurs when rolling back a transaction; REDO occurs when replaying a possibly done transaction (during crash recovery in fast-commit mode); and DO means replaying a transaction out of journal files.
  • dmxe – The physical layer for transaction control requests. Transaction begin, commit, and abort.
  • lg – Ingres uses a variant of the usual write-ahead logging protocol. Log records are formatted elsewhere in DMF (in dmp!dm0l.c), and are written via the lg subsystem. lg was once a stand-alone subsystem, and hence is relatively self-contained, almost like one of the regular facilities.
  • lgk – The logging and locking subsystems share some utilities, mostly initialization and parameter reading, and the shared utilities are in lgk.
  • lk – Ingres is a locking DBMS, and lk handles all the lock requests coming from elsewhere in DMF (mostly the buffer manager, but other places too). Like lg, lk was once a separate subsystem, and is relatively self-contained.

DMF, like QEF, is not terribly hard to read and work with. There are a few gotchas though. The first caution is simply the sheer size of DMF. It's not all one big blob, and the various pieces are reasonably well interfaced; but there is still a lot to grasp.

Some of the services provided by DMF are a little subtle, particularly in its record updating and locking protocols. Row level locking adds another layer of complexity, because there's a whole extra set of rules about how you can update rows and allocate space if you're row-locking. If you are down in the row level stuff in dmp, make sure you understand what you are about. As usual, there does not seem to be any available big-picture description of the locking protocols.

The buffer manager seems awfully complex, but remember that it must run in a variety of modes. In normal operation, there's a single shared page-cache; but the buffer manager can run in a multi-cache mode as well. In fact, the recovery server (which most of the time we ignore) runs in a private-cache mode. If the main server does a pass-abort (handing off a rollback failure to the recovery server), the buffer manager on both sides has to Do The Right Thing. Then there's DMCM, a clustered fast-commit mode that uses locking tricks to communicate amongst caches. And, while pages are normally shared, other DMF data is not. For instance, table control blocks are not shared, and a multi-server installation has to worry about stale data in TCB's. The locking system provides small installation-wide shared values, which are used to implement a versioning scheme to catch stale pages and TCB's.

For all its complexity, the buffer manager is pretty simplistic in its basic page management. It would be interesting to instrument a server and run it on a variety of loads, to see if better page replacement algorithms would be worth doing.

The Ingres logging algorithms are very similar to the Aries recovery system, as written up in TODS volume 17 (1992). The Aries article is quite readable, and worth reading if you plan to get intimate with the Ingres logging and recovery algorithms. Ingres differs from Aries in numerous details, but the big picture is essentially the same. (I have no idea what connection there was between Ingres and the Aries research, if any at all.)

Moving more data into a shared area, like some other DBMS's, might seem attractive. The non-shared environment cannot be ignored, though, because we support clustering.

DMF has to worry more about concurrency than most other facilities, since it's dealing with a lot more shared data. Mutexes need to be acquired in a uniform order to prevent deadlocks. Unfortunately that ordering is not usually documented. It's not too much of a problem at higher levels, but can be a bit of a trap in lg and lk. Lg funnels log-write requests from many threads into a few logwriters, and lk by definition operates on a shared resource database. In both lg and lk the mutex scheme is fine-grained and you need to be careful to avoid mutex deadlock. Someone needs to document the require mutex orderings.

The DMF memory allocation routines were revamped for 9.1.0, to improve performance under concurrent loads.

The accumulation of time is evident in a few places (for instance, it's no longer clear to anyone what the difference between a page TOSS and a page invalidate is); but in general DMF is in decent condition overall.

Major control blocks. DMF has a DM_SVCB global server control block. Each session has a DML_SCB session control block (not to be confused with the SCF SCB). There is a flock of request blocks: DMT_CB, DMR_CB, DMX_CB, DMU_CB are the usual ones (for table, row, transaction, and utility requests respectively). There is also a DMT_SHW_CB for table info "show" requests, a DMC_CB for control requests, a DMD_CB for debug requests, a DMM_CB for database maintenance requests, and a DMS_CB for sequence requests.

Each user DMF thread has a DML_SCB. If the thread is part of a transaction, it has a DML_XCB transaction control block. (More on this below, see transactions.) A thread and its parallel-query children all share an ODCB open-database control block.

Each database has a DCB, database control block. Each table has a TCB, table control block. TCB's are expensive to build, so they are cached as long as possible even if nobody is using the table at the moment. TCB's are hashed and stored using a HCB hash control block. (DMF does not use ULH hashing.) To access a table, DMF callers have to "open" it, which results in an RCB (record or row control block). So, an RCB handle is similar to a Unix file descriptor.

Transaction aware DML requires that irreversible file actions (deletes) be deferred to COMMIT. Each such operation has an XCCB attached to a list off the XCB. Short-lived temporary files are also remembered on the XCCB list (in case QEF forgets to clean them up). By extension, long-lived (session) temporary tables are also remembered in an XCCB, but that list is hung off the session's SCB.

Nearly everything is on a doubly-linked list (which Ingres calls generically a queue). In some cases (XCB, ODCB lists) there's a list when a simple pointer would suffice.

[needed: discuss buffer manager structures a little]

The locking and logging subsystems have their own control blocks, allocated from the lock/log shared memory. The logging system major control blocks are one LGD for subsystem global data; an LFB to manage the transaction log file (multiple LFB's in the cluster case only); one LDB per database attached to the logging system; one LXB per active transaction; one LBB per log buffer. The locking system major control blocks are one LKD for global data; one LLB per lock list; one LKB per lock; one RSB per locked resource; and a couple of hash tables, one for locks and one for resources.

Transactions. At its most basic, a transaction is an 8-byte ID number known to the locking and logging subsystems. These are the subsystems that really need to know the most about the scope and outcome of a transaction. In addition to the transaction ID, a DMF session receives a pair of handles, a lock ID and a log ID. These handles are used for locking and logging system calls. The session's XCB tracks the log and lock ID's, and other stuff DMF needs for either commit or abort.

In a parallel query situation, the child threads must have their own lock and log ID's. (Otherwise the lock or log subsystems could get very confused, eg thinking that a session requesting a lock was already in a wait state.) However we really want them to be part of the same transaction and not fight amongst themselves for lock/log resources. So, the child threads connect to the parent transaction. The lock/log ID's given to the child are in a sense indirect handles to the parent transaction; the lock and log transactions are shared among all query threads. That's why each child has its own XCB, to track its own lock/log ID's.

There are times when lower level DMF functions are run with no transaction context; for example, rollforwarddb. There will be no XCB around (because there's no session either), but there may be lock and log context. That's why the XCB and SCB are logical layer control blocks, not physical layer control blocks. The physical layer can do its job without XCB's and SCB's and such.

DMVE. Transaction log records can be used in one of three ways: undo, redo, or do. The routines in dmve apply transaction log records. Typically you will find one source file per log record, and that routine will dispatch internally depending on whether it's running undo, redo, or do.

Undo is transaction abort. This can take place in a normal execution context (rollback), or in an isolated context (crash recovery, rollforward). Redo is rerunning a transaction that has already executed (in full or in part), to ensure that the database state matches the transaction log state. Redo is only done by the recovery server at crash recovery time. (Thus, it's in a normal threaded-DMF context, but can assume that nothing else is happening.) "Do" is the execution of a transaction that should not have happened yet. Do can only occur when rerunning a transaction from journals during rollforward. (Thus, "do" only happens in a non-threaded DMF context, specifically rollforwarddb.)

Deferred update. Deferred update is a feature whereby updates are conceptually collected and done after all update candidates are examined. The idea is to prevent the "halloween" problem, where a row is updated, moved to a later point in the table for whatever reason, and then re-retrieved as QEF continues to run the query.

Deferred update is implemented by marking rows that have been updated by a transaction, so that they will not be retrieved again by the same transaction. This marking process is done in various ways depending on the page type. [details to be supplied someday by someone...]

Blobs. A blob is also called a large object or a peripheral object in Ingres. Blob data is stored in one or more extension tables automagically created by the server. There is (at least) one extension table per blob column; more can be created if one fills up due to eg filesize limits.

We don't want to carry around the actual blob data throughout the server, so instead a coupon is passed around. When the blob data arrives at the server, it's stored somewhere, and a coupon is generated. When we put the row containing the blob, down inside DMF, DMF ensures that the blob data is in the extension table where it belongs; if it's not, DMF moves it there. Retrieval works similarly; a row fetched out of DMF contains a coupon in place of the actual blob. At some point, either in ADF code if the blob is operated on, or in the sequencer as part of sending the result to the user, the coupon is redeemed: it's handed to DMF, which pulls the blob data out of the extension table and gives it to the caller.

A coupon consists of a permanent part and a temporary part. The permanent part identifies the table containing the blob, and the rows in that table that hold the blob. (The table might be the actual ET, or it might be a holding table.) The rows are identified via a table_key value attached to each of the ET rows. There's also a sequence number so that the blob pieces come out in the right order. The temporary part is whatever DMF needs to keep track of the blob data.

When blobs are being sent from the front-end to Ingres to be stored, if the target table is known, the coupon-izer can store the blob right into the ET where it belongs. Unfortunately this isn't usually the case! A peculiarity of Embedded SQL code generation sends the data first, and then the query. So the sequencer may have to deal with an unattached blob arriving, with no idea where it goes. Any such blob is initially stored in a temporary holding table; when the row is finally stored, DMF moves the blob rows from the temporary table to the proper ET.

See dmf/dml/dmpe.c for most of the DMF level blob handling. The couponizer and coupon redeemers are part of the sequencer in SCF. ADF uses the fexi mechanism to call into dmpe when it needs to do blob handling.

[Much, much more can be said even at a general level...DMF really needs its own document even for an introduction!]


Relational Definition Facility (RDF)

RDF is essentially a translating cache. Its job is to hold system catalog information about tables, views, and other database objects (DB procedures and the like), and cache them for other facilities. The parser and optimizer are the primary users of RDF.

RDF is a single subdirectory, rdf, with everything in one place.

RDF actually maintains three caches (more, if you include Star stuff, and depending on how you want to count it). There's the table info cache (relcache); the query tree cache (qtree cache); and the column-defaults cache. In RDF, a "qtree" is pretty much anything that is a catalog object but not a base table: views, DB procedures, permits, QUEL integrities, constraints, rules, etc. Note that some things like views can be both tables and qtrees.

The caching part of RDF caches table, qtree, and column-default-value information, and hands it out as requested. It uses the ULH utility hash table routines to implement each cache.

The catalog-interpreting part of RDF isolates the parser and optimizer from the details of the system catalogs. For base table information, DMF gets into this act as well, because RDF asks DMF for base table definitions. The result is presented pretty much (not quite) the same as DMF returns it. For other table-like objects, and especially for qtree stuff, RDF reads the catalogs more-or-less directly. The catalog rows may be presented as-is (eg protection tuples) or substantially reformatted and interpreted (partition definitions).

RDF is a bit of a mess. Its biggest problem is that it's not a reliable cache. There is no proper interface with DMF such that DMF can tell RDF when relevant catalogs have been updated (perhaps by another server), and is out of date. RDF will blithely hand out bogus information, and it's up to the callers to detect the problem by checking timestamps, etc and invalidating the stale data out of RDF. The RDF code itself is garrulous, with a lot of marginally helpful commentary and negative upside-down logic.

A second problem with RDF is that it does not cache negative findings. For an unqualified table reference "foo", the parser first looks up "user.foo" and then "dba.foo". In most production environments, "dba.foo" is the one that exists. RDF will waste a lot of time querying the system catalogs for "user.foo" since it forgets that the lookup failed the last time.

Because RDF is to some degree a system catalog interpreter, it ends up also being a system catalog "de-interpreter", ie writer. A number of catalog update functions go through RDF. (Which in turn call QEU utilities to do the actual row writing.) It's not entirely clearly drawn which updating is done by RDF, and which is done by QEF/QEU. In fact this might have been one of the areas in flux at the RTI/CA transition, although that is pure speculation on my part.

I personally think that RDF is a wrong-headed idea, and can't get interested in improvements in detail to RDF. The effort might be better spent in designing a more generic system-catalog cache that is better integrated with DMF, and has better stale-data detection.

Info blocks. When a caller asks for RDF information, an "info block" is returned. This info block belongs to RDF memory. RDF maintains a reference count and is (supposed to be) careful to not reclaim the memory as long as anyone is using it. To indicate that you are done with the RDF information that you asked for, you must do an RDF UNFIX call. Getting requests and unfixes out of sync will lead to the classic memory leak or garbled pointer crash problems.

Private vs master. Since many sessions may request the same information, RDF will try to hand out the same info block to everyone. This is the "system" or "master" info block. It turns out that when you request RDF information (particularly for a table), RDF does not simply build all known information for that table. It retrieves only what the caller asked for. If session A is the first requestor for table and attribute info on table T, an info block is built and it becomes the master info block,

Now, suppose that while A is using the info block, session B asks for table, attribute, and key information for T. You might think that B would either just update the master, or wait until A was done and update the master. In fact, neither happens. Instead, B is given a private copy of the info block containing everything that it asked for. The reason for doing this is that reading catalog information takes (shared) locks. RDF knows nothing about what sessions A or B are doing, and session B may have updated some catalog already. If B is put into a wait-state, and then session A attempts to lock the catalogs that B has locked, an (undetected) deadlock occurs. (The deadlock is undetected because the locking system only examines the lock-wait graph, it doesn't know anything about mutex waits or condition waits.)

A further effect is that if session B previously had requested limited information on T, it might have been given a pointer to the master info block. Callers are allowed to request additional information without unfixing what they have so far. So B makes a followup request to RDF, and gets a different info block address back. B has to be careful to 1) use the new info block pointer, and not the old one; and 2) unfix the new info block pointer, and not the old one.

Star. Speaking generally, a caller (parser or optimizer) wants the same kind of information about tables and views regardless of whether the table is a true table, or a view, or a Star distributed view, or a Star registration. In a Star server, some catalog information comes from the distributed catalogs in the coordinator database (CDB); other information may come from the local databases where the actual objects reside. RDF deals with making the appropriate catalogs requests for Star. Throughout RDF, you will see significant IF-blocks separating Star code from non-Star code.

Coding. RDF is not the most elegantly coded facility in the backend; it takes a little getting used to. Beware of routines where the "normal" control flow is deep into IF-statements, where one might normally expect to read special-case code. The IF nesting will sometimes end unexpectedly and you have to be careful about where control goes next. Out-of-date or inaccurate commenting seems to be more of a problem in RDF than elsewhere.

Memory allocation. RDF objects are allocated from ULM streams running from the RDF memory pool. RDF objects are associated with a ULH hash table entry; each hash table entry has its own stream. The master info block and everything it points to will come from that hash table entry stream. If a private info block is needed, a separate stream is created for that private info block and everything new that the info block points to. (The private info block may also point to existing master items, which is probably a Bad Idea.) If a private info block is unfixed, its stream is closed, freeing all the private objects. If a master info block is unfixed, nothing happens immediately. A master info block that is no longer used by anyone marks the ULH hash entry as unused, and ULH and/or RDF can reclaim that entry as needed. If the hash entry is reclaimed or invalidated, its stream is closed, destroying the info block and everything it points to.

Control blocks. RDF doesn't follow the model of having a session control block (there is an RDF server control block). There are session-start and session-end RDF functions, but they don't do much (in fact the session-end function is never used). RDF uses a couple of Request Control Blocks (there is an RDF_CB, and an RDF_RB, which go together). Privately within RDF, there is an RDF_GLOBAL request state block that persists for the life of the request.

Invalidation. RDF always (nearly always?) believes that whatever is in its cache is valid. It's up to the outside world to discover somehow that RDF contains stale information, and invalidate it. (RDF does help a little by propagating invalidation requests to other servers in the installation via an internal DBevent.) DMF does ensure that temp tables are invalidated out of RDF when they are destroyed; a temporary table is fundamentally a DMF construct, and DMF is the only facility that knows for sure when a temp table is destroyed.



Query Storage Facility (QSF)

QSF is basically a holding tank for stuff that's passed from one facility to another. It holds query text, query trees, and query plans. In other words, it's Yet Another Memory Allocator.

QSF has at least two reasons for existing. First, QSF is able to track persistent, "named" objects; these correspond to repeated queries and database procedures. Second, the facility layering rules imply that facility A shouldn't be meddling in facility B's memory pools. Since the server needs to pass query text, query trees, and query plans around, the answer is to define a neutral facility just for holding cross facility objects.

QSF is rather over-designed for its task, but it's not a large facility. I spent some time a while ago (in 2005) upgrading and revamping the comments in the header files and in qso.c, so it might be best to refer to the source for more details and information. The qsf subdirectories are:

  • qsf – The qsf_call interface
  • qso – Object storage, where most of QSF actually happens
  • qsr – A couple routines for server startup and shutdown
  • qss – A couple more routines for session startup and shutdown


Utility Facility (ULF)

This facility is a collection of utility routines for doing server error logging, memory allocation, hash tables, and a few other things. It hasn't changed much over the years, which may or may not indicate opportunity for improvement.

One area worth discussing is ULM, the utility memory piece. ULM forms the basis of memory allocation for the entire server, aside from DMF and SCF which do their own things. The idea behind ULM is rather clever, in that one has memory "pools" and "streams". A pool is an expandable memory area with its own upper size limit. (A pool need not be in one contiguous piece.) A stream is a tagged series of allocations from the pool. One opens a stream and allocates memory from it. When the stream is closed, all those allocations are returned to the pool. This is convenient, in that there's no need to track individual allocations. Oddly, there is no way to allocate just one piece of memory, short of opening a stream.

The implementation behind ULM is old and not particularly thread-efficient. The way that streams are handled is list-based and hard on the cache. There's plenty of opportunity here for doing better.

  • ulb, ulc, uls – one or two routines each (kind of silly, probably ought to be combined)
  • uld – "data": tree printer utilities, name lookups
  • ule – error / logging utility for the backend
  • ulh – utility hash table module
  • ulm – utility memory, i.e. memory pools and streams
  • ult – trace point bitvector and value utilities
  • ulx – backend level trap (signal) and exception handler


Abstract Data Facility (ADF)

ADF is not really a back-end facility; it's in common and can be called by front-end programs too. That's the first thing to keep in mind when working with ADF: you can't arbitrarily change the interface unless you worry about front-ends. Also, you need to keep in mind the ability for users to write their own datatypes (the OME, Object Management Extension). If you change the ADF interface, you might break existing OME code, or at least require users to recompile.

The main parts of ADF are:

  • adc – routines designed mostly for use by the back-end parser or optimizer facilities. Some get called by DMF as well.
  • ade – Compiled expression (CX) utilities. OPC calls the adecompile routines to compile an expression into something like bytecodes, and QEF calls the adeexecute routines to evaluate a compiled expression. ADE does some work directly, and calls adu utilities for more complex operations.
  • adf – A few miscellaneous routines related to ADF set-up. There isn't an "adf_call" interface like other facilities have.
  • adg – ADF builtin function and operator tables, and ADF startup code. There is some extra complexity here because ADF is extendible, via OME.
  • adi – Internal low-level utilities, mostly for working with data types, function instances, and operator instances. (A "function instance" is a specific instance of a function or operator, say "+", for a specific pair of operand types, say i4.)
  • adl – A utility for Unicode mapping table compilation, not part of the ADF runtime library
  • adm and adn – Character set descriptor files
  • adt – Utilities called by the back-end, mostly for doing various kinds of comparisons (compare the keys of two rows, compare a row key and a value, etc)
  • adu – This is where most of the actual expression and function semantics live. ADU contains code for string functions, date functions, coercions, and more.
  • adx – Error handling and mapping.

ADF is pretty straightforward and easy enough to read. Most changes involving adding new functionality to adu, or occasionally adc or adt. If you are defining new functions or operators, read the README file in the adg subdirectory for instructions.


Compatibility Library (CL or CLF)

The CL is a porting layer to insulate the rest of Ingres from compiler and platform variations. Some pieces of the CL are pretty simple, like a few extra string functions. (Ingres uses its own STxxx functions instead of the C-library onces, for historical reasons; but nowadays nearly all STxxx functions are #define'd to be the "real" library functions.) Other pieces of the CL have significant functionality, like the CS and DI pieces.

There's also a GL (General Library), which is usually considered together with the CL. No doubt there was once some agenda for making GL separate from CL, but I don't know what it was.

The most interesting pieces of the CL are:

  • cm – Character manipulation. Ingres uses CM functions and macros instead of normal char * operations in cases where the character string might be double-byte. Historically, Ingres could be built with or without double-byte capability, although this is no longer done. All the CM and double-byte code long predates the more recent wchar_t wide-char type.
  • cs – Control System. This is where Ingres thread control is done. Ingres threads can be internal (one OS thread, all Ingres threads at user level), or true multi-threaded. The CS code is mostly for Ingres internal threads. Unlike most of the CL, CS (and CSMT) are meant for use by the server backends (not frontends).
  • csmt – Control System, Multi-Threaded. Platforms that have decent posix-like thread support can run Ingres in an OS-threaded mode. (A server can be built OS-threaded but run in internal threads mode. So CSMT supplements CS, but does not replace it.)
  • cv – Conversions. Alpha to numeric, float to alpha, etc. Most of these are available in some form in the C library nowadays. Ingres still uses its own CV routines mostly for consistent support of the Ingres data types.
  • di – Disk utility. DI abstracts platform variations in disk I/O calls. It also provides two key things: a file descriptor stealing algorithm to support many open files per process, and pseudo-asynchronous slave I/O. FD stealing was essential when Unixes allowed only a few dozen open files per process, and is still necessary for servers supporting massive loads. I/O slaves are used with internal threads: an I/O request is sent to one of several slaves, so that the main server process does not block. (When running OS threaded, slaves are not needed; a blocking thread does not affect other threads.)
  • er – Error message lookup. Ingres error messages are kept in an external file, partly to save space and partly to allow language translation. ER does message fetching and parameter substitution.
  • ex – Exception handling. Each facility can maintain its own exception trap handler, and EX allows handlers to be added and removed.
  • fp – Floating point utilities
  • gc – GCF network communication utilities
  • handy – Miscellaneous platform specific routines
  • lo – LOcation utilities. This was an attempt to abstract away the variations between Unix, VMS, and Windows filenames. One can ask LO for a specific part of a filename string, or ask it to construct a filename with a specified path and name. Unfortunately the implementation is tricky; there are multiple pointers which look independent, but often point to a share buffer area. It's easy to make mistakes.
  • me – Low level memory allocation. If other memory utilities don't suit, the server falls back on MEreqmem and MEfree. Like much of the other memory allocation in Ingres, these have some implementation problems, and would be a juicy target for someone looking for a useful project.
  • mh – Math routines, including packed decimal arithmetic.
  • nm – Ingres "environment" variable routines, including the ingsetenv, ingprenv, and ingunset utilities for Unix / Windows. (On VMS one just uses logicals.)
  • pc – Process Call, utilities for spawning a subprocess and getting the answer.
  • si – Standard I/O. This is mostly a historical artifact, going back to the days when many stdio implementations were broken or inefficient, or even non-existent (as on VMS).
  • st – String utilities. Most of st has been ifdef'ed away into C library calls, but there are a number of utilities that aren't in the standard C library.
  • tm – Date and time utilities
  • tr – Tracing utilities. This is sort of a super-printf, normally used by debug tracers, but in fact usable by pretty much anything.
  • ut – Not used by the backend. ut is used by front-end programs to call compilers, linkers, and provide a standard command-line parameter interface.

Some of the subdirectories in GL:

  • bt – Bit vector utilities
  • cx – Cluster support. The cx utilities talk to OS supplied distributed lock managers, so that the rest of Ingres sees one interface.
  • hsh – An incremental hash function.
  • mo – "Managed Objects", code used to support the IMA feature. Routines for reading or writing data points out of structures in a table driven manner.
  • pm – Parameter Manager. Code for managing the unified Ingres configuration file. (Pretty grotty stuff.)
  • sp – An implementation of splay trees, which I've only seen used in conjunction with MO.

There is more to both libraries, but I've mentioned the highlights.

There are actually three CL's: the Windows CL, the Unix CL (covering all Unixes via header definitions and ifdef's), and the VMS CL. Often, the unix and windows implementations of a CL subdirectory use the identical code. The VMS implementations are usually separate.

Most of the CL is pretty simple, outside of CS (and DI to some extent). Although CS and SCF don't intermingle directly, they know a fair amount about either other; particularly with regard to thread states and statuses. Also, the CS and the slave code in DI are very closely interconnected.

Outside of CS and DI, the CL is mostly a big pile of unrelated routines. There is probably plenty of room for improvement in detail, but keep in mind that you are working on the underpinnings of the entire Ingres system – not just the server, but the front-ends and all the communications pieces. Work conservatively.

Other Facilities

There are a few other facilities that I have not mentioned, mostly because I have little to say about them.

RQF (Remote Query Facility) and TPF (Two-Phase commit Facility) support distributed Ingres, Ingres/Star. Unless you are doing Star work, you merely need to make sure that you don't break them. TPF in particular was coded using a very unfortunate style that embedded a sequence number in every data member and routine name, making the code all but unreadable.

SXF (Security Extensions Facility) does security auditing. SXF is largely idle if C2 security is not enabled in the configuration. It's a good idea to turn on C2 auditing every now and then to make sure it still works, but otherwise SXF doesn't get much attention.

The GWF (gateway facility) in the back-end is used for mapping DMF table or row operation requests into a foreign data source. The foreign source is a registration, rather than a true table. The registration is defined with a name and columns, like any other table. At a certain point within DMF, DMF realizes that it has a registration and not a table, and calls the gateway facility to pass the request to the external source.

GWF in a server is mostly a stub, but it does implement two internal "foreign" data sources: the SXA gateway, and the IMA gateway. The SXA gateway allows one to read SXF audit files as if they were tables. The IMA gateway maps table requests into reads or writes of named objects, which in turn are handled as reads and writes of server internal data structures (via the MO section of GL).

There are some other facilities under the "back" grouping which are used to build an ICE server: ascf, aswf, urs.

The cuf area under common has a variety of miscellaneous subroutines. The most interesting is CUT, containing the "common utility threads" implementation. CUT is a method for passing data from producers to consumers, which may be in different threads (ultimately, in different processes, although that is not yet implemented). CUT is relatively new (in terms of being actively used), and is still undergoing some refinement. So far it's been an effective tool, and deserves to replace hand-coding in more places within the server.

The dbutil grouping is not part of the back-end, but is closely related. Dbutil contains createdb, destroydb, sysmod, upgradedb, and verifydb utilities. These all work with system catalogs, which must be in the format expected by the DBMS server! So the dbutil programs have to track the back-end exactly.

Personal tools
Developing With