Janitors todo

From Ingres Community Wiki

Karl Schendel's To-Do List

This page is a relatively unexpurgated transcription of my personal want-to-do list. I'm not sure I would have listed it under the heading "Janitors", since there are some very difficult things here. The easy things are intermixed with hard things, and the list is in *no order whatsoever*. You might find more of the simple things at the end. Or not. Not everything I'd like to do is on this list, these are mostly non-immediate things that I've noted down so that I don't forget about them. --Karl

1. sysmod does a simple "modify catalog to hash" or btree. For the hash core catalogs, such as iirelation, it would sometimes be useful to be able to specify a minpages to allow for future table creation after the sysmod.

This project would require definition of suitable syntax; -minpages=nn on the command line is the obvious way, but perhaps there is a better way (eg. allow for N extra tables?).

2. When scanning a partitioned table, qeq_part_open allocates a new DMT_CB, DMR_CB, and DB_LOC_NAME area for every new partition. This wastes memory. When orig (qenorig.c) is done with a partition, and decides to close it, it would be useful to be able to re-use the control blocks allocated by qeq_part_open.

Note that this is somewhat easier said than done, because the ULM memory allocator doesn't have an explicit free-memory call. (You can free a stream, but not specific allocations from a stream.) So some kind of free list would be needed here.

3. update_load_tab in dmf/dmp/dm2r.c updates iirelation after a load into an existing table. It could be using the saved table's iirelation tid entry in the TCB, but it doesn't. Revise update_load_tab to operate more like update_rel (in dm2t.c) and use the saved iirelation entry's tid. [Done, damain change 662]

4. In various circumstances for update and delete, the optimizer will introduce a sort at the top of the query plan solely to perform a prefetch. (One such circumstance is when the update/delete will fire a rule that may update and possibly physically move the row.) Sorts are overkill when all that is needed is a prefetch. Invent a new query plan node that reads all of its input and stuffs it into a "hold file"; and then after all the input is read, it returns rows out of the hold file. See qef/qen/qentsort.c, it should work like that without the sorting bit. Work will be needed in opf as well to compile the plan with the new node type, and possibly in DMF to load rows into DMF hold files without actually sorting them.

5. DMF memory allocation is largely uncontrolled; the only real control is the dmf_tcb_limit. There are some mechanisms for reclaim, such as reclaiming unused TCB's, but in general this is "too little too late" once DMF can't get any more memory. Invent some sort of configuration parameters to control DMF. One possibility might be a dmf_pool_guideline that when reached causes DMF to become more conservative with allocations (e.g. smaller dm1x build pages sizes), and a further dmf_pool_limit that is a hard upper limit.

(Note that a key problem with uncontrolled DMF memory is that when DMF asks for memory it belongs to DMF forevermore. If DMF sucks up all the memory such that the server can't expand its address space any more, the other facilities are terminally starved for memory even if DMF is able to reclaim control blocks successfully. It would be better to limit DMF to less than all of memory even if that causes some queries to fail.)

6. [unix] The server opens existing nonempty tables in O_SYNC mode. This ensures that file metadata is up-to-date, and eliminates the need for an fsync() on all open tables during a consistency point. On the down side, O_SYNC is slow on many unixes. It would be extremely useful to have the capability to use fsync and/or fdatasync in place of O_SYNC on platforms where the former performs well. Specific things to worry about:

  • fsync ends up being a full sync on some platforms/filesystems (e.g. ext3 on linux)
  • DI opens and closes files to share the limited file descriptor resource. You'll need to determine what your platform/filesystem does when closing a file (e.g. is it synced or not).
  • Everything has to be synced during a consistency point, meaning data and metadata both.
  • there may be other situations where a full sync is expected (bulkload??)

7. RDF: the rdu_xlock routine sometimes returns holding the object mutex and sometimes not. (Not, when rdu_private was called to make a private copy of the object.) The rdu_xlock caller sometimes expects the object mutex to be held upon return, and sometimes doesn't. It all sort of kind of mostly works, which is hardly good enough.

The project is to figure out what rdu_xlock should do, and do it in all cases, and make the callers expect the right thing in all cases.

8. RDF: RDF doesn't cache negative findings. This causes a lot of extra work, because of the way the parser asks for table info.

Given a query "select * from t", the parser first asks for user.t (user = the session user). If not found, the parser asks for dba.t, and if that fails in some cases it asks for $ingres.t.

Now, if table t is owned by the dba, this means that every query referencing t will ask RDF for user.t. RDF will have no record of it, so RDF asks DMF, which has to ask iirel_idx for user.t which doesn't exist. If RDF would cache negative results as well as positive ones, RDF could immediately report "no such thing as user.t" and then eliminate all the extra DMF calls and iirel_idx checks.

9. RDF: there's a lot of whirling around in RDF to deal with synonyms. Most databases don't define any synonyms, or if they do, there are only a few. Investigate adding a more substantial and aggressive iisynonym cache to RDF.

10. RDF: The rest of the server doesn't believe RDF very much, for good reason. For instance, if a query fails with an error, the server asks for an RDF purge and tries again! Investigate ways to more positively assure that RDF is in sync with the reality in the catalogs. (This might be tricky or impossible in a cluster environment; or, it might need a better tie-in to DMF. DMF has an accurate mechanism for tracking table invalidations even in a cluster.)

11. Invent a new COPY variation that does a nologging COPY, and either drops or truncates the table if an error occurs. E.g. COPY .. FROM .. WITH NOLOGGING, ON_ERROR={{DROP | TRUNCATE}.

12. Invent a new heap structure variation that does away with the overflow links and relies strictly on FHDR/FMAP to decide which pages are for real and which aren't. Perhaps include empty-page tracking (i.e. pages with all rows deleted are marked free in the fmap, and re-used for row-by-row inserts).

Benefits would include simpler scanning code, easier rollback of load into nonempty table, simpler extend (no page rewrites just for the stupid overflow link).

13. [unix] TMget_stamp in clf/tm_unix/tmstamp.c is wrong and bogus. Figure out a better way. Make sure that nothing using TMget_stamp actually thinks it's getting a real time stamp.

14. PSM joins (qef/qen/qenisjoin.c) restart the inner stream back to the beginning after an out-of-order row from the outer. For large inners, this can be fairly disastrous. Invent some mechanism that remembers the inner-stream key and hold-file-tid occasionally; this in-memory index could be used to position the inner stream upon out-of-order.

(A sliding distance could help to keep this index small, e.g. start out making an index entry every K rows, then when we have 16 of them, throw half out and start recording an index entry every 2K rows, etc.)

15. Memory stats! lots and lots of work to be done here. High water marks, current usage; by bytes and by # of streams; pool used vs allocated; etc etc

  • Opf remembers almost nothing about its memory usages
  • psf likewise
  • Lots of facilities use the ULM stream mechanism, maybe could put some support into ulm to help out.
  • immediate tracepoint report as well as shutdown report.
  • most interesting stats returned via IMA as well.

Don't forget DMF which doesn't use ULM, it uses dmf/dmp/dm0m.c.

16. The ULH hash table utilities keeps no statistics either. ULH is used by RDF, and by QEF to track repeat-query DSH's. Expose any stats hiding within ULH and invent new ones (# buckets, avg/highwater chain length, hash entries avg/highwater, what else?)

17. The check-only (CO) hash join assumes that the result input is the probe side, and the check-only input is the build side. This simplifies the hash join code, but there are situations (e.g. TPCH query 4) where the check-only input is the largest one. It would be extremely useful to allow the (CO) input of hash join to be either side.

A build-result (CO) hash join would operate by running the join as usual, but after producing a joined row the hash-table entry (ie the build row) would be deleted from the hash table so that it could not join again.

18. Queries are interrupted and/or retried for various reasons, and none of them are effectively tracked. Track and report:

  • retries due to parse error
  • retries due to QEF timestamp validation
  • retries due to QEF reltups/pages changed too much
  • query interruption to reparse a rule; db proc
  •  ?? anything else?

19. Bug: The read-uncommitted isolation level puts the transaction into read only mode, but also prevents declare global temporary table. It shouldn't do the latter.

20. Bug: the DMF "add database" function sets the config file open-count (done by the RCP via lgevent) and then writes an OPENDB log record. This is backwards and violates the write-ahead logging principle. A crash before the OPENDB is written causes DB to be inconsistent with open-count. [John Dennis may have logged this as a bug]

Fix the code to write OPENDB first and then update the cnf file.

21. Partitioned table create creates the partition tables one at a time. The bad part is that each file create is logged separately and causes a log force. Figure out some way to improve this, perhaps by writing all of the fcreate log records before forcing them, and then creating the table files. This will require cooperation with QEF which at the moment feeds creates to DMF one at a time instead of N at a time.

Note that creating worker threads to help with the creates might be a nice idea, but fails in the naive form because all the threads will be colliding over iirelation. It might work to ask one thread to do iirelation and other to create files; obviously this would require a lot of work to ensure that crash / abort recovery works properly.

22. Memory allocation requires a total review and rethink, although individual pieces could be successfully addressed: - too many levels: MEreqmem in the CL, SCF get-pages function in the server, ULM, and DM0M for DMF. - Not well suited to concurrency on modern hardware, especially ULM. Long cache-unfriendly chain searches; global lists; too much locking.

Investigate some of the newer thread-friendly malloc-like packages such as google's tcmalloc. Perhaps use this to replace the ME level, and build a more lightweight ULM level for server memory streams. Or, ???

23. The DBMS server has a bad habit of defining the same flag using different names that have to have the same value. (See DMU_F_xxx attribute flags vs DMT_F_xxx via ATT_xxx; or, the relstat and relstat2 flag definitions TCB_xxx in DMF vs TBL_xxx elsewhere and some have DMU_xxxx copies). Things that must have the same value should be named with exactly one name.

Similarly, the server has a bad habit of defining N data structures for the same thing that are slightly different, such as the (at least) 3 different flavors of attribute descriptor. (DMT_ATT_ENTRY, DMT_ATTR_ENTRY, and DMU_ATTR_ENTRY and there might be more in PSF or OPF.)

Fix these things so that there is one definition for one thing, as global as necessary. Note that it would be prudent to approach this in small pieces instead of trying to clean up everything at once.

24. Database close/purge in DMF (dm2d.c) holds the DCB mutex during the entire operation. This is grossly wrong, since lots of different I/O or lock waits can occur. Get rid of this long-term DCB mutex, and invent some better way of ensuring that close/purge and add/open don't race.

25. [unix, ?windows] Get rid of the absolute pathnames in the DB config file. Cache all of the ingprenv and iidbdb locations in the server, as there are rarely more than a few and even 1000's of locations are trivial memory wise. Just use the location name in the config file and do the translation in the server.

26. (trivial) the call psf_error(...,PSF_CALLERR,...) is a no-op! psf_error basically just returns when given the PSF_CALLERR flag. Get rid of psf_error calls using PSF_CALLERR, and if they are real error situations, figure out some action more constructive than no-op.

27. On-line (concurrent) sysmod to go along with concurrent modify. The only way I can think of to do this is to stall the database in the same way that ckpdb does; then run the sysmod, purge and reopen the database (or equivalently refresh the core catalogs TCB's); and then unstall the database.

Note that stalling would only be necessary for the core catalogs, iirelation iiattribute iidevices. The other catalogs ought to be modify-able using the normal locking mechanisms. The core catalogs are special because their TCB's are permanently open as long as the database is open, so the usual TCB purge mechanism used by modify fails. That's why syslog wants exclusive access.

28. The LOcreate CL routine creates a temp file using mktemp, which is usually considered a security hole. LOcreate probably needs to be eliminated and replaces with something that creates and opens the temp file in one operation. This will impact the callers of LOcreate, and in some cases the changes are nontrivial.

29. (trivial) Eliminate all occurrences of explicit testing against the constants TRUE or FALSE. E.g. instead of

 if (thing == TRUE)

just write

 if (thing)

and similarly for other boolean-like tests. Note, this should be submitted a few files at a time (say, a subdirectory at a time, or even less); otherwise the change will be too large to safely apply.

30. When an expression is normalized to CNF (conjunctive normal form, ie AND's of OR's), useless always-true or always-false clause can be introduced. For instance:

   i BETWEEN 1 AND 3 OR i BETWEEN 10 AND 20

is normalized to:

   (i >= 1 OR i >= 10) AND (i >= 1 OR i <= 20) AND
   (i <= 3 OR i >= 10) AND (i <= 3 OR i <= 20)

Numbering clauses from 1, only tests 1, 5, 6, and 8 contribute anything useful. It would be cool if CNF normalization figured this out and dropped the extra clauses.

31. In general, constant arithmetic is missing. If you write

   i < 1+2

Ingres will dutifully generate the addition at runtime. Now granted, it does this in the "virgin" segment of the CX, which means that the addition will occur only once at query startup. However, if the parser or early-optimizer stage would do the addition, it would improve the ability to use histograms and hence generate better cost estimates, and possibly better query plans.

Take care to only do this for true literal constants and not repeated-query parameters, which look like constants but aren't.

32. The COPY statement front-end code iicopy.c uses SIread, which apparently does a "locked" getc at least on Solaris. Since there shouldn't be any possibility of another thread sharing the copy file input stream, use the _unlocked variant which goes a lot faster on Solaris. Investigate whether there are similar issues on other unix-like platforms.

33. Bug: If a database is extended to the same location name twice, with different usages (once for data and once for work), the work location can't be used with the set work locations statement. Unextend seems to have similar problems. dmc_alter is only finding the first occurrence of the location name in the config structure, and it barfs if the usage isn't of the right kind. It should probably keep looking to see if it can find a usage of the proper kind.

34. There are awkwardnesses in the parse tree design. Two that might be worth looking into:

  • the PST_TREE and PST_QLEND node types are really just "end" markers. Why not just use a null tree link instead?
  • the PST_CURVAL cursor-var node type is the same as a PST_VAR, except that it's different. The differences seem to be of no moment, but they cause code duplication e.g. in opcqual.c. See if PST_CURVAL can be eliminated in favor of PST_VAR nodes.

35. It's difficult, but possible, to get a garbage VARCHAR value into a table; garbage in the sense that the length value is too long for the declared length. (One can achieve this by fastloading bad data, for instance.) If vch_compression is turned ON, which is the default, and the sequencer sees a bad varchar value, it crashes. Teach the sequencer vch-compression code to be more defensive about bad varchar values (without impacting performance too much!)


36. Clean up compiler warnings from gcc. Fix a few files at a time, or a subdirectory at most, so that any mistakes are easy to catch and fix.

37. Clean up compiler warnings from the Sun Studio compiler on SPARC. It seems to be sensitive to a different class of warnings than gcc.

38. Much of Ingres pre-dates the availability of void * pointers. For instance, the arg to MEfree is declared as a PTR, requiring callers to either cast everything, or get compiler warnings. Use void * generic pointers where appropriate, starting with ME and facility-call wrappers. Also consider using void * for generic opaque cookies in data structures, e.g. transaction-ID type things, where currently Ingres uses PTR.

39. LOAD (dm1x) of temporary tables builds directly into the DMF cache, the so-called in-memory build. It was done this way because temp tables might not materialize to a disk file. Unfortunately if a temp table IS large, the build runs slowly and poisons the DMF cache unnecessarily. Pitch the in-memory build concept; instead, if you build a temp table and it turns out that you don't have to actually write the build buffers, have an end-phase that stuffs the build buffers into DMF cache instead of forcing to disk. This will make large temp-table builds go faster, not impact small ones, and unify the code as well.

40. ABF/4GL isn't being supported much, but it's more out of date than it really needs to be. In particular, fix up the ABF grammar to accept the CASE expression at least; bonus points for adding other current SQL syntax. (Platinum points for round-filing the separate grammar entirely and unifying the various SQL grammars.)

This page was last modified 18:07, 22 April 2008. - This page has been accessed 483 times. - Disclaimers - About Ingres Community Wiki