Motivation

The Introduction Section described some challenges in designing associative containers. This section describes the STL's solution and motivation for an alternative solution. It is organized as follows.

  1. The STL's Associative-Container Design briefly describes the STL's solution.
  2. Choice of Policies discusses possible additional policies by which to parameterize data structures.
  3. Data-Structure Genericity discusses possible problems with generic manipulation of containers based on different underlying data-structures.
  4. Mapping Semantics discusses scalability issues with the STL's non-unique-mapping associative containers.
  5. Choice of Methods discusses some reservations with the choice of methods in the STL.

The STL's Associative-Container Design

The STL (or its extensions) currently offer associative containers based on underlying red-black trees or collision-chaining hash tables. For association, containers based on trees are parameterized by a comparison functor, and containers based on hash tables are parameterized by a hash functor and an equivalence functor.

For each underlying data-structure, the STL offers four containers with different mapping semantics. A map-type uniquely maps each key to some datum, a set-type stores uniquely keys, a multimap-type non-uniquely maps each key to some datum, and a multiset-type non-uniquely stores keys.

Containers contain various iterator-based methods. E.g., all containers have constructors taking a pair of iterators, and transactionally construct an object containing all elements in the iterators' range. Additionally, it is possible to (non-transactionally) insert a range given by iterators, or erase such a range. Other methods are implicitly range-based, e.g., it is possible to test the equivalence of two associative container objects via operator==.

Choice of Policies

In order to function efficiently in various settings, associative containers require a wide variety of policies.

For example, a hash policy instructs how to transform a key object into some non-negative integral type; e.g., a hash functor might transform "hello" into 1123002298. A hash table, though, requires transforming each key object into some non-negative integral type in some specific domain; e.g., a hash table with 128 entries might transform the "hello" into position 63. The policy by which the hash value is transformed into a position within the table can dramatically affect performance.

Additionally, most hash-table algorithms encounter collisions. To mitigate the cost of these collisions, it sometimes is beneficial to store the hash value along with each element [clrs2001, austern01htprop]. While this improves performance for complex keys, it hampers performance for simple keys, and is best left as a policy.

Tree-based containers allow reasonable access while maintaining order between elements. In some cases, however, tree-based containers can be used for additional purposes. E.g.,consider Figure Sets of line intervals -A, which shows an example of a tree-based set storing half-open geometric line intervals. An std::set with this structure can efficiently answer whether [20, 101) is in the set, but it cannot efficiently answer whether any interval in the set overlaps [20, 101), nor can it efficiently enumerate all intervals overlapping [20, 101). A well-known augmentation to balanced trees can support efficient answers to such questions [clrs2001]. Namely, an invariant should be maintained whereby each node should contain also the maximal endpoint of any interval within its subtree, as in Figure Sets of line intervals -B. In order to maintain this ivariant, though, an invariant-restoring policy is required.

no image
Sets of line intervals.

Data-Structure Genericity

Consider a generic function manipulating an associative container, e.g.,

template<
	class Cntnr>
int some_op_sequence
    (Cntnr &r_cnt)
{
	...
}

The underlying data structure affects what the function can do with the container object.

For example, if Cntnr is std::map, then the function can use std::for_each(r_cnt.find(foo), r_cnt.find(bar), foobar) in order to apply foobar to all elements between foo and bar. If Cntnr is a hash-based container, then this call's results are undefined.

Also, if Cntnr is tree-based, the type and object of the comparison functor can be accessed. If Cntnr is hash based, these queries are nonsensical

These types of problems are excaberated when considering the wide variety of useful underlying data-structures. Figure Different underlying data structures shows different underlying data-structures (the ones currently supported in pb_assoc). A shows a collision-chaining hash-table; B shows a probing hash-table; C shows a red-black tree; D shows a splay tree; E shows a tree based on an ordered vector (the tree is implicit in the order of the elements); E shows a list-based container with update policies.

no image
Different underlying data structures.

These underlying data structures display different behavior. For one, they can be queried for different policies. Furthermore:

  1. Containers based on C, D, and E store eleents in a meaningful order; the others store elements in a meaningless (and probably time-varying) order. As a futher consequence, containers based on C, D, and E can support erase operations taking an iterator and returning an iterator to the following element; the others cannot.
  2. Containers based on C, D, and E can be split and joined efficiently, while the others cannot. Containers based on C and D, futhermore, can guarantee that this is exception-free; containers based on E cannot guarantee this.
  3. Containers based on all but E can guarantee that erasing an element is exception free; containers based on E cannot guarantee this. Containers based on all but B and E can guarantee that modifying an object of their type does not invalidate iterators or references to their elements, while contianers based on B and E cannot. Containers based on C, D, and E can futhermore make a stronger guarantee, namely that modifiying an object of their type does not affect the relation of iterators.

A unified tag and traits system (as used for the STL's iterators, for example) can ease generic manipulation of associative containers based on different underlying data-structures.

Mapping Semantics

In some cases, map and set semantics are inappropriate. E.g., consider an application monitoring user activity. Such an application might be designed to track a user, the machine(s) to which the user is logged, application(s) the user is running on the machine, and the start time of the application. In this case, since a user might run more than a single application, there can be no unique mapping from a user to specific datum.

The STL's non-unique mapping containers (e.g., std::multimap and std::multiset) can be used in this case. These types of containers can store store two or more equivalent, non-identical keys [kleft00sets]. Figure Non-unique mapping containers in the STL's design shows possible structures of STL tree-based and hash-based containers, multisets, respectively; in this figure, equivalent-key nodes share the same shading.

no image
Non-unique mapping containers in the STL's design.

This design has several advantages. Foremost, it allows maps and multimaps, and sets and multisets, to share the same value_type, easing generic manipulation of containers with different mapping semantics.

Conversely, this design has possible scalability drawbacks, due to an implicit "embedding" of linked lists. Figure Embedded lists in STL multimaps -A shows a tree with shaded nodes sharing equivalent keys; Figure Embedded lists in STL multimaps -A explicitly shows the linked lists implicit in Figure Non-unique mapping containers in the STL's design. The drawbacks are the following.

  1. As mentioned before, there are several underlying data-structures, each with its set of tradeoffs. The STL's design uses an associative linked-list to store all elements with equivalent primary key (e.g., users). Searching for a secondary key (e.g., a process) is inherently linear. While this works reasonably well when the number of distinct secondary keys is small, it does not scale well.
  2. Embedding linked lists can cause the entire structure to be inefficient. E.g., Figure Effect of embedded lists in STL multimaps -A shows a tree with several shaded nodes containing equivalent keys; note how unbalanced this tree would seem when considering all shaded nodes to be a single node. Figure Effect of embedded lists in STL multimaps -B shows a hash table with several shaded nodes containing equivalent keys; note that this can lengthen the search for other nodes as well.
  3. Embdedding linked lists is only possible for some data structures. Some data structures, e.g., probing-hash tables, linear hash tables, and extendible hash tables, cannot support it.
  4. The embedded linked list design forgoes the abilitiy to treat all elements with the same primary key as a single entity. The ability to efficiently simultaneously insert (or erase) a larger number of elements with the same primary key is lost; the ability to utilize segmented iterators is lost [austern98segmented].
  5. The linked-list design uses much space. For one, in the above example, the data identifying will must be duplicated for each application run by the user. Furthermore, the "links" in the linked list are supplied by the underlying data structure. In the case of tree-based containers, for example, the linked list utilizes the fact that each tree node contains pointers to its parent and its children; given that the order of equivalent keys is meaningless, the number of pointers exceeds the functionality supplied by a linked list.
no image
Embedded lists in STL multimaps.

Choice of Methods

[meyers02both] points out that a class's methods should comprise only operations which depend on the class's internal structure; other operations are best designed as external functions. Possibly, therefore, the STL's associative containers lack some useful methods, and provide some redundant methods.

  1. Possibly missing methods:
    1. It is well-known that tree-based container objects can be efficiently split or joined [clrs2001]. Externally splitting or joining trees is super-linear, and, furthermore, can throw exceptions. Split and join methods, consequently, seem good choices for tree-based container methods.
    2. Suppose all elements which match a certain criteria need to be erased from an unordered container object, e.g., all elements whos keys are in a given range. Externally erasing them from the container object is super-linear, since erasing an element might reorder all iterators. Conditional erasing, therefore, seems a good choice for associative containers.
  2. Possibly redundant methods:
    1. STL associative containers provide methods for inserting a range of elements given by a pair of iterators. At best, this can be implemented as an external function, or, even more efficiently, as a join operation (for the case of tree-based containers). Moreover, these methods seem similar to constructors taking a range given by a pair of iterators; the constructors, however, are transactional, whereas the insert methods are not; this is possibly confusing.
    2. STL associative containers provide methods for erasing a range of elements given by a pair of iterators. At best, this can be implemented as an external function, or, even more efficiently, as a (small) sequence of split and join operations (for the case of tree-based containers). Moreover, the results of erasing a range is undefined for the case of containers based on unordered data-structures.
    3. Associative containers are parameterized by policies allowing to test keys, but not data, for equivalence. When comparing two associative container objects, it is at least as reasonable to expect that they are equivalent if both keys and data are equivalent, as it is reasonable to expect that they are equivalent if their keys only are equivalent. Furthermore, in different settings it makes sense that two objects are equivalent if they store keys in the same order, whereas in other settings order does not matter. The operators operator== and operator!= are not descriptive enough for these considerations.