Load Shedding in TelegraphCQ

For situations where real-time query response is a priority, TelegraphCQ can shed excess data from incoming data streams. The architecture that TelegraphCQ uses to do this is called Data Triage. The basic idea behind Data Triage is what we call "summarize what you would drop". When the wrapper clearinghouse detects that it does not have time to process a tuple, it triages the tuple, adding it to a compact summary along with the other triaged tuples. Periodically, the wrapper clearinghouse sends these summaries to the backend, where the user can use a shadow query to reconstruct approximately what query answers she is missing.

Selecting Summary Type

For unarchived streams, you can choose what type of summary to generate when triaging tuples by appending an ON OVERLOAD clause to the CREATE STREAM statement:

                                                          +-- BLOCK
                                                          +-- DROP
CREATE STREAM [stream name] TYPE UNARCHIVED ON OVERLOAD --+
                                                          |          +-- COUNTS
                                                          |          +-- REGHIST    
                                                          +-- KEEP --+-- MYHIST
                                                                     +-- WAVELET ( '[wavelet params]' )
                                                                     +-- SAMPLE

The arguments of the ON OVERLOAD clause have the following meanings:

BLOCK means to stop reading tuples if the query engine is not consuming them fast enough.
DROP means to drop triaged tuples without constructing any summaries.
KEEP COUNTS means to keep counts of the triaged tuples
KEEP REGHIST means to build fixed-grid multidimensional histograms of triaged tuples
KEEP MYHIST means to build MHIST multidimensional histograms (the strange name is to avoid collisions with the name of the MHIST datatype that implements these histograms)
KEEP WAVELET means to build wavelet-based histograms
KEEP SAMPLE means to keep a reservoir sample

The default behavior is ON OVERLOAD BLOCK.

NOTE: Currently, TelegraphCQ does not perform any Data Triage on archived streams. If the data rate of an archived stream exceeds the system's capacity to consume data, the wrapper clearinghouse will block.

Summarizing "Kept" Tuples

To estimate the missing results of queries containing stream-stream joins, the wrapper clearinghouse also needs to summarize the tuples that are not triaged. Summarizing these tuples is only necessary if the user is running a query with stream-stream joins. The parameter load_shedding_disable_kept_summaries in postgresql.conf allows the user to disable these summaries for a slight performance improvement.

The Summary Streams

For a stream with the name schema.stream, TelegraphCQ will automatically create auxiliary summary streams schema.__stream_dropped and schema.__stream_kept, representing the triaged tuples and the non-triaged tuples, respectively. For all summary types except samples, the schema of a summary stream is:
(summary [summary type], window_num integer, prev_tcqtime Timestamp, tcqtime Timestamp)
Where:

summary holds the summary data structure
window_num identifies the summary time window (summaries are sent at the end of each summary window)
prev_tcqtime and tcqtime represent the time interval from which the tuples in the summary came

Sample Summaries

The SAMPLE summary type tells the Triage subsystem to generate fixed-size samples the tuples it triages. The schema of these tuples is the same as the original stream, with the addition of a new column __samplemult of type real (aka float4). The __samplemult column contains the number of triaged tuples that are represented by each tuple in the sample.
NOTE: Currently, the original stream needs to contain the __samplemult field. This will change!