Load Shedding in TelegraphCQ

For situations where real-time query response is a priority, TelegraphCQ can shed excess data from incoming data streams. The architecture that TelegraphCQ uses to do this is called Data Triage. The basic idea behind Data Triage is what we call "summarize what you would drop".  When the wrapper clearinghouse detects that it does not have time to process a tuple, it triages the tuple, adding it to a compact summary along with the other triaged tuples. Periodically, the wrapper clearinghouse sends these summaries to the backend, where the user can use a shadow query to reconstruct approximately what query answers she is missing.

Selecting Summary Type

For unarchived streams, you can choose what type of summary to generate when triaging tuples by appending an ON OVERLOAD clause to the CREATE STREAM statement:

               

                                                          +-- BLOCK
                                                          +-- DROP
CREATE STREAM [stream name] TYPE UNARCHIVED ON OVERLOAD --+
| +-- COUNTS
| +-- REGHIST
+-- KEEP --+-- MYHIST
+-- WAVELET ( '[wavelet params]' )
+-- SAMPLE
The arguments of the ON OVERLOAD clause have the following meanings:
The default behavior is ON OVERLOAD BLOCK.

NOTE: Currently, TelegraphCQ does not perform any Data Triage on archived streams. If the data rate of an archived stream exceeds the system's capacity to consume data, the wrapper clearinghouse will block.

Summarizing "Kept" Tuples

To estimate the missing results of queries containing stream-stream joins, the wrapper clearinghouse also needs to summarize the tuples that are not triaged. Summarizing these tuples is only necessary if the user is running a query with stream-stream joins. The parameter load_shedding_disable_kept_summaries in postgresql.conf allows the user to disable these summaries for a slight performance improvement.

The Summary Streams

For a stream with the name schema.stream, TelegraphCQ will automatically create auxiliary summary streams schema.__stream_dropped and schema.__stream_kept, representing the triaged tuples and the non-triaged tuples, respectively. For all summary types except samples, the schema of a summary stream is:
(summary [summary type], window_num integer, prev_tcqtime Timestamp, tcqtime Timestamp)
Where:

Sample Summaries

The SAMPLE summary type tells the Triage subsystem to generate fixed-size samples the tuples it triages. The schema of these tuples is the same as the original stream, with the addition of a new column __samplemult of type real (aka float4). The __samplemult column contains the number of triaged tuples that are represented by each tuple in the sample.
NOTE: Currently, the original stream needs to contain the __samplemult field. This will change!