Load Shedding in TelegraphCQ
For situations where real-time query response is a priority,
TelegraphCQ can shed excess data from incoming data streams. The
architecture that TelegraphCQ uses to do this is called Data Triage.
The basic idea behind Data Triage is what we call "summarize what you
would drop". When the wrapper clearinghouse detects that it does
not have time to process a tuple, it triages
the tuple, adding it to a compact summary
along with the other triaged tuples. Periodically, the wrapper
clearinghouse sends these summaries to the backend, where the user can
use a shadow query to
reconstruct approximately what query answers she is missing.
Selecting Summary Type
For unarchived streams, you can choose what type of summary to generate
when triaging tuples by appending an ON OVERLOAD
clause to the CREATE STREAM statement:
+-- BLOCK
+-- DROP
CREATE STREAM [stream name] TYPE UNARCHIVED ON OVERLOAD --+
| +-- COUNTS
| +-- REGHIST
+-- KEEP --+-- MYHIST
+-- WAVELET ( '[wavelet params]' )
+-- SAMPLE
The arguments of the ON OVERLOAD clause have the following meanings:
- BLOCK means to stop reading tuples if the query
engine is not consuming them fast enough.
- DROP means to drop triaged tuples
without constructing any summaries.
- KEEP COUNTS means to keep counts of the
triaged tuples
- KEEP REGHIST means to build fixed-grid
multidimensional histograms of triaged tuples
- KEEP MYHIST means to build MHIST
multidimensional histograms (the strange name is to avoid collisions
with the name of the MHIST datatype that
implements these histograms)
- KEEP WAVELET means to build
wavelet-based histograms
- KEEP SAMPLE means to keep a reservoir
sample
The default behavior is ON OVERLOAD BLOCK.
NOTE: Currently, TelegraphCQ
does not perform any Data Triage on archived streams. If the data rate
of an archived stream exceeds the system's capacity to consume data,
the wrapper clearinghouse will block.
Summarizing "Kept" Tuples
To estimate the missing results of queries containing stream-stream
joins, the wrapper clearinghouse also needs to summarize the tuples
that are not triaged.
Summarizing these tuples is only
necessary if the user is running a query with stream-stream joins. The
parameter load_shedding_disable_kept_summaries
in postgresql.conf allows the user to
disable these summaries for a slight performance improvement.
The Summary Streams
For a stream with the name schema.stream, TelegraphCQ will
automatically create auxiliary summary streams schema.__stream_dropped
and schema.__stream_kept, representing the
triaged tuples and the non-triaged tuples, respectively. For all
summary types except samples, the schema of a summary stream is:
(summary [summary type], window_num integer,
prev_tcqtime Timestamp, tcqtime Timestamp)
Where:
- summary holds the summary data structure
- window_num identifies the summary time
window (summaries are sent at the end of each summary window)
- prev_tcqtime and tcqtime
represent the time interval from which the tuples in the summary came
Sample Summaries
The SAMPLE summary type tells the Triage
subsystem to generate fixed-size samples the tuples it triages. The
schema of these tuples is the same as the original stream, with the
addition of a new column __samplemult of
type real (aka float4).
The __samplemult column contains the
number of triaged tuples that are represented by each tuple in the sample.
NOTE: Currently, the original stream
needs to contain the __samplemult field. This will change!