Jennifer Widom: ~~~~~~~~~~~~~~ There is a stream query repository at http://www-db.cs.berkeley.edu/stream/sqr. There are about 40 queries in four areas: auctions, squirrels, networks and bird nests. Jennifer then presented the four "challenge" queries in English, and in CQL. Here are the queries in English. The queries in CQL can be found at above site. ///////////////begin queries///////////////////////// Query 1: windowing and aggregation, self-join or subquery Stream: Packets(pID, length, time) // time may be explicit or implicit Generate the stream of packets whose length is greater than twice the average packet length over the last 1 hour. -------- Query 2: windowing, substreams, stored relation Stream: SquirrelSensors(sID, region, time) // time may be explicit or implicit Relation: SquirrelType(sID, type) Create an alert when more than 20 type 'A' squirrels are in Jennifer's backyard. (I've purposely underspecified whether alert() occurs once when more than 20 squirrels are first detected, or at every time step with more than 20 squirrels, or something else. Perhaps not important, or perhaps a point worth discussing.) ---------- Query 3: stream self-joins SquirrelChirps(sID, loc, time) // time may be explicit or implicit Stream an event each time 3 different squirrels within a pairwise distance of 5 meters from each other chirp within 10 seconds of each other. --------- Super-Bonus Query 4: windowing, stream transformations Packets(pID, src, dest, length, time) // time may be explicit or implicit Create a log of flow information from a stream of packets. A flow (simple definition) from a source S to a destination D ends when no packet from S to D is seen for at least 2 minutes after the last packet from S to D. The next packet from S to D starts a new flow. The flow log contains the source, destination, count of packets, and total length of packets for each flow. //////////////////end queries//////////////////////// Dave Meier has a question about whether streams are ordered by time, and also should the language be aware of this ordering. Dennis Shasha thinks that language should be aware of this ordering. Sirish said maybe timestamp should be an explicit attribute that can be referred to in the Where clause. Jennifer thinks it is unpleasant syntactically and also, it could lead to harder implementation issues: a less restricted language is harder to build optimizers for. However, Alex Buchmann pointed out that without explicit timestamps it is hard to express stuff like A came before B came before C (this is the kind of query that SQL-TS should be able to support). Coming back to CQL: it can only express windows that end at NOW. That is whether the window is sliding or landmark, the later end of the window is always at the latest timestamp. Is that the right decision? This does not allow queries that correlates current with historical data (Join the latest hour of data on Vehicle Identification Number with the same hour of data for every day in the last year). Jennifer pointed out one last thing: CQL uses application-defined notion of time. Dennis Shasha: ~~~~~~~~~~~~~ The language defines queries over Arrable type(== ordered data set). Their language has a clause [assuming order] that tells you if the tuples are entering the system in timestamp order. There is a difference (from CQL and StreaQuel) in the way they solve the squirrels in Jennifer's backyard problem: they have moving counts of squirrels with latest readings in Jennifer's backyard, and they check when the moving count > 20. *Dave Meier comment*: All the languages proposed so far have the number of streams in Query 3 (the chirping one) = 3 = number of squirrels in the query. Is there a way to fix the languages so the number of streams in the From Clause is independent of the number of squirrels involved in the query? Stan Zdonik: ~~~~~~~~~~~ Aurora does not have an SQL like language. Instead it has a GUI. Why did they make this choice? They claim that it is hard to optimize common subexpressions. Instead, let the users do it for you. In AuroraGUI, everything is a stream, there are no relations (unlike CQL) What goes in their GUIs? Boxes. What are the boxes? Regular Operators: Filter, map, Union, Join, aggregate New Operators: WSort, resample (Hmm...WSORT in CQL/StreaQuel??) Finally: Add windows Like StreaQuel and CQL They also have the notion of window size and window hop size (range and slide in CQL speak). The timestamp of the derived tuple is the minimum of the timestamps of the inputs but it can be changed by the application. Jennifer had a question: how do you keep things going even when you don't have new tuples arriving. Answer: use a heartbeat. * Aurora cares about QoS, and believes it should be part of the language. For blocking, they allow you to specify a timeout. * Aurora cares about how to deal with lost and out-of-order tuples. They believe this has to do about knowing when to close out windows (see Jennifer's comment above). For disorder, they allow you to specify a slack in the query. Mehul wants to know if the visual query is actually easier to compose for general users than for example, using queries. Hari says it is. Some databasy person in the room said the database industry was proof that it was not. Mike Stonebraker thinks that the reason for workflow is there is lots of signal processing in the front-end. And he believes the Aurora GUI is at a higher level than say CQL (Jennifer thinks it is the other way - language higher level than boxes and arrows). Carlo Zaniolo: ~~~~~~~~~~~~~~ UDAs are the key. You have init, next and close methods within the UDA, and you can write arbitrary code for the three stages. The language is Turing complete. They made a conscious decision not to join streams - that is because they cannot join windows. Interesting question raised by someone: can we view XML documents as steams? Franklin: ~~~~~~~~ He showed a flavor of StreaQuel - the language for TelegraphCQ. Language allows combining different notions of time, accesses both historical data and newly arriving data. (TelegraphCQ claims you need a rich model of windows over both historical and newly arriving data, especially since they care about archiving streams and querying the archive) The language essentially has a for-loop construct to declare a set of windows of data over which the query is to be executed. The construct captures all the kinds of windows on Mike's slide (personal note: the for-loop is used to *declare* a set of windows, the lanaguage is *not* procedural). Mike Stonebraker's comment: There is a tradeoff between simplicity and expressiveness.