How do you educate people on stream processing? For pipeline like systems stream processing is essential IMO - backpressure/circuit breakers/etc are critical for resilient systems. Yet I have a hard time building an engineering team that can utilize stream processing; Instead of just falling back on synchronous procedures that are easier to understand (But nearly always slower and more error prone)
serial_dev 3 hours ago [-]
It's important to consider whether it's worth it, even?
I worked on stream processing, it was fun, but I also believe it was over-engineered and brittle. The customers also didn't want real-time data, they looked at the calculated values once a week, then made decisions based on that.
Then, I joined another company that somehow had money to pay 50-100 people, and they were using CSV, sh scripts, batch processing, and all that. It solved the clients' needs, and they didn't need to maintain a complicated architecture and the code that could have been difficult to reason about otherwise.
The first company with the stream processing after I left, was bought by a competitor at fire sale price, some of the tech were relevant for them, but the stream processing stuff was immediately shut down. The acquiring company had just simple batch processing and they were printing money in comparison.
If you think it's still worth going with stream processing, give your reasoning to the team, and most reasonable developers would learn it if they really believe it's a significantly better solution for the given problem.
Not to over-simplify, but if you can't convince 5 out of 10 people to learn to make their job better, it's either that the people are not up to the task, or you are wrong that stream processing would make a difference.
nemothekid 1 hours ago [-]
I agree. Unless the downstream data is going to be used to feed a system to make automated decisions (ex. HFT or Ad buying), having real time analytics is usually never worth the cost. It's almost always easier and more robust to have high tail latencies for humans to consume and as computers get faster and faster that tail latency decreases.
Systems that needed complex streaming architectures in 2015 could probably be handled today with fast disk and large postgres instance (or BigQuery).
senderista 2 hours ago [-]
Yeah that reminds me of a startup I worked at that did real-time analytics for digital marketing campaigns. We went to all kinds of trouble to update dashboards with 5-minute latency, and real-time updates made for impressive sales demos, but I don't think we had a single customer that actually needed to make business decisions within 24 hours of looking at the data.
serial_dev 2 hours ago [-]
We were doing TV ads analytics by detecting ads on TV channels and checking web impact (among other things). The only thing is, most of these ads are deals made weeks or months in advance, so customers checked analytics about once before a renewal… so not sure it needed to be near real time…
No? Vector is for observability, to get your metrics/logs, transform them if needed, and put them in the necessary backends. Transformation is optional, and for cases like downsampling or converting formats or adding metadata.
ArkFlow gets data from stuff like databes and message queues/brokers, transforms it, and puts it back in databases and message queues/brokers. Transformation looks like a pretty central use case.
Very different scenarios. It's like saying that a Renault Kangoo is a simplified equivalent of a BTR-80 because both have wheels, engine and space for stuff.
rockwotj 3 hours ago [-]
Its a rust port of Redpanda Connect (benthos), but with less connectors
Vector is often used for observability data (in part because it's now owned by Datadog) but it's not limited to that. It's a general purpose stateless stream processing engine, and can be used for any kind of events.
sofixa 2 hours ago [-]
Vector started for observability data only, and that's why they got bought by Datadog.
Looks interesting, how does this compare to arroyo and vector.dev?
necubi 3 hours ago [-]
(I'm the creator of Arroyo)
I haven't dug deep into this project, so take this with a grain of salt.
ArkFlow is a "stateless" stream processor, like vector or benthos (now Redpanda Connect). These are great for routing data around your infrastructure while doing simple, stateless transformations on them. They tend to be easy to run and scale, and are programmed by manually constructing the graph of operations.
Arroyo (like Flink or Rising Wave) is a "stateful" stream processor, which means it supports operations like windowed aggregations, joins, and incremental SQL view maintenance. Arroyo is programmed declaratively via SQL, which is automatically planned into a dataflow (graph) representation. The tradeoff is that state is hard to manage, and these systems are much harder to operate and scale (although we've done a lot of work with Arroyo to mitigate this!).
Can you help me understand how this would plug into stream processing? My immediate thought is for web page interaction replays — but that seems sort of exotic a use case?
insane_dreamer 2 hours ago [-]
Does this include broker capabilities? If not, what's a recommended broker these days (for hosting in the cloud, i.e., an EC2 instance; I know AWS has its own Mqtt Broker but it's quite pricy for high volumes).
I worked on stream processing, it was fun, but I also believe it was over-engineered and brittle. The customers also didn't want real-time data, they looked at the calculated values once a week, then made decisions based on that.
Then, I joined another company that somehow had money to pay 50-100 people, and they were using CSV, sh scripts, batch processing, and all that. It solved the clients' needs, and they didn't need to maintain a complicated architecture and the code that could have been difficult to reason about otherwise.
The first company with the stream processing after I left, was bought by a competitor at fire sale price, some of the tech were relevant for them, but the stream processing stuff was immediately shut down. The acquiring company had just simple batch processing and they were printing money in comparison.
If you think it's still worth going with stream processing, give your reasoning to the team, and most reasonable developers would learn it if they really believe it's a significantly better solution for the given problem.
Not to over-simplify, but if you can't convince 5 out of 10 people to learn to make their job better, it's either that the people are not up to the task, or you are wrong that stream processing would make a difference.
Systems that needed complex streaming architectures in 2015 could probably be handled today with fast disk and large postgres instance (or BigQuery).
a major difference seems to be converting things to arrow and using SQL instead of using a DSL (vrl)
No? Vector is for observability, to get your metrics/logs, transform them if needed, and put them in the necessary backends. Transformation is optional, and for cases like downsampling or converting formats or adding metadata.
ArkFlow gets data from stuff like databes and message queues/brokers, transforms it, and puts it back in databases and message queues/brokers. Transformation looks like a pretty central use case.
Very different scenarios. It's like saying that a Renault Kangoo is a simplified equivalent of a BTR-80 because both have wheels, engine and space for stuff.
https://github.com/redpanda-data/connect
I haven't dug deep into this project, so take this with a grain of salt.
ArkFlow is a "stateless" stream processor, like vector or benthos (now Redpanda Connect). These are great for routing data around your infrastructure while doing simple, stateless transformations on them. They tend to be easy to run and scale, and are programmed by manually constructing the graph of operations.
Arroyo (like Flink or Rising Wave) is a "stateful" stream processor, which means it supports operations like windowed aggregations, joins, and incremental SQL view maintenance. Arroyo is programmed declaratively via SQL, which is automatically planned into a dataflow (graph) representation. The tradeoff is that state is hard to manage, and these systems are much harder to operate and scale (although we've done a lot of work with Arroyo to mitigate this!).
I wrote about the difference at length here: https://www.arroyo.dev/blog/stateful-stream-processing
Can you help me understand how this would plug into stream processing? My immediate thought is for web page interaction replays — but that seems sort of exotic a use case?