Vetric

Yoav Maman

Co-Founder and CTO at Vetric

See Vetric for yourselfSchedule a chat with an expert from our team to see how Vetric can work for your business.

Collecting public data at scale is, first and foremost, a Data Movement challenge. The real difficulty isn’t the collection itself — it’s the ability to move information from constantly changing sources, through monitoring, filtering and enrichment stages, all the way to a reliable output, without the frequent changes on the source side breaking the pipeline.

Every engineering team that has built such a solution in-house quickly discovers that most of its resources end up being spent on retry logic, monitoring and resolving bottlenecks involved in moving gigabytes of data — rather than on the unique product logic itself. At Vetric we chose to attack the problem at its root: we built a managed ETL layer on top of resilient collection infrastructure that absorbs source-side changes and keeps throughput stable end-to-end. The result frees cyber and public-safety teams from fighting with their data-engineering infrastructure, so they can focus on the core of their product: identifying and analyzing threats in real time.

From Raw Data Collection to End-to-End Pipeline Management

Since our founding in 2022, Vetric has focused on collecting data from public sources and delivering it via API. Under that model, our customers were responsible for building the pipeline that monitored, filtered and enriched the data. Analyzing customer behavior over time, however, led us to an important insight: customers were repeatedly building the same complex, manual process. This pushed us to develop an ETL product embedded directly on top of our collection infrastructure, sparing them the Sisyphean engineering work.

The technological heart of this development rests on Temporal — an open-source platform that guarantees reliable execution of complex workflows even when the underlying infrastructure fails. That said, teams that run ETL at high scale on top of Temporal usually hit a “wall”: the limit on the size of data passed between steps (the payload limit).

To solve this, we built an architecture that separates orchestration from data: while Temporal schedules workflows and manages retries, the data itself is written as Parquet files in S3. Between steps we introduced consolidation functions that prepare the data for the next stage, which eliminated the bottleneck that previously prevented moving gigabytes of data inside a single workflow.

Optimizing for Scale: Separation of Concerns

A second scaling problem appeared the moment customers started pulling results. Our primary database (Postgres) was simultaneously serving the pipeline’s writes and the customers’ reads, and the contention for resources tanked performance. Our solution was to fully separate reads from writes: we introduced a Read Replica — a copy of the database dedicated to reads only. We routed all customer queries to it while the pipeline kept writing to the primary. To guarantee consistency, we added a short delay between the end of the workflow and the moment results become available for retrieval, giving the replica enough time to fully sync.

The result is a stable system: the pipeline keeps running without breaking, customers get fast results, and the infrastructure handles both kinds of load without starving the other.

Conclusion: Building the New Standard for the Data Pipeline

The transformation we went through at Vetric reflects a deeper truth about the modern data market: our customers are no longer just looking for access to raw data — they’re looking for a partner that can manage the infrastructural complexity for them. By solving hard engineering challenges — from getting past Temporal’s payload limits to balancing read and write loads — we managed to lift the maintenance burden off our customers’ shoulders. Today, Vetric is no longer just a pipe for moving data; it’s a resilient workflow platform that lets cyber companies focus on what they do best — analyzing information and protecting the organization — while we make sure the data keeps flowing reliably and ready for action.

From API to Managed Product: How We Solved Cyber Customers’ Infrastructure Problems with a Managed Workflow Product

From Raw Data Collection to End-to-End Pipeline Management

Optimizing for Scale: Separation of Concerns

Conclusion: Building the New Standard for the Data Pipeline

Get Clean, Reliable Data Without the Drama