loader

What is Pravega?

Pravega is Open Source

Streaming is motivating us to rethink fundamental data processing and storage principles. As storage experts, Dell EMC is doing its part by designing a new storage primitive purpose-built for streaming data. We are open sourcing Pravega under the Apache 2.0 License to accelerate the adoption of streaming technology. Open source is right for Pravega because we believe that disruptive technologies should be owned and driven by a community of passionate open source developers.

Pravega is a Cloud Native Computing Foundation sandbox project.

icon_wp01

Exactly-Once Semantics

Ensure that each event is delivered and processed exactly once, with exact ordering guarantees, despite failures in clients, servers or the network.

This is the heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

Key Features

icon_wp01

Exactly-Once Semantics

Ensure that each event is delivered and processed exactly once, with exact ordering guarantees, despite failures in clients, servers or the network.

icon_wp02

Auto-Scaling

Unlike systems with static partitioning, Pravega can automatically scale individual data streams to accommodate changes in data ingestion rate.

icon_wp03

Distributed Computing Primitive

Pravega is great for distributed computing; it can be used as a data storage mechanism, for messaging between processes and for other distributed computing services such as leader election.

icon_wp04

Write Efficiency

Pravega shrinks write latency to milliseconds, and seamlessly scales to handle high throughput reads and writes from thousands of concurrent clients, making it ideal for IoT and other time sensitive applications.

icon_wp05

Unlimited Retention

Ingest, process and retain data in streams forever. Use same paradigm to access both real-time and historical events stored in Pravega.

icon_wp07

Durability

Don't compromise between performance, durability and consistency. Pravega persists and protects data before the write operation is acknowledged to the client.

icon_wp06

Storage Efficiency

Use Pravega to build pipelines of data processing, combining batch, real-time and other applications without duplicating data for every step of the pipeline.

icon_wp08

Transaction Support

A developer uses a Pravega Transaction to ensure that a set of events are written to a stream atomically.

Key Features

icon_wp01

Exactly-Once Semantics

Ensure that each event is delivered and processed exactly once, with exact ordering guarantees, despite failures in clients, servers or the network.

icon_wp02

Auto-Scaling

Unlike systems with static partitioning, Pravega can automatically scale individual data streams to accommodate changes in data ingestion rate.

icon_wp03

Distributed Computing Primitive

Pravega is great for distributed computing; it can be used as a data storage mechanism, for messaging between processes and for other distributed computing services such as leader election.

icon_wp04

Write Efficiency

Pravega shrinks write latency to milliseconds, and seamlessly scales to handle high throughput reads and writes from thousands of concurrent clients, making it ideal for IoT and other time sensitive applications.

icon_wp05

Unlimited Retention

Ingest, process and retain data in streams forever. Use same paradigm to access both real-time and historical events stored in Pravega.

icon_wp07

Durability

Don't compromise between performance, durability and consistency. Pravega persists and protects data before the write operation is acknowledged to the client.

icon_wp06

Storage Efficiency

Use Pravega to build pipelines of data processing, combining batch, real-time and other applications without duplicating data for every step of the pipeline.

icon_wp08

Transaction Support

A developer uses a Pravega Transaction to ensure that a set of events are written to a stream atomically.

Slide 1 Heading
Lorem ipsum dolor sit amet consectetur adipiscing elit dolor
Click Here
Slide 2 Heading
Lorem ipsum dolor sit amet consectetur adipiscing elit dolor
Click Here
Slide 3 Heading
Lorem ipsum dolor sit amet consectetur adipiscing elit dolor
Click Here
Previous
Next
qloud
qloud

Use Cases

Consistent, high performance storage, ideal for IoT
IoT Renewable Energy: Harnessing wind power at commercial scale requires a large number of wind turbines distributed over a large area. Each wind turbine generates thousands of data points per second (e.g. temperature, rotation speed, wind direction, energy output). Collecting all this data for historical and real-time analysis is necessary for prediction of potential failures as well as controlling power distribution networks to manage the variable nature of renewable energy. Unique benefits that Pravega offers:
  • Consistent high performance for ingestion of both small and large events
  • Scalable data ingestion from large number of sensors
  • Durable and low latency storage
Consistent, high performance storage, ideal for IoT
IoT Renewable Energy: Harnessing wind power at commercial scale requires a large number of wind turbines distributed over a large area. Each wind turbine generates thousands of data points per second (e.g. temperature, rotation speed, wind direction, energy output). Collecting all this data for historical and real-time analysis is necessary for prediction of potential failures as well as controlling power distribution networks to manage the variable nature of renewable energy. Unique benefits that Pravega offers:
  • Consistent high performance for ingestion of both small and large events
  • Scalable data ingestion from large number of sensors
  • Durable and low latency storage
Previous
Next
qloud
qloud

Use Cases

Icon_wind

Consistent, high performance storage, ideal for IoT

IoT Renewable Energy: Harnessing wind power at commercial scale requires a large number of wind turbines distributed over a large area. Each wind turbine generates thousands of data points per second (e.g. temperature, rotation speed, wind direction, energy output). Collecting all this data for historical and real-time analysis is necessary for prediction of potential failures as well as controlling power distribution networks to manage the variable nature of renewable energy. Unique benefits that Pravega offers:

  • List ItemsConsistent high performance for ingestion of both small and large events
  • Scalable data ingestion from large number of sensors
  • Durable and low latency storage
Icon_sync

Coordinating distributed applications

Distributed applications like micro-services: Pravega is a storage primitive, it is a messaging mechanism and it is a distributed computing coordination framework. Using the State Synchronizer API, micro-services can use Pravega as their database, sharing data without the overhead of a database. Other distributed computing problems such as discovery, leader election and many more can be built using Pravega. Unique benefits that Pravega offers

  • State synchronizer for sharing data between processes with strong consistency and optimistic concurrency.
  • Leader election and other distributed computing patterns implemented without a direct application dependency on other middleware such as Apache Zookeeper.
  • Consistent, reliable middleware for modern reactive, micro-services applications.
Icon_storage

Same storage API for both real-time and historical data

Telecommunications: Companies have vast networks of complex infrastructure distributed across the world. Each component generates its own log representing its limited view of the global state. It's crucial all logs are aggregated and analyzed in real-time to detect issues before they disrupt service. Unique benefits that Pravega offers:

  • Same Streaming API for accessing and processing both real-time and historical data
  • Stream level auto-scaling to accommodate bursts of data
  • Connectors to stream processing engines such as Flink
Icon_game

Distributed Pub/Sub

Multi Player Gaming: Low latency and high speed message delivery is vital for online multi-player gaming platforms. Each player's movements, interactions and events need to be reflected across all connected devices as they happen in real-time. Game messaging data is also stored for generating player leader board statistics from historical data to troubleshooting. Unique benefits that Pravega offers:

  • Scalable read/write parallelism that can support millions of simultaneous players
  • Reliable and exactly-once delivery of game state and events across all connected devices
  • Real-time delivery of score updates, game stats, and in-game notifications and alerts
Icon_iot

Exactly once stream processing with Apache Flink

Targeted Web Advertising: Targeting advertising to a customer’s interests or needs is key to increasing its effectiveness. First, unique users are identified in streams of data by correlating specific identifiers from multiple sources. Next, a user profile is generated by aggregating historical session data - extrapolating interests or demographic criteria against which the advertising can be targeted. Finally, user interactions on a website are correlated with their profiles in real-time to embed relevant ads from a catalogue of available options. Unique benefits that Pravega offers:

  • Exactly-once on sink/source with transactional writes, checkpointing and deduplication
  • One API to access real-time and historical data
  • Integration with Flink for elastically scalable data processing
Icon_build

Building modern, storage efficient data pipelines

Real-time and batch analytics: Historically, developers used a so-called Lambda architecture to build analytics platforms that process big data into accurate and real-time information. This approach required duplicate application development, one to provide accurate results over historical data using tools like HDFS and one to provide approximate but timely results using a different set of tools like Apache Storm. With Pravega, developers can build one application, satisfying both batch and real-time, eliminating complicated, hard to maintain dual infrastructures. Unique benefits that Pravega offers:

  • Data is stored once, in Pravega, not duplicated per middleware stack
  • Data is protected by Pravega, not replicated 3 (or more) times by middleware
  • Pravega's exactly once semantics allows developers to build pipelines of applications with both accurate and timely results
qloud
qloud

Use Cases

Coordinating distributed applications

Distributed applications like micro-services: Pravega is a storage primitive, it is a messaging mechanism and it is a distributed computing coordination framework. Using the State Synchronizer API, micro-services can use Pravega as their database, sharing data without the overhead of a database. Other distributed computing problems such as discovery, leader election and many more can be built using Pravega.

Same storage API for both real-time and historical data

Telecommunications: Companies have vast networks of complex infrastructure distributed across the world. Each component generates its own log representing its limited view of the global state. It's crucial all logs are aggregated and analyzed in real-time to detect issues before they disrupt service.

Distributed Pub/Sub

Multi Player Gaming: Low latency and high speed message delivery is vital for online multi-player gaming platforms. Each player's movements, interactions and events need to be reflected across all connected devices as they happen in real-time. Game messaging data is also stored for generating player leader board statistics from historical data to troubleshooting.

Exactly once stream processing with Apache Flink

Targeted Web Advertising: Targeting advertising to a customer’s interests or needs is key to increasing its effectiveness. First, unique users are identified in streams of data by correlating specific identifiers from multiple sources. Next, a user profile is generated by aggregating historical session data - extrapolating interests or demographic criteria against which the advertising can be targeted. Finally, user interactions on a website are correlated with their profiles in real-time to embed relevant ads from a catalogue of available options.

Building modern, storage efficient data pipelines

Real-time and batch analytics: Historically, developers used a so-called Lambda architecture to build analytics platforms that process big data into accurate and real-time information. This approach required duplicate application development, one to provide accurate results over historical data using tools like HDFS and one to provide approximate but timely results using a different set of tools like Apache Storm. With Pravega, developers can build one application, satisfying both batch and real-time, eliminating complicated, hard to maintain dual infrastructures.

Real-time Cybersecurity Threat Detection

To detect cybersecurity threads in real-time, huge volumes of streaming data from servers, network infrastructure, and applications logs must be analyzed in real-time using AI to identify possible threats. Event-driven applications must act quickly and reliably to notify system administrators and update firewall rules to block dangerous traffic.

Benefits that Pravega Offers
  • Scalable data ingestion to efficiently handle varying loads over time
  • High availability design
  • Durable and low latency storage
  • Connectors to stream processing engines such as Flink and Spark
  • Automatic deletion of older events based on a retention policy
Example Solution Architecture
  • Data collectors collect security events from servers and network devices.
  • A Flink streaming job aggregates all events from all streams, applies an AI inference model to detects threats, and outputs the following:
  1. A summary of inferred threats will be updated in the database.
  2. Threat details will be permanently stored in the Threads stream.
  • The database can be PostgreSQL, Elasticsearch, Pravega Search, or anything else supported by Flink.
  • A web server provides a UI to view a security dashboard. If using Elasticsearch or Pravega Search, this can be Kibana.
  • Additional event-driven applications can respond to events in the Threats stream, for instance, by sending a text message to system administrators or by automatically updating firewall rules to block dangerous traffic.
Benefits that Pravega Offers
  • Consistent high performance for ingestion of both small and large events
  • Transactions to guarantee that events are never duplicated, lost, or out of order
  • Watermark functionality to ensure accurate windowed aggregations
  • Durable and low latency storage
  • Same API for reading both real-time and historical data
  • Connectors to stream processing engines such as Flink and Spark
  • Automatic deletion of older events based on a retention policy
  • Scalable data ingestion from large number of sensors
Example Solution Architecture
  • IoT Device
  1. Pravega Sensor Collector or a similar component collects sensor data from IoT devices and forwards it to the Pravega stream Raw Sensors.
  • Edge Cluster
  1. A Spark streaming job reads from the Raw Sensors stream, performs inference, and writes inference results to a new Pravega stream Clean Sensors with Inference. Note that Pravega provides connectors for Spark and Flink they can be used interchangeably throughout this solution.
  2. A Flink job aggregates events from the Clean Sensors with Inference stream and updates a database for serving dashboard requests.
  3. The database can be PostgreSQL, Elasticsearch, Pravega Search, or anything else supported by Flink.
  4. A web server provides a UI to view a dashboard. If using Elasticsearch or Pravega Search, this can be Kibana.
  5. A Flink job is used to continuously copy stream events from an edge cluster to a data center or cloud instance of Pravega.
  6. Pravega stream retention is configured to keep just a few days of data in the edge cluster, to reduce the storage requirements at the edge. Older events will be automatically deleted.
  • Data Center / Cloud
  1. A Flink job in the data center or cloud can aggregate events from multiple edge clusters and write to the Aggregated stream.
  2. A Spark batch job can train or retrain an AI model based on the long-term historical events in the Aggregated stream. The new model can then be deployed as a new Spark AI Inference job in the edge cluster.
  3. By default, all stream data will be stored on Long Term Storage (HDFS, NFS, Dell ECS S3) forever. Retention policies can be applied as needed to delete older events.
To Learn More about this Solution
Benefits that Pravega Offers
  • Transactions to guarantee that events are never duplicated, lost, or out of order
  • Watermark functionality to ensure accurate windowed aggregations
  • Durable and low latency storage
  • Same API for reading both real-time and historical data
  • Connectors to stream (and batch) processing engines such as Flink and Spark
  • Scalable data ingestion to efficiently handle varying loads over time
  • High availability design
Example Solution Architecture
  • Data collectors collect billable events from various devices such as electrical meters and network routers.
  • Web sites or other applications generate billable events such as when a user purchases a book or on-demand movie.
  • A Flink streaming job aggregates all events from all streams and outputs the following:
  1. A summary of all billable events for the billing period will be updated in the database.
  2. An estimate of the monthly bill, based on historical patterns, will be updated in the database.
  3. At the end of the billing period, the actual bill will be generated, stored in the database for electronic billing, and permanently stored in the Monthly Bill stream
  • The database can be PostgreSQL, Elasticsearch, Pravega Search, or anything else supported by Flink.
  • A web server provides a UI to view a billing dashboard. If using Elasticsearch or Pravega Search, this can be Kibana.
  • Additional event-driven applications can respond to new bills in the Monthly Bill stream, for instance, by printing and mailing paper bills.
qloud
qloud

Use Cases

Icon_wind

Predictive Maintenance for IoT

Harnessing wind power at commercial scale requires a large number of wind turbines distributed over a large area. Each wind turbine generates thousands of data points per second (e.g. temperature, rotation speed, wind direction, energy output). Collecting all this data for historical and real-time analysis is necessary for prediction of potential failures as well as controlling power distribution networks to manage the variable nature of renewable energy. Pravega can be a key component of a predictive maintenance solution for a wide variety of equipment, including wind turbines, machines on a factory floor, trains, computer hardware, and even roller coasters.

  • Consistent high performance for ingestion of both small and large events
  • Transactions to guarantee that events are never duplicated, lost, or out of order
  • Watermark functionality to ensure accurate windowed aggregations
  • Durable and low latency storage
  • Same API for reading both real-time and historical data
  • Connectors to stream processing engines such as Flink and Spark
  • Automatic deletion of older events based on a retention policy
  • Scalable data ingestion from large number of sensors
Icon_sync

Real-time Billing

A typical billing system will collect billable events from a variety of sources such as online purchases, server usage metrics, network usage metrics, electrical meters, and much more. Traditional billing systems process these events once a month to provide monthly bills. In contrast, a real-time billing system is able to continuously provide billing information that is accurate up to the day (or better). This avoids surprising differences between an estimated bill based on inaccurate data and the actual bill.

  • Transactions to guarantee that events are never duplicated, lost, or out of order
  • Watermark functionality to ensure accurate windowed aggregations
  • Durable and low latency storage
  • Same API for reading both real-time and historical data
  • Connectors to stream (and batch) processing engines such as Flink and Spark
  • Scalable data ingestion to efficiently handle varying loads over time
  • High availability design
Icon_storage

Real-time Cybersecurity Threat Detection

To detect cybersecurity threads in real-time, huge volumes of streaming data from servers, network infrastructure, and applications logs must be analyzed in real-time using AI to identify possible threats. Event-driven applications must act quickly and reliably to notify system administrators and update firewall rules to block dangerous traffic.

  • Scalable data ingestion to efficiently handle varying loads over time
  • High availability design
  • Durable and low latency storage
  • Connectors to stream processing engines such as Flink and Spark
  • Automatic deletion of older events based on a retention policy

Real-time Cybersecurity Threat Detection

To detect cybersecurity threads in real-time, huge volumes of streaming data from servers, network infrastructure, and applications logs must be analyzed in real-time using AI to identify possible threats. Event-driven applications must act quickly and reliably to notify system administrators and update firewall rules to block dangerous traffic.

qloud
qloud

Use Cases

Predictive Maintenance for IoT

Harnessing wind power at commercial scale requires a large number of wind turbines distributed over a large area. Each wind turbine generates thousands of data points per second (e.g. temperature, rotation speed, wind direction, energy output). Collecting all this data for historical and real-time analysis is necessary for prediction of potential failures as well as controlling power distribution networks to manage the variable nature of renewable energy. Pravega can be a key component of a predictive maintenance solution for a wide variety of equipment, including wind turbines, machines on a factory floor, trains, computer hardware, and even roller coasters.

Real-time Billing

A typical billing system will collect billable events from a variety of sources such as online purchases, server usage metrics, network usage metrics, electrical meters, and much more. Traditional billing systems process these events once a month to provide monthly bills. In contrast, a real-time billing system is able to continuously provide billing information that is accurate up to the day (or better). This avoids surprising differences between an estimated bill based on inaccurate data and the actual bill.

Real-time Cybersecurity Threat Detection

To detect cybersecurity threads in real-time, huge volumes of streaming data from servers, network infrastructure, and applications logs must be analyzed in real-time using AI to identify possible threats. Event-driven applications must act quickly and reliably to notify system administrators and update firewall rules to block dangerous traffic.

Pravega Flink connector 102

If you missed my previous post, Pravega Flink connector 101, we strongly recommend you take the time to read that one first. It introduced how

Read More »

Change Data Capture (CDC) is becoming a popular technique for interconnecting disparate systems, for replicating state across traditional boundaries, for decomposing existing monoliths into microservices, and for the recordation of audit trails. CDC is the idea of emitting a changelog of all INSERT‘s, UPDATE‘s, DELETE‘s, and schema changes performed on a database. Debezium.io is an […]

Introduction The fundamentals of stream semantics in Pravega are learned through familiarity with its client APIs. In this article, we will overview Pravega’s client APIs with a handful of simple examples. As we reach the end, you should see Pravega in action, understand the guarantees afforded by Pravega streams, and have some familiarity with several […]

Pravega is a storage system for data streams that has an innovative design and an attractive set of features to cope with today’s Stream processing requirements (e.g., event ordering, scalability, performance, etc.). The project has plenty of documentation and great blog posts that explain in detail every technical aspect of Pravega. But, if you are […]

This blog post provides an overview of how Apache Flink and Pravega Connector works under the hood to provide end-to-end exactly-once semantics for streaming data pipelines. Overview Pravega [4] is a storage system that exposes Stream as storage primitive for continuous and unbounded data. A Pravega stream is a durable, elastic, append-only, unbounded sequence of […]

Pravega allows the state to be shared in a consistent fashion across multiple cooperating processes distributed in a cluster using a State Synchronizer. This blog details how to use State Synchronizer [1] to build and maintain consistency in a distributed application. State Synchronizer In distributed systems, frequently state needs to be shared across multiple instances […]

Introduction Pravega is an open-source distributed storage system implementing streams as first-class primitive for storing/serving continuous and unbounded data [1]. A Pravega stream is a durable, elastic, append-only, and unbounded sequence of bytes providing a strong consistency model guaranteeing data durability, message ordering, and exactly-once support. Since Pravega stores sequences of bytes and not events or messages, it provides […]

If you missed my previous post, Pravega Flink connector 101, we strongly recommend you take the time to read that one first. It introduced how Flink DataStream API works with reading from and writing to Pravega streams, which lays the necessary foundation for the topics we’ll cover in this post. To briefly recap the last […]

Over the last 40 years, the European Union has built a powerful research framework through a variety of research programmes, such as FP1-9, Horizon 2020, and more recently, Horizon Europe, among others [1]. Research programmes are organized into calls that address timely and relevant societal, economic, and cultural challenges in the European landscape, including health […]

Change Data Capture (CDC) is becoming a popular technique for interconnecting disparate systems, for replicating state across traditional boundaries, for decomposing existing monoliths into microservices, and for the recordation of audit trails. CDC is the idea of emitting a changelog of all INSERT‘s, UPDATE‘s, DELETE‘s, and schema changes performed on a database. Debezium.io is an […]

Introduction Pravega is a storage system based on the stream abstraction, providing the ability to process tail data (low-latency streaming) and historical data (catchup and batch reads). Relatedly, Apache Flink is a widely-used real-time computing engine that provides unified batch and stream processing. Flink provides high-throughput, low-latency streaming data processing, as well as support for complex event […]

Introduction Today there are billions of sensors around the world, producing a massive amount of data. Some sensor data will be used only at the edge, and some will be sent to the cloud or data centers for aggregation, analytics, and AI efforts. These sensors may measure or produce images, video, lidar, audio, acceleration, GPS, […]

We are pleased to announce Pravega 0.9.0, our first release since Pravega became part of CNCF (Cloud Native Computing Foundation). This release continues to expand the Pravega feature-set and improves the performance of mission-critical use cases, and, of course, brings improved stability overall. In 2020, Pravega community delivered several significant releases. We introduced Streaming Cache […]

Raúl Gracia and Flavio Junqueira Introduction Streaming applications commonly ingest data from a wide range of elements – e.g., sensors, users, servers – concurrently to form a single stream of events. Using a single stream to capture the parallel data flows generated by multiple such elements enables applications to better reason about data and even […]

Raul Gracia and Flavio Junqueira Introduction Streaming systems continuously ingest and process data from a variety of data sources. They build on append-only data structures to enable efficient write and read access, targeting low-latency end-to-end. As more of the data sources in applications are machines, the expected volume of continuously generated data has been growing […]

Introduction The fundamentals of stream semantics in Pravega are learned through familiarity with its client APIs. In this article, we will overview Pravega’s client APIs with a handful of simple examples. As we reach the end, you should see Pravega in action, understand the guarantees afforded by Pravega streams, and have some familiarity with several […]

Change Data Capture (CDC) is becoming a popular technique for interconnecting disparate systems, for replicating state across traditional boundaries, for decomposing existing monoliths into microservices, and for the recordation of audit trails. CDC is the idea of emitting a changelog of all INSERT‘s, UPDATE‘s, DELETE‘s, and schema changes performed on a database. Debezium.io is an […]

Introduction The fundamentals of stream semantics in Pravega are learned through familiarity with its client APIs. In this article, we will overview Pravega’s client APIs with a handful of simple examples. As we reach the end, you should see Pravega in action, understand the guarantees afforded by Pravega streams, and have some familiarity with several […]

Pravega is a storage system for data streams that has an innovative design and an attractive set of features to cope with today’s Stream processing requirements (e.g., event ordering, scalability, performance, etc.). The project has plenty of documentation and great blog posts that explain in detail every technical aspect of Pravega. But, if you are […]

This blog post provides an overview of how Apache Flink and Pravega Connector works under the hood to provide end-to-end exactly-once semantics for streaming data pipelines. Overview Pravega [4] is a storage system that exposes Stream as storage primitive for continuous and unbounded data. A Pravega stream is a durable, elastic, append-only, unbounded sequence of […]

Pravega allows the state to be shared in a consistent fashion across multiple cooperating processes distributed in a cluster using a State Synchronizer. This blog details how to use State Synchronizer [1] to build and maintain consistency in a distributed application. State Synchronizer In distributed systems, frequently state needs to be shared across multiple instances […]