Kafka-Confluent
Kafka-Confluent
Confluent initially
Lots of data systems and data rich applications
Allows every developer to take action on what is happening in real-time
Data in Motion - category
CIO of United States Postal Service (USPS)
129 billion mail pieces over 164m places; 232k vehicles
33,000 post offices
115 PB of data
Started Kafka journey in 2016;
Have their own data centers; 28,000 virtual and physical servers
Q: COVID 19 Pandemic - Distributing Tests
CIO team responsible for setting up website for
Synchronous - accepting the order, validating the address, making sure no duplicate order
Order placed on kafka topic - three downstream consumers and sent it to inventory systems; handled 8.7m orders per hour
Had to process orders super quickly
Q: Other areas of Kafka use?
We need a digital representation of all of our assets; package scans, trucks, georeplication
Ingest 30 terabytes to create an application called Informed Visibility
Mesh of 400 Kafka topics
Powers 20 downstream business applications
Pricing, logistics, conditions of packages, conditions of post office
Geo-replication - we need to create caches for different regions; helps keep distributed caches in sync
We want to have the Kafka registry; enterprise topics so we can have everything cataloged; want to bring structure around data streaming
Kafka History
3 major releases every year; 22 major releases
Biggest features for growth
Replication 2016 - releasedreplication feature which helped for high availability for mission critical applications
Security 2018 - allowed customers and users to do authentication using an industry standard way;
Connect 2019 - framework to allow community to build connectors for common data sources
kstreams; exactly once semantics 2019-20 - made streaming application building easier; allowed people to view data easier
Partition Scalability 2021 - allow for 10x scalability in metadata
Evolution over time
200+ connectors - postgres, mongodb,
Client libraries in a bunch of different languages including python, C#
Stream processing engines - beam, flink, hazelcast, spark
Open source project
2200 contributors; 1500 reviewers
Users
80% of fortune 100 are users
Pinterest Chuyan Wang
400m monthly active users
working at pinterest for 10 years
History - started using it back in 2012; needed unified and high throughtput log transformation platform
Right now - its powering hundreds of different batch and stream
All the signals we develop for A/B experimentation
Advantage - ability to generate faster signals and recommend real time content to users
Data Architecture - use kafka for all data transportaiton - activity log or production CDC; batch and stream processing - push to S3 and use real time flink
4,000 brokers; 3,000 topics; 50+ clusters; 500k+ partitions;
Kafka can scale well both vertically and horizontally
Operating at that big scale - kafka is very stable - performance improvements and patches help a lot; instance hardware failure, traffic pattern change are biggest challenges; we used to be running kafka on top of magnetic disks - now we use NVme SSD, I3s or I4s - IOPS has a lot to work with high partition density environement
We choose to invest heavily around automation; we have a project called Array that is helps automate some things - burdens for automated broker replacement, worker upgrade, restart, workload balancing
We've built several tools at Pinterest and have open sourced them as well - two other projects - Singer and Secor - Singer is a high throughput logging agent - moving it from idsk ; secor kafka consumer that persists data to S3 with stronger consistency
Querybook - Pinterest big data IDE with a notebook style interface
Kafka in the future - data mesh - data discoverability and data proliferation - internal data by building a data catalog; data documentation; table of tiering to improve batch and analytical processes
Kafka in the Cloud
Managing a distributed system is not easy; balancing clusters, upgrades, etc.
Typically can translate into lower costs
Running Kafka in the cloud Brokers have local attached storage; low-latency real time experience - we typically need to use
Overall storage cost could be dominating
We are sort of coupling bulk of data with the storage
Confluent cloud -
Users care about low latency real time experience for most recent data
For historical data only care about good throughput but not low latency
Local hotset to allow for fast access to recent data;
Remote access in external block storage; good throughput at lower cost
Decoupled bulk of storage with computation