Kafka-Confluent

Confluent initially
- Lots of data systems and data rich applications
- Allows every developer to take action on what is happening in real-time
- Data in Motion - category
CIO of United States Postal Service (USPS)
- 129 billion mail pieces over 164m places; 232k vehicles
- 33,000 post offices
- 115 PB of data
- Started Kafka journey in 2016;
- Have their own data centers; 28,000 virtual and physical servers
Q: COVID 19 Pandemic - Distributing Tests
- CIO team responsible for setting up website for
- Synchronous - accepting the order, validating the address, making sure no duplicate order
- Order placed on kafka topic - three downstream consumers and sent it to inventory systems; handled 8.7m orders per hour
- Had to process orders super quickly
Q: Other areas of Kafka use?
- We need a digital representation of all of our assets; package scans, trucks, georeplication
- Ingest 30 terabytes to create an application called Informed Visibility
  - Mesh of 400 Kafka topics
  - Powers 20 downstream business applications
  - Pricing, logistics, conditions of packages, conditions of post office
- Geo-replication - we need to create caches for different regions; helps keep distributed caches in sync
- We want to have the Kafka registry; enterprise topics so we can have everything cataloged; want to bring structure around data streaming
Kafka History
- 3 major releases every year; 22 major releases
- Biggest features for growth
  - Replication 2016 - releasedreplication feature which helped for high availability for mission critical applications
  - Security 2018 - allowed customers and users to do authentication using an industry standard way;
  - Connect 2019 - framework to allow community to build connectors for common data sources
  - kstreams; exactly once semantics 2019-20 - made streaming application building easier; allowed people to view data easier
  - Partition Scalability 2021 - allow for 10x scalability in metadata
- Evolution over time
  - 200+ connectors - postgres, mongodb,
  - Client libraries in a bunch of different languages including python, C#
  - Stream processing engines - beam, flink, hazelcast, spark
- Open source project
  - 2200 contributors; 1500 reviewers
- Users
  - 80% of fortune 100 are users
Pinterest Chuyan Wang
- 400m monthly active users
- working at pinterest for 10 years
- History - started using it back in 2012; needed unified and high throughtput log transformation platform
- Right now - its powering hundreds of different batch and stream
- All the signals we develop for A/B experimentation
- Advantage - ability to generate faster signals and recommend real time content to users
- Data Architecture - use kafka for all data transportaiton - activity log or production CDC; batch and stream processing - push to S3 and use real time flink
- 4,000 brokers; 3,000 topics; 50+ clusters; 500k+ partitions;
- Kafka can scale well both vertically and horizontally
- Operating at that big scale - kafka is very stable - performance improvements and patches help a lot; instance hardware failure, traffic pattern change are biggest challenges; we used to be running kafka on top of magnetic disks - now we use NVme SSD, I3s or I4s - IOPS has a lot to work with high partition density environement
- We choose to invest heavily around automation; we have a project called Array that is helps automate some things - burdens for automated broker replacement, worker upgrade, restart, workload balancing
- We've built several tools at Pinterest and have open sourced them as well - two other projects - Singer and Secor - Singer is a high throughput logging agent - moving it from idsk ; secor kafka consumer that persists data to S3 with stronger consistency
- Querybook - Pinterest big data IDE with a notebook style interface
- Kafka in the future - data mesh - data discoverability and data proliferation - internal data by building a data catalog; data documentation; table of tiering to improve batch and analytical processes
Kafka in the Cloud
- Managing a distributed system is not easy; balancing clusters, upgrades, etc.
- Typically can translate into lower costs
- Running Kafka in the cloud Brokers have local attached storage; low-latency real time experience - we typically need to use
  - Overall storage cost could be dominating
  - We are sort of coupling bulk of data with the storage
- Confluent cloud -
  - Users care about low latency real time experience for most recent data
  - For historical data only care about good throughput but not low latency
  - Local hotset to allow for fast access to recent data;
  - Remote access in external block storage; good throughput at lower cost
  - Decoupled bulk of storage with computation

Page updated

Google Sites

Report abuse