c* capacity forecasting (ajay upadhyay, jyoti shandil, arun agrawal, netflix) | cassandra summit...

Post on 13-Apr-2017

174 Views

Category:

Software

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Capacity Forecast @ ScaleCDE, Cloud Database EngineeringNetflix.

●CDE, Cloud Database Engineering ●Providing data stores as a service

○Cassandra,○ Dynomite, ○ Elasticsearch and RDS

Ajay Upadhyay Cloud Data Architect @ Netflix

Arun AgrawalSr. Software Engineer @

Netflix

Who are we?

●Cassandra @ Netflix●Cassandra footprint ●Capacity planning lifecycle

●Forecasting the capacity

●Q and A

Agenda

• 98% of streaming data is stored in Cassandra

• Data ranges from customer details to Viewing history / streaming bookmarks to billing and payment

Cassandra @ Netflix

Marlee Tart
Minor, but this is dated. Would suggest using S2 artwork

Cassandra Footprint

Hundreds C*

Cassandra Footprint

Thousands

Capacity Planning

•Able to predict

– Current usage and available capacity

– Resources needing upgrade– Life cycle of current configuration– Appropriate configuration for new

and existing App/Service

•Optimize – Under or over utilized resource– Increased business productivity

Capacity Planning

Avoid:

• Impact on Business • No service or SLA

disruption• Un-planned

maintenance• Firefighting

Life Cycle

Capture Requirement

RequirementAnalysis/

feasibility

Proxy or Simulate

Requirement

Monitoring /

Trending

New / Increased

traffic Optimization

Capture Requirement

– IOPs and SLA– Maintenance overhead– Failover – Access pattern

IOPs and SLAQuestions Response

Read OPS/sec [avg, peak] 5k - 10kRead Latency requirement 95th - 20ms

99th - 100ms Write OPS/sec [avg, peak] 1k - 2kWrite Latency requirement 95th - 20ms

99th - 100msNum Columns / Row 100

Avg col size / or avg row size 64kNum of rows 100 Mil

TTL [life Cycle of data] 365 Days

Data storeC*

Gutenberg publisher service

Gutenberg publisher serviceReadWrite

Maintenance Overhead

Repairs / Compactions Y/N

Node replacement Y

Backup - Full / Incrementals

Y/N

TypeRespons

e

Failover

Region Failover Y/N

SLA in case of region failover

Y/N

Questions Response

Access Pattern

Questions ResponseRead Point read

All row readersColumn slices

Write Part existing rowNew rows

Proxy/Simulate Traffic

– Proxy existing traffic – Simulate traffic

–NDBench– Generate actual /

synthetic traffic before final deployment using app

Optimization

• Cache - Application level- Fronting cache engine before C*

- Stagger R - W operations if possible

Cluster Sharding

Trend AnalysisContinuous monitoring / trending on usage pattern

New / Increased TrafficCapacity planning cycle begins

Capture

Requirement

RequirementAnalysis/

feasibility

Proxy or Simulate

Requirement

Monitoring /

Trending

New / Increased

traffic

Optimization

Capacity Forecasting

Arun AgrawalSr. Software Engineer

Demo

Metrics

Atlas

Previous Architecture

Pain Points

•No support for complex relationships

•Hardware failure could fail leading to false positives

Winston• Bridge between atlas and oncall• Complex relationship modeling

between metrics• Reduce false positives• Auto remediation platform

Lesson Learnt•It might be already too late to fix the system.

•Reactive than proactive

Requirements• Show us trend for the clusters. • Warn us of what is coming if

trend continues.• Give us time to scale their

cluster

Automic (UC4)

Architecture

Aggregation• Daily • Instance Level• Cluster Level

•Instance Failures•Adding capacity over days

Growth Criteriaf(x) of – Subscriber – Netflix content– # Viewing Sessions

ARIMA– AR

•Regression on prior values–I•Data values are replaced with (x(i) - x(i-1))

–MA•Linear combination of error terms

Future•Vector Auto Regression

•Automate manual judgement

Resources– https://www.otexts.org/fpp/8

Q & A

You may not control all the events that happen to you, but you CAN decide not to be reduced by them.

-Maya Angelou

top related