kickwrangler

Post on 14-Feb-2017

117 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Kickwrangler

Alex McCauley

Predict which Kickstarter projects won’t deliver

It doesn’t always work out…

Kickwrangler predicts the projects most likely to not deliver

pledged to Kickstarter projects over 6 years$1,768,328,118

Data SourcesProject Features

Creator data:• # past projects (experienced?)• # they have backed

Project data:• Category• # backers• $ pledged vs goal (overfunded?)

Project OutcomesNo data/tracking of project delivery

• Sentiment analysis -> project status

• 500k tech comments

Scraped from:

predict

Algorithms: …

TopicsComment

block

Comment block

Comment block

Topic modeling(Latent Dirichlet Allocation)

Small (1k) sample

Worker 1

Worker 2

Worker 3

Pre-process text

Labeled sentiments

Linear regression:topic – sentiment mapping:

Curate: keep only 7 most relevant topics

Sentiment vs time!

Workers label sentimentsAll comments

Timeline of project sentiments

“refund”, “delay”

“received mine today”, “awesome”

“pledge”, “funded”

“update”, “status”

Classification algorithmClassification resultsBuild intuition: 2-d reduction

• Random Forest classifier

Accuracy 0.88

Precision 0.6

Recall 0.3

F1 0.43 (0.24 = random)

Most important feature:• $ pledged / $ goal – overfunded

campaigns are badSurprisingly not important:• Creator history (# previous projects)

Good

Bad

About me

Wireless batteries!

Green = charging

Photonic crystals, quasi-crystals, and Casimir forces

Validation

Mechanical Turk ratings• Remove ratings with high scatter• Inspect subset

Topic-sentiment map• Train linear regression on labeled comments• Test on holdout set:

Project status• Recall: 10/10 on “top 10 failed projects” list• Precision: test on 20 predicted failed projects

Validation: sentiment analysisMechanical Turk ratings

• Train linear regression on labeled comments, mapping topic content to sentiments

• Test on holdout set:

Topic-sentiment map

• Remove comments with standard deviation > 0.5 (S.D. = 1.3 is a random guess)

• Test on 10 well-known failure cases, gets 10/10

• Test validity on random sample of 100 projects (TODO)

Validation: project labeling

top related