powered by kaggle

Completed • $250,000 • 173 teams

GE Flight Quest

in partnership with
Wed 28 Nov 2012
– Mon 11 Mar 2013 (21 months ago)

Hey,

I am interested in doing this for educational purposes and I doubt I will be a "winner" in this competition.  I would like to know how one should begin when dealing with datasets so large.  Do you do some preliminary work with data-plotting or go straight to the random forests algorithms?  I would like to hear your approach.  

Hey Dan,

I'm just starting out too, and thought this would be an interesting one ot start with, I had some ideas about where to begin. I'm going to extract the data and basically look at afew instances of flights history and brgin making some small basic groupings, so identifying schdeuled flights, looking at certain routes, and get a really good feel for the data first with some basic visualisations.

If you are interested we could team up and share insights/models. I am going to use Qlikview for the basic visualisation stuff, mapping etc and R/Python for the modeling. I was going to rent an EC2 for the crunching. I'm based in the UK and will be pottering on this in the evenings.

Let me know

Mark 

Mark,

I would be interested but I don't know how much I can commit to it.  Email me with some ideas and we can get started.

Dan

In our case 95% of the work was done in SQL. The data itself is very "buggy" and whole system must be very fault tolerant. Thus when we had an overview of the data we looked into reducing the biggest errors for example those above 60 minutes. The error metric is very prone to such errors but they are rather easy to catch. 

Our weak model was about ~6.5 on the leaderboard. So start with simple models that you can treat as a benchmark. If you cannot achieve such result with a simple model there is no sense going further (in terms of model complexity). To sum up expect spending more than 90% of the time on the data extraction and cleaning.

Daniel Parry wrote:

Hey,

I am interested in doing this for educational purposes and I doubt I will be a "winner" in this competition.  I would like to know how one should begin when dealing with datasets so large.  Do you do some preliminary work with data-plotting or go straight to the random forests algorithms?  I would like to hear your approach.  

Hey,Dan

I am very interested in using random forests algorithms on this problem, would you give me some suggest? 

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?