powered by kaggle

Completed • $250,000 • 173 teams

GE Flight Quest

in partnership with
Wed 28 Nov 2012
– Mon 11 Mar 2013 (22 months ago)

Contradiction in rules (for final model)

« Prev
Topic
» Next
Topic
<12>

Hello

I am seeing a contradiction in the rules for what is allowable for final model.

At this link: http://www.gequest.com/c/flight/details/submission-instructions it states:

Your model must be structured so that it makes each test day's predictions based on no information in the final evaluation test data other than the information from that day, which will be in an appropriately named folder.

So for prediction of Feb 20, it is not allowed to use data from Feb 15 thru Feb 19.

However in http://www.gequest.com/c/flight/data and also in https://www.gequest.com/wiki/FlightQuest.Data it states:

For each day in the test period (first for the public leaderboard, and later for the final evaluation), we will select a random time (uniformly chosen between 9am EST and 9pm EST) and select all of the flights in the air at that cutoff time. You will be provided with relevant data for each day that would be available at the chosen cutoff time. Predictions for each flight on a given day can not reference any data related to future dates in the evaluation data set.

So for prediction of Feb 20, it is allowed to use data from Feb 15 thru 19, (but not Feb 21 and beyond.)


So for the final evaluation set, which is correct?   Is prior-dated data allowed to be considered when making a prediction or not?

Thanks

[edit: spelling]

Additionally, it appears that this wording even implies that the model should "start clean" on each test. IOW, it would be an implicit violation if your model has ever been trained on data from the future from the standpoint of the randomly chosen test day.

This is a condition that may prove nearly impossible to implement since it means you must retrain your model fresh for each test. Any learned information in the model would need to explicitly come from past days only. Since it's impossible to identify the source of learned data in most models, we must either re-train on every test, or assume that the test period will be chosen from a hidden set of dates we won't have access to in the training set.

Our apologies -- the wiki and data page were wrong; the submission instructions page was correct. Now they are all correct.

Your model must be structured to make final test data set predictions based on no days in the final test data set other than the appropriate one (none prior or later are allowed).

This rule only applies to the final test data set. Note that some of the training data will be from a period later than the public leaderboard test data set. It is fine to use this data.

"For each day in the test period (first for the public leaderboard, and later for the final evaluation), we will select a random time (uniformly chosen between 9am EST and 9pm EST) and select all of the flights in the air at that cutoff time. You will be provided with relevant data for each day that would be available at the chosen cutoff time."

"Your model must be structured so that it makes each test day's final test data set predictions based on no information in the final evaluation test data other than the information from that day, which will be in an appropriately named folder. (Reworded for clarification on 11/30/2012. See forum for explanation."

It would be useful to people who are new to the flight data domain if the above could be explained with clarity by the organizers. Walk it through with an example or something that assists understanding.  

  1. I am guessing that for each test day we can use all data (including Arrival Times Scheduled and Actual) for that day up until a random cutoff time and then we will predict the arrival times for the rest of that day of only the aircraft that are in the air at the time of the cutoff. This would mean no extra points for good predictions of how long it takes the plane to board, taxi down the runway, and get into the air.

Can anyone tell me if this sounds correct?

Just how far does this use of time constrained data go? Given that, almost by definition, any model used will have been derived from data outside this period, including constants, structure, formulae etc. 

Some further clarity would be helpful to keep pedantry under control. An example could well help too.

I'm still little bit confused about the objective of the competition.

Let's assume I am a pilot flying, right now, towards my destination airport.

A wanted model, such as a final model (whose prediction performance should have been measured at a particular cut-off time), will give me a fairly good estimation for my actual arrival time on that airport. So, I have now better estimation for my arrival time which differs from the original flight plan.

Then, how can I use this knowledge to improve 'efficiency' of my flight? Do I have to increase/reduce my plane's speed to stick closer to the original plan?

Could anyone, please, clarify the meaning of the objective to predict arrival time in terms of "Make flying more efficient?"

Thanks.

Joexjmmvhm-- that is correct.

MarkA-- your models (parameters,etc.) will be based on the training data. This model then needs to be able to look at one day of the final test data (in isolation) and make predictions. So the sequence is:

1) Build models based on training data, submitting to public leaderboard
2) Finalize code based on full training data set -- submit code -- code (which will include all model parameters, etc.) must be able to look at one day of test data in isolation and make predictions for that day
3) Final test data is released
4) Submit predictions for test data, using submitted code
5) Winners' code will be verified for compliance with these rules

Farstar-- the pilot wouldn't use these predictions directly. Instead, these predictions will feed into a broader optimization of parameters like the ones you mentioned, like speed. You can't optimize over the outcomes until you know which situations will lead to which outcomes.

David, you wrote:

1) Build models based on training data, submitting to public leaderboard
2) Finalize code based on full training data set -- submit code -- code (which will include all model parameters, etc.) must be able to look at one day of test data in isolation and make predictions for that day

You see, when you "train" a model, it can actually memorise all the data it has seen. So your statement that you "allow to train model on all data and do not allow to use all data before cutoff date for scoring" does not make much sense.

Let me give you an example - you are building a KNN model that actually memorises all data in it and then simply finds closest neighbours. Although you might think that you are using only current record (data from the current day) for scoring, other records from previous days are used as well because you need to calculate distances to them.

Because you allow to train on all data, my model can memorise it all as a parameters that you allow to store.

Let me give you another example. For this challenge I want to build this type of naive model:

1. For each airport I canculate average delay of arrival (average difference between actual and estimated arrrival times).

2. For each flight in the air I simply subtract estimated arrival time from the cutoff time and add average delay for the destination airport.

So here is the question. You say that this naive model is not allowed, because to calculate it I am using average delays for airports that I calculated using all the data.To my mind this rule does not make practical sence because it contradicts with your statement that you allow to save model parameters.

Because any parameters are stored as bytes and all the data from previous days is stored as bytes you cannot allow one thing and prohibit another.

Do you agree?

Hear, hear... I was kinda wondering about the same thing (i.e. ban on using data from other days vs permission to use parameters trained on those days)

I have been interpreting it this way.  Any data that is available from the competition before the model submission deadline (and prior to the final evaluation set is released) is fair game for the models.  Then, when the final data set is released, the model execution should be restricted to the one day of test data.  So, a KNN model could contain all the training data, but should not add any of the instances from the other days in the final evaluation set to the neighborhood.

For example, a model should produce the exact same predictions for day 4 in the final evaluation set, regardless of how many other days of data are provided in that final evaluation set.

Is this the intent?

Fine: let's take a linear regression model as an example. If I have estimated the parameters on e.g. all days from november and then I try to use this model taking only december 4th as input - I am still - implicitly - using the other observations, because that affects what my regression parameters look like - so technically, I am in violation of the rules. If you want to avoid this, then training data should not be used at all...

BJG_ wrote:

I have been interpreting it this way.  Any data that is available from the competition before the model submission deadline (and prior to the final evaluation set is released) is fair game for the models.  Then, when the final data set is released, the model execution should be restricted to the one day of test data.  So, a KNN model could contain all the training data, but should not add any of the instances from the other days in the final evaluation set to the neighborhood.

For example, a model should produce the exact same predictions for day 4 in the final evaluation set, regardless of how many other days of data are provided in that final evaluation set.

Is this the intent?

This intent makes no sense. You can imagine a model as a black box loaded with some universal knowledge that makes predictions. You come to this black box, open it, put your new day data into it, step back and then ask "Dear box, when do you think this plane that I have just told you about, will land?". The box uses ALL of its knowledge to answer this question. Both your current data and everything it has seen before.

Here is an example - IBM Watson machine that has beaten humans in Jeopardy game. It was so clever because it had access to HUGE amounts of text data and it could search through it in real time and find answers.

Another example - you are playing chess with AI. You are loosing because it has millions of strategies loaded in it and it is optimising its chanses to win by predicting which strategy is the best. And you say: "Stop doing that! This is not fair! Please, do not use all your strategies that you have learned from previous games. You can use data only from this game!" =)  Ha-ha, that would be funny =)

I'm guessing the primary intent is to avoid using the future data in the prediction.  They do not want a model to use data from Day 14 in the final evaluation set to make predictions for Day 4. 

There is also a chance that the customer is risk adverse in this area and does not want to have a model whose parameters change automatically after it is deployed.  They may plan to go through some testing process prior to deploying the model and want any changes to the model to go through a similar testing process (not learned automatically).  

 I am just conjecturing though.

It's not like Jeopardy, but it may be like the chess example.
I think the thing to do is to look at how this will be used in the real world.
They do not want the tool to go through all of history and also include today's data and come to some analysis.
It will not have access to all this data, nor will it have all this time to do analysis and arrive at an answer.

You go to school for x number of years, and you read many books and learn from them. Even though you do not have access to those books at some point in the future, you use what you learned to make decisions today. So you are using the data, but not directly at the time of the decision.

They want something that has learned about all the data in the past, has some formulas and logic with parameters (perhaps even entered by the user) and constants based on that data, and when presented with TODAY'S data only (eg. today's weather, today's prior delays, today's number of flights), comes to some prediction about arrival times for planes currently in the air. It will not have access to all the historical data at the time it is used. So it's possible to say "It's raining today, so flights will be delayed", but it's not possible to say "It's been raining for the 5th straight day, so flights will be delayed". You will not know how many days in a row it has been raining, and can only predict that if it's raining today, it's x% more likely to cause delays today.

Which comes back to my point, essentially: your "x%" IS based on data other than today's observations ...

My understanding was that 14-days of data is provided from which to build a model. For the final, an additional days worth of data is provided with a cutoff time. Data after the cutoff time cannot be used. For all flights remaining in the air after the cutoff time, predict the runway and gate arrival times (using the 14 days data + additional data from the pre-cutoff time from the extra day).

But, reading the forums I'm sure the above is incorrect!

If the final model is to be based on just one days worth of data prior to the cutoff time then this is not a lot of data to work with.

In a previous note, I mentioned that the challenge is to "... find a scalable algorithm to provide a real-time profile to pilots ...". Not sure how the above meets the requirements of the challenge.

Realistically the historical flight data could/should stretch to cover at least one season (ie. 3 months ), combined with data from the previous season; with the flight data constantly updated in real-time; and predictions available in real-time.

Is a dynamic problem being asked to be solved as if it were a static problem. Anyway, lots of confusion.

Konrad Banachewicz wrote:

Which comes back to my point, essentially: your "x%" IS based on data other than today's observations ...

Exactly my point. When you are using ANY historical data to optimise your desicions - you are violating the rules. You are proposing to calculate x% based on historical data. You may do this in realtime or upfront but it makes no difference.

If one CAN use this data to build a model then one can simply call ALL HISTORICAL data as parameters of his model.

Buy I agree with jsink that intentions of organisers was to somehow restrict complexity of the models to make them easily deployable. Problem is that you will not be able to distinguish between model parameters and stored hitorical data and that is why my point is that all historical data prior the cutoff point should be used to make predictions.

<12>

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?