David, you wrote:
1) Build models based on training data, submitting to public leaderboard
2) Finalize code based on full training data set -- submit code -- code (which will include all model parameters, etc.) must be able to look at one day of test data in isolation and make predictions for that day
You see, when you "train" a model, it can actually memorise all the data it has seen. So your statement that you "allow to train model on all data and do not allow to use all data before cutoff date for scoring" does not make much sense.
Let me give you an example - you are building a KNN model that actually memorises all data in it and then simply finds closest neighbours. Although you might think that you are using only current record (data from the current day) for scoring, other records
from previous days are used as well because you need to calculate distances to them.
Because you allow to train on all data, my model can memorise it all as a parameters that you allow to store.
Let me give you another example. For this challenge I want to build this type of naive model:
1. For each airport I canculate average delay of arrival (average difference between actual and estimated arrrival times).
2. For each flight in the air I simply subtract estimated arrival time from the cutoff time and add average delay for the destination airport.
So here is the question. You say that this naive model is not allowed, because to calculate it I am using average delays for airports that I calculated using all the data.To my mind this rule does not make practical sence because it contradicts with your
statement that you allow to save model parameters.
Because any parameters are stored as bytes and all the data from previous days is stored as bytes you cannot allow one thing and prohibit another.
Do you agree?
with —