powered by kaggle

Completed • $250,000 • 173 teams

GE Flight Quest

in partnership with
Wed 28 Nov 2012
– Mon 11 Mar 2013 (22 months ago)

When I try to read the training data into my c++ program, I'm running into a problem at the end of file.  For the InitialTrainingSet 2012_11_12 flighthistory.csv, the actual data (including '\n' and spaces) is exactly 7784526 characters long.  However, if I keep reading until I hit an end of file character (== '\0'), I end up reading 7809940 characters, where the extra characters are gibberish.  

My text editor, too, insists that the file is 7809940 characters large, but a direct character count yields only 7784526 characters.

Anyone know what's going on?

You are mistaken. Your "direct character count" didn't take into account new lines (0x0D 0x0A) - that is it didn't take into account 0x0D (\r).

Total number of bytes is 7809940.

Total number of bytes excluding 0x0D is 7784526.

You can also deduce that your "direct character count" didn't take into account new lines by subtracting 7784526 from the total number of bytes 7809940. You will get 25414 which corresponds to the number of lines in that file.

This new line format (0x0D 0x0A) is typical DOS/Windows format - 0x0A is typical Linux format (https://en.wikipedia.org/wiki/Newline).

OK, the plot thickens ...

Thanks for pointing out the '\r'.  If I include this as a character as well, then you're right,  the full size in the text editor is correct.

However, when actually reading the data in (using fread) the '\r' character seems to be completely ignored, it doesn't appear in the charcter stream that I get back.  I suspect this is the issue.

Here's how I'm doing the input:

char* buffer = new char[large_size];

fread(buffer,1,large_size,file);

As I read through the data, because the '\r' are not present in the stream, I finish reading the last entry after 7784526 characters.  

Then a weird thing happens. 

If I query the buffer again, it jumps back in the file, exactly 25414 characters and continues spitting out the same data again until I hit the end of line.

This jump is exactly the number of missing '\r' characters, so I rather suspect this is the culprit.

Who can solve the mystery of the missing carriage returns?!

Try Python.

These kinds of problems are usually best solved in productivity focused languages with little concern for performance optimization. The underlying algorithms you design should take into consideration the nature of the problem and provide for scalability across a large dataset.

After you have validated your algorithm, then you might want to start optimizing for performance.

One of these days I'll get around to learning Python ...

If anyone else is having this same problem, I've found a work-around.  fread returns the number of characters read, so I don't have to go hunting for an end of file character.  For the data set I've been using, fread returns the smaller number 7784526.  I have no idea why the end of file is located far beyond this point.


I'm still curious about what is causing this issue, if anyone can enlighten me ...

Are you checking the file pointer for EOF correctly?

e.g.    while(!feof(fp)) { /* do stuff */ }

I always managed to mangle that part somehow (e.g. using input == EOF)

Also, this may be dependant on what OS/compiler combination you're using. I seem to remember some of them try to be fancy (read: non-standard) and deal with linux/windows/mac line ending conversions, others don't. Although it's been a very long time since I've had to deal with low level file IO.

Seriously though, C/C++ would be [almost] the last language I would use to do this. That you're having this problem seems proof enough.

I'm actually looking at the character stream and stopping when I see the '\0' character.  Perhaps feof would work properly.

Reply

Flag alert Flagging is a way of notifying administrators that this message contents inappropriate or abusive content. Are you sure this forum post qualifies?