Forgotten Features

The ancient wizards who defined the ASCII standard knew what they were doing. ASCII for those who have not come across it, is the standard means of encoding mainly textual data as a stream of 7- or 8-bit bytes¹ for transmission or storage. It’s how your computer works inside. So the character A has value 65 which is 01000001 in binary, on the wire or on the disk. Codes 32-126 are printable characters like these you are reading. Characters 1-31 are special, they are called control codes and one use for them is for devices to communicate instructions, such as a device reading code 10 knows to advance to a new line, or code 8 to move one space backwards (these are how your ↵ and ← keys work, under the covers). Some of them don’t really make sense in the modern context, carriage return (code 13) in its original usage would cause the receiver to physically move a dot-matrix printer head back to the left and line feed would physically advance the fanfold paper to the next line, but nevertheless the codes are still there and modern operating systems know how to interpret them for modern devices.

ASCII control codes also include values for structuring ASCII files or streams of bytes, for example to represent tabular data. The important thing about them is that they are outside the range of values that they delineate, so their meaning is always unambiguous. Code 28 is file separator, 29 group separator, 30 record separator, 31 unit separator. So it is easy to encode one or more tables of data in one chunk of ASCII. Or at least it should be but…

There is no “CSV standard”, so the format is operationally defined by the many applications which read and write it. The lack of a standard means that subtle differences often exist in the data produced and consumed by different applications. These differences can make it annoying to process CSV files from multiple sources

Everyone seems to have entirely forgotten that these exist! There is nothing weird or exotic about them, the first edition of the standard was published in 1963 and ASCII has been baked into nearly every computer ever since! Everytime I need to deal with CSV files², a format which is full of edge cases that noone can agree on, I despair a little at the state of the profession and our claims of being software engineers. And that’s before we even get onto more recent wheel-reinventing like XML, JSON, YAML… Anything that uses normal printable characters³ to delineate records or otherwise impose structure becomes unwieldy as soon as you want to have one of those characters in the data, for a start. Everyone who has had to deal with angle brackets or ampersands in HTML has been bitten by this at one point! I suppose one point in CSV’s favour is that at least each delineator is only one byte, unlike the others which have a great deal of overhead.

¹ The extra bit could be used to get another 127 chars, or for error detection.
² Many times per day
³ There’s no reason a text editor couldn’t display something for the control codes, when it can easily show paragraph breaks as ¶ or whitespace as ◊.

About Gaius

Jus' a good ol' boy, never meanin' no harm
This entry was posted in data science, Linux, Python, Random thoughts and tagged . Bookmark the permalink.

Leave a comment