Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.
KurtS
Alteryx Alumni (Retired)

Recently, I had to fix a problem with the older Excel output from Composer. The initial problem was that a particular large file would only open in Excel after a warning and having Excel "repair" the file. It then lost all of its formatting--and worse--some of its data. I found that the problem had to do with the conversion to Unicode, and changed the place in the Composer code that caused the problem.

 

I figured that I had the problem solved, so I re-generated a test file, and, it opened--sort of... Now, I was getting a message from Excel saying only that there was a 'problem' with the file (of course, not telling me what the problem was), with only an "OK" button. The document opened fine with all its data, and all the correct formatting. What was wrong with the file now???

 

Eventually, I found out that others who created Excel files had also come across the same problem, after a certain Security Patch had been installed. It might have to do with setting numbers of columns used. So, I had to find out where in the file I might have some bad column information. That meant reading the raw data of the file, which is in a binary format.

 

Armed with the "file format specification", I opened the file as binary. Here is what a small part of the file looks like, when opened in this way:

 

 

The column on the left has offsets from the beginning of the file; the "stuff" on the right is a conversion of the binary codes to printable characters (pretty much useless in this case); and the columns of numbers in the middle represent the raw file data. You might wonder : 'Why are there letters in there?' Well, it's hexadecimal. (You might remember from a math class in school a long time ago, being taught about using number systems besides "base 10"-- which is what humans use to count. Well, those numbers are "base 16", and since we don't have 16 actual digits, we use the letters "A" through "F" to make up the missing digits.) Hexadecimal corresponds well to actual binary (which, of course are just ones and zeros), where each digit represents one half byte.

 

The file contains sections, called records, which start with a number that tells what the record contains. I eventually discovered that the record with the problem was the Dimensions record. It helps Excel figure out how to display the file in a window, and is denoted by the number '512'. It's in the example binary above-- Can you see it? No? First, we need to convert "512" to hexadecimal. This one is actually kind of easy, as 512 is a multiple of 16... Okay, I'll help--The number we need is 200.

 

Then we also have to consider--"Endianness". I'll let you look that up in Wikipedia. Basically, it's how data gets stored in computers. Intel processors (and therefore, MS Windows) use what's called "Little-endian"--which means that the "little" numbers come first. (Like saying "4 and 20" instead of "24".) So, that means, you might need to "swap" columns of numbers, but only for data that is larger than a byte. You first need to know how big each piece of data should be. The specs tell you that. So, the number we are looking for--512-- would be "02 00" (200) in hexadecimal, and would actually be stored as "00 02" some place above.

 

Found it yet? You probably see a few of them. Most are just part of actual data. So, you can't "jump" to it. I had to find the beginning of the workbook portion, then I could move through the file as follows: Get the type number ("swap" it so you can read it, convert it to base 10, then look it up in the spec); use the spec to figure out what should be in it; if it's okay, move to the next one. Repeat this an indefinite number of times...

 

I found that this particular Dimensions record, was getting the number of rows placed for both the rows and the columns used in the file. Remember, the old Excel format limits you to 256 columns. But, since the Dimensions record is only used to help the application display the file, Excel really didn't care about whether that was accurate, until some hacker tried to exploit that problem--thus, the need for a Security Patch.

 

Now, I could go to the code, to find why the column number was wrong, and it ended up being very simple. The original author had just "cut and pasted" one piece of code to another and hadn't changed all of the variable names correctly. Therefore, the code was using the variable that was holding row information for modifying column information. Simple change--problem solved!

 

Whew....

-Kurt