Dealing with .csv files in Cocoa – writing an importer

Mac users who need to deal with large amounts of data might justifiably feel like they got the short end of the stick as far as Microsoft Office is concerned. The Mac version of Excel doesn’t support Visual Basic and isn’t multithreaded, meaning that:

a) as soon as you get above a couple of tens of thousands of rows, any formula more complicated than summing a column freezes the UI for a couple of minutes while it crunches the numbers, and

b) any data analysis you want to do which involves iteration results in formulae consisting of ten lines of densely nested brackets, which are nearly impossible to read or debug.

Like any bad programmer, I implicitly believe that my language of choice is the perfect tool for any job, and hence when recently confronted with a very large stack of data I needed to analyse I decided it would be easiest to Object Oriented the hell out of it with a small custom C application.

This meant parsing .csv files, which are about the most human-readable way of getting data out of Excel. CSV stands for Comma Separated Values; each row of your exported Excel chart is separated by a newLine character, and the data contained within each cell of the chart is separated by commas, like so:

Row 1 Column 1, Row 1 Column 2, Row 1 Column 3, Row 1 Column 4 \n
Row 2 Column 1, Row 2 Column 2, Row 2 Column 3, Row 2 Column 4 \n
…..

So far, so straightforward, right? That newLine character at the end of each row is a godsend for dividing the contents of the file into individual rows which can then each be parsed separately, by loading the .csv file into a single NSString and then using the componentsSeparatedByCharactersInSet: method of NSString to yield an array of NSStrings. It therefore seems logical to turn your .csv file into a 2-dimensional array, with a main array of rows, and subarrays containing each column’s element.

Unfortunately, things get tricky at this point. Excel has several different ways of denoting columns, so that instead of the neat, orderly arrangement in the example above, you get something like this:

Title 1, Title 2, Title 3, Title 4, Title 5, Title 6, Title 7, Title 8 \n
Guy,Ninja,,Sandwich,,,”Ninjutsu” \n
“Haggar, Mike”, Mayor, Clothesline, Ham Roast,” ” \n

Which is clearly going to be more difficult to deal with. If you try to divide up each row into columns by using [NSString componentsSeparatedByString:@”,”], you’ll end up with more columns than you wanted because of the comma separating Lastname, Firstname in the first column. Similarly, you can’t just divide things up using quotemarks; some of the columns represented aren’t separated by quotes. Also, what to you do with those bunches of commas, which denote a series of empty cells? Hey, that last line only contains five items, but there are eight columns!

In the end I chose a method which uses the steps:

1) Divide the initial string into an array of strings, each of which represents a row, by separating its components by the newLine character.
2) Divide each string in this array into an array of strings, separated by the comma character.
3) Scrutinize each string to determine whether it has been improperly terminated by a comma in the actual data value – in the example above, “Haggar, Mike” becomes two strings, ‘”Haggar’ and ‘ Mike”‘.
4) If the string starts with a quote mark but does not end with one, assume that it has been inappropriately divided; get the next string and combine the two
5) Remove whitespace – if you want to get e.g. numerical values by using, say, [aString floatValue] or [aString dateValue], leading whitespace will usually give a false result.
6) Remove leading and trailing quotes.

Here’s an example method, twoDimensionalMutableArrayOfStringsFromCSVData:. Note that it is using ARC and will leak a bunch of memory if you don’t use ARC, too. I have also separated out methods for reuniting strings which should not have been divided, and removing extraneous quote marks. Note that there are a fair number of comparisons of characters with “34” – this is the char denoting quote marks.

- (void)reuniteImproperlyDividedColumnStringsFromArray:(NSArray *)rowArray toArray:(NSMutableArray *)receiverArray {
    for (int i = 0; i < [rowArray count]; i ++) {
        NSString *stringToExamine = [rowArray objectAtIndex:i];
 
        NSUInteger column_string_length = [stringToExamine length];
        if (column_string_length > 0) {
            char firstCharacter = [stringToExamine characterAtIndex:0];
            char lastCharacter = [stringToExamine characterAtIndex:(column_string_length -1)];
            //Does the string to be examined start with a quote mark but not end with one?
            //If so, the string has been impropery divided - reconstitute it
            if ((firstCharacter == 34) && !(lastCharacter == 34) &&  i < ([rowArray count] - 1)) {
                for (int j = (i + 1); j < [rowArray count]; j ++) {
                    NSString *secondStringToExamine = [rowArray objectAtIndex:j];
                    char secondStringLastCharacter = [secondStringToExamine characterAtIndex:([secondStringToExamine length] - 1)];
                    stringToExamine = [stringToExamine stringByAppendingFormat:@", %@", secondStringToExamine];
                    if (secondStringLastCharacter == 34) {
                        i = j;
                        break;
                    }
                }
                [receiverArray addObject:stringToExamine];
                //skip the next string in the array, we've already appended it to the current string
            } else {
                //the string hasn't been improperly divided, add it to the output array "as is"
                [receiverArray addObject:stringToExamine];
            }
        } else {
            //the string is 0 characters long, put in an empty string to avoid losing a column
            [receiverArray addObject:@" "];
        }
    }
}
 
- (void)stripQuotesFromString:(NSString *)aStringToStrip andAddToArray:(NSMutableArray *)quotesStrippedMutableArray {
    if ([aStringToStrip length] > 0) {
        NSMutableString *aStrippedString = [[NSMutableString alloc] initWithString:aStringToStrip];
        char firstCharacter = [aStrippedString characterAtIndex:0];
        char lastCharacter = [aStrippedString characterAtIndex:([aStrippedString length] - 1)];
 
        if ((lastCharacter == 34) && [aStringToStrip length] > 1) {
            [aStrippedString replaceCharactersInRange:NSMakeRange(([aStrippedString length] - 1), 1) withString:@""];
        }
 
        if ((firstCharacter == 34) && [aStringToStrip length] > 0) {
            [aStrippedString replaceCharactersInRange:NSMakeRange(0, 1) withString:@""];
        }
        [quotesStrippedMutableArray addObject:aStrippedString];
    }
 
    else {
        [quotesStrippedMutableArray addObject:@""];
    }
}
 
- (NSMutableArray *)twoDimensionalMutableArrayOfStringsFromCSVData:(NSData *)csvData {
 
    NSMutableArray *outputArray = [[NSMutableArray alloc]init];
 
    //turn data loaded from .csv file into an NSString
    NSString *csvDataString = [[NSString alloc] initWithData:csvData encoding:NSUTF8StringEncoding];
    //divide into an array of strings, each representing a row in the original data
    NSArray *arrayOfRows = [csvDataString componentsSeparatedByCharactersInSet:[NSCharacterSet newlineCharacterSet]];
 
    //work out the maximum possible number of columns
    //unfortunately, in order to do this you need to look at every single string
 
    unichar commaCharacter = [@"," characterAtIndex:0];
    NSInteger maximumPossibleNumberOfColumns = 0;
 
    for (NSString *aRowString in arrayOfRows) {
        NSInteger numberOfPossibleColumnsInThisRow = 0;
        //count commas in this row
        for (NSUInteger string_location = 0; string_location < [aRowString length]; string_location ++) {
            if ([aRowString characterAtIndex:string_location] == commaCharacter) {
                numberOfPossibleColumnsInThisRow ++;
            }
        }
        if (numberOfPossibleColumnsInThisRow > maximumPossibleNumberOfColumns) {
            maximumPossibleNumberOfColumns = numberOfPossibleColumnsInThisRow;
        }
    }
 
    //NSLog(@"%li possible columns", maximumPossibleNumberOfColumns);
 
    //parse each row string
    for (NSString *aRowString in arrayOfRows) {
        //make sure the string isn't empty
        NSUInteger string_length = [aRowString length];
        if (string_length > 0) {
            NSArray *possiblyInaccurateArrayOfColumns = [aRowString componentsSeparatedByString:@","];
            //NSLog(@"%lu", [possiblyInaccurateArrayOfColumns count]);
            NSMutableArray *arrayOfColumnsReunitingImproperlyDividedValues = [[NSMutableArray alloc]init];
 
            [self reuniteImproperlyDividedColumnStringsFromArray:possiblyInaccurateArrayOfColumns toArray:arrayOfColumnsReunitingImproperlyDividedValues];
            //end reuniting improperly divided values
 
            //there might be a bunch of columns skipped from the end of this row
            //if this is the case, fill in a bunch of "dummy" columns
 
            while ([arrayOfColumnsReunitingImproperlyDividedValues count] < maximumPossibleNumberOfColumns) {
                [arrayOfColumnsReunitingImproperlyDividedValues addObject:[NSString stringWithFormat:@" "]];
            }
 
            //strip out unnecessary quote marks and whitespace
 
            NSMutableArray *strippedQuotesMutableArray = [[NSMutableArray alloc]init];
            for (NSString *aStringToStrip in arrayOfColumnsReunitingImproperlyDividedValues) {
                [self stripQuotesFromString:aStringToStrip andAddToArray:strippedQuotesMutableArray];
            }
 
            [outputArray addObject:strippedQuotesMutableArray];
        }
    }
    return outputArray;
}

There are a couple of points which bear explanation; the first is that there seems to be some debate about which encoding .csv actually uses. According to Google, Excel for Mac uses MacRoman. For basic English characters, UTF-8 appears to work without mangling the text. Also, there are a fair few rather ugly checks against string length. Unfortunately, these are largely unavoidable as many string methods will simply explode on meeting a string of length 0. Finally, there is a use of NSMutableString; I will deal more with this in a subsequent post on exporting from .csv; the usage here is to avoid gigantic memory allocations which occur when appending strings in a tight loop. If you try to manage this sort of operation with [NSString stringByAppendingString:], you will suddenly find your program using up your entire free system memory and c..r..a..w..l..i..n..g because this method of NSString creates a new copy every time it is used. In the context of ARC, where you can’t manage your own memory within the loop readily, NSMutableString or memory pools are your best options.

Next time: sample code and writing an exporter.

Leave a Reply

Your email address will not be published. Required fields are marked *