Sunday, March 15, 2009

Parsing comma-separated values (CSV) in JavaScript

This post is about a little JavaScript function that lets you parse comma-separated (CSV) text. The function takes a CSV string as parameter and returns an array of "records", each of which is itself an array of strings. Most importantly, the function accounts for edge cases, takes care of quoted fields spanning multiple lines as well as embedded quotes.

The code relies heavily on regular expressions, but is compact enough that you can port it to your favorite language (Perl, Java, Python, whatever) in short order.

Before looking at the code itself, here is a short demo to you can try. You can paste your own test strings into the text box below for testing.


Paste the CSV text below, and click





Here is the code. For simplicity, I have assumed that fields are separated by commas and that the quoting character is the double quote, but those choices are easy to configure. The function uses two regular expressions to compute the end points of the CSV fields, and repeatedly matches these expressions against the string.

I would be delighted to hear your comments.

function parseCSV (csvString) {
var fieldEndMarker = /([,\015\012] *)/g; /* Comma is assumed as field separator */
var qFieldEndMarker = /("")*"([,\015\012] *)/g; /* Double quotes are assumed as the quote character */
var startIndex = 0;
var records = [], currentRecord = [];
do {
// If the to-be-matched substring starts with a double-quote, use the qFieldMarker regex, otherwise use fieldMarker.
var endMarkerRE = (csvString.charAt (startIndex) == '"') ? qFieldEndMarker : fieldEndMarker;
endMarkerRE.lastIndex = startIndex;
var matchArray = endMarkerRE.exec (csvString);
if (!matchArray || !matchArray.length) {
break;
}
var endIndex = endMarkerRE.lastIndex - matchArray[matchArray.length-1].length;
var match = csvString.substring (startIndex, endIndex);
if (match.charAt(0) == '"') { // The matching field starts with a quoting character, so remove the quotes
match = match.substring (1, match.length-1).replace (/""/g, '"');
}
currentRecord.push (match);
var marker = matchArray[0];
if (marker.indexOf (',') < 0) { // Field ends with newline, not comma
records.push (currentRecord);
currentRecord = [];
}
startIndex = endMarkerRE.lastIndex;
} while (true);
if (startIndex < csvString.length) { // Maybe something left over?
var remaining = csvString.substring (startIndex).trim();
if (remaining) currentRecord.push (remaining);
}
if (currentRecord.length > 0) { // Account for the last record
records.push (currentRecord);
}
return records;
};


14 comments:

spawn said...

Why all this?
Isn't it more simple to use 2 split functions : one for new lines, to detect the rows, and one for coma to detect the fields?

M. A. Sridhar said...

Thanks for the comment. I'm not sure I understand, though. If you split it into lines first, you might have a field, say, that starts toward the end of line 3 and ends in the first part of line 7. So you would have to ensure that you don't split lines that are part of that field. The algorithm becomes a little complicated in that approach.

I would definitely like to see an example to illustrate your approach, though.

prairiedock said...

This breaks if a comma appears in a quoted field. (In the example, try putting a comma before the closing quote of " quoted words").

Unknown said...

Perfect.
exactly what I was looking for.
I was just about to write it myself when I found this thanks a lot!!

Unknown said...

A couple of comments:
This looks like a nice clean framework.
What are the \015\012 for? Did you mean \010\013 ?

Also this did break on input like:
"Coming over for ""graduation
party"","woo hoo!"""

the fix I applied is:
var qFieldEndMarker = /([^"])"([,\n\r] *)/g;

Unknown said...

sorry. how about:

var qFieldEndMarker = /(("")|([^"]))"([,\n\r] *)/g;

for a fix.

Also one potential issue with this is that it works well for well-formed input but doesn't throw any errors for poorly formed input.
Your example for Two "quoted words" is one such. The line is invalid csv and should be:
"Two ""quoted words""",,,high

Jerome said...

You da man!

I have spent many hours trawling the interweb for a decent function to do this. RegEx is a bit of a closed book to me. The only one I found that worked was as slow as my 'look at each char and add it or something else to another string' option.

I could do it in .Net, and the speed would be OK, but in JScript, Yawn! 5 hours for 5,000 records! (including inserts into databases)

This one parsed the 5,000 in seconds, all embedded commas and quotes intact, no spurious delimiters.

Whoever invented CSV should be tarred and feathered, IMO.

Many thanks - I owe you a beverage of choice!

shawnderson said...

Anyone have a js function that reads in the CSV file from local machine as a string?

selfexistent said...

Your parser doesn't work for the following string (within the curly brackets):

{1,2,"A quoted field with alternating commas and double-quotes ,"","",",3,4}

selfexistent said...

That is I would expect the string (within the curly brackets):

{1,2,"A quoted field with alternating commas and double-quotes ,"","",",3,4}

to be parsed into the array:

["1","2","A quoted field with alternating commas and double-quotes ,\",\",","3","4"]

selfexistent said...

//Here's some test cases I wrote
//using jasmine unit test framework
//in javascript:

describe("String.prototype.splitCSV", function() {

var testStrs = ["1,2,A simple field,3,4",
"1,2,A simple field with a\n newline,3,4",
"1,2,A simple field with a double-quote \" inside it.,3,4",
"1,2,\"A quoted field with a double-quote inside it \"\" to make you feel happy.\",3,4",
"1,2,\"A quoted field with a comma inside it, albeit only one.\",3,4",
"1,2,\"A quoted field with two adjacent commas,, that make two.\",3,4",
"1,2,\"a,b,\"\"c,d,e\",3,4",
"1,2,\"A quoted field with alternating commas and double-quotes ,\"\",\"\",\",3,4"];

var expectedResults = [["1","2","A simple field","3","4"],
["1","2","A simple field with a\n newline","3","4"],
["1","2","A simple field with a double-quote \" inside it.","3","4"],
["1","2","A quoted field with a double-quote inside it \" to make you feel happy.","3","4"],
["1","2","A quoted field with a comma inside it, albeit only one.","3","4"],
["1","2","A quoted field with two adjacent commas,, that make two.","3","4"],
["1","2","a,b,\"c,d,e","3","4"],
["1","2","A quoted field with alternating commas and double-quotes ,\",\",","3","4"]];


beforeEach(function() {
//no setup needed for tests
});

for(var i=0;i<testStrs.length;i++)
{
var testStr = testStrs[i];
var expectedResult = expectedResults[i];
it("Expect string:\"" + testStr + "\" to be split into:" + expectedResult.toString(), (function(testStr,expectedResult){
return function(){
var result = testStr.splitCSV();
expect(result).toEqual(expectedResult);
}
})(testStr,expectedResult));
}



});

selfexistent said...

Here's a modified version of your code that only parses input as a single record with potentially large amounts of columns (fields). This version passes the above unit tests in jasmine unit test framework:

String.prototype.trim = function() { return this.replace(/^\s+|\s+$/g, ''); };



String.prototype.splitCSV = function() {
var fieldEndMarker = /([,] *)/g; // Comma is assumed as field separator
var qFieldEndMarker = /(("")|([^"]))"([,] *)/g;// Double quotes are assumed as the quote character
var startIndex = 0;
var currentRecord = [];
var csvString = this;
do {
// If the to-be-matched substring starts with a double-quote, use the qFieldMarker regex, otherwise use fieldMarker.
var endMarkerRE = (csvString.charAt (startIndex) == '"') ? qFieldEndMarker : fieldEndMarker;
endMarkerRE.lastIndex = startIndex;
var matchArray = endMarkerRE.exec (csvString);
if (!matchArray || !matchArray.length) {
break;
}
var endIndex = endMarkerRE.lastIndex - matchArray[matchArray.length-1].length;
var match = csvString.substring (startIndex, endIndex);
if (match.charAt(0) == '"') { // The matching field starts with a quoting character, so remove the quotes
match = match.substring (1, match.length-1).replace (/""/g, '"');
}
currentRecord.push (match);

startIndex = endMarkerRE.lastIndex;
} while (true);
if (startIndex < csvString.length) { // Maybe something left over?
var remaining = csvString.substring (startIndex).trim();
if (remaining) currentRecord.push (remaining);
}

return currentRecord;
};

cjjesus said...

hi.
im looking for some kind of portable database without working with any server.
objective is for a ftp host it doesent have sql or php support
or local database as read-only
my proposition is:

[database.zip]
-table1.csv
-table2.csv
-table3.csv
-table4.csv

- read zip file(zip.js)
- make select with csv tables
- display text content

hope anyone can work with this project
cheers. ;)

Anonymous said...

Awesome! I will have to study this to learn something but, as it is, it saved me a TON of time!

Thanks so much!

(btw: I seem to get a blank record following each set of records, but, I'm sure it is something I am doing. (exported csv from openoffice spreadsheet with embedded commas in numbers and text)