Sunday, March 15, 2009

Parsing comma-separated values (CSV) in JavaScript

This post is about a little JavaScript function that lets you parse comma-separated (CSV) text. The function takes a CSV string as parameter and returns an array of "records", each of which is itself an array of strings. Most importantly, the function accounts for edge cases, takes care of quoted fields spanning multiple lines as well as embedded quotes.

The code relies heavily on regular expressions, but is compact enough that you can port it to your favorite language (Perl, Java, Python, whatever) in short order.

Before looking at the code itself, here is a short demo to you can try. You can paste your own test strings into the text box below for testing.


Paste the CSV text below, and click





Here is the code. For simplicity, I have assumed that fields are separated by commas and that the quoting character is the double quote, but those choices are easy to configure. The function uses two regular expressions to compute the end points of the CSV fields, and repeatedly matches these expressions against the string.

I would be delighted to hear your comments.

function parseCSV (csvString) {
var fieldEndMarker = /([,\015\012] *)/g; /* Comma is assumed as field separator */
var qFieldEndMarker = /("")*"([,\015\012] *)/g; /* Double quotes are assumed as the quote character */
var startIndex = 0;
var records = [], currentRecord = [];
do {
// If the to-be-matched substring starts with a double-quote, use the qFieldMarker regex, otherwise use fieldMarker.
var endMarkerRE = (csvString.charAt (startIndex) == '"') ? qFieldEndMarker : fieldEndMarker;
endMarkerRE.lastIndex = startIndex;
var matchArray = endMarkerRE.exec (csvString);
if (!matchArray || !matchArray.length) {
break;
}
var endIndex = endMarkerRE.lastIndex - matchArray[matchArray.length-1].length;
var match = csvString.substring (startIndex, endIndex);
if (match.charAt(0) == '"') { // The matching field starts with a quoting character, so remove the quotes
match = match.substring (1, match.length-1).replace (/""/g, '"');
}
currentRecord.push (match);
var marker = matchArray[0];
if (marker.indexOf (',') < 0) { // Field ends with newline, not comma
records.push (currentRecord);
currentRecord = [];
}
startIndex = endMarkerRE.lastIndex;
} while (true);
if (startIndex < csvString.length) { // Maybe something left over?
var remaining = csvString.substring (startIndex).trim();
if (remaining) currentRecord.push (remaining);
}
if (currentRecord.length > 0) { // Account for the last record
records.push (currentRecord);
}
return records;
};


Wednesday, December 17, 2008

SQL queries for creating reports across multiple tables

I often run into cases where I need to extract fields from multiple tables and show them all in a report. And I need to get all the fields with just one SQL query. I want to limit it to a single query for two reasons: one, I want the database to return the records in the user-specified order (e.g., ordered on a particular field or two); and two, using a single query usually is very efficient because exploits the capabilities of the database engine's query planner.

But designing the right query is sometimes difficult. In particular, showing the values referred to by foreign keys and picklists are a bit of a challenge, particularly when the database schema allows foreign keys or picklist references to be null. So I felt it worthwhile to document this problem as best I can, along with a couple of approaches for solving it. Your comments, of course, are welcome.

A simple example


The first is a simple example with just two tables, a Person entity table containing information about a person, and a Profession picklist table that maintains the list of professions. Here are the relevant fields in these tables.

Profession
id (primary key)
display_string
display_rank (integer)

Person
person_id (primary key)
full_name
email_address
profession_id (nullable foreign key reference to Profession.id)

Our problem is to produce a single query that retrieves the full name, email address and display_string field of the profession for every Person record in the table, ordered by the Profession's display_string. The output might look like this:

My Cousin Willie will@goodfor.not
Nelson Mandela nelson@gov.za Politics
Winston Churchill churchill@gov.uk Politics
Anatoly Karpov mike@jo.com Sports
Mother Teresa mo@heaven.net Social service

The first attempted solution is this one, using a simple join:
select Person.full_name as name, Person.email_address as email, Profession.display_string as profession from Person, Profession where Person.profession_id = Profession.id order by profession;
This works, but produces only those Person records that have a non-null value in their profession_id column. Since we want to also produce the Person records that have a null profession id, we have to resort to a left join, like this:
select Person.full_name as name, Person.email_address as email, Profession.display_string as profession
from Person left join Profession on Person.profession = Profession.id
order by profession;
Alternatively, we could use a correlated subquery like this:
select Person.full_name as name, Person.email_address as email,
(select Profession.display_string from Profession where Person.profession_id = Profession.id) as profession from Person order by profession;


A more complex example


Real-world situations are rarely limited to just two tables. So to make our problem more realistic, we'll extend it to where it spans two more tables. Person and Activity tables are major entities, and Profession and ActivityStatus tables are picklist containers. In addition, Activity has a nullable foreign key reference to Person, so that we can optionally identify a contact with whom the activity is associated.

So here is a representation of the two additional tables:

ActivityStatus
id (primary key)
display_string

Activity
activity_id
subject
time
status (foreign key to ActivityStatus,id, may not be null)
person_id (foreign key to Person, may be null)


This time, we are required to produce a single report containing all Activity records, along with the contact information (full name, email address and profession) of the person with whom the activity is associated. If there is no person associated with an activity, we want those columns to show empty in the report. And just for fun, suppose we want the report ordered by the name of the person associated with the activity. So our report might look like this:

Subject AtTime Status WithWhom DoesWhat
Golf game 2 pm Completed Anatoly Karpov Sports
Uno session 1 am Pending My cousin Willie
Breakfast meeting 8 am Pending Nelson Mandela Politics
Dinner meeting 3 pm Completed Nelson Mandela Politics
Dinner meeting 7 pm Completed Winston Churchill Politics

To create this query, we can try grouping the fields we want according to the tables in which they occur, and hooking them together using joins and subqueries, like this:

select Activity.subject, Activity.time as AtTime, ActivityStatus.status as Status,
p.full_name as WithWhom,
(select Profession.display_string from Profession where Profession.id = p.profession) as DoesWhat
from Activity
left join (select full_name, profession_id from Person) as p on p.person_id = Activity.person_id,
ActivityStatus
where ActivityStatus.id = Activity.status
order by WithWhom;
Here we have used a left join to extract the fields of the Person records, and a correlated subquery to extract the picklist value of the profession. We could just as easily have used a left join for the picklist value as well. In this sense the left join and the correlated subquery are equivalent in expressive power, but their performance may vary depending on the database engine's query planner.

One other thing worth noting: The query above is Activity-centric, in the sense that it produces Activity records as the basis, coupled with information about associated Persons. This is the reason why the Activity fields occur at the main level of the select query, and the Person fields occur in the join portion. If we had instead wanted a Person-centric report, with the Person records along with any associated activities -- which would likely yield a much larger result set -- we could have written a query that puts the Person fields in the outer level and the Activity fields in the left join, like this:

select Person.full_name, Person.email_address,
(select display_string from Profession where Profession.id = Person.profession_id) as DoesWhat,
ac.subject, ac.AtTime
from Person
left join
(select Activity.subject, Activity.time as AtTime, Activity.status from Activity) as ac
on ac.person_id = Person.person_id;


Summary


Here are a few guidelines I've found useful when creating a multi-table query:
  • Figure out which of the main entities is the "base" entity, and use its fields at the top level of the query.
  • Joins are expensive, and left joins even more so. Therefore, use a left join only if (a) the foreign key involved in the join is nullable and (b) you want the null values to show in your results. Otherwise, use an equi-join.
  • A correlated subquery provides a simple means to extract a single foreign-key dereference value.


Comments welcome!

Saturday, November 15, 2008

JSP-like templates in JavaScript

A template is a piece of (usually, but not necessarily, HTML) code that contains symbolic references to data variables. You can use a template, for example, when you need to create a table with a known, fixed layout, but you want to re-create its content with different data elements.

This post is focused on templating on the client side, where we have JavaScript code that needs to expand a template using a JavaScript data structure.There are several templating techniques out there that address this situation. For my part, I needed a template mechanism that looks like JSP, except that the code fragments must be written in JavaScript, and for one reason or another, none of these fit my needs.

In particular, I wanted a templating mechanism with the following features:

  • Templates that support arbitrary JavaScript data structures (nested maps and arrays) as data models.

  • Ability to create a pre-compiled template that I could then reuse many times, with different data models.

  • JSP-like syntax, where variables can be referenced via notation like ${variableName.itemName[index]}, and scriptlets can contain arbitrary JavaScript code. I also wanted JSP-style comments (delimited by <%-- and --%>) embeddable in the scriptlets.


To create a template, I wanted code such as this:
  
var tmpl = new Template (aTemplateString);

And then, later, to process a template with a data model:
   
var expandedResult = tmpl.process (dataModel);

For example:

// Sample template string:
var string = "\
<h2>${name}</h2>\
<ul>\
<% for (var i = 0; i < positions.length; i++) { %>\
<li>${positions[i].title}, ${positions[i].company}, ${positions[i].duration}</li>\
<% } %>\
</ul>\
";
var template = new Template (string);

// Data model that can be used with this template:
var model = {
name: "Jim Smith",
positions: [
{ title: "Programmer", company: "ACME Corp", duration: "1991-1994" },
{ title: "Analyst", company: "Fluor Corp", duration: "1995-2001" }
]
}

// Expand the template using the model, and obtain a string:
var expandedResult = template.process (model);


The technique I ended up with uses a regular expression to parse the given template into its code and data segments, and generates a JavaScript function via an eval. This function is then remembered, and called whenever a template must be expanded. Here is the code:


var Template = function (aString, templateName) {
var templateString = aString;
var dataParts = [], start = 0;
var codeParts = ["var _process = function (model) { \nvar result = [];\nwith (model){\n"];
var re = /<%([\s\S]*?)%>|\$\{(.*?[^\\])\}/g ;
aString.replace (re, function (fullMatch, g1, g2, index) {
dataParts.push (aString.substring (start, index));
if (g1) {
codeParts.push ("result.push (dataParts[" + (dataParts.length-1) + "]);\n");
if (g1.substring (0,2) != "--" && g1.substring(g1.length-2,g1.length) != "--") {
// It's not a comment, it's a code fragment
codeParts.push (g1 + "\n");
}
start = index + g1.length + 4;
} else { // g2 matched
codeParts.push ("result.push (dataParts[" + (dataParts.length-1) + "]);\n");
codeParts.push ("result.push (" + g2 + ");\n");
start = index + g2.length + 3;
}
});
dataParts.push (aString.substring (start));
codeParts.push ("result.push (dataParts[" + (dataParts.length-1) + "]);\n}\nreturn result.join ('');\n}");
var codeStr = codeParts.join ("");
try {
eval (codeStr);
} catch (e) {
alert ("Template '" + templateName + "' expansion error:\n" + e.message);
}
this.process = _process;
};


You can download the code for your enjoyment. Comments welcome!