Use Gawk to process CSV files

Sommaire

Awk and Gawk are very powerful tools to process text files. Yet, CSV files are especially difficult to process. Fields are separated by comma or semicolon, depending on the country. These separators can also be used in a field inside double quotes. Which makes it very difficult for Awk to split a line into fields. Moreover, a field with double quotes can contain line breaks, which makes it difficult for Awk to process it.

The gawk-csv extension

The most promising solution to read CSV files with Gawk is the gawk-csv, which is part of the gawkextlib library of Gawk extensions. It can properly split a CSV line, and offers functions to read and write CSV.

With this extension, a CSV file can be read in 2 lines:

@include "csv"
BEGIN { CSVMODE = 1 }
#... gawk program ...

Processing a CSV file using native Gawk

Since version 4.0, Gawk offers a nice functionality to read CSV files: the FPAT, variable. FPAT will split fields depending on their content, instead of splitting the fields depending on separators. With this, you can correctly split a CSV line, provided it does not include line breaks.

This small Gawk library csv2awk below has been designed to process CSV files that may contain line breaks. To process these CSV files, it offers 2 functions: csv_split() and csv_convert().

It offers the following benefits:

  • Works in native Gawk, without requiring an external extension.
  • Works with CSV fields including line breaks.
  • Tested with LibreOffice and Excel CSV files.
  • Can be used to change a CSV file with a simple Gawk program.

It has however some limitations:

  • If a field contains a line break, it has to be between double quotes (this is only recommended by RFC 4180).
  • As is, it only works with the standard input (e.g. not the explicitly open files, command and coprocess).
  • It requires Gawk version 4.0 or more.
  • After being processed by csv_split(), the line will be separated by NULL characters (ASCII 0) and FS will be set to NULL. This enables us to use functions that will force the line to be reevaluated (sub, gsub, $0 = ...) without the line being incorrectly split.
  • As a corollary, if the original line contains NULL characters, a reevaluation of the line will split it incorrectly.

csv_split()

The csv_split() function reads the current line ($0) and translate it as a regular Gawk input line. The csv_split() takes the CSV separator as a parameter (by default, a comma will be used).

It splits the line according to the CSV fields and reads more input lines with getline if necessary. After splitting, the line is formatted as a regular Gawk input line (NF, $0, $1, $2, …), NULL-separated and FS is set to NULL.

csv_convert()

The csv_convert() function will return the current line formatted as CSV. The csv_convert() takes the CSV separator as a parameter (by default, a comma will be used).

It can be used after csv_split() to display the line as CSV. With this, you can process a CSV file with Gawk the same way you would process a regular text file.

But it can also be used in a regular Gawk script to convert the output to CSV.

The awk2csv.awk mini-library

# csv2awk.awk - CSV-file processing with Gawk 4.0+
#
# Written in 2020 by Jean-Philippe Guérard <jean-philippe.guerard@tigreraye.org>
#
# To the extent possible under law, the author(s) have dedicated all
# copyright and related and neighboring rights to this software to the
# public domain worldwide. This software is distributed without any
# warranty.
#
# The full license text is available on <http://creativecommons.org/publicdomain/zero/1.0/>.
#
# Usage:
#
# @include "csv2awk.awk"
# { csv_split( "," ) }
# ... processing ...
# { print csv_convert( "," ) }
#
function csv_split( CSV_SEPARATOR,    FIELD_PATTERN, NB, SPLIT_LINE, IS_FIELD_COMPLETE, I, J, CSV_LINE ){
  if ( CSV_SEPARATOR == "" ) { CSV_SEPARATOR = "," }
  FIELD_PATTERN = "([^" CSV_SEPARATOR "]*)|(\"([^\"]|\"\")+(\"|$))"
  # Split the CSV line into CSV_LINE
  I = 1
  split( "", CSV_LINE )
  while( 1 ) {
    sub( /\r$/, "" )
    NB = patsplit( $0, SPLIT_LINE, FIELD_PATTERN )
    for ( J = 1 ; J <= NB ; J++ ){
      IS_FIELD_COMPLETE = 0
      if ( I in CSV_LINE ) {
        CSV_LINE[ I ] = CSV_LINE[ I ] CSV_SEPARATOR SPLIT_LINE[ J ]
      } else {
        CSV_LINE[ I ] = SPLIT_LINE[ J ]
      }
      if ( CSV_LINE[ I ] ~ /^"([^"]|"")+"$/ ){
        CSV_LINE[ I ] = substr( CSV_LINE[ I ], 2, length( CSV_LINE[ I ] ) - 2 )
        IS_FIELD_COMPLETE = 1
        I++
      } else if ( CSV_LINE[ I ] ~ "^[^" CSV_SEPARATOR "\"]*$" ) {
        IS_FIELD_COMPLETE = 1
        I++
      }
    }
    if ( IS_FIELD_COMPLETE ) break
    if ( getline <= 0 ) break
    $0 = CSV_LINE[ I ] "\n" $0
    delete CSV_LINE[ I ]
  }
  # Rebuilding of the line with NULL separator
  NF = 0
  for ( I = 1 ; I <= length( CSV_LINE ) ; I++ ){
    $I = gensub( /""/, "\"", "g", CSV_LINE[ I ] )
  }
}
function csv_convert( CSV_SEPARATOR,    CSV_LINE, CSV_FIELD, MUST_ESCAPE ){
  if ( CSV_SEPARATOR == "" ) { CSV_SEPARATOR = "," }
  MUST_ESCAPE = "[\n\"" CSV_SEPARATOR "]"
  for ( I = 1 ; I <= NF ; I++ ){
    if ( $I ~ MUST_ESCAPE ){
      CSV_FIELD = "\"" gensub( /"/, "\"\"", "g", $I ) "\""
    } else {
      CSV_FIELD = $I
    }
    if ( CSV_LINE ){
      CSV_LINE = CSV_LINE CSV_SEPARATOR CSV_FIELD
    } else {
      CSV_LINE = CSV_FIELD
    }
  }
  return CSV_LINE
}

Tests

Test 1: display CSV fields

Let’s export the following data as CSV:

-------------------------------------------
|   a   |   b   |   c   |  x y  | "x " y" |
-------------------------------------------

This gives us:

a,b,c,"x y","""x "" y"""

Let’s save our library as csv2awk.awk and use it to read the CSV file we just produced:

echo 'a,b,c,"x y","""x "" y"""' \
| gawk '@include "csv2awk.awk"
        { csv_split() ; for (I=1 ; I<=NF ; I++) { print I ": " $I } }'

This will give us back our original table:

1: a
2: b
3: c
4: x y
5: "x " y"

Test 2: modifying a CSV file

This time, we will use Gawk to change the content of a CSV file. We will use the same file as in the previous example:

a,b,c,"x y","""x "" y"""

As a simple test, we will replace all double quotes by underlines:

echo 'a,b,c,"x y","""x "" y"""' \
| gawk '@include "csv2awk.awk"
        { csv_split() ; gsub( /"/, "_" ) ; print csv_convert() }'

Which gives us the expected result:

a,b,c,x y,_x _ y_