Analyzing logfiles (2)

Following on from yesterday’s post, here is a similar tool implemented in Haskell. This one is to process a log format we use for various applications, which is

<timestamp>: key1=value1 key2=value2 ... keyn=valuen

There may be any number of keys, but within a log file all entries will have the same keys. The end result will be a SQLite database with a timestamp column + one column per key.

module Main where

import Data.Time (LocalTime)
import Data.Time.Format (readTime)
import System.Locale (defaultTimeLocale)
import IO (readFile, getContents, hPutStrLn, stderr)
import Data.List (intersperse, lines)
import Data.List.Split (splitOn)
import Database.HDBC.Sqlite3 (connectSqlite3, Connection)
import Database.HDBC (run, commit, disconnect, handleSqlError, toSql, prepare, executeMany, getTables)

import System.Console.GetOpt
import System

As with OCaml the first thing is to open or import the modules. I am fully qualifying all the functions I’m using just to help familiarize myself with where things live. Haskell “knows” which of many source files is the entry point by where module Main is declared (I am not actually sure how OCaml does, just from where a let _ construct is? Something it can evaluate immediately rather than just declaring a function?) GetOpt seems to need to be imported in its entirety in order to get the constructor for its datatype.

-- | Input line of the form "Tue Jul  6 05:15:02 2010: sweep=27 broadcast=194..." to LocalTime and a list of key-value pairs
parseLine::String -> (LocalTime, [(String, Int)])
parseLine s = (log_date, kvs)
    where log_date = ((readTime defaultTimeLocale "%a %b %e %T %Y" log_date')::LocalTime)
          log_date' = s' !! 0
          kvs = map toPair ( map (splitOn "=") (splitOn " " (s' !! 1)) )
          s' = splitOn ": " s
          toPair [x, y] = (x, read y::Int)

This is the function that actually parses a line of the file. I’m sure there’s a much better way to do this with Parsec, but it very simply first splits the line into a timestamp and the data on the :, then splits the data into key=value pairs on whitespace, then finally splits each of those on =. It returns the timestamp and a list of tuples each of which is one key and its associated (integer) value.

processLogFile::String ->IO String ->FilePath ->IO ()
processLogFile tabname infile outfile = do
    conn <-connectSqlite3 outfile
    logfile <-infile

This is the function that processes an entire logfile. It takes a table name (a pure String), an input stream (impure, hence being extracted from the IO Monad with ←), an output filename (a pure String, it is not actually a file at this stage, just a name – the IO happens when it is passed as a parameter to connectSqlite3, hence another use of ← ) and returns a result of IO.

    -- if the table already exists, drop it. return 0 the right way to say "do nothing"?
    tables <-getTables conn
    case (tabname `elem` tables) of
        True  -> run conn (dropTable tabname) []
        False -> return 0

In this case I am recreating the table rather than appending new data into it, using a similar technique to ignore the error if it occurs (i.e. the table did not already exist). dropTable is defined below; it is just a function that returns the the SQL.

    -- read the first line of the logfile, extract the keys and create a table
    -- this will obviously fail if more keys are added later
    let (_, kvs) = parseLine $ head $ lines $ logfile
    run conn (createTable tabname kvs) []

Next read the first line of the log file to get the list of new column names. This Haskell syntax is equivalent to parseLine (head (lines logfile)). Read the logfile, split it on newlines, take the first one, and call parse function on it. Haskell’s laziness means that this doesn’t read the entire file just to get one line of it. createTable is a function defined below that returns SQL.

    stmt <-prepare conn (insertInto tabname kvs)
    executeMany stmt (map toSqlList (lines logfile))

    commit conn
    disconnect conn

Now a statement is prepared (this does IO, SQLite compares the statement against the underlying table structure) and the statement is applied to every line in the log file using map. If all the underlying infrastructure was set up for it I guess this could be massively parallelized as each line is independant of the others, and the toSqlList function is pure. Not much point when SQLite would do a table lock for each one tho’! Haskell’s HDBC disables autocommit so it must be done explicitly.

        -- SQL statements
        createTable tabname kvs  = "create table " ++ tabname ++ " (log_date datetime primary key, " ++ (concat $ intersperse "," (map (++ " number") (colNames kvs))) ++ ");" 
        dropTable tabname = "drop table " ++ tabname ++";"
        -- length of list of keys +1 for the log_date column
        insertInto tabname kvs = "insert into " ++ tabname ++ " values (" ++ (concat $ intersperse "," $ replicate ((length kvs) + 1) "?") ++ ")" 
        colNames kvs = map fst kvs

        -- convert the data from the parsed structure to a list of SQL types for inserting in a prepared statement
        toSqlList logline = toSql log_date : colValues kvs
                (log_date, kvs) = parseLine logline
                colValues kvs = map toSql (map snd kvs)        

These are the local functions referenced in the main body of processLogFile. Mainly they are straightforward templates for SQL statements. toSqlList creates a list of all the values that will be bound into the prepared statment (lines 35-36) cast using the toSql constructor from Database.HDBC. It does this by mapping snd (second) over the (key, value) pairs returned by parseLine to extract the values. Line 45 simply produces a string of the form ?,?,?...?.

-- handle command line options
-- default is to read STDIN, create/use a database file called log.db and load into a table named log
data Options = Options { optTabName ::String
                       , optInput   ::IO String
                       , optOutput  ::FilePath

defaultOptions = Options { optTabName  = "log"
                         , optInput    = getContents
                         , optOutput   = "log.db"

options::[OptDescr (Options -> IO Options) ]
options = [ Option "i" ["input"]    (ReqArg (\arg opt ->return opt {optInput = readFile arg}) "FILE") "Input file"
          , Option "o" ["output"]   (ReqArg (\arg opt ->return opt {optOutput = arg}) "FILE") "Output file"
          , Option "t" ["table"]    (ReqArg (\arg opt ->return opt {optTabName = arg}) "TABLE") "Table name"
          , Option "v" ["version"]  (NoArg (\_ -> do 
                                            argv0 <-getProgName
                                            hPutStrLn stderr (argv0 ++ " version 0.1")
                                            exitWith ExitSuccess)) "Print version"
          , Option "h" ["help"]     (NoArg (\_ ->do
                                            argv0 <-getProgName
                                            hPutStrLn stderr (usageInfo argv0 options)
                                            exitWith ExitSuccess)) "Show help"

This code handles both parsing the command line arguments and setting up the defaults in the case that they are not specified. First a data structure (record) for the options is defined, then this is populated with the defaults, then these are tied together with the actual command line options and their help text. This is more sophisticated than the OCaml Arg module I have been using so far but a similar facility is available for it.

main = handleSqlError $ do
    args <- getArgs
    let (actions, nonOptions, errors) = getOpt RequireOrder options args
    opts <-foldl (>>=) (return defaultOptions) actions
    let Options { optTabName = table
                , optInput   = input
                , optOutput  = output } = opts

    processLogFile table input output

-- End of file

And finally, the entry point of the program. This gets the command line arguments (which is IO), overlays the contents of defaultOptions onto anything that is undefined, performs a pattern match against the resulting data structure to bind the variables, then calls processLogFile to do the work.

So which is better for developing this particular type of tool, OCaml or Haskell? There is very little to call between them in my opinion. I’m sure I will refine that once I have used them for more substantial projects. Both give me what I’m after: a high-level type-safe mainly functional compiled language. Haskell does have the advantage of better out-of-the-box date handling and better bindings to SQLite – it can INSERT with a prepared statement. The former is a non-issue once the appropriate OCaml library is identified and installed, of course, but Haskell is more Pythonic in this sense; there is an obvious way to do it, whereas OCaml requires an inexperienced programmer (i.e. me) to make decisions that will only obvious in retrospect, e.g. which of the three (so far!) date libraries to use, Unix, CalendarLib or Ocamlnet. I don’t want to complain about OCaml’s SQLite bindings too much; after all someone has done the work and given it away. Eventually I’ll be at the point where I can just implement the missing feature and submit a patch myself…

OCaml/Haskell or Python, is the next question. There’s no denying Python is an incredibly productive language, with great libraries, and much more maintainable than Perl. But throwaway scripts never are; you will always use it again, tweak it, add options, add error handling, give it to someone else to use… And before long, it’s real tool. In some cases today’s 10-line file munger is tomorrow’s 1000-line ETL batch job on which the business depends – and which no-one dares touch in case it breaks. Of course people do do large scale projects in Python. I just don’t see the point in writing unit tests for things the type system could already have caught!


About Gaius

Jus' a good ol' boy, never meanin' no harm
This entry was posted in Haskell, Ocaml, SQLite. Bookmark the permalink.

3 Responses to Analyzing logfiles (2)

  1. FWIW I have written a minimalistic syntax extension for OCaml (ocaml-sqlexpr) that simplifies
    the use of SQLite in a type-safe way that resembles OCaml’s built-in
    (type-safe) Printf.printf, allowing you to do things like

    module S = Sqlexpr_sqlite.Make(Sqlexpr_concurrency.Id)
    (* the module is functorized over a THREAD monad, allowing you to use it e.g. with Lwt lightweight threads *)
    let db = Sqlexpr_sqlite.open_db "foo.db"
      (fun (login, password) ->
          S.execute db sql"INSERT INTO users(login, password) VALUES(%s, %s)"
            login password)
      ["someguy", "bar"; "J.random", "baz"]
    let get_password = db sql"SELECT @s{password} FROM users where name = %s"

    It uses prepared statements (which will be cached if you use sqlc”…”), and
    handles parameter binding and data conversion when it makes sense (e.g.
    promoting INT to a float). It also features (nestable) transaction support and
    the usual higher-level functions (fold, iter) with automatic retries when the
    DB is locked. Just have to get around to documenting it.

  2. Pingback: Import 2 column CSV files to some sort of DB… preliminary planning | XL-UAT

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s