\(\newcommand{\F}{\mathbf{F}}\) \(\newcommand{\E}{\mathbf{E}}\) \(\newcommand{\C}{\mathcal{C}}\) \(\newcommand{\D}{\mathcal{D}}\) \(\newcommand{\id}{\mathrm{id}}\) \(\newcommand{\ob}[1]{\mathrm{ob}(#1)}\) \(\newcommand{\hom}[1]{\mathrm{hom}(#1)}\) \(\newcommand{\Set}{\mathbf{Set}}\) \(\newcommand{\Mon}{\mathbf{Mon}}\) \(\newcommand{\Vec}{\mathbf{Vec}}\) \(\newcommand{\Grp}{\mathbf{Grp}}\) \(\newcommand{\Rng}{\mathbf{Rng}}\) \(\newcommand{\ML}{\mathbf{ML}}\) \(\newcommand{\Hask}{\mathbf{Hask}}\) \(\newcommand{\Cat}{\mathbf{Cat}}\) \(\newcommand{\fmap}{\mathtt{fmap}}\)

Parsec

by Yann Esposito
@yogsototh, +yogsototh
ENTER FULLSCREEN
HTML presentation: use arrows, space to navigate.

Parsing

Latin pars (ōrātiōnis), meaning part (of speech).

Parsing in Programming Languages

Complexity:

Method Typical Example Output Data Structure
Splitting CSV Array, Map
Regexp email + Fixed Layout Tree
Parser Programming language + Most Data Structure

Parser & culture

In Haskell Parser are really easy to use.

Generally:

Parsing Example

From String:

(1+3)*(1+5+9)

To data structure:

AST

Parsec

Parsec lets you construct parsers by combining high-order Combinators to create larger expressions.

Combinator parsers are written and used within the same programming language as the rest of the program.

The parsers are first-class citizens of the languages [...]"

Parser Libraries

In reality there are many choices:

attoparsec fast
Bytestring-lexing fast
Parsec 3 powerful, nice error reporting

Haskell Remarks (1)

spaces are meaningful

f x   -- ⇔ f(x) in C-like languages
f x y -- ⇔ f(x,y)

Haskell Remarks (2)

Don't mind strange operators (<*>, <$>).
Consider them like separators, typically commas.
They are just here to deal with types.

Informally:

toto <$> x <*> y <*> z
    -- ⇔ toto x y z
    -- ⇔ toto(x,y,z) in C-like languages

A Parsec Example

whitespaces = many (oneOf "\t ")
number = many1 digit
symbol = oneOf "!#$%&|*+-/:<=>?@^_~"
"   \t "  -- whitespaces on "   \t "
""        -- whitespaces on "32"
"32"      -- number on "32"

-- number on "   \t 32  "
"number" (line 1, column 1):
unexpected " "
expecting digit

Comparison with Regexp (Parsec)

data IP = IP Word8 Word8 Word8 Word8
ip = IP <$>
       number <*  char '.' <*>
       number <*  char '.' <*>
       number <*  char '.' <*>
       number
number = do
    x <- read <$> many1 digit
    guard (0 <= x && x < 256)
    return (fromIntegral x)

Comparison with Regexp (Perl Regexp)

# remark: 888.999.999.999 is accepted
\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b

# exact but difficult to read
\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
  (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

Also, regexp are unityped by nature.

Monadic style

number :: Parser String
number = many1 digit

number' :: Parser Int
number' = do
    -- digitString :: String
    digitString <- many1 digit
    return (read digitString)
"32" :: [Char]  -- number on "32"
32   :: Int     -- number' on "32"

Combining Monadic style (S = aSb | ε)

s = (do
        a <- string "a"
        mid <- s
        b <- string "b"
        return (a ++ mid ++ b)
    <|> string ""
""          -- s on ""
"aaabbb"    -- s on "aaabbb"
"aabb"      -- s on "aabbb"
-- s on "aaabb"
S (line1 1, column 4):
unexpected end of input
expecting "b"

Combining Applicative style (S = aSb | ε)

s = concat3 <$> string "a" <*> s <*> char "b"
    <|> string ""
    where
        concat3 x y z = x ++ y ++ z

Applicative Style usefull with Data types

data IP = IP Int Int Int Int

parseIP = IP <$>
            number <*  char '.' <*>
            number <*  char '.' <*>
            number <*  char '.' <*>
            number

monadicParseIP = do
    d1 <- number
    char '.'
    d2 <- number
    char '.'
    d3 <- number
    char '.'
    d4 <- number
    return (IP d1 d2 d3 d4)

Write number correctly

number :: Parser Int
number = do
    x <- read <$> many1 digit
    guard (0 <= x && x < 256) <?>
        "Number between 0 and 255 (here " ++ show x ++ ")"
    return (fromIntegral x)
>>> test parseIP "parseIP" "823.32.80.113"
"parseIP" (line 1, column 4):
unexpected "."
expecting digit or Number between 0 and 255 (here 823)

So

Time to do something cool

A Simple DSL

Let's write a minimal DSL

Useful definitions

try tries to parse and backtracks if it fails.

(<||>) parser1 parser2 = try parser1 <|> parser2

lexeme, just skip spaces.

lexeme parser = whitespaces *> parser <* whitespaces

Scheme

Write Yourself a Scheme in 48 hours

Remember from text to data structure. Our data structure:

data LispVal =    Atom String
                | List [LispVal]
                | DottedList [LispVal] LispVal
                | Number Integer
                | String String
                | Bool Bool

Parse String

parseString :: Parser LispVal
parseString = do
    char '"'
    x <- many (noneOf "\"")
    char '"'
    return (String x)
-- parseString on '"toto"'
(String "toto") :: LispVal
-- parseString on '" hello"'
(String " hello") :: LispVal

Parse Atom

symbol :: Parser Char
symbol = oneOf "!#$%&|*+-/:<=>?@^_~"

parseAtom :: Parser LispVal
parseAtom = do
    first <- letter <|> symbol
    rest <- many (letter <|> digit <|> symbol)
    let atom = first:rest
    return $ case atom of
                "#t" -> Bool True
                "#f" -> Bool False
                _    -> Atom atom

Test parseAtom

-- parseAtom on '#t'
(Bool True) :: LispVal
-- parseAtom on '#f'
(Bool False) :: LispVal
-- parseAtom on 'some-atom'
(Atom "some-atom") :: LispVal

Parse Number

parseNumber :: Parser LispVal
parseNumber = Number . read <$> many1 digit
-- parseNumber on '18'
Number 18 :: LispVal
-- parseNumber on '188930992344321234'
Number 188930992344321234 :: LispVal

Compose all parsers

parseExpr :: Parser LispVal
parseExpr = parseAtom
            <||> parseString
            <||> parseNumber

Test the parser

-- parseExpr on '188930992344321234'
Number 188930992344321234 :: LispVal
-- parseExpr on '#t'
Bool True :: LispVal
-- parseExpr on 'just-some-word'
Atom "just-some-word" :: LispVal
-- parseExpr on '%-symbol-start'
Atom "%-symbol-start" :: LispVal
-- parseExpr on '"a String"'
String "a String" :: LispVal

Recursive Parsers

parseList :: Parser LispVal
parseList = List <$>
    (char '(' *> sepBy parseExpr' spaces <* char ')' )

parseExpr' :: Parser LispVal
parseExpr' = parseAtom
             <||> parseString
             <||> parseNumber
             <||> parseList

Test Parse List

-- parseExpr' on '(foo (bar baz))'
List [Atom "foo",List [Atom "bar",Atom "baz"]] :: LispVal

-- parseExpr' on '(foo (bar)'
"parseExpr'" (line 1, column 11):
unexpected end of input
expecting white space, letter, "\"", digit, "(" or ")"

-- parseExpr' on '(((foo)) bar)'
List [List [List [Atom "foo"]],Atom "bar"] :: LispVal

Conclusion

So Parser are more powerful than regular expression.
Parsec make it very easy to use.
Easy to read and to manipulate.

Notice how you could use parser as any other object in Haskell. You could mapM them for example.

Any question?

/

#