java-parser.org 15 KB

Test Java Parser in Common Lisp with Esrap

This document presents a personal exercise in both Common Lisp and the use of Context-free grammars. It's one of those just-for-fun projects, but will eventually be able to be used for both code metrics and even potentially code generation and manipulation.

The parser is implemented using the Packrat Parsing technique, which provides LL$($ rsing in linear time.

WORKING The Parser [16%]

To build this parser, I'm using a somewhat newer parsing technique to do this called "Packrat Parsing". Packrat Parsing is an unlimited look-ahead parsing technique based on Parser Expression Grammars optimized for use in functional languages. To achieve this, I'm using the excellent ESRAP parsing library for Common Lisp.

DONE Common Rules

To effectively represent the programming language, there are just a few housekeeping rules to allow us to parse. Many of them are fairly simple, defining things like names and strings.

  <<string-related-rules>>

  <<number-related-rules>>

  <<identifiers>>

  <<misc-rules>>

DONE String Related

To accurately parse strings, we need to provide two rules, a rule that provides string characters, and a second rule that defines a string. Aside from these, one needs a function to ensure that certain characters are not part of strings.

  (defun not-doublequote (character)
    (not (eql #\" character)))

  (defrule string-chars
      (or (not-doublequote character)
         (and #\\ #\")))

  (defrule string
      (and #\" (* string-chars) #\")
    (:destructure (q1 string q2)
                  (declare (ignore q1 q2))
                  (list :string (text string))))

DONE Number Related

Because of the nature of computers, there is the frequent need for numeric literals. Numeric literals can, however be difficult to parse, especially if you focus on type verification. To obviate the need for type verification, it generalizes to allow for all base-10 numeric literals, parsing a sign, whole part, fractional part and signed exponent part.

  (defrule number
      (and (? (or "-" "+"))
         (+ (or "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"))
         (? (and "."
               (+ (or "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"))))
         (? (and (or "e" "E")
               (? "-")
               (+ (or "0" "1" "2" "3" "4" "5" "6" "7" "8" "9")))))
    (:lambda (list)
      (list :number (text list))))

DONE Identifiers

All modern programming languages use identifiers for the names of methods, classes, functions and variables. The Java definition is a string of alphanumeric characters starting with any alphanumeric, the _ or the $ characters with the _ or . interspersed in as appropriate. Rather than simply parse as a string, the rule will split the identifier at the . and return a list of all the parts of the identifier.

  (defrule identifier
      (and (or (alphanumericp character) "_" "$")
         (+ (or (alphanumericp character) "_" ".")))
    (:text t)
    (:lambda (string)
      (concatenate 'list
                   (list :identifier)
                   (split-sequence #\. string))))

DONE Type Signatures

DONE Type Formats

DONE Miscellaneous Rules

There are a few miscellaneous rules, consisting of things like spaces and non-integers. They are both fairly simple, and provide a way to work easily.

  (defun not-integer (string)
    (when (find-if-not #'digit-char-p string)
      t))

  (defrule spaces
      (+ (or #\Space #\Tab #\Newline))
    (:constant nil))

  (defrule package-declaration)

TODO Comments

foo

  (defrule comment
      (or (and "//" (* (or (alphanumericp character) "_" "." "+" "\"" "'" "," "." "<" ">")))
         (and "/*" (* (or (alphanumpericp character) "_" "." "+" "\"" "'" "," "<" ">"))))
    (:constant nil))

WORKING Class Rules [80%]

Given the fact that Java is an inherently object-oriented language, and in fact forces it on users, it requires class definitions for everything.

  <<class-modifiers>>

  <<super-class>>

  <<interface-types>>

  <<class-body>>

  <<class-declaration>>

DONE Class Modifiers

Java has the ability to set classes as having certain attributes, these attributes are set using class modifiers. These are public, abstract and final. However, these may occur with just one or two though it may occur with all three, and to my understanding, they are order independent. Rather than parsing as a single string, we instead return a de-duplicated list of the modifiers as a keyword list.

  (defrule class-modifier
      (or "public" "abstract" "final"))

  (defrule class-modifiers
      (* (and class-modifier (? spaces)))
    (:lambda (modifier-list)
            (let ((mods (map 'list #'(lambda (mod)
                                       (intern (string-upcase (car mod)) "KEYWORD"))
                             modifier-list)))
              (remove-duplicates mods))))

DONE Super Class

Because Java is an object oriented language, the use of classes is implicit, and with the use of classes comes inheritance; to take advantage of inheritance, you need to be able to signify the class to inherit from. The purpose of this rule is to signify the super class.

  (defrule super-class
      (and "extends" (? spaces) identifier)
    (:lambda (list)
      (cddr list)))

DONE Interface Types

Java supports the concept of interfaces and allows the programmer to declare them in the class declaration. The possible declarations are the different types, things like byte, long and boolean.

  (defrule interface-types
      (or "byte" "short" "int" "long" "char" "float" "double" "boolean"))

  (defrule interfaces
      (and "implements" spaces (* (and interface-types (? (and (? spaces) "," (? spaces))))))
    (:destructure (interstring space interfaces)
                  (declare (ignore interstring space))
                  (let ((interlist (map 'list
                                        #'(lambda (interface-spec)
                                            (intern (string-upcase (car interface-spec))
                                                    "KEYWORD"))
                                        interfaces)))
                    (remove-duplicates interlist))))

WORKING Class Body

The class body contains a set of method definitions, variable declarations and the like.

  (defrule class-body
      (* (or (alphanumericp character) "_" "." ";" spaces)))

DONE Class Declaration

To define a class in Java, you write a class declaration, which has optional Class Modifiers, the word class, an Identifier, a Super Class declaration, an Interface Type list, an opening brace, a Class Body and a closing brace.

  (defrule class
      (and (? class-modifiers)
         (? spaces)
         "class"
         (? spaces)
         identifier
         (? spaces)
         (? super-class)
         (? spaces)
         (? interfaces)
         (? spaces)
         "{"
         (? spaces)
         class-body
         (? spaces)
         "}")
    (:destructure (modifiers
                   s1
                   class
                   s2
                   class-name
                   s3
                   superclass
                   s4
                   interfaces
                   s5
                   o-brace
                   s6
                   body
                   s7
                   c-brace)
                  (declare (ignore s1 class s2 s3 s4 s5 o-brace s6 s7 c-brace))
                  (list :modifiers modifiers
                        :class class-name
                        :superclass superclass
                        :interfaces interfaces
                        :body body)))

WORKING Method Rules [0%]

foo

  <<method-modifiers>>

  <<result-type>>

  <<formal-parameter-list>>

  <<throwing-errors>>

  <<method-declaration>>

TODO Method Modifiers

foo

  (defrule method-modifier
      (or "public" "protected" "private" "static" "abstract" "final" "synchonized" "native"))

  (defrule method-modifiers
      (* (and method-modifier (? spaces)))
    (:lambda (method-modifier-list)
      (let ((mods (map 'list #'(lambda (modifier)
                                  (intern (string-upcase (car modifier))
                                          "KEYWORD"))
                       method-modifier-list)))
        (remove-duplicates mods))))

TODO Result Type

foo

  (defrule result-type
      (or interface-types "void")
    (:lambda (type)
      (intern (string-upcase type) "KEYWORD")))

TODO Formal Parameter List

foo

  (defrule formal-param
      (and interface-types (? spaces) identifier)
    (:destructure (type space ident)
                  (declare (ignore space))
                  (list (intern (string-upcase type)
                                "KEYWORD")
                        ident)))

  (defrule formal-param-list
      (* (and formal-param (? (and "," (? spaces)))))
    (:lambda (params)
      (map 'list #'car params)))

TODO Error Throws

foo

  (defrule throws
      (and "throws" (? spaces) identifier)
    (:destructure (throws s1 idents)
                  (declare (ignore throws s1))
                  idents))

TODO Method Declaration

foo

  (defrule method-declarator
      (and identifier (? spaces) "(" (? spaces) (? formal-param-list) (? spaces) ")")
    (:destructure (ident s1 opar s2 params s3 cpar)
                  (declare (ignore s1 opar s2 s3 cpar))
                  (list :name ident
                        :parameters params)))

  (defrule method-header
      (and (? method-modifiers) (? spaces) result-type (? spaces) method-declarator (? spaces) (? throws))
    (:destructure (mods s1 res-type s2 declarator s3 throws)
                  (declare (ignore s1 s2 s3))
                  (list :modifiers mods
                        :result-type res-type
                        :declarator declarator
                        :throws throws)))

  (defrule method-declaration
      (and method-header (? spaces) "{" (? spaces) class-body (? spaces) "}")
    (:destructure (header s1 obrace s2 body s3 cbrace)
                  (declare (ignore s1 obrace s2 s3 cbrace))
                  (concatenate 'list header (list :body body))))

TODO Body Rules

TODO Data Structures

TODO The Interface

TODO The Tests [0%]

TODO The First Test

TODO A Test Class

TODO A Test Tree Walker

TODO The Packaging Machinery [50%]

While this document is mostly just an experiment in parser and tokenizer construction, I do intend it to be used for things like automated source-to-source translation and various related functionality.

DONE The defpackage Forms and packages.lisp

To properly load and work with this application, a set of packages need to be defined, and to simplify the definition of such, they will be placed in a single file.

  ;;;; package.lisp

  (defpackage #:java-parser
    (:use :esrap
          :cl)
    (:import-from #:parse-number
                  #:parse-number))

  (defpackage #:java-parser.test
    (:use :lift
          :java-parser
          :cl))

TODO The Parser

The parser is a fairly serious piece of code, comprised of various order-sensitive rules, which together form a complete parser/tokenizer combination.

  ;;;; parser.lisp

  (in-package #:java-parser)


  ;; "parser.lisp" goes here

TODO The Tests

Proper development of a package such as this requires regular testing of the code to ensure it does exactly as it's supposed to, and to help prevent errors.

  ;;;; test.lisp

  (in-package #:java-parser.test)

  ;; "test.lisp"

DONE The ASDF Mumbo-Jumbo

Packaging an application, or system of functions in Lisp is fairly simple, using ASDF, or the "Another System Definiton Facility". Doing this is fairly simple, using a single file.

  ;;;; java-parser.asd

  (asdf:defsystem #:java-parser
    :serial t
    :description "A Basic Java Parser"
    :author "Samuel W. Flint <swflint@flintfam.org>"
    :license "GNU GPLv3"
    :depends-on (#:esrap
                 #:parse-number
                 #:lift)
    :components ((:file "package.lisp")
                 (:file "parser.lisp")
                 (:file "test.lisp")))

WORKING Notes and Resources