#+Title: Test Java Parser in Common Lisp with Esrap #+AUTHOR: Sam Flint #+EMAIL: swflint@flintfam.org #+DATE: \today #+INFOJS_OPT: view:info toc:nil path:http://flintfam.org/org-info.js #+OPTIONS: toc:nil H:5 ':t *:t #+PROPERTY: noweb no-export #+PROPERTY: comments noweb #+LATEX_HEADER: \usepackage[color]{showkeys} #+LATEX_HEADER: \parskip=5pt #+LATEX_HEADER: \lstset{texcl=true,breaklines=true,columns=fullflexible,frame=lines,literate={lambda}{$\lambda$}{1} {set}{$\gets$}1 {setq}{$\gets$}1 {setf}{$\gets$}1 {<=}{$\leq$}1 {>=}{$\geq$}1} #+BEGIN_abstract This document presents a personal exercise in both Common Lisp and the use of Context-free grammars. It's one of those just-for-fun projects, but will eventually be able to be used for both code metrics and even potentially code generation and manipulation. The parser is implemented using the Packrat Parsing technique, which provides LL$(k)$ parsing in linear time. #+END_abstract #+TOC: headlines 3 #+TOC: listings * WORKING The Parser [16%] To build this parser, I'm using a somewhat newer parsing technique to do this called "Packrat Parsing". Packrat Parsing is an unlimited look-ahead parsing technique based on Parser Expression Grammars optimized for use in functional languages. To achieve this, I'm using the excellent =ESRAP= parsing library for Common Lisp. ** DONE Common Rules To effectively represent the programming language, there are just a few housekeeping rules to allow us to parse. Many of them are fairly simple, defining things like names and strings. #+CAPTION: Common Rules #+Name: common-rules #+BEGIN_SRC lisp <> <> <> <> #+END_SRC *** DONE String Related To accurately parse strings, we need to provide two rules, a rule that provides string characters, and a second rule that defines a string. Aside from these, one needs a function to ensure that certain characters are not part of strings. #+CAPTION: String Related Generic Rules #+Name: string-related-rules #+BEGIN_SRC lisp (defun not-doublequote (character) (not (eql #\" character))) (defrule string-chars (or (not-doublequote character) (and #\\ #\"))) (defrule string (and #\" (* string-chars) #\") (:destructure (q1 string q2) (declare (ignore q1 q2)) (list :string (text string)))) #+END_SRC *** DONE Number Related Because of the nature of computers, there is the frequent need for numeric literals. Numeric literals can, however be difficult to parse, especially if you focus on type verification. To obviate the need for type verification, it generalizes to allow for all base-10 numeric literals, parsing a sign, whole part, fractional part and signed exponent part. #+CAPTION: Number Related Parsing Rules #+Name: number-related-rules #+BEGIN_SRC lisp (defrule number (and (? (or "-" "+")) (+ (or "0" "1" "2" "3" "4" "5" "6" "7" "8" "9")) (? (and "." (+ (or "0" "1" "2" "3" "4" "5" "6" "7" "8" "9")))) (? (and (or "e" "E") (? "-") (+ (or "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"))))) (:lambda (list) (list :number (text list)))) #+END_SRC *** DONE Identifiers All modern programming languages use identifiers for the names of methods, classes, functions and variables. The Java definition is a string of alphanumeric characters starting with any alphanumeric, the =_= or the =$= characters with the =_= or =.= interspersed in as appropriate. Rather than simply parse as a string, the rule will split the identifier at the =.= and return a list of all the parts of the identifier. #+CAPTION: Identifiers #+Name: identifiers #+BEGIN_SRC lisp (defrule identifier (and (or (alphanumericp character) "_" "$") (+ (or (alphanumericp character) "_" "."))) (:text t) (:lambda (string) (concatenate 'list (list :identifier) (split-sequence #\. string)))) #+END_SRC *** DONE Type Signatures *** DONE Type Formats *** DONE Miscellaneous Rules There are a few miscellaneous rules, consisting of things like spaces and non-integers. They are both fairly simple, and provide a way to work easily. #+CAPTION: Miscellaneous Rules #+Name: misc-rules #+BEGIN_SRC lisp (defun not-integer (string) (when (find-if-not #'digit-char-p string) t)) (defrule spaces (+ (or #\Space #\Tab #\Newline)) (:constant nil)) (defrule package-declaration) #+END_SRC *** TODO Comments foo #+CAPTION: Comments #+Name: comments #+BEGIN_SRC lisp (defrule comment (or (and "//" (* (or (alphanumericp character) "_" "." "+" "\"" "'" "," "." "<" ">"))) (and "/*" (* (or (alphanumpericp character) "_" "." "+" "\"" "'" "," "<" ">")))) (:constant nil)) #+END_SRC ** WORKING Class Rules [80%] Given the fact that Java is an inherently object-oriented language, and in fact forces it on users, it requires class definitions for everything. #+CAPTION: Class Rules #+Name: class-rules #+BEGIN_SRC lisp <> <> <> <> <> #+END_SRC *** DONE Class Modifiers Java has the ability to set classes as having certain attributes, these attributes are set using class modifiers. These are =public=, =abstract= and =final=. However, these may occur with just one or two though it may occur with all three, and to my understanding, they are order independent. Rather than parsing as a single string, we instead return a de-duplicated list of the modifiers as a keyword list. #+CAPTION: Class Modifiers #+Name: class-modifiers #+BEGIN_SRC lisp (defrule class-modifier (or "public" "abstract" "final")) (defrule class-modifiers (* (and class-modifier (? spaces))) (:lambda (modifier-list) (let ((mods (map 'list #'(lambda (mod) (intern (string-upcase (car mod)) "KEYWORD")) modifier-list))) (remove-duplicates mods)))) #+END_SRC *** DONE Super Class Because Java is an object oriented language, the use of classes is implicit, and with the use of classes comes inheritance; to take advantage of inheritance, you need to be able to signify the class to inherit from. The purpose of this rule is to signify the super class. #+CAPTION: Super Class #+Name: super-class #+BEGIN_SRC lisp (defrule super-class (and "extends" (? spaces) identifier) (:lambda (list) (cddr list))) #+END_SRC *** DONE Interface Types Java supports the concept of interfaces and allows the programmer to declare them in the class declaration. The possible declarations are the different types, things like =byte=, =long= and =boolean=. #+CAPTION: Interface Types #+Name: interface-types #+BEGIN_SRC lisp (defrule interface-types (or "byte" "short" "int" "long" "char" "float" "double" "boolean")) (defrule interfaces (and "implements" spaces (* (and interface-types (? (and (? spaces) "," (? spaces)))))) (:destructure (interstring space interfaces) (declare (ignore interstring space)) (let ((interlist (map 'list #'(lambda (interface-spec) (intern (string-upcase (car interface-spec)) "KEYWORD")) interfaces))) (remove-duplicates interlist)))) #+END_SRC *** WORKING Class Body The class body contains a set of method definitions, variable declarations and the like. #+CAPTION: Class Body #+Name: class-body #+BEGIN_SRC lisp (defrule class-body (* (or (alphanumericp character) "_" "." ";" spaces))) #+END_SRC *** DONE Class Declaration To define a class in Java, you write a class declaration, which has optional [[Class Modifiers][Class Modifiers]], the word =class=, an [[Identifiers][Identifier]], a [[Super Class][Super Class]] declaration, an [[Interface Types][Interface Type]] list, an opening brace, a [[Class Body][Class Body]] and a closing brace. #+CAPTION: Class Declaration #+Name: class-declaration #+BEGIN_SRC lisp (defrule class (and (? class-modifiers) (? spaces) "class" (? spaces) identifier (? spaces) (? super-class) (? spaces) (? interfaces) (? spaces) "{" (? spaces) class-body (? spaces) "}") (:destructure (modifiers s1 class s2 class-name s3 superclass s4 interfaces s5 o-brace s6 body s7 c-brace) (declare (ignore s1 class s2 s3 s4 s5 o-brace s6 s7 c-brace)) (list :modifiers modifiers :class class-name :superclass superclass :interfaces interfaces :body body))) #+END_SRC ** WORKING Method Rules [0%] foo #+CAPTION: Method Rules #+Name: method-rules #+BEGIN_SRC lisp <> <> <> <> <> #+END_SRC *** TODO Method Modifiers foo #+CAPTION: Method Modifiers #+Name: method-modifiers #+BEGIN_SRC lisp (defrule method-modifier (or "public" "protected" "private" "static" "abstract" "final" "synchonized" "native")) (defrule method-modifiers (* (and method-modifier (? spaces))) (:lambda (method-modifier-list) (let ((mods (map 'list #'(lambda (modifier) (intern (string-upcase (car modifier)) "KEYWORD")) method-modifier-list))) (remove-duplicates mods)))) #+END_SRC *** TODO Result Type foo #+CAPTION: Result Type #+Name: result-type #+BEGIN_SRC lisp (defrule result-type (or interface-types "void") (:lambda (type) (intern (string-upcase type) "KEYWORD"))) #+END_SRC *** TODO Formal Parameter List foo #+CAPTION: Formal Parameter List #+Name: formal-parameter-list #+BEGIN_SRC lisp (defrule formal-param (and interface-types (? spaces) identifier) (:destructure (type space ident) (declare (ignore space)) (list (intern (string-upcase type) "KEYWORD") ident))) (defrule formal-param-list (* (and formal-param (? (and "," (? spaces))))) (:lambda (params) (map 'list #'car params))) #+END_SRC *** TODO Error Throws foo #+CAPTION: Throwing Errors #+Name: throwing-errors #+BEGIN_SRC lisp (defrule throws (and "throws" (? spaces) identifier) (:destructure (throws s1 idents) (declare (ignore throws s1)) idents)) #+END_SRC *** TODO Method Declaration foo #+CAPTION: Method Declaration #+Name: method-declaration #+BEGIN_SRC lisp (defrule method-declarator (and identifier (? spaces) "(" (? spaces) (? formal-param-list) (? spaces) ")") (:destructure (ident s1 opar s2 params s3 cpar) (declare (ignore s1 opar s2 s3 cpar)) (list :name ident :parameters params))) (defrule method-header (and (? method-modifiers) (? spaces) result-type (? spaces) method-declarator (? spaces) (? throws)) (:destructure (mods s1 res-type s2 declarator s3 throws) (declare (ignore s1 s2 s3)) (list :modifiers mods :result-type res-type :declarator declarator :throws throws))) (defrule method-declaration (and method-header (? spaces) "{" (? spaces) class-body (? spaces) "}") (:destructure (header s1 obrace s2 body s3 cbrace) (declare (ignore s1 obrace s2 s3 cbrace)) (concatenate 'list header (list :body body)))) #+END_SRC ** TODO Body Rules ** TODO Data Structures ** TODO The Interface * TODO The Tests [0%] ** TODO The First Test ** TODO A Test Class ** TODO A Test Tree Walker * TODO The Packaging Machinery [50%] While this document is mostly just an experiment in parser and tokenizer construction, I do intend it to be used for things like automated source-to-source translation and various related functionality. ** DONE The =defpackage= Forms and =packages.lisp= To properly load and work with this application, a set of packages need to be defined, and to simplify the definition of such, they will be placed in a single file. #+CAPTION: defpackage forms #+Name: defpackage-forms #+BEGIN_SRC lisp ;;;; package.lisp (defpackage #:java-parser (:use :esrap :cl) (:import-from #:parse-number #:parse-number)) (defpackage #:java-parser.test (:use :lift :java-parser :cl)) #+END_SRC ** TODO The Parser The parser is a fairly serious piece of code, comprised of various order-sensitive rules, which together form a complete parser/tokenizer combination. #+CAPTION: Parser Packaging #+Name: parser-packaging #+BEGIN_SRC lisp ;;;; parser.lisp (in-package #:java-parser) ;; "parser.lisp" goes here #+END_SRC ** TODO The Tests Proper development of a package such as this requires regular testing of the code to ensure it does exactly as it's supposed to, and to help prevent errors. #+CAPTION: Tests #+Name: test-packaging #+BEGIN_SRC lisp ;;;; test.lisp (in-package #:java-parser.test) ;; "test.lisp" #+END_SRC ** DONE The ASDF Mumbo-Jumbo Packaging an application, or system of functions in Lisp is fairly simple, using *ASDF*, or the "Another System Definiton Facility". Doing this is fairly simple, using a single file. #+CAPTION: ASDF System Definition File #+Name: asdf-file #+BEGIN_SRC lisp ;;;; java-parser.asd (asdf:defsystem #:java-parser :serial t :description "A Basic Java Parser" :author "Samuel W. Flint " :license "GNU GPLv3" :depends-on (#:esrap #:parse-number #:lift) :components ((:file "package.lisp") (:file "parser.lisp") (:file "test.lisp"))) #+END_SRC * WORKING Notes and Resources - [[http://cs.au.dk/~amoeller/dRegAut/JavaBNF.html]] is the Java BNF that I've based the parser on. - https://scymtym.github.io/esrap is the Common Lisp parser library that I've decided to use. - To accomplish unit testing and the like, I'm using the [[http://common-lisp.net/project/lift][Lift]] framework.