Parsing PHP source code using Python

About The Project

A small Python Library for building ASTs from source code and performing simple queries on them.

Note: This project only supports ASTs for PHP as of now and is no longer in development. There are a few known issues listed here. Any contributions are welcome.

The project uses a PLY-based Parser for building ASTs for PHP. There is also an ANTLR-based compiler but it is not fully complete and compatible with the modules. Therefore, all the modules in the project use the PLY-based compiler.

The project is license under the MIT License

Installation

The project requires Python 3.x.

Dependencies for the project are stored in requirements.txt. Install them using

pip install -r requirements.txt

Getting Started

Building AST for a File

To build ast for a file, create the following test.py file in the root directory.

from src.modules.php.syntax_tree import build_syntax_tree

tree = build_syntax_tree("path/to/php/file")

Run it using python -i test.py to inspect the result of the AST built. build_syntax_tree returns a SyntaxTree object.

Building Resource Tree for a Directory

A Resource Tree is basically a collection of ASTs for all the files in a project directory along with some other information (e.g, Function and Method definitions).

To build a Resource Tree for a given directory, create a test2.py file in the root directory.

from src.modules.php.resource import build_resource_tree

r_tree = build_resource_tree("examples/php")

Then run it using python -i test2.py to inspect the results. build_resource_tree returns a ResourceTree object.

Using Traversers and Visitors

Analysing the built Abstract Syntax Trees requires you to follow the Visitor Pattern. You need to use a traverser that inherits from the built-in Traverser class and overrides its methods. The traverser can register one or more visitors that inherit from the built-in Visitor class.

Once the traverser runs, it should visit each node in the tree and dispatch all of the visitors on the visited node. The visitors can collect information or mutate the AST during the traversal.

Have a look at the predefined Visitors and Traversers to get an idea of how it works.

Here is a minimal example that uses the built-in BFTraverser and a custom visitor to print the types of all the nodes present in the AST:

from src.modules.php.traversers.bf import BFTraverser
from src.modules.php.base import Visitor 
from src.modules.php.syntax_tree import build_syntax_tree

class CustomVisitor(Visitor):
    def visit(self, node):
        print(type(node))

s_tree = build_syntax_tree("/path/to/file")
traverser = BFTraverser(s_tree) 
printer_visitor = CustomVisitor()
traverser.register_visitor(printer_visitor)

traverser.traverse()

Querying for Particular Nodes

To search for and collect nodes that meet a particular criterion, you can use the pre-defined NodeFinder visitor. It takes a boolean-valued callback function and searches for nodes that meet that callback.

For example, to search for all function calls without a parameter, you would use the following example,

from src.modules.php.syntax_tree import build_syntax_tree
from src.modules.php.visitors.finders import NodeFinder
from src.modules.php.traversers.bf import BFTraverser
from src.compiler.php.phpast import FunctionCall

def function_has_no_params(node):
    return isinstance(node, FunctionCall) and len(node.params) == 0

s_tree = build_syntax_tree("/path/to/php/file")
bft = BFTraverser(s_tree)
node_finder = NodeFinder(function_has_no_params)
bft.register_visitor(node_finder) 

bft.traverse()
print(node_finder.found)

Predefined Traversers and Visitors

Traversers:

  • BFTraverser: Carries the visitors in a Breadth-first manner
  • DFTraverser: Carries the visitors in a Depth-first manner

Visitors:

  • Printer: Prints the nodes of the AST
  • GraphBuilder: Builds a graph using the AST
  • NameFinder: Searches for specified names (Function Definitions/Variables) and collects those nodes
  • NameHighlighter: Searches and highlights specified names using the AST and a GraphBuilder instance
  • NodeFinder: Takes a boolean-valued callback function and collects all the nodes that satisfy that callback filter
  • DependencyResolver: Searches for all the Include/Require nodes and tries to expand them by connecting the SyntaxTree for the imported file

Resource Tree specific Visitors:

Work similar to the above visitors, except they require a ResourceTree instance at initialization.

  • TablesBuilder: Searches for all function/method definitions while walking a file and updates the function_table and method_table in the corresponding ResourceTree
  • ResourceCallsFinder: Searches for all the Function Calls and Method Calls and associates them with the definitons in function_table and method_table of the corresponding ResourceTree.

Known Issues

  • Poor Performance, especially in the ANTLR-based parser
  • The PLY-based parser does not interpret some constrcuts properly. For example,
    • Use declarations of the format use (const|function) (namespace)
    • Arrow Functions
    • Nested Variable Names or Accessors like $a -> {$b -> c}
  • The ANTLR-based parser has poor performance and is tens of times slower than the PLY-based parser. The AST Generation is also not fully complete

Leave a Reply