Deep Packet Inspection (DPI) lies at the core of contemporary Network Intrusion Detection/Prevention Systems and Web Application Firewalls. DPI aims to identify various malware (including spam and viruses) by inspecting both the header and the payload of each packet and comparing it to a known set of patterns. DPI is often performed on the critical path of the packet processing, thus the overall performance of the security tools is dominated by the speed of DPI.
The seminal algorithm of Aho-Corasick (AC) is the defacto standard for pattern matching in network intrusion detection systems (NIDS). Basically, the AC algorithm constructs a Deterministic Finite Automaton (DFA) for detecting all occurrences of a given set of patterns by processing the input in a single pass. The input is inspected symbol by symbol (usually each symbol is a byte), such that each symbol results in a state transition. Thus, in principle, the AC algorithm has deterministic performance, which does not depend on specific input and therefore is not vulnerable to algorithmic complexity attacks, making it very attractive to NIDS systems.
In this paper we show that, when implementing the AC algorithm in software, this property does not hold, due to the fact that contemporary pattern sets induce very large DFAs that cannot be stored entirely in cache. We then propose a novel technique to compress the representation of the Aho-Corasick automaton, so it can fit in modern cache. We compare both the performance and the memory footprint of our technique to previously-proposed implementation, under various settings and pattern sets. Our results reveal the space-time tradeoffs of DPI. Specifically, we show that our compression technique reduces the memory footprint of the best prior-art algorithm by approximately 60%, while achieving comparable throughput.
See "Space-time Tradeoffs in Software-based Deep Packet Inspection" - here.
Other "How to Build DPI Product" posts related to pattern matching: here, here, here and here.