Files
ripgrep/grep/src/word_boundary.rs
Andrew Gallant cd08707c7c grep: upgrade to regex-syntax 0.5
This update brings with it many bug fixes:

  * Better error messages are printed overall. We also include
    explicit call out for unsupported features like backreferences
    and look-around.
  * Regexes like `\s*{` no longer emit incomprehensible errors.
  * Unicode escape sequences, such as `\u{..}` are now supported.

For the most part, this upgrade was done in a straight-forward way. We
resist the urge to refactor the `grep` crate, in anticipation of it
being rewritten anyway.

Note that we removed the `--fixed-strings` suggestion whenever a regex
syntax error occurs. In practice, I've found that it results in a lot of
false positives, and I believe that its use is not as paramount now that
regex parse errors are much more readable.

Closes #268, Closes #395, Closes #702, Closes #853
2018-03-13 22:55:39 -04:00

54 lines
2.3 KiB
Rust

use syntax::hir::{self, Hir, HirKind};
/// Strips Unicode word boundaries from the given expression.
///
/// The key invariant this maintains is that the expression returned will match
/// *at least* every where the expression given will match. Namely, a match of
/// the returned expression can report false positives but it will never report
/// false negatives.
///
/// If no word boundaries could be stripped, then None is returned.
pub fn strip_unicode_word_boundaries(expr: &Hir) -> Option<Hir> {
// The real reason we do this is because Unicode word boundaries are the
// one thing that Rust's regex DFA engine can't handle. When it sees a
// Unicode word boundary among non-ASCII text, it falls back to one of the
// slower engines. We work around this limitation by attempting to use
// a regex to find candidate matches without a Unicode word boundary. We'll
// only then use the full (and slower) regex to confirm a candidate as a
// match or not during search.
//
// It looks like we only check the outer edges for `\b`? I guess this is
// an attempt to optimize for the `-w/--word-regexp` flag? ---AG
match *expr.kind() {
HirKind::Concat(ref es) if !es.is_empty() => {
let first = is_unicode_word_boundary(&es[0]);
let last = is_unicode_word_boundary(es.last().unwrap());
// Be careful not to strip word boundaries if there are no other
// expressions to match.
match (first, last) {
(true, false) if es.len() > 1 => {
Some(Hir::concat(es[1..].to_vec()))
}
(false, true) if es.len() > 1 => {
Some(Hir::concat(es[..es.len() - 1].to_vec()))
}
(true, true) if es.len() > 2 => {
Some(Hir::concat(es[1..es.len() - 1].to_vec()))
}
_ => None,
}
}
_ => None,
}
}
/// Returns true if the given expression is a Unicode word boundary.
fn is_unicode_word_boundary(expr: &Hir) -> bool {
match *expr.kind() {
HirKind::WordBoundary(hir::WordBoundary::Unicode) => true,
HirKind::WordBoundary(hir::WordBoundary::UnicodeNegate) => true,
HirKind::Group(ref x) => is_unicode_word_boundary(&x.hir),
_ => false,
}
}