When ripgrep detects a literal, it emits them as raw hex escaped byte
sequences to Regex::new. This permits literal optimizations for arbitrary
byte sequences (i.e., possibly invalid UTF-8). The problem is that
Regex::new interprets hex escaped byte sequences as *Unicode codepoints*
by default, but we want them to actually stand for their raw byte values.
Therefore, disable Unicode mode.
This is OK, since the regex is composed entirely of literals and literal
extraction does Unicode case folding.
Fixes#251
This was a subtle bug, but the big picture was that the smart case
information wasn't being carried through to the literal extraction in
some cases. When this happened, it was possible to get back an incomplete
set of literals, which would therefore miss some valid matches.
The fix to this is to actually parse the regex and determine whether
smart case applies before doing anything else. It's a little extra work,
but parsing is pretty fast.
Fixes#199
If we do, this results in extracting `foofoofoo` from `(\wfoo){3}`,
which is wrong. This does prevent us from extracting `foofoofoo` from
`foo{3}`, which is unfortunate, but we miss plenty of other stuff too.
Literal extracting needs a good rethink (all the way down into the regex
engine).
Fixes#93
In particular, if we had an inner literal and were doing a case insensitive
search, then the literals are dropped because we previously only allowed
a single inner literal to have an effect. Now we allow alternations of
inner literals, but still don't quite take full advantage.