This commit fixes a bug on Windows where directory traversals were
completely broken when attempting to scan OneDrive directories that use
the "file on demand" strategy.
The specific problem was that Rust's standard library treats OneDrive
directories as reparse points instead of directories, which causes
methods like `FileType::is_file` and `FileType::is_dir` to always return
false, even when retrieved via methods like `metadata` that purport to
follow symbolic links.
We fix this by peppering our code with checks on the underlying file
attributes exposed by Windows. We consider an entry a directory if and
only if the directory bit is set on the attributes. We are careful to
make sure that the code remains the same on non-Windows platforms.
Note that we also bump the dependency on `walkdir`, which contains a
similar fix for its traversals.
This bug is recorded upstream:
https://github.com/rust-lang/rust/issues/46484
Upstream also has a pending PR:
https://github.com/rust-lang/rust/pull/47956Fixes#705
This optimization wasn't tested too carefully, and it seems to result
in a massive amount of file handles open simultaneously. This is likely
a result of the parallel iterator, where many directories are being
traversed simultaneously.
Fixes#648
This commit fixes the symlink loop checker in the parallel directory
traverser to open fewer handles at the expense of keeping handles held
open longer.
This roughly matches the corresponding change in walkdir:
5bcc5b87eeFixes#633
The uninteresting bits of this commit involve mechanical changes for
updates to walkdir 2. The more interesting bits of this commit are the
breaking changes, although none of them should require any significant
change on users of this library. The breaking changes are as follows:
* `DirEntry::path_is_symbolic_link` has been renamed to
`DirEntry::path_is_symlink`. This matches the conventions in the
standard library, and also the corresponding name change in walkdir.
* Removed the `From<walkdir::Error> for ignore::Error` impl. This was
intended to only be used internally, but was the only thing that
made `walkdir` a public dependency of `ignore`. Therefore, we remove
it since it seems unnecessary.
* Renamed `WalkBuilder::sort_by` to `WalkBuilder::sort_by_file_name`,
and changed the type of the comparator from
Fn(&OsString, &OsString) -> cmp::Ordering + 'static
to
Fn(&OsStr, &OsStr) -> cmp::Ordering + Send + Sync + 'static
The corresponding change in `walkdir` retains the `sort_by` name, but
gives the comparator a pair of `&DirEntry` values instead of a pair
of `&OsStr` values. Ideally, `ignore` would hand off its own pair of
`&ignore::DirEntry` values, but this requires more design work. So for
now, we retain previous functionality, but leave room to make a proper
`sort_by` method.
[breaking-change]
A maximum filesize can be specified as an argument to a `WalkBuilder`.
If a file exceeds the specified size it will be ignored as part of the
resulting file/directory set.
The filesize limit never applies to directories.
When running ripgrep like this:
rg foo > output
we must be careful not to search `output` since ripgrep is actively writing
to it. Searching it can cause massive blowups where the file grows without
bound.
While this is conceptually easy to fix (check the inode of the redirection
and the inode of the file you're about to search), there are a few problems
with it.
First, inodes are a Unix thing, so we need a Windows specific solution to
this as well. To resolve this concern, I created a new crate, `same-file`,
which provides a cross platform abstraction.
Second, stat'ing every file is costly. This is not avoidable on Windows,
but on Unix, we can get the inode number directly from directory traversal.
However, this information wasn't exposed, but now it is (through both the
ignore and walkdir crates).
Fixes#286
This commit fixes two issues.
First, the iterator was executing the callback for every child of a
directory in a single thread. Therefore, if the walker was run over a
single directory, then no parallelism is used. We tweak the iterator
slightly so that we don't fall into this trap.
The second issue is a bit more subtle. In particular, we don't use the
blocking semantics of MsQueue because we *don't know when iteration
finishes*. This means that if there are a bunch of idle workers because
there is no work available to them, then they will spin and burn the
CPU. One case where this crops up is if you pipe the output of ripgrep
into `less` *and* the total number of files to search is fewer than the
number of threads ripgrep uses. We "fix" this with a very stupid
heuristic: when the queue yields no work, we sleep the thread for 1ms.
This still pegs the CPU, but not nearly as much as before.
If one really want to avoid this behavior when using ripgrep, then `-j1`
can be used to disable parallelism.
Fixes#258
When give an explicit file path on the command line like `foo` where `foo`
is a symlink, ripgrep should follow it even if `-L` isn't set. This is
consistent with the behavior of `foo/`.
Fixes#256
Previously, ignore::WalkParallel would invoke the callback for all
*explicitly* given file paths in a single thread, which effectively
meant that `rg pattern foo bar baz ...` didn't actually search foo, bar
and baz in parallel.
The code was structured that way to avoid spinning up workers if no
directory paths were given. The original intention was probably to have
a separate pool of threads responsible for searching, but ripgrep ended
up just reusing the ignore::WalkParallel workers themselves for searching,
and thereby subjected to its sub-par performance in this case.
The code has been restructured so that file paths are sent to the workers,
which brings back parallelism.
Fixes#226
Namely, passing a directory to --ignore-file caused ripgrep to allocate
memory without bound.
The issue was that I got a bit overzealous with partial error
reporting. Namely, when processing a gitignore file, we should try
to use every pattern even if some patterns are invalid globs (e.g.,
a**b). In the process, I applied the same logic to I/O errors. In this
case, it manifest by attempting to read lines from a directory, which
appears to yield Results forever, where each Result is an error of the
form "you can't read from a directory silly." Since I treated it as a
partial error, ripgrep was just spinning and accruing each error in
memory, which caused the OOM killer to kick in.
Fixes#228
This adds a new walk type in the `ignore` crate, `WalkParallel`, which
provides a way for recursively iterating over a set of paths in parallel
while respecting various ignore rules.
The API is a bit strange, as a closure producing a closure isn't
something one often sees, but it does seem to work well.
This also allowed us to simplify much of the worker logic in ripgrep
proper, where MultiWorker is now gone.
This PR introduces a new sub-crate, `ignore`, which primarily provides a
fast recursive directory iterator that respects ignore files like
gitignore and other configurable filtering rules based on globs or even
file types.
This results in a substantial source of complexity moved out of ripgrep's
core and into a reusable component that others can now (hopefully)
benefit from.
While much of the ignore code carried over from ripgrep's core, a
substantial portion of it was rewritten with the following goals in
mind:
1. Reuse matchers built from gitignore files across directory iteration.
2. Design the matcher data structure to be amenable for parallelizing
directory iteration. (Indeed, writing the parallel iterator is the
next step.)
Fixes#9, #44, #45