Walk (703337) Mac OS
Walk (703337) Mac OS
PEP: | 471 |
---|---|
Title: | os.scandir() function -- a better and faster directory iterator |
Author: | Ben Hoyt <benhoyt at gmail.com> |
BDFL-Delegate: | Victor Stinner <vstinner at python.org> |
Status: | Final |
Type: | Standards Track |
Created: | 30-May-2014 |
Python-Version: | 3.5 |
Post-History: | 27-Jun-2014, 8-Jul-2014, 14-Jul-2014 |
Download: OS X El Capitan This downloads as a disk image named InstallMacOSX.dmg. On a Mac that is compatible with El Capitan, open the disk image and run the installer within, named InstallMacOSX.pkg. It installs an app named Install OS X El Capitan into your Applications folder.
Contents
- Specifics of proposal
- Examples
- Rejected ideas
This PEP proposes including a new directory iteration function,os.scandir(), in the standard library. This new function addsuseful functionality and increases the speed of os.walk() by 2-20times (depending on the platform and file system) by avoiding calls toos.stat() in most cases.
Python's built-in os.walk() is significantly slower than it needsto be, because -- in addition to calling os.listdir() on eachdirectory -- it executes the stat() system call orGetFileAttributes() on each file to determine whether the entry isa directory or not.
AppleCare+ for Mac Every Mac comes with a one-year limited warranty and up to 90 days of complimentary technical support.AppleCare+ for Mac extends your coverage to three years from your AppleCare+ purchase date and adds up to two incidents of accidental damage protection every 12 months, each subject to a service fee of $99 for screen damage or external enclosure damage, or $299 for other. Open that app from your Applications folder to begin installing the operating system. MacOS Sierra 10.12 can upgrade El Capitan, Yosemite, Mavericks, Mountain Lion, or Lion; OS X El Capitan 10.11 can upgrade Yosemite, Mavericks, Mountain Lion, Lion, or Snow Leopard; OS X Yosemite 10.10 can upgrade Mavericks, Mountain Lion, Lion, or Snow Leopard. On Mac OS X, getgroups behavior differs somewhat from other Unix platforms. If the Python interpreter was built with a deployment target of 10.5 or earlier, getgroups returns the list of effective group ids associated with the current user process; this list is limited to a system-defined number of entries, typically 16, and may be modified by calls to setgroups if suitably privileged. The new Sesame 2 key fob is a dead-simple security solution for your Mac that's exactly the right kind of boring. It automatically locks your computer when you walk away from it.
But the underlying system calls -- FindFirstFile /FindNextFile on Windows and readdir on POSIX systems --already tell you whether the files returned are directories or not, sono further system calls are needed. Further, the Windows system callsreturn all the information for a stat_result object on the directoryentry, such as file size and last modification time.
In short, you can reduce the number of system calls required for atree function like os.walk() from approximately 2N to N, where Nis the total number of files and directories in the tree. (And becausedirectory trees are usually wider than they are deep, it's often muchbetter than this.)
In practice, removing all those extra system calls makes os.walk()about 8-9 times as fast on Windows, and about 2-3 times as faston POSIX systems. So we're not talking about micro-optimizations. See more benchmarks here[1].
Somewhat relatedly, many people (see Python Issue 11406[2]) are alsokeen on a version of os.listdir() that yields filenames as ititerates instead of returning them as one big list. This improvesmemory efficiency for iterating very large directories.
So, as well as providing a scandir() iterator function for callingdirectly, Python's existing os.walk() function can be sped up ahuge amount.
The implementation of this proposal was written by Ben Hoyt (initialversion) and Tim Golden (who helped a lot with the C extensionmodule). It lives on GitHub at benhoyt/scandir[3]. (The implementationmay lag behind the updates to this PEP a little.)
Note that this module has been used and tested (see 'Use in the wild'section in this PEP), so it's more than a proof-of-concept. However,it is marked as beta software and is not extensively battle-tested.It will need some cleanup and more thorough testing before going intothe standard library, as well as integration into posixmodule.c.
os.scandir()
Specifically, this PEP proposes adding a single function to the osmodule in the standard library, scandir, that takes a single,optional string as its argument:
Like listdir, scandir calls the operating system's directoryiteration system calls to get the names of the files in the givenpath, but it's different from listdir in two ways:
- Instead of returning bare filename strings, it returns lightweightDirEntry objects that hold the filename string and providesimple methods that allow access to the additional data theoperating system may have returned.
- It returns a generator instead of a list, so that scandir actsas a true iterator instead of returning the full list immediately.
scandir() yields a DirEntry object for each file andsub-directory in path. Just like listdir, the '.'and '..' pseudo-directories are skipped, and the entries areyielded in system-dependent order. Each DirEntry object has thefollowing attributes and methods:
- name: the entry's filename, relative to the scandir pathargument (corresponds to the return values of os.listdir)
- path: the entry's full path name (not necessarily an absolutepath) -- the equivalent of os.path.join(scandir_path,entry.name)
- inode(): return the inode number of the entry. The result is cached onthe DirEntry object, use os.stat(entry.path,follow_symlinks=False).st_ino to fetch up-to-date information.On Unix, no system call is required.
- is_dir(*, follow_symlinks=True): similar topathlib.Path.is_dir(), but the return value is cached on theDirEntry object; doesn't require a system call in most cases;don't follow symbolic links if follow_symlinks is False
- is_file(*, follow_symlinks=True): similar topathlib.Path.is_file(), but the return value is cached on theDirEntry object; doesn't require a system call in most cases;don't follow symbolic links if follow_symlinks is False
- is_symlink(): similar to pathlib.Path.is_symlink(), but thereturn value is cached on the DirEntry object; doesn't require asystem call in most cases
- stat(*, follow_symlinks=True): like os.stat(), but thereturn value is cached on the DirEntry object; does not require asystem call on Windows (except for symlinks); don't follow symbolic links(like os.lstat()) if follow_symlinks is False
All methods may perform system calls in some cases and thereforepossibly raise OSError -- see the 'Notes on exception handling'section for more details.
The DirEntry attribute and method names were chosen to be the sameas those in the new pathlib module where possible, forconsistency. The only difference in functionality is that theDirEntry methods cache their values on the entry object after thefirst call.
Like the other functions in the os module, scandir() acceptseither a bytes or str object for the path parameter, andreturns the DirEntry.name and DirEntry.path attributes withthe same type as path. However, it is strongly recommendedto use the str type, as this ensures cross-platform support forUnicode filenames. (On Windows, bytes filenames have been deprecatedsince Python 3.3).
os.walk()
As part of this proposal, os.walk() will also be modified to usescandir() rather than listdir() and os.path.isdir(). Thiswill increase the speed of os.walk() very significantly (asmentioned above, by 2-20 times, depending on the system).
First, a very simple example of scandir() showing use of theDirEntry.name attribute and the DirEntry.is_dir() method:
This subdirs() function will be significantly faster with scandirthan os.listdir() and os.path.isdir() on both Windows and POSIXsystems, especially on medium-sized or large directories.
Or, for getting the total size of files in a directory tree, showinguse of the DirEntry.stat() method and DirEntry.pathattribute:
This also shows the use of the follow_symlinks parameter tois_dir() -- in a recursive function like this, we probably don'twant to follow links. (To properly follow links in a recursivefunction like this we'd want special handling for the case wherefollowing a symlink leads to a recursive loop.)
Note that get_tree_size() will get a huge speed boost on Windows,because no extra stat call are needed, but on POSIX systems the sizeinformation is not returned by the directory iteration functions, sothis function won't gain anything there.
Notes on caching
The DirEntry objects are relatively dumb -- the name andpath attributes are obviously always cached, and the is_Xand stat methods cache their values (immediately on Windows viaFindNextFile, and on first use on POSIX systems via a statsystem call) and never refetch from the system.
For this reason, DirEntry objects are intended to be used andthrown away after iteration, not stored in long-lived data structuredand the methods called again and again.
If developers want 'refresh' behaviour (for example, for watching afile's size change), they can simply use pathlib.Path objects,or call the regular os.stat() or os.path.getsize() functionswhich get fresh data from the operating system every call.
Walk (703337) Mac Os X
Notes on exception handling
DirEntry.is_X() and DirEntry.stat() are explicitly methodsrather than attributes or properties, to make it clear that they maynot be cheap operations (although they often are), and they may do asystem call. As a result, these methods may raise OSError.
For example, DirEntry.stat() will always make a system call onPOSIX-based systems, and the DirEntry.is_X() methods will make astat() system call on such systems if readdir() does notsupport d_type or returns a d_type with a value ofDT_UNKNOWN, which can occur under certain conditions or oncertain file systems.
Often this does not matter -- for example, os.walk() as defined inthe standard library only catches errors around the listdir()calls.
Also, because the exception-raising behaviour of the DirEntry.is_Xmethods matches that of pathlib -- which only raises OSErrorin the case of permissions or other fatal errors, but returns Falseif the path doesn't exist or is a broken symlink -- it's oftennot necessary to catch errors around the is_X() calls.
However, when a user requires fine-grained error handling, it may bedesirable to catch OSError around all method calls and handle asappropriate.
For example, below is a version of the get_tree_size() exampleshown above, but with fine-grained error handling added:
The scandir module on GitHub has been forked and used quite a bit (see'Use in the wild' in this PEP), but there's also been a fair bit ofdirect support for a scandir-like function from core developers andothers on the python-dev and python-ideas mailing lists. A sampling:
- python-dev: a good number of +1's and very few negatives forscandir and PEP 471 on this June 2014 python-dev thread
- Nick Coghlan, a core Python developer: 'I've had the local RedHat release engineering team express their displeasure at having tostat every file in a network mounted directory tree for info that ispresent in the dirent structure, so a definite +1 to os.scandir fromme, so long as it makes that info available.'[source1]
- Tim Golden, a core Python developer, supports scandir enough tohave spent time refactoring and significantly improving scandir's Cextension module.[source2]
- Christian Heimes, a core Python developer: '+1 for somethinglike yielddir()'[source3]and 'Indeed! I'd like to see the feature in 3.4 so I can remove myown hack from our code base.'[source4]
- Gregory P. Smith, a core Python developer: 'As 3.4beta1 happenstonight, this isn't going to make 3.4 so i'm bumping this to 3.5.I really like the proposed design outlined above.'[source5]
- Guido van Rossum on the possibility of adding scandir to Python3.5 (as it was too late for 3.4): 'The ship has likewise sailed foradding scandir() (whether to os or pathlib). By all means experimentand get it ready for consideration for 3.5, but I don't want to addit to 3.4.'[source6]
Support for this PEP itself (meta-support?) was given by Nick Coghlanon python-dev: 'A PEP reviewing all this for 3.5 and proposing aspecific os.scandir API would be a good thing.'[source7]
To date, the scandir implementation is definitely useful, but hasbeen clearly marked 'beta', so it's uncertain how much use of it thereis in the wild. Ben Hoyt has had several reports from people using it.For example:
- Chris F: 'I am processing some pretty large directories and was halfexpecting to have to modify getdents. So thanks for saving me theeffort.' [via personal email]
- bschollnick: 'I wanted to let you know about this, since I am usingScandir as a building block for this code. Here's a good example ofscandir making a radical performance improvement over os.listdir.'[source8]
- Avram L: 'I'm testing our scandir for a project I'm working on.Seems pretty solid, so first thing, just want to say nice work!'[via personal email]
- Matt Z: 'I used scandir to dump the contents of a network dir inunder 15 seconds. 13 root dirs, 60,000 files in the structure. Thiswill replace some old VBA code embedded in a spreadsheet that wastaking 15-20 minutes to do the exact same thing.' [via personalemail]
Others have requested a PyPI package[4] for it, which has beencreated. See PyPI package[5].
Walk (703337) Mac Os Update
GitHub stats don't mean too much, but scandir does have severalwatchers, issues, forks, etc. Here's the run-down as of the stats asof July 7, 2014:
- Watchers: 17
- Stars: 57
- Forks: 20
- Issues: 4 open, 26 closed
Also, because this PEP will increase the speed of os.walk()significantly, there are thousands of developers and scripts, and a lotof production code, that would benefit from it. For example, on GitHub,there are almost as many uses of os.walk (194,000) as there are ofos.mkdir (230,000).
Naming
The only other real contender for this function's name wasiterdir(). However, iterX() functions in Python (mostly foundin Python 2) tend to be simple iterator equivalents of theirnon-iterator counterparts. For example, dict.iterkeys() is just aniterator version of dict.keys(), but the objects returned areidentical. In scandir()'s case, however, the return values arequite different objects (DirEntry objects vs filename strings), sothis should probably be reflected by a difference in name -- hencescandir().
See some relevant discussion on python-dev.
Walk (703337) Mac Os Catalina
Wildcard support
FindFirstFile/FindNextFile on Windows support passing a'wildcard' like *.jpg, so at first folks (this PEP's authorincluded) felt it would be a good idea to include awindows_wildcard keyword argument to the scandir function sousers could pass this in.
However, on further thought and discussion it was decided that thiswould be bad idea, unless it could be made cross-platform (apattern keyword argument or similar). This seems easy enough atfirst -- just use the OS wildcard support on Windows, and somethinglike fnmatch or re afterwards on POSIX-based systems.
Unfortunately the exact Windows wildcard matching rules aren't reallydocumented anywhere by Microsoft, and they're quite quirky (see thisblog post),meaning it's very problematic to emulate using fnmatch or regexes.
So the consensus was that Windows wildcard support was a bad idea.It would be possible to add at a later date if there's across-platform way to achieve it, but not for the initial version.
Read more on the this Nov 2012 python-ideas threadand this June 2014 python-dev thread on PEP 471.
Methods not following symlinks by default
There was much debate on python-dev (see messages in this thread)over whether the DirEntry methods should follow symbolic links ornot (when the is_X() methods had no follow_symlinks parameter).
Initially they did not (see previous versions of this PEP and thescandir.py module), but Victor Stinner made a pretty compelling case onpython-dev that following symlinks by default is a better idea, because:
- following links is usually what you want (in 92% of cases in thestandard library, functions using os.listdir() andos.path.isdir() do follow symlinks)
- that's the precedent set by the similar functionsos.path.isdir() and pathlib.Path.is_dir(), so to dootherwise would be confusing
- with the non-link-following approach, if you wanted to follow linksyou'd have to say something like if (entry.is_symlink() andos.path.isdir(entry.path)) or entry.is_dir(), which is clumsy
As a case in point that shows the non-symlink-following version iserror prone, this PEP's author had a bug caused by getting thisexact test wrong in his initial implementation of scandir.walk()in scandir.py (see Issue #4 here).
In the end there was not total agreement that the methods shouldfollow symlinks, but there was basic consensus among the most involvedparticipants, and this PEP's author believes that the above case isstrong enough to warrant following symlinks by default.
In addition, it's straightforward to call the relevant methods withfollow_symlinks=False if the other behaviour is desired.
DirEntry attributes being properties
In some ways it would be nicer for the DirEntryis_X() andstat() to be properties instead of methods, to indicate they'revery cheap or free. However, this isn't quite the case, as stat()will require an OS call on POSIX-based systems but not on Windows.Even is_dir() and friends may perform an OS call on POSIX-basedsystems if the dirent.d_type value is DT_UNKNOWN (on certainfile systems).
Also, people would expect the attribute access entry.is_dir toonly ever raise AttributeError, not OSError in the case itmakes a system call under the covers. Calling code would have to havea try/except around what looks like a simple attribute access,and so it's much better to make them methods.
See this May 2013 python-dev threadwhere this PEP author makes this case and there's agreement from acore developers.
DirEntry fields being 'static' attribute-only objects
In this July 2014 python-dev message,Paul Moore suggested a solution that was a 'thin wrapper round the OSfeature', where the DirEntry object had only static attributes:name, path, and is_X, with the st_X attributes onlypresent on Windows. The idea was to use this simpler, lower-levelfunction as a building block for higher-level functions.
At first there was general agreement that simplifying in this way wasa good thing. However, there were two problems with this approach.First, the assumption is the is_dir and similar attributes arealways present on POSIX, which isn't the case (if d_type is notpresent or is DT_UNKNOWN). Second, it's a much harder-to-use APIin practice, as even the is_dir attributes aren't always presenton POSIX, and would need to be tested with hasattr() and thenos.stat() called if they weren't present.
See this July 2014 python-dev responsefrom this PEP's author detailing why this option is a non-idealsolution, and the subsequent reply from Paul Moore voicing agreement.
DirEntry fields being static with an ensure_lstat option
Another seemingly simpler and attractive option was suggested byNick Coghlan in this June 2014 python-dev message:make DirEntry.is_X and DirEntry.lstat_result properties, andpopulate DirEntry.lstat_result at iteration time, but only ifthe new argument ensure_lstat=True was specified on thescandir() call.
This does have the advantage over the above in that you can easily getthe stat result from scandir() if you need it. However, it has theserious disadvantage that fine-grained error handling is messy,because stat() will be called (and hence potentially raiseOSError) during iteration, leading to a rather ugly, hand-madeiteration loop:
Or it means that scandir() would have to accept an onerrorargument -- a function to call when stat() errors occur duringiteration. This seems to this PEP's author neither as direct nor asPythonic as try/except around a DirEntry.stat() call.
Another drawback is that os.scandir() is written to make code faster.Always calling os.lstat() on POSIX would not bring any speedup. In mostcases, you don't need the full stat_result object -- the is_X()methods are enough and this information is already known.
See Ben Hoyt's July 2014 replyto the discussion summarizing this and detailing why he thinks theoriginal PEP 471 proposal is 'the right one' after all.
Return values being (name, stat_result) two-tuples
Initially this PEP's author proposed this concept as a function callediterdir_stat() which yielded two-tuples of (name, stat_result).This does have the advantage that there are no new types introduced.However, the stat_result is only partially filled on POSIX-basedsystems (most fields set to None and other quirks), so they're notreally stat_result objects at all, and this would have to bethoroughly documented as different from os.stat().
Also, Python has good support for proper objects with attributes andmethods, which makes for a saner and simpler API than two-tuples. Italso makes the DirEntry objects more extensible and future-proofas operating systems add functionality and we want to include this inDirEntry.
See also some previous discussion:
- May 2013 python-dev threadwhere Nick Coghlan makes the original case for a DirEntry-styleobject.
- June 2014 python-dev threadwhere Nick Coghlan makes (another) good case against the two-tupleapproach.
Return values being overloaded stat_result objects
Another alternative discussed was making the return values to beoverloaded stat_result objects with name and pathattributes. However, apart from this being a strange (and strained!)kind of overloading, this has the same problems mentioned above --most of the stat_result information is not fetched byreaddir() on POSIX systems, only (part of) the st_mode value.
Return values being pathlib.Path objects
With Antoine Pitrou's new standard library pathlib module, itat first seems like a great idea for scandir() to return instancesof pathlib.Path. However, pathlib.Path's is_X() andstat() functions are explicitly not cached, whereas scandirhas to cache them by design, because it's (often) returning valuesfrom the original directory iteration system call.
And if the pathlib.Path instances returned by scandir cachedstat values, but the ordinary pathlib.Path objects explicitlydon't, that would be more than a little confusing.
Guido van Rossum explicitly rejected pathlib.Path caching stat inthe context of scandir here,making pathlib.Path objects a bad choice for scandir returnvalues.
There are many possible improvements one could make to scandir, buthere is a short list of some this PEP's author has in mind:
- scandir could potentially be further sped up by calling readdir/ FindNextFile say 50 times per Py_BEGIN_ALLOW_THREADS blockso that it stays in the C extension module for longer, and may besomewhat faster as a result. This approach hasn't been tested, butwas suggested by on Issue 11406 by Antoine Pitrou.[source9]
- scandir could use a free list to avoid the cost of memory allocationfor each iteration -- a short free list of 10 or maybe even 1 may help.Suggested by Victor Stinner on a python-dev thread on June 27[6].
- Original November 2012 thread Ben Hoyt started on python-ideasabout speeding up os.walk()
- Python Issue 11406[2], which includes the original proposal for ascandir-like function
- Further May 2013 thread Ben Hoyt started on python-devthat refined the scandir() API, including Nick Coghlan'ssuggestion of scandir yielding DirEntry-like objects
- November 2013 thread Ben Hoyt started on python-devto discuss the interaction between scandir and the new pathlibmodule
- June 2014 thread Ben Hoyt started on python-devto discuss the first version of this PEP, with extensive discussionabout the API
- First July 2014 thread Ben Hoyt started on python-devto discuss his updates to PEP 471
- Second July 2014 thread Ben Hoyt started on python-devto discuss the remaining decisions needed to finalize PEP 471,specifically whether the DirEntry methods should follow symlinksby default
- Question on StackOverflowabout why os.walk() is slow and pointers on how to fix it (thisinspired the author of this PEP early on)
- BetterWalk, this PEP'sauthor's previous attempt at this, on which the scandir code is based
[1] | https://github.com/benhoyt/scandir#benchmarks |
[2] | (1, 2)http://bugs.python.org/issue11406 |
[3] | https://github.com/benhoyt/scandir |
[4] | https://github.com/benhoyt/scandir/issues/12 |
[5] | https://pypi.python.org/pypi/scandir |
[6] | https://mail.python.org/pipermail/python-dev/2014-June/135232.html |
This document has been placed in the public domain.
Source: https://github.com/python/peps/blob/master/pep-0471.txtWalk (703337) Mac OS