698 lines
39 KiB
Plaintext
698 lines
39 KiB
Plaintext
Article 456 of comp.os.os2.programmer:
|
|
Organization: Freshman, CIT general, Carnegie Mellon, Pittsburgh, PA
|
|
Path: holborn!uupsi!psinntp!uunet!zaphod.mps.ohio-state.edu!magnus.acs.ohio-state.edu!cis.ohio-state.edu!news.sei.cmu.edu!bb3.andrew.cmu.edu!crabapple.srv.cs.cmu.edu!andrew.cmu.edu!pd2n+
|
|
Newsgroups: comp.os.os2.programmer
|
|
Message-ID: <QfLAc7m00YUn5IsFYX@andrew.cmu.edu>
|
|
Date: Tue, 19 Jan 1993 22:41:59 -0500
|
|
From: "Peter W. De Bonte" <pd2n+@andrew.cmu.edu>
|
|
Subject: Re: Repost: Wanted: bits and bytes of HPFS
|
|
In-Reply-To: <1jcua6INNboo@irz401.inf.tu-dresden.de>
|
|
References: <1jcua6INNboo@irz401.inf.tu-dresden.de>
|
|
Distribution: world
|
|
Lines: 682
|
|
|
|
This may not give you everything, but there is quite a bit here. I'd
|
|
tell you were I got it from, but I forgot... sorry
|
|
|
|
O /
|
|
X ------ cut here -------------------------------------------------
|
|
O \
|
|
|
|
Here's a technical description of the High Performance Files System
|
|
that was posted on the net recently.
|
|
|
|
***** Computer Library Periodicals, Jan 1990 : Doc #14753 *****
|
|
|
|
Journal: Microsoft Systems Journal Sept 1989 v4 n5 p1(13)
|
|
* Full Text COPYRIGHT Microsoft Corp. 1989.
|
|
-----------------------------------------------------------------------------
|
|
Title: Design goals and implementation of the new High Performance File
|
|
System. (includes related article on B-Trees and B+ Trees)
|
|
Author: Duncan, Roy.
|
|
|
|
Summary: The High Performance File System (HPFS) enhancement to OS/2
|
|
Version 1.2 solves all the problems of the File Allocation Table (FAT) file
|
|
system and is designed to meet the demands expected into the next few
|
|
decades. HPFS not only serves as a way to organize data on random access
|
|
block storage devices, but is also a software module that translates
|
|
file-oriented requests from applications programs to device drivers. HPFS is
|
|
also an example of an installable file system, which makes it possible to
|
|
access several incompatible volume structures on the same OS/2 system
|
|
simultaneously. Excellent throughput is achieved by the use of advanced data
|
|
structures such as intelligent caching, read-ahead and write-behind. Disk
|
|
space is managed more economically by the use of sectoring. HPFS also
|
|
includes greatly improved fault tolerance. Applications programs need only
|
|
simple modifications to make use of extended attributes and long filenames.
|
|
-----------------------------------------------------------------------------
|
|
Descriptors..
|
|
Product: OS-2 Extended Edition 1.2 (Operating system) (product
|
|
enhancement).
|
|
Topic: OS-2
|
|
File Management
|
|
Enhancements
|
|
Data Structures
|
|
Disk Space Allocation
|
|
Sectoring.
|
|
Feature: illustration
|
|
table
|
|
chart.
|
|
Caption: (Comparison of FAT and High Performance File System.)
|
|
(Overall structure of an HPFS volume.)
|
|
(Overall structure of an Fnode.)
|
|
|
|
Record#: 07 585 454.
|
|
-----------------------------------------------------------------------------
|
|
Full Text:
|
|
|
|
THE HPFS IS A WAY OF ORGANIZING DATA ON A RANDOM ACCESS BLOCK STORAGE DEVICE.
|
|
IT IS ALSO A SOFTWARE MODULE THAT TRANSLATES FILE-ORIENTED REQUESTS FROM AN
|
|
APPLICATION PROGRAM INTO MORE PRIMITIVE REQUESTS THAT A DEVICE DRIVER CAN
|
|
UNDERSTAND.
|
|
|
|
The High Performance File System (hereafter HPFS), which is making its first
|
|
appearance in the OS/2 operating systemVersion 1.2, had its genesis in the
|
|
network division of Microsoft and was designed by Gordon Letwin, the chief
|
|
architect of the OS/2 operating system. The HPFS has been designed to meet
|
|
the demands of increasingly powerful PCs, fixed disks, and networks for many
|
|
years to come and to serve as a suitable platform for object-oriented
|
|
languages, applications, and user interfaces.
|
|
|
|
The HPFS is a complex topic because it incorporates three distinct yet
|
|
interrelated file system issues. First, the HPFS is a way of organizing data
|
|
on a random access block storage device. Second, it is a software module that
|
|
translates file-oriented requests from an application program into more
|
|
primitive requests that a device driver can understand, using a variety of
|
|
creative techniques to maximize performance. Third, the HPFS is a practical
|
|
illustration of an important new OS/2 feature known as installable file
|
|
systems.
|
|
|
|
This article introduces the three aspects of the HPFS. But first, it puts
|
|
the HPFS in perspective by reviewing some of the problems that led to the
|
|
system's existence.
|
|
|
|
FAT File System
|
|
|
|
The so-called FAT file system, which is the file system used in all versions
|
|
of the MS-DOS"' operating system to date and in the first two releases of
|
|
OS/2' (Versions 1.0 and 1.1), has a dual heritage in Microsoft's earliest
|
|
programming language products and the Digital Research(R) CP/M(R) operating
|
|
system-software originally written for 8080-based and Z-80-based
|
|
microcomputers. It inherited characteristics from both ancestors that have
|
|
progressively tumed into handicaps in this new era of multitasking, protected
|
|
mode, virtual memory, and huge fixed disks.
|
|
|
|
The FAT file system revolves around the File Allocation Table for which it is
|
|
named. Each logical volume has its own FAT, which serves two important
|
|
functions: it contains the allocation information for each file on the volume
|
|
in the form of linked lists of allocation units (clusters, which are
|
|
power-of-2 multiples of sectors) and it indicates which allocation units are
|
|
free for assignment to a file that is being created or extended.
|
|
|
|
The FAT was invented by Bill Gates and Marc McDonald in1977 as a method of
|
|
managing disk space in the NCR version of standalone Microsoft* Disk BASIC.
|
|
Tim Paterson, at that time an employee of Seattle (R) Computer Products
|
|
(SCP), was introduced to the FAT concept when his company shared a booth with
|
|
Microsoft at the National Computer Conference in 1979. Paterson subsequently
|
|
incorporated FATs into the file system of 86-DOS, an operating system for
|
|
SCP's S-100 bus 8086 CPU boards. 86-DOS was eventually purchased by
|
|
Microsoft and became the starting point for MS-DOS2 Version 1.0, which was
|
|
released for the original IBM" PC in August 1981.
|
|
|
|
When the FAT was conceived, it was an excellent solution to disk management,
|
|
mainly because the floppy disks on which it was used were rarely larger than
|
|
I Mb. On such di s, the FAT was small enough to be held in memory at all
|
|
times, allowing very fast random access to any part of any file. This proved
|
|
far superior to the CP/M method of tracking disk space, in which the
|
|
information about the sectors assigned to a file might be spread across many
|
|
directory entries, which were in turn scattered randomly throughout the disk
|
|
directory.
|
|
|
|
When applied to fixed disks, however, the FAT began to look more like a bug
|
|
than a feature. It became too large to be held entirely resident and had to
|
|
be paged into memory in pieces; this paging resulted in many superfluous disk
|
|
head movements as a program was reading through a file and degraded system
|
|
throughput. In addition, because the information about free disk space was
|
|
dispersed across many sectors of FAT, it was impractical to allocate file
|
|
space contiguously, and file fragmentation became another obstacle to good
|
|
performance. Moreover, the use of relatively large clusters on fixed disks
|
|
resulted in a lot of dead space, since an average of one-half cluster was
|
|
wasted for each file. (Some network servers use clusters as large as 64Kb.)
|
|
|
|
The FAT file system's restrictions on naming files and directories are
|
|
inherited from CP/M. When Paterson was writing 86DOS, one of his primary
|
|
objectives was to make programs easy to port from CP/M to his new operating
|
|
system. He therefore adopted CP/M's limits on filenames and extensions so
|
|
the critical fields of 86-DOS File Control Blocks (FCBs) would look almost
|
|
exactly like those of CP/M. The sizes of the FCB filename and extension
|
|
fields were also propagated into the structure of disk directory entries. In
|
|
due time, 86-DOS became MSDOS and application programs for MS-DOS
|
|
proliferated beyond anyone's wildest dreams. Since most of the early
|
|
programs depended on the structure of FCBs, the 8.3 format for filenames
|
|
became irrevocably locked into the system.
|
|
|
|
During the last couple of years, Microsoft and IBM have made valiant attempts
|
|
to prolong the useful life of the FAT file system by lifting the restrictions
|
|
on volume sizes, improving allocation strategies, caching pathnames, and
|
|
moving tables and buffers into expanded memory. But these can only be
|
|
regarded as temporizing measures, because the fundamental data structures
|
|
used by the FAT file system are simply not well suited to large random access
|
|
devices.
|
|
|
|
The HPFS solves the FAT file system problems mentioned here and many others,
|
|
but it is not derived in any way from the FAT file system. The architect of
|
|
the HPFS started with a clean sheet of paper and designed a file system that
|
|
can take full advantage of a multitasking environment, and that will be able
|
|
to cope with any sort of disk device likely to arrive on microcomputers
|
|
during the next decade.
|
|
|
|
HPFS Volume Structure
|
|
|
|
HPFS volumes are a new partition type-type 7-and can exist on a fixed disk
|
|
alongside of the several previously defined FAT partition types.
|
|
IBM-compatible HPFS volumes use a sector size of 512 bytes and have a maximum
|
|
size of 2199Gb (2" sectors). Although there is no particular reason why
|
|
floppy disks can't be formatted as HPFS volumes, Microsoft plans to stick
|
|
with FAT file systems on floppy disks for the foreseeable future. (This
|
|
ensures that users will be able to transport files easily between MS-DOS and
|
|
OS/2 systems.)
|
|
|
|
An HPFS volume has very few fixed structures (Figure 1). Sectors 0-15 of a
|
|
volume (8Kb) are the BootBlock and contain a volume name, 32-bit volume ID,
|
|
and a disk bootstrap program. The bootstrap is relatively sophisticated (by
|
|
MS-DOS standards) and can use the HPFS in a restricted mode to locate and
|
|
read the operating system files wherever they might be found.
|
|
|
|
Sectors 16 and 17 are known as the SuperBlock and the SpareBlock
|
|
respectively. The SuperBlock is only modified by disk maintenance utilities.
|
|
It contains pointers to the free space bitmaps, the bad block list, the
|
|
directory block band, and the root directory. It also contains the date that
|
|
the volume was last checked out and repaired with CHKDSK/F. The SpareBlock
|
|
contains various flags and pointers that will be discussed later; it is
|
|
modified, although infrequently, as the system executes.
|
|
|
|
The remainder of the disk is divided into 8Mb bands. Each band has its own
|
|
free space bitmap in which a bit represents each sector. A bit is 0 if the
|
|
sector is in use and I if the sector is available. The bitmaps are located
|
|
at the head or tail of a band so that two bitmaps are adjacent between
|
|
alternate bands. This allows the maximum contiguous free space that can be
|
|
allocated to a file to be 16Mb. One band, located at or toward the seek
|
|
center of the disk, is called the directory block band and receives special
|
|
treatment (more about this later). Note that the band size is a
|
|
characteristic of the current implementation and may be changed in later
|
|
versions of the file system.
|
|
|
|
Files and Fnodes
|
|
|
|
Every file or directory on an HPFS volume is anchored on a fundamental file
|
|
system object called an Fnode (pronounced "eff node"). Each Fnode occupies a
|
|
single sector and contains control and access history information used
|
|
internally by the file system, extended attributes and access control lists
|
|
(more about this later), the length and the first 15 characters of the name
|
|
of the associated file or directory, and an allocation structure (Figure 2).
|
|
An Fnode is always stored near the file or directory that it represents.
|
|
|
|
The allocation structure in the Fnode can take several forms, depending on
|
|
the size and degree of contiguity of the file or directory. The HPFS views a
|
|
file as a collection of one or more runs or extents of one or more contiguous
|
|
sectors. Each run is symbolized by a pair of doublewords-a 32-bit starting
|
|
sector number and a 32-bit length in sectors (this is referred to as
|
|
runlength encoding). From an application program's point of view, the
|
|
extents are invisible; the file appears as a seamless stream of bytes.
|
|
|
|
The space reserved for allocation information in an Fnode can hold pointers
|
|
to as many as eight runs of sectors of up to 16Mb each. (This maximum run
|
|
size is a result of the band size and free space bitmap placement only; it is
|
|
not an inherent limitation of the file system.) Reasonably small files or
|
|
highly contiguous files can therefore be described completely within the
|
|
Fnode (Figure 3).
|
|
|
|
HPFS uses a new method to represent the location of files that are too large
|
|
or too fragmented for the Fnode and consist of more than eight runs. The
|
|
Fnode's allocation structure becomes the root for a B+ Tree of allocation
|
|
sectors, which in tum contain the actual pointers to the file's sector runs
|
|
(see Figure 4 and the sidebar, "BTrees and B+ Trees"). The Fnode's root has
|
|
room for 12 elements. Each allocation sector can contain, in addition to
|
|
various control information, as many as 40 pointers to sector runs.
|
|
Therefore, a two-level allocation B+ Tree can describe a file of 480 (12*40)
|
|
sector runs with a theoretical maximum size of 7.68Gb (12*40*16Mb) in the
|
|
current implementation (although the 32-bit signed offset parameter for
|
|
DosChgFilePtr effectively limits file sizes to 2Gb).
|
|
|
|
In the unlikely event that a two-level B+ Tree is not sufficient to describe
|
|
a highly fragmented file, the file system will introduce additional levels in
|
|
the tree as needed. Allocation sectors in the intermediate levels can hold
|
|
as many as 60 internal (nonterminal) B+ Tree nodes, which means that the
|
|
descriptive ability of this structure rapidly grows to numbers that are
|
|
nearly beyond comprehension. For example, a threelevel allocation B+ Tree
|
|
can describe a file with as many as 28,800 (12*60*40) sector runs.
|
|
|
|
Run-length encoding and B+ Trees of allocation sectors are a memory-efficient
|
|
way to specify a file's size and location, but they have other significant
|
|
advantages. Translating a logical file offset into a sector number is
|
|
extremely fast: the file system just needs to traverse the list (or B+ Tree
|
|
of lists) of run pointers until it finds the correct range. It can then
|
|
identify the sector within the run with a simple calculation. Run-length
|
|
encoding also makes it trivial to extend the file logically if the newly
|
|
assigned sector is contiguous with the file's previous last sector; the file
|
|
system merely needs to increment the size doubleword of the file's last run
|
|
pointer and clear the sector's bit in the appropriate freespace bitmap.
|
|
|
|
Directories
|
|
|
|
Directories, like files, are anchored on Fnodes. A pointe to the Fnode for
|
|
the root directory is found in the SuperBlock. The Fnodes for directories
|
|
other than the root are reache through subdirectory entries in their parent
|
|
directories.
|
|
|
|
Directories can grow to any size and are built up from 2Kb directory blocks,
|
|
which are allocated as four consecutive sectors on the disk. The file system
|
|
attempts to allocate directory blocks in the directory band, which is located
|
|
at or near the seek center of the disk. Once the directory band is full, the
|
|
directory blocks are allocated wherever space is available.
|
|
|
|
Each 2Kb directory block contains from one to many directory entries. A
|
|
directory entry contains several fields, including time and date stamps, an
|
|
Fnode pointer, a usage count for use by disk maintenance programs, the length
|
|
of the file or directory name, the name itself, and a B-Tree pointer. Each
|
|
entry begins with a word that contains the length of the entry. This
|
|
provides for a variable amount of flex space at the end of each entry, which
|
|
can be used by special versions of the file system and allows the directory
|
|
block to be traversed very quickly (Figure 5).
|
|
|
|
The nuniber of entries in a directory block varies with the length of names.
|
|
If the average filename length is 1 3 characters, an average directory block
|
|
will hold about 40 entries. The entries in a directory block are sorted by
|
|
the binary lexical order of their name fields (this happens to put them in
|
|
alphabetical order for the U.S. alphabet). The last entry in a directory
|
|
block is a dummy record that marks the end of the block.
|
|
|
|
When a directory gets too large to be stored in one block, it increases in
|
|
size by the addition of 2Kb blocks that are organized as a B-Tree"B-Trees and
|
|
B+ Trees"). When searching for a specific name, the file system traverses a
|
|
directory block until it either finds a match or finds a name that is
|
|
lexically greater than the target. In the latter case, the file system
|
|
extracts the BTree pointer from the entry. If there is no pointer, the
|
|
search failed; otherwise the file system follows the pointer to the next
|
|
directory block in the tree and continues the search.
|
|
|
|
A little back-of-the-envelope arithmetic yields some impressive statistics.
|
|
Assuming 40 entries per block, a two-level tree of directory blocks can hold
|
|
1640 directory entries and a three-level tree can hold an astonishing 65,640
|
|
entries. In other words, a particular file can be found (or shown not to
|
|
exist) in a typical directory of 65,640 files with a maximum of three disk
|
|
hits-the actual number of disk accesses depending on cache contents and the
|
|
location of the file's name in the directory block B-Tree. That's quite a
|
|
contrast to the FAT file system, where in the worst case more than 4000
|
|
sectors would have to be read to establish that a file was or was not present
|
|
in a directory containing the same number of files.
|
|
|
|
The B-Tree directory structure has interesting implications beyond its effect
|
|
on open and find operations. A file creation, renaming, or deletion may
|
|
result in a cascade of complex operations, as directory blocks are added or
|
|
freed or names are moved from one block to the other to keep the tree
|
|
balanced. In fact, a rename operation could theoretically fail for lack of
|
|
disk space even though the file itself is not growing. In order to avoid
|
|
this sort of disaster, the HPFS maintains a small pool of free blocks that
|
|
can be drawn from in a directory emergency; a pointer to this pool of free
|
|
blocks is stored in the SpareBlock.
|
|
|
|
Extended Attributes
|
|
|
|
File attributes are information about a file that is maintained by the
|
|
operating system outside the file's overt storage area. The FAT file system
|
|
supports only a few simple attributes (read only, system, hidden, and
|
|
archive) that are actually stored as bit flags in the file's directory entry;
|
|
these attributes are inspected or modified by special function calls and are
|
|
not accessible through the normal file open, read, and write calls.
|
|
|
|
The HPFS supports the same attributes as the FAT file system for historical
|
|
reasons, but it also supports a new form of fileassociated, highly
|
|
generalized information called Extended Attributes (EAs). Each EA is
|
|
conceptually similar to an environment variable, taking the form
|
|
|
|
name- -value
|
|
|
|
except that the value portion can be either a null-terminated (ASCIIZ) string
|
|
or binary data. In OS/2 1.2, each file or directory can have a maximum of
|
|
64Kb of EAs attached to it. This limit may be lifted in a later release of
|
|
OS/2.
|
|
|
|
The storage method for EAs can vary. If the EAs associated with a given file
|
|
or directory are small enough, they will be stored right in the Fnode. If
|
|
the total size of the EAs is too large, they are stored outside the Fnode in
|
|
sector runs, and a B+ Tree of allocation sectors can be created to describe
|
|
the runs. If a single EA gets too large, it can be pushed outside the Fnode
|
|
into a B+ Tree of its own.
|
|
|
|
The kernel API functions DosQFileInfo and DosSetFileInfo have been expanded
|
|
with new information levels that allow application programs to manipulate
|
|
extended attributes for files. The new functions DosQPathInfo and
|
|
DosSetPathInfo are used to read or write the EAs associated with arbitrary
|
|
pathnames. An application program can either ask for the value of a specific
|
|
EA (supplying a name to be matched) or can obtain all of the EAs for the file
|
|
or directory at once.
|
|
|
|
Although application programs can begin to take advantage of EAs as soon as
|
|
the HPFS is released, support for EAs is an essential component in
|
|
Microsoft's long-range plans for object-oriented file systems, Information of
|
|
almost any type can be stored in EAs, ranging from the name of the
|
|
application that owns the file to names of dependent files to icons to
|
|
executable code. As the HPFS evolves, its facilities for manipulating EAs
|
|
are likely to become much more sophisticated. It's easy to imagine, for
|
|
example, that in future versions the API might be extended with EA functions
|
|
that are analogous to DosFindFirst and DosFindNext and EA data might get
|
|
organized into B -Trees.
|
|
|
|
I should note here that in addition to EAs, the LAN Manager version of HPFS
|
|
will support another class of file-associated information called Access
|
|
Control Lists (ACLs). ACLs have the same general appearance as EAs and are
|
|
manipulated in a similar manner, but they are used to store access rights,
|
|
passwords, and other information of interest in a networking multiuser
|
|
environment.
|
|
|
|
Installable File Systems
|
|
|
|
Support for installable file system has been one of the most eagerly
|
|
anticipated features of OS/2 Version 1.2. It will make it possible to access
|
|
multiple incompatible volume structures-FAT, HPFS, CD ROM, and perhaps even
|
|
UNIX(R)--on the same OS/2 system at the same time, will simplify the life of
|
|
network implementors, and will open the door to rapid file system evolution
|
|
and innovation. Installable file systems are, however, only relevant to the
|
|
HPFS insofar as they make use of the HPFS optional. The FAT file system is
|
|
still embedded in the OS/2 kernel, as it was in OS/2 1.0 and 1.1, and will
|
|
remain there as the compatibility file system for some time to come.
|
|
|
|
An installable file system driver (FSD) is analogous in many ways to a device
|
|
driver. An FSD resides on the disk in a file that is structured like a
|
|
dynamic-link library (DLL), typically with a SYS or IFS extension, and is
|
|
loaded during system initialization by IFS= statements in the CONFIG.SYS
|
|
file. IFS= directives are processed in the order they are encountered and
|
|
are also sensitive to the order of DEVICE= statements for device drivers.
|
|
This lets you load a device driver for a nonstandard device, load a file
|
|
system driver from a volume on that device, and so on.
|
|
|
|
Once an FSD is installed and initialized, the kernel communicates with it in
|
|
terms of logical requests for file opens, reads, writes, seeks, closes, and
|
|
so on. The FSD translates these requests-using control structures and tables
|
|
found on the volume itself-into requests for sector reads and writes for
|
|
which it can call special kemel entry points called File System Helpers
|
|
(FsHlps). The kemel passes the demands for sector 1/0 to the appropriate
|
|
device driver and retums the results to the FSD (Figure 6).
|
|
|
|
The procedure used by the operating system to associate volumes with FSDs is
|
|
called dynamic mounting and works as follows. Whenever a volume is first
|
|
accessed, or after it has been locked for direct access and then unlocked
|
|
(for example, by a FORMAT operation), OS/2 presents identifying information
|
|
from the volume to each of the FSDs in tum until one of them recognizes the
|
|
information. When an FSD claims the volume, the volume is mounted and all
|
|
subsequent file 1/0 requests for the volume are routed to that FSD.
|
|
|
|
Performance Issues
|
|
|
|
The HPFS attacks potential bottlenecks in disk throughput at multiple levels.
|
|
It uses advanced data structures, contiguous sector allocation, intelligent
|
|
caching, read-ahead, and deferred writes in order to boost performance.
|
|
|
|
First, the HPFS matches its data structures to the task at hand:
|
|
sophisticated data structures (B-Trees and B+ Trees) for fast random access
|
|
to filenames, directory names, and lists of sectors allocated to files or
|
|
directories, and simple compact data structures (bitmaps) for locating chunks
|
|
of free space of the appropriate size. The routines that manipulate these
|
|
data structures are written in assembly language and have been painstakingly
|
|
tuned, with special focus on the routines that search the freespace bitmaps
|
|
for patterns of set bits (unused sectors).
|
|
|
|
Next, the HPFS's main goal -its prime directive, if you will- is to assign
|
|
consecutive sectors to files whenever possible. The time required to move
|
|
the disk's read/write head from one track to another far outweighs the other
|
|
possible delays, so the HPFS works hard to avoid or minimize such head
|
|
movements by allocating file space contiguously and by keeping control
|
|
structures such as Fnodes and freespace bitmaps near the things they control.
|
|
Highly contiguous files also help the file system make fewer requests of the
|
|
disk driver for more sectors at a time, allow the disk driver to exploit the
|
|
multisector transfer capabilities of the disk controller, and reduce the
|
|
number of disk completion interrupts that must be serviced.
|
|
|
|
Of course, trying to keep files from becoming fragmented in a multitasking
|
|
system in which many files are being updated concurrently is no easy chore.
|
|
One strategy the HPFS uses is to scatter newly created files across the
|
|
disk-in separate bands, if possible-so that the sectors allocated to the
|
|
files as they are extended will not be interleaved. Another strategy is to
|
|
preallocate approximately 4Kb of contiguous space to the file each time it
|
|
must be extended and give back any excess when the file is closed.
|
|
|
|
If an application knows the ultimate size of a new file in advance, it can
|
|
assist the file system by specifying an initial file allocation when it
|
|
creates the file. The system will then search all the free space bitmaps to
|
|
find a run of consecutive sec tors large enough to hold the file. That
|
|
failing, it will search for two runs that are half the size of the file, and
|
|
so on.
|
|
|
|
The HPFS relies on several different kinds of caching to minimize the number
|
|
of physical disk transfers it must request. Naturally, it caches sectors, as
|
|
did the FAT file system. But unlike the FAT file system, the HPFS can manage
|
|
very large caches efficiently and adjusts sector caching on a perhandle basis
|
|
to the manner in which a file is used. The HPFS also caches pathnames and
|
|
directories, transforming disk directory entries into an even more compact
|
|
and efficient inmemory representation.
|
|
|
|
Another technique that the HPFS uses to improve performance is to preread
|
|
data it believes the program is likely to need. For example, when a file is
|
|
opened, the file system will preread and cache the Fnode and the first few
|
|
sectors of the file's contents. If the file is an executable program or the
|
|
history information in the file's Fnode shows that an open operation has
|
|
typically been followed by an immediate sequential read of the entire file,
|
|
the file system will preread and cache much more of the file's contents.
|
|
When a program issues relatively small read requests, the file system always
|
|
fetches data from the file in 2Kb chunks and caches the excess, allowing most
|
|
read operations to be satisfied from the cache.
|
|
|
|
Finally, the OS/2 operating system's support for multitasking makes it
|
|
possible for the HPFS to rely heavity on lazy writes (sometimes called
|
|
deferred writes or write behind) to improve performance. When a program
|
|
requests a disk write, the data is placed in the cache and the cache buffer
|
|
is flagged as dirty (that is, inconsistent with the state of the data on
|
|
disk). When the disk becomes idle or the cache becomes saturated with dirty
|
|
buffers, the file system uses a captive thread from a daemon process to write
|
|
the buffers to disk, starting with the oldest data.
|
|
|
|
In general, lazy writes mean that programs run faster because their read
|
|
requests will almost never be stalled waiting for a write request to
|
|
complete. For programs that repeatedly read, modify, and write a small
|
|
working set of records, it also means that many unnecessary or redundant
|
|
physical disk writes may be avoided. Lazy writes have their dangers, of
|
|
course, so a program can defeat them on a per-handle basis by setting the
|
|
writethrough flag in the OpenMode parameter for DosOpen, or it can commit
|
|
data to disk on a perhandle basis with the DosBufReset function.
|
|
|
|
Fault Tolerance
|
|
|
|
The HPFS's extensive use of lazy writes makes it imperative for the HPFS to
|
|
be able to recover gracefully from write errors under any but the most dire
|
|
circumstances. After all, by the time a write is known to have failed, the
|
|
application has long since gone on its way under the illusion that it has
|
|
safely shipped the data into disk storage. The errors may be detected by the
|
|
hardware (such as a "sector not found" error returned by the disk adapter),
|
|
or they may be detected by the disk driver in spite of the hardware during a
|
|
read-after-write verification of the data.
|
|
|
|
The primary mechanism for handling write errors is called a hotfix. When an
|
|
error is detected, the file system takes a free block out of a reserved
|
|
hotfix pool, writes the data to that block, and updates the hotfix map. (The
|
|
hotfix map is simply a series of pairs of doublewords, with each pair
|
|
containing the number of a bad sector associated with the number of its
|
|
hotfix replacement. A pointer to the hotfix map is maintained in the
|
|
SpareBlock.) A copy of the hotfix map is t written to disk, and a waming
|
|
message is displayed to let the user know that all is not well with the disk
|
|
device.
|
|
|
|
Each time the file system requests a sector read or write from the disk
|
|
driver, it scans the hotfix map and replaces any bad sector numbers with the
|
|
corresponding good sector holding the actual data. This lookaside translation
|
|
of sector numbers is not as expensive as it sounds, since the hotfix list
|
|
need only be scanned when a sector is physically read or written, not each
|
|
time it is accessed in the cache.
|
|
|
|
One of CHKDSK's duties is to empty the hotfix map. For each replacement
|
|
block on the hotfix map, it allocates a new sector that is in a favorable
|
|
location for the file that owns the data, moves the data from the hotfix
|
|
block to the newly allocated sector, and updates the file's allocation
|
|
information (which may involve rebalancing allocation trees and other
|
|
elaborate operations). It then adds the bad sector to the bad block list,
|
|
releases the replacement sector back to the hotfix pool, deletes the hotfix
|
|
entry from the hotfix map, and writes the updated hotfix map to disk.
|
|
|
|
Of course, write errors that can be detected and fixed on the fly are not the
|
|
only calamity that can befall a file system. The HPFS designers also had to
|
|
consider the inevitable damage to be wreaked by power failures, program
|
|
crashes, malicious viruses and Trojan horses, and those users who tum off the
|
|
machine without selecting Shutdown in the Presentation Manager Shell.
|
|
(Shutdown notifies the file system to flush the disk cache, update
|
|
directories, and do whatever else is necessary to bring the disk to a
|
|
consistent state.)
|
|
|
|
The HPFS defends itself against the user who is too abrupt with the Big Red
|
|
Switch by maintaining a Dirty FS flag in the SpareBlock of each HPFS volume.
|
|
The flag is only cleared when all files on the volume have been closed and
|
|
all dirty buffers in the cache have been written out or, in the case of the
|
|
boot volume (since OS2.INI and the swap file are never closed), when Shutdown
|
|
has been selected and has completed its work.
|
|
|
|
During the OS/2 boot sequence, the file system inspects the DirtyFS flag on
|
|
each HPFS volume and, if the flag is set, will not allow further access to
|
|
that volume until CHKDSK has been run. If the DirtyFS flag is set on the boot
|
|
volume, the system will refuse to boot; the user must boot OS/2 in
|
|
maintenance mode from a diskette and run CHKDSK to check and possibly repair
|
|
the boot volume.
|
|
|
|
In the event of a truly major catastrophe, such as loss of the SuperBlock or
|
|
the root directory, the HPFS is designed to give data recovery the best
|
|
possible chance of success. Every type of crucial file object-including
|
|
Fnodes, allocation sectors, and directory blocks-is doubly linked to both its
|
|
parent and its children and contains a unique 32-bit signature. Fnodes also
|
|
contain the initial portion of the name of their file or directory.
|
|
Consequently, CHKDSK can rebuild an entire volume by methodically scanning
|
|
the disk for Fnodes, allocation sectors, and directory blocks, using them to
|
|
reconstruct the files and directories and finally regenerating the freespace
|
|
bitmaps.
|
|
|
|
Application Programs and the HPFS
|
|
|
|
Each of the OS/2 releases thus far have carried with them a major
|
|
discontinuity for application programmers who learned their trade in the
|
|
MS-DOS environment. In OS/2 1.0, such programmers were faced for the first
|
|
time with virtual memory, multitasking, interprocess communications, and the
|
|
protected mode restrictions on addressing and direct control of the hardware
|
|
and were challenged to master powerful new concepts such as threading and
|
|
dynamic linking. In OS/2 Version 1.1, the stakes were raised even further.
|
|
Programmers were offered a powerful hardware-independent graphical user
|
|
interface but had to restructure their applications drastically for an
|
|
event-driven environment based on objects and message passing.
|
|
|
|
In OS/2 Version 1.2, it is time for many of the file-oriented programming
|
|
habits and assumptions carried forward from the MS-DOS environment to fall by
|
|
the wayside. An application that wishes to take full advantage of the HPFS
|
|
must allow for long, free-form, mixed-case filenames and paths with few
|
|
restrictions on punctuation and must be sensitive to the presence of EAs and
|
|
ACLs. After all, if EAs are to be of any use, it won't suffice for
|
|
applications to update a file by renaming the old file and creating a new one
|
|
without also copying the EAs.
|
|
|
|
But the necessary changes for OS/2 Version 1.2 are not tricky to make. A new
|
|
API function, DosCopy, helps applications create backups-it essentially
|
|
duplicates an existing file together with its EAs. EAs can also be
|
|
manipulated explicitly with DosQFileInfo, DosSetFileInfo, DosQPathInfo, and
|
|
DosSetPathInfo. A program should call DosQSysInfo at run time to find the
|
|
maximum possible path length for the system, and ensure that all buffers used
|
|
by DosChDir, DosQCurDir, and related functions are sufficiently large.
|
|
Similarly, the buffers used by DosOpen, DosMove, D o s G e t M o d N a m e ,
|
|
DosFindFirst, DosFindNext, and like functions must allow for longer
|
|
filenames. Any logic that folds cases in filenames or tests for the
|
|
occurrence of only one dot delimiter in a filename must be rethought or
|
|
eliminated
|
|
|
|
The other changes in the API will not affect the average application. The
|
|
functions DosQFileInfo, DosFindFirst, and DosFindNext now retum all three
|
|
sets of times and dates (created, last accessed, last modified) for a file on
|
|
an HPFS volume, but few programs are concerned with time and date stamps
|
|
anyway. DosQFsInfo is used to obtain volume labels or disk characteristics
|
|
just as before, and the use of DosSetFsInfo for volume labels is unchanged.
|
|
There are a few totally new API functions such as DosFsCtl (analogous to
|
|
DosDev IOCtl but used for communication between an application and an FSD),
|
|
DosFsAttach (a sort of explicit mount call), and DosQFsAttach (determines
|
|
which FSD owns a volume); these are intended mainly for use by disk utility
|
|
programs.
|
|
|
|
In order to prevent old OS/2 applications and MS-DOS applications running in
|
|
the DOS box from inadvertently damaging HPDS files, a new flag bit has been
|
|
defined in the EXE file header that indicates whether an application is
|
|
HPFS-aware. If this bit is not set, the application will only be able to
|
|
search for, open, or create files on HPFS volumes that are compatible with
|
|
the FAT file system's 8.3 naming conventions. If the bit is set, OS/2 allows
|
|
access to all files on an HPFS volume, because it assumes that the program
|
|
knows how to handle long, free-form filenames and will take the
|
|
responsibility of conserving a file's EAs and ACLs.
|
|
|
|
Summary
|
|
|
|
The HPFS solves all of the historical problems of the FAT file system .It
|
|
achieves excellent throughput even in extreme cases-many very small files or
|
|
a few very large files-by means of advanced data structures and techniques
|
|
such as intelligent caching, read-ahead, and write-behind. Disk space is used
|
|
economically because it is managed on a sector basis. Existing application
|
|
programs will need modification to take advantage of the HPFS's support for
|
|
extended attributes and long filenames, but these changes will not be
|
|
difficult. All application programs will benefit from the HPFS's improved
|
|
performance and decreased CPU use whether they are modified or not. This
|
|
article is based on a prerelease version of the HPFS that was still
|
|
undergoing modification and tuning. therefore, the final release of the HPFS
|
|
may differ in some details from the description given here--Ed.
|
|
|
|
B- Trees and B+ Trees
|
|
|
|
Most programmers are at least passingly familiar with the data structure
|
|
known as a binary tree. Binary trees are a technique for imposing a logical
|
|
ordering on a collection of data items by means of pointers, without regard
|
|
to the physical order of the data.
|
|
|
|
In a simple binary tree, each node contains some data, including a key value
|
|
that determines the node's logical position in the tree, as well as pointers
|
|
to the node's left and right subtrees. The node that begins the tree is known
|
|
as the root; the nodes that sit at the ends of the tree's branches are
|
|
sometimes called the leaves.
|
|
|
|
To find a particular piece of data, the binary tree is traversed from the
|
|
root, At each node, the desired key is compared with the node's key; if they
|
|
don't match, one branch of the node's subtree or another is selected based on
|
|
whether the desired key is less than or greater than the node's key. This
|
|
process continues until a match is found or an empty subtree is encountered
|
|
(see Figure A).
|
|
|
|
Such simple binary trees, although easy to understand and implement, have
|
|
disadvantages in practice. If keys are not well distributed or are added to
|
|
the tree in a non-random fashion, the tree can become quite asymmetric,
|
|
leading to wide variations in tree traversal times.
|
|
|
|
In order to make access times uniform, many programmers prefer a particular
|
|
type of balanced tree known as a B-Tree. For the purposes of this
|
|
discussion, the important points about a B-Tree are that data is stored in
|
|
all nodes, more than one data item might be stored in a node, and all of the
|
|
branches of the tree are of identical length (see Figure B).
|
|
|
|
The worst-case behavior of a B-Tree is predictable and much better than that
|
|
of a simple binary tree, but the maintenance of a B-Tree is correspondingly
|
|
more complex. Adding a new data item, changing a key value, or deleting a
|
|
data item may result in the splitting or merging of a node, which in tum
|
|
forces a cascade of other operations on the tree to rebalance it.
|
|
|
|
A B+ Tree is a specialized form of B-Tree that has two types of nodes:
|
|
internal which only point to other nodes, and external, which contain the
|
|
actual data (see Figure C).
|
|
|
|
The advantage of a B+ Tree over a B- Tree is that the internal nodes of the
|
|
B+ Tree can hold many more decision values than the intermediate-level nodes
|
|
of a B-Tree, so the fan out of the tree is faster and the average length of a
|
|
branch is shorter. This makes up for the fact that you must always follow a
|
|
B+ Tree branch to its end to get the data for which you are looking, whereas
|
|
in a B-Tree you may discover the data at an intermediate node or even at the
|
|
root.
|
|
|
|
|
|
--
|
|
Robert Kelley Cook rcook@darwin.cc.nd.edu
|
|
Univ. of Notre Dame
|
|
Physics Dept. If you think these are ND opinions, speak to the pope!
|
|
|
|
Hmm. guess thats where I got it from... ;)
|
|
|
|
O /
|
|
X -------- cut here --------------------------------------------
|
|
O \
|
|
|
|
Work hard and prosper,
|
|
|
|
Pete W. De Bonte (x4173) pwd+@cmu.edu
|
|
"Knolege is powef, Speling is unimportnt" (but pd2n is unaceptable)
|
|
|
|
|